# Core NLP Libraries

# NLTK (Natural Language Toolkit)



- Python library designed to work with human language data
- access to over 50 corpora and lexical resources
- offers tools for fundamental NLP tasks:
    1. Tokenization (splitting text into words)
    2. Stemming (reducing words to their root form)
    3. Tagging (assigning grammatical labels to the words role in the text, e.g. nouns, verbs, adjectives)
    4. Parsing (analyzing sentence structure and aiming to identify and classify named entities, e.g. people, locations)
    5. Classification
    6. Semantic Reasoning
- free, open-source and cross-platform compatibility (Windows, macOS, Linux)

### Resources:
- https://www.nltk.org/
- https://www.nltk.org/book/
- https://github.com/hb20007/hands-on-nltk-tutorial

### ABC's of nltk

In [4]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import wordnet

# Download necessary NLTK data (run this once)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [15]:
from tabulate import tabulate # For prettier output

text = "At the Broad Institute in Cambridge, Dr. Jennifer Doudna's work with CRISPR technology transformed genetic research."

# 1. Tokenization
tokens = word_tokenize(text)
print("\n--- Tokenization ---")
print(tokens)

# 2. Stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in tokens]
print("\n--- Stemming ---")
print(stemmed_tokens)

# 3. Part-of-Speech Tagging
tagged_tokens = pos_tag(tokens)
print("\n--- Part-of-Speech Tagging ---")
print(tabulate(tagged_tokens, headers=["Token", "POS Tag"], tablefmt="fancy_grid"))

# 4. Parsing (Named Entity Recognition)
ner_tree = ne_chunk(tagged_tokens)

print("\n--- Named Entity Recognition (Tree Structure) ---")
ner_tree.pretty_print()

print("\n--- Named Entity Recognition (Tabular Representation) ---")

ner_results = []
for subtree in ner_tree:
    if hasattr(subtree, 'label'):  # Check if it's a named entity chunk
        entity = " ".join(word for word, tag in subtree.leaves())
        ner_results.append((entity, subtree.label()))
    else:  # It's a regular token
        ner_results.append((subtree[0], 'O')) # 'O' means 'Outside' any named entity.

print(tabulate(ner_results, headers=["Entity/Token", "NER Tag"], tablefmt="fancy_grid"))


--- Tokenization ---
['At', 'the', 'Broad', 'Institute', 'in', 'Cambridge', ',', 'Dr.', 'Jennifer', 'Doudna', "'s", 'work', 'with', 'CRISPR', 'technology', 'transformed', 'genetic', 'research', '.']

--- Stemming ---
['at', 'the', 'broad', 'institut', 'in', 'cambridg', ',', 'dr.', 'jennif', 'doudna', "'s", 'work', 'with', 'crispr', 'technolog', 'transform', 'genet', 'research', '.']

--- Part-of-Speech Tagging ---
╒═════════════╤═══════════╕
│ Token       │ POS Tag   │
╞═════════════╪═══════════╡
│ At          │ IN        │
├─────────────┼───────────┤
│ the         │ DT        │
├─────────────┼───────────┤
│ Broad       │ NNP       │
├─────────────┼───────────┤
│ Institute   │ NNP       │
├─────────────┼───────────┤
│ in          │ IN        │
├─────────────┼───────────┤
│ Cambridge   │ NNP       │
├─────────────┼───────────┤
│ ,           │ ,         │
├─────────────┼───────────┤
│ Dr.         │ NNP       │
├─────────────┼───────────┤
│ Jennifer    │ NNP       │
├─────────────┼──────

In [28]:
# 5. Classification Example (Simple Sentiment Analysis using Naive Bayes)
def simple_sentiment_analysis(text):
    def word_feats(words):
        return dict([(word, True) for word in words])

    positive_words = ['good', 'awesome', 'fantastic', 'amazing']
    negative_words = ['bad', 'terrible', 'awful', 'horrible']

    positive_features = [(word_feats(positive_words), 'pos')]
    negative_features = [(word_feats(negative_words), 'neg')]

    train_set = positive_features + negative_features
    classifier = NaiveBayesClassifier.train(train_set)

    words = text.split()
    feats = word_feats(words)
    return classifier.classify(feats)

print(simple_sentiment_analysis("I really liked the play, the actors were amazing!"))
print(simple_sentiment_analysis("This play was so sad. I felt awful afterwards."))
print(simple_sentiment_analysis("This play was so sad. I felt awful afterwards. The acting was fantastic and the sound quality good."))

# Semantic Reasoning Example (Using WordNet)
def simple_semantic_reasoning(word1, word2):
    synsets1 = wordnet.synsets(word1)
    synsets2 = wordnet.synsets(word2)

    if synsets1 and synsets2:
        similarity = synsets1[0].wup_similarity(synsets2[0]) # Wu-Palmer similarity
        if similarity is not None and similarity > 0.5: # arbitrary threshold
            return f"'{word1}' and '{word2}' are semantically similar (similarity: {similarity:.2f})"
        else:
            return f"'{word1}' and '{word2}' are not very similar (similarity: {similarity:.2f})"

    else:
        return "One or both words not found in WordNet."

print(simple_semantic_reasoning("house", "apartment")) # Output: 'dog' and 'cat' are semantically similar (similarity: 0.86)
print(simple_semantic_reasoning("house", "flat")) # Output: 'dog' and 'car' are not very similar (similarity: 0.17)
print(simple_semantic_reasoning("chips", "potato")) #Output: 'apple' and 'orange' are semantically similar (similarity: 0.83)
print(simple_semantic_reasoning("chips", "poker")) #Output: 'random' and 'word' are not very similar (similarity: 0.2)

pos
neg
pos
'house' and 'apartment' are semantically similar (similarity: 0.82)
'house' and 'flat' are not very similar (similarity: 0.43)
'chips' and 'potato' are semantically similar (similarity: 0.95)
'chips' and 'poker' are not very similar (similarity: 0.22)


## Problems with these easy examples:
*Sentiment Analysis*
- code works by simply checking if the words in the string are within the negative or positive word list; context is lost (e.g. sad doesnt always equate negative)
- code doesn't handle negation ("not good")
- sarcasm and irony are very hard for simple algorithms to detect
- code doesn't look at the sentence as a whole

*Semantic Reasoning*
- "house" and "flat" have a lower similarity as they are not related in the way WordNets structure defines it; there are regional difference in the usage of the world "flat" (e.g. in the UK)
- "chips" and "poker" have a low similarity because there are different meanings for the word "chips" (Potato or Poker Chips)

-> Naive Bayes and WordNet-based similarity struggle with complexities of human language. Real-World NLP applications use way more sophisticated techniques such as deep learning models, more extensive lexical resources and techniques for handling negation, sarcasm and other linguistic phenomena.

# spaCy

- Python library for advanced NLP
- designed for efficiency and speed, especially for large volume of text
- provided pre-trained statistical models and word vactors for various languages
- tools for core NLP tasks:
  1. Tokenization (splitting text into words)
  2. Lemmatization (reducing words to their base form, more accurate than stemming)
  3. Part-of-Speech Tagging (assigning grammatical labels, e.g., nouns, verbs)
  4. Named Entity Recognition (identifying and classifying entities, e.g., people, locations, organizations)
  5. Dependency Parsing (analyzing sentence structure and relationships between words)
  6. Text Classification (using external libraries like TextBlob for simple tasks)
  7. Semantic Similarity (computing similarity between words and documents using word vectors)
- free and open-source and cross-platform compatibility (Windows, macOS, Linux)

## Resources:
- https://spacy.io/

In [33]:
import spacy
from tabulate import tabulate

# Load the English language model
spacy.cli.download("en_core_web_md")
nlp = spacy.load("en_core_web_md")

text = "At the Broad Institute in Cambridge, Dr. Jennifer Doudna's work with CRISPR technology transformed genetic research."

# 1. Tokenization and POS Tagging (Combined in spaCy)
doc = nlp(text)
token_pos = [(token.text, token.pos_) for token in doc]
print("\n--- Tokenization and Part-of-Speech Tagging ---")
print(tabulate(token_pos, headers=["Token", "POS Tag"], tablefmt="fancy_grid"))

# 2. Stemming (spaCy uses lemmatization, which is more accurate)
lemmatized_tokens = [token.lemma_ for token in doc]
print("\n--- Lemmatization ---")
print(lemmatized_tokens)

# 4. Named Entity Recognition (NER)
ner_results = [(ent.text, ent.label_) for ent in doc.ents]
print("\n--- Named Entity Recognition ---")
print(tabulate(ner_results, headers=["Entity", "NER Tag"], tablefmt="fancy_grid"))

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.

--- Tokenization and Part-of-Speech Tagging ---
╒═════════════╤═══════════╕
│ Token       │ POS Tag   │
╞═════════════╪═══════════╡
│ At          │ ADP       │
├─────────────┼───────────┤
│ the         │ DET       │
├─────────────┼───────────┤
│ Broad       │ PROPN     │
├─────────────┼───────────┤
│ Institute   │ PROPN     │
├─────────────┼───────────┤
│ in          │ ADP       │
├─────────────┼───────────┤
│ Cambridge   │ PROPN     │
├─────────────┼───────────┤
│ ,           │ PUNCT     │
├─────────────┼───────────┤
│ Dr.         │ PROPN     │
├─────────────┼───────────┤
│ Jennifer    │ PROPN     │
├─────────────┼───────────┤
│

In [42]:
# Example Sentiment Analysis (using TextBlob for simplicity)
from textblob import TextBlob

def simple_sentiment_analysis_textblob(text):
    blob = TextBlob(text)
    sentiment = blob.sentiment.polarity
    if sentiment > 0:
        return "pos"
    elif sentiment < 0:
        return "neg"
    else:
        return "neutral"

print("\n--- Sentiment Analysis (TextBlob) ---")
print(simple_sentiment_analysis_textblob("I really liked the play, the actors were amazing!"))
print(simple_sentiment_analysis_textblob("This play was so sad. I felt awful afterwards."))
print(simple_sentiment_analysis_textblob("This play was so sad. I felt awful afterwards. The acting was fantastic and the sound quality good."))

# Example Semantic Reasoning (using spaCy similarity)
def simple_semantic_reasoning_spacy(word1, word2):
    token1 = nlp(word1)
    token2 = nlp(word2)
    similarity = token1.similarity(token2)
    return f"'{word1}' and '{word2}' similarity: {similarity:.2f}"

print("\n--- Semantic Reasoning (spaCy) ---")
print(simple_semantic_reasoning_spacy("house", "apartment"))
print(simple_semantic_reasoning_spacy("house", "flat"))
print(simple_semantic_reasoning_spacy("chips", "potato"))
print(simple_semantic_reasoning_spacy("fries", "potato"))
print(simple_semantic_reasoning_spacy("chips", "poker"))


--- Sentiment Analysis (TextBlob) ---
pos
neg
neg

--- Semantic Reasoning (spaCy) ---
'house' and 'apartment' similarity: 0.33
'house' and 'flat' similarity: 0.33
'chips' and 'potato' similarity: 0.16
'fries' and 'potato' similarity: 1.00
'chips' and 'poker' similarity: 0.03
'draw' and 'poker' similarity: 0.12


## Problems with these easy examples:
*Sentiment Analysis (TextBlob)*
- Relies on simple polarity scores, losing context
- Doesn't handle negation well
- Struggles with sarcasm and complex language
- Doesn't understand the whole sentence's semantic structure

*Semantic Reasoning (spaCy)*
- Similarity is based on vector representations, which can be context-dependent
- 'Chips' ambiguity is still problematic
- Doesn't fully understand nuanced word relationships
- Similarity is based on statistical co-occurrence, not deep semantic understanding
