# Stemming and Lemmatization

#### 1. Given the list of pluralized words below, define your own simple word stemmer function or class,  limited to only simple rules and regex. No libraries! It should strip basic endings.

In [8]:
import re

plurals = [
    "flies",
    "denied",
    "itemization",
    "sensational",
    "reference",
    "colonizer",
]

# define rules
rules = [
            (r'ies$', 'y'),  # flies -> fly
            (r'ied$', 'y'),    # denied -> deny
            (r'ation$', 'e'), # itemization -> itemize
            (r'al$', ''),    # sensational -> sensation
            (r'ence$', ''),  # reference -> refer
            (r'izer$', 'ize')     # colonizer -> colonize
        ]

# stemmer implementation
def stem(text):
    stemmed_words = []
    for word in text:
        for suffix, replacement in rules:
            if re.search(suffix, word):
                stemmed_words.append(re.sub(suffix, replacement, word))
    return stemmed_words

print(stem(plurals))

['fly', 'deny', 'itemize', 'sensation', 'refer', 'colonize']


#### 2. After your initial implementation, run it on the following words:

In [9]:
new_words = [
    "friendly",
    "puzzling",
    "helpful",
]

print(stem(new_words))

[]


#### 3. Realizing that fixing future words manually can be problematic, use a desired NLTK stemmer and run it on all the words:

In [10]:
# source: Getting Started with Natural Language Processing, ch. 3

import nltk
import string
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.lancaster import LancasterStemmer

all_words = plurals + new_words

def stemmer(word_list):

    # retrieve stopwords
    stopwords_list = set(stopwords.words('english'))

    # use Lancaster Stemmer
    st = LancasterStemmer()

    # stem words
    stemmed_words = [st.stem(word) for word in word_list
                    if word.lower() not in stopwords_list and word not in string.punctuation]
    return stemmed_words

stemmed_word_list = stemmer(all_words)
print(stemmed_word_list)


['fli', 'deny', 'item', 'sens', 'ref', 'colon', 'friend', 'puzzl', 'help']


#### 4. There are likely a few words in the outputs above that would cause issues in real-world applications. Pick some examples, and show how they are solved with a lemmatizer. Use either spaCy or nltk.

The stemmer above incorrectly stems multiple words, including "flies" to "fli" instead of "fly", which can result in errors like over-stemming or under-stemming, affecting the accuracy of word analysis. Lemmatization, on the other hand, reduces the word to its root through a linguistic analysis of a word. For example, lemmatization would correctly reduce "flies" to "fly" by recognizing it as the plural form of "fly" as either the noun (the fly) or the verb (to fly).

In [11]:
import nltk
from nltk.stem import WordNetLemmatizer

# download necessary NLTK data
def download_nltk_data(package):
    try:
        nltk.data.find(package)
    except LookupError:
        nltk.download(package.split('/')[-1])

# initialize the WordNet lemmatizer
lemmatizer = WordNetLemmatizer()

# lemmatize words
lemmatized_word_list= [lemmatizer.lemmatize(word.lower()) for word in all_words]
print(lemmatized_word_list)

['fly', 'denied', 'itemization', 'sensational', 'reference', 'colonizer', 'friendly', 'puzzling', 'helpful']


# Stemming/Lemmatization - Practical Example
Using the news corpus (subset/category of the Brown corpus), perform common text normalization techniques such as stopword filtering and stemming/lemmatization. Compare the top 10 most common **words** before and after these normalization techniques.

In [12]:
from nltk.corpus import brown
from nltk.probability import FreqDist

words = brown.words(categories='news')

# calculate frequency distribution
fdist_initial = FreqDist(words)

# print the top 10 most common words
print(fdist_initial.most_common(10))

[('the', 5580), (',', 5188), ('.', 4030), ('of', 2849), ('and', 2146), ('to', 2116), ('a', 1993), ('in', 1893), ('for', 943), ('The', 806)]


In [15]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

stopwords_list = set(stopwords.words('english'))
words = brown.words(categories='news')
# initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# filter out stopwords and lemmatize remaining words
normalized_words = [lemmatizer.lemmatize(word.lower()) for word in words
                    if word.lower() not in stopwords_list and word.isalpha()]

# calculate frequency distribution
fdist_initial = FreqDist(normalized_words)

# print the top 10 most common words
print(fdist_initial.most_common(10))

[('said', 406), ('would', 246), ('year', 244), ('new', 241), ('one', 221), ('state', 213), ('last', 177), ('two', 174), ('first', 158), ('president', 143)]


# TF-IDF
TF-IDF (term frequency-inverse document frequency) is a way to measure the importance of a word in a document.

$$
\text{tf-idf}(t, d, D) = \text{tf}(t, d) \times \text{idf}(t, D)
$$

Where:
- $t$ is the term (word)
- $d$ is the document
- $D$ is the corpus



#### 1. Implement TF-IDF using NLTKs FreqDist (no use of e.g. scikit-learn and other high-level libraries).

In [None]:
import math
from typing import List
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist

def tf(document: List[str], term: str) -> float:
    
    fdist = FreqDist(document)
    term_freq = fdist[term.lower()]/len(document)
    
    return term_freq


def idf(documents: List[List[str]], term: str) -> float:
    
    term = term.lower()
    total_docs = len(documents)
    term_docs = sum(1 for doc in documents if term in set(word.lower() for word in doc))

    # idf with smoothing to account for cases when the term is not present in any of the docs or in the case of zero documents
    inverse_doc_freq = math.log((1 + total_docs) / (1 + term_docs))

    return inverse_doc_freq


def tf_idf(
    all_documents: List[List[str]],
    document: List[str],
    term: str,
) -> float:
    
    tf_idf = tf(document, term) * idf(all_documents, term)
    return tf_idf


#### 2. With your TF-IDF function in place, calculate the TF-IDF for the following words in the first document of the news articles found in the Brown corpus: 

- *the*
- *nevertheless*
- *highway*
- *election*

Perform any preprocessing steps you deem necessary. Comment on your findings.

In [None]:
fileids = brown.fileids(categories='news')
first_doc = list(brown.words(fileids[0]))
all_docs = [list(brown.words(fileid)) for fileid in fileids]

term_list = ['the', 'nevertheless', 'highway', 'election']

for term in term_list:
    score = tf_idf(all_docs, first_doc, term)
    print(f'{term}: {score}')


the: 0.0
nevertheless: 0.0
highway: 0.0029400864103517653
election: 0.008253604709969881


I decided to not remove stopwords, such as "the", in order to compare the tf-idf scores. The findings show that the tf-idf score for a usually frequent word ("the") is much lower than a more unique word. This is because the tf-idf values unique words (eg. "election") higher than frequent words (eg. "nevertheless"). 

#### 3. While TF-IDF is primarily used for information retrieval and text mining, reflect on how TF-IDF could be used in a language modeling context.

It can be used in a language modeling context such as sentiment analysis by placing more emphasis on less common, but significant, words. 

#### 4. You were previously introduced to word representations. TF-IDF can be considered one. What are some differences between the TF-IDF output and one that is computed once from a vocabulary (e.g. one-hot encoding)?

TF-IDF calculates numerical weights of a word based on its occurrence in a document, relative to its frequency in all documents. This way, more unique words are highlighted. In contrast, one-hot encoding uses binary vectors to represent words which only indicates the presence or absence of a word without any context.

# TF-IDF - Practical Example
You will again be looking at specific words for a document, but this time weighted by their TF-IDF scores. Ideally, the scoring should be able to retrieve representative words for this document in context of its document collection or category.

You will do the following:
- Select a category from the Reuters (news) corpus
- Perform preprocessing
- Calculate TF-IDF scores
- Find the top 5 words for *each document* in a subset of documents in your collection (e.g. 5, 10, ... documents total)
- Inspect whether these words make sense for a given document, and comment on your findings.

In [None]:
import nltk; 
from nltk.corpus import reuters, stopwords
from nltk.tokenize import word_tokenize

nltk.download("reuters")
nltk.download("punkt")

category = 'wheat'
documents = reuters.fileids(category)

# preprocess documents
stopword_list = set(stopwords.words('english'))
preprocessed_docs = []

for doc_id in documents[:10]:  # subset of 10 documents
    words = [word.lower() for word in word_tokenize(reuters.raw(doc_id)) if word.isalpha()]
    preprocessed_docs.append([word for word in words if word not in stopword_list])

top_words_per_doc = []

for doc in preprocessed_docs:
    scores = {}
    for word in set(doc):
        tf_idf_score = tf_idf(preprocessed_docs, doc, word)
        # store the score in the dictionary
        scores[word] = tf_idf_score

    top_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:5]
    top_words_per_doc.append(top_words)

# display top 5 words for each document
for i, top_words in enumerate(top_words_per_doc, start=1):
    formatted_top_words = ', '.join([f"{word}: {score:.4f}" for word, score in top_words])
    print(f"Document {i}: {formatted_top_words}")

Document 1: price: 0.0533, approval: 0.0533, approved: 0.0533, sri: 0.0533, continental: 0.0533
Document 2: mln: 0.0800, sown: 0.0509, pct: 0.0423, last: 0.0349, crop: 0.0349
Document 3: agent: 0.0960, honduras: 0.0720, pct: 0.0570, better: 0.0480, laydays: 0.0480
Document 4: tunisia: 0.1263, french: 0.0947, tender: 0.0722, credits: 0.0631, coface: 0.0631
Document 5: flour: 0.1218, iraq: 0.0913, ccc: 0.0609, bonus: 0.0609, bid: 0.0609
Document 6: buy: 0.1218, egypt: 0.1218, authorized: 0.1218, ship: 0.0609, existing: 0.0609
Document 7: agreement: 0.1019, january: 0.0764, pl: 0.0669, signed: 0.0669, discussing: 0.0669
Document 8: hectares: 0.1299, mln: 0.0920, china: 0.0650, henan: 0.0568, pests: 0.0568
Document 9: fao: 0.0781, world: 0.0521, mln: 0.0482, output: 0.0390, record: 0.0298
Document 10: oil: 0.1065, prices: 0.0838, export: 0.0663, adjusted: 0.0639, follows: 0.0559


[nltk_data] Downloading package reuters to
[nltk_data]     /Users/groelisabeth/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/groelisabeth/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Part-of-speech tagging

#### 1. Briefly describe your understanding of POS tagging and its possible use-cases in context of text generation applications/language modeling.

POS tagging analyzes the syntax of text by assigning each word to a part of speech, such as a noun, verb or adjective. This process aids in the understanding of textual content. It can be applied in text generation to ensure grammatically correct sentences, as well as in language modeling as it aids in distinguishing words that can e.g. be both a noun and a verb (e.g. "fly"). 

#### 2. Train a UnigramTagger (NLTK) using the Brown corpus. 
Hint: the taggers in nltk require a list of sentences containing tagged words.

In [None]:
import nltk
from nltk.corpus import brown
from nltk.tag import UnigramTagger

nltk.download('brown')
nltk.download('punkt')

trained_tagger = None

def tag_sentence(sentence):
    global trained_tagger

    # train the tagger if it hasn't been trained yet
    if trained_tagger is None:
        brown_tagged_sents = brown.tagged_sents(tagset='universal')
        trained_tagger = UnigramTagger(brown_tagged_sents)

    # tokenize if necessary
    if isinstance(sentence, str):
        sentence = nltk.word_tokenize(sentence)

    # tag the sentence using the trained UnigramTagger
    tagged_sentence = trained_tagger.tag(sentence)

    return tagged_sentence



[nltk_data] Downloading package brown to
[nltk_data]     /Users/groelisabeth/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/groelisabeth/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#### 3. Use this tagger to tag the text given below. Print out the POS tags for all variants of "justify"

In [None]:
text = """
Imagine a situation where you have to explain why you did something – that's when you justify your actions. So, let's say you made a decision; you, as the justifier, need to give good reasons (justifications) for your choice. You might use justifying words to make your point clear and reasonable. Justifying can be a bit like saying, "Here's why I did what I did." When you justify things, you're basically providing the why behind your actions. So, being a good justifier involves carefully explaining, giving reasons, and making sure others understand your choices
"""

tagged_sentence = tag_sentence(text)
print(text)
print(tagged_sentence)



Imagine a situation where you have to explain why you did something – that's when you justify your actions. So, let's say you made a decision; you, as the justifier, need to give good reasons (justifications) for your choice. You might use justifying words to make your point clear and reasonable. Justifying can be a bit like saying, "Here's why I did what I did." When you justify things, you're basically providing the why behind your actions. So, being a good justifier involves carefully explaining, giving reasons, and making sure others understand your choices

[('Imagine', 'VERB'), ('a', 'DET'), ('situation', 'NOUN'), ('where', 'ADV'), ('you', 'PRON'), ('have', 'VERB'), ('to', 'PRT'), ('explain', 'VERB'), ('why', 'ADV'), ('you', 'PRON'), ('did', 'VERB'), ('something', 'NOUN'), ('–', None), ('that', 'ADP'), ("'s", None), ('when', 'ADV'), ('you', 'PRON'), ('justify', 'VERB'), ('your', 'DET'), ('actions', 'NOUN'), ('.', '.'), ('So', 'ADV'), (',', '.'), ('let', 'VERB'), ("'s", None), ('

#### 4. Your results may be disappointing. Repeat the same task as above using both the default NLTK pos-tagger and with spaCy. Compare the results

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')

# tokenize the sentence
tokens = nltk.word_tokenize(text)

# NLTK's default POS tagger
nltk_tagged = nltk.pos_tag(tokens)


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/groelisabeth/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [None]:
# Install spaCy in your environment if you haven't already:
# pip install spacy
# python -m spacy download en_core_web_sm

import spacy

# load spaCys
nlp = spacy.load("en_core_web_sm")

# process sentence
doc = nlp(text)

# extract tokens and POS tags
spacy_tagged = [(token.text, token.pos_) for token in doc]

Comparison of results:

In [None]:
print(tagged_sentence)
print(nltk_tagged)
print(spacy_tagged)

[('Imagine', 'VERB'), ('a', 'DET'), ('situation', 'NOUN'), ('where', 'ADV'), ('you', 'PRON'), ('have', 'VERB'), ('to', 'PRT'), ('explain', 'VERB'), ('why', 'ADV'), ('you', 'PRON'), ('did', 'VERB'), ('something', 'NOUN'), ('–', None), ('that', 'ADP'), ("'s", None), ('when', 'ADV'), ('you', 'PRON'), ('justify', 'VERB'), ('your', 'DET'), ('actions', 'NOUN'), ('.', '.'), ('So', 'ADV'), (',', '.'), ('let', 'VERB'), ("'s", None), ('say', 'VERB'), ('you', 'PRON'), ('made', 'VERB'), ('a', 'DET'), ('decision', 'NOUN'), (';', '.'), ('you', 'PRON'), (',', '.'), ('as', 'ADP'), ('the', 'DET'), ('justifier', None), (',', '.'), ('need', 'VERB'), ('to', 'PRT'), ('give', 'VERB'), ('good', 'ADJ'), ('reasons', 'NOUN'), ('(', '.'), ('justifications', 'NOUN'), (')', '.'), ('for', 'ADP'), ('your', 'DET'), ('choice', 'NOUN'), ('.', '.'), ('You', 'PRON'), ('might', 'VERB'), ('use', 'NOUN'), ('justifying', 'VERB'), ('words', 'NOUN'), ('to', 'PRT'), ('make', 'VERB'), ('your', 'DET'), ('point', 'NOUN'), ('clear'

#### 5. Finally, explore more features of the what the spaCy *document* includes related to topics covered in this lab.

In [None]:
import spacy
from spacy import displacy

# use the first 5 sentences from the Brown corpora
sentences = brown.sents(categories='news')[:5]

nlp = spacy.load("en_core_web_sm")

print("Named Entity Recognition:")

# Named Entity Recognition
for sentence in sentences:
    sentence_text = " ".join(sentence)
    doc = nlp(sentence_text)
    
    # Named Entity Recognition
    print(f"{doc}:")
    for ent in doc.ents:
        print(f" - {ent.text} ({ent.label_})")


Named Entity Recognition:
The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place .:
 - The Fulton County Grand Jury (ORG)
 - Friday (DATE)
 - Atlanta (GPE)
The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted .:
 - the City Executive Committee (ORG)
 - the City of Atlanta (GPE)
The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible `` irregularities '' in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr. .:
 - September-October (DATE)
 - Fulton Superior Court (ORG)
 - Durwood Pye (PERSON)
 - Ivan Allen Jr. (PERSON)
`` Only a relative handful of such reports was received '' , the jury said , `` considering the wid