# Stemming and Lemmatization

#### 1. Given the list of pluralized words below, define your own simple word stemmer function or class,  limited to only simple rules and regex. No libraries! It should strip basic endings.

In [None]:
plurals = [
    "flies",
    "denied",
    "itemization",
    "sensational",
    "reference",
    "colonizer",
]

# TODO: implement your own simple stemmer

#### 2. After your initial implementation, run it on the following words:

In [None]:
new_words = [
    "friendly",
    "puzzling",
    "helpful",
]
# TODO: run your stemmer on the new words

#### 3. Realizing that fixing future words manually can be problematic, use a desired NLTK stemmer and run it on all the words:

In [None]:
import nltk

all_words = plurals + new_words

# TODO: use an nltk stemming implementation to stem `all_words`

#### 4. There are likely a few words in the outputs above that would cause issues in real-world applications. Pick some examples, and show how they are solved with a lemmatizer. Use either spaCy or nltk.

Your answer here! Code below.

In [None]:
# TODO: basic observations on which examples are problematic with stemming + implement lemmatization with spacy/nltk

# Stemming/Lemmatization - Practical Example
Using the news corpus (subset/category of the Brown corpus), perform common text normalization techniques such as stopword filtering and stemming/lemmatization. Compare the top 10 most common **words** before and after these normalization techniques.

In [None]:
# import nltk; nltk.download('brown')  # ensure we have the data
from nltk.corpus import brown
news = brown.words(categories='news')

# TODO: find the top 10 most common words

In [None]:
# TODO: find the top 10 most common words after applying text normalization techniques

# TF-IDF
TF-IDF (term frequency-inverse document frequency) is a way to measure the importance of a word in a document.

$$
\text{tf-idf}(t, d, D) = \text{tf}(t, d) \times \text{idf}(t, D)
$$

Where:
- $t$ is the term (word)
- $d$ is the document
- $D$ is the corpus



#### 1. Implement TF-IDF using NLTKs FreqDist (no use of e.g. scikit-learn and other high-level libraries).

In [None]:
from typing import List

##########################################################
# Feel free to change everything below.
# It is merely a guide to understand the inputs/outputs
##########################################################


############ TODO ############
def tf(document: List[str], term: str) -> float:
    """
    Calculate the term frequency (TF) of a given term in a document.

    Args:
        document (List[str]): The document in which to calculate the term frequency.
        term (str): The term for which to calculate the term frequency.

    Returns:
        float: The term frequency of the given term in the document.
    """
    return


############ TODO ############
def idf(documents: List[List[str]], term: str) -> float:
    """
    Calculate the inverse document frequency (IDF) of a term in a collection of documents.

    Args:
        documents (List[List[str]]): A list of documents, where each document is represented as a list of strings.
        term (str): The term for which IDF is calculated.

    Returns:
        float: The IDF value of the term.
    """
    return


############ TODO ############
def tf_idf(
    all_documents: List[List[str]],
    document: List[str],
    term: str,
) -> float:
    return


#### 2. With your TF-IDF function in place, calculate the TF-IDF for the following words in the first document of the news articles found in the Brown corpus: 

- *the*
- *nevertheless*
- *highway*
- *election*

Perform any preprocessing steps you deem necessary. Comment on your findings.

In [None]:
fileids = brown.fileids(categories='news')
first_doc = list(brown.words(fileids[0]))
all_docs = [list(brown.words(fileid)) for fileid in fileids]

# TODO: preprocess and calculate tf-idf scores.

#### 3. While TF-IDF is primarily used for information retrieval and text mining, reflect on how TF-IDF could be used in a language modeling context.

Your answer here!

#### 4. You were previously introduced to word representations. TF-IDF can be considered one. What are some differences between the TF-IDF output and one that is computed once from a vocabulary (e.g. one-hot encoding)?

Your answer here!

# TF-IDF - Practical Example
You will again be looking at specific words for a document, but this time weighted by their TF-IDF scores. Ideally, the scoring should be able to retrieve representative words for this document in context of its document collection or category.

You will do the following:
- Select a category from the Reuters (news) corpus
- Perform preprocessing
- Calculate TF-IDF scores
- Find the top 5 words for *each document* in a subset of documents in your collection (e.g. 5, 10, ... documents total)
- Inspect whether these words make sense for a given document, and comment on your findings.

In [None]:
import nltk; nltk.download("reuters")
from nltk.corpus import reuters

# Part-of-speech tagging

#### 1. Briefly describe your understanding of POS tagging and its possible use-cases in context of text generation applications/language modeling.

Your answer here!

#### 2. Train a UnigramTagger (NLTK) using the Brown corpus. 
Hint: the taggers in nltk require a list of sentences containing tagged words.

In [None]:
# TODO: train a unigram tagger on the brown corpus

#### 3. Use this tagger to tag the text given below. Print out the POS tags for all variants of "justify"

In [None]:
text = """
Imagine a situation where you have to explain why you did something – that's when you justify your actions. So, let's say you made a decision; you, as the justifier, need to give good reasons (justifications) for your choice. You might use justifying words to make your point clear and reasonable. Justifying can be a bit like saying, "Here's why I did what I did." When you justify things, you're basically providing the why behind your actions. So, being a good justifier involves carefully explaining, giving reasons, and making sure others understand your choices
"""

# TODO: use your trained tagger

#### 4. Your results may be disappointing. Repeat the same task as above using both the default NLTK pos-tagger and with spaCy. Compare the results

In [None]:
# TODO: use the default NLTK tagger

In [None]:
# TODO: use spacy to fetch pos tags from the document

#### 5. Finally, explore more features of the what the spaCy *document* includes related to topics covered in this lab.

In [None]:
# TODO