> Introduction to Natural Language Processing

Natural Language Processing (NLP) is all about teaching computers to understand and work with human language (the words we use every day). From chatbots and translation apps to spam filters and voice assistants, NLP powers many tools we rely on without even noticing. In this notebook, we’ll explore some of the basic techniques that let machines handle text in meaningful ways.

## Setup

Some `import`s

In [None]:
import numpy as np

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS, TfidfTransformer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import LatentDirichletAllocation

from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

Below we'll be using two [nltk](https://www.nltk.org) resources that are not installed along with the library.

In [None]:
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

<font color='red'>TO-DO</font> What are the two NLTK resources we downloaded above?

# Text Preprocessing

Text is inherently *unstructured data* in the sense that it does not follow a fixed format (it's not a neat table, array or spreadsheet). Hence, some *preprocessing* is required before text can be fed into a machine learning (ML) model. Learning to do this *preprocessing* is the purpose of this section.

Notice that
> As machine learning algorithms process numbers rather than text, the text must be converted to numbers.
>
> -- <cite>[Wikipedia](https://en.wikipedia.org/wiki/Large_language_model#Tokenization)</cite>

### Tokenization

Tokenization is the process of splitting a text into smaller pieces, called tokens. These tokens can be words, phrases, or even sentences. Each token is then assigned a (unique) number. Tokenization is a crucial step in NLP as it allows us to analyze and process textual data more effectively.

CAVEAT: This is markedly different from [*Lexical* tokenization](https://en.wikipedia.org/wiki/Lexical_analysis#Lexical_token_and_lexical_tokenization), the latter being of no interest to us in this course.

Let's *tokenize* the following text (you can replace it with any text you like)


In [None]:
text = 'They can kill you, but the legalities of eating you are quite a bit dicier (from "Infinite Jest", by David Foster Wallace)'

using the `nltk` library

In [None]:
tokens = word_tokenize(text)
print("Tokens:", tokens)

With this particular *tokenizer* each token is a word, and punctuation marks are treated as separate tokens. This is useful for many NLP tasks, but there are other tokenizers that can handle punctuation differently or tokenize at the character level. 

After tokenization, you just number the tokens so each one will have a number associated with it. That will be its index in the list of possible tokens that make up the **vocabulary** of the model, i.e., the set of possible tokens.

<font color='red'>TO-DO</font> What would be here the vocabulary?

### Stopwords Removal

Clearly some words are more informative than others. If a toddler says "cat tree", you can guess she's probably trying to say "there is cat on the tree". Words such as "the" or "a" are very common in English, but they don't usually carry much meaning on their own. These common words are called **stopwords**. Removing stopwords can help reduce *noise* in the data and improve the performance of NLP models.

There are predefined *lists of stopwords* to be exploited (we downloaded one of such lists above!!).

In [None]:
stop_words = stopwords.words('english')
stop_words[:10]

We could exploit them to *filter* the above list of `tokens`

In [None]:
filtered_tokens = list(filter(lambda w: w.lower() not in stop_words, tokens))
filtered_tokens

<font color='red'>TO-DO</font>: Which words were removed as stopwords?

In [None]:
set(tokens) - set(filtered_tokens)

Later on we will look at an example showing the effect of removing stopwords.

# Feature Extraction: Bag‑of‑Words & tf-idf

Instead of NLTK, here we will make use of [scikit-learn](https://scikit-learn.org), a popular (sort of *standard*) library for ML. It exhibits a number of tools for text processing, including feature extraction methods such as Bag-of-Words and tf-idf (stay tuned!!).

Let us make up some *documents*

In [None]:
docs = [
    "I love programming in Python.",
    "Python is great for NLP tasks.",
    "I enjoy learning new languages."
]

<!-- First step is to come up with a **dictionary** and count the number of times every one of its elements shows up in every *document*.  This is a  -->
How can we turn each document into a *fixed-size* vector of numbers? Once we have a dictionary, we can simply count the number of times every one of its elements shows up in every *document*. Then, every document is represented by a vector the size of the vocabulary. This *transformation* is a really
common task that *scikit-learn* automates through the class [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). What you obtain is a [Bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model) (or *BoW*) representation of the corpus (in which every document becomes a vector the size of the vocabulary).

<font color='red'>TO-DO</font>: Exploit the class [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)  (it supports *stopwords*) to *vectorize* the above `list` of *documents*. Print the vector, i.e., the *counts*, for the first *document*.

<font color='red'>TO-DO</font>: What's the vocabulary? How many times does the 3rd word in the vocabulary show up in the 2nd document?

We now have a way of representing every text document in a corpus as a fixed-size vector of numbers (*counts*). However, even after getting rid of *stopwords*, which have little-to-zero meaning, it's clear that not every word is equally significant. Intuitively, words that show up *all the time* in a corpus (e.g., the word "medical" in a bunch of documents on medicine) are not very significant and, viceversa, words that occur very little might yield some useful hints for a task. We have a name for this intuition: [term frequency–inverse document frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) or *tf-idf*. The idea is: for each document, we count the relative number of times (*frequency*) each word (*term*) shows up and divide it by the relative number of times it does in the corpus (*document frequency*).

<font color='red'>TO-DO</font>:  Use the [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) class to obtain the tf-idf instead of *raw* counts.

<font color='red'>TO-DO</font>: What is the word with the smallest non-zero tf-idf values across all documents?

<font color='red'>TO-DO</font>:

- How is *sparsity* (every document only includes a small number of terms in the vocabulary) handled by the above classes (to avoid taking up a huge amount of memory)?
- After adding a document **without** the word “python,” should the tf-idf of the latter increase or decrease in the original documents?

## Why Stopwords Matter

Notice that, once every document becomes a fixed-size (that of the vocabulary, say $N$) vector of numbers, it's easy to compare documents by comparing their corresponding vectors. In principle, any *distance* in $\mathbb{R}^N$ is amenable to be used.

We'll compute [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) between two short sentences **with** and **without** stopword removal to illustrate how stopwords can inflate similarity scores. A slightly contrived corpus

In [None]:
a_sentence = "The cat sat on the mat"
another_sentence = "A dog sat on the rug"

Are the sentences similar?

**Without** removing stopwords

In [None]:
vec_all = CountVectorizer().fit_transform([a_sentence, another_sentence])
print("Cosine similarity (without stopwords):",
      cosine_similarity(vec_all)[0,1].round(3))

**After** removing stopwords

In [None]:
vec_ns = CountVectorizer(stop_words='english').fit_transform([a_sentence, another_sentence])
print("Cosine similarity (with stopwords):",
      cosine_similarity(vec_ns)[0,1].round(3))

Notice how removing stopwords lowers the similarity by excluding common words like *the* and *on*, giving a truer sense of semantic distance.

# Text Classification with Naive Bayes

Let us put to use what we have learned in building a *news classifier*: given a piece of news, the task is to decide the *category* (theme, topic) among a set of competing ones. We'll be making use of **2 categories**,

In [None]:
categories = ['rec.autos', 'rec.sport.baseball']

, of the `20 Newsgroups` dataset, which is readily available through *scikit-learn*.

In [None]:
train = fetch_20newsgroups(subset='train', categories=categories, remove=('headers','footers','quotes'))
test  = fetch_20newsgroups(subset='test',  categories=categories, remove=('headers','footers','quotes'))

<font color='red'>TO-DO</font>: Take a look at a couple of *posts* (either from the *training* or *test* set), along with their corresponding *category* (i.e., label).

<font color='red'>TO-DO</font>: Turn the documents into tf-idf *counts*. For that, `fit` the *vectorizer* on the training set and, afterwards, use the resulting `TfidfVectorizer` object to `transform` both the training and test sets separately.

<font color='red'>TO-DO</font>: Take a look at the vector of numbers for the previously checked *posts*.

<font color='red'>TO-DO</font>: What happens when there is a word in the test set that was not seen in the training set (when you `fit_transform` on the latter)? Make up a document with a single non-existent word and look how many non-zero elements are in its tf-idf representation (obtained through the `*Vectorizer` fitted on the training set)?

Let us train a simple [Naive Bayes classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) on the *training set*...

In [None]:
clf = MultinomialNB()
clf.fit(X_train, train.target)

...and use it to predict on the *test set*

In [None]:
pred = clf.predict(X_test)

<font color='red'>TO-DO</font>: What is the overall accuracy of the classifier? You need to compare the above `pred`ictions against the actual `target`s in the *test* set.

<font color='red'>TO-DO</font>: Check one of the missclasified *posts*. Are you able to tell which class it belongs to (i.e., what it is about)?

<font color='red'>TO-DO</font>: How come all the texts end up with the same length?

# Introduction to Topic Modeling

Topic modeling uncovers latent themes in a corpus *without supervision* (notice that above we had *labels*, i.e., superivision, for each piece of news). We’ll use [Latent Dirichlet allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) or LDA (not to be confused with *Linear discriminant analysis*, also abbreviated in the literature as LDA).

Let us get more data from the `20 Newsgroups` dataset...but this time without making use of the labels (i.e., the categories). Indeed, we are only using the `data` attribute (**not** the `target`).

In [None]:
categories = ['talk.politics.misc', 'comp.graphics', 'sci.space']
data = fetch_20newsgroups(subset='train', categories=categories,remove=('headers','footers','quotes'))
docs = data.data  # list of raw text documents

Let us first *vectorize* the copus using a vanilla `CountVectorizer`, the latter being more aligned with assumptions behind LDA.

In [None]:
vec = CountVectorizer(stop_words='english')
X = vec.fit_transform(docs)

Let us `fit` the model using LDA

In [None]:
lda = LatentDirichletAllocation(n_components=3, random_state=42)
lda.fit(X)

Let us take a look at the most important (most likely) words in every topic (aka, `component`)

In [None]:
vocab = vec.get_feature_names_out()
for idx, topic in enumerate(lda.components_):
    top_terms = [vocab[i] for i in topic.argsort()[-5:][::-1]]
    print(f"Topic {idx+1}: {top_terms}")

Can you match the above topics with *categories* in the `20 Newsgroups` dataset?

<font color='red'>TO-DO</font>: It might be convenient to ignore some of the words in the above topics. Re-*fit* the model excluding a couple of words: "mr" and "don".

<font color='red'>TO-DO</font>: What happens when you increase or decrease the number of topics? Notice that, right now, we now have more topics than *actual* categories in the data.

In [None]:
get_topics(docs, n_components=2, stop_words=list(ENGLISH_STOP_WORDS) + ['mr', 'don'])

<font color='red'>TO-DO</font>: Can a word show up in two different topics? What happens when you modify the value of `random_state` passed to `LatentDirichletAllocation`? Why is that?