# STA 141B Data & Web Technologies for Data Analysis

### Lecture 13, 2/22/23, Natural language processing


### Announcements

 - Homework due tomorrow. 

### Today's topics
- Natural Language Processing
     - Standardizing Text
     - Feature extraction
         - Term frequencies
         - One-hot encoding
         - Term Frequency-Inverse Document Frequency

### Ressources
- [Natural Language Processing with Python][nlpp], chapters 1-3. Beware: the print version is for Python 2.
- [Scikit-Learn Documentation][skl], especially the section about [Text Feature Extraction](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)


[PDSH]: https://jakevdp.github.io/PythonDataScienceHandbook/
[ProGit]: https://git-scm.com/book/
[nlpp]: https://www.nltk.org/book/
[atap]: https://search.library.ucdavis.edu/primo-explore/fulldisplay?docid=01UCD_ALMA51320822340003126&context=L&vid=01UCD_V1&search_scope=everything_scope&tab=default_tab&lang=en_US
[skl]: https://scikit-learn.org/stable/documentation.html


### Standardizing Text

We standardize numerical data in order to make fair comparisons, comparisons that are not influenced by the location and scale of the data. Similarly, you can standardize text (tokens) to make sure comparisons are fair and accurate.

For example, `"Cat"` and `"cat"` are the same word even though they're different tokens. Converting all characters to lowercase is one way to standardize a document.

Some common standardization techniques for text are:

* Lowercasing
* Stemming: Use patterns to remove prefixes and suffixes from words.
* Lemmatiziation: Look up each token in a dictionary and replace it with a root word. Similar to stemming, but more accurate.
* Stopword Removal: Remove tokens that don't contribute meaning. For example, "the" is meaningless on its own.
* Identifying Outliers: Identify and possibly remove non-standard "words" like numbers, mispellings, code, etc...

How and whether you should standardize a document or corpus depends on what kind of analysis you want to do. There is no formula; you must think carefully and experiment to determine which standardization techniques work best for your problem.

#### Lowercasing

You can use Python's string methods for simple text transformations.

In [None]:
chapter[:100]

In [None]:
chapter.lower()[:100]

In [None]:
chapter.upper()[:100]

In [None]:
words = re.findall(r"\w+", chapter)

In [None]:
words[0:9]

In [None]:
lower = [w.lower() for w in words] # lower and upper
lower[:10]

#### Stemming

_Stemming_ runs an algorithm on each token to remove affixes (prefixes and suffixes). The result is called a _stem_.

Stemming is useful if you want to ignore affixes.

For example, most English verbs use suffixes to mark the tense. We write "They fish" (present) and "They fished" (past). Without any standardization, the tokens "fish" and "fished" would be treated as separate words. Stemming converts both tokens to the common stem "fish":

In [None]:
[nltk.PorterStemmer().stem(w) for w in words][0:10]

In [None]:
print(nltk.PorterStemmer().stem("whales"))
print(nltk.PorterStemmer().stem("whaling"))
print(nltk.PorterStemmer().stem("whalebone"))
print(nltk.PorterStemmer().stem("narwhales"))

Stemmers use a sequence of rules to determine the stem for each token, but natural languages are full of special cases and exceptions. So as you can see in the example above, some stems are not words , and sometimes tokens that seem like they should have the same stem don't.

Several different stemmers are provided in the `nltk.stem` submodule.

#### Lemmatization

_Lemmatization_ looks up each token in a dictionary to find a root word, or _lemma_.

Lemmatization serves the same purpose as stemming. Lemmatization is more accurate, but requires a dictionary and usually takes longer.

In [None]:
nltk.download('wordnet')

In [None]:
nltk.WordNetLemmatizer().lemmatize("whales")

In [None]:
nltk.WordNetLemmatizer().lemmatize("whaling")

In [None]:
nltk.WordNetLemmatizer().lemmatize("whaling", "v") #this is a verb - it should be lemmatized to 'whale'

In [None]:
nltk.WordNetLemmatizer().lemmatize("whalebone")

In [None]:
nltk.WordNetLemmatizer().lemmatize("narwhales")

The WordNet lemmatizer requires part of speech information in order to lemmatize words. You can get approximate part of speech information with __nltk__'s `pos_tag()` function.

In [None]:
nltk.download('averaged_perceptron_tagger')

In [None]:
nltk.pos_tag(["whaled"])

In [None]:
nltk.pos_tag(["whaling"])

NLTK POS Tags are [Brown POS tags][brown]

[brown]: https://en.wikipedia.org/wiki/Brown_Corpus#Part-of-speech_tags_used

#### Foreign language

In [None]:
from nltk.stem.snowball import SnowballStemmer

In [None]:
fr = SnowballStemmer('french')

sent = "En mathématiques, une fonction càdlàg (continue à droite, limite à gauche) est ..."
nltk.word_tokenize(sent)

nltk.pos_tag([fr.stem(word) for word in nltk.word_tokenize(sent)])

In [None]:
moby_tags = nltk.pos_tag(words)
moby_tags[0:10]

The `nltk.stem` submodule also provides several different lemmatizers.

### Stopword Removal

_Stopwords_ are words that appear frequently but don't add meaning.

In English, "the", "a", and "at" are examples. However, exactly which words are stopwords depends on your analysis. Words that are meaningless in one analysis might be very important in others.

You can filter out stopwords with a list comprehension:

In [None]:
stopwords = ["the", "a", "and", "or", "in", "by"]
[w for w in words if w not in stopwords][0:10]

__nltk__ also provides a stopwords corpus that contains common stopwords for several languages.

In [None]:
nltk.download("stopwords")

In [None]:
stopwords = nltk.corpus.stopwords.words("english")
[w for w in words if w not in stopwords][0:10]

### Feature Engineering for Natural Language Data

Most statistical techniques take numbers as input. You may have already noticed this when working with categorical data. We can't compute the mean, median, standard deviation, or z-score if the observations aren't numbers. While we can fit linear models, it takes extra work because we have to create, or _engineer_, indicator variables.

We face the same problem with natural language data. We need to _quantify_ documents, or turn them into numbers, so that we can use a wider variety of statistical techniques. We can do this by engineering features from our documents.

So: what kinds of features can we create for language data?

In [None]:
import numpy as np
import pandas as pd
import nltk, nltk.corpus

#### Term Frequencies

One solution is to extend the idea of frequency analysis. We used frequency analysis to study individual documents, but what if we compute the word frequencies for every document in our corpus, and use those frequencies as features?

Let's try this for a small corpus:

In [None]:
corpus = ["The cat saw the dog was angry at the other cat.", 
          "The dog saw the cat was angry at the other cat.", 
          "The canary saw the iguana was sad."]

def get_freq_doc(doc):
    words = (w.lower() for w in nltk.word_tokenize(doc))
    words = (w for w in words if w not in ["the", "a", "an", "at", 'other', "."] and w.isalnum())
    return nltk.FreqDist(words)

In [None]:
# use the function to get frequency for each word
df = pd.DataFrame([get_freq_doc(doc) for doc in corpus])

In [None]:
df = df.fillna(0)
df = df.astype(int)

In [None]:
df

In [None]:
# The isalnum() method returns True if all characters in the string are alphanumeric
"dog?2".isalnum()

In [None]:
words = [re.findall(r"\w+", chapter) for chapter in chapters]
words.pop(0)
words = [w for l in words for w in l] 

In [None]:
words[0:10]

In [None]:
words = [w.lower() for w in words]
#words = [w for w in words if w.isalnum()]
words[0:10]

In [None]:
len(set(words))

`fq` will give the frequencies for each word, see [here](https://tedboy.github.io/nlps/generated/generated/nltk.FreqDist.html). 

In [None]:
fq = nltk.FreqDist(words)

In [None]:
fq

Frequency distribution objects have a few methods to provide summary information.

The `.most_common()` method returns the most common tokens and their frequencies:

In [None]:
fq.most_common(10)

A _hapax_ is a token that only occurs once within a document. The `.hapaxes()` method returns the hapaxes:

In [None]:
len(fq.hapaxes())

The `.plot()` method displays a plot of word frequencies, sorted from most to least frequent word.

The first parameter controls how many words to display. The second parameter controls whether the plot is cummulative.

In [None]:
%matplotlib inline

In [None]:
fq.plot(40, cumulative = True)

In [None]:
fq.plot(40)

Consider [Zipf's law](https://en.wikipedia.org/wiki/Zipf%27s_law): When the elements of a set - for example, the words of a text - are ordered by their frequency, the probability $p$ of their occurrence is inversely proportional to the place $n$ on the frequency list. 

In [None]:
logFreq = [np.log(i[1]) for i in fq.most_common(2000)]
logRank = [np.log(1 + i) for i in range(0,2000)]

In [None]:
logTheo = [np.log(1/(1 + i)) for i in range(0,2000)] + logFreq[0]

In [None]:
import plotnine as p9

In [None]:
(
p9.ggplot() + p9.theme_minimal() + 
    p9.geom_line(p9.aes(x='logRank', y='logFreq')) + 
    p9.geom_line(p9.aes(x='logRank', y='logTheo'), color = 'red') 
)

Notice that when we use term frequencies as features, we lose information about the order of the words in each document.

The first and second document contain the same words, but in different orders. The word frequency features for these two documents are identical.

The __scikit-learn__ package provides functions to help with feature engineering. The `sklearn.feature_extraction.text` submodule is specifically for extracting features from text documents.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
help(CountVectorizer)

In [None]:
vec = CountVectorizer(tokenizer = nltk.word_tokenize)
freq = vec.fit_transform(corpus)

In [None]:
freq

In [None]:
# .todense() convert sparse matrix to a dense matrix
# Don't do this for a really large matrix!
freq.todense()

In [None]:
df

Use the `.get_feature_names_out()` method to see which term each column corresponds to:

In [None]:
vec.get_feature_names_out()

In [None]:
vec

One problem with term frequencies is that some terms have high frequencies simply because they appear frequently in the language. These terms can cause documents to appear similar even if they are otherwise different.

While removing stopwords takes care of some high-frequency words, there may also be high-frequency words that have meaning and need to be kept.

### One-hot Encoding

We can avoid emphasis on high-frequency words by ignoring frequency altogether. Instead, we can create indicator variables for individual words. The indicator is 1 if the word appears in the document, and 0 otherwise.

In machine learning, an indicator variable is also called a _one-hot encoding_.

The `sklearn.preprocessing` submodule of __scikit-learn__ provides a function for one-hot encoding.

In [None]:
from sklearn.preprocessing import Binarizer
help(Binarizer)

In [None]:
freq

In [None]:
(freq > 0).todense()

In [None]:
binarizer = Binarizer()
ohot = binarizer.fit_transform(freq)
ohot.todense()

In [None]:
freq.todense()

In [None]:
corpus

In [None]:
vec.get_feature_names_out()

As with term frequencies, we lose information about the order of the words in the document.

One-hot encoding as an extreme transformation: every term is equally important. This means terms that are relatively rare or unique still might be underemphasized (this is also a problem for term frequencies).

### Term Frequency-Inverse Document Frequency

_Term frequency-inverse document frequency_ (tf-idf) statistics put terms on approximately the same scale while also emphasizing relatively rare terms. There are [several different tf-idf statistics](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

The _smoothed tf-idf_, for a term $t$ and document $d$, is given by:

$$
\operatorname{tf-idf}(t, d) = \operatorname{tf}(t, d) \cdot \log \left( \frac{N}{1 + n_t} \right)
$$

where $N$ is the total number of documents and $n_t$ is the number of documents that contain $t$.

The `sklearn.feature_extraction.text` submodule of __scikit-learn__ provides a function for computing tf-idf:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
help(TfidfVectorizer)

In [None]:
vec = TfidfVectorizer(tokenizer = nltk.word_tokenize) 
tfidf = vec.fit_transform(corpus)

In [None]:
vec.get_feature_names_out()

In [None]:
tfidf.todense()

In [None]:
(1 / 12) * np.log(3 / ( 1 + 3)) # '.' appears once in all 12 words,

In [None]:
corpus

In long documents or documents with many high-frequency terms, we can further reduce the emphasis on these terms by taking the logarithm of the term frequency. To do this, set `sublinear_tf = True` in the `TfidfVectorizer()` function.

## The Bag-of-words Model

The one-hot encoding, term frequencies, and TF-IDF scores all ignore word order.

The _bag-of-words model_ assumes that the order of words in a document doesn't matter. Imagine taking the words in each document and dumping them into a bag, where they get all mixed up. Note that in this case "model" means a way of thinking about a problem, not a statistical model.

While the order of words in a document might seem important, the bag-of-words model is surprisingly useful. The bag-of-words model is a good place to start if you want to use statistical methods on language data.

## Measuring Similarity

We can measure the _similarity_ of two documents by computing the distance between their term frequency vectors. There are many different ways we can measure distance and similarity:

* Minkowski distance, a family of distances that includes Euclidean distance ($\ell_2$-norm) and Manhattan distance ($\ell_1$-norm). 
 * $\ell_2$-norm, $\|a - b \|_2 = \sqrt{\sum_{i=1}^n (a_i - b_i)^2}$
 * $\ell_1$-norm, $\|a -b\|_1 = \sum_{i=1}^n |a_i - b_i|$

* $\ell_\infty$-norm, $\|a-b\|_\infty = \max_{1\leq i\leq n} |a_i - b_i|$

    * Relation between those norms: $\|\cdot\|_1$ $\geq$ $\|\cdot\|_2$ $\geq$ $\cdots$ $\geq$ $\|\cdot\|_\infty$

* Mahalanobis distance, the Euclidean distance between z-scores.
* Cosine similarity, the cosine of the angle between two vectors. See [here](https://stats.stackexchange.com/a/235676/29695) for an explanation of how cosine similarity is related to correlation. Note that the range of cosine is $[-1, 1]$ and $\cos(0) = 1$, so vectors that are close together will have a cosine similarity close to 1, not 0.
* And others...

Cosine similarity often works well for language data. The cosine similarity between two vectors $a$ and $b$ is defined as:

$$
\frac{a'b}{\Vert a \Vert_2 \Vert b \Vert_2}.
$$

The `TfidfVectorizer()` function already divides the returned tf-idf vectors by their Euclidean norms, so we can compute cosine similarity as a simple dot product:

In [None]:
pd.DataFrame(tfidf.todense())

In [None]:
(tfidf @ tfidf.T).todense()

Part of the reason that cosine similarity is a good measure in NLP is that cosine similarity, like correlation, is not affected by the scale of the vector elements. For vectors that contain term frequencies (or functions of term frequencies), this means that the length of the original documents will not affect whether or not they are similar -- only their word content will.

### Bigrams

### Summary 

- Standardize text first
- Engineer features depending on priorities