### Natural Language Processing

Run the cell below to import the required packages:

In [5]:
import pandas as pd
import numpy as np

import re
import nltk
import string
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import Normalizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler

from nltk.corpus import stopwords

Sources:

https://towardsdatascience.com/a-gentle-introduction-to-natural-language-processing-e716ed3c0863

https://www.freecodecamp.org/news/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3/


### Natural Language Processing

NLP is a branch of artificial intelligence that deals with analyzing, understanding and generating the languages that humans use naturally in order to interface with computers in both written and spoken contexts using natural human languages instead of computer languages.

### Applications of NLP

- Machine translation(Google Translate)
- Natural language generation
- Web Search
- Spam filters
- Sentiment Analysis (positive or negative tone)
- Chatbots

… and many more

### Preprocessing of data

A **text corpus** is a large and structured set of texts. 

Here's an example of a corpus. This example is a document containing three sentences:

In [6]:
corpus = 'I like football. Football is one of my favorite sports. I play Fantasy Football with my friends. I am in several different fantasy football leagues.'

Before we apply our machine learning algorithms, we often preprocess the corpus in order to transform the raw data in a useful and efficient format. Here are some common types of preprocessing:

- **Normalization**: Making all the text lower case is one of the simplest and most effective forms of text preprocessing.

We'll do that now:

In [7]:
corpus = corpus.lower()
print(corpus)

i like football. football is one of my favorite sports. i play fantasy football with my friends. i am in several different fantasy football leagues.


- **Punctuation removal**: Punctuation can also be removed. The string library contains a list of (most) punctuation characters:

In [8]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

We'll remove the punctuation using regular expressions. Regular expressions are a topic unto themselves and you can google tutorials on them if you'd like. We'll import the regular expression package (called re) in order to use them. Don't worry about the code in the following cell for now, just run it to remove the punctuation:

In [9]:
punc_re = re.compile('[%s]' % re.escape(string.punctuation))
corpus = map(lambda x: punc_re.sub(' ', x), corpus)
corpus = ''.join(list(corpus))
print(corpus)

i like football  football is one of my favorite sports  i play fantasy football with my friends  i am in several different fantasy football leagues 


- **Stop words** are common words that do not contribute much of the information in a text document. Words like ‘the’, ‘is’, ‘a’ have less value and add noise to the text data. Here are the first ten contained in the nltk stopwords list:

In [10]:
from nltk.corpus import stopwords

stopwords.words('english')[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

Let's remove them from our corpus now:

In [11]:
resultwords  = [word for word in corpus.split() if word not in stopwords.words('english')]
corpus = ' '.join(resultwords)
print(corpus)

like football football one favorite sports play fantasy football friends several different fantasy football leagues


- **Tokenization** is the process of breaking up a text document into individual words called tokens. Let's do that now:

In [12]:
corpus.split()

['like',
 'football',
 'football',
 'one',
 'favorite',
 'sports',
 'play',
 'fantasy',
 'football',
 'friends',
 'several',
 'different',
 'fantasy',
 'football',
 'leagues']

Some other types of preprocessing that we do now but are helpful to know about are:
    
- **Stemming** is the process of reducing a word to its stem/root word. It reduces inflection in words (e.g. ‘help’, ’helping’, ’helped’, ’helpful’) to their root form (e.g. ‘help’). It removes the morphological affixes from words, leaving only the word stem. The stem word may or may not be a valid word in the language. For example ‘movi’ is the root word for ‘movie’, ‘emot’ is the root word for ‘emotion’.

- **Lemmatization** does the same thing as stemming, converting a word to its root form but with one difference i.e., the root word in this case belongs to a valid word in the language. For example the word caring would map to ‘care’ and not ‘car’ as the in case of stemming.

- **Ngrams** are the combination of multiple words used together. N-grams can be used when we want to preserve sequence information in the document, like what word is likely to follow the given one. For example, if we were making a Donald Trump chatbot, we might want the word "news" to always follow the word "fake".

### Text Data Vectorization

Once the data is preprocessed, we can numerically represent text data. Here are some ways to do so.


- **Bag of Words** we can think of as creating a table where columns are the set of unique words in the corpus and rows correspond to each sentence(document). We set the value as 1 if the word is present in the sentence else we set it to 0. Consider the list below as two documents:

In [13]:
corpus = ['The car is driven on the road.', 
          'The truck is driven on the highway.']

We can use the NLTK count vectorizer to create a bag of words, where each row corresponds to a different document:

In [14]:
cv = CountVectorizer()

X = cv.fit_transform(corpus)

pd.DataFrame(X.toarray(), columns=cv.get_feature_names())

Unnamed: 0,car,driven,highway,is,on,road,the,truck
0,1,1,0,1,1,1,2,0
1,0,1,1,1,1,0,2,1


- **TF-IDF** stands for Term Frequency - Inverse Document Frequency. It takes into account that we should weight rare words more highly than common words.

**Term Frequency** defines the probability of finding a word in the document. Let’s say we want to find what is the probability of finding $\text{word}_i$ in $\text{document}_j$:

$\text{ TermFrequency(word_i,document_j}) = \frac{\text{Number of times word_i occurs in document_j}}{\text{Total number of words in document_j}}$

**Inverse Document Frequency**:The intuition behind IDF is that a word is not of much use if it is appearing in all the documents. It defines how unique the word is in the total corpus:

$\text{ InverseDocumentFrequency(word_i,All Documents in Corpus}) = \log(\frac{\text{Total number of documents}}{\text{Number of documents which contain word_i}})$

If word_i is more frequent in the corpus then IDF value decreases.

If word_i is not frequent which means ni decreases and hence IDF value increases.

And finally, we obtain the formula for TF-IDF:

$\text{TF-IDF = TF(word_i, document_j) * IDF(word_i, All documents in corpus)}$

We can calculate the TF-IDF matrix in the above example:

<img src="images/td.png" width=500>

From the above table, we can see that TF-IDF of common words was zero, which shows they are not significant. On the other hand, the TF-IDF of “car” , “truck”, “road”, and “highway” are non-zero. These words have more significance.

You will see the matrix written in this form:


In [15]:
data = [(0,0),(0.043,0),(0,0.043),(0,0),(0,0),(0,0),(0,0),(0.043,0),(0,0.043)]
(pd.DataFrame(data, columns=['A','B'], index=['the', 'car', 'truck', 'is', 'driven', 'on', 'the', 'road', 'highway'])).T

Unnamed: 0,the,car,truck,is,driven,on,the.1,road,highway
A,0.0,0.043,0.0,0.0,0.0,0.0,0.0,0.043,0.0
B,0.0,0.0,0.043,0.0,0.0,0.0,0.0,0.0,0.043


We will let Python create the TF-IDF matrix for us instead.

Nicely enough, the sklearn TF-IDF vectorizer can make the corpus lowercase and remove punctuation and stop words. We remove punctuation using the argument stop_words by setting it equal to a regular expression that indicates we only want alphabetic characters and not punctuation. We can also add a min_df argument, which ignores terms in the document that have a document frequency strictly lower than the given threshold.

Don't be alarmed that sklearn's matrix looks quite a bit different than yours. The values differ slightly because sklearn uses a smoothed version idf and various other little optimizations. For example, sklearn uses $\log(\frac{\text{Total number of documents}}{\text{Number of documents which contain word_i}})+1$ instead of just $ \log(\frac{\text{Total number of documents}}{\text{Number of documents which contain word_i}})$ to calculate the IDF score. This ensures that the words with an IDF score of zero (i.e., words that occur in every document) don’t get suppressed entirely.

In [16]:
tf = TfidfVectorizer(lowercase=True, 
                     token_pattern="\\b[a-zA-Z][a-zA-Z]+\\b", 
                     stop_words=stopwords.words('english'),
                     min_df=1)

X = tf.fit_transform(corpus)

pd.DataFrame(X.toarray(), columns=tf.get_feature_names())

Unnamed: 0,car,driven,highway,road,truck
0,0.631667,0.449436,0.0,0.631667,0.0
1,0.0,0.449436,0.631667,0.0,0.631667


### LSA

Latent Semantic Analysis is just SVD applied to a word/document matrix

- D1 = "I like databases"
- D2 = "I hate databases"

then the document-term matrix would be:

$$ \begin{matrix} \text{I} & \text{like} & \text{hate} & \text{database} \\ 1 & 1 & 0 & 1 \\ 1 & 0 & 1 & 1  \end{matrix} $$

With each row being a different document
and each column being a new word.

In this case our decomposition has a new interpretation:
- $\Sigma$ are the importances of each of our topics
- $U$ is a transform from a word vector to the topics that word is most used in
- $V$ is a transform from each document to the topics it is about

Let's perform LSA on a list of six documents below. We'll first create a TF-IDF matrix:

In [17]:
example = ['Football baseball basketball',
            'baseball giants cubs redsox',
            'football broncos cowboys',
            'baseball redsox tigers',
            'pop stars hendrix prince',
            'hendrix prince jagger rock',
            'joplin pearl jam tupac rock',
          ]

vectorizer = TfidfVectorizer(lowercase=True, 
                     token_pattern="\\b[a-zA-Z][a-zA-Z]+\\b", 
                     stop_words=stopwords.words('english'),
                     min_df=1)

X = vectorizer.fit_transform(example)

pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())

Unnamed: 0,baseball,basketball,broncos,cowboys,cubs,football,giants,hendrix,jagger,jam,joplin,pearl,pop,prince,redsox,rock,stars,tigers,tupac
0,0.479185,0.675356,0.0,0.0,0.0,0.560603,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.397106,0.0,0.0,0.0,0.559675,0.0,0.559675,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.464579,0.0,0.0,0.0,0.0
2,0.0,0.0,0.609819,0.609819,0.0,0.506202,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.479185,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.560603,0.0,0.0,0.675356,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.451635,0.0,0.0,0.0,0.0,0.544082,0.451635,0.0,0.0,0.544082,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.473977,0.570997,0.0,0.0,0.0,0.0,0.473977,0.0,0.473977,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.461804,0.461804,0.461804,0.0,0.0,0.0,0.383337,0.0,0.0,0.461804


We can view the words that are contained in the columns here:

In [18]:
vectorizer.get_feature_names()

['baseball',
 'basketball',
 'broncos',
 'cowboys',
 'cubs',
 'football',
 'giants',
 'hendrix',
 'jagger',
 'jam',
 'joplin',
 'pearl',
 'pop',
 'prince',
 'redsox',
 'rock',
 'stars',
 'tigers',
 'tupac']

We'll apply SVD using 2 components:

In [19]:
svd = TruncatedSVD(2)
X_svd = svd.fit_transform(X)

pd.DataFrame(svd.components_.round(5),
             index = ["component_1","component_2"],
             columns = vectorizer.get_feature_names())

Unnamed: 0,baseball,basketball,broncos,cowboys,cubs,football,giants,hendrix,jagger,jam,joplin,pearl,pop,prince,redsox,rock,stars,tigers,tupac
component_1,0.59434,0.26389,0.10775,0.10775,0.25565,0.30849,0.25565,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.47627,0.0,0.0,0.31811,0.0
component_2,-0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.51977,0.33357,0.10539,0.10539,0.10539,0.29259,0.51977,-0.0,0.36438,0.29259,-0.0,0.10539


We notice that the first component accounted for 14% of the explained variance and the second component accounted from 16%. It may be confusing that these are not in descending order. It is due to a rather obscure fact that we didn't apply the standard scaler to our data first. However, we'll keep our data unscaled in order to more easily interpret later results:

In [20]:
svd.explained_variance_ratio_

array([0.14219813, 0.16486574])

We'll also want to scale our results using Normalizer. This ensures that each vector has a norm of 1. Vectors with a norm of 1 are easy to work with for calculating similarity.

In [21]:
dtm_svd = Normalizer(copy=False).fit_transform(X_svd)

Each document is a linear combination of the PCA components. We notice that the first sports-related documents are composed entirely of the first component. The music-related documents are composed entirely of the second component:

In [22]:
pd.DataFrame(dtm_svd.round(5),
             index=example, 
             columns=["component_1","component_2"])

Unnamed: 0,component_1,component_2
Football baseball basketball,1.0,-0.0
baseball giants cubs redsox,1.0,0.0
football broncos cowboys,1.0,0.0
baseball redsox tigers,1.0,-0.0
pop stars hendrix prince,0.0,1.0
hendrix prince jagger rock,0.0,1.0
joplin pearl jam tupac rock,0.0,1.0


In summary, we have reduced a 19 dimensional space corresponding to 19 unique words down to 2 dimensions. Similar docs point in similar directions. Dissimilar docs have perpendicular (orthogonal) vectors.

Suppose we have a new sports-related document that is not already in our corpus, such as "baseball basketball broncos" and we want to see if it is more similar to component_1 or component_2. We will calculate the dot product of the new vector with each of the components and see which dot product is larger:

In [23]:
new_article = [1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] # corresponds to "baseball basketball broncos" in the matrix above
print("Dot product with component 1: ", np.dot(svd.components_[0].round(5), new_article))
print("Dot product with component 2: ", np.dot(svd.components_[1].round(5), new_article))

Dot product with component 1:  0.9659800000000001
Dot product with component 2:  0.0


Not surprisingly, the document was much more similar to component 1. What about a new music-related document like "jagger jam joplin"?

In [24]:
new_article = [0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0] # corresponds to "jagger jam joplin" in the matrix above
print("Dot product with component 1: ", np.dot(svd.components_[0].round(5), new_article))
print("Dot product with component 2: ", np.dot(svd.components_[1].round(5), new_article))

Dot product with component 1:  0.0
Dot product with component 2:  0.54435


What about an article that is a combination of the two, such as "redsox hendrix"?

In [25]:
new_article = [0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0] # corresponds to "redsox hendrix" in the matrix above
print("Dot product with component 1: ", np.dot(svd.components_[0].round(5), new_article))
print("Dot product with component 2: ", np.dot(svd.components_[1].round(5), new_article))

Dot product with component 1:  0.47627
Dot product with component 2:  0.51977
