In [3]:
import unicodedata
import pandas as pd
import string
import re
import numpy as np
from nltk.util import ngrams
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem.snowball import  SnowballStemmer
from nltk.stem import WordNetLemmatizer
from matplotlib import pyplot as plt

np.set_printoptions(linewidth=100)

# Natural Language Processing

### Objectives 

By the end of this lecture students should be able to, 

1. Define NLP, how it is used, and explain NLP's challenges.
2. Describe the NLP workflow.
3. Execute NLP workflow with python libraries 

### Requirements

You need to install the `nltk` module:

```
conda install nltk
```

This module will need corporas that you need to download. This can take a very long time, for simplicity here's the minimal corporas for this lecture. In a terminal, open `ipython` , import nltk, and type:

```
nltk.download('wordnet')
nltk.download('stopwords')
```
<!--
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_treebank_pos_tagger')
-->


## Introduction to NLP

### What is Natural Language Processing (NLP)?

The machine learning models we create are dependent on numerical data. However text, which is non-numerical, is
diverse and plentiful. So, how can we get computers to understand the languages of humans? In NLP we are trying to program computers to process and analyze large amounts of natural language data.  To do this we have to use various techniques to convert words or language into a numerical representation so that machine learning algorithms can interpret language.

<!-- ![](https://miro.medium.com/max/1532/1*JFSVBcw7GDr7v-8hcqmPDA.png) -->

<img src='http://hoctructuyen123.net/wp-content/uploads/2019/07/Untitled.png' width=500 >

### Example Applications of NLP 

NLP is a huge field with many exciting applications.  Here are just a few: 
- **Gmail:** Classifying spam emails 
- **Chat bots:** Ability to interact with a chat bot on websites
- **Google:** Automatically group news articles by topic 
- **Amazon Reviews:** Is a review on amazon positive, negative or neutral? 

<img src='https://www.pantechsolutions.net/blog/wp-content/uploads/2019/05/NLP.png' width=500 >

### It is difficult to understand human language! 

What is the meaning of the following sentence? 

> I made him duck. 

<br><br><br><br><br><br><br><br>

**There is a lot of ambiguity in language!**
<table>
   <td> <img src='https://bloximages.chicago2.vip.townnews.com/tucson.com/content/tncms/assets/v3/editorial/2/7b/27bbe6b5-3dc0-58ef-9f6e-ed4c88dbf224/4fb3c6fd2fc1a.image.jpg' width=400> 
   <td> <img src='https://www.nutritionadvance.com/wp-content/uploads/2018/11/duck-meat-is-it-a-healthy-choice.jpg' width=300> 
   <td> <img src='https://image.invaluable.com/housePhotos/PeachtreeandBennett/26/559426/H4390-L67817296.jpg' width=300> 
   <td> <img src ='https://upload.wikimedia.org/wikipedia/commons/thumb/d/d5/Grave_eend_maasmuur.jpg/1200px-Grave_eend_maasmuur.jpg' width=300> 
</table>

This problem of determining which sense was meant by a specific word is  formally known as **word sense disambiguation**.

<img src='images/nlp_python_cartoon.png' width=600>

Often times we want to get at the semantic meaning of langugage, but unfortunately language is set up where this is pretty difficult.  This is something to be aware of when using NLP.  



### NLP Terminology 

Before we dive into the conversion of natural language to numerical vectors, we first need to introudce terminology commonly used in NLP.  

| Term      | Meaning     | 
| :------------- | :----------: | 
|  *Corpus* | Collection of documents (collection of articles)   | 
|  *Document* | A single document (a tweet, an email, an article) | 
|  *Vocabulary*  | Set of words in your corpus, or maybe the entire English dictionary | 
|  *Bag of Words*  | Vector representation of words in a document | 
|  *Token*  | Single word | 
|  *Stop Words*  | Common ignored words because not useful in distinguishing text | 
|  *Vectorizing*  | Converting text into a bag-of-words | 
|  *n-gram*  | How many words constitute a linguistic unit in the analysis (*i.e* boy vs little boy)  |
|  *stemming*  | Cuts off beginning or end of words to get at root meaning (*i.e* studying --> study)  |
|  *lemmatizing*  | Similar to stemming but gets at the root meaning of the word (*i.e* studied --> study)  |


I will expand on this language throughout the lecture.  Let's start with how we convert language into a vector: **bag of words**.

### Bag of Words 

One of the most common ways to represent text numerically is using Bag of Words (BoW).

![](images/BOW.png)

<!--<img src='images/bag_words1.png' width=800 />-->


# NLP Pipeline 

## NLP tools

Throughout this lecture we will be introducing several tools which can be used to help with the NLP pipeline.  These tools include:

- sklearn 
- nltk
- spacey

I would like to point out that each of these tools have an important role to play.  

- sklearn: great for a first pass; easy to implement but you have the least amount of control 
- nltk: solid nlp library; great nlp learning tool 
- spacey: good for production environments

With theses notes in mind, I am going to start by using nltk and implement the entire NLP pipeline. That being said, please be aware of the other tools you have when working on NLP projects. 

## Pipeline
![](images/pipeline-walkthrough1.png)

Below is a to do list when converting text into vector form: 

**Clean text and Create a Bag of Words (BoW)**
>1. Lowercase the text
2. Tokenize 
3. Strip out punctuation or undesirable text
4. Remove Stopwords 
5. Stemming or Lemmatizing
6. Compute N-Grams
7. Use this to create BoW

**Vectorize BoW**
>8. Term Frequencies
9. Document Frequencies
10. TF-IDF
11. Normalize vectors

Let's go through what each of these steps are and how to do them in python with the following corpus of comments about data science...

Below we define some text that we are going to vectorize in this notebook

In [2]:
docs = [
    'I ~love~ ^love^ *love* to Study DATA and SCience...',
    '     a data science 😀🤯👍 is a 🔥   career field.',
    'Janet is studying the data science @ GALVANIZE!!!. Janet LOVES data aNd scieNCE.'
]

### Removing special characters 

Next, special characters are removed by only allowing ASCII characters.  
<br>
<details>
   <summary>
       <b>Click here</b> for a few things you may want to know:
    </summary>
<b>unicode:</b>
<ul>
<li>an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. -- wiki
<li>includes over 140,000 characters in various languages and emojis 
    </ul>
<b>ascii:</b> 
<ul>
<li> American Standard Code for Information Interchange, is a character encoding standard for electronic communication. -- wiki 
<li> limited to english characters, basic punctunation, numbers (128 characters)
</ul>
    
Below we did the following: 
<ul>
<li> normalized the text following unicodes 'NFKD' normalization form
<li> encode text as ASCII and drop characters that could not be included (here we are sticking with the english alphabet and basic punctuation and droping anything else)
<li> then we decode it back to unicode using the utf8 (8-bit representation of unicode characters)
    </ul>
</details>

In [3]:
unicodedata.normalize('NFKD', docs[1]).encode('ASCII', 'ignore').decode('utf8')

'     a data science  is a    career field.'

In [4]:
# let's start by removing emojis and other special characters
for i,doc in enumerate(docs):
    docs[i] = unicodedata.normalize('NFKD', doc).encode('ASCII', 'ignore').decode('utf8')

In [5]:
docs

['I ~love~ ^love^ *love* to Study DATA and SCience...',
 '     a data science  is a    career field.',
 'Janet is studying the data science @ GALVANIZE!!!. Janet LOVES data aNd scieNCE.']

## Clean text and Create a BoWs

### Lowercase the text 

This one is pretty straight forward.

In [6]:
for i,doc in enumerate(docs):
    docs[i] = doc.lower()  

docs

['i ~love~ ^love^ *love* to study data and science...',
 '     a data science  is a    career field.',
 'janet is studying the data science @ galvanize!!!. janet loves data and science.']

### Tokenize

Convert text into list of vocabulary words.

In [7]:
for i,doc in enumerate(docs):
    docs[i] = word_tokenize(doc) # method in nltk
docs

[['i',
  '~love~',
  '^love^',
  '*',
  'love',
  '*',
  'to',
  'study',
  'data',
  'and',
  'science',
  '...'],
 ['a', 'data', 'science', 'is', 'a', 'career', 'field', '.'],
 ['janet',
  'is',
  'studying',
  'the',
  'data',
  'science',
  '@',
  'galvanize',
  '!',
  '!',
  '!',
  '.',
  'janet',
  'loves',
  'data',
  'and',
  'science',
  '.']]

### Strip out punctuation

In [8]:
pt = string.punctuation
pt

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [9]:
for i,doc in enumerate(docs):
    docs[i] = [word for word in doc if not word in pt]
docs

[['i',
  '~love~',
  '^love^',
  'love',
  'to',
  'study',
  'data',
  'and',
  'science',
  '...'],
 ['a', 'data', 'science', 'is', 'a', 'career', 'field'],
 ['janet',
  'is',
  'studying',
  'the',
  'data',
  'science',
  'galvanize',
  'janet',
  'loves',
  'data',
  'and',
  'science']]

Ok, so this didn't remove all punctuations lets do a bit more to eliminate un-needed punctuation

In [52]:
# use regular expression to remove additional punctuation
for i,doc in enumerate(docs):
    for j,word in enumerate(doc):
        docs[i][j] = re.sub(r'[~*.^]*', '', doc[j])  

# r is raw string -- it is treated as it appears
# Without the r, backslashes are treated as escape characters. 
# With the r, backslashes are treated as literal.

In [11]:
docs

[['i', 'love', 'love', 'love', 'to', 'study', 'data', 'and', 'science', ''],
 ['a', 'data', 'science', 'is', 'a', 'career', 'field'],
 ['janet',
  'is',
  'studying',
  'the',
  'data',
  'science',
  'galvanize',
  'janet',
  'loves',
  'data',
  'and',
  'science']]

In [12]:
for i,doc in enumerate(docs):
    if '' in doc:
        doc.remove('')

In [13]:
docs

[['i', 'love', 'love', 'love', 'to', 'study', 'data', 'and', 'science'],
 ['a', 'data', 'science', 'is', 'a', 'career', 'field'],
 ['janet',
  'is',
  'studying',
  'the',
  'data',
  'science',
  'galvanize',
  'janet',
  'loves',
  'data',
  'and',
  'science']]

### Remove Stopwords 

Sometimes, some extremely common words which would appear to be of little value in understanding the content of language are excluded from the vocabulary entirely. These words are called **stop words**.

**Examples:**
1. the
2. is
3. to

NLTK comes with a default list of stop words you can remove from your corpus. 

In [14]:
sw = stopwords.words('english') # stopwords from nltk

In [15]:
sw;

In [16]:
for i,doc in enumerate(docs):
    docs[i] = [token for token in doc if token not in sw]
docs

[['love', 'love', 'love', 'study', 'data', 'science'],
 ['data', 'science', 'career', 'field'],
 ['janet',
  'studying',
  'data',
  'science',
  'galvanize',
  'janet',
  'loves',
  'data',
  'science']]

**Note:** 
I have used a nltk's stop words.  However, you can feed in your own list of stopwords! Customizing this process to your problem is an important part of the NLP process. 

### Stemming or Lemmatizing

**Stemming:**<br>
- Removes suffixes 
    - speaking -> speak
    - actually -> actual
- Does it without context 
- Some times can produce weird words 
- Pretty fast 

**Lemmatizing** 
- Replace word with root word or lemma 
   - better -> good 
   - funny -> fun 
- Slower 
- Can do a better job and getting at root meaning of word

![](https://miro.medium.com/max/942/1*1bZumlAPUQwfEvQXbmWI_w.png)

### Stemming with NLTK 

There are several types of stemming algorithms.  Here are three commonly used Stemming algorithms:

1. Porter: Commonly used algorithm and one of the least aggressive (Less likely to remove letters from words)
2. Snowball: An improved version of Porter that is move efficient.
3. Lancaster: An aggressive and efficient method; however, it is difficult to interpret the root meaning of words. 

Overall, I would suggest Snowball as it is both efficient and interpretable. 

In [17]:
stemmer = SnowballStemmer('english')

In [18]:
docs1 = []
for i,doc in enumerate(docs):
    docs1.append([stemmer.stem(token) for token in doc])
docs1

[['love', 'love', 'love', 'studi', 'data', 'scienc'],
 ['data', 'scienc', 'career', 'field'],
 ['janet',
  'studi',
  'data',
  'scienc',
  'galvan',
  'janet',
  'love',
  'data',
  'scienc']]

### Lemmatizing with NLTK

First let's look at how it works

In [19]:
lemmatizer = WordNetLemmatizer()
text = "studies studying cries cry"
tokenization = word_tokenize(text)
print(tokenization)
for w in tokenization:
    print("Lemma for {} is {}".format(w, lemmatizer.lemmatize(w)))  

['studies', 'studying', 'cries', 'cry']
Lemma for studies is study
Lemma for studying is studying
Lemma for cries is cry
Lemma for cry is cry


OK, now lets use it to transform our data science commments:

In [20]:
for i,doc in enumerate(docs):
    docs[i] = [lemmatizer.lemmatize(token) for token in doc]
docs

[['love', 'love', 'love', 'study', 'data', 'science'],
 ['data', 'science', 'career', 'field'],
 ['janet',
  'studying',
  'data',
  'science',
  'galvanize',
  'janet',
  'love',
  'data',
  'science']]

### Compute N-Grams

N-grams of texts are extensively used in text mining and natural language processing tasks. They are a set of co-occurring words which are found via a sliding window.  

- Pro: They are helpful when pairs of words (or >2) give insight to the content of text
- Cons:  They increase the dimensionality of an already very high dimensional problem.

In [21]:
list(ngrams(['I','like','data','science'],2)) # n-grams is a generator

[('I', 'like'), ('like', 'data'), ('data', 'science')]

In [22]:
for i,doc in enumerate(docs):
    ng = list(map(lambda tup: '-'.join(tup), ngrams(doc, 2)))
    docs[i] = doc + ng

In [23]:
docs

[['love',
  'love',
  'love',
  'study',
  'data',
  'science',
  'love-love',
  'love-love',
  'love-study',
  'study-data',
  'data-science'],
 ['data',
  'science',
  'career',
  'field',
  'data-science',
  'science-career',
  'career-field'],
 ['janet',
  'studying',
  'data',
  'science',
  'galvanize',
  'janet',
  'love',
  'data',
  'science',
  'janet-studying',
  'studying-data',
  'data-science',
  'science-galvanize',
  'galvanize-janet',
  'janet-love',
  'love-data',
  'data-science']]

### Create the Bag of Words

In [24]:
# creating the vocabulary and initializing the BoW matrix
vocabulary = set()
for doc in docs:
    for token in doc:
        vocabulary.add(token)

vocabulary_lookup = {word: i for i, word in enumerate(vocabulary)}
matrix = np.zeros((len(docs), len(vocabulary)))

In [25]:
vocabulary_lookup 

{'studying-data': 0,
 'janet-love': 1,
 'science-galvanize': 2,
 'career-field': 3,
 'science-career': 4,
 'study-data': 5,
 'love': 6,
 'love-love': 7,
 'galvanize': 8,
 'career': 9,
 'love-data': 10,
 'field': 11,
 'love-study': 12,
 'studying': 13,
 'data': 14,
 'data-science': 15,
 'janet': 16,
 'galvanize-janet': 17,
 'science': 18,
 'janet-studying': 19,
 'study': 20}

In [26]:
# creating the bag of words
for doc_id, doc in enumerate(docs):
    for token in doc:
        word_id = vocabulary_lookup[token]
        matrix[doc_id][word_id] += 1

In [27]:
# visualizing the bag of words
columns = sorted(vocabulary_lookup, key=lambda key: vocabulary_lookup[key])
df = pd.DataFrame(matrix.astype('int'), columns=columns); df

Unnamed: 0,studying-data,janet-love,science-galvanize,career-field,science-career,study-data,love,love-love,galvanize,career,...,field,love-study,studying,data,data-science,janet,galvanize-janet,science,janet-studying,study
0,0,0,0,0,0,1,3,2,0,0,...,0,1,0,1,1,0,0,1,0,1
1,0,0,0,1,1,0,0,0,0,1,...,1,0,0,1,1,0,0,1,0,0
2,1,1,1,0,0,0,1,0,1,0,...,0,0,1,2,2,2,1,2,1,0


### Review 

What were the steps we took to convert raw text into the BoW?

## Vectorize the BoWs

### Term Frequencies

The percentage of number of times a term occurs in a specific document:

$tf(term,document) = \frac{\# \ of \ times \ a \ term \ appears \ in \ a \ document}{\#\ of\ terms\ in\ the\ document}$

Note: You can use the L2 norm as well for the denominator. 

In [28]:
# dividing each document vector by the sum of its counts, producing a term frequency vector
tf = df / df.sum(axis=1).values.reshape(-1, 1); tf.round(2)

Unnamed: 0,studying-data,janet-love,science-galvanize,career-field,science-career,study-data,love,love-love,galvanize,career,...,field,love-study,studying,data,data-science,janet,galvanize-janet,science,janet-studying,study
0,0.0,0.0,0.0,0.0,0.0,0.09,0.27,0.18,0.0,0.0,...,0.0,0.09,0.0,0.09,0.09,0.0,0.0,0.09,0.0,0.09
1,0.0,0.0,0.0,0.14,0.14,0.0,0.0,0.0,0.0,0.14,...,0.14,0.0,0.0,0.14,0.14,0.0,0.0,0.14,0.0,0.0
2,0.06,0.06,0.06,0.0,0.0,0.0,0.06,0.0,0.06,0.0,...,0.0,0.0,0.06,0.12,0.12,0.12,0.06,0.12,0.06,0.0


What would be the problems if we just used the term frequency as our numerical representation of text?



**Example:** Consider the three tweets: 

- Data science rocks
- Rock climbing data rocks
- Science rocks

And we put these documents into a BoWs:

| Data        | Science     | Rock      | Climb |
| ----------- | ----------- | ----------| ----------- |
| 1.          | 1.          |1          | 0 |
| 1.          | 0.          |2          | 1  |
| 0.          | 1.          |1          | 0 |

Here the Term Frequency for Rock would be high in each case.  However, the words which are more rare are actually more informative into how the tweets are different.

### Document Frequency (DF)

The ratio of documents with word, w, divided by number of all documents

$df(term,corpus) = \frac{ \# \ of \ documents \ that \ contain \ a \ term}{ \# \ of \ documents \ in \ the \ corpus}$

### Inverse Document Frequency (IDF)

The inverse document frequency is defined in terms of the document frequency as

$idf(term,corpus) = \log{\frac{1}{df(term,corpus)}}$.

In [29]:
# computing the inverse document frequency
idf = np.log(matrix.shape[0] / np.sum(matrix > 0, axis=0)); idf.round(2)

array([1.1 , 1.1 , 1.1 , 1.1 , 1.1 , 1.1 , 0.41, 1.1 , 1.1 , 1.1 , 1.1 , 1.1 , 1.1 , 1.1 , 0.  ,
       0.  , 1.1 , 1.1 , 0.  , 1.1 , 1.1 ])

### TF-IDF

TF-IDF is an acronym for the product of two parts: the term frequency tf and what is called the inverse document frequency idf. The term frequency is just the counts in a term frequency vector. 

tf-idf $ = tf(term,document) * idf(term,corpus)$

In [30]:
# term frequency, inverse document frequency matrix
tfidf = tf * idf; tfidf.round(2)

Unnamed: 0,studying-data,janet-love,science-galvanize,career-field,science-career,study-data,love,love-love,galvanize,career,...,field,love-study,studying,data,data-science,janet,galvanize-janet,science,janet-studying,study
0,0.0,0.0,0.0,0.0,0.0,0.1,0.11,0.2,0.0,0.0,...,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1
1,0.0,0.0,0.0,0.16,0.16,0.0,0.0,0.0,0.0,0.16,...,0.16,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.06,0.06,0.06,0.0,0.0,0.0,0.02,0.0,0.06,0.0,...,0.0,0.0,0.06,0.0,0.0,0.13,0.06,0.0,0.06,0.0


![](images/tfidf.png)

### Normalize vectors

In [31]:
normalized = tfidf / np.linalg.norm(tfidf, axis=1, ord=2).reshape(-1, 1)
normalized

Unnamed: 0,studying-data,janet-love,science-galvanize,career-field,science-career,study-data,love,love-love,galvanize,career,...,field,love-study,studying,data,data-science,janet,galvanize-janet,science,janet-studying,study
0,0.0,0.0,0.0,0.0,0.0,0.348665,0.386045,0.697329,0.0,0.0,...,0.0,0.348665,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.348665
1,0.0,0.0,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.5,...,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.287051,0.287051,0.287051,0.0,0.0,0.0,0.105942,0.0,0.287051,0.0,...,0.0,0.0,0.287051,0.0,0.0,0.574101,0.287051,0.0,0.287051,0.0


# Another quick example: Bitcoin tweets

## Step 1: Get Corpus

In [32]:
corpus = [
    'Give me 12,000 #bitcoin',
    'The supply of gold is infinite in our galaxy, the supply of #Bitcoin will always be fixed. Long-term, there will be no comparison between these assets. Gold will be to bitcoin in the future like sand is to gold today.',
    'Someone inquired about my #bitcoin exit strategy. I said the exit occurred in the front end. It is a fiat exit.',   
    'Give ME 11,000!!!! #bitcoin'
]

## Step 2: Clean and Tokenize

In [33]:
def doc_to_tokens(doc,ngs=1):
    # lower case all words
    doc = doc.lower()
    # remove accents
    doc = unicodedata.normalize('NFKD', doc).encode('ASCII', 'ignore').decode('utf8')
    # remove odd punctuation
    doc = re.sub(r'[~*.^]*!#', '', doc)
    # turn document into a list of tokens
    doc = word_tokenize(doc)
    # removing stopwords (sw) and punctation (pt)
    doc = [token for token in doc if token not in sw and token not in pt]
    # stem all tokens
    doc = [stemmer.stem(token) for token in doc]
    # here we are setting up the bigrams if specified
    ng = list(map(lambda tup: '-'.join(tup), ngrams(doc, ngs)))
    return doc + ng

### Step 3: Create Vocab

In [34]:
# creating the vocabulary and initializing the BoW matrix
vocabulary = set()
for doc in corpus:
    for token in doc_to_tokens(doc):
        vocabulary.add(token)

vocabulary_lookup = {word: i for i, word in enumerate(vocabulary)}
matrix = np.zeros((len(corpus), len(vocabulary)))

### Step 4: Create the Bag of Words (BOW)

In [35]:
# creating the bag of words
for doc_id, doc in enumerate(corpus):
    for token in doc_to_tokens(doc):
        word_id = vocabulary_lookup[token]
        matrix[doc_id][word_id] += 1

In [36]:
# visualizing the bag of words
columns = sorted(vocabulary_lookup, key=lambda key: vocabulary_lookup[key])
df = pd.DataFrame(matrix.astype('int'), columns=columns); df

Unnamed: 0,"12,000",infinit,strategi,someon,fiat,give,alway,fix,bitcoin,today,...,suppli,long-term,comparison,futur,said,"11,000",galaxi,end,front,asset
0,2,0,0,0,0,2,0,0,2,0,...,0,0,0,0,0,0,0,0,0,0
1,0,2,0,0,0,0,2,2,4,2,...,4,2,2,2,0,0,2,0,0,2
2,0,0,2,2,2,0,0,0,2,0,...,0,0,0,0,2,0,0,2,2,0
3,0,0,0,0,0,2,0,0,2,0,...,0,0,0,0,0,2,0,0,0,0


### Step 5: Compute TF-IDF

In [37]:
# dividing each document vector by the sum of its counts, producing a term frequency vector
tf = df / df.sum(axis=1).values.reshape(-1, 1)
# computing the inverse document frequency
idf = np.log(matrix.shape[0] / np.sum(matrix > 0, axis=0))
# term frequency, inverse document frequency matrix
tfidf = tf * idf; tfidf.round(2)

Unnamed: 0,"12,000",infinit,strategi,someon,fiat,give,alway,fix,bitcoin,today,...,suppli,long-term,comparison,futur,said,"11,000",galaxi,end,front,asset
0,0.46,0.0,0.0,0.0,0.0,0.23,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.08,0.0,0.0,0.0,0.0,0.08,0.08,0.0,0.08,...,0.15,0.08,0.08,0.08,0.0,0.0,0.08,0.0,0.0,0.08
2,0.0,0.0,0.12,0.12,0.12,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.12,0.0,0.0,0.12,0.12,0.0
3,0.0,0.0,0.0,0.0,0.0,0.23,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.46,0.0,0.0,0.0,0.0


### Step 6: Normalize Vector

In [38]:
# setting each tfidf vector to be a unit vector
normalized = tfidf / np.linalg.norm(tfidf, axis=1, ord=2).reshape(-1, 1); normalized.round(2)

Unnamed: 0,"12,000",infinit,strategi,someon,fiat,give,alway,fix,bitcoin,today,...,suppli,long-term,comparison,futur,said,"11,000",galaxi,end,front,asset
0,0.89,0.0,0.0,0.0,0.0,0.45,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.2,0.0,0.0,0.0,0.0,0.2,0.2,0.0,0.2,...,0.41,0.2,0.2,0.2,0.0,0.0,0.2,0.0,0.0,0.2
2,0.0,0.0,0.24,0.24,0.24,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.24,0.0,0.0,0.24,0.24,0.0
3,0.0,0.0,0.0,0.0,0.0,0.45,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.89,0.0,0.0,0.0,0.0


In [39]:
# all vectors have length 1
np.linalg.norm(normalized, axis=1, ord=2)

array([1., 1., 1., 1.])

# Comparing two documents / Similarity Measures

## Euclidean distance

We could try the Euclidean distance $||\vec{x}-\vec{y}||$  

The euclidean distance goes up with the length of a document. Intuitively, duplicating each word in our bag of words generates a vector that points in exactly the same direction, however, the euclidean distance goes up. One solution is to normalize vectors before calculating the euclidean distance. Now increasing the length of a document does not change the Euclidean distance unless the direction of the term frequency vector changes. 

In [40]:
normalized #.iloc[3].values

Unnamed: 0,"12,000",infinit,strategi,someon,fiat,give,alway,fix,bitcoin,today,...,suppli,long-term,comparison,futur,said,"11,000",galaxi,end,front,asset
0,0.894427,0.0,0.0,0.0,0.0,0.447214,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.204124,0.0,0.0,0.0,0.0,0.204124,0.204124,0.0,0.204124,...,0.408248,0.204124,0.204124,0.204124,0.0,0.0,0.204124,0.0,0.0,0.204124
2,0.0,0.0,0.242536,0.242536,0.242536,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.242536,0.0,0.0,0.242536,0.242536,0.0
3,0.0,0.0,0.0,0.0,0.0,0.447214,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.894427,0.0,0.0,0.0,0.0


In [41]:
for i in range(4):
    for j in range(i+1,4):
        print('Euclidean Distance between tweet {} and {} is {}'.format(i,j, 
                np.linalg.norm(normalized.iloc[i].values - normalized.iloc[j].values)) )

Euclidean Distance between tweet 0 and 1 is 1.4142135623730951
Euclidean Distance between tweet 0 and 2 is 1.4142135623730951
Euclidean Distance between tweet 0 and 3 is 1.2649110640673518
Euclidean Distance between tweet 1 and 2 is 1.4142135623730951
Euclidean Distance between tweet 1 and 3 is 1.4142135623730951
Euclidean Distance between tweet 2 and 3 is 1.4142135623730951


## 3.2 Cosine Similarity
Recall that for two vector $\vec{x}$ and $\vec{y}$ that $\vec{x} \cdot \vec{y} = ||\vec{x}|| ||\vec{y}|| \cos{\theta}$. And so,

$\frac{\vec{x} \cdot \vec{y} }{||\vec{x}|| ||\vec{y}||} = \cos{\theta}$

$\theta$ can only range from 0 to 90 degrees, because tf-idf vectors are non-negative. Therefore cos $\theta$ ranges from 0 to 1. Documents that are exactly identical will have $\cos{\theta} = 1$.

In [42]:
for i in range(4):
    for j in range(i+1,4):
         print('Cosine Similarity between tweet {} and {} is {}'.format(i,j, 
                np.dot(normalized.iloc[i].values, normalized.iloc[j].values)) )

Cosine Similarity between tweet 0 and 1 is 0.0
Cosine Similarity between tweet 0 and 2 is 0.0
Cosine Similarity between tweet 0 and 3 is 0.19999999999999998
Cosine Similarity between tweet 1 and 2 is 0.0
Cosine Similarity between tweet 1 and 3 is 0.0
Cosine Similarity between tweet 2 and 3 is 0.0


Tweet 0 and 3 seem to be the most similar let's remember what they are:

In [43]:
corpus[0],corpus[3]

('Give me 12,000 #bitcoin', 'Give ME 11,000!!!! #bitcoin')

# NLP Libraries 

## NLTK 

Above I have used NLTK, which is one of the popular NLP libraries.  It is a good NLP library for learning which is why I have used it above. However, there are more efficient NLP libraries.

## Sklearn 

Sklearn is great and easy to use! Its good for a first try with NLP. However, if you want more fine-grain control use NLTK or Spacy.  

### Demo 

*This demo is based on Sklearns demo in their docs.*

Sklearn module which are helpful for NLP.  The first is the Count Vectorizer which takes a collection of text and converts it to a matrix of token counts.

In [55]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?',
]

vectorizer = CountVectorizer(  ) # stop_words= ['this', 'is', 'the', 'and' ] ) # stop_words ='english') # 
X = vectorizer.fit_transform(corpus)

vectorizer.get_feature_names(), X.toarray()

(['document', 'first', 'one', 'second', 'third'],
 array([[1, 1, 0, 0, 0],
        [2, 0, 0, 1, 0],
        [0, 0, 1, 0, 1],
        [1, 1, 0, 0, 0]]))

- What's the difference between fit, transform and fit_transform?  
- Why would this matter for us when using machine learning? 

In [45]:
vectorizer2 = CountVectorizer(ngram_range=(1, 2)) # word specifies n-grams should be created with words
vectorizer2.fit_transform(corpus)
vectorizer2.get_feature_names()

['and',
 'and this',
 'document',
 'document is',
 'first',
 'first document',
 'is',
 'is the',
 'is this',
 'one',
 'second',
 'second document',
 'the',
 'the first',
 'the second',
 'the third',
 'third',
 'third one',
 'this',
 'this document',
 'this is',
 'this the']

## TFIDF Vectorizer and Transfomer

The `TfidfTransformer` converts a count vectorized document into a normalized TF or TFIDF vector.

TFIDF Vectorizer in sklearn is `CountVectorizer` followed by `TfidfTransformer`.

In [46]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
[[0.         0.46979139 0.58028582 0.38408524 0.         0.         0.38408524 0.
  0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762 0.28108867 0.
  0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.         0.26710379 0.51184851
  0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.         0.38408524 0.
  0.38408524]]


In [47]:
vectorizer = TfidfVectorizer(max_features=5)  # maximum number of features
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())

['document', 'first', 'is', 'the', 'this']
[[0.46979139 0.58028582 0.38408524 0.38408524 0.38408524]
 [0.81614027 0.         0.33362407 0.33362407 0.33362407]
 [0.         0.         0.57735027 0.57735027 0.57735027]
 [0.46979139 0.58028582 0.38408524 0.38408524 0.38408524]]


In [48]:
vectorizer = TfidfVectorizer(max_df=3)    # ignore terms with doc freq higher than value 
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())

['and', 'document', 'first', 'one', 'second', 'third']
[[0.         0.62922751 0.77722116 0.         0.         0.        ]
 [0.         0.78722298 0.         0.         0.61666846 0.        ]
 [0.57735027 0.         0.         0.57735027 0.         0.57735027]
 [0.         0.62922751 0.77722116 0.         0.         0.        ]]


In [49]:
vectorizer = TfidfVectorizer(min_df=2)     # ignore terms with doc freq lower than value; can do counts or percent
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.toarray())

['document', 'first', 'is', 'the', 'this']
[[0.46979139 0.58028582 0.38408524 0.38408524 0.38408524]
 [0.81614027 0.         0.33362407 0.33362407 0.33362407]
 [0.         0.         0.57735027 0.57735027 0.57735027]
 [0.46979139 0.58028582 0.38408524 0.38408524 0.38408524]]


## Spacy 

[Spacy](https://spacy.io/) is a NLP library with similar functionality to NLTK but more efficient and thus better for production environments.  We suggest looking into Spacy if you are doing a NLP capstone project. 

<img src='https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/SpaCy_logo.svg/1200px-SpaCy_logo.svg.png' width=400 >

# Review

1. What is a corpus? 
2. Given a corpus of tweets, how do I convert those tweets into a bag of words? 
3. How do your vectorize the bag of words? 

# Optional Content 

## 1.5. Part-of-Speech tagging

This is an alternative process that relies on machine learning to tag each word in a sentence with its function. In libraries such as NLTK, there are embedded tools to do that. Tags detected depend on the corpus used for training. In NLTK, the function `nltk.pos_tag()` uses the [Penn Treebank](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

### nltk.pos_tag

In [50]:
# Here I have some language where I have already tokenized the language
tokens = [['My','caring','mother','drove','me', 'to', 'the', 'airport', 'with', 'the','windows','rolled', 'down', '.'],
 ['It','was','seventy-five','degrees','in', 'Phoenix', ',', 'the', 'sky', 'a',  'perfect',',', 'cloudless',  'blue', '.']]

from nltk import pos_tag

sent_tags = list(map(pos_tag, tokens))

for sent in sent_tags:
    print("--- sentence tags: {}".format(sent))

--- sentence tags: [('My', 'PRP$'), ('caring', 'VBG'), ('mother', 'NN'), ('drove', 'VBD'), ('me', 'PRP'), ('to', 'TO'), ('the', 'DT'), ('airport', 'NN'), ('with', 'IN'), ('the', 'DT'), ('windows', 'NNS'), ('rolled', 'VBD'), ('down', 'RB'), ('.', '.')]
--- sentence tags: [('It', 'PRP'), ('was', 'VBD'), ('seventy-five', 'JJ'), ('degrees', 'NNS'), ('in', 'IN'), ('Phoenix', 'NNP'), (',', ','), ('the', 'DT'), ('sky', 'NN'), ('a', 'DT'), ('perfect', 'JJ'), (',', ','), ('cloudless', 'JJ'), ('blue', 'NN'), ('.', '.')]


Let's filter verbs!

In [51]:
for sent in sent_tags:
    tags_filtered = [t for t in sent if t[1].startswith('VB')]
    print("--- verbs:\n{}".format(tags_filtered))

--- verbs:
[('caring', 'VBG'), ('drove', 'VBD'), ('rolled', 'VBD')]
--- verbs:
[('was', 'VBD')]
