# Lesson 12 - Natural Language Processing I

> Analyse text with machine learning - part 1.

<p align="left">
<img src='./images/natural-language-processing-so-hot-right-now.jpg'  width=400>

Natural language processing (NLP) is the part of Machine Learning concerned with the analysis of digital, human written texts. The topic of NLP is as old as machine learning itself and dates back to Alan Turing himself. 

However, there has been exciting and rapid progress in the past couple of years. One of the outstanding achievements was the publication of OpenAI's GPT-2, a language model able to not only create realistic text samples but also solve tasks of many NLP benchmarks without special training. See the figure below for an example output of GPT-2.

If you want to try your own examples you can do so at [talktotransformer.com](https://talktotransformer.com/) or read the original article on [OpenAI's webpage](https://openai.com/blog/better-language-models/).

<p align="left">
<img src='./images/gpt2-example.png' width=400>

_Summary:_ In this notebook explore how we can transform text into a format that is machine readable by using **vector encodings**. To get high quality encodings it is often necessary to do some **pre-processing** on the raw texts. This often requires some forms of **string processing** which are briefly explained and showcased. Finally, we use the encodings to build a rudimentary **search enginge**. In summary, this lecture is structured in the following three parts:
* Dataset: 20 Newsgroup
* String Processing
* Vector Encodings
* Search Enginge

_Created by:_ Leandro von Werra, Spring 2019

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# sets how many digits of numpy objects are printed 
np.set_printoptions(precision=3)

## Dataset

In this lesson we will use the 20 Newsgroup dataset which is the `Hello World` example in NLP. It contains about 18'000 text snippets that belong to 20 topics. The goal is to assign the texts in the test set to the 20 topics.

This dataset can be loaded using the `Scikit-learn` API:

In [None]:
from sklearn.datasets import fetch_20newsgroups

In [None]:
newsgroups_train = fetch_20newsgroups(subset='all')

The resulting `newsgroups_train` variable is a dictionary with several keys: 

In [None]:
newsgroups_train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

The `DESCR` contains a text with a description of the dataset:

In [None]:
print(newsgroups_train['DESCR'][25:394])



The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.


The `data` entry contains a list of all texts. We can print the first entry:

In [None]:
print(newsgroups_train['data'][0])

From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
Lines: 12
NNTP-Posting-Host: po4.andrew.cmu.edu



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!




The `target` contains the numerical lables wheras the `target_names` contains the text labels as the name suggests. The label that belongs to the example above is:

In [None]:
newsgroups_train['target'][0]

10

In [None]:
newsgroups_train['target_names']

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

For the example text this means it belongs to the following category:

In [None]:
newsgroups_train['target_names'][newsgroups_train['target'][0]]

'rec.sport.hockey'

Now we store the data in a DataFrame, since it is easier to handle the data once it is in that form.

In [None]:
df = pd.DataFrame({'filenames': newsgroups_train['filenames'],
                   'text': newsgroups_train['data'],
                   'target': newsgroups_train['target'],})
df.head()

Unnamed: 0,filenames,text,target
0,/Users/leandro/scikit_learn_data/20news_home/2...,From: Mamatha Devineni Ratnam <mr47+@andrew.cm...,10
1,/Users/leandro/scikit_learn_data/20news_home/2...,From: mblawson@midway.ecn.uoknor.edu (Matthew ...,3
2,/Users/leandro/scikit_learn_data/20news_home/2...,From: hilmi-er@dsv.su.se (Hilmi Eren)\nSubject...,17
3,/Users/leandro/scikit_learn_data/20news_home/2...,From: guyd@austin.ibm.com (Guy Dawson)\nSubjec...,3
4,/Users/leandro/scikit_learn_data/20news_home/2...,From: Alexander Samuel McDiarmid <am2o+@andrew...,4


We also do a few necessary dataframe operations:
1. Remove the filepath from the filename by taking the piece after last slash.

In [None]:
df['filenames'] = df['filenames'].apply(lambda x: x.split('/')[-1])

2. Split off additional information from text such as email adresses.

In [None]:
df['text'] = df['text'].apply(lambda x: x.split('\n\n', 1)[1])

3. Add the target names to the entries in the dataframe

In [None]:
df['target_name'] = df['target'].apply(lambda x: newsgroups_train['target_names'][x])

In [None]:
df.head()

Unnamed: 0,filenames,text,target,target_name
0,54367,\n\nI am sure some bashers of Pens fans are pr...,10,rec.sport.hockey
1,60215,My brother is in the market for a high-perfo...,3,comp.sys.ibm.pc.hardware
2,76120,"\n\n\n|>The student of ""regional killings"" ali...",17,talk.politics.mideast
3,60771,\nIn article <1993Apr19.034517.12820@julian.uw...,3,comp.sys.ibm.pc.hardware
4,51882,\n1) I have an old Jasmine drive which I c...,4,comp.sys.mac.hardware


**Exercise 1:** Create a new column called `text_length` where the length of each text is stored and then  plot a histogram of the text lenghts.
* To get the new columne use the `apply` and the function `len` to get the length of the a string. Example usage of `len`:
```python
len('test')
>>> 4
```
* Then use the `sns.distplot` function to plot the distribution. Set the bins argument of displot to `bins=1000` and then use `plt.xlim([0,1000])`.


## String Processing
In this section we have a look at the basics of string processing. Being able to filter/combine/manipulate strings is a crucial skill to do natural language processing.

In [None]:
string = 'This is a string!\n(But not a very interesting one)\n\n\tEnd.'
print(string)

This is a string!
(But not a very interesting one)

	End.


In Python strings are lists of characters and as such one can iterate through them like lists:

In [None]:
for character in string:
    print(character)

T
h
i
s
 
i
s
 
a
 
s
t
r
i
n
g
!


(
B
u
t
 
n
o
t
 
a
 
v
e
r
y
 
i
n
t
e
r
e
s
t
i
n
g
 
o
n
e
)




	
E
n
d
.


Check their length like lists:

In [None]:
len(string)

57

Check if they contain certain elements like lists:

In [None]:
'!' in string

True

In [None]:
'?' not in string

True

We can also check if a **substring** is present in a string:

In [None]:
'very' in string

True

**Capitalisation:**
There are different ways to manipulate the casing of strings:

In [None]:
'test'.upper()

'TEST'

In [None]:
'TEST'.lower()

'test'

In [None]:
'test'.capitalize()

'Test'

**Adding strings:**

In [None]:
result = 'a'+'b'
print(result)

ab


**Splitting strings:**

Often we need to split sentences into words or file paths into components. For this task we can use the `split()` function. By default a string is split wherever a whitespace is (this could be normal space, a tab `\t` or a newline `\n`).

In [None]:
string.split()

['This',
 'is',
 'a',
 'string!',
 '(But',
 'not',
 'a',
 'very',
 'interesting',
 'one)',
 'End.']

In [None]:
'path/to/file/image.jpg'.split('/')

['path', 'to', 'file', 'image.jpg']

**Stripping strings:**

Sometimes strings contain leading or trailing characters that we want to get rid of, such as whitespaces or unnecessary characters. We can remove them with the `strip()` function. Like the `split()` function it removes whitespaces by default but we can set any characters we want:

In [None]:
'_path/to/file/image.jpg_'.strip('_')

'path/to/file/image.jpg'

In [None]:
'_-_path/to/file/image.jpg,_,'.strip(',_-')

'path/to/file/image.jpg'

**Replacing:**

With the `replace()` function one can replace substrings in a string.

In [None]:
'one plus one equals two!'.replace('two','three')

'one plus one equals three!'

**Joining strings**

Sometimes we split strings into a list of words for processing (like stemming or stop word removal) and then want to join them back to a single string. To to this we can use the `join()` function:

In [None]:
' '.join(['this', 'is', 'a', 'list', 'of', 'words'])

'this is a list of words'

In [None]:
'-'.join(['this', 'is', 'a', 'list', 'of', 'words'])

'this-is-a-list-of-words'

**Exercise 2:** Write a function that performs the following on `string_1` 
- split the string into words with spaces
- then strip the special character `/` from each word
- join the words back together with single spaces (`' '`)
- make the whole string lower-case

In [None]:
string_1 = 'This is a string!\n/(But not a very interesting one)/\n\n\tEnd.'
print(string_1)

This is a string!
/(But not a very interesting one)/

	End.


## Pre-processing

Now that we are armed with this arsenal of string processing tools, we can pre-process the texts in the dataset to bring them to a cleaner form.

One of the richest Python libraries to process texts is the Natural Language Toolkit (NLTK). To install it run the follwing command in your environment:
```bash
> pip install nltk
```


In [None]:
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

Now we go through the following steps to clean-up the texts:
* Normalization
* Tokenization
* Remove Stop-Words
* Remove Non-Alphabetical Tokens
* Stemming

We'll do this on one text as an example and then build a function and apply it to all texts.

In [None]:
text = df.loc[0, 'text']
print(text)



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!




### Normalize
This is the process of transforming the text to lower-case.

In [None]:
text = text.lower()
print(text)



i am sure some bashers of pens fans are pretty confused about the lack
of any kind of posts about the recent pens massacre of the devils. actually,
i am  bit puzzled too and a bit relieved. however, i am going to put an end
to non-pittsburghers' relief with a bit of praise for the pens. man, they
are killing those devils worse than i thought. jagr just showed you why
he is much better than his regular season stats. he is also a lot
fo fun to watch in the playoffs. bowman should let jagr have a lot of
fun in the next couple of games since the pens are going to beat the pulp out of jersey anyway. i was very disappointed not to see the islanders lose the final
regular season game.          pens rule!!!




### Tokenize
Now we split the text in words/tokens.

In [None]:
tokens = word_tokenize(text)
print(tokens)

['i', 'am', 'sure', 'some', 'bashers', 'of', 'pens', 'fans', 'are', 'pretty', 'confused', 'about', 'the', 'lack', 'of', 'any', 'kind', 'of', 'posts', 'about', 'the', 'recent', 'pens', 'massacre', 'of', 'the', 'devils', '.', 'actually', ',', 'i', 'am', 'bit', 'puzzled', 'too', 'and', 'a', 'bit', 'relieved', '.', 'however', ',', 'i', 'am', 'going', 'to', 'put', 'an', 'end', 'to', 'non-pittsburghers', "'", 'relief', 'with', 'a', 'bit', 'of', 'praise', 'for', 'the', 'pens', '.', 'man', ',', 'they', 'are', 'killing', 'those', 'devils', 'worse', 'than', 'i', 'thought', '.', 'jagr', 'just', 'showed', 'you', 'why', 'he', 'is', 'much', 'better', 'than', 'his', 'regular', 'season', 'stats', '.', 'he', 'is', 'also', 'a', 'lot', 'fo', 'fun', 'to', 'watch', 'in', 'the', 'playoffs', '.', 'bowman', 'should', 'let', 'jagr', 'have', 'a', 'lot', 'of', 'fun', 'in', 'the', 'next', 'couple', 'of', 'games', 'since', 'the', 'pens', 'are', 'going', 'to', 'beat', 'the', 'pulp', 'out', 'of', 'jersey', 'anyway',

### Stop Words
Next, we remove words that are too common and don't add the the content of sentences. These words are commonly called 'stop words'. NLTK provides a list of stop words:

In [None]:
stop_words = set(stopwords.words('english'))
print(stop_words)

{'will', 'itself', 'them', 'all', 'aren', "weren't", 'than', 'this', 'couldn', "shan't", "you'd", 'needn', 'wasn', 'myself', 'again', 'below', 'has', 'at', 'mightn', "it's", "needn't", "should've", 'while', "mustn't", 'ain', 'weren', 'shouldn', 'out', 'doesn', 'won', "hadn't", 'hadn', 'you', 'as', 'nor', 'so', 'for', 'once', 'such', 'they', 'my', 'being', "you'll", 'with', 'who', 'very', 'few', 'haven', "won't", 've', 'm', 'of', 'where', 'more', 'down', 'only', 'off', "mightn't", 'in', 'above', 'about', 'll', 'be', 'if', 'over', 'through', 'here', "doesn't", 'each', 'mustn', "hasn't", "you've", 'don', 'd', 'what', 'am', 'is', 'him', 'that', 'why', 'their', 'should', 'do', 'his', 'she', 'other', 'doing', "you're", 'from', 'under', "wouldn't", 'an', 'herself', 'most', 'a', "wasn't", 'her', 'we', 'ma', 'its', 'before', 'were', 'too', 'o', 'shan', 'wouldn', 'because', 'further', 'by', 'yours', 'after', 'some', 'been', 's', 'to', 'hasn', 'the', 'then', 'theirs', 't', 'just', 'those', "aren'

We keep only the words that are **not** in the list of stop words.

In [None]:
tokens = [i for i in tokens if not i in stop_words]
print(tokens)

['sure', 'bashers', 'pens', 'fans', 'pretty', 'confused', 'lack', 'kind', 'posts', 'recent', 'pens', 'massacre', 'devils', '.', 'actually', ',', 'bit', 'puzzled', 'bit', 'relieved', '.', 'however', ',', 'going', 'put', 'end', 'non-pittsburghers', "'", 'relief', 'bit', 'praise', 'pens', '.', 'man', ',', 'killing', 'devils', 'worse', 'thought', '.', 'jagr', 'showed', 'much', 'better', 'regular', 'season', 'stats', '.', 'also', 'lot', 'fo', 'fun', 'watch', 'playoffs', '.', 'bowman', 'let', 'jagr', 'lot', 'fun', 'next', 'couple', 'games', 'since', 'pens', 'going', 'beat', 'pulp', 'jersey', 'anyway', '.', 'disappointed', 'see', 'islanders', 'lose', 'final', 'regular', 'season', 'game', '.', 'pens', 'rule', '!', '!', '!']


### Punctuation
We also want to get of all tokens that are not composed of letters (e.g. punctuation and numbers). We can check if a words is only composed of alphabetic letters with the `isalpha()` and filter with it:

In [None]:
tokens = [i for i in tokens if i.isalpha()]
print(tokens)

['sure', 'bashers', 'pens', 'fans', 'pretty', 'confused', 'lack', 'kind', 'posts', 'recent', 'pens', 'massacre', 'devils', 'actually', 'bit', 'puzzled', 'bit', 'relieved', 'however', 'going', 'put', 'end', 'relief', 'bit', 'praise', 'pens', 'man', 'killing', 'devils', 'worse', 'thought', 'jagr', 'showed', 'much', 'better', 'regular', 'season', 'stats', 'also', 'lot', 'fo', 'fun', 'watch', 'playoffs', 'bowman', 'let', 'jagr', 'lot', 'fun', 'next', 'couple', 'games', 'since', 'pens', 'going', 'beat', 'pulp', 'jersey', 'anyway', 'disappointed', 'see', 'islanders', 'lose', 'final', 'regular', 'season', 'game', 'pens', 'rule']


### Stemming
As a final step we want to trim the words to the stem. This helps drastically decrease the vocabulary size and maps similar/same words onto the same word. E.g. plural/singular words or different forms of verbs:
* pen, pens --> pen
* happy, happier --> happi
* go, goes --> go

There are several languages available in nltk since this is a **language dependant process**:

In [None]:
print(SnowballStemmer.languages)

('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')


Applied to the text sample this yields:

In [None]:
stemmer = SnowballStemmer("english")
tokens = [stemmer.stem(i) for i in tokens]
print(tokens)

['sure', 'basher', 'pen', 'fan', 'pretti', 'confus', 'lack', 'kind', 'post', 'recent', 'pen', 'massacr', 'devil', 'actual', 'bit', 'puzzl', 'bit', 'reliev', 'howev', 'go', 'put', 'end', 'relief', 'bit', 'prais', 'pen', 'man', 'kill', 'devil', 'wors', 'thought', 'jagr', 'show', 'much', 'better', 'regular', 'season', 'stat', 'also', 'lot', 'fo', 'fun', 'watch', 'playoff', 'bowman', 'let', 'jagr', 'lot', 'fun', 'next', 'coupl', 'game', 'sinc', 'pen', 'go', 'beat', 'pulp', 'jersey', 'anyway', 'disappoint', 'see', 'island', 'lose', 'final', 'regular', 'season', 'game', 'pen', 'rule']


**Exercise 3:** Put all preprocessing steps into a function `preprocessing(text)` that returns a list of clean tokens like above and apply it to all texts creating a new column `processed_text`.

Applying this function to the whole corpus takes a while (ca. 1min 30s on my machine). Sometimes it is handy to have a progress bar to see how well/fast we are doing. There is a cool library called `tqdm` which provides easy to use progress bars.

If you want to experiment with it, you can install it with:
```bash
> pip install tqdm
```

After importing you need to register it with pandas with `tqdm.pandas()`. If you then use `progress_apply` instead of `apply` you get a progress bar during the computation showing you the progress. If you have problems installing it or don't want to use it, just use the normal `apply` function.

In [None]:
from tqdm import tqdm
tqdm.pandas(desc="progress")

In [None]:
df['processed_text'] = df['text'].progress_apply(preprocessing)

progress: 100%|██████████| 18846/18846 [01:29<00:00, 209.89it/s]


In [None]:
df.head()

Unnamed: 0,filenames,text,target,target_name,processed_text
0,54367,\n\nI am sure some bashers of Pens fans are pr...,10,rec.sport.hockey,"[sure, basher, pen, fan, pretti, confus, lack,..."
1,60215,My brother is in the market for a high-perfo...,3,comp.sys.ibm.pc.hardware,"[brother, market, video, card, support, vesa, ..."
2,76120,"\n\n\n|>The student of ""regional killings"" ali...",17,talk.politics.mideast,"[student, region, kill, alia, davidian, davidi..."
3,60771,\nIn article <1993Apr19.034517.12820@julian.uw...,3,comp.sys.ibm.pc.hardware,"[articl, wlsmith, wayn, smith, write, articl, ..."
4,51882,\n1) I have an old Jasmine drive which I c...,4,comp.sys.mac.hardware,"[old, jasmin, drive, use, new, system, underst..."


## Text Encodings
Now that we cleaned up and tokenized the text corpus we are now ready to encode the texts in vectors. In class we had a look at simple **one-hot encodings** that can be extended to count encodings and **TF-IDF encodings**.

Scikit-learn comes with functions to do both count and TF-IDF encodings on text. The interface is very similar to the classifier just the `predict` step is replace with `transform`:

```python
count_vectorizer = CountVectorizer(your_settings)
count_vectorizer.fit(your_dataset)
vec = count_vectorizer.transform('your_text')
```

This creates a vectorizer that can transform texts to vectors.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

The scikit-learn vectorizers come with a set of rudimentary preprocessors and tokenizers by default, but since we have already done these steps we replace the default values with the identity function does nothing.

In [None]:
def identity(doc):
    return doc

We can also limit the number of words take into account when building the vector. This limits the vector size and cuts off words that occur rarely.

**Example:** If you set `vocab_size=10000` only the 10000 most occurring words are used to build the vector and all rare words are excluded. This means that the encoding vector then has a dimension of 10000. 

For now we take all words (`vocab_size=None`):

In [None]:
vocab_size=None

In [None]:
tfidf_vec = TfidfVectorizer(analyzer='word', tokenizer=identity, preprocessor=identity, token_pattern=None, max_features=vocab_size)
count_vec = CountVectorizer(analyzer='word', tokenizer=identity, preprocessor=identity, token_pattern=None, max_features=vocab_size)

Let's test both vectorizers on a small, dummy dataset with **4 documents**:

In [None]:
corpus = [
    ['this','is','the','first','document','in','the','corpus'],
    ['this','document','is','the','second','document','in','the','corpus'],
    ['and','this','is','the','third','one','in','this','corpus'],
    ['is','this','the','first','document','in','this','corpus'],
]

Now we fit a count vectorizer to the data.

In [None]:
count_vec.fit(corpus)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1),
        preprocessor=<function identity at 0x116c73400>, stop_words=None,
        strip_accents=None, token_pattern=None,
        tokenizer=<function identity at 0x116c73400>, vocabulary=None)

Once the a vectorizer is fitted, we can investigate the vocabulary. It is a dictionary that points each word to the index in the vector it corresponds to. For example the word `'this'` corresponds to the 10+1 (+1 because we start counting at zero) entry in the vector and the word `'and'` corresponds to the the first entry.

In [None]:
count_vec.vocabulary_

{'this': 10,
 'is': 5,
 'the': 8,
 'first': 3,
 'document': 2,
 'in': 4,
 'corpus': 1,
 'second': 7,
 'and': 0,
 'third': 9,
 'one': 6}

Now we can transform the corpus and get a list of vectors in the form of a matrix (each row corresponds to a document vector):

In [None]:
X = count_vec.transform(corpus)
print(X.toarray())

[[0 1 1 1 1 1 0 0 2 0 1]
 [0 1 2 0 1 1 0 1 2 0 1]
 [1 1 0 0 1 1 1 0 1 1 2]
 [0 1 1 1 1 1 0 0 1 0 2]]


If we now do the same thing with the TF-IDF vectorizer we see that the output looks different:

In [None]:
tfidf_vec.fit(corpus)
X = tfidf_vec.transform(corpus)
print(X.toarray())

[[0.    0.291 0.356 0.44  0.291 0.291 0.    0.    0.583 0.    0.291]
 [0.    0.238 0.582 0.    0.238 0.238 0.    0.456 0.476 0.    0.238]
 [0.439 0.229 0.    0.    0.229 0.229 0.439 0.    0.229 0.439 0.459]
 [0.    0.291 0.356 0.44  0.291 0.291 0.    0.    0.291 0.    0.583]]


* The shape of the matrix is the same.
* Instead of integers (corresponding to counts) we have continous values.
* Elements that occur in multilple documents have lower scores than those appearing in fewer.

This should just illustrate how count and TF-IDF vectorizer work. Now let's apply this to our dataset and create encodigs with `50000` words:

In [None]:
vocab_size=50000
tfidf_vec = TfidfVectorizer(analyzer='word', tokenizer=identity, preprocessor=identity, token_pattern=None, max_features=vocab_size)
count_vec = CountVectorizer(analyzer='word', tokenizer=identity, preprocessor=identity, token_pattern=None, max_features=vocab_size)

In the example above we used the `fit` and `transform` function. We can avoid these two steps with the combined function `fit_transform`:

In [None]:
X_tfidf = tfidf_vec.fit_transform(df['processed_text'])

In [None]:
X_count = count_vec.fit_transform(df['processed_text'])

This yields a vocabulary with `50000` entries:

In [None]:
len(count_vec.vocabulary_)

50000

Looking at the shape of the returned matrix we see that it still has as many rows as the input but now has `50000` entries per row (the feature vector).

In [None]:
X_count.shape

(18846, 50000)

So we see that the texts were converted into vectors of size 67390. This is the number of unique words in the dataset. This feature space is **significantly larger** than what we saw so far: 67'390 vs. ~10-20 in the titanic dataset. This is one challenging aspect of NLP: very large, yet sparse (most of the entries are zero) input matrix.

# Search Engine

We can use these encodings to build ourselves a rudimentary **search engine**. We will see that the TF-IDF yield much better results than count vectors.

In information retrieval jargon a question or term that is searched in a corpus of documents is called a `query`. Let define a example query:

In [None]:
query = 'i want to buy a mustang'

We encode the query with both the count and TF-IDF vectorizer. Note that we duplicate the vector `n_documents` times. This is not really necessary, but makes the comparison of the query vector with the documents easier since they then have both the same shape.

In [None]:
tokens = preprocessing(query)
print('query tokens:', tokens)
n_documents = np.shape(X_count)[0]

query_vec_count = count_vec.transform([tokens]*n_documents)
query_vec_tfidf = tfidf_vec.transform([tokens]*n_documents)

print('query and corpus shapes:', np.shape(query_vec_tfidf), np.shape(X_count))

query tokens: ['want', 'buy', 'mustang']
query and corpus shapes: (18846, 50000) (18846, 50000)


To compare the encodings we use the `cosine_similarity`, since it is best suited for high-dimensional vectors:

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

No we usse this to calculate a similarity between each document and the query:

In [None]:
sim_count = cosine_similarity(X_count, query_vec_count)[:, 0]
sim_tfidf = cosine_similarity(X_tfidf, query_vec_tfidf)[:, 0]

Now we are almost done, we only need to find the of the similarity array with the highest similarity with `np.argmax` and display the `text` in the dataframe at this position. We can display the best search result for both encodings:

In [None]:
print(df.loc[np.argmax(sim_count), 'text'])

I'm looking to buy a '92 Toyota Previa All-Trac with low miles.
If you are selling one, or want someone to buy out an existing lease,
please contact me by mail.

-- 
Will Estes		Internet: westes@netcom.com



In [None]:
print(df.loc[np.argmax(sim_tfidf), 'text'])

petebre@elof.iit.edu (BrentA. Peterson) writes:
>jmh@hopper.Virginia.EDU (Jeffrey Hoffmeister) writes:
>>jmm4h@Virginia.EDU ("The Bald Runner") writes:

>>>I just have got to remind all of you that this is it!  Yes,
>>>that's right, somtime this fall, Ford (the granddaddy of cars)
>>>will be introducing an all-new, mega-cool
>>>way-too-fast-for-Accord-drivers Mustang.  It's supposed to be
>>>100% streamlined, looking similar to the Mach III concept car
>>>Ford came out with around January.  I can't wait.  Anyone out
>>>there hear anything about it recently?

>>If everything I've read is correct, Ford is doing nothing but "re-
>>skinning" the existing Mustang, with MINOR suspension modifications.
>>And the pictures I've seen indicate they didn't do a very good job
>>of it.
>>The "new" mustang, is nothing but a re-cycle of a 20 year old car.

>gee.... is it 1999 already?
>Yes, it will still be on the fox program chasis, anything that will be differe
>nt on the new car as far as mechanica

**Exercise 4:** Can you explain why the count vectorizer chose the first text although it doesn't say anything about `'mustang'`? Why are longer documents generally penalized with the cosine-similarity score (_hint:_ look at the definition of cosine similarity)?