## NLP first step, vectorization:  words to numbers

We want to eventually train a machine learning algorithm to take in a headline and tell us how many upvotes it would receive. However, machine learning algorithms only understand numbers, not words. How do we translate our headlines into something an algorithm can understand?

The first step is to create something called a **bag of words** matrix. A bag of word matrix gives us a numerical representation of which words are in which headlines.

In order to construct a bag of words matrix, we first find the unique words across the whole set of headlines. Then, we setup a matrix where each row is a headline, and each column is one of the unique words. Then, we fill in each cell with the number of times that word occured in that headline.

This will result in a matrix where a lot of the cells have a value of zero, unless the vocabulary is mostly shared between the headlines.

In [7]:
from collections import Counter
import pandas

text = [
    "Programming language Python is simple programming language",
    "Machine Learning with Python is simple",
    "Machine Learning with Java is common",
    "Java is Object Oriented Programming language",
    "Machine Learning with R is old"
]

In [8]:
# Find all the unique words in the text.
unique_words = list(set(" ".join(text).lower().split()))
def make_matrix(text, vocab):
    matrix = []
    for sentence in text:
        s = sentence.lower().split()
        # Count each word in the text, and make a dictionary.
        counter = Counter(s)
        # Turn the dictionary into a matrix row using the vocab.
        row = [counter.get(w, 0) for w in vocab]
        matrix.append(row)
    df = pandas.DataFrame(matrix)
    df.columns = unique_words
    return df

print(make_matrix(text, unique_words))


   oriented  java  language  python  is  object  programming  machine  simple  \
0         0     0         2       1   1       0            2        0       1   
1         0     0         0       1   1       0            0        1       1   
2         0     1         0       0   1       0            0        1       0   
3         1     1         1       0   1       1            1        0       0   
4         0     0         0       0   1       0            0        1       0   

   r  common  learning  old  with  
0  0       0         0    0     0  
1  0       0         1    0     1  
2  0       1         1    0     1  
3  0       0         0    0     0  
4  1       0         1    1     1  


### Removing Stopwords
Certain words don’t help you discriminate between good and bad headlines. Words such as the, a, and also occur commonly enough in all contexts that they don’t really tell us much about whether something is good or not. They are generally equally likely to appear in both good and bad headlines.

By removing these, we can reduce the size of the matrix, and make training an algorithm faster.

```python
with open("stopwords_en.txt", 'r') as f:
    stopwords = f.read().split("\n")
```



In [9]:
# Find all the unique words in the text.
unique_words = list(set(" ".join(text).lower().split()))

# remove stopwords
with open("stopwords_en.txt", 'r') as f:
    stopwords = f.read().split("\n")

unique_words = [w for w in unique_words if w not in stopwords]

def make_matrix(text, vocab):
    matrix = []
    for sentence in text:
        s = sentence.lower().split()
        # Count each word in the text, and make a dictionary.
        counter = Counter(s)
        # Turn the dictionary into a matrix row using the vocab.
        row = [counter.get(w, 0) for w in vocab]
        matrix.append(row)
    df = pandas.DataFrame(matrix)
    df.columns = unique_words
    return df

print(make_matrix(text, unique_words))

   oriented  java  language  python  object  programming  machine  simple  \
0         0     0         2       1       0            2        0       1   
1         0     0         0       1       0            0        1       1   
2         0     1         0       0       0            0        1       0   
3         1     1         1       0       1            1        0       0   
4         0     0         0       0       0            0        1       0   

   common  learning  
0       0         0  
1       0         1  
2       1         1  
3       0         0  
4       0         1  


### Generating a matrix 

Now that we know the basics, we can make a bag of words matrix for the whole set of headlines.

We don’t want to have to code everything out manually every time, so we’ll use a class from scikit-learn to do it automatically. Using the vectorizers from scikit-learn to construct your bag of words matrices will make the process much easier and faster.

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

# Construct a bag of words matrix.
# This will lowercase everything, and ignore all punctuation by default.
# It will also remove stop words.
vectorizer = CountVectorizer(lowercase=True, stop_words="english")

matrix = vectorizer.fit_transform(text)
vocab = vectorizer.get_feature_names()

# We created our bag of words matrix with far fewer commands.
print "Shape of matrix"
print(matrix.shape)

print "Terms/Vocabularies"
print vocab

print "Matrix of text"
print(matrix.todense())



Shape of matrix
(5, 11)
Terms/Vocabularies
[u'common', u'java', u'language', u'learning', u'machine', u'object', u'old', u'oriented', u'programming', u'python', u'simple']
Matrix of text
[[0 0 2 0 0 0 0 0 2 1 1]
 [0 0 0 1 1 0 0 0 0 1 1]
 [1 1 0 1 1 0 0 0 0 0 0]
 [0 1 1 0 0 1 0 1 1 0 0]
 [0 0 0 1 1 0 1 0 0 0 0]]


In [12]:
list(vocab).index(u'language')

2

### Comparing Documents

In [15]:
import numpy as np 
from sklearn.metrics.pairwise import cosine_similarity

dist = 1 - cosine_similarity(matrix)
np.round(dist, 2)


array([[ 0.  ,  0.68,  1.  ,  0.43,  1.  ],
       [ 0.68,  0.  ,  0.5 ,  1.  ,  0.42],
       [ 1.  ,  0.5 ,  0.  ,  0.78,  0.42],
       [ 0.43,  1.  ,  0.78,  0.  ,  1.  ],
       [ 1.  ,  0.42,  0.42,  1.  ,  0.  ]])

In [17]:
dist[1,2]

0.5