<img src = 'images/thinking.jpeg'></img>

#### Let's return to our Scenario: You work for a international political consultant, which you are suspecting is a little bit sketchy, but are still giving it a few weeks. Scott, a friendly coworker, is trying to sort through news articles to quickly filter political from non-political articles. He has heard you possess some solid NLP chops, and has asked for your help in automating the task.

In [101]:
## Let's quickly preprocess our texts.
import string
text_list = [f"{letter}.txt" for letter in string.ascii_uppercase[:12]]

In [102]:
## Initial cleaning: punctuation, lowercase, stopwords, stem

In [103]:
from nltk import regexp_tokenize
from nltk.stem import *
from nltk.corpus import stopwords

def clean_text(file_name):
    
    with open(f'text_examples/{file_name}', 'r') as read_file:
        text = read_file.read().replace('\n', ' ')
        
    pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
    tokens = regexp_tokenize(text, pattern) 
    
    tokens = [token.lower() for token in tokens]
    tokens = [token for token in tokens if token not in stopwords.words('english')]
    
    p_stemmer = PorterStemmer()
    tokens = [p_stemmer.stem(token) for token in tokens]
    return tokens

clean_token_list = [clean_text(text) for text in text_list]
    

In [104]:
from nltk import FreqDist
FreqDist(clean_token_list[3]).most_common(10)

[('arrest', 9),
 ('bnp', 4),
 ('commit', 4),
 ('man', 4),
 ('griffin', 3),
 ('polic', 3),
 ('follow', 3),
 ('documentari', 3),
 ('suspicion', 3),
 ('racial', 3)]

### Term Frequency (TF)

$\begin{align}
 tf_{i,j} = \dfrac{n_{i,j}}{\displaystyle \sum_k n_{i,j} }
\end{align} $

In [105]:
## TF for two documents

In [106]:
# Create a set of all words across all documents
docs = clean_token_list[:2]

def vanilla_tf():
    

    return tf

term_f = []
for doc in docs:
    term_f.append(vanilla_tf())

    

### Inverse Document Frequency (IDF)

$\begin{align}
idf(w) = \log \dfrac{N}{df_t}
\end{align} $

In [107]:
# Now create inverse document frequencies
import math

def vanilla_idf(all_words, all_bows):
    
    """
    Calculate inverse document frequency from scratch
    for demonstration.
    
    parameters:
    all_words: A set of all tokens across the texts
    all_bows: a list of all tokens in each text.
    
    returns:
    an inverse document frequency list with indices 
    corresponding to the tokens in all_words.
    """
    
    idf= []
    
    for word in all_words:
        count = 0
        for doc in all_bows:
            if word in doc:
                count += 1
        idf.append(math.log10(len(all_bows)/count))
    
    
    
    return idf

vanilla_idf = vanilla_idf(all_tokens, docs)

In [108]:
vanilla_tf_idf = []
for doc in term_f:
    doc_tf_idf = []
    for word, idf in zip(doc, vanilla_idf):
        doc_tf_idf.append(word * idf)
    vanilla_tf_idf.append(doc_tf_idf)


In [127]:
import pandas as pd
vanilla_df = pd.DataFrame(vanilla_tf_idf)
vanilla_df.columns = all_tokens
vanilla_df.shape

(2, 201)

![](https://media.giphy.com/media/Rl9Yqavfj2Ula/giphy.gif)

In [233]:
text_labels = [1, 1, 0, 1, 0, 0, 0, 1 ,0 ,0, 1, 0]

In [358]:
#Modeling
from sklearn.model_selection import train_test_split

clean_token_strings = []
for doc in clean_token_list:
    clean_token_strings.append((' ').join(doc))

X_train, X_test, y_train, y_test = train_test_split(clean_token_strings, text_labels, test_size = .3)

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer


Our data began as a sparse matrix, which is a matrix with far more 0 values than not 0 values
More about it herehttps://docs.scipy.org/doc/scipy/reference/sparse.html
To make better sense of it, we can put it back into a dataframe.
Caution: moving from sparse matrix to array format will take much more memory to perform operations

In [4]:
#CV


Ok, results, but very confident because of the small sample size. Sorry, Scott. Next time give me more to work with!
<img src = 'images/thinking.jpeg'></img>

In [262]:
## let's try another, larger dataset

In [285]:
from sklearn.datasets import fetch_20newsgroups
cats = ['rec.sport.baseball','rec.sport.hockey']
newsgroups_train = fetch_20newsgroups(subset='train',categories=cats)
newsgroups_test = fetch_20newsgroups(subset='test',categories=cats)

In [365]:
cv = CountVectorizer(stop_words='english')
X_train = cv.fit_transform(newsgroups_train.data)
y_train = newsgroups_train.target
mnb.fit(X_train, y_train )
y_hat = mnb.predict(cv.transform(newsgroups_test.data))
print(accuracy_score(newsgroups_test.target, y_hat))
confusion_matrix(newsgroups_test.target, y_hat)

0.9748743718592965


array([[386,  11],
       [  9, 390]])

In [362]:
from sklearn.ensemble import RandomForestClassifier
cv = CountVectorizer(stop_words='english')
X_train = cv.fit_transform(newsgroups_train.data)
y_train = newsgroups_train.target
rf = RandomForestClassifier()
rf.fit(X_train, y_train )
y_hat = rf.predict(cv.transform(newsgroups_test.data))
print(accuracy_score(newsgroups_test.target, y_hat))
confusion_matrix(newsgroups_test.target, y_hat)

0.8793969849246231




array([[376,  21],
       [ 75, 324]])

In [330]:
mnb.coef_

array([[ -7.00778466,  -8.22312122, -11.0853221 , ..., -11.0853221 ,
        -11.77846928, -11.77846928]])

In [369]:
import numpy as np
feature_names = np.array(cv.get_feature_names())
feature_importances = np.argsort(mnb.coef_[0])[-20:]

for idx in feature_importances:
    print(feature_names[idx])

11
55
games
year
article
25
season
writes
university
nhl
10
play
game
organization
lines
subject
hockey
ca
team
edu


In [None]:
# Weird stuff going on in here. Numbers.
# What can we do?

# Cosine Similarity

We can tell how similar two documents are to one another, normalizing for size, by taking the cosine similarity of the two. 

This number will range from [0,1], with 0 being not similar whatsoever, and 1 being the exact same. A potential application of cosine similarity is a basic recommendation engine. If you wanted to recommend articles that are most similar to other articles, you could talk the cosine similarity of all articles and return the highest one.

<img src="images/cosine-similarity.png">
<img src="images/better_cos_similarity.png">

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
samples = ['I ate a burger at burger queen and it was very good.',
           'I ate a hot dog at burger prince and it was bad',
          'I drove a racecar through your kitchen door',
          'I ate a hot dog at burger king and it was bad. I ate a burger at burger queen and it was very good']

cv.fit(samples)
text_data = cv.transform(samples)

In [15]:
from sklearn.metrics.pairwise import cosine_similarity
## the 0th and 2nd index lines are very different, a number close to 0
cosine_similarity(text_data[0],text_data[3])


array([[0.91413793]])