## Text Mining and NLP

## Part 2

### Situation:

Priya works at an international PR firm in the Europe division. Their largest client has offices in Ibiza, Madrid, and Las Palmas. She needs to keep her boss aware of current events and provide a weekly short list of articles concerning political events in Spain. The problem is, this takes hours every week to review articles on the BBC and Priya is very busy! She wonders if she could automate this process using text mining to save her time.

### **Goal**: to internalize the steps, challenges, and methodology of text mining
- explore text analysis by hand
- apply text mining steps in Jupyter with Python libraries NLTK
- classify documents correctly

## Refresher on cleaning text
![gif](https://www.nyfa.edu/student-resources/wp-content/uploads/2014/10/furious-crazed-typing.gif)


In [1]:
from __future__ import print_function
import nltk
import sklearn

from nltk.collocations import *
from nltk import FreqDist, word_tokenize
import string, re
import urllib
from nltk.stem.snowball import SnowballStemmer

url_a = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/A.txt"
url_b = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/B.txt"
article_a = urllib.request.urlopen(url_a).read()
article_a_st = article_a.decode("utf-8")


In [2]:
# tokens
pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
arta_tokens_raw = nltk.regexp_tokenize(article_a_st, pattern)

# lower case
arta_tokens = [i.lower() for i in arta_tokens_raw]

# stop words
from nltk.corpus import stopwords
stopwords.words("english")

stop_words = set(stopwords.words('english'))
arta_tokens_stopped = [w for w in arta_tokens if not w in stop_words]

# stem words
stemmer = SnowballStemmer("english")
arta_stemmed = [stemmer.stem(word) for word in arta_tokens_stopped]

In [3]:
# repeat w second article
article_b = urllib.request.urlopen(url_b).read()
article_b_st = article_b.decode("utf-8")
artb_tokens_raw = nltk.regexp_tokenize(article_b_st, pattern)
artb_tokens = [i.lower() for i in artb_tokens_raw]
artb_tokens_stopped = [w for w in artb_tokens if not w in stop_words]
artb_stemmed = [stemmer.stem(word) for word in artb_tokens_stopped]

### Document statistics

what's wrong with the table from yesterday? what does it not consider?


### Term Frequency (TF)

$\begin{align}
 tf_{i,j} = \dfrac{n_{i,j}}{\displaystyle \sum_k n_{i,j} }
\end{align} $

### Inverse Document Frequency (IDF)

$\begin{align}
idf(w) = \log \dfrac{N}{df_t}
\end{align} $

### TF-IDF score

$ \begin{align}
w_{i,j} = tf_{i,j} \times \log \dfrac{N}{df_i} \\
tf_{i,j} = \text{number of occurences of } i \text{ in} j \\
df_i = \text{number of documents containing} i \\
N = \text{total number of documents}
\end{align} $


### The from scratch method
![homemade](https://media2.giphy.com/media/LBZcXdG0eVBdK/giphy.gif?cid=3640f6095c2d7bb2526a424a4d97117c)


Please go through the code and comment what each section does

In [4]:
# creating a set of words from both documents (without duplicates)
wordSet = set(arta_stemmed).union(set(artb_stemmed))

# creating dictionaries with the items in wordSet as the keys and zero as the values
wordDictA = dict.fromkeys(wordSet, 0) 
wordDictB = dict.fromkeys(wordSet, 0) 

# Adding +1 to each key for each instance of the word in article A
for word in arta_stemmed:
    wordDictA[word]+=1
    
# Adding +1 to each key for each instance of the word in article B
for word in artb_stemmed:
    wordDictB[word]+=1    

# creates a dictionary with words as the keys and the number of times the word appears in the doc divided by the 
# total number of words in the document
def computeTF(wordDict, bow):
    tfDict = {}
    bowCount = len(bow)
    for word, count in wordDict.items():
        tfDict[word] = count/float(bowCount)
    return tfDict

tfbowA = computeTF(wordDictA,arta_stemmed)
tfbowB = computeTF(wordDictB,artb_stemmed)

In [5]:
tfbowA

{'chanc': 0.005434782608695652,
 'innov': 0.005434782608695652,
 'deal': 0.0,
 'hold': 0.005434782608695652,
 'agreement': 0.0,
 'exampl': 0.005434782608695652,
 'draft': 0.016304347826086956,
 'law': 0.02717391304347826,
 'invent': 0.02717391304347826,
 'small': 0.010869565217391304,
 'controversi': 0.005434782608695652,
 'patent': 0.02717391304347826,
 'rewrit': 0.005434782608695652,
 'hurt': 0.005434782608695652,
 'use': 0.005434782608695652,
 'shop': 0.005434782608695652,
 'hit': 0.0,
 'ineffici': 0.005434782608695652,
 'us': 0.016304347826086956,
 'fail': 0.005434782608695652,
 'fund': 0.0,
 'legal': 0.016304347826086956,
 'lobbi': 0.005434782608695652,
 'suspend': 0.0,
 'hope': 0.0,
 'two': 0.010869565217391304,
 'serv': 0.005434782608695652,
 'secretari': 0.0,
 'canada': 0.0,
 'even': 0.005434782608695652,
 'begun': 0.0,
 'amazon': 0.005434782608695652,
 'bring': 0.005434782608695652,
 'hammer': 0.0,
 'europ': 0.005434782608695652,
 'support': 0.010869565217391304,
 'commiss': 0

In [6]:
def computeIDF(docList):
    import math
    idfDict = {}
    N = len(docList)
    
# creating a dictionary with the keys from the first dictionary in the docList (since they have the same keys) 
# and zero values
    idfDict = dict.fromkeys(docList[0].keys(), 0)
    for doc in docList:
# for each dictionary's values, find out if it's greater than zero. If it is, then add one that value's key in 
# the idfDict dictionary 
        for word, val in doc.items():
            if val > 0:
                idfDict[word] += 1
# Compute the TF-IDF for each word/value pair: take the natural log of the number of documents in the doclist 
# divided by value)
    for word, val in idfDict.items():
        idfDict[word] = math.log10(N / float(val))
        
    return idfDict

In [7]:
idfs = computeIDF([wordDictA, wordDictB])

In [10]:
idfs

{'chanc': 0.3010299956639812,
 'innov': 0.3010299956639812,
 'deal': 0.3010299956639812,
 'hold': 0.3010299956639812,
 'agreement': 0.3010299956639812,
 'exampl': 0.3010299956639812,
 'draft': 0.3010299956639812,
 'law': 0.3010299956639812,
 'invent': 0.3010299956639812,
 'small': 0.3010299956639812,
 'controversi': 0.3010299956639812,
 'patent': 0.3010299956639812,
 'rewrit': 0.3010299956639812,
 'hurt': 0.3010299956639812,
 'use': 0.3010299956639812,
 'shop': 0.3010299956639812,
 'hit': 0.3010299956639812,
 'ineffici': 0.3010299956639812,
 'us': 0.3010299956639812,
 'fail': 0.3010299956639812,
 'fund': 0.3010299956639812,
 'legal': 0.3010299956639812,
 'lobbi': 0.3010299956639812,
 'suspend': 0.3010299956639812,
 'hope': 0.3010299956639812,
 'two': 0.3010299956639812,
 'serv': 0.3010299956639812,
 'secretari': 0.3010299956639812,
 'canada': 0.3010299956639812,
 'even': 0.3010299956639812,
 'begun': 0.3010299956639812,
 'amazon': 0.3010299956639812,
 'bring': 0.3010299956639812,
 'ham

In [8]:
def computeTFIDF(tfBow, idfs):
    tfidf = {}

# for each item in the tf dic, calculating the value times the corresponding value in the idfs dictionary 
    for word, val in tfBow.items():
        tfidf[word] = val*idfs[word]
    return tfidf

In [9]:
tfidfBowA = computeTFIDF(tfbowA, idfs)
tfidfBowB = computeTFIDF(tfbowB, idfs)

In [18]:
import pandas as pd
df = pd.DataFrame([tfidfBowA, tfidfBowB])
df

Unnamed: 0,abstain,achiev,action,adopt,affair,affect,agre,agreement,also,amazon,...,vocal,vote,wealthi,week,welcom,without,word,world,would,year
0,0.001636,0.001636,0.001636,0.001636,0.001636,0.0,0.0,0.0,0.0,0.001636,...,0.001636,0.001636,0.0,0.0,0.001636,0.001636,0.001636,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.002923,0.002923,0.002923,0.005845,0.0,...,0.0,0.0,0.002923,0.002923,0.0,0.0,0.0,0.002923,0.0,0.002923


## But yes, there is an easier way

![big deal](https://media0.giphy.com/media/xUA7aQOxkz00lvCAOQ/giphy.gif?cid=3640f6095c2d7c51772f47644d09cc8b)


In [19]:
# create a string again
cleaned_a = ' '.join(arta_stemmed)
cleaned_b = ' '.join(artb_stemmed)


from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
response = tfidf.fit_transform([cleaned_a, cleaned_b])

import pandas as pd
df = pd.DataFrame(response.toarray(), columns=tfidf.get_feature_names())
print(df)

    abstain    achiev    action     adopt    affair    affect      agre  \
0  0.053285  0.053285  0.053285  0.053285  0.053285  0.000000  0.000000   
1  0.000000  0.000000  0.000000  0.000000  0.000000  0.084167  0.084167   

   agreement      also    amazon  ...     vocal      vote   wealthi      week  \
0   0.000000  0.000000  0.053285  ...  0.053285  0.053285  0.000000  0.000000   
1   0.084167  0.168334  0.000000  ...  0.000000  0.000000  0.084167  0.084167   

     welcom   without      word     world     would      year  
0  0.053285  0.053285  0.053285  0.000000  0.113738  0.000000  
1  0.000000  0.000000  0.000000  0.084167  0.059885  0.084167  

[2 rows x 200 columns]


## Corpus Statistics 

How many non-zero elements are there?
- Adapt the code below, using the `df` version of the `response` object to replace everywhere below it says `DATA`
- Interpret the findings


In [26]:
import numpy as np

In [30]:
# Edit code before running it

newval=np.array(df)

non_zero_vals = np.count_nonzero(newval) / float(df.shape[0])
print("Average Number of Non-Zero Elements in Vectorized Articles: {}".format(non_zero_vals))

percent_sparse = 1 - (non_zero_cols / float(df.shape[1]))
print('Percentage of columns containing 0: {}'.format(percent_sparse))

Average Number of Non-Zero Elements in Vectorized Articles: 103.5
Percentage of columns containing 0: 0.48250000000000004


### Next Steps:
- Create the tf-idf for the **whole** corpus of 12 articles
- What are _on average_ the most important words in the whole corpus?
- Add a column named "Target" to the dataset
- Target will be set to 1 or 0 if the article is "Politics" or "Not Politics"
- Do some exploratory analysis of the dataset
 - what are the average most important words for the "Politics" articles?
 - What are the average most important words for the "Not Politics"?

## Lets talk classification
- How would you split into train and test? what would be the dataset?

In [None]:
# Sample code
from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)  