## Text Mining and NLP

## Part 2

### Situation:

Priya works at an international PR firm in the Europe division. Their largest client has offices in Ibiza, Madrid, and Las Palmas. She needs to keep her boss aware of current events and provide a weekly short list of articles concerning political events in Spain. The problem is, this takes hours every week to review articles on the BBC and Priya is very busy! She wonders if she could automate this process using text mining to save her time.

### **Goal**: to internalize the steps, challenges, and methodology of text mining
- explore text analysis by hand
- apply text mining steps in Jupyter with Python libraries NLTK
- classify documents correctly

## Refresher on cleaning text
![gif](https://www.nyfa.edu/student-resources/wp-content/uploads/2014/10/furious-crazed-typing.gif)


In [21]:
from __future__ import print_function
import nltk
import sklearn


from nltk.collocations import *
from nltk import FreqDist, word_tokenize
import string, re
import urllib
from nltk.stem.snowball import SnowballStemmer

# url_a = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/A.txt"
# url_b = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/B.txt"
article_a = urllib.request.urlopen(url_a).read()
article_a_st = article_a.decode("utf-8")



    
url_a = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/A.txt"
url_b = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/B.txt"
url_c = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/C.txt"
url_d = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/D.txt"
url_e = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/E.txt"
url_f = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/F.txt"
url_g = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/G.txt"
url_h = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/H.txt"
url_i = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/I.txt"
url_j = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/J.txt"
url_k = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/K.txt"
url_l = "https://raw.githubusercontent.com/aapeebles/text_examples/master/Text%20examples%20folder/L.txt"

article_list = [ url_a,url_b,url_c,url_d,url_e,url_f,url_g,url_h,url_i,url_j,url_k,url_l]    

article_a = urllib.request.urlopen(url_a).read()
article_a_st = article_a.decode("utf-8")

In [23]:
def cleaning_code(article):
    # tokens
    pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
    art_tokens_raw = nltk.regexp_tokenize(article_a_st, pattern)

    # lower case
    art_tokens = [i.lower() for i in art_tokens_raw]

    # stop words
    from nltk.corpus import stopwords
    stopwords.words("english")

    stop_words = set(stopwords.words('english'))
    art_tokens_stopped = [w for w in art_tokens if not w in stop_words]

    # stem words
    stemmer = SnowballStemmer("english")
    art_stemmed = [stemmer.stem(word) for word in art_tokens_stopped]
    
    return art_stemmed

In [24]:
cleaned_articles = []
for article in article_list:
    cleaned_articles.append(cleaning_code(article))
    
cleaned_articles

[['https',
  'raw',
  'githubusercont',
  'com',
  'aapeebl',
  'text',
  'exampl',
  'master',
  'text',
  'exampl',
  'folder',
  'txt'],
 ['https',
  'raw',
  'githubusercont',
  'com',
  'aapeebl',
  'text',
  'exampl',
  'master',
  'text',
  'exampl',
  'folder',
  'b',
  'txt'],
 ['https',
  'raw',
  'githubusercont',
  'com',
  'aapeebl',
  'text',
  'exampl',
  'master',
  'text',
  'exampl',
  'folder',
  'c',
  'txt'],
 ['https',
  'raw',
  'githubusercont',
  'com',
  'aapeebl',
  'text',
  'exampl',
  'master',
  'text',
  'exampl',
  'folder',
  'txt'],
 ['https',
  'raw',
  'githubusercont',
  'com',
  'aapeebl',
  'text',
  'exampl',
  'master',
  'text',
  'exampl',
  'folder',
  'e',
  'txt'],
 ['https',
  'raw',
  'githubusercont',
  'com',
  'aapeebl',
  'text',
  'exampl',
  'master',
  'text',
  'exampl',
  'folder',
  'f',
  'txt'],
 ['https',
  'raw',
  'githubusercont',
  'com',
  'aapeebl',
  'text',
  'exampl',
  'master',
  'text',
  'exampl',
  'folder',
  

In [4]:
# repeat w second article
article_b = urllib.request.urlopen(url_b).read()
article_b_st = article_b.decode("utf-8")
artb_tokens_raw = nltk.regexp_tokenize(article_b_st, pattern)
artb_tokens = [i.lower() for i in artb_tokens_raw]
artb_tokens_stopped = [w for w in artb_tokens if not w in stop_words]
artb_stemmed = [stemmer.stem(word) for word in artb_tokens_stopped]

### Document statistics

what's wrong with the table from yesterday? what does it not consider?


### Term Frequency (TF)

$\begin{align}
 tf_{i,j} = \dfrac{n_{i,j}}{\displaystyle \sum_k n_{i,j} }
\end{align} $

### Inverse Document Frequency (IDF)

$\begin{align}
idf(w) = \log \dfrac{N}{df_t}
\end{align} $

### TF-IDF score

$ \begin{align}
w_{i,j} = tf_{i,j} \times \log \dfrac{N}{df_i} \\
tf_{i,j} = \text{number of occurences of } i \text{ in} j \\
df_i = \text{number of documents containing} i \\
N = \text{total number of documents}
\end{align} $


### The from scratch method
![homemade](https://media2.giphy.com/media/LBZcXdG0eVBdK/giphy.gif?cid=3640f6095c2d7bb2526a424a4d97117c)


Please go through the code and comment what each section does

In [5]:
# Creating a set of words from the documents without duplicates and dictionaries to count
wordSet = set(arta_stemmed).union(set(artb_stemmed))
wordDictA = dict.fromkeys(wordSet, 0) 
wordDictB = dict.fromkeys(wordSet, 0) 

# Counts words in A
for word in arta_stemmed:
    wordDictA[word]+=1
    
# Counts words in B
for word in artb_stemmed:
    wordDictB[word]+=1    

# Creates a function to return a ratio of each word to all the words in the article
def computeTF(wordDict, bow):
    tfDict = {}
    bowCount = len(bow)
    for word, count in wordDict.items():
        tfDict[word] = count/float(bowCount)
    return tfDict


tfbowA = computeTF(wordDictA,arta_stemmed)
tfbowB = computeTF(wordDictB,artb_stemmed)

In [6]:
tfbowA

{'chair': 0.0,
 'announc': 0.0,
 'financ': 0.0,
 'miss': 0.0,
 'submit': 0.005434782608695652,
 'action': 0.005434782608695652,
 'open': 0.005434782608695652,
 'draft': 0.016304347826086956,
 'one': 0.016304347826086956,
 'reach': 0.0,
 'fuller': 0.005434782608695652,
 'committe': 0.010869565217391304,
 'immens': 0.005434782608695652,
 'fund': 0.0,
 'permit': 0.005434782608695652,
 'innov': 0.005434782608695652,
 'europ': 0.005434782608695652,
 'issu': 0.005434782608695652,
 'put': 0.005434782608695652,
 'week': 0.0,
 'wealthi': 0.0,
 'gain': 0.005434782608695652,
 'internet': 0.005434782608695652,
 'repay': 0.0,
 'mr': 0.0,
 'biggest': 0.0,
 'controversi': 0.005434782608695652,
 'similar': 0.005434782608695652,
 'program': 0.005434782608695652,
 'sourc': 0.005434782608695652,
 'ineffici': 0.005434782608695652,
 'thursday': 0.0,
 'model': 0.005434782608695652,
 'welcom': 0.005434782608695652,
 'protect': 0.010869565217391304,
 'larg': 0.005434782608695652,
 'germani': 0.0,
 'mep': 0.01

In [7]:
def computeIDF(docList):
    ''' Computer inverse document frequency for each document in the doclist
    returns: IDF for each document
    '''
    import math
    idfDict = {}
    N = len(docList)
    # Create dictionary with the keys from the list of dictionaries and zero values
    idfDict = dict.fromkeys(docList[0].keys(), 0)
    # For each pair in docList, compare to zero and add to dictionary if zero and increment by one.
    for doc in docList:
        for word, val in doc.items():
            if val > 0:
                idfDict[word] += 1
    # Computer the TF-IDF (natural) log of the number of 
    for word, val in idfDict.items():
        idfDict[word] = math.log10(N / float(val))
        
    return idfDict

In [8]:
idfs = computeIDF([wordDictA, wordDictB])

In [9]:
def computeTFIDF(tfBow, idfs):
    '''Creates function for calculating TFIDF'''
    tfidf = {} # creates empty dictoinary
    for word, val in tfBow.items(): # starts a for loop using keys (word) and values from tfBow
        tfidf[word] = val*idfs[word] # for each word in tfBow, the value is multiplied by the idfs for that word
                                     # The word and resluting computation are then added to the dictionary tfidf
    return tfidf

In [10]:
tfidfBowA = computeTFIDF(tfbowA, idfs)
tfidfBowB = computeTFIDF(tfbowB, idfs)

In [11]:
import pandas as pd
pd.DataFrame([tfidfBowA, tfidfBowB])

Unnamed: 0,abstain,achiev,action,adopt,affair,affect,agre,agreement,also,amazon,...,vocal,vote,wealthi,week,welcom,without,word,world,would,year
0,0.001636,0.001636,0.001636,0.001636,0.001636,0.0,0.0,0.0,0.0,0.001636,...,0.001636,0.001636,0.0,0.0,0.001636,0.001636,0.001636,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.002923,0.002923,0.002923,0.005845,0.0,...,0.0,0.0,0.002923,0.002923,0.0,0.0,0.0,0.002923,0.0,0.002923


## But yes, there is an easier way

![big deal](https://media0.giphy.com/media/xUA7aQOxkz00lvCAOQ/giphy.gif?cid=3640f6095c2d7c51772f47644d09cc8b)


In [12]:
# create a string again
cleaned_a = ' '.join(arta_stemmed)
cleaned_b = ' '.join(artb_stemmed)


from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
response = tfidf.fit_transform([cleaned_a, cleaned_b])

import pandas as pd
df = pd.DataFrame(response.toarray(), columns=tfidf.get_feature_names())
print(df)

    abstain    achiev    action     adopt    affair    affect      agre  \
0  0.053285  0.053285  0.053285  0.053285  0.053285  0.000000  0.000000   
1  0.000000  0.000000  0.000000  0.000000  0.000000  0.084167  0.084167   

   agreement      also    amazon    ...        vocal      vote   wealthi  \
0   0.000000  0.000000  0.053285    ...     0.053285  0.053285  0.000000   
1   0.084167  0.168334  0.000000    ...     0.000000  0.000000  0.084167   

       week    welcom   without      word     world     would      year  
0  0.000000  0.053285  0.053285  0.053285  0.000000  0.113738  0.000000  
1  0.084167  0.000000  0.000000  0.000000  0.084167  0.059885  0.084167  

[2 rows x 200 columns]


## Corpus Statistics 

How many non-zero elements are there?
- Adapt the code below, using the `df` version of the `response` object to replace everywhere below it says `DATA`
- Interpret the findings


In [18]:
# Edit code before running it
import numpy as np

newval = np.array(df)

non_zero_vals = np.count_nonzero(newval) / float(df.shape[0])
print("Average Number of Non-Zero Elements in Vectorized Articles: {}".format(non_zero_vals))

percent_sparse = 1 - (non_zero_cols / float(df.shape[1]))
print('Percentage of columns containing 0: {}'.format(percent_sparse))

Average Number of Non-Zero Elements in Vectorized Articles: 103.5
Percentage of columns containing 0: 0.48250000000000004


### Next Steps:
- Create the tf-idf for the **whole** corpus of 12 articles
- What are _on average_ the most important words in the whole corpus?
- Add a column named "Target" to the dataset
- Target will be set to 1 or 0 if the article is "Politics" or "Not Politics"
- Do some exploratory analysis of the dataset
 - what are the average most important words for the "Politics" articles?
 - What are the average most important words for the "Not Politics"?

## Lets talk classification
- How would you split into train and test? what would be the dataset?

In [None]:
# Sample code
from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)  