## Text Mining and NLP

## Part 2

### Situation:

Priya works at an international PR firm in the Europe division. Their largest client has offices in Ibiza, Madrid, and Las Palmas. She needs to keep her boss aware of current events and provide a weekly short list of articles concerning political events in Spain. The problem is, this takes hours every week to review articles on the BBC and Priya is very busy! She wonders if she could automate this process using text mining to save her time.

### **Goal**: to internalize the steps, challenges, and methodology of text mining
- explore text analysis by hand
- apply text mining steps in Jupyter with Python libraries NLTK
- classify documents correctly

## Refresher on cleaning text
![gif](https://www.nyfa.edu/student-resources/wp-content/uploads/2014/10/furious-crazed-typing.gif)


In [2]:
import string, re
import urllib

import sklearn
from nltk import FreqDist, word_tokenize, regexp_tokenize
from nltk.collocations import *
from nltk.corpus import stopwords
stopwords.words("english")
from nltk.stem.snowball import SnowballStemmer

with open ("examples/A.txt", 'rb') as f:
    article_a = f.read()
article_a_st = article_a.decode("utf-8")
with open ("examples/B.txt", 'rb') as f:
    article_b = f.read()
article_b_st = article_b.decode("utf-8")

In [3]:
# tokens
pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
arta_tokens_raw = regexp_tokenize(article_a_st, pattern)

# lower case
arta_tokens = [i.lower() for i in arta_tokens_raw]

stop_words = set(stopwords.words('english'))
arta_tokens_stopped = [w for w in arta_tokens if not w in stop_words]

# stem words
stemmer = SnowballStemmer("english")
arta_stemmed = [stemmer.stem(word) for word in arta_tokens_stopped]

In [4]:
# repeat w second article
artb_tokens_raw = regexp_tokenize(article_b_st, pattern)
artb_tokens = [i.lower() for i in artb_tokens_raw]
artb_tokens_stopped = [w for w in artb_tokens if not w in stop_words]
artb_stemmed = [stemmer.stem(word) for word in artb_tokens_stopped]

### TF-IDF score

$ \begin{align}
w_{i,j} = tf_{i,j} \times \log \dfrac{N}{df_i} \\
tf_{i,j} = \text{number of occurences of } i \text{ in} j \\
df_i = \text{number of documents containing} i \\
N = \text{total number of documents}
\end{align} $


In [5]:
# create a string again
cleaned_a = ' '.join(arta_stemmed)
cleaned_b = ' '.join(artb_stemmed)


from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
response = tfidf.fit_transform([cleaned_a, cleaned_b])

import pandas as pd
df = pd.DataFrame(response.toarray(), columns=tfidf.get_feature_names())
print(df)

    abstain    achiev    action     adopt    affair    affect      agre  \
0  0.053285  0.053285  0.053285  0.053285  0.053285  0.000000  0.000000   
1  0.000000  0.000000  0.000000  0.000000  0.000000  0.084167  0.084167   

   agreement      also    amazon  ...     vocal      vote   wealthi      week  \
0   0.000000  0.000000  0.053285  ...  0.053285  0.053285  0.000000  0.000000   
1   0.084167  0.168334  0.000000  ...  0.000000  0.000000  0.084167  0.084167   

     welcom   without      word     world     would      year  
0  0.053285  0.053285  0.053285  0.000000  0.113738  0.000000  
1  0.000000  0.000000  0.000000  0.084167  0.059885  0.084167  

[2 rows x 200 columns]


## Corpus Statistics 

How many non-zero elements are there?
- Adapt the code below, using the `df` version of the `response` object to replace everywhere below it says `DATA`
- Interpret the findings


In [6]:
# Edit code before running it
import numpy as np

newval = np.array(df)

non_zero_vals = np.count_nonzero(newval )/ float(df.shape[0])
print("Average Number of Non-Zero Elements in Vectorized Articles: {}".format(non_zero_vals))

percent_sparse = 1 - (non_zero_vals / float(df.shape[1]))
print('Percentage of columns containing 0: {}'.format(percent_sparse))

Average Number of Non-Zero Elements in Vectorized Articles: 103.5
Percentage of columns containing 0: 0.48250000000000004


## Basic Machine Learning NLP Pipeline Example

Now that we've gone over the basics of NLP data, we can take a look at an example of how a pipeline might work.

In [8]:
from sklearn.datasets import fetch_20newsgroups
cats = ['rec.sport.baseball','rec.sport.hockey']
newsgroups_train = fetch_20newsgroups(subset='train',categories=cats)
newsgroups_test = fetch_20newsgroups(subset='test',categories=cats)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [12]:
#fetch_20newsgroups()

In [15]:
print(newsgroups_train.data[2])
print(newsgroups_train.target[2])

From: rudy@netcom.com (Rudy Wade)
Subject: Re: YANKKES 1 GAME CLOSER
Article-I.D.: netcom.rudyC52rBD.86w
Organization: Home of the Brave
Lines: 18

My god, hope we don't have to put up with this kind of junk all season!

In article <002251w.5.734117130@axe.acadiau.ca> 002251w@axe.acadiau.ca (JASON WALTER WORKS) writes:
>    The N.Y.Yankees, are now one game closer to the A.L.East pennant.  They 
>clobbered Cleveland, 9-1, on a fine pitching performance by Key, and two 
>homeruns by Tartabull(first M.L.baseball to go out this season), and a three 

How many home runs by Tartabull?  Just 1, right, you must be thinking
of Dean Palmer or Juan Gonzalez (both of Texas) who each had 2 homers.

>run homer by Nokes.  For all of you who didn't pick Boggs in your pools, 
>tough break, he had a couple hits, and drove in a couple runs(with many more 

I don't know how many to follow, but he was 1 for 4.

> GO YANKS., Mattingly for g.glove, and MVP, and Abbot for Cy Young.

Spare us, please!

0


In [17]:
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(stop_words='english')
X_train = bow.fit_transform(newsgroups_train.data)
y_train = newsgroups_train.target

In [18]:
X_train.toarray().shape

(1197, 18277)

In [29]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
mnb = MultinomialNB()
mnb.fit(X_train,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [32]:
mnb.score(X_train, y_train)

##### Now that we've fit our model, we can transform out X_test into vectorized form. In order to do this, we will be using a .transform( ) method on the previously trained vectorizer model. Why don't we use a fit_transform operation?????   It's because we can only make  a vector based off of the vocabulary and features of our trained dataset. If there is a new vocabulary word in the test set that is not present in the training set, we will not gain any new information from it.

In [33]:
X_test = bow.transform(newsgroups_test.data)
y_test = newsgroups_test.target
accuracy_score(mnb.predict(X_test),y_test)

0.9748743718592965

Wow! Even without accounting for different ngrams, or removing special characters, we can classify these two types of articles very accurately. Let's take a look at our features to get a better idea of which ones were the most important in determining our prediction. Of course, we should also look at a confusion matrix to gain a better understanding of how well our model is performing.


In [37]:
### grabbing our feature names (each one of our tokenized words)
feature_names = np.array(bow.get_feature_names())

In [38]:
# we can look at the coefficients for the fitted Multinomial Naive Bayes model in order to see the coefficient values

min(mnb.coef_[0] )

-11.778469284674996

In [39]:
feature_importances = np.argsort(mnb.coef_[0])[-10:]


In [40]:
for idx in feature_importances:
    print(feature_names[idx])

10
play
game
organization
lines
subject
hockey
ca
team
edu


#### Clearly there are some features that are indicated as significant that don't exactly make sense. The fact that the number 10 is the most distinguishing feature between the two categories indicates that there might be some numerical identifiers in each category. To help narrow down the possibilities, we can make custom tokenizers/prepocessors that we feed into our vectorizers

Learn more about adding custom tokenizers, preprocessors, and analyzers here: https://towardsdatascience.com/hacking-scikit-learns-vectorizers-9ef26a7170af

Learn more about selecting features in Naive Bayes text classification problems here:
https://arxiv.org/pdf/1602.02850.pdf


Try making a custom tokenizer function and use it with sklearn's vectorizor classes: 

In [45]:
def tokenizer_func():
    """Input: raw text - document to be tokenized
       Output: list - tokenized text """
    
    
    
    
    pass


count = CountVectorizer(tokenizer = tokenizer_func)

## Measuring the Similarity Between Documents

We can tell how similar two documents are to one another, normalizing for size, by taking the cosine similarity of the two. 

This number will range from [0,1], with 0 being not similar whatsoever, and 1 being the exact same. A potential application of cosine similarity is a basic recommendation engine. If you wanted to recommend articles that are most similar to other articles, you could talk the cosine similarity of all articles and return the highest one.

<img src="./resources/better_cos_similarity.png">

In [47]:
sample = CountVectorizer()
sunday_afternoon = ['I ate a burger at burger queen and it was very good.',
                    'I ate a hot dog at burger prince and it was bad',
                    'I drove a racecar through your kitchen door',
                    'I ate a hot dog at burger king and it was bad. I ate a burger at burger queen and it was very good']

sample.fit(sunday_afternoon)
text_data = sample.transform(sunday_afternoon)

In [51]:
text_data.toarray()

array([[1, 1, 1, 0, 2, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0],
       [1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1],
       [2, 2, 2, 1, 3, 1, 0, 0, 1, 1, 2, 1, 0, 0, 1, 0, 0, 1, 2, 0]])

In [48]:
from sklearn.metrics.pairwise import cosine_similarity
## the 0th and 2nd index lines are very different, a number close to 0
cosine_similarity(text_data[0],text_data[2])


array([[0.]])

In [53]:
## the 0th and 3rd index lines are very similar, despite different lengths
cosine_similarity(text_data[0],text_data[3])

array([[0.91413793]])

## Bonus

### Spacy 

Spacy is a powerful, efficient NLP library that employs many deep learning techniques to create semantic meaning for different words
Spacy has features related to syntactic meaning of words

In [13]:
import spacy
nlp = spacy.load('en')

OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

In [6]:
import pandas as pd

In [None]:
sample_sentence = """when data scientists are performing natural language processing analysis, they must take\
different verb tenses and singular versus plural words into account."""

In [None]:
tokenized = nlp(sample_sentence)

In [None]:
for word in tokenized:
    print(word, word.pos_)

It can also detect things such as "noun chunks" and many other parts of speech

In [None]:
for chunk in tokenized.noun_chunks:
    print(chunk)

#### Spacy has built in models that have been trained that represent different words with vectors. They are part of a larger deep learning field called word2vec.

Read more about it here: https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

In [None]:
for token in tokenized:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)