___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

## Introduction 
- Most classic ML algorithms cannot take in raw text. Instead we need to perform a __feature extraction__ from the raw text in order to pass numerical features to the ML algorithm.
- For example, we could count the occurence of each word to map text to a number.
- Let's discuss Counter Vectorization along with Term-Frequency and Inverse Document Frequency.   
  Lets say we have 3 messages:
  ```
  messages = ["Hey, Let's go to the game today!",
              "Call your sister.",
              "Want to go walk your dogs?"
              ]
  ```
  Now everything is in raw text and using sklearn we can do - CountVectorizer. It treats each individual unique word as a feature.
  ```
    from sklearn.feature_extraction.text import CountVectorizer
    count_vect = CountVectorizer()
  ```

  Then it accounts for each individual feature/word as a document. And each document is essentially a text message.

- __Document Term Matrix(DTM)__ - It counts the number of times each unique word throughout entire vocabulary of all documents shows up.

- An alterative to CountVectorizer is something called TF-IDF vectorizer. It also creates a  DTM from our messages. However, instead of filling the DTM with token counts it calculates term frequency-inverse document frequency(TF-IDF) value for each word.
- __Term frequency__ `tf(t,d)`: is the raw count of a term in a document i.e., the number of times that term t occurs in document d.

| cat | dogs | game | go | hey | lets | sister | the | to | today | walk | want | your |
|-----|------|------|----|-----|------|--------|-----|----|-------|------|------|------|
| 0   | 0    | 1    | 1  | 1   | 1    | 0      | 1   | 1  | 1     | 0    | 0    | 0    |
| 1   | 0    | 0    | 0  | 0   | 0    | 1      | 0   | 0  | 0     | 0    | 0    | 1    |
| 0   | 1    | 0    | 1  | 0   | 0    | 0      | 0   | 1  | 0     | 1    | 1    | 1    |
  

- However term frequency alone is not enough for a thorough frequency analysis of the text! Let's imagine the stop words like - a, the, ....(stop words).
- Becauase the term frequency _the_ is so common, term frequency will tend to incorrectly emphasize documents which happens to use the word _the_ more frequently, without giving enough weight to the more meaningful terms like "red" and "dog".
- An inverse document frequency (IDF) factor is incorporated which dimnishes the weight of terms that occur very frequetly in the document set and increases the weight of the terms that occur very rarely.
- IDF is the logarithmically saved inverse fraction of the documents that contain the word (obtained by dividing the total number of documnets by the number of documents containing the term, and then taking the logarithm of that quotient).  
  ![](images\1.PNG)
  

![](images\2.PNG)


Let us take the same Example and calculate Term Frequency and IDF:  
Document A: I do not like Vanilla Cake  
Document B: I do not like Vanilla Icecream    
No. of words in Document A: 6  
No. of words in Document B: 6

![](images\3.PNG)

![](images\4.PNG)



It is clear from the above approach that less frequent words like ‘cake’ and ‘icecream’ get more weight than more frequent words.

We can achieve the same task by importing TfidfVectorizer from sklearn library.

Thing to note is that every library which is calculating Tf-idf may have a different formula for it. Also, there are certain parameters which you can set for smoothning of the results. So, when you see a different Tf-idf value from sklearn, do not get confused. At Least you got the basic idea behind this approach.

Some of the techniques add 1 in the denominator while calculating the IDF values etc.

In the code below, need to check how to get dtm for the messages.

In [1]:
messages = ["Hey, Let's go to the game today!",
              "Call your sister.",
              "Want to go walk your dogs?"
              ]

from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer()

In [2]:
dtm = vect.fit_transform(messages)

In [3]:
print(dtm)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 16 stored elements and shape (3, 13)>
  Coords	Values
  (0, 4)	0.40301621080355077
  (0, 5)	0.40301621080355077
  (0, 3)	0.3065042162415877
  (0, 8)	0.3065042162415877
  (0, 7)	0.40301621080355077
  (0, 2)	0.40301621080355077
  (0, 9)	0.40301621080355077
  (1, 0)	0.6227660078332259
  (1, 12)	0.4736296010332684
  (1, 6)	0.6227660078332259
  (2, 3)	0.3494981241087058
  (2, 8)	0.3494981241087058
  (2, 12)	0.3494981241087058
  (2, 11)	0.45954803293870056
  (2, 10)	0.45954803293870056
  (2, 1)	0.45954803293870056


TF-IDF helps us to understand the context of words across an entire corpus pf documents, instead of just its relative importance in a single document.

This unit is divided into two sections:
* First, we'll find out what what is necessary to build an NLP system that can turn a body of text into a numerical array of *features*.
* Next we'll show how to perform these steps using real tools.

# Building a Natural Language Processor From Scratch
In this section we'll use basic Python to build a rudimentary NLP system. We'll build a *corpus of documents* (two small text files), create a *vocabulary* from all the words in both documents, and then demonstrate a *Bag of Words* technique to extract features from each document.<br>
<div class="alert alert-info" style="margin: 20px">**This first section is for illustration only!**
<br>Don't bother memorizing the code - we'd never do this in real life.</div>

## Start with some documents:
For simplicity we won't use any punctuation.

In [4]:
%%writefile 1.txt
This is a story about cats
our feline pets
Cats are furry animals

Overwriting 1.txt


In [5]:
%%writefile 2.txt
This story is about surfing
Catching waves is fun
Surfing is a popular water sport

Overwriting 2.txt


## Build a vocabulary

Building a vacabulary is always the first step.  Regardless of what methods you're going to be using
you do have to assign some sort of vocabulary across all your documents.

The goal here is to build a numerical array from all the words that appear in every document. Later we'll create instances (vectors) for each individual document.

And the way this works is we're going to build out a dictionary and then set an ID counter. (Below code)

In [6]:
vocab = {}
i = 1 ## ID counter

with open('1.txt') as f:
    x = f.read().lower().split()

for word in x:
    if word in vocab:  # so if a word is already there, it will not be counted.
        continue
    else:
        vocab[word]=i
        i+=1

print(vocab)

{'this': 1, 'is': 2, 'a': 3, 'story': 4, 'about': 5, 'cats': 6, 'our': 7, 'feline': 8, 'pets': 9, 'are': 10, 'furry': 11, 'animals': 12}


In [7]:
with open('2.txt') as f:
    x = f.read().lower().split()

for word in x:
    if word in vocab:
        continue
    else:
        vocab[word]=i
        i+=1

print(vocab)

{'this': 1, 'is': 2, 'a': 3, 'story': 4, 'about': 5, 'cats': 6, 'our': 7, 'feline': 8, 'pets': 9, 'are': 10, 'furry': 11, 'animals': 12, 'surfing': 13, 'catching': 14, 'waves': 15, 'fun': 16, 'popular': 17, 'water': 18, 'sport': 19}


Even though `2.txt` has 15 words, only 7 new words were added to the dictionary.

## Feature Extraction
Now that we've encapsulated our "entire language" in a dictionary, let's perform *feature extraction* on each of our original documents:

In [8]:
# Create an empty vector with space for each word in the vocabulary:
one = ['1.txt']+[0]*len(vocab)
one

['1.txt', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [9]:
# map the frequencies of each word in 1.txt to our vector:
with open('1.txt') as f:
    x = f.read().lower().split()
    
for word in x:
    one[vocab[word]]+=1
    
one

['1.txt', 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]

<font color=green>We can see that most of the words in 1.txt appear only once, although "cats" appears twice.</font>

In [10]:
# Do the same for the second document:
two = ['2.txt']+[0]*len(vocab)

with open('2.txt') as f:
    x = f.read().lower().split()
    
for word in x:
    two[vocab[word]]+=1

In [11]:
# Compare the two vectors:
print(f'{one}\n{two}')

['1.txt', 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
['2.txt', 1, 3, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 2, 1, 1, 1, 1, 1, 1]


By comparing the vectors we see that some words are common to both, some appear only in `1.txt`, others only in `2.txt`. Extending this logic to tens of thousands of documents, we would see the vocabulary dictionary grow to hundreds of thousands of words. Vectors would contain mostly zero values, making them *sparse matrices*.

## Bag of Words and Tf-idf
In the above examples, each vector can be considered a *bag of words*. By itself these may not be helpful until we consider *term frequencies*, or how often individual words appear in documents. A simple way to calculate term frequencies is to divide the number of occurrences of a word by the total number of words in the document. In this way, the number of times a word appears in large documents can be compared to that of smaller documents.

However, it may be hard to differentiate documents based on term frequency if a word shows up in a majority of documents. To handle this we also consider *inverse document frequency*, which is the total number of documents divided by the number of documents that contain the word. In practice we convert this value to a logarithmic scale, as described [here](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Inverse_document_frequency).

Together these terms become [**tf-idf**](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

## Stop Words and Word Stems
Some words like "the" and "and" appear so frequently, and in so many documents, that we needn't bother counting them. Also, it may make sense to only record the root of a word, say `cat` in place of both `cat` and `cats`. This will shrink our vocab array and improve performance.

## Tokenization and Tagging
When we created our vectors the first thing we did was split the incoming text on whitespace with `.split()`. This was a crude form of *tokenization* - that is, dividing a document into individual words. In this simple example we didn't worry about punctuation or different parts of speech. In the real world we rely on some fairly sophisticated *morphology* to parse text appropriately.

Once the text is divided, we can go back and *tag* our tokens with information about parts of speech, grammatical dependencies, etc. This adds more dimensions to our data and enables a deeper understanding of the context of specific documents. For this reason, vectors become ***high dimensional sparse matrices***.

<div class="alert alert-info" style="margin: 20px">**That's the end of the first section.**
<br>In the next section we'll use scikit-learn to perform a real-life analysis.</div>

___
# Feature Extraction from Text
In the **Scikit-learn Primer** lecture we applied a simple SVC classification model to the SMSSpamCollection dataset. We tried to predict the ham/spam label based on message length and punctuation counts. In this section we'll actually look at the text of each message and try to perform a classification based on content. We'll take advantage of some of scikit-learn's [feature extraction](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction) tools.

## Load a dataset

In [12]:
# Perform imports and load the dataset:
import numpy as np
import pandas as pd

df = pd.read_csv('../TextFiles/smsspamcollection.tsv', sep='\t')
df.head()

Unnamed: 0,label,message,length,punct
0,ham,"Go until jurong point, crazy.. Available only ...",111,9
1,ham,Ok lar... Joking wif u oni...,29,6
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,6
3,ham,U dun say so early hor... U c already then say...,49,6
4,ham,"Nah I don't think he goes to usf, he lives aro...",61,2


## Check for missing values:
Always a good practice.

In [13]:
df.isnull().sum()

label      0
message    0
length     0
punct      0
dtype: int64

## Take a quick look at the *ham* and *spam* `label` column:

In [14]:
df['label'].value_counts()

label
ham     4825
spam     747
Name: count, dtype: int64

<font color=green>4825 out of 5572 messages, or 86.6%, are ham. This means that any text classification model we create has to perform **better than 86.6%** to beat random chance.</font>

## Split the data into train & test sets:

In [15]:
from sklearn.model_selection import train_test_split

X = df['message']  # this time we want to look at the text
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [16]:
X_train[:5]

3235                                  Yup ü not comin :-(
945     I sent my scores to sophas and i had to do sec...
5319                         Kothi print out marandratha.
5528    Its just the effect of irritation. Just ignore it
247                        I asked you to call him now ok
Name: message, dtype: object

## Scikit-learn's CountVectorizer
Text preprocessing, tokenizing and the ability to filter out stopwords are all included in [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), which builds a dictionary of features and transforms documents to feature vectors.

In [17]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

## Fit vectorizer to the data what it does is that it will build a vocab, count the number of words, then
## count_vect.fit(X_train)

## transform  the orginal text message to vector
## X_train_counts = count_vect.transform(X_train) 

## instead we can do both the above steps togerther
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape

(3733, 7082)

<font color=green>This shows that our training set is comprised of 3733 documents, and 7082 features.</font>

So across 3733 messages, there are a total of 7082 unique words.

In [18]:
X_train_counts

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 49992 stored elements and shape (3733, 7082)>

Notice we cannot see the trasnformed data as its a huge sparse matrix.

`X_train_counts` will be similar to the `vocab_count` data. 

So you have a ton of zeros because of this scikit learn and NumPy are able to compress using the information that there's so many zeros there in order to save space and memory on your computer. So we're not gonna be able to directly see the sparse matrix.

Next, what we wanna do is transform the counts to frequencies with TF-IDF. Then we'll combine the steps with a TF-IDF vectorizer, we'll train a classifier and build a pipeline.

## Transform Counts to Frequencies with Tf-idf
While counting words is helpful, longer documents will have higher average count values than shorter documents, even though they might talk about the same topics.

To avoid this we can simply divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called **tf** for Term Frequencies.

Another refinement on top of **tf** is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus.




Both tf and tf–idf can be computed as follows using [TfidfTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html). 

TfidfTransformer takes word counts (from CountVectorizer) and gives more weight to important words and less weight to common words.

In [19]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(3733, 7082)

Note: the `fit_transform()` method actually performs two operations: it fits an estimator to the data and then transforms our count-matrix to a tf-idf representation.

## Combine Steps with TfidVectorizer
In the future, we can combine the CountVectorizer and TfidTransformer steps into one using [TfidVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html):

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

X_train_tfidf = vectorizer.fit_transform(X_train) # remember to use the original X_train set
X_train_tfidf.shape

(3733, 7082)

## Train a Classifier
Here we'll introduce an SVM classifier that's similar to SVC, called [LinearSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html). LinearSVC handles sparse input better, and scales well to large numbers of samples.

In [21]:
from sklearn.svm import LinearSVC
clf = LinearSVC()
clf.fit(X_train_tfidf,y_train)

<font color=green>Earlier we named our SVC classifier **svc_model**. Here we're using the more generic name **clf** (for classifier).</font>

## Build a Pipeline
Remember that only our training set has been vectorized into a full vocabulary. In order to perform an analysis on our test set we'll have to submit it to the same procedures. Fortunately scikit-learn offers a [**Pipeline**](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) class that behaves like a compound classifier.

In [22]:
from sklearn.pipeline import Pipeline
# from sklearn.feature_extraction.text import TfidfVectorizer
# from sklearn.svm import LinearSVC

text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC()),
])

# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)  

## Test the classifier and display results

In [23]:
# Form a prediction set
predictions = text_clf.predict(X_test)

In [24]:
# Report the confusion matrix
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))

[[1586    7]
 [  12  234]]


In [25]:
# Print a classification report
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         ham       0.99      1.00      0.99      1593
        spam       0.97      0.95      0.96       246

    accuracy                           0.99      1839
   macro avg       0.98      0.97      0.98      1839
weighted avg       0.99      0.99      0.99      1839



In [26]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

0.989668297988037


In [27]:
## can pass a new message also to check
text_clf.predict(["Hi, how are you doing today?"])

array(['ham'], dtype=object)

In [28]:
text_clf.predict(["Congratulations! You have won a lottery!"])

array(['spam'], dtype=object)

Using the text of the messages, our model performed exceedingly well; it correctly predicted spam **98.97%** of the time!<br>
Now let's apply what we've learned to a text classification project involving positive and negative movie reviews.

## Next up: Text Classification Project