<img src="../images/cads-logo.png" style="height: 100px;" align=left>  <img src="../images/NLP.jpeg" style="height: 200px;" align=right, width="300">
# Natural Language Processing

# Contents:

- Introduction to Natural Language Processing
- Sentiment Analysis
    - Model Selection in scikit-learn
    - Extracting features
        - Bag-of-words
        - Exercise A
    - Logistic Regression classification
    - Tfidf
        - Exercise B
    - N-gram
- Text Classification
    - Using sklearn's NaiveBayes Classifier
        - Exercise C

# 1. Introduction to Natural Language Processing
NLP is a branch of data science that consists of systematic processes for analyzing, understanding, and deriving information from the text data in a smart and efficient manner. By utilizing NLP and its components, one can organize the massive chunks of text data, perform numerous automated tasks and solve a wide range of problems such as – automatic summarization, machine translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation etc.

What better way than to use a popular use case application: Amazon review sentiment analysis, to better understand how text information can be parsed and processed into something useful for ML.


# 2. Case Study: Sentiment Analysis

We will be working on a large dataset of reviews of unlocked mobile phones sold on Amazon.com that has been collected by Crawlers et al. in December, 2016. The Amazon reviews dataset consists of 400 thousand reviews to find out insights with respect to reviews, ratings, price and their relationships.

#### Dataset Content 

Given below are the fields:

- Product Title
- Brand
- Price
- Rating
- Review text
- Number of people who found the review helpful

Our main end goal here is to learn how to extract meaningful information from a subset of these reviews to build a machine learning model that can predict whether a certain reviewer liked or disliked a mobile phones.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Read in the data
df = pd.read_csv('../data/Amazon_Unlocked_Mobile.csv', encoding="utf8")

# shuffle rows of dataframe
df = df.sample(frac=0.292893, random_state=10)
# sampling 10%/30% only to avoid hang

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 121211 entries, 394349 to 101941
Data columns (total 6 columns):
Product Name    121211 non-null object
Brand Name      102163 non-null object
Price           119499 non-null float64
Rating          121211 non-null int64
Reviews         121188 non-null object
Review Votes    117609 non-null float64
dtypes: float64(2), int64(1), object(3)
memory usage: 6.5+ MB


In [4]:
# Drop missing values for the whole row
df.dropna(inplace=True) 

# Remove any 'neutral' ratings equal to 3
df = df[df['Rating'] != 3]

# Encode 4s and 5s as 1 (rated positively)
# Encode 1s and 2s as 0 (rated poorly)
df['Positively Rated'] = np.where(df['Rating'] > 3, 1, 0)

In [5]:
df.head()

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,Positively Rated
34377,Apple iPhone 5c 8GB (Pink) - Verizon Wireless,Apple,194.99,1,"The phone needed a SIM card, would have been n...",1.0,0
248521,Motorola Droid RAZR MAXX XT912 M Verizon Smart...,Motorola,174.99,5,I was 3 months away from my upgrade and my Str...,3.0,1
167661,CNPGD [U.S. Office Extended Warranty] Smartwat...,CNPGD,49.99,1,an experience i want to forget,0.0,0
73287,Apple iPhone 7 Unlocked Phone 256 GB - US Vers...,Apple,922.0,5,GREAT PHONE WORK ACCORDING MY EXPECTATIONS.,1.0,1
277158,Nokia N8 Unlocked GSM Touch Screen Phone Featu...,Nokia,95.0,5,I fell in love with this phone because it did ...,0.0,1


# Model Selection in scikit-learn

In [6]:
from sklearn.model_selection import train_test_split

# Split data into train and test subsets
X_train, X_test, y_train, y_test = train_test_split(df["Reviews"], df["Positively Rated"],random_state=0)

In [7]:
# What is the review number 10 in the X_train set  
X_train.iloc[9,]

'BLU Life XL - LTE Smartphone - GSM Unlocked - 16GB +2GB RAM - Dark BlueI purchased this phone to replace a worn out Samsung Galaxy S2 that I dropped, causing EOL (End of Life) of that hand set. I had to call upon my old HTC LEO that had been NAND Flashed with ICS 4.4.2. My wife calls it my Franken-phone. I was certian I had become a "Samsung" guy, having never been short-changed by a Samsung device of any type. Seriously, how many can say they were still using a S2 until last month. However, it pains me to even think about spending $500+ on a new phone. Therefore, I set out on my quest to find a good as new S5, or something reasonable. I stumbled across this handset while searching for my \'new\' handset. After reading the glowing review by Armin Tamzarian (http://www.amazon.com/gp/pdp/profile/A2XCCN239AR1XK/ref=cm_cr_dp_pdp) on BLU Phones, my mission inadvertantly became a quest to see which BLU handset would please me. I don\'t get sidetracked, really I don\'t. I am a buyer, not a s

In [8]:
# X_train size
X_train.shape

(67669,)

In [9]:
# X_test size
X_test.shape

(22557,)

# Extracting features from text files


Text files are actually series of words (ordered). In order to run machine learning algorithms we need to convert the text files into numerical feature vectors. We will be using bag of words model.

## Bag-of-words (BOW)
BOW model allows us to represent text as numerical feature vectors. The idea behind BOW is quite simple and can be summarized as follows:
- 1) Create a vocabulary of unique tokens (or words) from the entire set 
    of documents.
- 2) Construct a feature vector from each document that contains the counts of how often each word occurs in the particular document.

Since the unique words in each document represent only a small subset of all the words in the bag-of-words vocabulary, the feature vectors will consist of mostly zeros, which is why we call them sparse. For this reason we say that bags of words are typically <b>high-dimensional sparse datasets</b>.

{for our example. Briefly, we segment each text file into words (for English splitting by space), and count # of times each word occurs in each document and finally assign each word an integer id. Each unique word in our dictionary will correspond to a feature (descriptive feature).}


### Transform words into vectors (CountVectorizer)
To construct a bag-of-words model based on the word counts in the respective documents, we can use the `CountVectorizer` class implemented in `scikit-learn`. As we will see in the following codes, the `CountVectorizer` class takes an array of text data, which can be documents or just sentences, and constructs the bag-of-words model for us:

Scikit-learn has a high level component which will create feature vectors for us <b>‘CountVectorizer’</b>

In [49]:
docs = np.array([
    'The sun is shining',
    'The weather is sweet',
    'The sun is shining and the weather is sweet'])

# Fit the CountVectorizer to the training data 
from sklearn.feature_extraction.text import CountVectorizer
vect1=CountVectorizer().fit(docs)

# transform the documents in the training data to a document-term matrix. 
bag = vect1.transform(docs)
print(bag)

  (0, 1)	1
  (0, 2)	1
  (0, 3)	1
  (0, 5)	1
  (1, 1)	1
  (1, 4)	1
  (1, 5)	1
  (1, 6)	1
  (2, 0)	1
  (2, 1)	2
  (2, 2)	1
  (2, 3)	1
  (2, 4)	1
  (2, 5)	2
  (2, 6)	1


In [50]:
# the words count for each sentence
bag.toarray()

array([[0, 1, 1, 1, 0, 1, 0],
       [0, 1, 0, 0, 1, 1, 1],
       [1, 2, 1, 1, 1, 2, 1]], dtype=int64)

In [51]:
# the words 
vect1.get_feature_names()

['and', 'is', 'shining', 'sun', 'sweet', 'the', 'weather']

In [52]:
# the vocab index
vect1.vocabulary_

{'the': 5, 'sun': 3, 'is': 1, 'shining': 2, 'weather': 6, 'sweet': 4, 'and': 0}

In [53]:
vect.get_feature()

AttributeError: 'TfidfVectorizer' object has no attribute 'get_feature'

In [54]:
vect.get_vocabulary()

AttributeError: 'TfidfVectorizer' object has no attribute 'get_vocabulary'

In [56]:
vect.features_

AttributeError: 'TfidfVectorizer' object has no attribute 'features_'

## <font color=green> Exercise A</font>

1) Do CountVectorizer for training data

2) Detedrmine: 
- The number of features 
- The shape of sparse matrix

jamboard link : https://jamboard.google.com/d/1f416saFqPKj1pCztdm2r_o5Xm5FqpHY8CfQFCo-pcaQ/edit?usp=sharing
You may share your answers here in the jamboard

In [14]:
# Your code here

In [15]:
from sklearn.feature_extraction.text import CountVectorizer
vect=CountVectorizer().fit(X_train)
X_train_vectorized = vect.transform(X_train)

In [16]:
for i,key in enumerate(vect.vocabulary_,start=1):
    if i <=5:
        print(i,key)

1 would
2 not
3 accept
4 my
5 sim


In [17]:
len(vect.get_feature_names())

32490

In [18]:
X_train_vectorized.shape

(67669, 32490)

In [19]:
X_test_vectorized = vect.transform(X_test)
X_test_vectorized.shape

(22557, 32490)

# Logistic Regression classification

We will train a logistic regression model to classify the  Amazon reviews into positive and negative reviews by using feature matrix. 

In [20]:
from sklearn.linear_model import LogisticRegression

# Train the model
model = LogisticRegression(solver='lbfgs')
model.fit(X_train_vectorized, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [21]:
from sklearn.metrics import roc_auc_score

# Predict the transformed test documents
predictions = model.predict(vect.transform(X_test))
y_proba = model.predict_proba(vect.transform(X_test))
                              
print('AUC: ', roc_auc_score(y_test, y_proba[:,1]))  

AUC:  0.9698306229043259


In [22]:
model.coef_

array([[-0.94946532, -0.23841458,  0.07210076, ...,  0.03628066,
         0.00927655,  0.00339736]])

In [23]:
model.coef_[0].argsort()

array([31966, 13012, 16318, ..., 11220, 11201, 11202], dtype=int64)

In [24]:
# get the feature names as numpy array
feature_names = np.array(vect.get_feature_names())

# Sort the coefficients from the model
sorted_coef_index = model.coef_[0].argsort()

# Find the 10 smallest and 10 largest coefficients

# top 10 negative coefficient/words
print('Smallest Coefs:' )
print(feature_names[sorted_coef_index[:10]])

# top 10 positive coefficient/words
print('\n Largest Coefs:')      
print(feature_names[sorted_coef_index[:-11:-1]])

Smallest Coefs:
['worst' 'garbage' 'junk' 'freezes' 'unusable' 'useless' 'overheating'
 'waste' 'poor' 'crashed']

 Largest Coefs:
['excelente' 'excelent' 'excellent' 'loves' 'love' 'exelente' 'loving'
 'perfect' 'awesome' 'amazing']


# Tf-idf

When we are analyzing text data, we often encounter words that occur across multiple documents from both classes. Those frequently occurring words typically don't contain useful or discriminatory information. In this subsection, we will learn about a useful technique called **term frequency-inverse document frequency** (*tf-idf*) that can be used to downweight those frequently occurring words in the feature vectors. On the other words by tf-idf we can reduce the weightage of more common words like (the, is, an etc.) which occurs in all document.

The *tf-idf* can be defined as the product of the term frequency and the inverse document frequency:

\begin{align}
\textit{tf-idf}(t,d) = tf(t,d) \times idf(t,d)
\end{align}

Here the <font color=green><b> *tf(t,d)* </b></font> is the term frequency that equal to **Count of word / Total words, in each document**. The inverse document frequency *idf(t,d)* can be calculated as:

\begin{align}
idf(t,d) = log\frac{n_d}{\text{df(d,t)}}
\end{align}

where <font color=green><b> $n_d$ </b></font> is **the total number of documents**, and <font color=green><b>*df(d,t)*</b></font> is **the number of documents *d* that contain the term t**. Note that the log is used to ensure that low document frequencies are not given too much weight.


scikit-learn implements yet another vectorizer, the TfidfVectorizer, that creates feature vectors as tf-idfs.


In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer


docs = np.array([
    'The sun is shining',
    'The weather is sweet',
    'The sun is shining and the weather is sweet'])

vect2 = TfidfVectorizer().fit(docs)
bag2 = vect2.transform(docs)
bag2.toarray()

array([[0.        , 0.43370786, 0.55847784, 0.55847784, 0.        ,
        0.43370786, 0.        ],
       [0.        , 0.43370786, 0.        , 0.        , 0.55847784,
        0.43370786, 0.55847784],
       [0.40474829, 0.47810172, 0.30782151, 0.30782151, 0.30782151,
        0.47810172, 0.30782151]])

In [26]:
vect2.get_feature_names()

['and', 'is', 'shining', 'sun', 'sweet', 'the', 'weather']

In [27]:
# trying to calculate tf-idf 'is' in doc 1
# xdpt
1/4* (np.log ( (1 + 3) / (1 + 3) )+1 )

0.25

```python
'is ' in doc [1] : 1/4* log(3/1)

```

In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Fit the TfidfVectorizer to the training data 
vect = TfidfVectorizer(min_df=5).fit(X_train)
X_train_vectorized = vect.transform(X_train)

model = LogisticRegression(solver="lbfgs")
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))
y_proba = model.predict_proba(vect.transform(X_test))
                              
print('AUC: ', roc_auc_score(y_test, y_proba[:,1])) 



AUC:  0.978878338653962


## <font color=green> Exercise B</font> 
- Predict two below reviews as negetive or positive using our model: 

      ['no an issue, phone is working', 'an issue, phone is not working']      

In [29]:
# Your code here

In [30]:
a = ['no an issue, phone is working', 'an issue, phone is not working']  

from sklearn.feature_extraction.text import TfidfVectorizer
a_vectorized = vect.transform(a)
a_vectorized.toarray()


array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [31]:
model.predict(a_vectorized)

array([0, 0])

In [32]:
# predict both as negative because of the word no and not

# n-grams

The sequence of items in the bag-of-words model that we just created is also called the 1-gram or unigram model — each item or token in the vocabulary represents a single word. Generally, <b>the contiguous sequences of items in NLP</b> — words, letters, or symbols— is also called an n-gram. The choice of the number n in the n-gram model depends on the particular application. For instance, spam filtering applications tend to use n=3 or n=4 for good performances.
To summarize the concept of the n-gram representation, the 1-gram and 2-gram representations of our first document "the sun is shining" would be constructed as follows:
- 1-gram: "the", "sun", "is", "shining"
- 2-gram: "the sun", "sun is", "is shining"

The CountVectorizer class in scikit-learn allows us to use different n-gram models via its ngram_range parameter. By default, it uses a 1-gram representation.

In [33]:
# Try 2-gram representation
docs = np.array([
    'The sun is shining',
    'The weather is sweet',
    'The sun is shining and the weather is sweet'])

vect3=CountVectorizer(ngram_range=(1,2)).fit(docs)
bag3=vect3.transform(docs)
vect3.get_feature_names()

['and',
 'and the',
 'is',
 'is shining',
 'is sweet',
 'shining',
 'shining and',
 'sun',
 'sun is',
 'sweet',
 'the',
 'the sun',
 'the weather',
 'weather',
 'weather is']

In [34]:
vect = CountVectorizer(min_df=5, ngram_range=(1,2)).fit(X_train)
X_train_vectorized = vect.transform(X_train)

In [35]:
len(vect.get_feature_names())

72703

In [36]:
model = LogisticRegression(solver = "lbfgs")
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))
y_proba = model.predict_proba(vect.transform(X_test))
                              
print('AUC: ', roc_auc_score(y_test, y_proba[:,1])) 



AUC:  0.9806215599135434


In [37]:
feature_names = np.array(vect.get_feature_names())

sorted_coef_index = model.coef_[0].argsort()

print('Smallest Coefs:' )
print(feature_names[sorted_coef_index[:10]])
      
print('\n Largest Coefs:')      
print(feature_names[sorted_coef_index[:-11:-1]])

Smallest Coefs:
['no good' 'worst' 'junk' 'not good' 'poor' 'horrible' 'broken' 'garbage'
 'terrible' 'not happy']

 Largest Coefs:
['excelente' 'excelent' 'excellent' 'not bad' 'perfect' 'awesome'
 'no problems' 'great' 'love' 'amazing']


In [38]:
print(model.predict(vect.transform(['no an issue, phone is working',
                                    'an issue, phone is not working'])))

[1 0]


# Text Classification

## Using sklearn's NaiveBayes Classifier


### <font color=green> Exercise C</font> 
1. Do text classification for the Amazon reviews dataset using NaiveBayes Classifier
2. Evaluate your model classifier

In [39]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics

In [40]:
# Your code here

In [41]:
vect = TfidfVectorizer(min_df=5,ngram_range=(1,2)).fit(X_train)
X_train_vectorized = vect.transform(X_train)
X_test_vectorized = vect.transform(X_test)

from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train_vectorized, y_train)
pred = model.predict(vect.transform(X_test))

In [42]:
print(X_train_vectorized.shape)
print(X_test_vectorized.shape)

(67669, 72703)
(22557, 72703)


In [43]:
tuple(zip(y_test[:5],pred[:5]))

((1, 1), (0, 0), (1, 1), (1, 1), (1, 1))

In [44]:
from sklearn import metrics
print("Accuracy:", metrics.accuracy_score(y_test,pred))
print("Precision: ", metrics.precision_score(y_test,pred))
print("Recall: ", metrics.recall_score(y_test,pred))
print("f1: ", metrics.f1_score(y_test,pred))

Accuracy: 0.9333687990424259
Precision:  0.9306448906538397
Recall:  0.984439711276772
f1:  0.9567867513872516


In [45]:
# xleh pkai model.score(X_test_vectorized,ytest)