## Tutorial 20. Sentiment analysis

Created by Emanuel Flores-Bautista 2019  All content contained in this notebook is licensed under a [Creative Commons License 4.0 BY NC](https://creativecommons.org/licenses/by-nc/4.0/). The code is licensed under a [MIT license](https://opensource.org/licenses/MIT).


In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.neural_network import MLPClassifier
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report,accuracy_score,confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from keras.datasets import imdb
import TCD19_utils as TCD

TCD.set_plotting_style_2()

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

Using TensorFlow backend.


We will train a classifier movie for reviews in the IMDB data set.

In [9]:
import tensorflow as tf

#from tensorflow import keras as tf.keras

In [17]:
#import numpy as np
# save np.load
np_load_old = np.load

# modify the default parameters of np.load
np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)

# call load_data with allow_pickle implicitly set to true
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=5000)

# restore np.load for future normal usage
np.load = np_load_old

from keras.datasets import imdb

(x_train, y_train), (x_test, y_test) = imdb.load_data(path="imdb.npz",
                                                      num_words=5000,
                                                      skip_top=0,
                                                      maxlen=None,
                                                      seed=113,
                                                      start_char=1,
                                                      oov_char=2,
                                                      index_from=3)


(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words = vocabulary_size)#,allow_pickle = True)
#print('Loaded dataset with {} training samples,{} test samples'.format(len(X_train), len(X_test)))

In [19]:
len(X_train[0])

218

In [20]:
print('---review---')
print(X_train[6])
print('---label---')
print(y_train[6])

---review---
[1, 2, 365, 1234, 5, 1156, 354, 11, 14, 2, 2, 7, 1016, 2, 2, 356, 44, 4, 1349, 500, 746, 5, 200, 4, 4132, 11, 2, 2, 1117, 1831, 2, 5, 4831, 26, 6, 2, 4183, 17, 369, 37, 215, 1345, 143, 2, 5, 1838, 8, 1974, 15, 36, 119, 257, 85, 52, 486, 9, 6, 2, 2, 63, 271, 6, 196, 96, 949, 4121, 4, 2, 7, 4, 2212, 2436, 819, 63, 47, 77, 2, 180, 6, 227, 11, 94, 2494, 2, 13, 423, 4, 168, 7, 4, 22, 5, 89, 665, 71, 270, 56, 5, 13, 197, 12, 161, 2, 99, 76, 23, 2, 7, 419, 665, 40, 91, 85, 108, 7, 4, 2084, 5, 4773, 81, 55, 52, 1901]
---label---
1


Note that the review is stored as a sequence of integers. From the [Keras documentation](https://keras.io/datasets/) we can see that these are words IDs that have been pre-assigned to individual words, and the label is an integer (0 for negative, 1 for positive). We can go ahead and access the words from each review with the `get_word_index()` method from the `imdb` object.

In [21]:
word2id = imdb.get_word_index()
id2word = {i: word for word, i in word2id.items()}
print('---review with words---')
print([id2word.get(i, ' ') for i in X_train[6]])
print('---label---')
print(y_train[6])

---review with words---
['the', 'and', 'full', 'involving', 'to', 'impressive', 'boring', 'this', 'as', 'and', 'and', 'br', 'villain', 'and', 'and', 'need', 'has', 'of', 'costumes', 'b', 'message', 'to', 'may', 'of', 'props', 'this', 'and', 'and', 'concept', 'issue', 'and', 'to', "god's", 'he', 'is', 'and', 'unfolds', 'movie', 'women', 'like', "isn't", 'surely', "i'm", 'and', 'to', 'toward', 'in', "here's", 'for', 'from', 'did', 'having', 'because', 'very', 'quality', 'it', 'is', 'and', 'and', 'really', 'book', 'is', 'both', 'too', 'worked', 'carl', 'of', 'and', 'br', 'of', 'reviewer', 'closer', 'figure', 'really', 'there', 'will', 'and', 'things', 'is', 'far', 'this', 'make', 'mistakes', 'and', 'was', "couldn't", 'of', 'few', 'br', 'of', 'you', 'to', "don't", 'female', 'than', 'place', 'she', 'to', 'was', 'between', 'that', 'nothing', 'and', 'movies', 'get', 'are', 'and', 'br', 'yes', 'female', 'just', 'its', 'because', 'many', 'br', 'of', 'overly', 'to', 'descent', 'people', 'time', 

Because we cannot feed the index matrix directly to the classifier, we need to perform some data wrangling and feature extraction abilities. We're going to write a couple of functions, in order to 

1. Get a list of reviews, consisting of full length strings. 
2. Perform TF-IDF feature extraction on the reviews documents. 

### Feature engineering

In [22]:
def get_joined_rvw(X):
    
    """
    
    Given an X_train or X_test dataset from the IMDB reviews
    of Keras, return a list of the reviews in string format. 
    
    """
    
    #Get word to index dictionary
    word2id = imdb.get_word_index()
    #Get index to word mapping dictionary
    id2word = {i: word for word, i in word2id.items()}
    
    #Initialize reviews list
    doc_list = []
    
    for review in X:
        #Extract review
        initial_rvw = [id2word.get(i) for i in review]
        
        #Join strings followed by spaces
        joined_rvw = " ".join(initial_rvw)
        
        #Append review to the doc_list
        doc_list.append(joined_rvw)
        
    return doc_list

In [23]:
train_rvw = get_joined_rvw(X_train)
test_rvw = get_joined_rvw(X_test)

In [31]:
tf_idf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=vocabulary_size,
                                   stop_words='english')
    
tf_idf_train = tf_idf_vectorizer.fit_transform(train_rvw)

tf_idf_test = tf_idf_vectorizer.fit_transform(test_rvw)

#tf_idf_feature_names = tf_idf_vectorizer.get_feature_names() 

#tf_idf = np.vstack([tf_idf_train.toarray(), tf_idf_test.toarray()])

#X_new = pd.DataFrame(tf_idf, columns=tf_idf_feature_names)

X_train_new = tf_idf_train.toarray()

X_test_new = tf_idf_test.toarray()

In [38]:
X_test_new.shape

(25000, 4622)

In [24]:
def get_data_from_keras_imdb():
    
    """
    
    Extract TF-IDF matrices for the Keras IMDB dataset. 
    
    """
    vocabulary_size = 1000
    (X_train, y_train), (X_test, y_test) = imdb.load_data(num_words = vocabulary_size)
    
    #X = np.vstack([X_train[:, None], X_test[:, None]])
    
    X_train_docs = get_joined_rvw(X_train)
    X_test_docs = get_joined_rvw(X_test)
    
    
    tf_idf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=vocabulary_size,
                                   stop_words='english')
    
    tf_idf_train = tf_idf_vectorizer.fit_transform(X_train_docs)
    
    tf_idf_test = tf_idf_vectorizer.fit_transform(X_test_docs)
    
    #tf_idf_feature_names = tf_idf_vectorizer.get_feature_names() 
    
    #tf_idf = np.vstack([tf_idf_train.toarray(), tf_idf_test.toarray()])
    
    #X_new = pd.DataFrame(tf_idf, columns=tf_idf_feature_names)
    
    X_train_new = tf_idf_train.toarray()
    
    X_test_new = tf_idf_test.toarray()

    
    return X_train_new, y_train, X_test_new, y_test 

X_train, y_train, X_test, y_test  = get_data_from_keras_imdb()

In [26]:
print('train dataset shape', X_train.shape)
print('test dataset shape', X_test.shape)

train dataset shape (25000,)
test dataset shape (25000,)


We can readily see that we are ready to train our classification algorithm with the TF-IDF matrices. 

### ML Classification: Model bulding and testing

In [39]:
model = RandomForestClassifier(n_estimators=200, max_depth=3, random_state=42)
model.fit(X_train_new[:, :-1], y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=3, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=200,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [40]:
y_pred = model.predict(X_test_new)

In [41]:
print(classification_report(y_test, y_pred))
print('Accuracy score : ', accuracy_score(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.82      0.58      0.68     12500
           1       0.68      0.87      0.76     12500

    accuracy                           0.73     25000
   macro avg       0.75      0.73      0.72     25000
weighted avg       0.75      0.73      0.72     25000

Accuracy score :  0.72704


In [42]:
model = MLPClassifier()

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))
print('Accuracy score : ', accuracy_score(y_test, y_pred))

ValueError: setting an array element with a sequence.

In [None]:
from sklearn.model_selection import cross_val_score
cross_val_score(model, X_train_new[], y_train, cv=5)

In [None]:
import manu_utils as TCD
palette = TCD.palette(cmap = True)

In [None]:
C = confusion_matrix(y_test, y_pred)
c_normed = C / C.astype(np.float).sum(axis=1) [:, np.newaxis]

sns.heatmap(c_normed, cmap = palette,
            xticklabels=['negative', 'positive'], 
           yticklabels=['negative', 'positive'],
          annot= True, vmin = 0, vmax = 1, 
           cbar_kws = {'label': 'recall'})

#

plt.ylabel('True label')
plt.xlabel('Predicted label');

### Sci-kit learn pipelines

In [43]:
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=vocabulary_size,
                                   stop_words='english'), MLPClassifier())

pipe.fit(train_rvw, y_train)
labels = pipe.predict(test_rvw)

In [44]:
targets = ['negative','positive']

In [45]:
def predict_category(s, model=pipe):
    pred = pipe.predict([s])
    return targets[pred[0]]


In [None]:
predict_category('this was a hell of a good movie')

In [None]:
predict_category('this was a freaking crappy time yo')