# Sentiment Analysis using Python and Pands

## Preprosessing Textual Data using Pandas

In this first section, we will have a look at how textual data can be processed by a computer. In order to do so, we will use a tool called Pandas. First, we will convert some textual data into an array. Then, we vectorise it, which means that we transform each sentence into a vector of a length equal to the other vectors. 

In [0]:
import numpy as np
# Create text
text_data = np.array(['I love Paris. Paris!',
                      'Japan is best',
                      'Germany is lovely',
                      'I do like Iceland',
                      'I love that movie',
                      'Cake is bad for you',
                      'Dont go to that restaurant',
                       'Very nice pies'])

Below, we will create a so-called Bag of Words.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
# Create the bag of words feature matrix
count = CountVectorizer()
bag_of_words = count.fit_transform(text_data)

# Show feature matrix
bag_of_words.toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0,
        0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0],
       [0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0,
        0],
       [1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        1],
       [0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,
        0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1,
        0]])

If we want, we can inspect the features (in this case the words) and convert them into a pandas table. This allows us to easly inspect the data and see if everyhing's there. Moreoever, we can also add the label (i.e. the positive or negative sentiment label) to this table.

In [3]:
# Get feature names
feature_names = count.get_feature_names()

# View feature names
feature_names

['bad',
 'best',
 'cake',
 'do',
 'dont',
 'for',
 'germany',
 'go',
 'iceland',
 'is',
 'japan',
 'like',
 'love',
 'lovely',
 'movie',
 'nice',
 'paris',
 'pies',
 'restaurant',
 'that',
 'to',
 'very',
 'you']

In [4]:
import pandas as pd

text_data = np.array(['I love Paris. Paris!',
                      'Japan is best',
                      'Germany is lovely',
                      'I dont like Iceland',
                      'I love that movie',
                      'Cake is bad for you',
                      'Dont go to that restaurant',
                       'Very nice pies'])

# Create data frame
data_frame = pd.DataFrame(bag_of_words.toarray(), columns=feature_names)
data_frame

Unnamed: 0,bad,best,cake,do,dont,for,germany,go,iceland,is,japan,like,love,lovely,movie,nice,paris,pies,restaurant,that,to,very,you
0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,2,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0
3,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0
5,1,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
6,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0
7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0


In [5]:
#Add labels
data_frame['Sentiment'] = [1,1,1,0,1,0,0,1]
data_frame

Unnamed: 0,bad,best,cake,do,dont,for,germany,go,iceland,is,japan,like,love,lovely,movie,nice,paris,pies,restaurant,that,to,very,you,Sentiment
0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,2,0,0,0,0,0,0,1
1,0,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1
3,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,1
5,1,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0
6,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0
7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,1


## Here we will start with the machine learning data-preprosessing

In this section, we will work on the implementation of the machine learning algorithm. This is just a 'toy' example on what this would look like as in reality we would need much more data. In a later section, we will do so, but for now it is good to get some feeling on what implementing such a network might look like. 

In [0]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier

First, we will split the table into two: a training set and a test set. 

In [0]:
labels = data_frame['Sentiment']
data = data_frame.drop('Sentiment',axis=1)

In [8]:
#Check the shape of the data frame
data_frame.shape

(8, 24)

In [0]:
#From 'regular data' to a test and training set
X_train, X_test, y_train, y_test = train_test_split(data, labels) 

In [0]:
#Normalisation
scaler = StandardScaler()

In [11]:
# Fit only to the training data
scaler.fit(X_train)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [0]:
#From pdataframe to numpy_array 
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [0]:
#type(X_train)

## Training the Neural Network

After preprocessing, the first step is to feed the preprossed data to a machine learing algorithm and let it train on the training data. 

In [0]:
from sklearn.metrics import classification_report,confusion_matrix

We define a model and store that model (by assigning it to a variables) in order to work with it later one. Moreoever, it allows us to easily compare it to other models and prevents it from training the same model more than once. The model below is a very simple model. 

In [0]:
#Define the model
mlp = MLPClassifier(hidden_layer_sizes=(13,13,13),max_iter=500)

Having defined the model, we fit the data to it. This is the 'actual' feeding. 

In [16]:
#Fit the data 
mlp.fit(X_train,y_train)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(13, 13, 13), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=500,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=None, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

## Training and Evaluating the Neural Network 

Having trained a machine learning algorithm is one thing. Obviously, we want to evaluate its performance, because without it, we cannot say whether or not it performs well on unseen examples. We test the performance of the model using the test set. 

In [0]:
predictions = mlp.predict(X_test)

In [18]:
print(confusion_matrix(y_test,predictions))

[[0 2]
 [0 0]]


Finally, we can print the scores of the performance and see how our model did. In this case, these do not make sense, as we only input very little data to the algorithm. Still, it is good to get some feeling as to how evaluation works and what we evaluate on. 

In [19]:
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00       2.0
           1       0.00      0.00      0.00       0.0

    accuracy                           0.00       2.0
   macro avg       0.00      0.00      0.00       2.0
weighted avg       0.00      0.00      0.00       2.0



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# IMDB Data Set

In this section, we will use a sufficiently large data set, the IMDB review data set. Moreoever, we will use a more complex model. However, the procedure is the same as before. 

In [20]:
import tensorflow as tf
import keras

from keras import models, regularizers, layers, optimizers, losses, metrics
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import np_utils, to_categorical
 
from keras.datasets import imdb

Using TensorFlow backend.


Once again, we import the data set and immmediatly split it into a test and a training set. 

In [21]:
#Use only the first 10.000 words
NUM_WORDS=10000 
INDEX_FROM=3  

train,test = keras.datasets.imdb.load_data(num_words=NUM_WORDS, index_from=INDEX_FROM)
train_x,train_y = train
test_x,test_y = test

#print(test_x)

Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz


The code below consists of a function that can help you to see what the reviews in the data base are like. The imported data base already contains encoded (i.e. Bag of Words) data, but we can decode it into readable sentences using this function. 

In [22]:
#Takes as an input an i, the ID number of the review and outputs the review as a string. 

def id_to_word_funct(i):
    word_to_id = keras.datasets.imdb.get_word_index()
    word_to_id = {k:(v+INDEX_FROM) for k,v in word_to_id.items()}
    word_to_id["<PAD>"] = 0
    word_to_id["<START>"] = 1
    word_to_id["<UNK>"] = 2
    id_to_word = {value:key for key,value in word_to_id.items()}
    word = ' '.join(id_to_word[id] for id in train_x[i])
    return word

#1 is positive
#0 is negative
id_to_word_funct(10)
#print(train_y[7])

Downloading data from https://s3.amazonaws.com/text-datasets/imdb_word_index.json


"<START> french horror cinema has seen something of a revival over the last couple of years with great films such as inside and <UNK> romance <UNK> on to the scene <UNK> <UNK> the revival just slightly but stands head and shoulders over most modern horror titles and is surely one of the best french horror films ever made <UNK> was obviously shot on a low budget but this is made up for in far more ways than one by the originality of the film and this in turn is <UNK> by the excellent writing and acting that ensure the film is a winner the plot focuses on two main ideas prison and black magic the central character is a man named <UNK> sent to prison for fraud he is put in a cell with three others the quietly insane <UNK> body building <UNK> marcus and his retarded boyfriend daisy after a short while in the cell together they stumble upon a hiding place in the wall that contains an old <UNK> after <UNK> part of it they soon realise its magical powers and realise they may be able to use it

As before, we have to make a feature vector. Moreoever, because every vector has to be of the same length, we need to add so-called padding. Padding simply means that the empty places in the data set are being replaced by 0's.  

In [0]:
#Add padding to convert the data points to the same size
train_x_seq = keras.preprocessing.sequence.pad_sequences(train_x, maxlen=256, padding='pre', value=0.0)
test_x_seq = keras.preprocessing.sequence.pad_sequences(test_x, maxlen=256, padding='pre', value=0.0)

In [0]:
#type(NUM_WORDS)

We can now define our model, as we did before. Them model is a bit more complicated, but the idea is the same: we define a model and add layers to it. Because this model has many layers, we call it a deep network.

In [25]:
# input shape is the vocabulary count used for the movie reviews (10,000 words)
vocab_size = NUM_WORDS

model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16, activation=tf.nn.relu))
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))

model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 16)          160000    
_________________________________________________________________
global_average_pooling1d_1 ( (None, 16)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 16)                272       
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 17        
Total params: 160,289
Trainable params: 160,289
Non-trainable params: 0
_________________________________________________________________


In [26]:
#Check the length 
print(len(train_x_seq[0]))

256


Here we compile the model. This means that we determine a loss function and a metric which will be used to test the model. In the case we choose accuracy (acc.). Hence, we want our model to be as accurate as possible. 

In [0]:
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['acc'])

Below is an intermediate step which we haven't encountered before: validation. Instead of training an algorithm and then testing it, we can also let it make predictions on unlabelled instance while training. This increases the model's performance. 

In [0]:
#Create validation set
x_val = train_x_seq[:10000]
partial_x_train = train_x_seq[10000:]

y_val = train_y[:10000]
partial_y_train = train_y[10000:]

In [29]:
history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=40,
                    batch_size=512,
                    validation_data=(x_val, y_val),
                    verbose=1)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 15000 samples, validate on 10000 samples
Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


Finally, we can test our model on the test data.

In [30]:
results = model.evaluate(test_x_seq, test_y)

print(results)

[0.3272354226446152, 0.8731600046157837]


# Let's write and predict our own review

It seems that our model has an accuracy of somewhere between 0.85 and 0.90. This means that in about 90% of the cases, the test data was classified correctly. However, this is all nice and good, but we would like to see how this works in real time. For this reason, please write your own movie review (negative or positive) and see whether the prediction the model makes is correct or not. 

In [0]:
import re 

#review = 'Then youll like also the movie! Of course if you see only the movie probably youll say this is a boring movie without an intense screenplay! And probably youll be right! But this is not a movie! This is a celebration of downton abbey and the characters weve loved! You will feel nostalgia! So if you want to see an extra long episode of your favorite tv show go to the cinema!'
#review = 'Saw a preview tonight, which thankfully gave a catch up or refresher which may have gone way too fast for someone who had never seen the series. It was a lovely film, costumes and cinematography worthy, but the storyline was dull, and the acting predictable. Prediction: no awards but a moneymaker'
#review = 'Sorry folks but this movie is a C movie and not an A. I cant believe that people like this movie so much. Yes I am biased, I really dont like Tim Robbins but he has made a decent movie or two but this aint one of them. This is your basic prison movie. What can I say it has the host of usual characters. The storyline is basic, the outcome I wont say but it is just stupid to say that this is a great movie. I didnt think it was realistic at all and anybody who falls for this crap is either delusional or needs some brains to differentiate between an Edsel and a Lamborghini. I wouldt watch this movie again unless I was paid to do so. One viewing is enough for this predictable laugher.'
#review = 'It is one of the best movies in the world that many people will join me. Watching very, very impressed. What it might be called the great stories of Stephen King anyway. Acting on cues can say is enormous. Theres no posturing. Especially when Andy Dufrense incredible music that review still stands wandering on everyones lips. Finale as you lose yourself in it is a movie you are getting into. Finals if he so exhilarating you a high-quality music. And finding that deserves really deserve. I watched this film very young age and it has taught me that the magic of cinema in a nice way. This film is one of the reasons I hate the Oscars. Will remain as the best film in more than 10 years, I believe wholeheartedly sites. Definitely one of the best films in the world.'
review = 'this is the best movie ive ever seen. my friend suggested it to me, to see it for 3 months and I finally did and Im absolutely happyt that I did.'

list_of_words = re.sub("[^\w]", " ",  review).split()
new_review = []

In [32]:
print(list_of_words)

['this', 'is', 'the', 'best', 'movie', 'ive', 'ever', 'seen', 'my', 'friend', 'suggested', 'it', 'to', 'me', 'to', 'see', 'it', 'for', '3', 'months', 'and', 'I', 'finally', 'did', 'and', 'Im', 'absolutely', 'happyt', 'that', 'I', 'did']


In [0]:
'''
The function below takes as an input a string of one or more words. This string is being converted into a 
list of strings (i.e. words). Each of them is compared to the values in the word_index of the ibdm DB. 
Only NUM_WORDS are compared (usually 10.000). Then, a np.array is created from all of these values. This array
is padded and passed to the model in order to predict the sentiment. This sentiment will be the output. 

Below are some examples of reviews. 

'''

def prediction_review(review,NUM_WORDS=10000, model=model):
    word_to_id = keras.datasets.imdb.get_word_index()
    word_to_id = {k:(v+INDEX_FROM) for k,v in word_to_id.items()}
    word_to_id["<PAD>"] = 0
    word_to_id["<START>"] = 1
    word_to_id["<UNK>"] = 2
    
    wordList = re.sub("[^\w]", " ",  review).split()
    new_review = []

    for word in wordList:
        try:
            value = word_to_id[word]
            if value <= NUM_WORDS:
                new_review.append(word_to_id[word])
            else:
                continue
        except:
            continue
            
    new_array = []
    new_array.append(new_review)
    npa = np.asarray(new_array, dtype=np.float32) 

    try_out = keras.preprocessing.sequence.pad_sequences(npa, maxlen=256, padding='pre', value=0.0)
    prediction = model.predict_classes(try_out) 
    
    prediction = str(prediction)
    
    if '1' in prediction:
        return('The predicted sentiment in this review is positive.')
    elif '0' in prediction:
        return('The predicted sentiment in this review is negative.')
    


In [34]:
import re 

#review = 'Then youll like also the movie! Of course if you see only the movie probably youll say this is a boring movie without an intense screenplay! And probably youll be right! But this is not a movie! This is a celebration of downton abbey and the characters weve loved! You will feel nostalgia! So if you want to see an extra long episode of your favorite tv show go to the cinema!'
#review = 'Saw a preview tonight, which thankfully gave a catch up or refresher which may have gone way too fast for someone who had never seen the series. It was a lovely film, costumes and cinematography worthy, but the storyline was dull, and the acting predictable. Prediction: no awards but a moneymaker'
#review = 'Sorry folks but this movie is a C movie and not an A. I cant believe that people like this movie so much. Yes I am biased, I really dont like Tim Robbins but he has made a decent movie or two but this aint one of them. This is your basic prison movie. What can I say it has the host of usual characters. The storyline is basic, the outcome I wont say but it is just stupid to say that this is a great movie. I didnt think it was realistic at all and anybody who falls for this crap is either delusional or needs some brains to differentiate between an Edsel and a Lamborghini. I wouldt watch this movie again unless I was paid to do so. One viewing is enough for this predictable laugher.'
#review = 'It is one of the best movies in the world that many people will join me. Watching very, very impressed. What it might be called the great stories of Stephen King anyway. Acting on cues can say is enormous. Theres no posturing. Especially when Andy Dufrense incredible music that review still stands wandering on everyones lips. Finale as you lose yourself in it is a movie you are getting into. Finals if he so exhilarating you a high-quality music. And finding that deserves really deserve. I watched this film very young age and it has taught me that the magic of cinema in a nice way. This film is one of the reasons I hate the Oscars. Will remain as the best film in more than 10 years, I believe wholeheartedly sites. Definitely one of the best films in the world.'
#review = 'this is the best movie ive ever seen. my friend suggested it to me, to see it for 3 months and I finally did and Im absolutely happyt that I did.'
#review = 'I would rather get a root canal threatment than watch this movie again.'

prediction_review(review)

'The predicted sentiment in this review is positive.'