# Building sentiment classification using word vectors

Import the modules.

In [1]:
import re
import nltk
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from gensim.models import Word2Vec
from keras.models import Sequential
from keras.layers import Dense, Activation
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

Using TensorFlow backend.


In [2]:
nltk.download('stopwords')
stop = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\maninaya\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Download the dataset.

In [3]:
data=pd.read_csv('https://www.dropbox.com/s/8yq0edd4q908xqw/airline_sentiment.csv?dl=1')

A sample of the dataset looks as follows.

In [4]:
data.head()

Unnamed: 0,airline_sentiment,text
0,1,@VirginAmerica plus you've added commercials t...
1,0,@VirginAmerica it's really aggressive to blast...
2,0,@VirginAmerica and it's a really big bad thing...
3,0,@VirginAmerica seriously would pay $30 a fligh...
4,1,"@VirginAmerica yes, nearly every time I fly VX..."


Preprocess the input text
* Preprocess the input sentences to remove punctuation
* Lowercasing for all words.
* Remove the stop words.

In [5]:
def preprocess(text):
    text=text.lower()
    text=re.sub('[^0-9a-zA-Z]+',' ',text)
    words = text.split()
    words2 = [word for word in words if word not in stop]
    words3=' '.join(words2)
    return(words3)

In [6]:
data['text'] = data['text'].apply(preprocess)

After Preprocessing the Dataset looks as follows.

In [7]:
data.head()

Unnamed: 0,airline_sentiment,text
0,1,virginamerica plus added commercials experienc...
1,0,virginamerica really aggressive blast obnoxiou...
2,0,virginamerica really big bad thing
3,0,virginamerica seriously would pay 30 flight se...
4,1,virginamerica yes nearly every time fly vx ear...


Convert the input text into a list of lists.

In [8]:
list_words=[]
for i in range(len(data)):
    list_words.append(data['text'][i].split())

Build a CBOW model, where the context window size is 5 and the vector length is 100

In [9]:
model = Word2Vec(size=100, window=5, min_count=30, sg=0)

  "C extension not loaded, training will be slow. "


Specify the vocabulary to model and then train it.

In [10]:
model.build_vocab(list_words)
model.train(list_words, total_examples=model.corpus_count, epochs=100)

(6809487, 12585800)

Extract the average vector of a given tweet.

In [11]:
features= []
for i in range(len(list_words)):
    t2 = list_words[i]
    z = np.zeros((1,100))
    k=0
    for j in range(len(t2)):
        try:
            z = z+model[t2[j]]
            k= k+1
        except KeyError:
            continue
    features.append(z/k)

  


print first element of features.

In [12]:
print(features[0])

[[-0.9752347  -0.27016597  0.70575761  2.02127039  0.23110319 -0.58973696
  -1.58623463  0.76974845  0.24209128  1.18256747  0.10017001  0.9132099
   0.42699955 -1.32334952 -0.20311717  0.91257919 -1.30829659  0.24698428
   1.20008349 -0.63479377 -0.5038844   0.40085929  0.71319471 -0.30997811
  -1.19385314  0.69267646 -0.69100454 -0.28000922  0.36959147  1.18292203
  -1.13867408 -0.71333447  1.06983713  0.09232059 -0.27037012  0.77410247
  -0.7799902  -0.42059729  0.51543218 -0.3691845   0.29262251 -0.04271292
  -0.53452698  0.78014304  0.28128824  0.79487153  0.51127694  1.59902956
  -0.80974678  0.95499197  0.44596243 -0.63527008 -0.31866452  0.28614587
   0.36727148 -0.12961707  0.8757535  -0.89195958 -0.24674536 -0.48642722
  -0.67119573 -1.17169897 -0.35880532  0.6722302  -1.64867594 -0.07470235
  -0.35535427  1.0077929  -1.19501923 -0.03629139 -0.44627865  0.07526087
  -0.52079657  0.03860507 -0.65691359 -0.79518194  0.16919739 -1.75692097
   0.94030108  0.45052017  0.99923423 -

We are taking the average of the word vectors for all the words present in the input sentence. Additionally, there will be certain words that are not in the vocabulary (words that occur less frequently) and would result in an error if we try to extract their vectors. We've deployed try and catch errors for this specific scenario.

Preprocess features to convert them into an array, split the dataset, into train and test datasets and reshape the datasets so that they can be passed to model.

In [13]:
features = np.array(features)
X_train, X_test, y_train, y_test = train_test_split(features, data['airline_sentiment'], test_size=0.30,random_state=10)
X_train = X_train.reshape(X_train.shape[0],100)
X_test = X_test.reshape(X_test.shape[0],100)

print first element of X_train.

In [14]:
print(X_train[0])

[-0.64189334 -0.08338755 -0.8285354   0.82467048 -1.02868349 -0.70144425
 -0.15247449  1.08180206  0.41147404  0.4075205  -0.21401518 -0.27407975
 -0.11124514 -0.12706222  0.9429029   0.24867104 -0.92390297  0.33188284
 -0.21333201 -0.98242351  0.86084474  0.36444499  0.24323412 -0.08319756
 -0.09810477  0.62146915  0.94574724  0.77698617  0.41386133  0.27470819
 -0.09095638  0.52962661  0.66286149 -0.75652258  0.28187069  0.37262182
 -0.05396645  0.76291973 -0.38055352 -0.29811106 -0.88408418  0.35670871
 -0.25927967  0.36336555  0.04383947  0.21175945  0.70749708 -0.33536115
 -0.14764456  0.39910796 -0.02710901 -0.41864921 -0.81636152 -0.62623763
  0.0469867  -0.11417671  0.04190523  0.05552077  0.24671666  0.98658287
  0.69590224 -0.78081633 -0.29835591  1.0138798  -1.55551763  0.02205896
 -0.62944871  0.63498874 -0.36552272  0.15407907  0.41989354  0.7547316
 -0.27643165 -0.465283   -0.41696574  0.01783279 -0.17925515 -0.71615679
  0.39568957 -0.75882583  0.38768501  0.93064707  1.

print first element of y_train.

In [16]:
print(y_train)

9091     0
3846     1
4265     1
6918     0
7624     0
9641     0
2548     1
4059     0
3669     0
9199     0
7566     0
817      0
3688     0
5884     0
3851     0
1258     0
6793     0
7464     0
1146     0
1775     0
2570     0
9062     0
8170     0
1651     0
3651     1
21       0
4734     0
1863     0
7995     0
4797     1
        ..
3932     0
653      0
1406     0
409      0
6899     0
11318    0
9166     1
8036     0
574      0
7290     0
3416     0
2102     0
2443     0
239      0
4452     0
5648     0
10742    0
6400     0
9289     1
9224     0
10234    0
10141    1
1520     0
4829     1
10201    1
9372     0
7291     0
1344     0
7293     0
1289     0
Name: airline_sentiment, Length: 8078, dtype: int64


Compile and build the neural network to predict the sentiment of a tweet.

In [17]:
model = Sequential()
model.add(Dense(1000,input_dim = 100,activation='relu'))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 1000)              101000    
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 1001      
_________________________________________________________________
activation_1 (Activation)    (None, 1)                 0         
Total params: 102,001
Trainable params: 102,001
Non-trainable params: 0
_________________________________________________________________


In the preceding model, we have a 1,000-dimensional hidden layer that connects the 100 inputted average word vector values to the output, which has a value of 1 (1 or 0 for a positive or negative sentiment, respectively).

In [18]:
model.fit(X_train, y_train, batch_size=128, nb_epoch=5, validation_data=(X_test, y_test),verbose = 1)

Instructions for updating:
Use tf.cast instead.


  """Entry point for launching an IPython kernel.


Train on 8078 samples, validate on 3463 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1a71668bef0>

We can see that the accuracy of our model is ~90% in predicting the sentiment of a tweet.

Plot the confusion matrix of predictions.

In [19]:
pred = model.predict(X_test)
pred2 = np.where(pred>0.5,1,0)
confusion_matrix(y_test, pred2)

array([[2639,  130],
       [ 235,  459]], dtype=int64)

From the above, we see that in 2,639 sentences, we predicted them to be positive and they are actually positive. 130 sentences were predicted to be negative and happened to be positive. 235 sentences were predicted to be positive and happened to be negative and finally, 459 sentences were predicted negative and were actually negative.