**Sentiment Analysis:** the process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic, product, etc. is positive, negative, or neutral.

**What's New** I have added how to deal with data imbalance. Almost all classification task have this problem as number of data of every class if different. For current dataset number of data having positive sentiments is very low relative to data with negative sentiment.

**Solving class imbalaned data**:
- upsampling 
- using class weighted loss function

Using LSTM to classify the movie reviews into positive and negative.


In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical
from sklearn.utils import resample
from sklearn.utils import shuffle
from sklearn.metrics import confusion_matrix,classification_report
import re

Only keeping the necessary columns.

In [19]:
data = pd.read_csv('./clean_dataset.csv')
# Keeping only the neccessary columns
data = data[['Tweet','HS']]

Data preview

In [20]:
data = data.dropna()
data.head()

Unnamed: 0,Tweet,HS
0,di saat cowok usaha lacak perhati gue kamu lan...,1
1,telat beri tau kamu edan sarap gue gaul cigax ...,0
2,kadang pikir percaya tuhan jatuh kali kali kad...,0
3,tau mata sipit lihat,0
4,kaum cebong kafir sudah lihat dongok dungu haha,1


Next, I am dropping the 'Neutral' sentiments as my goal was to only differentiate positive and negative tweets. After that, I am filtering the tweets so only valid texts and words remain.  Then, I define the number of max features as 2000 and use Tokenizer to vectorize and convert text into Sequences so the Network can deal with it as input.

In [21]:
from string import punctuation

# Set all text to be lowercase
data['Tweet'] = data['Tweet'].apply(lambda x: x.lower())

# Remove special chars
data['Tweet'] = data['Tweet'].apply((lambda x: re.sub('[^a-zA-z0-9\s]', '', x)))

# Define text to remove

# data['Tweet'] = data['Tweet'].apply(lambda x: bytes(x, 'utf-8').decode('utf-8', 'ignore'))
data['Tweet'] = data['Tweet'].apply((lambda x: ''.join([c for c in x if c not in punctuation])))
data['Tweet'] = data['Tweet'].str.replace('rt', '')
data['Tweet'] = data['Tweet'].str.replace('user', '')
data['Tweet'] = data['Tweet'].str.replace('\n', '')
data['Tweet'] = data['Tweet'].str.strip()

data.head(10)
# print(new)

Unnamed: 0,Tweet,HS
0,di saat cowok usaha lacak perhati gue kamu lan...,1
1,telat beri tau kamu edan sarap gue gaul cigax ...,0
2,kadang pikir percaya tuhan jatuh kali kali kad...,0
3,tau mata sipit lihat,0
4,kaum cebong kafir sudah lihat dongok dungu haha,1
5,bani taplak dan kawan kawan,1
6,deklarasi pilih kepala daerah aman anti hoaks ...,0
7,gue saja selesai watch aldnoah zero kampret me...,0
8,admin belanja po baik nak makan ais kepal milo...,0
9,enak kalau sambil ngewe,0


In [22]:
print(data['HS'].size, "Total")
print(data[data['HS'] == 1].size, "Hate speech")
print(data[data['HS'] == 0].size, "Non hate speech")

13116 Total
11106 Hate speech
15126 Non hate speech


In [23]:
max_fatures = 2000
tokenizer = Tokenizer(num_words=max_fatures, split=' ')
tokenizer.fit_on_texts(data['Tweet'].values)
X = tokenizer.texts_to_sequences(data['Tweet'].values)
X = pad_sequences(X)
X[:3]

array([[   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,   57,  336,  165,  598,   10,   23,  598,
          10,   81,  139,   23,   23,  336,  211],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0, 1890,  690,
          41,   23,  452,  386,   10, 1686,   58],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,  599,
          94,  149,  198,  506,  129,  129,  599,  198,  216,    5,  482,
         982, 1687, 1011,   13,   32,   34, 1786]])

Next, I compose the LSTM Network. Note that **embed_dim**, **lstm_out**, **batch_size**, **droupout_x** variables are hyperparameters, their values are somehow intuitive, can be and must be played with in order to achieve good results. Please also note that I am using softmax as activation function. The reason is that our Network is using categorical crossentropy, and softmax is just the right activation method for that.

In [24]:
embed_dim = 128
lstm_out = 196

model = Sequential()
model.add(Embedding(max_fatures, embed_dim,input_length = X.shape[1]))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 40, 128)           256000    
                                                                 
 spatial_dropout1d (SpatialD  (None, 40, 128)          0         
 ropout1D)                                                       
                                                                 
 lstm (LSTM)                 (None, 196)               254800    
                                                                 
 dense (Dense)               (None, 2)                 394       
                                                                 
Total params: 511,194
Trainable params: 511,194
Non-trainable params: 0
_________________________________________________________________
None


Hereby I declare the train and test dataset.

In [25]:
Y = pd.get_dummies(data['HS']).values
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.20, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

(10492, 40) (10492, 2)
(2624, 40) (2624, 2)


Here we train the Network with 15 epoch.

In [26]:
batch_size = 128
model.fit(X_train, Y_train, epochs = 15, batch_size=batch_size, verbose = 1)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x22d71981a90>

Extracting a validation set, and measuring score and accuracy.

In [27]:
# Y_pred = model.predict_classes(X_test,batch_size = batch_size)
predict_x = model.predict(X_test) 
classes_x = np.argmax(predict_x,axis=1)

In [29]:
df_test = pd.DataFrame({'true': Y_test.tolist(), 'pred':classes_x})
df_test['true'] = df_test['true'].apply(lambda x: np.argmax(x))
print("confusion matrix",confusion_matrix(df_test.true, df_test.pred))
print(classification_report(df_test.true, df_test.pred))

confusion matrix [[1286  221]
 [ 276  841]]
              precision    recall  f1-score   support

           0       0.82      0.85      0.84      1507
           1       0.79      0.75      0.77      1117

    accuracy                           0.81      2624
   macro avg       0.81      0.80      0.80      2624
weighted avg       0.81      0.81      0.81      2624



Finally measuring the number of correct guesses.  It is clear that finding negative tweets (**class 0**) goes very well (**recall 0.92**) for the Network but deciding whether is positive (**class 1**) is not really (**recall 0.52**). My educated guess here is that the positive training set is dramatically smaller than the negative, hence the "bad" results for positive tweets.

As expected accuracy for positive data is vary low compare to negative, Lets try to solve this problem.

**1. Up-sample Minority Class**

Up-sampling is the process of randomly duplicating observations from the minority class in order to reinforce its signal.
There are several heuristics for doing so, but the most common way is to simply resample with replacement.

In [30]:
# Separate majority and minority classes
data_majority = data[data['HS'] == 0]
data_minority = data[data['HS'] == 1]

bias = data_minority.shape[0]/data_majority.shape[0]
# lets split train/test data first then 
train = pd.concat([data_majority.sample(frac=0.8,random_state=200),
         data_minority.sample(frac=0.8,random_state=200)])
test = pd.concat([data_majority.drop(data_majority.sample(frac=0.8,random_state=200).index),
        data_minority.drop(data_minority.sample(frac=0.8,random_state=200).index)])

train = shuffle(train)
test = shuffle(test)

In [31]:
print('positive data in training:',(train.HS == 1).sum())
print('negative data in training:',(train.HS == 0).sum())
print('positive data in test:',(test.HS == 1).sum())
print('negative data in test:',(test.HS == 0).sum())


positive data in training: 4442
negative data in training: 6050
positive data in test: 1111
negative data in test: 1513


In [32]:
# Separate majority and minority classes in training data for upsampling 
data_majority = train[train['HS'] == 0]
data_minority = train[train['HS'] == 1]

print("majority class before upsample:",data_majority.shape)
print("minority class before upsample:",data_minority.shape)

# Upsample minority class
data_minority_upsampled = resample(data_minority, 
                                 replace=True,     # sample with replacement
                                 n_samples= data_majority.shape[0],    # to match majority class
                                 random_state=123) # reproducible results
 
# Combine majority class with upsampled minority class
data_upsampled = pd.concat([data_majority, data_minority_upsampled])
 
# Display new class counts
print("After upsampling\n",data_upsampled.HS.value_counts(),sep = "")

max_fatures = 2000
tokenizer = Tokenizer(num_words=max_fatures, split=' ')
tokenizer.fit_on_texts(data['Tweet'].values) # training with whole data

X_train = tokenizer.texts_to_sequences(data_upsampled['Tweet'].values)
X_train = pad_sequences(X_train,maxlen=29)
Y_train = pd.get_dummies(data_upsampled['HS']).values
print('x_train shape:',X_train.shape)

X_test = tokenizer.texts_to_sequences(test['Tweet'].values)
X_test = pad_sequences(X_test,maxlen=29)
Y_test = pd.get_dummies(test['HS']).values
print("x_test shape", X_test.shape)

majority class before upsample: (6050, 2)
minority class before upsample: (4442, 2)
After upsampling
0    6050
1    6050
Name: HS, dtype: int64
x_train shape: (12100, 29)
x_test shape (2624, 29)


In [33]:
# model
embed_dim = 128
lstm_out = 192

model = Sequential()
model.add(Embedding(max_fatures, embed_dim,input_length = X_train.shape[1]))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(lstm_out, dropout=0.4, recurrent_dropout=0.4))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 29, 128)           256000    
                                                                 
 spatial_dropout1d_1 (Spatia  (None, 29, 128)          0         
 lDropout1D)                                                     
                                                                 
 lstm_1 (LSTM)               (None, 192)               246528    
                                                                 
 dense_1 (Dense)             (None, 2)                 386       
                                                                 
Total params: 502,914
Trainable params: 502,914
Non-trainable params: 0
_________________________________________________________________
None


Here we train the Network. We should run much more than 15 epoch, but I would have to wait forever for kaggle, so it is 15 for now.

In [34]:
batch_size = 128
# also adding weights
class_weights = {0: 1 ,
                1: 1.6/bias }
model.fit(X_train, Y_train, epochs = 15, batch_size=batch_size, verbose = 1,
          class_weight=class_weights)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x22d7e6826d0>

In [35]:
predict_x = model.predict(X_test) 
Y_pred = np.argmax(predict_x,axis=1)


df_test = pd.DataFrame({'true': Y_test.tolist(), 'pred':Y_pred})
df_test['true'] = df_test['true'].apply(lambda x: np.argmax(x))
print("confusion matrix",confusion_matrix(df_test.true, df_test.pred))
print(classification_report(df_test.true, df_test.pred))

confusion matrix [[1212  301]
 [ 236  875]]
              precision    recall  f1-score   support

           0       0.84      0.80      0.82      1513
           1       0.74      0.79      0.77      1111

    accuracy                           0.80      2624
   macro avg       0.79      0.79      0.79      2624
weighted avg       0.80      0.80      0.80      2624



So the class imbalance is reduced significantly recall value for positive tweets (Class 1) improved from 0.54 to 0.77. It is alwayes not possible to reduce it compleatly. 

You may also noticed that the recall value for Negative tweets also decreased from 0.90 to 0.78  but this can be improved using training model to more epocs and tuning the hyperparameters.

In [36]:
# running model to few more epochs
# model.fit(X_train, Y_train, epochs = 15, batch_size=batch_size, verbose = 1,
#           class_weight=class_weights)

predict_x = model.predict(X_test) 
Y_pred = np.argmax(predict_x,axis=1)

# Y_pred = model.predict_classes(X_test,batch_size = batch_size)
df_test = pd.DataFrame({'true': Y_test.tolist(), 'pred':Y_pred})
df_test['true'] = df_test['true'].apply(lambda x: np.argmax(x))
print("confusion matrix",confusion_matrix(df_test.true, df_test.pred))
print(classification_report(df_test.true, df_test.pred))

confusion matrix [[1212  301]
 [ 236  875]]
              precision    recall  f1-score   support

           0       0.84      0.80      0.82      1513
           1       0.74      0.79      0.77      1111

    accuracy                           0.80      2624
   macro avg       0.79      0.79      0.79      2624
weighted avg       0.80      0.80      0.80      2624



In [41]:
twt = ['ayo kita makan bersama, ayo cebong ikut dong']
#vectorizing the tweet by the pre-fitted tokenizer instance
twt = tokenizer.texts_to_sequences(twt)
#padding the tweet to have exactly the same shape as `embedding_2` input
twt = pad_sequences(twt, maxlen=29, dtype='int32', value=0)
print(twt)
sentiment = model.predict(twt,batch_size=1,verbose = 2)[0]
print(sentiment)

if(np.argmax(sentiment) == 0):
    print("negative")
elif (np.argmax(sentiment) == 1):
    print("positive")

[[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0  217 1134   96  217   25  212
   723]]
1/1 - 0s - 21ms/epoch - 21ms/step
[0.002676 0.997324]
positive
