# UJM - Master DSC/MLDM - Deep Learning - TP3a
# Sentiment analysis with LSTM

This session is based on this source: https://www.kaggle.com/ngyptr/lstm-sentiment-analysis-keras

**Associated data on claroline** : twitter.zip which contains 'Sentiment.csv'


**Sentiment Analysis:** the process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic, product, etc. is positive, negative, or neutral.


In [8]:
# Notebook prepared with  Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.feature_extraction.text import CountVectorizer
from tensorflow.keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from sklearn.model_selection import train_test_split
from keras.utils import to_categorical
import re



## First we process the data
**Only keeping the necessary columns.**

In [10]:
#We assume data to be in a directory data, change it with respect to your environment
data = pd.read_csv('./Sentiment.csv')
# Keeping only the neccessary columns
data = data[['text','sentiment']]

Next, the 'Neutral' sentiments are dropped as the goal proposed here is only to differentiate positive and negative tweets. After that, the tweets are filtered so that  only valid texts and words remain. The number of max features is defined as 2000 and we use Tokenizer to vectorize and convert text into Sequences so the Network can deal with it as input.

In [11]:
data = data[data.sentiment != "Neutral"]
data['text'] = data['text'].apply(lambda x: x.lower()) #convert into lowercase letters
data['text'] = data['text'].apply((lambda x: re.sub('[^a-zA-z0-9\s]','',x))) #remove symbols at beginning

print(data[ data['sentiment'] == 'Positive'].size) #nb of positives
print(data[ data['sentiment'] == 'Negative'].size) #nb of negatives

for idx,row in data.iterrows():  #remove rt symbols
    row[0] = row[0].replace('rt',' ')

#We define the representation vector for each sequences. We use the tokenizer package
#Each sequence is represented by  a token, a word is a token
max_fatures = 2000
tokenizer = Tokenizer(num_words=max_fatures, split=' ')
tokenizer.fit_on_texts(data['text'].values)
X = tokenizer.texts_to_sequences(data['text'].values)
X = pad_sequences(X)

#The last instruction pads each sequence with zeros at the beginning of the string representation
#such that each string has 29 characters
print(X[0])
print(X[1])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['text'] = data['text'].apply(lambda x: x.lower()) #convert into lowercase letters
  row[0] = row[0].replace('rt',' ')
  row[0] = row[0].replace('rt',' ')


3498
12558
[   0    0    0    0    0    0    0    0    0    0    3  268   98    2
  563    1   17   27  292   32  136    6  150 1607   10  998 1061  673]
[   0    0    0    0    0    0    0    0    0    3   16  206  227    7
  757   78  115   25 1797   97    6    2  199   10    1  147  489   13]


Next, we design the LSTM Network. Note that **embed_dim**, **lstm_out**, **batch_size**, **dropout_x** variables are hyperparameters, their values are somehow intuitive, can be and must be played with in order to achieve good results. Please also note softmax is used as activation function. The reason is that our Network is using categorical crossentropy, and softmax is just the right activation method for that.

In [12]:
embed_dim = 128
lstm_out = 196

model = Sequential()
model.add(Embedding(max_fatures, embed_dim,input_length = X.shape[1]))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())



None


We now build the train and test datasets.

In [13]:
Y = pd.get_dummies(data['sentiment']).values
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.33, random_state = 42)
print(X_train.shape,Y_train.shape)
print('\n')
print(X_test.shape,Y_test.shape)

(5378, 28) (5378, 2)


(2650, 28) (2650, 2)


Here we train the Network. We should run much more than 2 epochs, but to start, we fix it at 2 epochs.

In [14]:
batch_size = 32
model.fit(X_train, Y_train, epochs = 2, batch_size=batch_size, verbose = 2)

Epoch 1/2
169/169 - 15s - 87ms/step - accuracy: 0.7873 - loss: 0.5017
Epoch 2/2
169/169 - 12s - 72ms/step - accuracy: 0.8561 - loss: 0.3457


<keras.src.callbacks.history.History at 0x7bf086c96bf0>

Extracting a validation set, and measuring score and accuracy.

In [15]:
validation_size = 1500

X_validate = X_test[-validation_size:]
Y_validate = Y_test[-validation_size:]
X_test = X_test[:-validation_size]
Y_test = Y_test[:-validation_size]
score,acc = model.evaluate(X_test, Y_test, verbose = 2, batch_size = batch_size)
print("score: %.2f" % (score))
print("acc: %.2f" % (acc))

36/36 - 3s - 83ms/step - accuracy: 0.8609 - loss: 0.3629
score: 0.36
acc: 0.86


Finally we measure the number of correct guesses.  It is clear that finding negative tweets goes very well for the Network but deciding whether is positive is not really. The "bad" results for positive tweets can be explained by the imbalanced nature of the data.

In [16]:
pos_cnt, neg_cnt, pos_correct, neg_correct = 0, 0, 0, 0
for x in range(len(X_validate)):

    result = model.predict(X_validate[x].reshape(1,X_test.shape[1]),batch_size=1,verbose = 0)[0]

    if np.argmax(result) == np.argmax(Y_validate[x]):
        if np.argmax(Y_validate[x]) == 0:
            neg_correct += 1
        else:
            pos_correct += 1

    if np.argmax(Y_validate[x]) == 0:
        neg_cnt += 1
    else:
        pos_cnt += 1



print("pos_acc", pos_correct/pos_cnt*100, "%")
print("neg_acc", neg_correct/neg_cnt*100, "%")

pos_acc 44.47592067988669 %
neg_acc 95.9023539668701 %


**We now predict some tweets**

In [17]:
twt = ['Meetings: Because none of us is as dumb as all of us.']
#vectorizing the tweet by the pre-fitted tokenizer instance
twt = tokenizer.texts_to_sequences(twt)
#padding the tweet to have exactly the same shape as `embedding_2` input
twt = pad_sequences(twt, maxlen=28, dtype='int32', value=0)
print(twt)
sentiment = model.predict(twt,batch_size=1,verbose = 2)[0]
if(np.argmax(sentiment) == 0):
    print("negative")
elif (np.argmax(sentiment) == 1):
    print("positive")

[[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 168
  677   6  95   7  39 903  39  40   6  95]]
1/1 - 0s - 23ms/step
negative


Test your own tweets !
**Note**: this is a basic notebook. The model requires more epochs and special attention to class imbalance, one solution could be to use more data or to start from pretrained models.

In [38]:
twts = ['This is dumb',
        'I love pudding',
        'But you are kind',
        "I don't think there are good and bad situations, if I had to sum up my life, I would say that this is before all some meetings",
        "This situation is bad",
        "I don't like country music"]
#vectorizing the tweet by the pre-fitted tokenizer instance
for twt in twts:
  twt = tokenizer.texts_to_sequences(twt)
  #padding the tweet to have exactly the same shape as `embedding_2` input
  twt = pad_sequences(twt, maxlen=28, dtype='int32', value=0)
  #print(twt)
  sentiment = model.predict(twt,batch_size=1,verbose = 0)[0]
  if(np.argmax(sentiment) == 0):
      print("negative ->", sentiment[0])
  elif (np.argmax(sentiment) == 1):
      print("positive ->", sentiment[1])

negative -> 0.61238956
positive -> 0.5688901
negative -> 0.5228314
positive -> 0.5688901
negative -> 0.61238956
positive -> 0.5688901
