![Twitter](https://i.imgur.com/ELyYIf9.png)

# Twitter Sentiment Analysis

Using twitter dataset - sentiment140 that was annotated automatically,we try to train a neural neural network that would produce a positive score for a tweet(a score of **1.0** is the maximum positiveness). this score can be used in a lot of other machine learning models to further it's accuracy.

Tools used:

- Tensorflow 2.0 (Code optimized for GPU)
- Pandas
- Scikit-earn
- tqdm
- Glove pre-trained word embeddings. (twitter specific)

Dataset: http://help.sentiment140.com/for-students

Glove: https://nlp.stanford.edu/projects/glove/

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.utils import shuffle

import mmap
import tqdm

import tensorflow as tf

## Preprocessing:

- Load data
- Shuffle
- out of 140000 tweets, 800000 are negative and remaining are positive.the negative rows are in first rows.
- We shuffle the data such that we have a distribution when in a given sample,there are roughly the same amount of positive and negative tweets.this also makes it easy to use partial dataset to overcome memory constraints. 

In [2]:
cols = ["target","id","date","Query","username","tweet"]
train = pd.read_csv("/home/jagadeesh/Datasets/Twitter/train.csv",encoding = "ISO-8859-1",names=cols, header=None)
train = shuffle(train)
train.head()

Unnamed: 0,target,id,date,Query,username,tweet
623516,0,2229581957,Thu Jun 18 15:55:21 PDT 2009,NO_QUERY,xDrugFreeMattx,I'm soo fucking sore ughhh soo fucking worth ...
232576,0,1979194550,Sun May 31 02:00:54 PDT 2009,NO_QUERY,stevieenglish,"my tips have gone to shit, @mellalicious seem..."
523838,0,2193270699,Tue Jun 16 08:13:46 PDT 2009,NO_QUERY,lacebound,"@twelfthminute Loser, I am doing E Math. I hat..."
892369,4,1691162833,Sun May 03 17:14:20 PDT 2009,NO_QUERY,K_MAE,just woke up from a wonderful nap. I feel bett...
186310,0,1968292428,Fri May 29 21:08:32 PDT 2009,NO_QUERY,nomibear,@alwayskatharine It works for me... Maybe it'...


In [3]:
#remove 'Query' and 'id'
train.drop(['Query','id'],axis=1,inplace=True)

#replace 4(positive) with 1
train.target.replace(4,1,inplace=True)

#Select 10k as validation set
val = train[100000:120000]

#first 100k as training set.
train = train[:100000]

In [4]:
print("Negative:",len(train[train.target == 0].index),"Positive:",len(train[train.target == 1].index))

Negative: 50164 Positive: 49836


In [5]:
#check for missing values

train.isna().sum()

target      0
date        0
username    0
tweet       0
dtype: int64

In [6]:
#set target as label
train_labels = train.target.values
val_labels = val.target.values

- 'Tokenizing' is nothing but splitting a sentence into a list of words. Similar to ```.split()```
- Ex: ```"Hi, How are you?"``` --> *Tokenize()* ---> ```["Hi,","How","are","you","?"]```

- ``` texts_to_sequences ``` converts tokenized list of words into numeric representation.

### glossary

- ```word_index``` - Unique words in the whole dataset.

In [7]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

t = Tokenizer()
t.fit_on_texts(pd.concat([train,val])['tweet'])
vocab_size = len(t.word_index) + 1
print("vocab_size:", vocab_size)

train_encoded_docs = t.texts_to_sequences(train.tweet)
val_encoded_docs = t.texts_to_sequences(val.tweet)

vocab_size: 112644


- ```get_embedding_index``` loads glove data.
- ```create_weight_matrix``` creates weights from above *embedding_index*

In [8]:
def get_num_lines(file_path):
    fp = open(file_path, "r+")
    buf = mmap.mmap(fp.fileno(), 0)
    lines = 0
    while buf.readline():
        lines += 1
    return lines

def get_embedding_index(path):
    
    embeddings_index = dict()
    glove_path = path
    f = open(glove_path)
    for line in tqdm.tqdm_notebook(f,total=get_num_lines(glove_path)):
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs
    f.close()
    print('Loaded %s word vectors.' % len(embeddings_index))
    return embeddings_index

def create_weight_matrix(embeddings_index):
    
    embedding_matrix = np.zeros((vocab_size, 100))
    for word, i in t.word_index.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
    return embedding_matrix

embeddings_index = get_embedding_index("/home/jagadeesh/Weights/glove_twitter/glove.twitter.27B.100d.txt")
embedding_matrix = create_weight_matrix(embeddings_index)

HBox(children=(IntProgress(value=0, max=1193514), HTML(value='')))


Loaded 1193514 word vectors.


!['Keras doc pad_seq'](https://i.imgur.com/bD07tF7.png)

In [9]:
max_length = 10

train_padded_docs = pad_sequences(train_encoded_docs, maxlen=max_length, padding='post')
val_padded_docs = pad_sequences(val_encoded_docs, maxlen=max_length, padding='post')

## Model Architecture

- Embedding(100)
- LSTM(50)
 - This layers speeds up training significantly.
- Dense(100)
- Dropout(0.2)
- Dense(1) (Output's score in range of 0 to 1.0)

In [10]:
model = tf.keras.models.Sequential()

model.add(tf.keras.layers.Embedding(vocab_size, 100,weights=[embedding_matrix], input_length=10))
model.add(tf.compat.v1.keras.layers.CuDNNLSTM(50))
model.add(tf.keras.layers.Dense(100, activation='relu'))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dropout(0.2))


model.add(tf.keras.layers.Dense(1,activation='sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy',metrics=['acc'])

In [11]:
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 10, 100)           11264400  
_________________________________________________________________
cu_dnnlstm (CuDNNLSTM)       (None, 50)                30400     
_________________________________________________________________
dense (Dense)                (None, 100)               5100      
_________________________________________________________________
flatten (Flatten)            (None, 100)               0         
_________________________________________________________________
dropout (Dropout)            (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 11,300,001
Trainable params: 11,300,001
Non-trainable params: 0
____________________________________________

In [12]:
es = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=1)
checkpoint = tf.keras.callbacks.ModelCheckpoint("/home/jagadeesh/Weights/twitter/model",save_best_only=True)
model.fit(train_padded_docs, train_labels,validation_data=(val_padded_docs,val_labels), epochs=15, verbose=1,batch_size=10000,callbacks=[es,checkpoint])

Train on 100000 samples, validate on 20000 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15


<tensorflow.python.keras.callbacks.History at 0x7fe70018f550>

## Evaluation

In [13]:
predicted = model.predict(val_padded_docs)

In [14]:
infered = pd.DataFrame(val['tweet'])
infered['score'] = predicted
infered['result'] = infered.score.apply(lambda x: "+" if x > 0.51 else "-")

In [15]:
pd.set_option('display.max_colwidth', -1)
infered.iloc[:50]

Unnamed: 0,tweet,score,result
570472,@NatalieGear totez jel. Neither of my classes end for like 3 more weeks,0.119479,-
799072,It's soooo hot. I don't deal with heat well.,0.182169,-
1381948,"Woke 07:23 to a rain-teem hard at work, now taking an extended break under two layers of patchwork clouds, not warm, not cold, not windy",0.02642,-
456231,brb for like 5 minutes i have to do the dishes,0.219266,-
1574601,@XtnDvla I am eating sushi. Home made.,0.838343,+
909881,@MissLaniSasha lmfao!!! Yea anyone?? Please?? We're really pretty!!!,0.952968,+
37124,Awake. Very very very tired and i think i have to move my ass out of the bed. Do some morning sport.,0.473959,-
949242,Apple store coming to Roseville!,0.34165,-
1099968,@WritingTravel thanks for the follow friday,0.991828,+
1079704,@CJay282 gotta sweat it out ma. the swine flu is receding tho.,0.060083,-


In [18]:
#lets try it on our own sentence.

pos1 = "I'm feeling great today, thanks for asking!"
pos2 = "even though i "

neg1 = "very frustated with recent development in my country."
neg2 = ""

combined_arr = np.array([pos,neg])

sentences = t.texts_to_sequences(combined_arr)

sentences = pad_sequences(sentences,maxlen=max_length,padding='post')

res = model.predict(sentences)

res_df = pd.DataFrame(combined_arr)
res_df['analysis'] = res
res_df['result'] = res_df.analysis.apply(lambda x: "+" if x > 0.51 else "-")

In [19]:
res_df.head()

Unnamed: 0,0,analysis,result
0,"I'm feeling great today, thanks for asking!",0.976719,+
1,very frustated with recent development in my country.,0.665812,+
