# TWITTER SENTIMENT ANALYSIS

# dataset information

The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.

Formally, given a training sample of tweets and labels, where label '1' denotes the tweet is racist/sexist and label '0' denotes the tweet is not racist/sexist, your objective is to predict the labels on the test dataset.

Hate  speech  is  an  unfortunately  common  occurrence  on  the  Internet.  Often social media sites like Facebook and Twitter face the problem of identifying and censoring  problematic  posts  while weighing the right to freedom of speech. The  importance  of  detecting  and  moderating hate  speech  is  evident  from  the  strong  connection between hate speech and actual hate crimes. Early identification of users promoting  hate  speech  could  enable  outreach  programs that attempt to prevent an escalation from speech to action. Sites such as Twitter and Facebook have been seeking  to  actively  combat  hate  speech. In spite of these reasons, NLP research on hate speech has been very limited, primarily due to the lack of a general definition of hate speech, an analysis of its demographic influences, and an investigation of the most effective features.

Evaluation Metric:
The metric used for evaluating the performance of classification model would be F1-Score.

The metric can be understood as -

True Positives (TP) - These are the correctly predicted positive values which means that the value of actual class is yes and the value of predicted class is also yes.

True Negatives (TN) - These are the correctly predicted negative values which means that the value of actual class is no and value of predicted class is also no.

False Positives (FP) – When actual class is no and predicted class is yes.

False Negatives (FN) – When actual class is yes but predicted class in no.

Precision = TP/TP+FP

Recall = TP/TP+FN

F1 Score = 2*(Recall * Precision) / (Recall + Precision)

F1 is usually more useful than accuracy, especially if for an uneven class distribution.

# dataset

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("D:/sentiment anlysis/train_E6oV3lV.csv")
df

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation
...,...,...,...
31957,31958,0,ate @user isz that youuu?ðððððð...
31958,31959,0,to see nina turner on the airwaves trying to...
31959,31960,0,listening to sad songs on a monday morning otw...
31960,31961,1,"@user #sikh #temple vandalised in in #calgary,..."


In [3]:
x=df.drop('label',axis=1)

In [4]:
y=df['label']

# module importing

Used models:
*LSTM BI-DIRECTIONAL RNN

In [5]:
import tensorflow as tf
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Bidirectional
from tensorflow.keras.preprocessing.text import one_hot 

In [6]:
voc_size=5000

In [7]:
messages=x.copy()

In [8]:
messages

Unnamed: 0,id,tweet
0,1,@user when a father is dysfunctional and is s...
1,2,@user @user thanks for #lyft credit i can't us...
2,3,bihday your majesty
3,4,#model i love u take with u all the time in ...
4,5,factsguide: society now #motivation
...,...,...
31957,31958,ate @user isz that youuu?ðððððð...
31958,31959,to see nina turner on the airwaves trying to...
31959,31960,listening to sad songs on a monday morning otw...
31960,31961,"@user #sikh #temple vandalised in in #calgary,..."


# text preprocessing

In [9]:
import nltk
import re
from nltk.corpus import stopwords

In [10]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\KAUSHIK\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [11]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [12]:
corpus = []
for i in range(0, len(messages)):
    review = re.sub('[^a-zA-Z]', ' ', str(messages['tweet'][i]))
    review = review.lower()
    review = review.split()
    
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)

In [13]:
corpus

['user father dysfunct selfish drag kid dysfunct run',
 'user user thank lyft credit use caus offer wheelchair van pdx disapoint getthank',
 'bihday majesti',
 'model love u take u time ur',
 'factsguid societi motiv',
 'huge fan fare big talk leav chao pay disput get allshowandnogo',
 'user camp tomorrow user user user user user user user danni',
 'next school year year exam think school exam hate imagin actorslif revolutionschool girl',
 'love land allin cav champion cleveland clevelandcavali',
 'user user welcom gr',
 'ireland consum price index mom climb previou may blog silver gold forex',
 'selfish orlando standwithorlando pulseshoot orlandoshoot biggerproblem selfish heabreak valu love',
 'get see daddi today day gettingf',
 'user cnn call michigan middl school build wall chant tcot',
 'comment australia opkillingbay seashepherd helpcovedolphin thecov helpcovedolphin',
 'ouch junior angri got junior yugyoem omg',
 'thank paner thank posit',
 'retweet agre',
 'friday smile around

# one hot encoding and padding

In [14]:
onehot_repr = [one_hot(words,voc_size)for words in corpus]
print(onehot_repr)

[[845, 2515, 4654, 2797, 1553, 3588, 4654, 2004], [845, 845, 1273, 1050, 135, 4824, 1980, 169, 2267, 3077, 4535, 4240, 3984], [2176, 329], [4435, 3275, 2520, 1615, 2520, 4289, 1395], [4389, 599, 3577], [635, 4712, 2311, 3164, 1015, 4039, 1744, 2698, 919, 550, 2454], [845, 4553, 3608, 845, 845, 845, 845, 845, 845, 845, 1877], [926, 4617, 1002, 1002, 4370, 777, 4617, 4370, 1580, 4088, 4314, 4288, 4427], [3275, 4752, 133, 3818, 4134, 3221, 3420], [845, 845, 561, 176], [2441, 105, 4255, 4562, 1494, 439, 295, 2854, 4010, 15, 3809, 1870], [2797, 4933, 2816, 227, 2965, 4390, 2797, 2442, 65, 3275], [550, 3711, 996, 2921, 3391, 4936], [845, 148, 283, 4442, 3836, 4617, 1861, 4060, 475, 2458], [4738, 2706, 4, 2992, 3481, 3202, 3481], [3055, 483, 3232, 3768, 483, 413, 3037], [1273, 992, 1273, 3470], [4225, 2262], [3114, 4456, 4987, 3872, 395, 845, 845, 3258, 2896, 3820], [179, 1235, 1714, 154, 2908], [2935, 3820, 3830, 2903, 2431, 2308, 1733, 208, 2903, 304, 4719, 3759, 179, 2654, 2446], [4491, 65

In [15]:
sent_length=20
embedded_docs = pad_sequences(onehot_repr,padding='pre',maxlen=sent_length)
print(embedded_docs)

[[   0    0    0 ... 3588 4654 2004]
 [   0    0    0 ... 4535 4240 3984]
 [   0    0    0 ...    0 2176  329]
 ...
 [   0    0    0 ...  213 4183 4491]
 [   0    0    0 ... 4127 4468 2974]
 [   0    0    0 ... 1273  845 4585]]


# model training and prediction

In [16]:
embedding_vector_features=40
model=Sequential()
model.add(Embedding(voc_size,embedding_vector_features,input_length=sent_length))
model.add(Bidirectional(LSTM(100)))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 20, 40)            200000    
                                                                 
 bidirectional (Bidirectiona  (None, 200)              112800    
 l)                                                              
                                                                 
 dense (Dense)               (None, 1)                 201       
                                                                 
Total params: 313,001
Trainable params: 313,001
Non-trainable params: 0
_________________________________________________________________
None


In [17]:
X_final=np.array(embedded_docs)
y_final=np.array(y)

In [18]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X_final,y_final,test_size=0.33,random_state=42)

In [19]:
model.fit(X_train,y_train,validation_data=(X_test,y_test),epochs=10,batch_size=32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x229a7bb1dc0>

In [20]:
y_pred=model.predict(X_test)



In [21]:
y_pred=np.where(y_pred>=0.5,1,0)

# score

In [22]:
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report
confusion_matrix(y_test,y_pred)

array([[9527,  279],
       [ 365,  377]], dtype=int64)

In [23]:
matrix=confusion_matrix(y_test,y_pred)
print(matrix)
score=accuracy_score(y_test,y_pred)
print(score)
report=classification_report(y_test,y_pred)
print(report)

[[9527  279]
 [ 365  377]]
0.9389457717102768
              precision    recall  f1-score   support

           0       0.96      0.97      0.97      9806
           1       0.57      0.51      0.54       742

    accuracy                           0.94     10548
   macro avg       0.77      0.74      0.75     10548
weighted avg       0.94      0.94      0.94     10548



# test data prediction

In [24]:
df1=pd.read_csv("D:/sentiment anlysis/test_tweets_anuFYb8.csv")
df1

Unnamed: 0,id,tweet
0,31963,#studiolife #aislife #requires #passion #dedic...
1,31964,@user #white #supremacists want everyone to s...
2,31965,safe ways to heal your #acne!! #altwaystohe...
3,31966,is the hp and the cursed child book up for res...
4,31967,"3rd #bihday to my amazing, hilarious #nephew..."
...,...,...
17192,49155,thought factory: left-right polarisation! #tru...
17193,49156,feeling like a mermaid ð #hairflip #neverre...
17194,49157,#hillary #campaigned today in #ohio((omg)) &am...
17195,49158,"happy, at work conference: right mindset leads..."


In [25]:
df1['tweet'][478:485]

478    @user the right 2 be   without #stress or #fea...
479    it is amazing what a day has #surprises #movem...
480    caffeine fix. a quick cup of coffee.  #positiv...
481       green #jamesmcavoy #atonement #mcavoy   #flim 
482    #pass #last #exam #made #new #lashes thanks to...
483                   @user   #easter! my new hero...   
484     @user peace an' love ooh ooh ...  #brighton #...
Name: tweet, dtype: object

In [26]:
messages_t = df1[['tweet']][:17196]

In [27]:
messages_t

Unnamed: 0,tweet
0,#studiolife #aislife #requires #passion #dedic...
1,@user #white #supremacists want everyone to s...
2,safe ways to heal your #acne!! #altwaystohe...
3,is the hp and the cursed child book up for res...
4,"3rd #bihday to my amazing, hilarious #nephew..."
...,...
17191,2_damn_tuff-ruff_muff__techno_city-(ng005)-web...
17192,thought factory: left-right polarisation! #tru...
17193,feeling like a mermaid ð #hairflip #neverre...
17194,#hillary #campaigned today in #ohio((omg)) &am...


In [38]:
corpus_t = []
for i in range(0, len(df1)):
    review = re.sub('[^a-zA-Z]', ' ', str(df1['tweet'][i]))
    review = review.lower()
    review = review.split()
    
    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
    review = ' '.join(review)
    corpus_t.append(review)

In [39]:
corpus_t

['studiolif aislif requir passion dedic willpow find newmateri',
 'user white supremacist want everyon see new bird movi',
 'safe way heal acn altwaystoh healthi heal',
 'hp curs child book reserv alreadi ye harrypott pottermor favorit',
 'rd bihday amaz hilari nephew eli ahmir uncl dave love miss',
 'choos momtip',
 'someth insid die eye ness smokeyey tire lone sof grung',
 'finish tattoo ink ink loveit thank aleee',
 'user user user never understand dad left young deep inthefeel',
 'delici food lovelif capetown mannaepicur restur',
 'dayswast narcosi infinit ep make awar grind neuro bass lifestyl',
 'one world greatest spo event leman teamaudi',
 'half way websit allgoingwel',
 'good food good life enjoy call garlic bread iloveit',
 'stand behind guncontrolpleas senselessshoot takethegun comicrelief stillsad',
 'ate ate ate jamaisasthi fish curri prawn hilsa foodfestiv foodi',
 'user got user limit edit rain shine set today user user user user',
 'amp love amp hug amp kiss keep babi 

In [40]:
onehot_repr_t = [one_hot(words,voc_size)for words in corpus_t]
print(onehot_repr_t)

[[4385, 403, 1691, 1552, 1138, 4114, 1855, 12], [845, 2689, 1160, 1530, 2888, 3711, 1628, 3626, 823], [2925, 3243, 2130, 681, 1231, 247, 2130], [107, 4191, 4934, 2867, 2258, 4105, 4673, 2936, 96, 2563], [2149, 2176, 2646, 3331, 1925, 3911, 4100, 501, 4169, 3275, 4796], [647, 3121], [2828, 4240, 30, 887, 901, 3252, 777, 4734, 3072, 4546], [3839, 4847, 1061, 1061, 2270, 1273, 2002], [845, 845, 845, 4158, 2185, 4771, 3533, 3508, 1113, 1971], [325, 3890, 262, 3793, 1673, 1076], [1134, 2444, 2960, 3814, 2896, 1294, 3281, 526, 935, 3671], [104, 1438, 1119, 1652, 2891, 2254, 1052], [4190, 3243, 2233, 1708], [638, 3890, 638, 4082, 227, 283, 3710, 4495, 4812], [3641, 88, 2566, 4064, 728, 835, 2250], [2290, 2290, 2290, 4738, 1648, 4424, 2940, 1305, 4156, 4777], [845, 3768, 845, 1766, 1814, 4526, 312, 4838, 2921, 845, 845, 845, 845], [3382, 3275, 3382, 1824, 3382, 48, 2088, 1100, 3236, 1245], [4427, 1645, 1336, 4672, 2858, 1516], [2551, 60, 4362, 4050, 2123, 1312, 631, 3741, 3146, 1804, 3688, 180

In [41]:
sent_length=20
embedded_docs_t = pad_sequences(onehot_repr_t,padding='pre',maxlen=sent_length)
print(embedded_docs_t)

[[   0    0    0 ... 4114 1855   12]
 [   0    0    0 ... 1628 3626  823]
 [   0    0    0 ... 1231  247 2130]
 ...
 [   0    0 3735 ... 4534 2913 2183]
 [   0    0    0 ... 2329 4183 2673]
 [   0    0    0 ... 4736 1072 1413]]


In [42]:
x_test_t = np.array(embedded_docs_t)

In [43]:
test = model.predict(x_test_t)



In [44]:
test=np.where(test>=0.5,1,0)

In [45]:
test.shape

(17197, 1)

In [46]:
df1.shape

(17197, 2)

In [47]:
df1['label_pred'] = test

In [49]:
df1.to_csv("submission.csv")