<a href="https://colab.research.google.com/github/mittalmeghna/projects/blob/master/Twitter_Sentiment_Analysis_MM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Twitter Sentiment Analysis

# 01 :Frame the Problem

#### Problem Statement Link :  https://datahack.analyticsvidhya.com/contest/practice-problem-twitter-sentiment-analysis/

# 02 :Obtain Data

### Import Statements

In [0]:
!mkdir twitter
%cd twitter
!ls


/content/twitter


In [0]:
!pip install missingno
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as ms
% matplotlib inline






### Reading the Train Data

In [0]:
!wget https://www.dropbox.com/s/p8fq1p6wan2g89a/train.csv -q

In [0]:
!ls -l

In [0]:
train = pd.read_csv('train.csv')
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31962 entries, 0 to 31961
Data columns (total 3 columns):
id       31962 non-null int64
label    31962 non-null int64
tweet    31962 non-null object
dtypes: int64(2), object(1)
memory usage: 749.2+ KB


In [0]:
pd.set_option('max_colwidth', 240)

In [0]:
train.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run
1,2,0,@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦
4,5,0,factsguide: society now #motivation


# 03 : Analyze Data

In [0]:
train.head(20)

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run
1,2,0,@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦
4,5,0,factsguide: society now #motivation
5,6,0,[2/2] huge fan fare and big talking before they leave. chaos and pay disputes when they get there. #allshowandnogo
6,7,0,@user camping tomorrow @user @user @user @user @user @user @user dannyâ¦
7,8,0,the next school year is the year for exams.ð¯ can't think about that ð­ #school #exams #hate #imagine #actorslife #revolutionschool #girl
8,9,0,we won!!! love the land!!! #allin #cavs #champions #cleveland #clevelandcavaliers â¦
9,10,0,@user @user welcome here ! i'm it's so #gr8 !


In [0]:
train.iloc[13]

id                                                                               14
label                                                                             1
tweet    @user #cnn calls #michigan middle school 'build the wall' chant '' #tcot  
Name: 13, dtype: object

In [0]:
train['label'].value_counts()

0    29720
1     2242
Name: label, dtype: int64

In [0]:
train[train['label']==1]['tweet'].head()

13                                  @user #cnn calls #michigan middle school 'build the wall' chant '' #tcot  
14       no comment!  in #australia   #opkillingbay #seashepherd #helpcovedolphins #thecove  #helpcovedolphins
17                                                                                      retweet if you agree! 
23                                                             @user @user lumpy says i am a . prove it lumpy.
34    it's unbelievable that in the 21st century we'd need something like this. again. #neverump  #xenophobia 
Name: tweet, dtype: object

## Label types
-   0 : Normal
-   1 : Hate

# 05 : Model Selection ( 1st Iteration)

## RandomForest without Preprocessing of Text Data

In [0]:
#Building the model without preprocessing of data
unprocessed_data = pd.read_csv('train.csv')

In [0]:
from sklearn.model_selection import train_test_split


#splitting the data into random train and test subsets
X_train, X_test, y_train, y_test = train_test_split(unprocessed_data["tweet"],
                                                        unprocessed_data["label"], 
                                                    test_size = 0.2, random_state = 42)

In [0]:
# Sequentialization of tasks
from sklearn.pipeline import Pipeline

#generating ngrams and tokens and Bagging
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer

from sklearn.ensemble import RandomForestClassifier

text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
                      ('tfidf', TfidfTransformer()),
                     ('clf', RandomForestClassifier(n_estimators=50)),])

In [0]:
type(text_clf)

sklearn.pipeline.Pipeline

In [0]:
model = text_clf.fit(X_train,y_train)

In [0]:
predicted = model.predict(X_test)

In [0]:
from sklearn.metrics import precision_score,recall_score,f1_score, accuracy_score, confusion_matrix, classification_report

In [0]:
confusion_matrix(y_test,predicted)

array([[5916,   21],
       [ 253,  203]])

In [0]:
accuracy_score(y_test,predicted)

0.9571406225559206

In [0]:
precision_score(y_test,predicted) # True positive/Predicted positive

0.90625

In [0]:
recall_score(y_test,predicted) # true positive/actual positive = TP/TP+FN When it was actually racist, how often did I caught it?

0.4451754385964912

In [0]:
f1_score(y_test,predicted)

0.5970588235294118

# 04 and 05 : Feature Engineering and Model Selection (2nd Iteration)

Preprocessing of Text data is very important for Textual Analysis. Tokenization, Feature Extraction (Vectorization) are the most important techniques in Scikit-Learn. 
The text must be parsed to extract words, called tokenization. Then the words need to be encoded as integers or floating point values for use as input to a machine learning algorithm, called feature extraction (or vectorization).


In [0]:
#regular expression 
import re 

#regular expression for the removal of name tags and the emoticons from tweets.
def process_tweet(tweet):
    return " ".join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])", " ",tweet.lower()).split())

In [0]:
#Dropping of columns from pd
def drop_features(features,data):
    data.drop(features,inplace=True,axis=1)

In [0]:
#Applying the Process_tweet function to the given Train Data
train['processed_tweets'] = train['tweet'].apply(process_tweet)

In [0]:
train.head()

Unnamed: 0,id,label,tweet,processed_tweets
0,1,0,@user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. #run,when a father is dysfunctional and is so selfish he drags his kids into his dysfunction run
1,2,0,@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. #disapointed #getthanked,thanks for lyft credit i can t use cause they don t offer wheelchair vans in pdx disapointed getthanked
2,3,0,bihday your majesty,bihday your majesty
3,4,0,#model i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦,model i love u take with u all the time in ur
4,5,0,factsguide: society now #motivation,factsguide society now motivation


In [0]:
train[train['label']==1].head(20)

Unnamed: 0,id,label,tweet,processed_tweets
13,14,1,@user #cnn calls #michigan middle school 'build the wall' chant '' #tcot,cnn calls michigan middle school build the wall chant tcot
14,15,1,no comment! in #australia #opkillingbay #seashepherd #helpcovedolphins #thecove #helpcovedolphins,no comment in australia opkillingbay seashepherd helpcovedolphins thecove helpcovedolphins
17,18,1,retweet if you agree!,retweet if you agree
23,24,1,@user @user lumpy says i am a . prove it lumpy.,lumpy says i am a prove it lumpy
34,35,1,it's unbelievable that in the 21st century we'd need something like this. again. #neverump #xenophobia,it s unbelievable that in the 21st century we d need something like this again neverump xenophobia
56,57,1,@user lets fight against #love #peace,lets fight against love peace
68,69,1,ð©the white establishment can't have blk folx running around loving themselves and promoting our greatness,the white establishment can t have blk folx running around loving themselves and promoting our greatness
77,78,1,"@user hey, white people: you can call people 'white' by @user #race #identity #medâ¦",hey white people you can call people white by race identity med
82,83,1,how the #altright uses &amp; insecurity to lure men into #whitesupremacy,how the altright uses amp insecurity to lure men into whitesupremacy
111,112,1,@user i'm not interested in a #linguistics that doesn't address #race &amp; . racism is about #power. #raciolinguistics bringsâ¦,i m not interested in a linguistics that doesn t address race amp racism is about power raciolinguistics brings


In [0]:
drop_features(['id','tweet'],train)

In [0]:
train.head()

Unnamed: 0,label,processed_tweets
0,0,when a father is dysfunctional and is so selfish he drags his kids into his dysfunction run
1,0,thanks for lyft credit i can t use cause they don t offer wheelchair vans in pdx disapointed getthanked
2,0,bihday your majesty
3,0,model i love u take with u all the time in ur
4,0,factsguide society now motivation


In [0]:
#splitting the data into random train and test subsets
x_train, x_test, y_train, y_test = train_test_split(train["processed_tweets"],train["label"],
                                                    test_size = 0.2, random_state = 42)

Pipeline : Sequentially apply a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit. 

In [0]:
text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
                      ('tfidf', TfidfTransformer()),
                     ('clf', RandomForestClassifier(n_estimators=200)),])
text = text_clf.fit(x_train,y_train)

In [0]:
predicted = text.predict(x_test)

In [0]:
from sklearn.metrics import confusion_matrix, classification_report,precision_score

In [0]:
cm_m = confusion_matrix(y_test,predicted)
cm_m

array([[5902,   35],
       [ 215,  241]])

In [0]:
TN, FP = cm_m[0]
FN, TP = cm_m[1]

In [0]:
TP

241

In [0]:
float(TN+TP)/(TN+TP+FN+FP)

0.9608947286094166

In [0]:
p = TP/(TP+FP)
p

0.8731884057971014

In [0]:
precision_score(y_test,predicted)

0.8731884057971014

In [0]:
r = TP/(FN+TP)
r

0.5285087719298246

In [0]:
recall_score(y_test,predicted)

0.5285087719298246

In [0]:
f1 = 2*p*r/(p+r)
f1

0.6584699453551913

In [0]:
f1_score(y_test,predicted)

0.6584699453551913

# 04 and 05 : Feature Engineering and Model Selection (3rd Iteration)

In [0]:

count_vect = CountVectorizer(stop_words='english',ngram_range=(1,3),analyzer='word')
transformer = TfidfTransformer(norm='l2',sublinear_tf=True)

In [0]:
type(count_vect)

sklearn.feature_extraction.text.CountVectorizer

In [0]:
#splitting the data into random train and test subsets
x_train, x_test, y_train, y_test = train_test_split(train["processed_tweets"],train["label"],
                                                    test_size = 0.2, random_state = 42)

x_train_counts = count_vect.fit_transform(x_train)
x_train_tfidf = transformer.fit_transform(x_train_counts)
x_test_counts = count_vect.transform(x_test)
x_test_tfidf = transformer.transform(x_test_counts)

In [0]:
print(x_train_counts.shape)
print(x_train_tfidf.shape)
print(x_test_counts.shape)
print(x_test_tfidf.shape)

(25569, 272155)
(25569, 272155)
(6393, 272155)
(6393, 272155)


In [0]:
from sklearn.linear_model import SGDClassifier

model = SGDClassifier(loss="modified_huber", penalty="l1")
model.fit(x_train_tfidf,y_train)
predictions = model.predict(x_test_tfidf)



In [0]:
f1_score(y_test,predictions)

0.5798969072164948

In [0]:
recall_score(y_test,predictions)

0.4934210526315789

In [0]:
precision_score(y_test,predictions)

0.703125

In [0]:
f1_score(y_test,predictions)

0.5798969072164948

# 05 : Model Selection

In [0]:
#different classification modesls being used
from sklearn.svm import LinearSVC

model_svc = LinearSVC(C=2.0,max_iter=500,tol=0.0001,loss ='hinge')
model_svc.fit(x_train_counts,y_train)

LinearSVC(C=2.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='hinge', max_iter=500, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0)

In [0]:
predict_svc = model_svc.predict(x_test_counts)

In [0]:
f1_score(y_test,predict_svc)

0.6745718050065876

In [0]:
recall_score(y_test,predict_svc)

0.5614035087719298

# 06 : Tune the Model

In [0]:

#optimizing parameters
from sklearn.model_selection import GridSearchCV


params = {"tfidf__ngram_range": [(1, 2), (1,3), (1,4)],
          "svc__C": [.01, .1, 1, 10, 100]}

clf = Pipeline([("tfidf", TfidfVectorizer(sublinear_tf=True)),
                ("svc", LinearSVC(loss='hinge'))])

gs = GridSearchCV(clf, params, verbose=4, n_jobs=-1)
gs.fit(x_train,y_train)
print("Best Estimator = ", gs.best_estimator_)
print("Best Score = ",gs.best_score_)

Fitting 3 folds for each of 15 candidates, totalling 45 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  21 tasks      | elapsed:   45.2s
[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed:  2.0min finished


Best Estimator =  Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,...e', max_iter=1000, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0))])
Best Score =  0.9630411826821542


In [0]:
predicted = gs.predict(x_test)


In [0]:
predicted

array([0, 0, 0, ..., 0, 0, 1])

In [0]:
f1_score(y_test,predicted)

0.7245657568238213

In [0]:
recall_score(y_test,predicted)

0.6403508771929824

In [0]:
precision_score(y_test,predicted)

0.8342857142857143

# 07 : Predict on new cases

In [0]:
!wget https://www.dropbox.com/s/as2y6lpjsh6284l/test.csv

--2019-05-05 23:01:25--  https://www.dropbox.com/s/as2y6lpjsh6284l/test.csv
Resolving www.dropbox.com (www.dropbox.com)... 162.125.8.1, 2620:100:6016:1::a27d:101
Connecting to www.dropbox.com (www.dropbox.com)|162.125.8.1|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/as2y6lpjsh6284l/test.csv [following]
--2019-05-05 23:01:25--  https://www.dropbox.com/s/raw/as2y6lpjsh6284l/test.csv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc1dc07496e570faa0cb00b8ba2a.dl.dropboxusercontent.com/cd/0/inline/AgXb5h4CursuR8t5GncccysfB6yecmKrXKqqAOx3rfsC2wtSxrQ8eeOyoBOJhK9txG9RVVrpnwtNbndr_bHb7geF4Gk28dd8xJ7rW8kOr6hunQ/file# [following]
--2019-05-05 23:01:25--  https://uc1dc07496e570faa0cb00b8ba2a.dl.dropboxusercontent.com/cd/0/inline/AgXb5h4CursuR8t5GncccysfB6yecmKrXKqqAOx3rfsC2wtSxrQ8eeOyoBOJhK9txG9RVVrpnwtNbndr_bHb7geF4Gk28dd8xJ7rW8kOr6hunQ/file
Resolving uc1dc07496e570faa0cb0

In [0]:
submission = pd.read_csv('test.csv')
submission.info()

In [0]:
submission['processed_tweet'] = submission['tweet'].apply(process_tweet)

In [0]:
submission.head()

In [0]:
drop_features(['tweet'],submission)

In [0]:
submission.head()

In [0]:
predicted = gs.predict(submission['processed_tweet'])

In [0]:
predicted

In [0]:
final_predict = pd.DataFrame(predicted,columns=['label'])
result = pd.DataFrame(submission['id'],columns=['id'])
result = pd.concat([result,final_predict],axis=1)
result.to_csv('final_predictions.csv',index=False)

In [0]:
result['label'].value_counts()