# Twitter Sentiment Analysis
#### The aim of the project is to classify the tweets that are floated on twitter as negative, positive or neutral.
#### In this Project two datasets are taken, one is the training data that contains the text of the tweet and its calssified sentiment whereas the other dataset contains testing data which only includes the tweets whose sentiment we need to predict.
#### The project of analyzing tweets comes under the domain of "Pattern Classification", which is defined as the process of discovering/exploring useful patterns in large set of data. Natural Language Processing is used in extracting significant patterns and features from large set of data.
#### After the feature extraction is done, the classification algorithms from the Sklearn library are applied and accuracy score for each was calculated. Following are the algorithms that were applied :
#### 1) Support Vector Machine (SVM)
#### 2) Random Forest Classifier
#### 3) Multinomial Naive Bayes Classifier
#### 4) Decision Tree Classifier

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

### Preparing Training Dataset
#### Importing Training Dataset

In [2]:
df_train=pd.read_csv("training_twitter_x_y_train.csv")

In [3]:
print(df_train.shape)
df_train.head()

(10980, 12)


Unnamed: 0,tweet_id,airline_sentiment,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,567900433542488064,negative,Southwest,,ColeyGirouard,,0,"@SouthwestAir I am scheduled for the morning, ...",,2015-02-17 20:16:29 -0800,Washington D.C.,Atlantic Time (Canada)
1,569989168903819264,positive,Southwest,,WalterFaddoul,,0,@SouthwestAir seeing your workers time in and ...,,2015-02-23 14:36:22 -0800,"Indianapolis, Indiana; USA",Central Time (US & Canada)
2,568089179520954368,positive,United,,LocalKyle,,0,@united Flew ORD to Miami and back and had gr...,,2015-02-18 08:46:29 -0800,Illinois,Central Time (US & Canada)
3,568928195581513728,negative,Southwest,,amccarthy19,,0,@SouthwestAir @dultch97 that's horse radish 😤🐴,,2015-02-20 16:20:26 -0800,,Atlantic Time (Canada)
4,568594180014014464,negative,United,,J_Okayy,,0,@united so our flight into ORD was delayed bec...,,2015-02-19 18:13:11 -0800,,Eastern Time (US & Canada)


In [4]:
df_train=df_train[['text','airline_sentiment']]
df_train.head()

Unnamed: 0,text,airline_sentiment
0,"@SouthwestAir I am scheduled for the morning, ...",negative
1,@SouthwestAir seeing your workers time in and ...,positive
2,@united Flew ORD to Miami and back and had gr...,positive
3,@SouthwestAir @dultch97 that's horse radish 😤🐴,negative
4,@united so our flight into ORD was delayed bec...,negative


In [5]:
training_documents=df_train.values

### Spliiting the text into words using NLTK

In [6]:
from nltk.tokenize import word_tokenize

tweets_train=[]
for i in range(len(training_documents)):
    tweets_train.append([word_tokenize(training_documents[i][0]),training_documents[i][1]])

In [7]:
import random
random.shuffle(tweets_train)
tweets_train[0:3]

[[['@',
   'USAirways',
   'flight',
   '#',
   '654',
   'sitting',
   'at',
   'JFK',
   'with',
   'delays',
   'for',
   '3-1/2',
   'hrs',
   '!',
   'No',
   'employees',
   'to',
   'load',
   'bags',
   '?',
   '#',
   'terrible'],
  'negative'],
 [['@',
   'SouthwestAir',
   'only',
   '1',
   'guest',
   'needs',
   'to',
   'change',
   'a',
   'flight',
   'on',
   'a',
   'reservation',
   'of',
   '2',
   '.',
   'How',
   'can',
   'I',
   'do',
   'it',
   '?',
   'I',
   'NEED',
   'this',
   'to',
   'happen',
   '.',
   'Say',
   'it',
   'can',
   '...',
   'Please',
   'help',
   '!'],
  'neutral'],
 [['@',
   'USAirways',
   'Are',
   'you',
   'not',
   'even',
   'going',
   'to',
   'acknowledge',
   'that',
   'you',
   'bumped',
   'me',
   'from',
   'a',
   'flight',
   '(',
   'NOT',
   'BECAUSE',
   'OF',
   'WEATHER',
   'I',
   'AM',
   'IN',
   'ARIZONA',
   ')'],
  'negative']]

### Part Of Speech using NLTK

In [8]:
from nltk.corpus import wordnet

In [9]:
def get_simple_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

### Identifying StopWords Along with punctuations

In [10]:
import string
from nltk.corpus import stopwords
stops=set(stopwords.words("english"))
punctuations=list(string.punctuation)
stops.update(punctuations)
stops, string.punctuation

({'!',
  '"',
  '#',
  '$',
  '%',
  '&',
  "'",
  '(',
  ')',
  '*',
  '+',
  ',',
  '-',
  '.',
  '/',
  ':',
  ';',
  '<',
  '=',
  '>',
  '?',
  '@',
  '[',
  '\\',
  ']',
  '^',
  '_',
  '`',
  'a',
  'about',
  'above',
  'after',
  'again',
  'against',
  'ain',
  'all',
  'am',
  'an',
  'and',
  'any',
  'are',
  'aren',
  "aren't",
  'as',
  'at',
  'be',
  'because',
  'been',
  'before',
  'being',
  'below',
  'between',
  'both',
  'but',
  'by',
  'can',
  'couldn',
  "couldn't",
  'd',
  'did',
  'didn',
  "didn't",
  'do',
  'does',
  'doesn',
  "doesn't",
  'doing',
  'don',
  "don't",
  'down',
  'during',
  'each',
  'few',
  'for',
  'from',
  'further',
  'had',
  'hadn',
  "hadn't",
  'has',
  'hasn',
  "hasn't",
  'have',
  'haven',
  "haven't",
  'having',
  'he',
  'her',
  'here',
  'hers',
  'herself',
  'him',
  'himself',
  'his',
  'how',
  'i',
  'if',
  'in',
  'into',
  'is',
  'isn',
  "isn't",
  'it',
  "it's",
  'its',
  'itself',
  'just',
  'll',


### Cleaning the Words using WordNetLemmatizer available in NLTK

In [11]:
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
lemmatizer=WordNetLemmatizer()

In [12]:
def clean_tweets(words):
    output_words=[]
    for w in words:
        if w.isalpha():
            if w.lower() not in stops:
                pos=pos_tag([w])
                clean_word=lemmatizer.lemmatize(w,pos=get_simple_pos(pos[0][1]))
                output_words.append(clean_word.lower())
    return output_words

In [13]:
for i in range(len(tweets_train)):
    tweets_train[i]=(clean_tweets(tweets_train[i][0]),tweets_train[i][1])

In [14]:
y_train=[]
tweets=[]
for tweet,sentiment in tweets_train:
    tweets.append(" ".join(tweet))
    y_train.append(sentiment)

### Using Count Vectorizer to get the X Train Features

In [15]:
from sklearn.feature_extraction.text import CountVectorizer
count_vec=CountVectorizer(max_features=2000)
x_train_features=count_vec.fit_transform(tweets)

### Preparing Testing Dataset

In [16]:
df_test=pd.read_csv("test_twitter_x_test.csv")

In [17]:
df_test.head()

Unnamed: 0,tweet_id,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,569682010270101504,American,,zsalim03,,0,@AmericanAir In car gng to DFW. Pulled over 1h...,,2015-02-22 18:15:50 -0800,Texas,Central Time (US & Canada)
1,569608307184242688,American,,sa_craig,,0,"@AmericanAir after all, the plane didn’t land ...",,2015-02-22 13:22:57 -0800,"College Station, TX",Central Time (US & Canada)
2,567879304593408001,Southwest,,DanaChristos,,1,@SouthwestAir can't believe how many paying cu...,,2015-02-17 18:52:31 -0800,CT,Eastern Time (US & Canada)
3,569757651539660801,US Airways,,rossj987,,0,@USAirways I can legitimately say that I would...,,2015-02-22 23:16:24 -0800,"Washington, D.C.",Eastern Time (US & Canada)
4,569900705852608513,American,,tranpham18,,0,@AmericanAir still no response from AA. great ...,,2015-02-23 08:44:51 -0800,New York City,Eastern Time (US & Canada)


In [18]:
testing_documents=np.array(df_test['text'])

In [19]:
tweets_test=[]
for t in testing_documents:
    t=clean_tweets(word_tokenize(t))
    tweets_test.append(" ".join(t))

In [20]:
x_test_features=count_vec.transform(tweets_test)

### Splitting Training Dataset in order to find Accuracy Score

In [21]:
from sklearn.model_selection import train_test_split
x_train_tweets,x_test_tweets,y_train_tweets,y_test_tweets=train_test_split(x_train_features,y_train,train_size=0.8,
                                                                           test_size=0.2,random_state=0)

### Applying SkLearn Classifiers
### SVM (Support Vector Machine)

In [22]:
from sklearn.svm import SVC
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
svc=SVC()
svc.fit(x_train_tweets,y_train_tweets)
y_test_pred=svc.predict(x_test_tweets)
print("Classification Report:")
print(classification_report(y_test_tweets,y_test_pred))
print("Confusion Matrix: ")
print(confusion_matrix(y_test_tweets,y_test_pred))
print("Accuracy Score: ")
print(accuracy_score(y_test_tweets,y_test_pred)*100,"%",sep=" ")

Classification Report:
              precision    recall  f1-score   support

    negative       0.81      0.92      0.86      1399
     neutral       0.64      0.49      0.56       441
    positive       0.75      0.56      0.64       356

    accuracy                           0.78      2196
   macro avg       0.73      0.66      0.69      2196
weighted avg       0.77      0.78      0.77      2196

Confusion Matrix: 
[[1292   76   31]
 [ 188  217   36]
 [ 112   44  200]]
Accuracy Score: 
77.82331511839709 %


In [23]:
svc.fit(x_train_features,y_train)
y_pred_svc=svc.predict(x_test_features)

In [24]:
df=pd.DataFrame(y_pred_svc)
df.to_csv('predictions_tweets_svm.csv',index=False,header=False)

### Random Forest Classifier

In [25]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
rfc=RandomForestClassifier()
rfc.fit(x_train_tweets,y_train_tweets)
y_test_pred=rfc.predict(x_test_tweets)
print("Classification Report:")
print(classification_report(y_test_tweets,y_test_pred))
print("Confusion Matrix: ")
print(confusion_matrix(y_test_tweets,y_test_pred))
print("Accuracy Score: ")
print(accuracy_score(y_test_tweets,y_test_pred)*100,"%",sep=" ")

Classification Report:
              precision    recall  f1-score   support

    negative       0.83      0.89      0.86      1399
     neutral       0.59      0.51      0.55       441
    positive       0.70      0.61      0.65       356

    accuracy                           0.77      2196
   macro avg       0.70      0.67      0.68      2196
weighted avg       0.76      0.77      0.76      2196

Confusion Matrix: 
[[1241  111   47]
 [ 168  227   46]
 [  93   47  216]]
Accuracy Score: 
76.68488160291439 %


In [26]:
rfc.fit(x_train_features,y_train)
y_pred_rfc=rfc.predict(x_test_features)

In [27]:
df=pd.DataFrame(y_pred_rfc)
df.to_csv('predictions_tweets_rfc.csv',index=False,header=False)

### Multinomial Naive Bayes Classifier

In [28]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
mnv=MultinomialNB(alpha=1)
mnv.fit(x_train_tweets,y_train_tweets)
y_test_pred=mnv.predict(x_test_tweets)
print("Classification Report:")
print(classification_report(y_test_tweets,y_test_pred))
print("Confusion Matrix: ")
print(confusion_matrix(y_test_tweets,y_test_pred))
print("Accuracy Score: ")
print(accuracy_score(y_test_tweets,y_test_pred)*100,"%",sep=" ")

Classification Report:
              precision    recall  f1-score   support

    negative       0.85      0.87      0.86      1399
     neutral       0.57      0.53      0.55       441
    positive       0.69      0.69      0.69       356

    accuracy                           0.77      2196
   macro avg       0.70      0.70      0.70      2196
weighted avg       0.77      0.77      0.77      2196

Confusion Matrix: 
[[1212  129   58]
 [ 156  234   51]
 [  66   45  245]]
Accuracy Score: 
77.00364298724955 %


In [29]:
mnv.fit(x_train_features,y_train)
y_pred_mnv=mnv.predict(x_test_features)

In [30]:
df=pd.DataFrame(y_pred_mnv)
df.to_csv('Predictions_tweets_mnv.csv',index=False,header=False)

### Decision Tree Classifier

In [31]:
from sklearn.datasets import make_classification
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from sklearn import tree
dt = tree.DecisionTreeClassifier()
dt.fit(x_train_tweets,y_train_tweets)
y_test_pred=dt.predict(x_test_tweets)
print("Classification Report:")
print(classification_report(y_test_tweets,y_test_pred))
print("Confusion Matrix: ")
print(confusion_matrix(y_test_tweets,y_test_pred))
print("Accuracy Score: ")
print(accuracy_score(y_test_tweets,y_test_pred)*100,"%",sep=" ")

Classification Report:
              precision    recall  f1-score   support

    negative       0.81      0.78      0.79      1399
     neutral       0.46      0.50      0.48       441
    positive       0.54      0.55      0.54       356

    accuracy                           0.69      2196
   macro avg       0.60      0.61      0.61      2196
weighted avg       0.69      0.69      0.69      2196

Confusion Matrix: 
[[1093  195  111]
 [ 164  222   55]
 [  95   66  195]]
Accuracy Score: 
68.76138433515483 %


In [32]:
dt.fit(x_train_features,y_train)
y_pred_dt=dt.predict(x_test_features)

In [33]:
df=pd.DataFrame(y_pred_dt)
df.to_csv('Predictions_tweets_dt.csv',index=False,header=False)

### SVM predicts the result with the best accuracy score.
### Below are the predicted sentiments according to the SVM Classifier

In [34]:
import csv
i=0
v=open("Predictions_tweets_svm.csv")
r = csv.reader(v)
for item in r:
    item.insert(0,tweets_test[i])
    i+=1
    print(item)

['americanair car gng dfw pulled ago icy road aa since ca reach arpt wat', 'negative']
['americanair plane land identical bad condition grk accord metars', 'negative']
['southwestair ca believe many pay customer left high dry reason flight cancelled flightlations monday bdl wow', 'negative']
['usairways legitimately say would rather driven cross country flown us airways', 'negative']
['americanair still response aa great job guy', 'positive']
['united developer fly tmrw morn min layover earlier flight layover move', 'negative']
['usairways hello anyone', 'negative']
['usairways husainhaqqani husain u shld protest well one ur party member rehman malik delayed pia flight hour', 'negative']
['usairways likely flightaware say plane still durango depart', 'negative']
['americanair even give option hold say line busy plz try late flightr', 'negative']
['united announcement pre boarding address mobility disability require travel lot stuff preboard', 'negative']
['usairways really embarrass as

['southwestair possible book refundable trip willing pay extra would domestic round trip flight', 'negative']
['usairways great call fuck phone number hung', 'negative']
['southwestair yeah happen really piss kind thing expect us austrian', 'negative']
['united special promotion flight depart newark nj john antigua', 'neutral']
['united make come help anything service great', 'positive']
['virginamerica results handily exceed forecasts nytimes http', 'neutral']
['jetblue lolol', 'neutral']
['usairways understand ca get miss mile flight dividend miles help', 'negative']
['jetblue employee logan told finally get one meet daughter first solo flight rude rep', 'negative']
['jetblue whole plane flight lga pbi http', 'neutral']
['americanair two non rev right', 'neutral']
['southwestair terminal b lga makes want never fly airline', 'negative']
['southwestair go fly nashville deicer short plane tomorrow morning', 'neutral']
['americanair becomes get money passenger strict feedback reward', 'n