<a href="https://colab.research.google.com/github/mohsinziabutt/Applied-AI-Challenge-2/blob/main/NLP_Twitter_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Importing dataset**

In [75]:
import pandas as pd

tweets_data = pd.read_csv('https://raw.githubusercontent.com/mohsinziabutt/Applied-AI-Challenge-2/main/train.csv')
tweets_data.head()

Unnamed: 0,Id,Text,Sentiment
0,549e992a42,Sooo SAD I will miss you here in San Diego!!!,negative
1,088c60f138,my boss is bullying me...,negative
2,9642c003ef,what interview! leave me alone,negative
3,358bd9e861,"Sons of ****, why couldn`t they put them on t...",negative
4,6e0c6d75b1,2am feedings for the baby are fun when he is a...,positive


# **Checking if there is any sentiment in the column other than negative and positive**

In [76]:
tweets_data.pivot_table(columns=['Sentiment'], aggfunc='size')

Sentiment
negative    7781
positive    8582
dtype: int64

# **Data Pre-Processing**

**1. Defining the columns we need**

In [77]:
X=tweets_data['Text']
y=tweets_data['Sentiment']

**2. Removing Stop Words**

In [78]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem import PorterStemmer

stop_words=stopwords.words('english')
stemmer=PorterStemmer()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**3. Cleaning the data we required**

In [79]:
import re

cleaned_data=[]
for i in range(len(X)):
   tweet=re.sub('[^a-zA-Z]',' ',X.iloc[i])
   tweet=tweet.lower().split()
   tweet=[stemmer.stem(word) for word in tweet if (word not in stop_words)]
   tweet=' '.join(tweet)
   cleaned_data.append(tweet)

cleaned_data

['sooo sad miss san diego',
 'boss bulli',
 'interview leav alon',
 'son put releas alreadi bought',
 'feed babi fun smile coo',
 'journey wow u becam cooler hehe possibl',
 'realli realli like song love stori taylor swift',
 'sharpi run danger low ink',
 'want go music tonight lost voic',
 'uh oh sunburn',
 'ok tri plot altern speak sigh',
 'sick past day thu hair look wierd didnt hat would look http tinyurl com mnf kw',
 'back home gonna miss everi one',
 'play ghost onlin realli interest new updat kirin pet metamorph third job wait dragon pet',
 'free fillin app ipod fun im addict',
 'sorri',
 'way malaysia internet access twit',
 'juss came backk berkeleyi omg madd fun havent minut whassqoodd',
 'went sleep power cut noida power back work',
 'go home seen new twitter design quit heavenli',
 'hope unni make audit fight dahy unni',
 'consol got bmi test hahaha say obes well much unhappi minut',
 'funni cute kid',
 'born rais nyc live texa past year still miss ny',
 'soooooo sleeeeepi

**4. Bag of Words Approach using Count Vectorizer**

In [80]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(max_features=3000)
X_fin=cv.fit_transform(cleaned_data).toarray()

# **Converting the sentiments into numbers**

In [81]:
sentiment_ordering = ['negative', 'positive']
y = y.apply(lambda x: sentiment_ordering.index(x))

y.head()

0    0
1    0
2    0
3    0
4    1
Name: Sentiment, dtype: int64

# **Splitting the dataset into train and test data and fitting the model**

In [45]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(X_fin, y, test_size=0.3, random_state=100)

# **Defining the model we are going to use**

**Defining array for models that we are going to use**

In [48]:
models = []

**1. Multinomial Naive Bayes**

In [50]:
from sklearn.naive_bayes import MultinomialNB
models.append(MultinomialNB())
models[0].fit(X_train, y_train)

MultinomialNB()

**2. Decision Tree Classifier**

In [51]:
from sklearn.tree import DecisionTreeClassifier
models.append(DecisionTreeClassifier())
models[1].fit(X_train, y_train)

DecisionTreeClassifier()

**3. Support vector machine (SVM)**

In [52]:
from sklearn import svm
models.append(svm.SVC())
models[2].fit(X_train, y_train)

SVC(kernel='linear')

**4. Gaussian Naive Bayes**

In [53]:
from sklearn.naive_bayes import GaussianNB
models.append(GaussianNB())
models[3].fit(X_train, y_train)

GaussianNB()

# **Performing the predictions and generating the classification report**

In [65]:
from sklearn.metrics import classification_report

cf = []
y_pred = []

for i in range(0, len(models)):
  y_pred = models[i].predict(X_test)
  cf.append(classification_report(y_test,y_pred))
  print(models[i], "\n" + cf[i] + "\n")

MultinomialNB() 
              precision    recall  f1-score   support

           0       0.86      0.84      0.85      2398
           1       0.85      0.87      0.86      2511

    accuracy                           0.85      4909
   macro avg       0.85      0.85      0.85      4909
weighted avg       0.85      0.85      0.85      4909


DecisionTreeClassifier() 
              precision    recall  f1-score   support

           0       0.82      0.77      0.80      2398
           1       0.80      0.84      0.82      2511

    accuracy                           0.81      4909
   macro avg       0.81      0.81      0.81      4909
weighted avg       0.81      0.81      0.81      4909


SVC(kernel='linear') 
              precision    recall  f1-score   support

           0       0.85      0.85      0.85      2398
           1       0.86      0.86      0.86      2511

    accuracy                           0.85      4909
   macro avg       0.85      0.85      0.85      4909
weighte

**As we can see SVM and MultinomialNB gives the same accuracy but greater than other models, so we will use MultinomialNB model as it takes less time to process**

In [66]:
model = models[0]

# **Importing the testing file to generate the sentiments against new tweets**

In [67]:
#import the csv file with tweets to be labelled
new_tweets = pd.read_csv("https://raw.githubusercontent.com/mohsinziabutt/Applied-AI-Challenge-2/main/test.csv")
new_tweets = new_tweets[["Text"]]

new_tweets["Sentiment"] = ""
new_tweets.head()

Unnamed: 0,Text,Sentiment
0,Shanghai is also really exciting (precisely -...,
1,"Recession hit Veronique Branquinho, she has to...",
2,happy bday!,
3,http://twitpic.com/4w75p - I like it!!,
4,that`s great!! weee!! visitors!,


# **Converting the tweets into Bag of Words using Count Vectorizer**

In [68]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(max_features=3000)
X_fin2=cv.fit_transform(new_tweets['Text']).toarray()

# **Predicting the sentiments for new tweets**

In [69]:
new_sentiments = model.predict(X_fin2)
new_tweets["Sentiment"] = new_sentiments

# **Converting the Sentiments back from numeric to string**

In [70]:
sentiments = {1:'positive', 0:'negative'}
new_tweets["Sentiment"] = [sentiments[item] for item in new_tweets["Sentiment"]]

# **Saving final results as CSV file**

In [71]:
new_tweets = new_tweets[['Text', "Sentiment"]]
new_tweets.to_csv('new_tweets_predictions.csv', header=True, index=False)

# **Displaying the newly tested dataset**

In [72]:
pd.set_option("display.max_rows", None)
new_tweets

Unnamed: 0,Text,Sentiment
0,Shanghai is also really exciting (precisely -...,negative
1,"Recession hit Veronique Branquinho, she has to...",negative
2,happy bday!,positive
3,http://twitpic.com/4w75p - I like it!!,negative
4,that`s great!! weee!! visitors!,positive
5,I THINK EVERYONE HATES ME ON HERE lol,positive
6,"soooooo wish i could, but im in school and my...",positive
7,My bike was put on hold...should have known th...,positive
8,"I`m in VA for the weekend, my youngest son tur...",negative
9,Its coming out the socket I feel like my phon...,negative
