# Introduction

### Problem Description
Given dataset contains data of tweets on various airline’s twitter handles.

It contains a total of 12 columns, out of which one column specifies the sentiment of the tweet. All other columns provide various information related to what was the tweet, where was it posted from, when was it posted, it's retweeted; etc.

My task was to build a machine learning / deep learning model to predict the sentiment of the tweet using all or some of the other given columns

### Data Description
Description of columns of the dataset is given below -

tweet_id -- Id of the tweet

airline_sentiment -- Sentiment of the tweet (Target variable)

airline_sentiment_confidence -- Confidence with which the given sentiment was determined

negativereason_confidence -- Confidence with which the negative reason of tweet was predicted

name -- Name of the person who tweeted

retweet_count -- Number of retweets

text -- Text of the tweet whose sentiment has to be predicted

tweet_created -- Time at which the tweet was created

tweet_location -- Location from where the tweet was posted

user_timezone -- Time zone from where the tweet was posted

negativereason -- Reason for which user posted a negative tweet

airline -- Airline for which the tweet was posted

## Content
1. Introduction
2. Data Injection
3. Data Visualisation
4. Preprocessing
5. Training                                                                                                                                                                                                                                                            
5.1. Logistic Regression                                                                                                                                                                                                                                      
5.2.Artificial Neural Network
6. Evaluation with graph

# Data Injection

In [None]:
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os

In [None]:
data=pd.read_csv("../input/train.csv")
data.head()

# Vizualizing Data

In [None]:
#looking for null values
(len(data)-data.count())/len(data)

In [None]:
#Visualizing the Data
data.groupby(['airline_sentiment']).size()

In [None]:
data.groupby(['airline']).size()

## Visualizing with Graph

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
sns.countplot(x='airline_sentiment',data=data,order=['negative','neutral','positive'])
plt.show()

In [None]:
#Visualizing 'airline_sentiment' and 'airline'
sns.factorplot(x = 'airline_sentiment',data=data,
               order = ['negative','neutral','positive'],kind = 'count',col_wrap=3,col='airline',size=4,aspect=0.6,sharex=False,sharey=False)
plt.show()

In [None]:
# Visualizing 'airlinee_sentiment' and 'tweet_count'
sns.factorplot(x= 'airline_sentiment',data=data,
              order=['negative','neutral','positive'],kind = 'count',col_wrap=3,col='retweet_count',size=4,aspect=0.6,sharex=False,sharey=False)
plt.show()

In [None]:
#Visualizing 'negativereason' and 'airline'
sns.factorplot(x = 'airline',data=data,
               order = ['Virgin America','United'],kind ='count',hue='negativereason',size=6,aspect=0.9)
plt.show()

# Preprocessing

In [None]:
import re
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
data=data.drop(["tweet_id",
           "airline",
           "name",
           "retweet_count",
           "tweet_created",
           "tweet_location",
           "user_timezone"],axis=1)


In [None]:
#remove words which are starts with @ symbols
data['text'] = data['text'].map(lambda x:re.sub('@\w*','',str(x)))
#remove link starts with https
data['text'] = data['text'].map(lambda x:re.sub('http.*','',str(x)))
#removing data and time (numeric values)
data['text'] = data['text'].map(lambda x:re.sub('[0-9]','',str(x)))
#removing special characters
data['text'] = data['text'].map(lambda x:re.sub('[#|*|$|:|\\|&]','',str(x)))


## Getting important numeric and non numeric data
1. Appending negative reason to text
2. For data['negativereason'] i have removed the NaN values by 0 in 'negativereason' and placed 1 in place of vaild negative reason.
3. For data['negativereason_confidence'] the values are between 0 to 1 higher the values more its chances to be a 'negative' tweet lower the values more its chances to be 'positive' or 'neutral' tweet.
so replacing the NaN by value near to zero

In [None]:
data.head()

In [None]:
#appending negative reason to text
data=data.values
for i in range(3339):
    if not str(data[i][2])=="nan":
        data[i][4]=str(data[i][4])+" "+ str(data[i][2])

In [None]:
#Getting important numeric data 
for i in range(3339):
    if str(data[i][2])=="nan":
        data[i][2]=0
    if str(data[i][3])=="nan":
        data[i][3]=0.3
for i in range(3339):
    if not str(data[i][2])=='0':
        data[i][2]=1


In [None]:
data=pd.DataFrame(data=data,columns=["airline_sentiment","airline_sentiment_confidence","negativereason","negativereason_confidence","text"])
data.head()

In [None]:
#preparing train data
#removing stopwords and tokenizing it.
stop=stopwords.words('english')
text=[]
none=data['text'].map(lambda x:text.append(' '.join
       ([word for word in str(x).strip().split() if not word in set(stop)])))
tfid=TfidfVectorizer(strip_accents=None,lowercase=False,preprocessor=None)
x_features=tfid.fit_transform(text).toarray()

In [None]:
#preparing target variable
y=data['airline_sentiment']
y=pd.DataFrame(y,columns=['airline_sentiment'])
y = y['airline_sentiment'].map({'neutral':1,'negative':2,'positive':0})

# Training

## Logistic Regression

In [None]:
#training with Logistic Regression
from sklearn.linear_model import LogisticRegression as lg
from sklearn.model_selection import cross_val_score

In [None]:
clf=lg()
acc=cross_val_score(estimator=clf,X=x_features,y=y,cv=5)
acc

In [None]:
#calculating accuracy after adding three more numerical parameters 'negativereason','negativereason_confidence', and 'airline_sentiment_confidence'.
#Note that we have transformed that earlier
#emmbading numerical data in x_features
x_features=pd.DataFrame(x_features)
x_features.loc[:,'a']=data.iloc[:,1].values
x_features.loc[:,'b']=data.iloc[:,2].values
x_features.loc[:,'c']=data.iloc[:,3].values

In [None]:
#training our new data
clf=lg()
acc=cross_val_score(estimator=clf,X=x_features,y=y,cv=5)
acc

#### As you can clearly see the accuracy is increased by a desent margin

##  Artificial Neural Network

In [None]:
#lets dig deeper and apply Deep learning for better accuracy
from keras.models import Sequential
from keras.layers import Flatten, Dense
from keras import regularizers
from keras.layers import Dropout

In [None]:
# Transforming our target vatiable
from sklearn.preprocessing import OneHotEncoder

In [None]:
onehotencoder=OneHotEncoder()
target=y.values
target=target.reshape(-1,1)
target=onehotencoder.fit_transform(target).toarray()

In [None]:
target=pd.DataFrame(data=target,columns=['positive','neutral','negative'])
target.head()

In [None]:
clf=Sequential()
#adding layers to ANN
clf.add(Dense(units=2048,activation="relu",kernel_initializer="uniform",kernel_regularizer=regularizers.l2(0.001),input_dim=6212))
clf.add(Dropout(0.5))
#adding two more hidden layer to ANN
clf.add(Dense(units=2048,activation="relu",kernel_initializer="uniform",kernel_regularizer=regularizers.l2(0.001)))
clf.add(Dropout(0.5))
clf.add(Dense(units=2048,activation="relu",kernel_initializer="uniform",kernel_regularizer=regularizers.l2(0.001)))
clf.add(Dropout(0.5))
#adding output layer
clf.add(Dense(units=3,activation="softmax",kernel_initializer="uniform"))
#compiling ANN
clf.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])

#fitting ANN
hist=clf.fit(x_features,target,batch_size=32,epochs=10)


# Evaluation with Graph

In [None]:
# Plot the loss and accuracy curves for training and validation 
fig, ax = plt.subplots(2,1)
ax[0].plot(hist.history['loss'], color='b', label="Training loss")
legend = ax[0].legend(loc='best', shadow=True)

ax[1].plot(hist.history['acc'], color='r', label="Training accuracy")
legend = ax[1].legend(loc='best', shadow=True)

### Thank you for your visit and plzz upvote it if like it. 