### Senitment Analysis 
This is a sentiment analysis for a collection of tweets to detect the sentiment associated with a particular tweets and determine it as negative or positive.

### Introduction

#### Data Preparation

Before we start with the problem statements, we have to do a little data preparation.
First, let's import all required files.

In [20]:
# Importing the required files.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, classification_report
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
nltk.download('punkt')
from nltk.tokenize import word_tokenize


[nltk_data] Downloading package stopwords to D:\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to D:\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


We will now read the data.
The dataset is a CSV file so we are using the read_csv() function of Pandas.

In [21]:
# Making width of the column viewable
pd.set_option('display.max_colwidth', None)

# Read the data into a dataframe
data = pd.read_csv('data/twitter.csv',encoding='latin-1', header=None, names=['target', 'id', 'date', 'flag', 'user', 'text'])


# look at the top five rows of the dataframe
data.head()

Unnamed: 0,target,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there."


We will be ommiting every column except for the text and the label, as we won't need any of the other information

In [26]:
data['text'] = data['text'].apply(lambda x: x.lower())

## Cleaning and processing the data
### Text Cleaning
Our data set is not clear, it contains uppercase, brackets, links, punctuation and so many things. We need to remove thoes things from our data. Here, we will use re library to fixed thoes things.

In [None]:
# preprocess the text data
data['text'] = data['text'].apply(lambda x: re.sub(r"@\S+", "", x))
data['text'] = data['text'].apply(lambda x: re.sub(r"http\S+", "", x))
data['text'] = data['text'].apply(lambda x: re.sub(r"[^a-zA-Z0-9]+", " ", x))

In [28]:
# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data['text'], data['target'], test_size=0.2, random_state=42)


In [29]:
# tokenize the text
X_train = X_train.apply(lambda x: word_tokenize(x))
X_test = X_test.apply(lambda x: word_tokenize(x))

In [31]:
# create a count vectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train.apply(lambda x: ' '.join(x)))


In [32]:
# create a TF-IDF transformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)


In [33]:
# train a Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_train_tfidf, y_train)


MultinomialNB()

In [35]:
# make predictions on the testing set
X_test_counts = count_vect.transform(X_test.apply(lambda x: ' '.join(x)))
X_test_tfidf = tfidf_transformer.transform(X_test_counts)
y_pred = clf.predict(X_test_tfidf)



In [36]:
# evaluate the classifier's performance
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[126633  32861]
 [ 39413 121093]]
              precision    recall  f1-score   support

           0       0.76      0.79      0.78    159494
           4       0.79      0.75      0.77    160506

    accuracy                           0.77    320000
   macro avg       0.77      0.77      0.77    320000
weighted avg       0.77      0.77      0.77    320000

