### Senitment Analysis 
This is a sentiment analysis for a collection of tweets to detect the sentiment associated with a particular tweets and determine it as negative or positive.

### Introduction

#### Data Preparation

Before we start with the problem statements, we have to do a little data preparation.
First, let's import all required files.

In [20]:
# Importing the required files.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, classification_report
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
nltk.download('punkt')
from nltk.tokenize import word_tokenize


[nltk_data] Downloading package stopwords to D:\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to D:\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


We will now read the data.
The dataset is a CSV file so we are using the read_csv() function of Pandas.

In [21]:
# Making width of the column viewable
pd.set_option('display.max_colwidth', None)

# Read the data into a dataframe
data = pd.read_csv('data/twitter.csv',encoding='latin-1', header=None, names=['target', 'id', 'date', 'flag', 'user', 'text'])


# look at the top five rows of the dataframe
data.head()

Unnamed: 0,target,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there."


We will be ommiting every column except for the text and the label, as we won't need any of the other information

In [26]:
data['text'] = data['text'].apply(lambda x: x.lower())

### Target visualization
our target is sentiment, and we have to predict when it is positive or negative. So, as our data set contain 0 and 4 so we need to replace thoes value as negative and possitive, and later we will encode thoes values as 0 and 1.

In [27]:
# preprocess the text data
data['text'] = data['text'].apply(lambda x: re.sub(r"@\S+", "", x))
data['text'] = data['text'].apply(lambda x: re.sub(r"http\S+", "", x))
data['text'] = data['text'].apply(lambda x: re.sub(r"[^a-zA-Z0-9]+", " ", x))

## Cleaning and processing the data
### Text Cleaning
Our data set is not clear, it contains uppercase, brackets, links, punctuation and so many things. We need to remove thoes things from our data. Here, we will use re library to fixed thoes things.

In [10]:
def clean_text(text):
    text = str(text).lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

In [11]:
data['text'] = data['text'].apply(clean_text)
data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['text'] = data['text'].apply(clean_text)


Unnamed: 0,label,text,sentiment
0,0,is upset that he cant update his facebook by texting it and might cry as a result school today also blah,negative
1,0,kenichan i dived many times for the ball managed to save the rest go out of bounds,negative
2,0,my whole body feels itchy and like its on fire,negative
3,0,nationwideclass no its not behaving at all im mad why am i here because i cant see you all over there,negative
4,0,kwesidei not the whole crew,negative


In [12]:
data.tail()

Unnamed: 0,label,text,sentiment
1599994,4,just woke up having no school is the best feeling ever,positive
1599995,4,thewdbcom very cool to hear old walt interviews â«,positive
1599996,4,are you ready for your mojo makeover ask me for details,positive
1599997,4,happy birthday to my boo of alll time tupac amaru shakur,positive
1599998,4,happy charitytuesday thenspcc sparkscharity,positive


### Remove Stopwords
In Natural Language Processing (NLP), stop words are commonly occurring words that are filtered out before or after processing of text data. Stop words are usually words that do not contribute much to the meaning of a sentence or document, and are therefore not considered useful for text analysis. Examples of stop words include "the", "and", "a", "an", "in", "of", "is", "to", "that", "it", and so on.Removing stop words from a text can help reduce the dimensionality of the dataset, which can make analysis more efficient and effective.

In [13]:
stop_words = stopwords.words('english')

def remove_stopwords(text):
    text = ' '.join(word for word in text.split(' ') if word not in stop_words)
    return text
    
data['text'] = data['text'].apply(remove_stopwords)
data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['text'] = data['text'].apply(remove_stopwords)


Unnamed: 0,label,text,sentiment
0,0,upset cant update facebook texting might cry result school today also blah,negative
1,0,kenichan dived many times ball managed save rest go bounds,negative
2,0,whole body feels itchy like fire,negative
3,0,nationwideclass behaving im mad cant see,negative
4,0,kwesidei whole crew,negative


### Stemming
In processing unstructured text, stemming is the process of converting multiple forms of the same word into one stem, to simplify the task of analyzing the processed text. For example, in the previous sentence, "processing," "process," and "processed" would all be converted to the single stem "process."

In [14]:
stemmer = nltk.SnowballStemmer("english")

def stemm_text(text):
    text = ' '.join(stemmer.stem(word) for word in text.split(' '))
    return text

data['text'] = data['text'].apply(stemm_text)
data.head()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['text'] = data['text'].apply(stemm_text)


Unnamed: 0,label,text,sentiment
0,0,upset cant updat facebook text might cri result school today also blah,negative
1,0,kenichan dive mani time ball manag save rest go bound,negative
2,0,whole bodi feel itchi like fire,negative
3,0,nationwideclass behav im mad cant see,negative
4,0,kwesidei whole crew,negative


### Tokenization

In [15]:
# Tokenizing the text

tokenizer = RegexpTokenizer(r'\w+')
data['text'] = data['text'].apply(tokenizer.tokenize)
data['text'].head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['text'] = data['text'].apply(tokenizer.tokenize)


0    [upset, cant, updat, facebook, text, might, cri, result, school, today, also, blah]
1                       [kenichan, dive, mani, time, ball, manag, save, rest, go, bound]
2                                                 [whole, bodi, feel, itchi, like, fire]
3                                           [nationwideclass, behav, im, mad, cant, see]
4                                                                [kwesidei, whole, crew]
Name: text, dtype: object

### Exploratory Data Analysis using WordCloud

In [18]:
# visualize the frequent words

all_words = " ".join([sentence for sentence in data['text']])

wordcloud = WordCloud(width=800, height=500, random_state=42, max_font_size=100).generate(all_words)

# plot the graph
plt.figure(figsize=(15,8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

TypeError: sequence item 0: expected str instance, list found

In [22]:
from sklearn.preprocessing import LabelEncoder

a = LabelEncoder()
a.fit(data['sentiment'])

data['sentiment'] = a.transform(data['sentiment'])
data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['sentiment'] = a.transform(data['sentiment'])


Unnamed: 0,label,text,sentiment
0,0,"[upset, cant, updat, facebook, text, might, cri, result, school, today, also, blah]",0
1,0,"[kenichan, dive, mani, time, ball, manag, save, rest, go, bound]",0
2,0,"[whole, bodi, feel, itchi, like, fire]",0
3,0,"[nationwideclass, behav, im, mad, cant, see]",0
4,0,"[kwesidei, whole, crew]",0


In [23]:
from sklearn.model_selection import train_test_split
x = data['text']
y = data['sentiment']
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=42)

In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectoriser = TfidfVectorizer(ngram_range=(1,2),max_features=3891472)
vectoriser.fit(x_train)

x_train = vectoriser.transform(x_train)
x_test  = vectoriser.transform(x_test)
print('No. of feature_words: ', len(vectoriser.get_feature_names()))

AttributeError: 'list' object has no attribute 'lower'

In [25]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(x_train, y_train)


y_pred = model.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Test accuracy: {accuracy:.4f}")
print(classification_report(y_test, y_pred))

ValueError: setting an array element with a sequence.

In [26]:
from sklearn.metrics import accuracy_score ,confusion_matrix, classification_report
confution_lg = confusion_matrix(y_test, y_pred) #confusion metrics
sns.heatmap(confution_lg, linewidths=0.01, annot=True,fmt= '.1f', color='red') #heat map

NameError: name 'y_pred' is not defined

In [27]:
# Create a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

# Train the model
nb.fit(x_train, y_train)
# Evaluate the model on the test set
y_p = nb.predict(x_test)
accuracy = accuracy_score(y_test, y_p)
print(f"Test accuracy: {accuracy:.4f}")
print(classification_report(y_test, y_p))

ValueError: setting an array element with a sequence.

In [28]:
# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data['text'], data['target'], test_size=0.2, random_state=42)


In [29]:
# tokenize the text
X_train = X_train.apply(lambda x: word_tokenize(x))
X_test = X_test.apply(lambda x: word_tokenize(x))

In [31]:
# create a count vectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train.apply(lambda x: ' '.join(x)))


In [32]:
# create a TF-IDF transformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)


In [33]:
# train a Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_train_tfidf, y_train)


MultinomialNB()

In [35]:
# make predictions on the testing set
X_test_counts = count_vect.transform(X_test.apply(lambda x: ' '.join(x)))
X_test_tfidf = tfidf_transformer.transform(X_test_counts)
y_pred = clf.predict(X_test_tfidf)



In [36]:
# evaluate the classifier's performance
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[126633  32861]
 [ 39413 121093]]
              precision    recall  f1-score   support

           0       0.76      0.79      0.78    159494
           4       0.79      0.75      0.77    160506

    accuracy                           0.77    320000
   macro avg       0.77      0.77      0.77    320000
weighted avg       0.77      0.77      0.77    320000

