# Sentiment Analysis using Naive Bayes

In this assignment, we will attempt to label tweets with sentiments (positive, neutral and negative) using Naive Bayes classifier. Naive Bayes is a very basic approach to this problem, but gives surprisingly good accuracy sometimes.

**Fill in the Blanks**

## Importing required libraries

In [1]:
import pandas as pd
import re
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

## Reading dataset

In [2]:
data=pd.read_csv('C:/Users/navee/##END Program- SchoolOfAI/Dataset/tweets.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,tweets,labels
0,0,Obama has called the GOP budget social Darwini...,1
1,1,"In his teen years, Obama has been known to use...",0
2,2,IPA Congratulates President Barack Obama for L...,0
3,3,RT @Professor_Why: #WhatsRomneyHiding - his co...,0
4,4,RT @wardollarshome: Obama has approved more ta...,1


In [3]:
data.drop(data.columns[0],axis=1,inplace=True)
data.head()

Unnamed: 0,tweets,labels
0,Obama has called the GOP budget social Darwini...,1
1,"In his teen years, Obama has been known to use...",0
2,IPA Congratulates President Barack Obama for L...,0
3,RT @Professor_Why: #WhatsRomneyHiding - his co...,0
4,RT @wardollarshome: Obama has approved more ta...,1


## Text processing for the tweets

In [4]:
import nltk 
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\navee\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\navee\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [5]:
from nltk.tokenize import word_tokenize
from string import punctuation 
from nltk.corpus import stopwords 

stopwords = set(stopwords.words('english') + list(punctuation) + ['AT_USER','URL'])
    
def processTweet(tweet):
    # tweet is the text we will pass for preprocessing 
    # convert passed tweet to lower case 
    tweet = str(tweet).lower()
    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', 'URL', tweet) # remove URLs
    tweet = re.sub('@[^\s]+', 'AT_USER', tweet) # remove usernames
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet) # remove the # in #hashtag
    
    # use work_tokenize imported above to tokenize the tweet
    tweet = word_tokenize(tweet)
    return [word for word in tweet if word not in stopwords]

## Process all tweets

In [6]:
processed=[]

for tweet in data['tweets']:
    
    # process all tweets using processTweet function above - store in variable 'cleaned' 
    cleaned=processTweet(tweet)
    processed.append(' '.join(cleaned))

In [7]:
data['processed'] = processed

## Create pipeline and define parameters for GridSearch

In [8]:
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())])

tuned_parameters = {
    'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': [1, 1e-1, 1e-2]
}

## Split data into test and train

In [9]:
# split data into train and test with split as 0.2 
X = data.processed
y = data.labels

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

(1104,) (1104,)
(276,) (276,)


## Perform classification (using GridSearch)

In [10]:
# perform GridSearch CV with 10 fold CV using pipeline and tuned_paramters defined above 
clf = GridSearchCV(text_clf, tuned_parameters, cv=10)
clf.fit(x_train, y_train)

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('vect', CountVectorizer()),
                                       ('tfidf', TfidfTransformer()),
                                       ('clf', MultinomialNB())]),
             param_grid={'clf__alpha': [1, 0.1, 0.01],
                         'tfidf__norm': ('l1', 'l2'),
                         'tfidf__use_idf': (True, False),
                         'vect__ngram_range': [(1, 1), (1, 2), (2, 2)]})

## Classification report 

In [11]:
y_res = clf.predict(x_test)

In [12]:
# print classification report after predicting on test set with best model obtained in GridSearch
print(classification_report(y_test, y_res))

              precision    recall  f1-score   support

           0       0.81      0.95      0.88       178
           1       0.79      0.60      0.68        80
           2       1.00      0.39      0.56        18

    accuracy                           0.81       276
   macro avg       0.87      0.65      0.71       276
weighted avg       0.82      0.81      0.80       276



## Important:

In [13]:
counts = data.labels.value_counts()
print(counts)

0    947
1    352
2     81
Name: labels, dtype: int64


We can see above that the class distribution is highly imbalanced, this would not lead to good sampling of the data for the classifier. For your learning, try using [SMOTE](https://imbalanced-learn.readthedocs.io/en/stable/api.html) to oversample the minority classes and then evaluate the performance with Naive Bayes and compare.

In [14]:
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [16]:
text_clf = Pipeline([('sampling', SMOTE()),
                    ('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())])

tuned_parameters = {
    'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': [1, 1e-1, 1e-2]
}

In [17]:
clf = GridSearchCV(text_clf, tuned_parameters, cv=10)
clf.fit(x_train, y_train)

Traceback (most recent call last):
  File "C:\Users\navee\anaconda3\envs\Pytorch-GPU\lib\site-packages\sklearn\model_selection\_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\navee\anaconda3\envs\Pytorch-GPU\lib\site-packages\imblearn\pipeline.py", line 277, in fit
    Xt, yt, fit_params = self._fit(X, y, **fit_params)
  File "C:\Users\navee\anaconda3\envs\Pytorch-GPU\lib\site-packages\imblearn\pipeline.py", line 240, in _fit
    **fit_params_steps[name]
  File "C:\Users\navee\anaconda3\envs\Pytorch-GPU\lib\site-packages\joblib\memory.py", line 355, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\navee\anaconda3\envs\Pytorch-GPU\lib\site-packages\imblearn\pipeline.py", line 403, in _fit_resample_one
    X_res, y_res = sampler.fit_resample(X, y, **fit_params)
  File "C:\Users\navee\anaconda3\envs\Pytorch-GPU\lib\site-packages\imblearn\base.py", line 77, in fit_resample
    X, y, binarize_y = self._check_X

Traceback (most recent call last):
  File "C:\Users\navee\anaconda3\envs\Pytorch-GPU\lib\site-packages\sklearn\model_selection\_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\navee\anaconda3\envs\Pytorch-GPU\lib\site-packages\imblearn\pipeline.py", line 277, in fit
    Xt, yt, fit_params = self._fit(X, y, **fit_params)
  File "C:\Users\navee\anaconda3\envs\Pytorch-GPU\lib\site-packages\imblearn\pipeline.py", line 240, in _fit
    **fit_params_steps[name]
  File "C:\Users\navee\anaconda3\envs\Pytorch-GPU\lib\site-packages\joblib\memory.py", line 355, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\navee\anaconda3\envs\Pytorch-GPU\lib\site-packages\imblearn\pipeline.py", line 403, in _fit_resample_one
    X_res, y_res = sampler.fit_resample(X, y, **fit_params)
  File "C:\Users\navee\anaconda3\envs\Pytorch-GPU\lib\site-packages\imblearn\base.py", line 77, in fit_resample
    X, y, binarize_y = self._check_X

Traceback (most recent call last):
  File "C:\Users\navee\anaconda3\envs\Pytorch-GPU\lib\site-packages\sklearn\model_selection\_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\navee\anaconda3\envs\Pytorch-GPU\lib\site-packages\imblearn\pipeline.py", line 277, in fit
    Xt, yt, fit_params = self._fit(X, y, **fit_params)
  File "C:\Users\navee\anaconda3\envs\Pytorch-GPU\lib\site-packages\imblearn\pipeline.py", line 240, in _fit
    **fit_params_steps[name]
  File "C:\Users\navee\anaconda3\envs\Pytorch-GPU\lib\site-packages\joblib\memory.py", line 355, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\navee\anaconda3\envs\Pytorch-GPU\lib\site-packages\imblearn\pipeline.py", line 403, in _fit_resample_one
    X_res, y_res = sampler.fit_resample(X, y, **fit_params)
  File "C:\Users\navee\anaconda3\envs\Pytorch-GPU\lib\site-packages\imblearn\base.py", line 77, in fit_resample
    X, y, binarize_y = self._check_X

Traceback (most recent call last):
  File "C:\Users\navee\anaconda3\envs\Pytorch-GPU\lib\site-packages\sklearn\model_selection\_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\navee\anaconda3\envs\Pytorch-GPU\lib\site-packages\imblearn\pipeline.py", line 277, in fit
    Xt, yt, fit_params = self._fit(X, y, **fit_params)
  File "C:\Users\navee\anaconda3\envs\Pytorch-GPU\lib\site-packages\imblearn\pipeline.py", line 240, in _fit
    **fit_params_steps[name]
  File "C:\Users\navee\anaconda3\envs\Pytorch-GPU\lib\site-packages\joblib\memory.py", line 355, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\navee\anaconda3\envs\Pytorch-GPU\lib\site-packages\imblearn\pipeline.py", line 403, in _fit_resample_one
    X_res, y_res = sampler.fit_resample(X, y, **fit_params)
  File "C:\Users\navee\anaconda3\envs\Pytorch-GPU\lib\site-packages\imblearn\base.py", line 77, in fit_resample
    X, y, binarize_y = self._check_X

Traceback (most recent call last):
  File "C:\Users\navee\anaconda3\envs\Pytorch-GPU\lib\site-packages\sklearn\model_selection\_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\navee\anaconda3\envs\Pytorch-GPU\lib\site-packages\imblearn\pipeline.py", line 277, in fit
    Xt, yt, fit_params = self._fit(X, y, **fit_params)
  File "C:\Users\navee\anaconda3\envs\Pytorch-GPU\lib\site-packages\imblearn\pipeline.py", line 240, in _fit
    **fit_params_steps[name]
  File "C:\Users\navee\anaconda3\envs\Pytorch-GPU\lib\site-packages\joblib\memory.py", line 355, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\navee\anaconda3\envs\Pytorch-GPU\lib\site-packages\imblearn\pipeline.py", line 403, in _fit_resample_one
    X_res, y_res = sampler.fit_resample(X, y, **fit_params)
  File "C:\Users\navee\anaconda3\envs\Pytorch-GPU\lib\site-packages\imblearn\base.py", line 77, in fit_resample
    X, y, binarize_y = self._check_X

ValueError: could not convert string to float: 'click obama related trending topic racist bullshit every tweets arrested development dove macro'