# Sentiment Analysis using Naive Bayes

In this assignment, we will attempt to label tweets with sentiments (positive, neutral and negative) using Naive Bayes classifier. Naive Bayes is a very basic approach to this problem, but gives surprisingly good accuracy sometimes.

**Fill in the Blanks**

## Importing required libraries

In [1]:
import pandas as pd
import re
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

In [51]:
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import confusion_matrix, roc_auc_score, recall_score, precision_score
from sklearn.metrics import classification_report

## Reading dataset

In [4]:
data=pd.read_csv('https://raw.githubusercontent.com/gkdivya/NLP_Notebooks/main/NLP_Basics/assets/tweets.csv')
data.drop(data.columns[0],axis=1,inplace=True)
data.head()

Unnamed: 0,tweets,labels
0,Obama has called the GOP budget social Darwini...,1
1,"In his teen years, Obama has been known to use...",0
2,IPA Congratulates President Barack Obama for L...,0
3,RT @Professor_Why: #WhatsRomneyHiding - his co...,0
4,RT @wardollarshome: Obama has approved more ta...,1


## Text processing for the tweets

In [7]:
import nltk 
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [16]:
from nltk.tokenize import word_tokenize
from string import punctuation 
from nltk.corpus import stopwords 

stopwords = set(stopwords.words('english') + list(punctuation) + ['AT_USER','URL'])
    
def processTweet(tweet):
    # tweet is the text we will pass for preprocessing
    # convert passed tweet to lower case 
    tweet = str(tweet).lower()
    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', 'URL', tweet) # remove URLs
    tweet = re.sub('@[^\s]+', 'AT_USER', tweet) # remove usernames
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet) # remove the # in #hashtag
    
    # use work_tokenize imported above to tokenize the tweet
    tweet =  word_tokenize(tweet)
    return [word for word in tweet if word not in stopwords]

## Process all tweets

In [17]:
processed=[]

for tweet in data['tweets']:
    # process all tweets using processTweet function above - store in variable 'cleaned' 
    cleaned=processTweet(tweet)
    processed.append(' '.join(cleaned))

In [18]:
data['processed'] = processed

## Create pipeline and define parameters for GridSearch

In [36]:
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())])

tuned_parameters = {
    'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': [1, 1e-1, 1e-2]
}

## Split data into test and train

In [37]:
# split data into train and test with split as 0.2 
X = data.processed
y = data.labels

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

## Perform classification (using GridSearch)

In [45]:
# perform GridSearch CV with 10 fold CV using pipeline and tuned_paramters defined above 
# Set the parameters by cross-validation
clf = GridSearchCV(text_clf, tuned_parameters, cv=10, scoring='roc_auc_ovr')
clf.fit(x_train, y_train)

GridSearchCV(cv=10, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        pre

## Classification report 

In [39]:
# print classification report after predicting on test set with best model obtained in GridSearch
print("Best Score: ", clf.best_score_)
print("Best Params: ", clf.best_params_)

Best Score:  0.8342669942669942
Best Params:  {'clf__alpha': 0.01, 'tfidf__norm': 'l1', 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}


## Important:

In [40]:
counts = data.labels.value_counts()
print(counts)

0    947
1    352
2     81
Name: labels, dtype: int64


In [53]:
target_names = ['0', '1', '2']
y_pred = clf.best_estimator_.predict(x_test)
print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

           0       0.81      0.96      0.88       187
           1       0.82      0.59      0.69        71
           2       1.00      0.17      0.29        18

    accuracy                           0.82       276
   macro avg       0.88      0.57      0.62       276
weighted avg       0.83      0.82      0.79       276



We can see above that the class distribution is highly imbalanced, this would not lead to good sampling of the data for the classifier. For your learning, try using [SMOTE](https://imbalanced-learn.readthedocs.io/en/stable/api.html) to oversample the minority classes and then evaluate the performance with Naive Bayes and compare.

In [54]:
!pip install imbalanced-learn



In [55]:
import imblearn

In [56]:
from imblearn.over_sampling import SMOTE

In [57]:
oversample = SMOTE()
# split data into train and test with split as 0.2 
X = data.processed
y = data.labels

X, y = oversample.fit_resample(X, y)
# summarize the new class distribution
counter = Counter(y)
print(counter)
# scatter plot of examples by class label
for label, _ in counter.items():
	row_ix = where(y == label)[0]
	pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))
pyplot.legend()
pyplot.show()

ValueError: ignored