# Sentiment Analysis using Naive Bayes

In this assignment, we will attempt to label tweets with sentiments (positive, neutral and negative) using Naive Bayes classifier. Naive Bayes is a very basic approach to this problem, but gives surprisingly good accuracy sometimes.

**Fill in the Blanks**

## Importing required libraries

In [None]:
import pandas as pd
import re
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

## Reading dataset

In [None]:
data=pd.read_csv('/content/drive/My Drive/NLP/1/tweets.csv')
data.drop(data.columns[0],axis=1,inplace=True)
data.head()

Unnamed: 0,tweets,labels
0,Obama has called the GOP budget social Darwini...,1
1,"In his teen years, Obama has been known to use...",0
2,IPA Congratulates President Barack Obama for L...,0
3,RT @Professor_Why: #WhatsRomneyHiding - his co...,0
4,RT @wardollarshome: Obama has approved more ta...,1


## Text processing for the tweets

In [None]:
import nltk 
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
from nltk.tokenize import word_tokenize
from string import punctuation 
from nltk.corpus import stopwords 

stopwords = set(stopwords.words('english') + list(punctuation) + ['AT_USER','URL'])
    
def processTweet(tweet):
    # tweet is the text we will pass for preprocessing 
    # convert passed tweet to lower case 
    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', 'URL', str(tweet)) # remove URLs
    tweet = re.sub('@[^\s]+', 'AT_USER', str(tweet)) # remove usernames
    tweet = re.sub(r'#([^\s]+)', r'\1', str(tweet)) # remove the # in #hashtag
    tweet = tweet.lower()

    
    # use word_tokenize imported above to tokenize the tweet
    tweet = word_tokenize(tweet)
    cleaned_words = [word for word in tweet if word not in stopwords]
    return cleaned_words

## Process all tweets

In [None]:
processed=[]

for tweet in data['tweets']:
    
    # process all tweets using processTweet function above - store in variable 'cleaned' 
    cleaned=processTweet(tweet)
    processed.append(' '.join(cleaned))

In [None]:
data['processed'] = processed
print(data)

                                                 tweets  ...                                          processed
0     Obama has called the GOP budget social Darwini...  ...  obama called gop budget social darwinism nice ...
1     In his teen years, Obama has been known to use...  ...       teen years obama known use marijuana cocaine
2     IPA Congratulates President Barack Obama for L...  ...  ipa congratulates president barack obama leade...
3     RT @Professor_Why: #WhatsRomneyHiding - his co...  ...  rt at_user whatsromneyhiding connection suppor...
4     RT @wardollarshome: Obama has approved more ta...  ...  rt at_user obama approved targeted assassinati...
...                                                 ...  ...                                                ...
1375  @liberalminds Its trending idiot.. Did you loo...  ...  at_user trending idiot.. look tweets lol makin...
1376  RT @AstoldByBass: #KimKardashiansNextBoyfriend...  ...  rt at_user kimkardashiansnextboyfriend bar

## Create pipeline and define parameters for GridSearch

In [None]:
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())])
print(text_clf)
tuned_parameters = {
    'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': [1, 1e-3, 1e-4]
}


Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         verbose=False)


## Split data into test and train

In [None]:
# split data into train and test with split as 0.2 
X = data.processed
y = data.labels

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=42)
print(X_train.shape)
print(X_test.shape)



(1104,)
(276,)


## Perform classification (using GridSearch)

In [None]:
# perform GridSearch CV with 10 fold CV using pipeline and tuned_paramters defined above 
clf = GridSearchCV(text_clf, tuned_parameters, cv=10)
clf.fit(x_train, y_train)

GridSearchCV(cv=10, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        pre

## Classification report 

In [None]:
# print classification report after predicting on test set with best model obtained in GridSearch
print(classification_report(y_test, clf.predict(x_test), digits=2))

              precision    recall  f1-score   support

           0       0.83      0.97      0.89       188
           1       0.87      0.59      0.70        70
           2       0.62      0.28      0.38        18

    accuracy                           0.83       276
   macro avg       0.78      0.61      0.66       276
weighted avg       0.83      0.83      0.81       276



## Important:

In [None]:
counts = data.labels.value_counts()
print(counts)

We can see above that the class distribution is highly imbalanced, this would not lead to good sampling of the data for the classifier. For your learning, try using [SMOTE](https://imbalanced-learn.readthedocs.io/en/stable/api.html) to oversample the minority classes and then evaluate the performance with Naive Bayes and compare.