<a href="https://colab.research.google.com/github/gopal2812/mlblr/blob/master/NLPSentiment_Analysis_using_Naive_Bayesonimbalanceddata.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis using Naive Bayes

In this assignment, we will attempt to label tweets with sentiments (positive, neutral and negative) using Naive Bayes classifier. Naive Bayes is a very basic approach to this problem, but gives surprisingly good accuracy sometimes.

**Fill in the Blanks**

## Importing required libraries

In [1]:
import pandas as pd
import re
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

In [3]:
from google.colab import files
files.upload()


Saving tweets.csv to tweets.csv




## Reading dataset

In [17]:
data=pd.read_csv('tweets.csv').astype(str)
data.head()
data.drop(data.columns[0],axis=1,inplace=True)
data.head()

Unnamed: 0,tweets,labels
0,Obama has called the GOP budget social Darwini...,1
1,"In his teen years, Obama has been known to use...",0
2,IPA Congratulates President Barack Obama for L...,0
3,RT @Professor_Why: #WhatsRomneyHiding - his co...,0
4,RT @wardollarshome: Obama has approved more ta...,1


## Text processing for the tweets

In [18]:
import nltk 
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [15]:
from nltk.tokenize import word_tokenize
from string import punctuation 
from nltk.corpus import stopwords 

stopwords = set(stopwords.words('english') + list(punctuation) + ['AT_USER','URL'])
    
def processTweet(tweet):
    # tweet is the text we will pass for preprocessing 
    # convert passed tweet to lower case 
    tweet.lower()
    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', 'URL', tweet) # remove URLs
    tweet = re.sub('@[^\s]+', 'AT_USER', tweet) # remove usernames
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet) # remove the # in #hashtag
    
    # use work_tokenize imported above to tokenize the tweet
    # instantiate tokenizer class
    tweet = word_tokenize(tweet)
    return [word for word in tweet if word not in stopwords]

## Process all tweets

In [19]:
processed=[]

for tweet in data['tweets']:
    
    # process all tweets using processTweet function above - store in variable 'cleaned' 
    cleaned=processTweet(tweet)
    processed.append(' '.join(cleaned))

In [20]:
data['processed'] = processed

## Create pipeline and define parameters for GridSearch

In [21]:
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())])

tuned_parameters = {
    'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': [1, 1e-1, 1e-2]
}

## Split data into test and train

In [31]:
# split data into train and test with split as 0.2 
X = data.processed
y = data.labels

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## Perform classification (using GridSearch)

In [34]:
# perform GridSearch CV with 10 fold CV using pipeline and tuned_paramters defined above 
clf = GridSearchCV(text_clf,
                   param_grid=tuned_parameters,
                   cv=10)
clf.fit(x_train, y_train)

GridSearchCV(cv=10, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        pre

## Classification report 

In [35]:
# print classification report after predicting on test set with best model obtained in GridSearch
print(clf.best_params_)

{'clf__alpha': 0.1, 'tfidf__norm': 'l2', 'tfidf__use_idf': False, 'vect__ngram_range': (1, 2)}


## Important:

In [55]:
counts = data.labels.value_counts()
print(counts)
prediction = clf.predict(x_test)
scores = clf.score(x_test,y_test)

0    947
1    352
2     81
Name: labels, dtype: int64


In [37]:
	
!sudo pip install imbalanced-learn



In [53]:
from imblearn.over_sampling import SMOTE
import numpy as np
from imblearn.pipeline import make_pipeline
from matplotlib import pyplot
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import precision_score, recall_score, f1_score

tvec = TfidfVectorizer(stop_words=None, max_features=100000, ngram_range=(1, 3))
lr = MultinomialNB()

def lr_cv(splits, X, Y, pipeline, average_method):
    
    kfold = StratifiedKFold(n_splits=splits, shuffle=True, random_state=777)
    accuracy = []
    precision = []
    recall = []
    f1 = []
    for train, test in kfold.split(X, Y):
        lr_fit = pipeline.fit(X[train], Y[train])
        prediction = lr_fit.predict(X[test])
        scores = lr_fit.score(X[test],Y[test])
        
        accuracy.append(scores * 100)
        precision.append(precision_score(Y[test], prediction, average=average_method)*100)
        print('              negative    neutral     positive')
        print('precision:',precision_score(Y[test], prediction, average=None))
        recall.append(recall_score(Y[test], prediction, average=average_method)*100)
        print('recall:   ',recall_score(Y[test], prediction, average=None))
        f1.append(f1_score(Y[test], prediction, average=average_method)*100)
        print('f1 score: ',f1_score(Y[test], prediction, average=None))
        print('-'*50)

    print("accuracy: %.2f%% (+/- %.2f%%)" % (np.mean(accuracy), np.std(accuracy)))
    print("precision: %.2f%% (+/- %.2f%%)" % (np.mean(precision), np.std(precision)))
    print("recall: %.2f%% (+/- %.2f%%)" % (np.mean(recall), np.std(recall)))
    print("f1 score: %.2f%% (+/- %.2f%%)" % (np.mean(f1), np.std(f1)))

SMOTE_pipeline = make_pipeline(tvec, SMOTE(random_state=777) , lr)

#original_pipeline = Pipeline([
#    ('vectorizer', tvec),
#    ('SMOTE(random_state=777)', ),
#    ('classifier', lr)
#])



lr_cv(5, X, y, SMOTE_pipeline, 'macro')



              negative    neutral     positive
precision: [0.90728477 0.54022989 0.21052632]
recall:    [0.72486772 0.66197183 0.5       ]
f1 score:  [0.80588235 0.59493671 0.2962963 ]
--------------------------------------------------




              negative    neutral     positive
precision: [0.91390728 0.52631579 0.33333333]
recall:    [0.73015873 0.70422535 0.625     ]
f1 score:  [0.81176471 0.60240964 0.43478261]
--------------------------------------------------




              negative    neutral     positive
precision: [0.87248322 0.54945055 0.22222222]
recall:    [0.68421053 0.71428571 0.5       ]
f1 score:  [0.76696165 0.62111801 0.30769231]
--------------------------------------------------




              negative    neutral     positive
precision: [0.90789474 0.58695652 0.25      ]
recall:    [0.72631579 0.77142857 0.5       ]
f1 score:  [0.80701754 0.66666667 0.33333333]
--------------------------------------------------




              negative    neutral     positive
precision: [0.91304348 0.71621622 0.26829268]
recall:    [0.77777778 0.75714286 0.64705882]
f1 score:  [0.84       0.73611111 0.37931034]
--------------------------------------------------
accuracy: 71.67% (+/- 2.85%)
precision: 58.12% (+/- 3.05%)
recall: 66.83% (+/- 3.64%)
f1 score: 60.03% (+/- 3.27%)


We can see above that the class distribution is highly imbalanced, this would not lead to good sampling of the data for the classifier. For your learning, try using [SMOTE](https://imbalanced-learn.readthedocs.io/en/stable/api.html) to oversample the minority classes and then evaluate the performance with Naive Bayes and compare.

https://towardsdatascience.com/yet-another-twitter-sentiment-analysis-part-1-tackling-class-imbalance-4d7a7f717d44