# Naive Bayes in Sentiment Analysis
This notebook going through the use of  `Naive Bayes` rules to identify either the tweet is `Positive` or `Negative`. We will cover the following steps:

- Collecting dataset (`twitter_samples`).
- Cleaning and processing datasets.
- Train a naive bayes model on a sentiment analysis task.
- Test Naive Bayes model
- Predict to our own tweet

In [87]:
# Automatic reloading changes from external files
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [88]:
# importing necessary libraries
import numpy as np
import nltk
import pprint
from utils import *
from model import *

### 1. Dataset Loading
Downloading the datasets and important packages from **`nltk`** using this script
```Python
nltk.download('twitter_samples')
nltk.download('stopwords')
```
This dataset contains subsets of 5,000 positive tweets, 5,000 negative tweets, and the full set of 20,000 tweets.

In [8]:
# Loading datasets
from nltk.corpus import twitter_samples

file_ids = twitter_samples.fileids()

['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430-223406.json']


In [21]:
# select the set of positive and negative tweets
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')
print(f'''length of positive tweets: {len(positive_tweets)}\nlength of negative tweets: {len(negative_tweets)}''')

length of positive tweets: 5000
length of negative tweets: 5000


In [28]:
# pretty print top five tweets
pprint.pprint(positive_tweets[:5])
print('\n')
pprint.pprint(negative_tweets[:5])

['#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged '
 'members in my community this week :)',
 '@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 '
 'and we will be able to assist you :) Many thanks!',
 '@DespiteOfficial we had a listen last night :) As You Bleed is an amazing '
 'track. When are you in Scotland?!',
 '@97sides CONGRATS :)',
 'yeaaaah yippppy!!!  my accnt verified rqst has succeed got a blue tick mark '
 'on my fb profile :) in 15 days']


['hopeless for tmr :(',
 "Everything in the kids section of IKEA is so cute. Shame I'm nearly 19 in 2 "
 'months :(',
 '@Hegelbon That heart sliding into the waste basket. :(',
 '“@ketchBurning: I hate Japanese call him "bani" :( :(”\n\nMe too',
 'Dang starting next week I have "work" :(']


In [92]:
# Splitting the datasets into train and test examples
train_positive = positive_tweets[:4000]
test_positive = positive_tweets[4000:]
train_negative = negative_tweets[:4000]
test_negative = negative_tweets[4000:]

train_x = train_positive + train_negative
test_x = test_positive + test_negative

train_y = np.squeeze(np.append(np.ones((len(train_positive), 1)), np.zeros((len(train_negative), 1)), axis=0))
test_y = np.squeeze(np.append(np.ones((len(test_positive), 1)), np.zeros((len(test_negative), 1)), axis=0))

In [93]:
shapes = f'''
train_positive: {len(train_positive)}
train_negative: {len(train_negative)}
test_positive: {len(test_positive)}
test_negative: {len(test_negative)}
train_x: {len(train_x)}
train_y: {len(train_y)}
test_x: {len(test_x)}
test_y: {len(test_y)}
'''
print(shapes)


train_positive: 4000
train_negative: 4000
test_positive: 1000
test_negative: 1000
train_x: 8000
train_y: 8000
test_x: 2000
test_y: 2000



### 2. Data Cleaning and Processing
In this step we are going to:
- Lowercase
- Remove punctuation, urls, names
- Remove stop words
- Stemming
- Tokenize sentences

Using helper functions (`process_tweet`) defined in `utils.py`

In [94]:
tweet = '#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged '
process_tweet(tweet)

['followfriday', 'top', 'engag']

Next we are going to create a frequent table between positive and negative occurance in the corpus. We will use `word_freq` function it return a dictionary witha tuple `{(word, label): freq}`

In [108]:
# Sample of the output
tweets = ['i am happy', 'i am tricked', 'i am sad', 'i am tired', 'i am tired']
ys = [1, 0, 0, 0, 0]
print(word_freq(tweets, ys))

{('happi', 1): 1, ('trick', 0): 1, ('sad', 0): 1, ('tire', 0): 2}


In [98]:
# Creating the frequence of the word
freqs = word_freq(train_x , train_y)

### 3. Model Training
We will train a naive bayes model as defined in `model.py`

In [109]:
# define model from model.py
model = train_naive_bayes(freqs, train_x, train_y)
print("logprior", model[0])
print("loglikelihood", len(model[1]))

logprior 0.0
loglikelihood 9085


In [111]:
my_tweet = 'She smiled.'
prediction =  model_predict(model, my_tweet)
print(prediction)

1.5737244858565678


### 4. Model Testing
Model testing to measure the accuracy of the model

In [114]:
# Accuracy to the model
accuracy = model_test(model, test_x, test_y)
print('accuracy:', accuracy)

accuracy: 0.994


In [115]:
# Run this cell to test your function
for tweet in ['I am happy', 'I am bad', 'this movie should have been great.', 'great', 'great great', 'great great great', 'great great great great']:
    p = model_predict(model, tweet)
    print(f'{tweet} -> {p:.2f}')

I am happy -> 2.15
I am bad -> -1.29
this movie should have been great. -> 2.14
great -> 2.14
great great -> 4.28
great great great -> 6.41
great great great great -> 8.55


### 5. Predict your own tweet

In [116]:
# Feel free to check the sentiment of your own tweet below
my_tweet = 'you are bad :('
model_predict(model, my_tweet)

-8.80222939347889