## Sentiment Analysis with Logistic Regression
This simple project focus on identifying either **Tweet** is **Positive** or **Negative**

In [1]:
# Automatic reload changes from imported files
%load_ext autoreload
%autoreload 2

In [2]:
# import necessary libraries and packages
import nltk
import numpy as np
import pprint
from utils import *

We going to use `tweeter_samples` dataset from `nltk`. This dataset contains subsets of 5,000 positive tweets, 5,000 negative tweets, and the full set of 20,000 tweets.

### Installing dataset and english stopwords
```Python
nltk.download('twitter_samples')
nltk.download('stopwords')
```

In [3]:
# importing datasets
from nltk.corpus import twitter_samples

In [4]:
# Exploaring the dataset
print(twitter_samples.fileids())

['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430-223406.json']


In [5]:
# select the set of positive and negative tweets
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')
print(f'''length of positive tweets: {len(positive_tweets)}\nlength of negative tweets: {len(negative_tweets)}''')

length of positive tweets: 5000
length of negative tweets: 5000


In [6]:
# pretty print top five tweets
pprint.pprint(positive_tweets[:5])

['#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged '
 'members in my community this week :)',
 '@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 '
 'and we will be able to assist you :) Many thanks!',
 '@DespiteOfficial we had a listen last night :) As You Bleed is an amazing '
 'track. When are you in Scotland?!',
 '@97sides CONGRATS :)',
 'yeaaaah yippppy!!!  my accnt verified rqst has succeed got a blue tick mark '
 'on my fb profile :) in 15 days']


In [7]:
pprint.pprint(negative_tweets[:5])

['hopeless for tmr :(',
 "Everything in the kids section of IKEA is so cute. Shame I'm nearly 19 in 2 "
 'months :(',
 '@Hegelbon That heart sliding into the waste basket. :(',
 '“@ketchBurning: I hate Japanese call him "bani" :( :(”\n\nMe too',
 'Dang starting next week I have "work" :(']


In [8]:
# Splitting the datasets into train and test examples
train_positive = positive_tweets[:4000]
test_positive = positive_tweets[4000:]
train_negative = negative_tweets[:4000]
test_negative = negative_tweets[4000:]

train_x = train_positive + train_negative
test_x = test_positive + test_negative

train_y = np.append(np.ones((len(train_positive), 1)), np.zeros((len(train_negative), 1)), axis=0)
test_y = np.append(np.ones((len(test_positive), 1)), np.zeros((len(test_negative), 1)), axis=0)

In [9]:
shapes = f'''
train_positive: {len(train_positive)}
train_negative: {len(train_negative)}
train_x: {len(train_x)}
train_y: {len(train_y)}
test_positive: {len(test_positive)}
test_negative: {len(test_negative)}
test_x: {len(test_x)}
test_y: {len(test_y)}
'''
print(shapes)


train_positive: 4000
train_negative: 4000
train_x: 8000
train_y: 8000
test_positive: 1000
test_negative: 1000
test_x: 2000
test_y: 2000



In [10]:
# Building frequecy of words in positive and negative tweets
freqs = build_freqs(train_x, train_y)

In [11]:
pprint.pprint(type(freqs))
print(len(freqs.keys()))

<class 'dict'>
11340


### Training

In [12]:
# collect the features 'x' and stack them into a matrix 'X'
X = np.zeros((len(train_x), 3))

for i in range(len(train_x)):
    X[i, :]= extract_features(train_x[i], freqs)

# training labels corresponding to X
Y = train_y

# Apply gradient descent
J, w = gradientDescent(X, Y, np.zeros((3, 1)), 1e-9, 2000)
print(f"The cost after training is {J:.8f}.")
print(f"The resulting vector of weights is {[round(t, 8) for t in np.squeeze(w)]}")

The cost after training is 0.21085308.
The resulting vector of weights is [1e-07, 0.00062145, -0.000633]


In [13]:
for tweet in ['I am happy', 'I am bad', 'this movie should have been great.', 'great', 'great great', 'great great great', 'great great great great']:
    print( '%s -> %f' % (tweet, predict_tweet(tweet, freqs, w).item()))

I am happy -> 0.522150
I am bad -> 0.493630
this movie should have been great. -> 0.518388
great -> 0.518437
great great -> 0.536823
great great great -> 0.555110
great great great great -> 0.573249


### Testing with user tweets

In [14]:
# Feel free to change the tweet below
my_tweet = 'The plot was terrible and I was sad until the ending!'
print(process_tweet(my_tweet))
y_hat = predict_tweet(my_tweet, freqs, w)
print(y_hat)
if y_hat > 0.5:
    print('Positive sentiment')
else: 
    print('Negative sentiment')

['plot', 'terribl', 'sad', 'end']
[[0.47969763]]
Negative sentiment


###  Evaluating the accuracy

In [15]:
tmp_accuracy = test_logistic_regression(test_x, test_y, freqs, w)
print(f"Logistic regression model's accuracy = {tmp_accuracy:.4f}")

Logistic regression model's accuracy = 0.9955
