# Sentiment Analysis Demo

Notebook from [Eric Elmoznino](https://github.com/EricElmoznino/lighthouse_nlp_II).

![workflow](images/sentiment_workflow.jpg)

# Loading the dataset

Dataset is downloaded from [here](https://www.kaggle.com/code/kerneler/starter-imdb-master-csv-19b6829a-2/data).

In [1]:
import pandas as pd
imdb_df = pd.read_csv('data/imdb_sentiment.csv', encoding = "ISO-8859-1")
imdb_df.columns = ['review', 'label']
imdb_df.head()

Unnamed: 0,review,label
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [2]:
imdb_df['label'].value_counts()

positive    25000
negative    25000
Name: label, dtype: int64

In [3]:
# Only consider positive and negative reviews
imdb_df = imdb_df[imdb_df['label'].str.startswith(('pos','neg'))]

# Preprocessing to a numeric representation

The current data is in the form of moview reviews (text paragraphs) and their targets (`pos` or `neg`). 
We need to encode movie reviews into feature vectors so that we can train supervised machine learning models with `scikit-learn`. 
How can we do this?

#### Create binarized word frequency counts (`X_binary`)
Turn the text into sparse vector of word frequency counts using [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) from  `scikit-learn`. 

When you reproduce this, explore the arguments of `CountVectorizer` (e.g., [`stop_words`](https://scikit-learn.org/stable/modules/feature_extraction.html#stop-words), `ngram_range`, `max_features`, `min_df`, `tokenizer`, and `binary`).  

The intuition behind using binarized representation is that for sentiment analysis word occurrence may matter more than word frequency. For instance, the occurrence of the word _excellent_ tells us a lot and the fact that it occurs four times may not tell us much more. This is just a hypothesis that you could test, however.

In [4]:
# For tokenization
import nltk

# For converting words into frequency counts
from sklearn.feature_extraction.text import CountVectorizer

In [5]:
# First step in pipeline
# Keep words that appear in atleast 2 documents, keeps 5000 most common words
preprocessor = CountVectorizer(min_df=2, tokenizer=nltk.word_tokenize, max_features=5000, binary=True)

In [6]:
# Demo the preprocessor:
X_example = preprocessor.fit_transform(imdb_df['review'].iloc[:1000])
print(f'Preprocessing output shape: {X_example.shape}')

# Show the process for the first datapoint
first_datapoint = imdb_df['review'].iloc[0]
print(f'First datapoint: {first_datapoint[:100]}')

first_tokens = nltk.word_tokenize(first_datapoint)
print(f'First datapoint tokens: {first_tokens[:10]}')

first_bow = preprocessor.transform([first_datapoint])
first_bow.maxprint = 5  # Change how many of the non-zero elements are printing to not clutter the notebook
print(f'First datapoint Binary Bag of Words (sparse) representation:\n{first_bow}')
print(f'First datapoint Binary Bag of Words (dense) representation:\n{first_bow.todense()}')

Preprocessing output shape: (1000, 5000)
First datapoint: One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. The
First datapoint tokens: ['One', 'of', 'the', 'other', 'reviewers', 'has', 'mentioned', 'that', 'after', 'watching']
First datapoint Binary Bag of Words (sparse) representation:
  (0, 8)	1
  (0, 14)	1
  :	:
  (0, 4962)	1
  (0, 4986)	1
  (0, 4989)	1
First datapoint Binary Bag of Words (dense) representation:
[[0 0 0 ... 0 0 0]]


### Question
What is the disadvantage of using something like a *Bag of Words* representation for the documents in sentiment analysis?

### Answer
You lose the word order, and therefore the syntax. For instance, negation is lost.

# Modeling

In [11]:
from sklearn.naive_bayes import BernoulliNB # Bernoulli because we have binary features
from sklearn.pipeline import Pipeline

pipeline = Pipeline([('preprocessing', preprocessor), 
                     ('model', BernoulliNB())])

In [12]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(imdb_df['review'], imdb_df['label'], test_size=0.20, random_state=27)

In [13]:
pipeline.fit(X_train, y_train)
train_accuracy = pipeline.score(X_train, y_train)
test_accuracy = pipeline.score(X_test, y_test)

print(f'Train accuracy:\t{train_accuracy}')
print(f'Test accuracy:\t{test_accuracy}')

Train accuracy:	0.851875
Test accuracy:	0.8432
