# Sentiment Prediction


### Preprocessing


- The dataset contains 50,000 reviews along with their corrosponding sentiments, Positive/Negative.


- Our goal is to build a model to predict the sentiments of futures reviews.


- Following notebook contains steps to preprocess the dataset, **'reviews'** to be precise.


- We'll use **Spacy** and **Regular Expressions** for cleaning the text containing reviews.


- It is advised to install all the requirements first from the **requirements.txt** file from the root folder and also install the spacy dependencies with following commands.

    ```
    pip install spacy
    
    pip install -U spacy-lookups-data
    
    python -m spacy download en_core_web_sm
    ```

In [1]:
from warnings import filterwarnings
filterwarnings('ignore')

import re
import string
import pandas as pd
import spacy
import tqdm

from tqdm._tqdm_notebook import tqdm_notebook

tqdm_notebook.pandas()
nlp = spacy.load('en_core_web_sm')

Let's first load the dataset in a pandas dataframe and sample randomly selected 5 rows.

In [2]:
df = pd.read_csv('../data/Dataset.csv')
df.sample(5)

Unnamed: 0,review,sentiment
6732,"This show is so full of action, and everything...",positive
49629,It took a long time until I could find the tit...,positive
49013,"""Jason Priestly stars as 'Breakfast', a psycho...",negative
21810,This film plays really well with an audience. ...,positive
11994,Keira Knightley and Sienna Miller stars in the...,positive


- By looking at the sampled reviews, it seems not a lot of cleaning is required, but by looking at the dataset in Excel, it looks like reviews contains **<br />** tags, which needs to be removed.


- We'll also remove all the punctuations and words which contains numbers in them, as they don't contribute much to the models.


- After which we'll lemmatize the text because we don't want a group of inflected words, we just want to keep a single word for them.


- Also, let's take a look at number of datapoints with positive and negative labels, because for classification we want a balanced dataset.

In [3]:
df['sentiment'].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

The dataset is already balanced, so nothing needs to be done here.

In [4]:
def clean_text(text: str):
    text = re.sub(r'<[^>]*>', ' ', text)  # remove <br /> tags
    text = re.sub(r'[%s]' % re.escape(string.punctuation), '', text)  # remove punctuations 
    text = re.sub(r'\w*\d\w*', '', text)  # remove words containing numbers
    return text

def preprocess_text(text: str):
    sentence = list()
    doc = nlp(text)
    for word in doc:
        sentence.append(word.lemma_)
    return ' '.join(sentence)

- We'll use above two functions to apply to the **reviews** columns in order to clean and lemmatize the text.


- After performing these steps we'll store the processed text in a new column, as it is a good practice just in case we want the original text.

In [5]:
df['cleaned_review'] = df['review'].progress_apply(lambda x:clean_text(x))

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=50000.0), HTML(value='')))




In [6]:
df['cleaned_review'] = df['cleaned_review'].progress_apply(lambda x:preprocess_text(x))

df.tail()

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=50000.0), HTML(value='')))




Unnamed: 0,review,sentiment,cleaned_review
49995,I thought this movie did a down right good job...,positive,-PRON- think this movie do a down right good j...
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative,bad plot bad dialogue bad act idiotic direct t...
49997,I am a Catholic taught in parochial elementary...,negative,-PRON- be a Catholic teach in parochial elemen...
49998,I'm going to have to disagree with the previou...,negative,-PRON- be go to have to disagree with the prev...
49999,No one expects the Star Trek movies to be high...,negative,no one expect the Star Trek movie to be high a...


In [7]:
df['cleaned_review'] = df['cleaned_review'].str.replace('-PRON-', '')
df.tail()

Unnamed: 0,review,sentiment,cleaned_review
49995,I thought this movie did a down right good job...,positive,think this movie do a down right good job be...
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative,bad plot bad dialogue bad act idiotic direct t...
49997,I am a Catholic taught in parochial elementary...,negative,be a Catholic teach in parochial elementary s...
49998,I'm going to have to disagree with the previou...,negative,be go to have to disagree with the previous c...
49999,No one expects the Star Trek movies to be high...,negative,no one expect the Star Trek movie to be high a...


As the reviews are cleaned we'll store the final results in a new dataframe and save that as a new csv file.

In [16]:
final_df = df.drop([
    'review'
], axis=1)

In [17]:
final_df.tail()

Unnamed: 0,sentiment,cleaned_review
49995,positive,think this movie do a down right good job be...
49996,negative,bad plot bad dialogue bad act idiotic direct t...
49997,negative,be a Catholic teach in parochial elementary s...
49998,negative,be go to have to disagree with the previous c...
49999,negative,no one expect the Star Trek movie to be high a...


In [18]:
final_df.to_csv('../data/final_data.csv', index=False)

Preprocessing is done, we'll move onto modeling now.