## Sentiment Analysis Part 1: Data Preprocessing

Dataset from Kaggle: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv("data/IMDB_Dataset.csv") 

In [3]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [5]:
data['sentiment'].value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

In [6]:
dict = {'positive': '1', 'negative': '0'}
data = data.replace({"sentiment": dict})
print(data)

                                                  review sentiment
0      One of the other reviewers has mentioned that ...         1
1      A wonderful little production. <br /><br />The...         1
2      I thought this was a wonderful way to spend ti...         1
3      Basically there's a family where a little boy ...         0
4      Petter Mattei's "Love in the Time of Money" is...         1
...                                                  ...       ...
49995  I thought this movie did a down right good job...         1
49996  Bad plot, bad dialogue, bad acting, idiotic di...         0
49997  I am a Catholic taught in parochial elementary...         0
49998  I'm going to have to disagree with the previou...         0
49999  No one expects the Star Trek movies to be high...         0

[50000 rows x 2 columns]


In [7]:
def data_preprocessing(text):
    text = text.str.lower()  # convert the text into lowercase
    text = text.replace(r'<[^<]+?>', ' ', regex=True)  # remove html tags 
    text = text.replace(r'\d+', ' ', regex=True)     # remove numbers
    text = text.replace(r'[^a-zA-Z\'\s]', ' ', regex=True)  # remove everything other than letters and whitespaces
    text = text.replace(r'\s\s+', ' ', regex=True)   # remove extra whitespaces
    return text

data['review'] = data_preprocessing(data['review'])

data.insert(loc=0, column='id', value=[i for i in range(50000)])
data.to_csv('data/clean_dataset_id.csv', index=False)

In [8]:
test_data = pd.read_csv("data/testData.tsv", sep='\t') 

In [9]:
test_data['review'] = data_preprocessing(test_data['review'])

In [10]:
test_data.to_csv('data/clean_testdata.csv', index=False)

In [11]:
output = pd.concat([data, test_data])   # combining the train and test data so that they have the same number of features

In [12]:
output.to_csv("data/train_test.csv", index=False)