***Imports***

In [1]:
from datasets import load_dataset
import pandas as pd
from nltk.corpus import stopwords

***Loading Dataset***

In [2]:
ds = load_dataset('imdb')

***Creating DataFrames***

In [13]:
train_set = pd.DataFrame(ds['train'])
test_set = pd.DataFrame(ds['test'])

In [4]:
train_set.head()

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


***Exploring Dataset***

In [5]:
# shape of data
print(train_set.shape)
print(test_set.shape)

(25000, 2)
(25000, 2)


In [6]:
train_set.dtypes

text     object
label     int64
dtype: object

In [8]:
train_set['label'].value_counts()

label
0    12500
1    12500
Name: count, dtype: int64

**Insights:** There is `25000` rows of text in both the training data and the test data. There are only 2 features, `label` that labels the data with either `0` for negative and `1` for positive and `text` which is a long string of text. There are both even amounts of data for each label.

***Checking for Nulls***

In [101]:
print(train_set.isnull().sum())
print(test_set.isnull().sum())

text     0
label    0
dtype: int64
text     0
label    0
dtype: int64


**Insights:** There is no null data. So we can continue on with the preprocessing and tokenization

***Removing Punctuation***

In [102]:
def remove_punctuation(text): # Removes punctuation from text
    text = text.lower()
    alphabet = 'abcdefghijklmnopqrstuvwxyz '
    text = "".join([char for char in text if char in alphabet])
    return text

# applying function to data
train_set['text_processed'] = train_set['text'].apply(lambda x: remove_punctuation(x))
test_set['text_processed'] = test_set['text'].apply(lambda x: remove_punctuation(x))

***Tokenization***

In [104]:
def tokenize(text): #  tokenizes the text
    tokens = text.split(" ")
    tokens = ([word for word in tokens if word not in '']) # removing empty spaces
    return tokens

# applying function to data
train_set['text_processed'] = train_set['text_processed'].apply(lambda x: tokenize(x))
test_set['text_processed'] = test_set['text_processed'].apply(lambda x: tokenize(x))

In [105]:
# exploring new tokenized data
print(train_set['text_processed'].apply(lambda x: len(x)).min())
print(train_set['text_processed'].apply(lambda x: len(x)).max())
print(train_set['text_processed'].apply(lambda x: len(x)).mean())

10
2460
231.74028


**Insights:** After the tokenizer we can see that the max amount of words is `2460` and minimum is `10`. The mean is `231.7403` showing that there may be outlier data that has significantly more words than the majority of the data. We will be keeping these outliers so the model can be able to read large tex as well.

***Stop Word Removal***

In [108]:
stopwords = stopwords.words('english')

def stop_words(text): # removes common words 
    text = [word for word in text if word not in stopwords]
    return text

# applying function to data
train_set['text_processed'] = train_set['text_processed'].apply(lambda x: stop_words(x))
test_set['text_processed'] = test_set['text_processed'].apply(lambda x: stop_words(x))

In [110]:
# exploring new data after stop word removal
print(train_set['text_processed'].apply(lambda x: len(x)).min())
print(train_set['text_processed'].apply(lambda x: len(x)).max())
print(train_set['text_processed'].apply(lambda x: len(x)).mean())

4
1440
122.9484


**Insights:** After removing the stop words the maximum has decreased to `1440` words along with the min decreasing to `4` words. The mean also decreased to `122.9484` words. This is a significant decrease in the range of words.

***Finished Preprocessed Data***

In [111]:
train_set.head()

Unnamed: 0,text,label,text_processed
0,I rented I AM CURIOUS-YELLOW from my video sto...,0,"[rented, curiousyellow, video, store, controve..."
1,"""I Am Curious: Yellow"" is a risible and preten...",0,"[curious, yellow, risible, pretentious, steami..."
2,If only to avoid making this type of film in t...,0,"[avoid, making, type, film, future, film, inte..."
3,This film was probably inspired by Godard's Ma...,0,"[film, probably, inspired, godards, masculin, ..."
4,"Oh, brother...after hearing about this ridicul...",0,"[oh, brotherafter, hearing, ridiculous, film, ..."


***Saving Data as CSV's***

In [112]:
train_set.to_csv('../data/train_set.csv', index=False)
test_set.to_csv('../data/test_set.csv', index=False)

Due to github issues I could not save the data into folders like usual. Although the code is here and easy to run to create the data