This notebook is a the pretreatment of the IMDB csv to tokenize the text and save it in a parquet file.
The CSV can be downloaded from [https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)

In [1]:
import pandas as pd
import re
import nltk
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from transformers import BertTokenizer
from sklearn.model_selection import train_test_split
import torch


# Download stopwords if not already available
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package stopwords to
[nltk_data]     /users/eleves-b/2021/remi.grzeczkowicz/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Step 1: Load the csv file and show the first 5 rows.

In [2]:
data = pd.read_csv('IMDB_Dataset.csv')
data['sentiment'] = data['sentiment'].map({'positive': 1, 'negative': 0})

#show the first 5 rows of the data
print(data.head())

                                              review  sentiment
0  One of the other reviewers has mentioned that ...          1
1  A wonderful little production. <br /><br />The...          1
2  I thought this was a wonderful way to spend ti...          1
3  Basically there's a family where a little boy ...          0
4  Petter Mattei's "Love in the Time of Money" is...          1


Step 2: Remove stop words, html tags, punctuation, numbers, and lowercase the text.

In [3]:
def clean_text(text):
    # Remove HTML tags
    text = BeautifulSoup(text, "html.parser").get_text()
    
    # Remove punctuation and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove stopwords
    text = ' '.join([word for word in text.split() if word not in stop_words])
    
    return text

# Apply function to the dataframe
data['cleaned_review'] = data['review'].apply(clean_text)

print(data[['review', 'cleaned_review']].head())

                                              review  \
0  One of the other reviewers has mentioned that ...   
1  A wonderful little production. <br /><br />The...   
2  I thought this was a wonderful way to spend ti...   
3  Basically there's a family where a little boy ...   
4  Petter Mattei's "Love in the Time of Money" is...   

                                      cleaned_review  
0  one reviewers mentioned watching oz episode yo...  
1  wonderful little production filming technique ...  
2  thought wonderful way spend time hot summer we...  
3  basically theres family little boy jake thinks...  
4  petter matteis love time money visually stunni...  


Step 3: Tokenize the text using bert tokenizer.

In [4]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

data['bert_tokens'] = data['cleaned_review'].apply(lambda x: tokenizer.tokenize(x))
data['bert_token_ids'] = data['cleaned_review'].apply(lambda x: tokenizer.encode(x, add_special_tokens=True))

print(data[['review', 'cleaned_review', 'bert_tokens', 'bert_token_ids']].head())

Token indices sequence length is longer than the specified maximum sequence length for this model (589 > 512). Running this sequence through the model will result in indexing errors


                                              review  \
0  One of the other reviewers has mentioned that ...   
1  A wonderful little production. <br /><br />The...   
2  I thought this was a wonderful way to spend ti...   
3  Basically there's a family where a little boy ...   
4  Petter Mattei's "Love in the Time of Money" is...   

                                      cleaned_review  \
0  one reviewers mentioned watching oz episode yo...   
1  wonderful little production filming technique ...   
2  thought wonderful way spend time hot summer we...   
3  basically theres family little boy jake thinks...   
4  petter matteis love time money visually stunni...   

                                         bert_tokens  \
0  [one, reviewers, mentioned, watching, oz, epis...   
1  [wonderful, little, production, filming, techn...   
2  [thought, wonderful, way, spend, time, hot, su...   
3  [basically, there, ##s, family, little, boy, j...   
4  [pet, ##ter, matt, ##eis, love, time, money

Step 4: Split the data into train test and validation sets. Store them.

In [11]:
#pad the sequences to the same length of max length
max_len = max(data['bert_token_ids'].apply(len))
#extend all the sequences to the max length using padding token
bart_token_list = data['bert_token_ids'].tolist()
for i in range(len(bart_token_list)):
    bart_token_list[i] = bart_token_list[i] + [0]*(max_len-len(bart_token_list[i]))

In [12]:
# Convert to tensors
input_ids = torch.tensor(bart_token_list)
labels = torch.tensor(data['sentiment'].tolist())

# Split dataset into Train (80%), Validation (10%), Test (10%)
train_inputs, temp_inputs, train_labels, temp_labels = train_test_split(input_ids, labels, test_size=0.2, random_state=42)
val_inputs, test_inputs, val_labels, test_labels = train_test_split(temp_inputs, temp_labels, test_size=0.5, random_state=42)

# Save tensors for later use
torch.save(train_inputs, 'train_inputs.pt')
torch.save(train_labels, 'train_labels.pt')
torch.save(val_inputs, 'val_inputs.pt')
torch.save(val_labels, 'val_labels.pt')
torch.save(test_inputs, 'test_inputs.pt')
torch.save(test_labels, 'test_labels.pt')

print("Data successfully processed and saved for model training!")

Data successfully processed and saved for model training!


Step 5 : to load the tensors

In [13]:
train_inputs = torch.load('train_inputs.pt')
train_labels = torch.load('train_labels.pt')
val_inputs = torch.load('val_inputs.pt')
val_labels = torch.load('val_labels.pt')
test_inputs = torch.load('test_inputs.pt')
test_labels = torch.load('test_labels.pt')

  train_inputs = torch.load('train_inputs.pt')
  train_labels = torch.load('train_labels.pt')
  val_inputs = torch.load('val_inputs.pt')
  val_labels = torch.load('val_labels.pt')
  test_inputs = torch.load('test_inputs.pt')
  test_labels = torch.load('test_labels.pt')


In [14]:
#print a sample of the data
print(train_inputs[0])
print(train_labels[0])

tensor([ 101, 2008, 2015,  ...,    0,    0,    0])
tensor(0)
