## What we are going to do?


1. Download the IMDB review dataset
2. Perform necessary preprocessing
3. Convert data into vector (OHE,word2vec,BOW)
4. create a model(RNN, LSTM,Bi-directional LSTM, Stacked LSTM, GRU) using keras, tensorflow or pytorch
5. Compare the performance, which model is Best?

In [3]:
import tarfile
import os

In [4]:
path_tofile = r"/Users/praveensrivas/Documents/NLP_GEN_AI/Projects/Data/aclImdb_v1.tar.gz"
extract_to  = os.path.dirname(path_tofile)


if tarfile.is_tarfile(path_tofile):
  with tarfile.open(path_tofile) as f:
    f.extractall(path=extract_to)    #extract

In [5]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)  # or 1000
pd.set_option('display.max_rows', None)  # or 1000
pd.set_option('display.max_colwidth', None)  # or 199

#### 1. Reading data, and creating train test dataset

In [6]:
def load_data_from_folder(folder_path, label):
  data = []
  for filename in os.listdir(folder_path):
    if filename.endswith('.txt'):
      file_path = os.path.join(folder_path, filename)
      with open(file_path,'r',encoding='utf-8') as file:
        content = file.read()
        data.append((content,label))

  return data

def create_dataset(base_dir):
  train_data = []
  test_data = []

  #Train paths
  train_pos_path = os.path.join(base_dir,'train','pos')
  train_neg_path = os.path.join(base_dir,'train','neg')


  #Test path
  test_pos_path = os.path.join(base_dir,'test','pos')
  test_neg_path = os.path.join(base_dir,'test','neg')


  # Load train data
  train_data.extend(load_data_from_folder(train_pos_path,1))
  train_data.extend(load_data_from_folder(train_neg_path,0))

  # Load test data
  test_data.extend(load_data_from_folder(test_pos_path,1))
  test_data.extend(load_data_from_folder(test_neg_path,0))

  # Convert to Dataframe
  train_df = pd.DataFrame(train_data, columns=['review','lables'])
  test_df = pd.DataFrame(test_data, columns=['review','label'])

  return train_df, test_df

In [None]:
base_dir = '/Users/praveensrivas/Documents/Datasets/NLP_Datasets/Data/aclImdb'

In [8]:
train_df, test_df = create_dataset(base_dir=base_dir)

In [9]:
train_df.sample(2)

Unnamed: 0,review,lables
7965,"From the first scene you are given clues as to what may be going on here. It becomes more and more obvious as the story rolls on. The acting is excellent throughout and these actors touch your soul. Even though I knew what was going to happen I was extremely puzzled by the motive. I'm still puzzled as to why Ben did what he did. We could see in his face ""second thoughts"", but the ultimate sacrifice seemed to go against his emotion and feelings. It was a very interesting and touching story but it left me confused. Maybe that was the point of the film. I did like the film and Wil Smith can wrack up another good film choice. This guy knows how to entertain an audience!",1
15778,"I bought Bloodsuckers on ebay a while ago. I watched parts and deemed it just too dumb to review again. The excessive amount of watery 'blood' at the beginning is just plain obsolete - not to mention the ""whip-around"" wind sounds. My friends and I made a super low budget movie, and the effects still exceeded this crap fest.<br /><br />As for the amount of mistakes in this movie, there are way too many to count. I knew one of the actors - believe it or not, he was my THEATRE teacher. HA! <br /><br />Final verdict: Don't bother with this ""horror"" flick. <br /><br />3 Stars (out of a possible 73)",0


## 2. Text preprocessing
1. Lowercasing  
2. Remove Punctuation  
3. Tokenization  
4. Remove Stopwords  
5. Remove Numbers  
6. Stemming  
7. Lemmatization  
8. Handling Special Characters  
9. Handling Whitespace

In [10]:
# Lower case
train_df['review'] = train_df['review'].apply(lambda x: x.lower())

In [11]:

# Remove punctualtion marks, HTML tags, URLs
import re
import string
train_df['review'] = train_df['review'].apply(lambda x: re.sub(r'<[^>]*?>|\[.*?\]|https?:\/\/\S+|www\.\S+|[^\w\s]', '', x))


In [12]:
# Tokenization
train_df['review_token'] = train_df['review'].apply(lambda x: x.split())

In [13]:
# removing stopwords
from nltk.corpus import stopwords
stopwords_ = set(stopwords.words('english'))
train_df['non_stopword_review'] = train_df['review_token'].apply(lambda x: [xi.strip() for xi in x if xi.strip() not in stopwords_])

In [14]:
train_df.drop(columns=['review','review_token'], inplace=True)

In [15]:
### Lemmatization
from nltk.stem import WordNetLemmatizer
def word_lemma(token_list):
  wl = WordNetLemmatizer()
  token_list = " ".join([wl.lemmatize(w) for w in token_list])
  return token_list

In [16]:
train_df['clean_review'] = train_df['non_stopword_review'].apply(word_lemma)

In [17]:
train_df.drop(columns=['non_stopword_review'], inplace=True)

In [18]:
train_df.sample(2)

Unnamed: 0,lables,clean_review
17765,0,movie female rape victimcomic book writer new york decides get away awful big city glamor move dirty run small town find refuge singlewide trailer dirt lot middle 12th nowhere townspeople mentally ill yet inviting crazy men trailer annoying fact ability exactly right thing place dangerous circumstance dangerous circumstance db sweeneys performance high school best he one kindacute young actor sweet grin unfortunately career kind mother nature right tow previous commentator stating acting real well agree actually wasnt acting two main character really pathetic weak incapable making mature healthy decision brief movie suck like rent laugh real crime scene atrocious wood paneling trailer enough make commit murder lastly shes artistwriter couldnt afford doublewide trailer something sunyellow chevy chevette love god
15284,0,entirely impressed film originally named sin eater stayed way considering talked last half film im even sure first 20 minute film rest slow picking robocop peter weller one main actor sad pointall would say check thing dealing catholic religion dont expect exorcist stigma film surely flop day word get
