Dataset - https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

# Problem 1
 Apply all the preprocessing techniques that you think are necessary

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Reading the CSV file and storing the data in a DataFrame 'df'
full_df = pd.read_csv("../archive/IMDB Dataset.csv")

df = full_df[:1000].copy() # For time saving purpose

In [3]:
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
995,Nothing is sacred. Just ask Ernie Fosselius. T...,positive
996,I hated it. I hate self-aware pretentious inan...,negative
997,I usually try to be professional and construct...,negative
998,If you like me is going to see this in a film ...,negative


## Lowercasing 

In [4]:
df['review'] = df['review'].str.lower()
df.sample(5)

Unnamed: 0,review,sentiment
615,it starts out like a very serious social comme...,negative
666,this was a fine example of how an interesting ...,positive
176,this movie took me by surprise. the opening cr...,positive
893,a good ol' boy film is almost required to have...,negative
0,one of the other reviewers has mentioned that ...,positive


## Removing HTML tags

In [5]:
import re 
def remove_html_tags(text):
    pattern = re.compile('<.*?>')
    return pattern.sub(r'',text)

In [6]:
df['review'] = df['review'].apply(remove_html_tags)
df.sample(5)

Unnamed: 0,review,sentiment
55,as someone has already mentioned on this board...,negative
147,francis ford coppola wrote and directed this s...,positive
517,is it a perfect movie? no. it is a weird adapt...,positive
422,"first and foremost, i loved the novel by ray b...",negative
838,"very typical almodóvar of the time and, in its...",positive


## Removing URL's 

In [7]:
import re
def remove_url(text):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'',text)

In [8]:
df['review'] = df['review'].apply(remove_url)
df.sample(5)

Unnamed: 0,review,sentiment
607,"this film is, quite simply, brilliant. the cin...",positive
689,considering this film was released 8 years bef...,positive
287,i saw this movie last night and thought it was...,positive
984,this is a very noir kind of episode. it begins...,positive
5,"probably my all-time favorite movie, a story o...",positive


## Removing Punctuations 

In [9]:
import string
exclude = string.punctuation
def remove_punc1(text):
    return text.translate(str.maketrans('','',exclude))

df['review'] = df['review'].apply(remove_punc1)
df.sample(5)

Unnamed: 0,review,sentiment
337,i saw this movie on the hallmark channel and t...,positive
265,finally an iranian film that is not made by ma...,positive
616,this is my first deepa mehta film i saw the fi...,positive
71,honestly this short film sucks the dummy used...,negative
874,this is one of my all time favorite cheap corn...,positive


## Removing Stopwords

In [10]:
from nltk.corpus import stopwords
stop_words_english = stopwords.words('english')
def remove_stopwords(text):
    new_text = []
    for word in text.split():
        if word in stopwords.words('english'):
            new_text.append('')
        else:
            new_text.append(word)
    x = new_text[:]
    new_text.clear()
    return " ".join(x)

In [11]:
df['review'] = df['review'].apply(remove_stopwords)
df.sample(5)

Unnamed: 0,review,sentiment
296,movie sucks ass something heatwave europe...,negative
971,possible spoilers perhaps must say cinderell...,negative
147,francis ford coppola wrote directed stunning...,positive
718,hard praise film much cgi dragon well d...,negative
433,1929 director walt disney animator ub iwerks...,positive


## Lemmatization 

In [12]:
import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
punctuations = "?:!.,;"

def lemmatize(text):
    sentence_words = nltk.word_tokenize(text)
    filtered_words = []
    for word in sentence_words:
        if word not in punctuations:
            filtered_words.append(word)
    lemmatized_words = [wordnet_lemmatizer.lemmatize(word) for word in filtered_words]
    
    return ' '.join(lemmatized_words)

In [13]:
df['review'] = df['review'].apply(lemmatize)
df.sample(5)

Unnamed: 0,review,sentiment
119,greatly enjoyed margaret atwoods novel robber ...,negative
499,joyous world created u pixars bug life immerse...,positive
371,apparently second remake film filmed 1911 1918...,positive
794,saw film got screwed film foolish boring thoug...,negative
644,rented dvd video store alternative reading rep...,negative


# Problem 2
Find out the number of words in the entire corpus and also the total number of unique words(vocabulary) using just python

In [14]:

def get_corpus(series):
    corpus = ' '.join(series)
    corpus = corpus.split()
    total_words = len(corpus)
    corpus = set(corpus)
    unique_words = len(corpus)
    print("total words :",total_words)
    print("unique words :",unique_words)

    return list(corpus)
    

In [15]:
get_corpus(df['review'])

total words : 120127
unique words : 19372


['monkeylike',
 'balloon',
 'held',
 'pedophile',
 'golddigging',
 'portray',
 'snot',
 'ike',
 '1918',
 'dmytryk',
 'magdalene',
 'heal',
 'telling',
 'similarly',
 'warwick',
 'pta',
 'seendont',
 'soap',
 'madge',
 'facade',
 'diplomatically',
 'hunter',
 'clete',
 'performs',
 'oneeye',
 'dawson',
 'brainskyle',
 'lit',
 'croc',
 'progressed',
 'badness',
 'hanky',
 'sideshow',
 'lyric',
 'machine',
 'dicaprio',
 'ebullient',
 'revenue',
 'phenomenon',
 'filmstelevision',
 'resolutionas',
 'bunuel',
 'menacing',
 'reject',
 'owner',
 'toll',
 'reeked',
 'frame',
 'clifton',
 'brussels',
 'turgid',
 'parenthetically',
 'rememberto',
 'keeping',
 'valerie',
 '6000',
 'hoop',
 'strikesthat',
 'catchup',
 'ooze',
 'clause',
 'generic',
 'political',
 'klux',
 'unintelligent',
 'playbut',
 'snobbery',
 'hobgoblin',
 'paddle',
 'kg',
 'blowup',
 'wellwhether',
 'nickname',
 'siegels',
 'overits',
 'stood',
 'channeling',
 'gagging',
 'stopping',
 'belief',
 'additionally',
 'songits',
 '

# Problem 3

Apply One Hot Encoding

In [16]:
def one_hot_encode_series(series, corpus_words):
    corpus_set = set(corpus_words)
    
    # Create a mapping of words to their respective indices in the corpus_words list
    word_to_index = {word: i for i, word in enumerate(corpus_words)}

    # Initialize the one-hot encoding matrix as an empty NumPy array
    num_rows = len(series)
    num_cols = len(corpus_words)
    one_hot_matrix = np.zeros((num_rows, num_cols), dtype=int)

    # Iterate through the rows in the textual column
    for i, row in enumerate(series):
        # Tokenize each row into words
        row_words = row.split()

        # Create a one-hot encoding vector for each row
        for word in row_words:
            if word in corpus_set:
                index = word_to_index[word]
                one_hot_matrix[i, index] = 1

    return one_hot_matrix


In [17]:
arr = one_hot_encode_series(df['review'],get_corpus(df['review']))

total words : 120127
unique words : 19372


In [18]:
arr = np.array(arr)

In [19]:
pd.DataFrame(arr)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,19362,19363,19364,19365,19366,19367,19368,19369,19370,19371
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
998,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Problem 4

Apply bag words and find the vocabulary also find the times each word has occured

In [20]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
bow = cv.fit_transform(df['review'])

In [21]:
print(cv.vocabulary_)



In [22]:
vocabulary = cv.get_feature_names_out()
word_frequency = bow.sum(axis=0).A1 

word_frequency_df = pd.DataFrame({'Word': vocabulary, 'Frequency': word_frequency})

word_frequency_df.sample(10)

Unnamed: 0,Word,Frequency
15040,scripted,6
9353,joy,9
6690,foremost,5
8321,houswives,1
7351,gory,11
3966,coverups,1
7090,generally,12
13134,portmans,1
17304,thoser,1
10425,magnified,1


# Problem 5

Apply bag of bi-gram and bag of tri-gram and write down your observation about the dimensionality of the vocabulary

### Bi-gram

In [23]:
# Initializing the CountVectorizer object with desired parameters
cv = CountVectorizer(
    ngram_range=(2, 2)  # Using bigrams (ngram_range=(2, 2))
)

# Applying CountVectorizer to the 'text' column in the DataFrame 'df'
bigram = cv.fit_transform(df['review'])

# Printing the vocabulary
print(cv.vocabulary_)



In [24]:
print(len(cv.vocabulary_))

101936


### tri-gram

In [25]:
# Initializing the CountVectorizer object with desired parameters
cv = CountVectorizer(
    ngram_range=(3, 3)  # Using trigrams
)

# Applying CountVectorizer to the 'text' column in the DataFrame 'df'
bigram = cv.fit_transform(df['review'])

# Printing the vocabulary
print(len(cv.vocabulary_))

116458


In [26]:
# Dimensionality wise it's pretty huge already for just 1000 rows 

# Problem 6

Apply tf-idf and find out the idf scores of words, also find out the vocabulary.

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()

In [28]:
tfidf_matrix = tfidf.fit_transform(df['review']).toarray()

In [29]:
tfidf_df = pd.DataFrame(tfidf_matrix,columns=tfidf.get_feature_names_out())

In [30]:
tfidf_df.sample(5)

Unnamed: 0,007,02,0510,10,100,1000,10000,100000,10002000,100th,...,zoom,zooming,zp,zu,zucker,zulu,zwick,zzzzzzzzzzzzzzzzzz,élan,ísnt
656,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
526,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
728,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
367,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
110,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [31]:
tfidf_df.shape

(1000, 19331)

In [32]:
tfidf.get_feature_names_out()

array(['007', '02', '0510', ..., 'zzzzzzzzzzzzzzzzzz', 'élan', 'ísnt'],
      dtype=object)