


#Data pre-processing

Data preprocessing techniques that are applied to prepare the data for training a natural language processing model. 
* Clean the data by removing any unnecessary information such as HTML tags. 
* Lemmatization for improved NLP performance. 
* Use pre-trained Glove Vector to convert the text into numerical representations, which are then used as input for the model.

In [31]:
import pandas as pd
import numpy as np
import io
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).





##Load dataset 



---








In [32]:
path = "/content/drive/My Drive/data/imdb-reviews.csv"
df = pd.read_csv(path, sep='\t') 

#Display randomly 10 rows of 2 last columns (rating, review)
df.iloc[:,-2:].sample(n=10)

Unnamed: 0,rating,review
20466,7.0,"algernon4's comment that Ms Paget's ""ultra lew..."
10806,10.0,I saw this movie in sixth grade around Christm...
7194,4.0,"This relatively obscure Hong Kong ""minorpiece""..."
17495,3.0,"The complaints are valid, to me the biggest pr..."
23685,2.0,I cannot believe the number of people referrin...
15862,3.0,This movie was so poorly written and directed ...
6197,2.0,The only interesting part of this movie was it...
39993,7.0,Asterix and the Vikings is the first animated ...
19108,9.0,After gorging myself on a variety of seemingly...
25681,7.0,------ Spoilers----- Spoilers----- Spoilers---...







##Data Cleaning


---



*Example before cleaning:*







In [33]:
df.review[2240]

'This series has recently been unearthed and excerpts can be seen, at least within Britain, via http://www.screenonline.org.uk/tv/id/527213/index.html Presumably there is some hope that the series may eventually become available more widely. The problem is that this series was followed by the series THE WARS OF THE ROSES that had a similarly stellar cast and which has been available to cable TV, or at least crowding the market. <br /><br />The two series are quite different in dramaturgy; THE WARS consolidates the plays through extensive rewriting and shifting of scenes; AN AGE OF KINGS follows Shakespeare more closely. Both series benefit from integral casting.'

*Cleaning the data..*

In [34]:
import re

tag_re = re.compile(r'<[^>]+>')

def remove_tags_and_links(text):
    clean = re.compile('<.*?>|http\S+')
    return re.sub(clean, '', text)
    

def textPreprocess(text):
  
  #remove numbers and Punctuations
  text = re.sub('[^A-Za-z ]', '', text) 
  text = remove_tags_and_links(text)
  #lowercase
  text = text.lower()

  return text

df.review = df.review.apply(textPreprocess)

*Example* after cleaning:

In [35]:
df.review[2240]

'this series has recently been unearthed and excerpts can be seen at least within britain via  presumably there is some hope that the series may eventually become available more widely the problem is that this series was followed by the series the wars of the roses that had a similarly stellar cast and which has been available to cable tv or at least crowding the market br br the two series are quite different in dramaturgy the wars consolidates the plays through extensive rewriting and shifting of scenes an age of kings follows shakespeare more closely both series benefit from integral casting'

##Tokenization and Lemmanization with removed stopwords
>* The process of tokenization and lemmatization is applied to break down the text into individual words and to reduce them to their base or root form. For example, the word "running" would be reduced to its base form "run", which will help to reduce the dimensionality of the data and improve the performance of the NLP model.
* To improve the accuracy of the lemmatization process:
  * POS tagging step added.The pos_tag() function uses the Averaged Perceptron Tagger, which is a machine learning based tagger trained on the Penn Treebank corpus, which is considered as a standard dataset for POS tagging.
  * Additionally, the code is using the WordNetLemmatizer function from NLTK, which uses the WordNet lexical database to determine the lemma of a word. WordNet is a large lexical database of English and it is used to help the lemmatization step to decide the correct base form of a word.

In [36]:
import nltk
nltk.download('omw-1.4')
from nltk.tokenize import WhitespaceTokenizer
from nltk.corpus import wordnet

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
nltk.download('wordnet')

stop_words = nltk.corpus.stopwords.words('english')
lemmatizer = nltk.stem.WordNetLemmatizer()

def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

def tokenization_lemmatization(text):
    # Tokenize the text
    tokens = nltk.word_tokenize(text)
    #Remove stop words
    tokens = [token for token in tokens if token.lower() not in stop_words]
    # POS tagging
    pos_tokens = nltk.pos_tag(tokens)
    # Lemmatize the tokens
    lemmatized_tokens = [lemmatizer.lemmatize(token, get_wordnet_pos(token)) for token, pos in pos_tokens]
    return ' '.join(lemmatized_tokens)

df.review = df.review.apply(tokenization_lemmatization)


*Example after lemmatization:*

In [37]:
df.review[2240]

'series recently unearthed excerpt see least within britain via presumably hope series may eventually become available widely problem series follow series war rose similarly stellar cast available cable tv least crowd market br br two series quite different dramaturgy war consolidates play extensive rewrite shift scene age king follow shakespeare closely series benefit integral cast'

##Vectorization using Glove


---



###Load the GloVe vectors into a dictionary and map words to their represantions

In [38]:
#Each line of the file contains a word, followed by 50 numbers
path = "/content/drive/My Drive/data/glove.6B.50d.txt"
glove_vectors = {}
with open(path, 'r') as f:
  for line in f:
    parts = line.split()
    word = parts[0]
    vector = np.array(parts[1:], dtype=np.float32)
    glove_vectors[word] = vector

###Create the word embeddings using the GloVe vectors

In [40]:
#words_found = 0
def generate_review_embeddings(review, glove_vectors, vector_size):

  tokens = nltk.word_tokenize(review)
  #Set to zero in case of a word not present in glove_vectors
  review_embeddings = np.zeros((vector_size,))
  for token in tokens:
    if token in glove_vectors:
      review_embeddings += glove_vectors[token]
      #words_found +=1
  #Return the average of the review embeddings
  return review_embeddings / len(tokens)

#Generate word embeddings for each review in the dataset
df['embeddings'] = df['review'].apply(lambda x: generate_review_embeddings(x, glove_vectors, vector_size=50))

##Save the data


### Prepare the dataset
>Using the naming convention "text" for input and "label" for output, and remove all other columns to keep only this two columns for use in the NLP Model.

In [41]:
#Convert the list of embeddings to a numpy array
df['text'] = np.array(df['embeddings'])
#Convert the the rating column to positive/negative and save the processing data.
df['label'] = df['rating'].apply(lambda x: 'positive' if x >= 5 else 'negative')
#Drop all other columns
df = df[['text','label']]
df.head()
df['label'].value_counts()

positive    22508
negative    22500
Name: label, dtype: int64

### Saving preprocessed data

In [42]:
import pickle
df['label'] = df['label'].apply(lambda x: "positive" if x == 1 else "negative")
with open('/content/drive/My Drive/data/pre_data.pkl', 'wb') as f:
  pickle.dump(df, f)
f.close

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['label'] = df['label'].apply(lambda x: "positive" if x == 1 else "negative")


<function BufferedWriter.close>

In [43]:
df

Unnamed: 0,text,label
0,"[0.025679750523219507, 0.3064883647797008, -0....",negative
1,"[0.10050697626265724, 0.3647341652077089, -0.0...",negative
2,"[0.12732792873824134, 0.20976997845712195, -0....",negative
3,"[0.1396960578960716, 0.10605007202830166, -0.0...",negative
4,"[0.13701498769766207, 0.1530708866872038, -0.0...",negative
...,...,...
45003,"[0.19379462649076418, -0.07134693995479266, -0...",negative
45004,"[0.082541145356511, 0.10408829985214259, -0.07...",negative
45005,"[0.12884820096549535, -0.012841832808529336, 0...",negative
45006,"[0.03857534694852251, 0.035725512936937084, -0...",negative
