## Data Processing for ML models

Link to notebook: https://colab.research.google.com/drive/1sVTTF1lUFgPxVLaTiz3MdG-ITvrsJ6Ax?usp=sharing

In this notebook, we process train_data.csv and test_data.csv before inputting it into our ML models. We also improve on the limitations of [this paper](https://ieeexplore.ieee.org/document/9084046). The authors of the paper did not share techniques to handle class imbalanced datasets. This could affect model accuracy if some target classes are underrepresented by the dataset. We did this using RandomOverSampler from the imblearn package.

In [1]:
!pip install contractions

Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.2-py3-none-any.whl (289 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.9/289.9 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.7/110.7 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.2 contractions-0.1.73 pyahocorasick-2.1.0 textsearch-0.0.24


In [2]:
import re

import contractions
import nltk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from imblearn.over_sampling import RandomOverSampler
from collections import Counter
from numpy import *
import seaborn as sns
from collections import Counter
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import TweetTokenizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from wordcloud import WordCloud, STOPWORDS


# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

import warnings

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


## Reading the data

In [3]:
# This cell reads files from Google Colab. If not using Colab, change the file directories accordingly
from google.colab import drive
drive.mount('/content/drive')

train_df = pd.read_csv('/content/drive/MyDrive/Datasets/train_data.csv')
test_df = pd.read_csv('/content/drive/MyDrive/Datasets/test_data.csv')

Mounted at /content/drive


## Additional Processing

Remove punctuation marks

In [4]:
def remove_punctuation(text):
    punctuation_pattern = re.compile(r'[^\w\s]')
    clean_text = punctuation_pattern.sub('', text)
    return clean_text

In [5]:
train_df['Text'] = train_df['Text'].apply(lambda z: remove_punctuation(z))
test_df['Text'] = test_df['Text'].apply(lambda z: remove_punctuation(z))

Tokenization

In [6]:
tokenizer = TweetTokenizer()
train_df['Text'] = train_df['Text'].apply(tokenizer.tokenize)
test_df['Text'] = test_df['Text'].apply(tokenizer.tokenize)

Stopword Removal

In [7]:
# Just to check list of stopwords to ensure that the keep tokens isnt redunat
stop_words = set(stopwords.words('english'))
print(stop_words)

{"she's", 'here', 'o', 'against', 'too', "that'll", 'no', 'most', 'than', 'will', 'not', 'own', 'on', 'from', 'weren', 'i', 'into', 'off', 'being', 'how', 'themselves', 'by', 'needn', 'under', "you've", 'my', 'does', 'after', "doesn't", "couldn't", 'itself', 're', 'other', 'a', "weren't", 'wasn', 'myself', 'yourselves', 'at', 'out', 'few', 'until', 'any', 'did', 'while', "isn't", 'aren', 'it', 'now', 'during', 'hers', 'was', 'more', 'further', 'should', 'she', "needn't", 'and', 'doing', 'who', 'shan', "wasn't", "you'll", 'each', 'such', 'between', 'why', "should've", 'above', 'our', 'of', 'both', 'has', "won't", 'he', "aren't", 'as', 'doesn', 'them', 'once', 'nor', "mightn't", 'don', 'its', "shan't", 'won', 'but', 'in', 'some', 'couldn', 'which', 'down', 'your', 'herself', 'having', 'about', 've', 'whom', 'didn', 'very', 'mustn', 'himself', "it's", 'ours', 'where', "hadn't", 'ourselves', 'that', 'be', 'haven', 'his', 'ma', 'me', 'him', 'hadn', 'is', 'so', 'am', 'again', 'just', 't', 'y

In [8]:
def filter_tokens(tokens):
    stop_words = set(stopwords.words('english'))

    #Our n-grams analysis has shown that the <br> tag was not removed properly
    stop_words.add('br')

    #Keep these words in the text as they could correlate to sentiment
    keep_in_tokens = [
        "isn't", "is",
        "wasn't", "was",
        "aren't", "are",
        "doesn't", "does",
        "couldn't", "could",
        "won't", "will",
        "shouldn't", "should",
        "didn't", "did",
        "haven't", "have"
    ]
    for word in keep_in_tokens:
        stop_words.discard(word)
    filtered_tokens = [token for token in tokens if token not in stop_words]
    return filtered_tokens

In [9]:
train_df['Text'] = train_df['Text'].apply(filter_tokens)
test_df['Text'] = test_df['Text'].apply(filter_tokens)

Lemmatization

In [10]:
lemmatizer = WordNetLemmatizer()
def lemmatize_tokens(filtered_tokens):
  return ' '.join([lemmatizer.lemmatize(token) for token in filtered_tokens])

In [11]:
train_df['Text'] = train_df['Text'].apply(lemmatize_tokens)
test_df['Text'] = test_df['Text'].apply(lemmatize_tokens)

In [12]:
print("Deep cleaned train Dataset")
train_df.head(10)

Deep cleaned train Dataset


Unnamed: 0,Text,Sentiment
0,saw premiered rewatched ifc is great telling m...,8
1,movie is one alltime favorite think sean penn ...,6
2,describing stalingrad war film may bit inaccur...,8
3,tale two sister one creepiest film have seen r...,8
4,well notice imdb offered plot infothat is is p...,1
5,little picture succeeds many big picture fails...,7
6,will love child saddest movie have ever seen d...,8
7,wow 3d imagery time wa used nicely provide goo...,2
8,24 is best television show is incredible tv se...,8
9,wa moved film 1981 went back theater four time...,8


In [13]:
print("Deep cleaned test Dataset")
test_df.head(10)

Deep cleaned test Dataset


Unnamed: 0,Text,Sentiment
0,frank horrigan clint eastwood is harassed mitc...,6
1,carly jones elisha curtberth bad boy brother n...,5
2,dig would say anyone even like metallica see k...,5
3,is great premise movie overall plot is origina...,4
4,underground comedy movie is possibly worst tra...,1
5,plot nutshell duchess voice eva gabor is well ...,5
6,liked film lot is dark is bulletdodging carcha...,7
7,terrible movie represents perfectly state dege...,1
8,know story group plucky nohopers enter competi...,5
9,boat builder sleepy town maine is going busine...,5


## Handling class imbalance

In our data preparation notebook, we found that there was an imbalance of classes. We will now perform random oversampling on the train data.

In [14]:
train_df['Sentiment'].value_counts()

Sentiment
1    7981
8    7733
6    4672
4    4225
3    3922
5    3829
7    3666
2    3630
Name: count, dtype: int64

In [15]:
import numpy as np

In [16]:
ros = RandomOverSampler()
train_x, train_y = ros.fit_resample(np.array(train_df['Text']).reshape(-1, 1), np.array(train_df['Sentiment']).reshape(-1, 1))

# Get Oversampled training data
train_oversampled_df = pd.DataFrame(list(zip([x[0] for x in train_x], train_y)), columns = ['Text', 'Sentiment'])

In [17]:
train_oversampled_df['Sentiment'].value_counts()

Sentiment
8    7981
6    7981
1    7981
7    7981
2    7981
4    7981
3    7981
5    7981
Name: count, dtype: int64

## Export the train and test data as csv

In [18]:
train_oversampled_df.to_csv("ML_train.csv",index=False)
test_df.to_csv("ML_test.csv",index=False)