## CSCI544: Homework Assignment No2
##### Due on February 8, 2024 (before class)
##### Name - Hrishikesh Thakur
##### Student ID - 5980681484

Note - Answers to questions are present at the bottom of .ipynb under the results and observation tab.

In [45]:
import pandas as pd
import numpy as np
import nltk
import re
from bs4 import BeautifulSoup
from collections import defaultdict

from gensim import utils
from gensim.test.utils import datapath
from gensim.models import Word2Vec
import gensim.downloader as api

import torch
from torch.utils.data import TensorDataset, DataLoader
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Perceptron
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn
from nltk import word_tokenize, pos_tag

nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/hrishikesh/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/hrishikesh/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/hrishikesh/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/hrishikesh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [47]:
# ! pip install bs4 # in case you don't have it installed
# ! pip install gensim
# ! pip install torch
# ! pip install nltk

## Read Data

In [48]:
dataframe = pd.read_csv("https://web.archive.org/web/20201127142707if_/https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Office_Products_v1_00.tsv.gz", sep="\t", on_bad_lines="skip")
# dataframe.to_csv('dataframe.csv')

  dataframe = pd.read_csv("https://web.archive.org/web/20201127142707if_/https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Office_Products_v1_00.tsv.gz", sep="\t", on_bad_lines="skip")


## 1. Dataset Generation
### Keep Ratings, Review headline and Review body 
- The code selects specific columns ("star_rating", "review_headline", "review_body") from the original dataframe, presumably containing review data.

- It renames the "star_rating" column to "rating" for clarity and consistency.

- The code concatenates the "review_headline" and "review_body" columns to form a single "review" column, which contains the complete text of each review.

- It converts the "rating" column to numeric format, handling any conversion errors gracefully by coercing errors. This ensures that only valid numeric ratings are retained.

- Any rows with missing values (NaNs) in the dataframe are dropped using dropna() function to ensure data cleanliness and consistency.

In [49]:
# Keep Reviews and Ratings
# dataframe = pd.read_csv('./dataframe.csv')
df = dataframe[["star_rating", "review_headline", "review_body"]].copy()  # Make a copy to avoid modifying the original DataFrame
df.rename(columns={"star_rating": "rating"}, inplace=True)
df["review"] = df.review_headline + ' ' + df.review_body
df["rating"] = pd.to_numeric(df.rating, errors="coerce")
df = df.dropna()
print("- Three sample reviews before data cleaning + preprocessing.\n")
df.head(3)

- Three sample reviews before data cleaning + preprocessing.



Unnamed: 0,rating,review_headline,review_body,review
0,5.0,Five Stars,Great product.,Five Stars Great product.
1,5.0,"Phffffffft, Phfffffft. Lots of air, and it's C...",What's to say about this commodity item except...,"Phffffffft, Phfffffft. Lots of air, and it's C..."
2,5.0,but I am sure I will like it.,"Haven't used yet, but I am sure I will like it.",but I am sure I will like it. Haven't used yet...


Below cell computes the frequency distribution of ratings in a DataFrame (df) and stores it in rating_stats. Then, it iterates over the items in rating_stats, printing the number of reviews for each unique rating value along with its frequency. This essentially provides a summary of how many reviews exist for each rating value present in the DataFrame.

In [50]:
rating_stats = df.rating.value_counts()
for i , j in rating_stats.items():
    print(f'Number of reviews having rating {i} : {j}')

Number of reviews having rating 5.0 : 1582682
Number of reviews having rating 4.0 : 418339
Number of reviews having rating 1.0 : 306962
Number of reviews having rating 3.0 : 193674
Number of reviews having rating 2.0 : 138380


###  We form three classes and select 50000 reviews randomly for every rating.

- The code snippet uses a lambda function to encode the ratings into three categories: 1 for ratings greater than 3, 2 for ratings less than 3, and 3 for ratings equal to 3. This creates a categorical variable that simplifies the sentiment analysis task by converting the ratings into positive, negative, and neutral sentiment categories.

- The ratings are then sampled to create a balanced dataset for training the sentiment analysis model. Each rating category is sampled to have 50,000 samples using the sample method with a specified random seed (random_state=42). This ensures that each sentiment category has an equal representation in the dataset, which is important for training a robust sentiment analysis model.

- After sampling, the dataframes corresponding to each rating category are concatenated using pd.concat. This combines the sampled dataframes into a single dataframe called dataset, which will be used for training the sentiment analysis model.

In [51]:
df["labels"] = df.rating.apply(lambda x: 1 if x > 3 else (2 if x < 3 else 3))
ratings = [df[df["rating"] == i].sample(50000,random_state=42) for i in range(1, 6)]
dataset = pd.concat(ratings, ignore_index=True)
# filtered_dataset.to_csv('filtered_dataset.csv')
# filtered_dataset.head(20)
# dataset = pd.read_csv('./filtered_dataset.csv')
dataset.head()

Unnamed: 0,rating,review_headline,review_body,review,labels
0,1.0,not worse of money,The keyboard is not sensitive enough and it ta...,not worse of money The keyboard is not sensit...,2
1,1.0,Price?,How come you can buy this on Sony's site for $...,Price? How come you can buy this on Sony's sit...,2
2,1.0,Black is light grey @ 60% full status,Well I was happy to save a lot of money *at fi...,Black is light grey @ 60% full status Well I w...,2
3,1.0,I am not please with my order. i replaced ...,I am not please with my order. i replaced the ...,I am not please with my order. i replaced ... ...,2
4,1.0,Does not work.,Bought a new (not refurbished) one. When I pla...,Does not work. Bought a new (not refurbished) ...,2


### Custom function for Lemmatization

- The lemmatization process involves converting words to their base form to normalize variations of the same word. For example, "running" would be converted to "run", "better" to "good", etc. This helps in reducing the dimensionality of the vocabulary and improving the performance of natural language processing tasks.

- Part-of-Speech (POS) Tagging: In order to perform accurate lemmatization, it is important to provide the lemmatizer with the correct part-of-speech (POS) tag for each word. This is because a word may have different lemmas depending on its grammatical role in a sentence (e.g., verb, noun, adjective, adverb).

- POS Tagging with NLTK: The code snippet uses NLTK's pos_tag function to perform POS tagging on the input text. This function assigns a POS tag to each word in the sentence, indicating its grammatical category (e.g., noun, verb, adjective, adverb).

- Mapping POS Tags to WordNet POS Tags: The POS tags returned by NLTK are mapped to corresponding WordNet POS tags using a dictionary (tag_map). This mapping ensures that the lemmatizer understands the grammatical context of each word and selects the appropriate lemma.

- Lemmatization Process: Finally, the lemmatizer applies lemmatization to each word in the input text based on its POS tag. It returns a string containing the lemmatized version of the input text, where each word has been replaced with its base form.

In [52]:
# WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

def lemmatize_word_according_pos(statement):
    tag_map = defaultdict(lambda: wn.NOUN)
    tag_map['J'] = wn.ADJ
    tag_map['V'] = wn.VERB
    tag_map['R'] = wn.ADV
    tokens = word_tokenize(statement)
    return ' '.join([lemmatizer.lemmatize(token, tag_map[tag[0]]) for token, tag in pos_tag(tokens)])


### Data Cleaning & Preprecessing

- data_cleaning and data_preprocessing collectively performs following task: 

    - The function converts the text to lowercase and removes any URLs or HTML tags present in the text using regular expressions.

    - BeautifulSoup Parsing: BeautifulSoup library is used to parse the HTML content and extract the text, effectively removing any HTML tags.

    - Special Character Removal: Special characters, such as punctuation marks and symbols, are removed from the text using regular expressions.

    - Handling Contractions: A dictionary contraction_dict is defined to map common contractions to their expanded forms. The function replaces contractions with their expanded forms, ensuring uniformity in the text.

    - Custom Stopword List: We have created a custom list of stopwords by excluding certain negation words ('no', 'nor', 'not', etc.) from the standard English stopwords list. This helps in preserving the negation context in the text, which is important for sentiment analysis.

    - Stopword Removal: Stopwords, which are common words that do not carry significant meaning in the context of sentiment analysis, are removed from the text. This step helps in reducing noise and focusing on important keywords.

    - Return Cleaned Text: The function returns the cleaned text after applying all the specified preprocessing steps. This clean text can then be used for further analysis or modeling.

In [53]:
# data cleaning function
def data_cleaning (statement, stop_words):
    contraction_dict = {
        "i'm": "i am",
        "you're": "you are",
        "he's": "he is",
        "she's": "she is",
        "they're": "they are",
        "we're": "we are",
        "it's": "it is",
        "that's": "that is",
        "here's": "here is",
        "there's": "there is",
        "who's": "who is",
        "where's": "where is",
        "when's": "when is",
        "why's": "why is",
        "what's": "what is",
        "how's": "how is",
        "everybody's": "everybody is",
        "nobody's": "nobody is",
        "something's": "something is",
        "so's": "so is",
        "i'll": "i will",
        "you'll": "you will",
        "he'll": "he will",
        "she'll": "she will",
        "they'll": "they will",
        "it'll": "it will",
        "we'll": "we will",
        "that'll": "that will",
        "this'll": "this will",
        "these'll": "these will",
        "there'll": "there will",
        "where'll": "where will",
        "who'll": "who will",
        "what'll": "what will",
        "how'll": "how will",
        "i've": "i have",
        "you've": "you have",
        "he's": "he has",
        "she's": "she has",
        "we've": "we have",
        "they've": "they have",
        "should've": "should have",
        "could've": "could have",
        "would've": "would have",
        "might've": "might have",
        "must've": "must have",
        "what've": "what have",
        "what's": "what has",
        "where've": "where have",
        "where's": "where has",
        "there've": "there have",
        "there's": "there has",
        "these've": "these have",
        "who's": "who has",
        "don't": "do not",
        "can't": "cannot",
        "mustn't": "must not",
        "aren't": "are not",
        "couldn't": "could not",
        "wouldn't": "would not",
        "shouldn't": "should not",
        "isn't": "is not",
        "doesn't": "does not",
        "didn't": "did not",
        "hasn't": "has not",
        "hadn't": "had not",
        "haven't": "have not",
        "wasn't": "was not",
        "won't": "will not",
        "weren't": "were not",
        "ain't": "am not",
        "let's": "let us",
        "y'all": "you all",
        "where'd": "where did",
        "how'd": "how did",
        "why'd": "why did",
        "who'd": "who did",
        "when'd": "when did",
        "what'd": "what did",
        "g'day": "good day",
        "ma'am": "madam",
        "o'clock": "of the clock"
    }
    # Lowercase and remove URLs and html tags
    statement = re.sub(r'https?://\S+|www\.\S+', ' ', statement.lower().strip())
    # statement = re.sub(r'<[^<>]*>', ' ', statement)
    soup = BeautifulSoup(statement, 'html.parser')
    statement = soup.get_text(separator=' ', strip=True)
    
    # Remove special characters
    statement = re.sub(r'[^a-zA-Z]', ' ', statement)
    
    # Handle contractions
    statement = " ".join([contraction_dict[word] if word in contraction_dict else word for word in statement.split()])

    # Remove stopwords
    statement = ' '.join([word for word in statement.split() if word not in stop_words])
    
    return statement

# Data preprocessing function
def data_preprocessing(data):
    stop_words = stopwords.words('english')
    stop_words = list(set(stop_words) - set(['no', 'nor','not', 'only', 'very', "don't", "ain't", "aren't", "couldn't", "didn't", "doesn't", "hadn't", "hasn't", "might't","musn't", "isn't", "needn't", "shan't", "shouldn't", "wasn't", "weren't", "wont't", "wouldn't"]))
    data["processed_review"] = data.review.apply(lambda x: data_cleaning(x, stop_words))
    data.processed_review = data.processed_review.apply(lambda x: lemmatize_word_according_pos(x))
    return data

Performing data cleaning and preprocessing 

In [54]:
dataset = data_preprocessing(dataset)
# dataset.to_csv('preprocessed_data.csv')
# dataset = pd.read_csv('./preprocessed_data.csv')

  soup = BeautifulSoup(statement, 'html.parser')


## 2. Word Embedding

- Pretrained Word2Vec Model:

    - Utilizing api.load('word2vec-google-news-300') to load the pre-trained Word2Vec model from the gensim library.

    - This specific model is the Google News Word2Vec model trained on a large corpus of Google News articles.

- Sample Vocabulary:

    - Iterating through the vocabulary of the loaded Word2Vec model to showcase some sample words. Printing the index and corresponding word for the first 5 entries in the vocabulary.

In [55]:
# Pretrained Word2Vec Model
pretrained_word2vec_model = api.load('word2vec-google-news-300')

# sample list of vocabulary from word2vec-google-news-300
for index, word in enumerate(pretrained_word2vec_model.index_to_key):
    if index == 5:
        break
    print(f"word #{index}/{len(pretrained_word2vec_model.index_to_key)} is {word}")

word #0/3000000 is </s>
word #1/3000000 is in
word #2/3000000 is for
word #3/3000000 is that
word #4/3000000 is is


In [56]:
# printing semantic similarities of the generated vectors
print("Pretrained semantic similarities of the generated vectors", pretrained_word2vec_model.most_similar(positive=['summer', 'cold'], negative=['hot'], topn=1))
print("Pretrained semantic similarities of the generated vectors", pretrained_word2vec_model.most_similar(positive=["worse", "good"], negative= ["bad"],topn=1))
print("Pretrained semantic similarities of the generated vectors", pretrained_word2vec_model.most_similar(positive=["dad", "daughter"], negative= ["mom"], topn=1))

Pretrained semantic similarities of the generated vectors [('winter', 0.6970129609107971)]
Pretrained semantic similarities of the generated vectors [('better', 0.735008955001831)]
Pretrained semantic similarities of the generated vectors [('son', 0.8507932424545288)]



- The Word2Vec model is trained on the processed review data from given dataset using the Word2Vec class from the Gensim library. 
You preprocess the text using a simple pre-processing function to convert it into a list of tokens.

- Model Parameters: sepicified the parameters of the Word2Vec model, including the vector size (dimensionality of the word vectors),
window size (maximum distance between the current and predicted word within a sentence), minimum word count (threshold for filtering infrequent words),
and whether to use the Skip-gram (sg) or Continuous Bag of Words (CBOW) architecture.

- Training: The Word2Vec model is trained on the processed review data to learn word embeddings. During training, the model iterates over
the text corpus and adjusts the word vectors to maximize the likelihood of predicting context words given a target word 
(for the Skip-gram architecture).

In [57]:
# Own Word2Vec Model
own_word2vec_model = Word2Vec(sentences=dataset.processed_review.apply(lambda x : utils.simple_preprocess(str(x))), vector_size=300, window=11, min_count=10, sg=1)

# sample list of vocabulary
for index, word in enumerate(own_word2vec_model.wv.index_to_key):
    if index == 5:
        break
    print(f"word #{index}/{len(own_word2vec_model.wv.index_to_key)} is {word}")

word #0/13518 is not
word #1/13518 is work
word #2/13518 is use
word #3/13518 is printer
word #4/13518 is one


In [58]:
# Printing semantic similarities of the generated vectors
print("Own Model semantic similarities of the generated vectors", own_word2vec_model.wv.most_similar(positive=['summer', 'cold'], negative=['hot'], topn=1))
print("Own Model semantic similarities of the generated vectors", own_word2vec_model.wv.most_similar(positive=["worse", "good"], negative= ["bad"], topn=1))
print("Own Model semantic similarities of the generated vectors", own_word2vec_model.wv.most_similar(positive=["dad", "daughter"], negative= ["mom"], topn=1))

Own Model semantic similarities of the generated vectors [('winter', 0.46683269739151)]
Own Model semantic similarities of the generated vectors [('excellent', 0.4422725737094879)]
Own Model semantic similarities of the generated vectors [('son', 0.5418736338615417)]


#### Comparison of Semantic Similarities:

- The pretrained Word2Vec model tends to produce more intuitive semantic similarities between words compared to the Word2Vec model trained on your own dataset.

- For example, in the comparison of "summer - hot + cold = winter", the pretrained model produces a semantic similarity with "winter", which aligns well with our expectations. However, the similarity score is relatively lower in the model trained on your own dataset.

- Similarly, in the comparison of "worse - bad + good = better", the pretrained model again produces a semantic similarity with "better", whereas the model trained on your dataset returns "excellent" with a lower similarity score.

- The same trend is observed in the comparison of "dad - mom + daughter = son", where the pretrained model provides a more relevant semantic similarity.

#### Encoding Semantic Similarities:
- The pretrained Word2Vec model, trained on a large corpus of diverse texts, seems to encode semantic similarities between words better. This is likely because the pretrained model has been trained on a vast amount of data, capturing richer semantic relationships.

- On the other hand, the Word2Vec model trained on your own dataset may not have been exposed to as much diverse linguistic context, leading to less accurate semantic embeddings.

<ins> In summary, the pretrained Word2Vec model generally performs better in encoding semantic similarities between words, likely due to its training on a larger and more diverse dataset. However, the model trained on your own dataset can still be valuable if it captures domain-specific nuances or vocabulary not present in the pretrained model.</ins>

In [59]:
# evaluate both word2vec model for supporting our observation 
print(pretrained_word2vec_model.evaluate_word_pairs(datapath('wordsim353.tsv')))
print(own_word2vec_model.wv.evaluate_word_pairs(datapath('wordsim353.tsv')))

(PearsonRResult(statistic=0.6238773487289394, pvalue=1.7963224351224885e-39), SignificanceResult(statistic=0.6589215888009288, pvalue=2.534605645914962e-45), 0.0)
(PearsonRResult(statistic=0.42847274721379586, pvalue=4.32166669342491e-12), SignificanceResult(statistic=0.4206590906793289, pvalue=1.1477606232564647e-11), 32.29461756373937)


### 3. Simple models


- The model_train_and_test function is designed to streamline the process of training and evaluating a SVM, Perceptron, Logistic regression, Naive Bayes model on both training and testing datasets.

- Input Parameters:
    - model: The machine learning model to be trained and evaluated.

    - x_train: The feature matrix of the training dataset.
    
    - x_test: The feature matrix of the testing dataset.
    
    - y_train: The target labels of the training dataset.
    
    - y_test: The target labels of the testing dataset.
    
    - inputDatatype: A string indicating the type of input data (e.g., "Word Embeddings", "TF-IDF Vectors").
    
    - modelname: A string specifying the name of the model being used (e.g., "Logistic Regression", "Random Forest").
    
    - classification: A string indicating the type of classification task (e.g., "Binary", "Multiclass").
    
    - result: A list to store the results of model evaluation for later analysis.

- Model Training: The function fits the specified model to the training data (x_train, y_train) using the fit method.

- Training Accuracy: After training, the function calculates the accuracy of the model on the training data (x_train, y_train) using the accuracy_score function from scikit-learn. This accuracy score represents how well the model predicts the training data.

- Testing Accuracy: The function then uses the trained model to make predictions on the testing data (x_test) and evaluates its performance using the corresponding ground truth labels (y_test). The accuracy of the model on the testing data is calculated using the accuracy_score function.

- Result Logging: Appends the input data type, model name, classification type, training accuracy, and testing accuracy to the result list for further analysis or comparison with other models.

In [60]:
def model_train_and_test (model, x_train, x_test, y_train, y_test,inputDatatype, modelname, classification, result):
    
    model.fit(x_train, y_train) 
    train_pred = model.predict(x_train)
    accuracy = accuracy_score(y_train, train_pred) 

    print(f'Training data Metrix ({modelname} - {inputDatatype} - {classification}):')
    print("Accuracy :", accuracy )

    y_pred = model.predict(x_test)
    accuracy_test = accuracy_score(y_test, y_pred)

    print(f'Testing data Metrix ({modelname} - {inputDatatype} - {classification}):')
    print("Accuracy :", accuracy_test )
    print()
    # storing result in a list for better comparision and visualisation
    result.append([inputDatatype, modelname,classification, accuracy, accuracy_test])

In [61]:
# Creating result List to store data related to model
result_column = ["Input Data", "Model Name", "Classification", "Training Accuracy", "Testing Accuracy"]
result = list()

##### This function performs TF-IDF vectorization on processed reviews, excluding those labeled as 3, utilizing a maximum of 5000 features, and returns a DataFrame containing TF-IDF values for each term in the reviews.

In [62]:
# TF-IDF Vectorization
def tfidf_vectorization(data):
    tfidf = TfidfVectorizer(max_features=5000)
    review_list = data[data.labels != 3].processed_review.tolist()
    result = tfidf.fit_transform(review_list)
    features = tfidf.get_feature_names_out()
    tfidf_dataset = pd.DataFrame(data=result.toarray(), columns=features)
    return tfidf_dataset

##### This function 'w2v_vectorize' takes a statement and a word embedding model, extracts word vectors for each word in the statement from the model, and returns the mean of these vectors as a single vector representation of the statement, or a zero vector if none of the words are found in the model.

In [63]:
# Vectorize function for Word Embedding
def w2v_vectorize(statement, model):
    words_vector = [model[word] for word in statement.split() if word in model]
    if len(words_vector) == 0:
        return np.zeros(300)
    words_vector = np.array(words_vector)
    return words_vector.mean(axis=0)

##### This function takes a statement and a word embedding model, extracts word vectors for up to 10 words from the statement using the model, pads with zeros if necessary, and returns a flattened array representing the vectorized statement with a fixed length of 300 * max_words.

In [64]:
# Vectorize function for Word Embedding with  max_words limit 10
def w2v_vectorize_withLimit(statement, model, max_words = 10):
    words_vector = [model[word] for word in statement.split() if word in model]
    if len(words_vector) == 0:
        return np.zeros(300 * max_words)
    
    words_vector = np.array(words_vector[:max_words])
    padding_size = max_words - words_vector.shape[0]

    if padding_size > 0:
        padding = np.zeros((padding_size, words_vector.shape[1]))
        words_vector = np.concatenate([words_vector, padding])

    return words_vector.flatten()

##### Creating TF-IDF Dataset train test split

This code first creates a TF-IDF representation of the dataset using a vectorization process. Then, it splits the dataset into training and testing sets, with 80% of the data used for training and 20% for testing, ensuring a consistent split across categories.

In [65]:
#creating TF-IDF data set
tfidf_dataset = tfidf_vectorization(dataset)
# Train and Test split
tfidf_x_train, tfidf_x_test, tfidf_y_train, tfidf_y_test = train_test_split(tfidf_dataset, dataset[dataset.labels != 3].labels, test_size=0.2, random_state=42)

##### Runing Simple models on TF-IDF data set 

In [66]:
# Perceptron Model call for TF-IDF data set for binary classification
perceptron = Perceptron(penalty='elasticnet', l1_ratio=0.8, alpha=1e-4, tol=1e-3)
model_train_and_test(perceptron, tfidf_x_train, tfidf_x_test, tfidf_y_train, tfidf_y_test, "TF-IDF", "Perceptron", "Binary", result)

# SVM Model call for TF-IDF data set for binary classification
svm_model = svm.LinearSVC(dual=True, loss="hinge", max_iter=200000, fit_intercept=True, tol=1e-5)
model_train_and_test(svm_model, tfidf_x_train, tfidf_x_test, tfidf_y_train, tfidf_y_test, "TF-IDF", "SVM", "Binary", result)

Training data Metrix (Perceptron - TF-IDF - Binary):
Accuracy : 0.8741375
Testing data Metrix (Perceptron - TF-IDF - Binary):
Accuracy : 0.8707

Training data Metrix (SVM - TF-IDF - Binary):
Accuracy : 0.92078125
Testing data Metrix (SVM - TF-IDF - Binary):
Accuracy : 0.90855



In [67]:
logistic_reg = LogisticRegression(max_iter=5000) 
model_train_and_test(logistic_reg, tfidf_x_train, tfidf_x_test, tfidf_y_train, tfidf_y_test, "TF-IDF", "Logistic", "Binary", result)

naiveBayes = MultinomialNB()
model_train_and_test(naiveBayes, tfidf_x_train, tfidf_x_test, tfidf_y_train, tfidf_y_test, "TF-IDF", "Naive Bayes", "Binary", result)

Training data Metrix (Logistic - TF-IDF - Binary):
Accuracy : 0.916825
Testing data Metrix (Logistic - TF-IDF - Binary):
Accuracy : 0.907975

Training data Metrix (Naive Bayes - TF-IDF - Binary):
Accuracy : 0.86921875
Testing data Metrix (Naive Bayes - TF-IDF - Binary):
Accuracy : 0.8654



In [68]:
#free up memory ( beacuse kernal is crashing due to insuficient memory )
del(index, word, dataframe, df, i, j, lemmatizer,perceptron, rating_stats, ratings, stopwords, svm_model,tfidf_dataset, 
    tfidf_x_train, tfidf_x_test, tfidf_y_train, tfidf_y_test,wn)

##### Creating train test split for Binary classification.

This code snippet splits a dataset into train and test sets for binary classification, excluding class label 3. It then vectorizes the textual data using word embeddings (Word2Vec) from two different models, both with and without limiting the maximum number of words.

In [69]:
# Train and Test Split for Binary
word2vec_binary_x_train, word2vec_binary_x_test, word2vec_binary_y_train, word2vec_binary_y_test = train_test_split(
    dataset[dataset.labels != 3].processed_review, dataset[dataset.labels != 3].labels,
    test_size=0.2, random_state=42)

# Vectorize for Word Embedding
max_ = 10
own_word2vec_binary_x_train = np.asarray([w2v_vectorize(statement=statement, model=own_word2vec_model.wv) for statement in word2vec_binary_x_train])
own_word2vec_binary_x_test = np.asarray([w2v_vectorize(statement=statement, model=own_word2vec_model.wv) for statement in word2vec_binary_x_test])
own_word2vec_limit_binary_x_train = np.asarray([w2v_vectorize_withLimit(statement=statement, model=own_word2vec_model.wv) for statement in word2vec_binary_x_train])
own_word2vec_limit_binary_x_test = np.asarray([w2v_vectorize_withLimit(statement=statement, model=own_word2vec_model.wv) for statement in word2vec_binary_x_test])

pretrained_word2vec_binary_x_train = np.asarray([w2v_vectorize(statement=statement, model=pretrained_word2vec_model) for statement in word2vec_binary_x_train])
pretrained_word2vec_binary_x_test = np.asarray([w2v_vectorize(statement=statement, model=pretrained_word2vec_model) for statement in word2vec_binary_x_test])
pretrained_word2vec_limit_binary_x_train = np.asarray([w2v_vectorize_withLimit(statement=statement, model=pretrained_word2vec_model) for statement in word2vec_binary_x_train])
pretrained_word2vec_limit_binary_x_test = np.asarray([w2v_vectorize_withLimit(statement=statement, model=pretrained_word2vec_model) for statement in word2vec_binary_x_test])


#### Creating train test split for ternary cassification

This code splits a dataset into training and testing sets, then converts textual data into word embeddings using two different Word2Vec models, one with no limit on vector size and another with a limit set to 10, for both self-trained and pre-trained Word2Vec models, effectively preparing the data for ternary classification tasks.

In [70]:
# Train and Test Split for Ternary
word2vec_ternary_x_train, word2vec_ternary_x_test, word2vec_ternary_y_train, word2vec_ternary_y_test = train_test_split(
    dataset.processed_review, dataset.labels,
    test_size=0.2, random_state=42)
# Vectorize for Word Embedding
max_ = 10
own_word2vec_ternary_x_train = np.asarray([w2v_vectorize(statement=statement, model=own_word2vec_model.wv) for statement in word2vec_ternary_x_train])
own_word2vec_ternary_x_test = np.asarray([w2v_vectorize(statement=statement, model=own_word2vec_model.wv) for statement in word2vec_ternary_x_test])
own_word2vec_limit_ternary_x_train = np.asarray([w2v_vectorize_withLimit(statement=statement, model=own_word2vec_model.wv) for statement in word2vec_ternary_x_train])
own_word2vec_limit_ternary_x_test = np.asarray([w2v_vectorize_withLimit(statement=statement, model=own_word2vec_model.wv) for statement in word2vec_ternary_x_test])

pretrained_word2vec_ternary_x_train = np.asarray([w2v_vectorize(statement=statement, model=pretrained_word2vec_model) for statement in word2vec_ternary_x_train])
pretrained_word2vec_ternary_x_test = np.asarray([w2v_vectorize(statement=statement, model=pretrained_word2vec_model) for statement in word2vec_ternary_x_test])
pretrained_word2vec_limit_ternary_x_train = np.asarray([w2v_vectorize_withLimit(statement=statement, model=pretrained_word2vec_model) for statement in word2vec_ternary_x_train])
pretrained_word2vec_limit_ternary_x_test = np.asarray([w2v_vectorize_withLimit(statement=statement, model=pretrained_word2vec_model) for statement in word2vec_ternary_x_test])


#### Runing Perceptron SVM and logistic regression model for Binary classification

code define and train two Perceptron models and two Support Vector Machine (SVM) models for binary classification tasks using custom and pre-trained Word2Vec embeddings, with specific hyperparameters and training settings. The models are evaluated and compared using the model_train_and_test function, storing the results for analysis.

In [71]:
perceptron_own = Perceptron(penalty='elasticnet', l1_ratio=0.8, alpha=1e-4, tol=1e-3)
model_train_and_test(perceptron_own, own_word2vec_binary_x_train, own_word2vec_binary_x_test, word2vec_binary_y_train, word2vec_binary_y_test, "Word2Vec-custom", "Perceptron", "Binary", result)

svm_model_own = svm.LinearSVC(dual=True, loss="hinge", max_iter=200000, fit_intercept=True, tol=1e-5)
model_train_and_test(svm_model_own, own_word2vec_binary_x_train, own_word2vec_binary_x_test, word2vec_binary_y_train, word2vec_binary_y_test, "Word2Vec-custom", "SVM", "Binary", result)

perceptron_pretrained = Perceptron(penalty='elasticnet', l1_ratio=0.8, alpha=1e-4, tol=1e-3)
model_train_and_test(perceptron_pretrained, pretrained_word2vec_binary_x_train, pretrained_word2vec_binary_x_test, word2vec_binary_y_train, word2vec_binary_y_test, "Word2Vec-Pre-Trained", "Perceptron", "Binary", result)

svm_model_pretrained = svm.LinearSVC(dual=True, loss='hinge', max_iter=200000, fit_intercept=True, tol=1e-5)
model_train_and_test(svm_model_pretrained, pretrained_word2vec_binary_x_train, pretrained_word2vec_binary_x_test, word2vec_binary_y_train, word2vec_binary_y_test, "Word2Vec-Pre-Trained", "SVM", "Binary", result)

Training data Metrix (Perceptron - Word2Vec-custom - Binary):
Accuracy : 0.85066875
Testing data Metrix (Perceptron - Word2Vec-custom - Binary):
Accuracy : 0.8501

Training data Metrix (SVM - Word2Vec-custom - Binary):
Accuracy : 0.8824375
Testing data Metrix (SVM - Word2Vec-custom - Binary):
Accuracy : 0.880775

Training data Metrix (Perceptron - Word2Vec-Pre-Trained - Binary):
Accuracy : 0.83849375
Testing data Metrix (Perceptron - Word2Vec-Pre-Trained - Binary):
Accuracy : 0.838725

Training data Metrix (SVM - Word2Vec-Pre-Trained - Binary):
Accuracy : 0.86276875
Testing data Metrix (SVM - Word2Vec-Pre-Trained - Binary):
Accuracy : 0.858525



In [72]:
logistic_own = LogisticRegression(max_iter=5000) 
model_train_and_test(logistic_own, own_word2vec_binary_x_train, own_word2vec_binary_x_test, word2vec_binary_y_train, word2vec_binary_y_test, "Word2Vec-custom", "Logistic", "Binary", result)

logistic_pretrained = LogisticRegression(max_iter=5000) 
model_train_and_test(logistic_pretrained, pretrained_word2vec_binary_x_train, pretrained_word2vec_binary_x_test, word2vec_binary_y_train, word2vec_binary_y_test, "Word2Vec-Pre-Trained", "Logistic", "Binary", result)

Training data Metrix (Logistic - Word2Vec-custom - Binary):
Accuracy : 0.88164375
Testing data Metrix (Logistic - Word2Vec-custom - Binary):
Accuracy : 0.880675

Training data Metrix (Logistic - Word2Vec-Pre-Trained - Binary):
Accuracy : 0.861375
Testing data Metrix (Logistic - Word2Vec-Pre-Trained - Binary):
Accuracy : 0.8574



#### FNN Model

Function mlp_classification for training and evaluating a multi-layer perceptron (MLP) model for classification tasks using PyTorch. Here's a concise summary:

- Data Preparation: The function takes training and testing data along with their labels, as well as parameters related to the network architecture and training settings.

- Model Definition: Inside the function, an MLP model is defined using the nn.Module class. The architecture is specified by the input_size, hidden_sizes, and output_size parameters.

- ReLU activation functions are used between layers.

- Training Loop: The function then trains the model using the provided training data. It utilizes the Adam optimizer and Cross Entropy Loss criterion for optimization.

- Testing Loop: After training, the function evaluates the model's performance on the testing data.

- Results: Finally, the function prints the training loss for each epoch and the testing accuracy. Additionally, it appends the results to a list called result.

In [75]:
def mlp_classification(X_train, y_train, X_test, y_test, input_size, hidden_sizes, output_size, input_data, model_name, classification, batch_size=64, epochs=25, learning_rate=0.001, ):
    # Convert data to PyTorch tensors
    X_train_tensor = torch.from_numpy(X_train).to(dtype=torch.float32)
    y_train_tensor = torch.tensor(y_train.values - 1, dtype=torch.long)
    X_test_tensor = torch.from_numpy(X_test).to(dtype=torch.float32)
    y_test_tensor = torch.tensor(y_test.values - 1, dtype=torch.long)

    # Create DataLoader
    train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
    test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

    # Define the MLP model
    class MLP(nn.Module):
        def __init__(self, input_size, hidden_sizes, output_size):
            super(MLP, self).__init__()
            layers = []
            layers.append(nn.Flatten())
            for i in range(len(hidden_sizes)):
                layers.append(nn.Linear(input_size if i == 0 else hidden_sizes[i-1], hidden_sizes[i]))
                layers.append(nn.ReLU())
            layers.append(nn.Linear(hidden_sizes[-1], output_size))
            self.model = nn.Sequential(*layers)

        def forward(self, x):
            return self.model(x)

    # Create model, criterion, and optimizer
    model = MLP(input_size, hidden_sizes, output_size)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    # Loss monitoring
    training_losses = []
    print(f'\n\nTraining ({model_name}) - ({input_data}) - ({classification}) : \n')
    print('Model configuration :')
    print( model )
    print(f'\nTraining Loss for each epoch :')
    # Training loop
    for epoch in range(epochs):
        model.train()
        epoch_loss = 0.0
        for inputs, labels in train_loader:
            optimizer.zero_grad()
            outputs = model(inputs)
            # print(outputs.size(), labels.size())
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()

        # Average loss for the epoch
        avg_loss = epoch_loss / len(train_loader)
        training_losses.append(avg_loss)
        print(f'Epoch [{epoch + 1}/{epochs}], Loss: {avg_loss:.4f}')

    # Testing loop
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for inputs, labels in test_loader:
            outputs = model(inputs)
            # print(torch.max(outputs.data, 1))
            _, predicted = torch.max(outputs.data, 1)
            # print(_, predicted.size())
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    testing_accuracy = correct / total
    print(f'Testing Accuracy for ({model_name}) - ({input_data}) - ({classification}) : {testing_accuracy}')
    result.append([input_data, model_name, classification, "-", testing_accuracy])
    return model, testing_accuracy, training_losses

#### Running FNN model for binary cclassification

conducting multi-layer perceptron (MLP) binary classification experiments using different word embedding approaches (custom vs. pretrained Word2Vec embeddings) and different input sizes (300 vs. 3000).

In [76]:
hidden_sizes = [50, 10]
output_size = 2
input_size = 300
model, accuracy, training_loss = mlp_classification(own_word2vec_binary_x_train, word2vec_binary_y_train, own_word2vec_binary_x_test, word2vec_binary_y_test,input_size, hidden_sizes, output_size, "Word2Vec-custom", "FNN - (Average)", "Binary" )
model, accuracy, training_loss = mlp_classification(pretrained_word2vec_binary_x_train, word2vec_binary_y_train, pretrained_word2vec_binary_x_test, word2vec_binary_y_test,input_size, hidden_sizes, output_size, "Word2Vec-Pretrained", "FNN - (Average)", "Binary")

input_size = 3000
model, accuracy, training_loss = mlp_classification(own_word2vec_limit_binary_x_train, word2vec_binary_y_train, own_word2vec_limit_binary_x_test, word2vec_binary_y_test,input_size, hidden_sizes, output_size, "Word2Vec-custom", "FNN - (Concat)", "Binary")
model, accuracy, training_loss = mlp_classification(pretrained_word2vec_limit_binary_x_train, word2vec_binary_y_train, pretrained_word2vec_limit_binary_x_test, word2vec_binary_y_test,input_size, hidden_sizes, output_size, "Word2Vec-Pretrained", "FNN - (Concat)", "Binary")



Training (FNN - (Average)) - (Word2Vec-custom) - (Binary) : 

Model configuration :
MLP(
  (model): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=300, out_features=50, bias=True)
    (2): ReLU()
    (3): Linear(in_features=50, out_features=10, bias=True)
    (4): ReLU()
    (5): Linear(in_features=10, out_features=2, bias=True)
  )
)

Training Loss for each epoch :
Epoch [1/25], Loss: 0.3151
Epoch [2/25], Loss: 0.2786
Epoch [3/25], Loss: 0.2660
Epoch [4/25], Loss: 0.2583
Epoch [5/25], Loss: 0.2525
Epoch [6/25], Loss: 0.2479
Epoch [7/25], Loss: 0.2436
Epoch [8/25], Loss: 0.2392
Epoch [9/25], Loss: 0.2358
Epoch [10/25], Loss: 0.2325
Epoch [11/25], Loss: 0.2295
Epoch [12/25], Loss: 0.2271
Epoch [13/25], Loss: 0.2249
Epoch [14/25], Loss: 0.2218
Epoch [15/25], Loss: 0.2197
Epoch [16/25], Loss: 0.2185
Epoch [17/25], Loss: 0.2160
Epoch [18/25], Loss: 0.2147
Epoch [19/25], Loss: 0.2127
Epoch [20/25], Loss: 0.2108
Epoch [21/25], Loss: 0.2088
Epoch [22/25], 

conducting multi-layer perceptron (MLP) ternary classification experiments using different word embedding approaches (custom vs. pretrained Word2Vec embeddings) and different input sizes (300 vs. 3000).

In [77]:
hidden_sizes = [50, 10]
output_size = 3
input_size = 300
model, accuracy, training_loss = mlp_classification(own_word2vec_ternary_x_train, word2vec_ternary_y_train, own_word2vec_ternary_x_test, word2vec_ternary_y_test,input_size, hidden_sizes, output_size,"Word2Vec-Custom", "FNN - (Average)", "Ternary")
model, accuracy, training_loss = mlp_classification(pretrained_word2vec_ternary_x_train, word2vec_ternary_y_train, pretrained_word2vec_ternary_x_test, word2vec_ternary_y_test,input_size, hidden_sizes, output_size, "Word2Vec-Pretrained", "FNN - (Average)", "Ternary")

input_size = 3000
model, accuracy, training_loss = mlp_classification(own_word2vec_limit_ternary_x_train, word2vec_ternary_y_train, own_word2vec_limit_ternary_x_test, word2vec_ternary_y_test,input_size, hidden_sizes, output_size, "Word2Vec-Custom", "FNN - (Concat)", "Ternary")
model, accuracy, training_loss = mlp_classification(pretrained_word2vec_limit_ternary_x_train, word2vec_ternary_y_train, pretrained_word2vec_limit_ternary_x_test, word2vec_ternary_y_test,input_size, hidden_sizes, output_size, "Word2Vec-Pretrained", "FNN - (Concat)", "Ternary")



Training (FNN - (Average)) - (Word2Vec-Custom) - (Ternary) : 

Model configuration :
MLP(
  (model): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=300, out_features=50, bias=True)
    (2): ReLU()
    (3): Linear(in_features=50, out_features=10, bias=True)
    (4): ReLU()
    (5): Linear(in_features=10, out_features=3, bias=True)
  )
)

Training Loss for each epoch :
Epoch [1/25], Loss: 0.6634
Epoch [2/25], Loss: 0.6043
Epoch [3/25], Loss: 0.5861
Epoch [4/25], Loss: 0.5768
Epoch [5/25], Loss: 0.5694
Epoch [6/25], Loss: 0.5634
Epoch [7/25], Loss: 0.5586
Epoch [8/25], Loss: 0.5540
Epoch [9/25], Loss: 0.5503
Epoch [10/25], Loss: 0.5470
Epoch [11/25], Loss: 0.5436
Epoch [12/25], Loss: 0.5411
Epoch [13/25], Loss: 0.5377
Epoch [14/25], Loss: 0.5355
Epoch [15/25], Loss: 0.5333
Epoch [16/25], Loss: 0.5317
Epoch [17/25], Loss: 0.5289
Epoch [18/25], Loss: 0.5271
Epoch [19/25], Loss: 0.5256
Epoch [20/25], Loss: 0.5242
Epoch [21/25], Loss: 0.5228
Epoch [22/25],

In [78]:
#free up memory
del(   training_loss, svm_model_pretrained, svm_model_own, pretrained_word2vec_ternary_x_train, pretrained_word2vec_limit_ternary_x_train, 
    pretrained_word2vec_ternary_x_test, pretrained_word2vec_limit_binary_x_train, pretrained_word2vec_binary_x_train, pretrained_word2vec_limit_binary_x_test,
    pretrained_word2vec_binary_x_test, perceptron_own, perceptron_pretrained, own_word2vec_ternary_x_test,
 own_word2vec_ternary_x_train, pretrained_word2vec_limit_ternary_x_test,accuracy, hidden_sizes, input_size, max_, model, output_size,own_word2vec_binary_x_test, own_word2vec_binary_x_train, 
    own_word2vec_limit_ternary_x_test, own_word2vec_limit_ternary_x_train, own_word2vec_limit_binary_x_train,own_word2vec_limit_binary_x_test
  )

In [79]:
def func( statement, model, max_review_length, vector_size):
    # Load and preprocess data for a single sample
    words_vector = [model[word] for word in statement.split() if word in model]
    if len(words_vector) == 0:
        words_vector = torch.zeros(max_review_length, vector_size)
    else:
        words_vector = torch.tensor(words_vector[:max_review_length])

    # Pad or truncate the sequence
    if words_vector.shape[0] < max_review_length:
        padding_size = max_review_length - words_vector.shape[0]
        padding = torch.zeros(padding_size, vector_size)
        words_vector = torch.cat([words_vector, padding])

    return words_vector

#### CNN Model

This Python function defines and trains a Convolutional Neural Network (CNN) model for text classification using PyTorch. Here's a summary of what the function does:

- Preprocessing: The function first limits the length of input text sequences to 50 words and converts them into PyTorch tensors.

- Model Architecture: It defines a CNN model using nn.Sequential, consisting of two convolutional layers followed by ReLU activation functions and a fully connected layer for classification.

- Training: The model is trained using the Adam optimizer and Cross Entropy Loss for a specified number of epochs. Training progress is monitored, and training losses for each epoch are printed.

- Testing: The trained model is evaluated on the test data to compute the accuracy. The function returns the trained model, testing accuracy, and training losses.

- Output: The function prints out the model configuration and training losses for each epoch during training. Finally, it appends the test accuracy to the provided result list along with input data details and model configuration for further analysis.

In [80]:
def CNN_Model(x_train, y_train, x_test, y_test, vector_size, output_channels1, output_channels2, num_classes, word2vec,
              input_data, model_name, classification, result, max_review_length=50, batch_size=32, epochs=10, learning_rate=0.001):
    #  Concat first 50 words 
    x_train_limit = np.asarray([func(statement=str(statement), model=word2vec, max_review_length=50, vector_size=300 ) for statement in x_train])
    x_test_limit = np.asarray([func(statement=str(statement), model=word2vec, max_review_length=50, vector_size=300 ) for statement in x_test])
    
    # Convert data to PyTorch tensors
    x_train_tensor = torch.from_numpy(x_train_limit).to(dtype=torch.float32)
    y_train_tensor = torch.tensor(y_train.values - 1, dtype=torch.long)
    x_test_tensor = torch.from_numpy(x_test_limit).to(dtype=torch.float32)
    y_test_tensor = torch.tensor(y_test.values - 1, dtype=torch.long)

    # Create DataLoader
    train_dataset = TensorDataset(x_train_tensor, y_train_tensor)
    test_dataset = TensorDataset(x_test_tensor, y_test_tensor)

    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

    # Define the CNN model using Sequential
    model = nn.Sequential(
        nn.Conv1d(in_channels=vector_size, out_channels=output_channels1, kernel_size=3),
        nn.ReLU(),
        nn.Conv1d(in_channels=output_channels1, out_channels=output_channels2, kernel_size=3),
        nn.ReLU(),
        nn.Flatten(),
        nn.Linear(output_channels2 * (max_review_length - 4), num_classes)
    )

    # Loss and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    # Loss monitoring
    training_losses = []
    print(f'\n\nTraining ({model_name}) - ({input_data}) - ({classification}) : \n')
    print('Model configuration :')
    print( model )
    print(f'\nTraining Loss for each epoch :')
    # Training loop
    for epoch in range(epochs):
        model.train()
        epoch_loss = 0.0
        for inputs, labels in train_loader:
            optimizer.zero_grad()
            # No need to check for sparse input, as Conv1d layer expects dense input
            outputs = model(inputs.permute(0, 2, 1))
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()

        # Average loss for the epoch
        avg_loss = epoch_loss / len(train_loader)
        training_losses.append(avg_loss)
        print(f'Epoch [{epoch + 1}/{epochs}], Loss: {avg_loss:.4f}')

    # Testing loop
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for inputs, labels in test_loader:
            outputs = model(inputs.permute(0, 2, 1))
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    testing_accuracy = correct / total
    result.append([input_data, model_name, classification, "-", testing_accuracy])
    return model, testing_accuracy, training_losses


##### Below snippet utilizes the CNN_Model function to train a Convolutional Neural Network (CNN) model for binary classification using word embeddings generated by a custom Word2Vec model (own_word2vec_model):

- Model Configuration: The CNN model is configured with an input vector size of 300 (corresponding to the dimensionality of the Word2Vec embeddings), 50 output channels for the first convolutional layer, 10 output channels for the second convolutional layer, and an output size of 2 for binary classification.

- Training: The CNN_Model function is invoked with the training and testing data, along with the specified model configuration. The function trains the CNN model on the provided training data, monitors the training progress, and computes the testing accuracy on the test data.

- Output: After training, the function prints the test accuracy for binary classification. This accuracy reflects how well the CNN model can classify binary sentiment labels based on the provided Word2Vec embeddings.

In [81]:
vector_size = 300
output_channels1 = 50
output_channels2 = 10
output_size = 2

model_binary, accuracy_binary, training_losses_binary = CNN_Model(
    word2vec_binary_x_train, word2vec_binary_y_train, word2vec_binary_x_test, word2vec_binary_y_test, vector_size, output_channels1, output_channels2, output_size, own_word2vec_model.wv,
    "Word2Vec-custom", "CNN - (50 concat)", "Binary", result
)
print(f'Binary Classification Test Accuracy: {accuracy_binary * 100:.2f}%')



Training (CNN - (50 concat)) - (Word2Vec-custom) - (Binary) : 

Model configuration :
Sequential(
  (0): Conv1d(300, 50, kernel_size=(3,), stride=(1,))
  (1): ReLU()
  (2): Conv1d(50, 10, kernel_size=(3,), stride=(1,))
  (3): ReLU()
  (4): Flatten(start_dim=1, end_dim=-1)
  (5): Linear(in_features=460, out_features=2, bias=True)
)

Training Loss for each epoch :
Epoch [1/10], Loss: 0.2430
Epoch [2/10], Loss: 0.1913
Epoch [3/10], Loss: 0.1739
Epoch [4/10], Loss: 0.1585
Epoch [5/10], Loss: 0.1460
Epoch [6/10], Loss: 0.1338
Epoch [7/10], Loss: 0.1220
Epoch [8/10], Loss: 0.1126
Epoch [9/10], Loss: 0.1037
Epoch [10/10], Loss: 0.0956
Binary Classification Test Accuracy: 91.64%


##### Below segment utilizes the CNN_Model function to train a Convolutional Neural Network (CNN) model for binary classification:

- Model Configuration: The CNN model is configured with a word embedding size of 300 dimensions (vector_size). It consists of two convolutional layers with 50 and 10 output channels (output_channels1 and output_channels2, respectively), followed by a fully connected layer with an output size of 2 for binary classification (output_size).

- Training: The CNN_Model function is called with training and testing data (word2vec_binary_x_train, word2vec_binary_y_train, word2vec_binary_x_test, word2vec_binary_y_test) and the specified model configuration. It trains the CNN model using the provided Word2Vec embeddings (pretrained_word2vec_model), monitors the training process, and computes the testing accuracy on the test data.

- Output: After training, the code prints the test accuracy for binary classification, indicating how accurately the CNN model can predict binary sentiment labels based on the provided pre-trained Word2Vec embeddings.

In [82]:
vector_size = 300
output_channels1 = 50
output_channels2 = 10
output_size = 2

model_binary, accuracy_binary, training_losses_binary = CNN_Model(
    word2vec_binary_x_train, word2vec_binary_y_train, word2vec_binary_x_test, word2vec_binary_y_test, vector_size, output_channels1, output_channels2, output_size, pretrained_word2vec_model,
    "Word2Vec-Pretrained", "CNN - (50 concat)", "Binary", result
)
print(f'Binary Classification Test Accuracy: {accuracy_binary * 100:.2f}%')



Training (CNN - (50 concat)) - (Word2Vec-Pretrained) - (Binary) : 

Model configuration :
Sequential(
  (0): Conv1d(300, 50, kernel_size=(3,), stride=(1,))
  (1): ReLU()
  (2): Conv1d(50, 10, kernel_size=(3,), stride=(1,))
  (3): ReLU()
  (4): Flatten(start_dim=1, end_dim=-1)
  (5): Linear(in_features=460, out_features=2, bias=True)
)

Training Loss for each epoch :
Epoch [1/10], Loss: 0.2444
Epoch [2/10], Loss: 0.1913
Epoch [3/10], Loss: 0.1722
Epoch [4/10], Loss: 0.1570
Epoch [5/10], Loss: 0.1433
Epoch [6/10], Loss: 0.1306
Epoch [7/10], Loss: 0.1200
Epoch [8/10], Loss: 0.1102
Epoch [9/10], Loss: 0.1016
Epoch [10/10], Loss: 0.0933
Binary Classification Test Accuracy: 91.35%


##### Below code segment employs the CNN_Model function to train a Convolutional Neural Network (CNN) model for ternary classification:

- Model Configuration: The CNN model is configured with a word embedding size of 300 dimensions (vector_size). It comprises two convolutional layers with 50 and 10 output channels (output_channels1 and output_channels2, respectively), followed by a fully connected layer with an output size of 3 for ternary classification (output_size).

- Training: The CNN_Model function is invoked with training and testing data (word2vec_ternary_x_train, word2vec_ternary_y_train, word2vec_ternary_x_test, word2vec_ternary_y_test) and the specified model configuration. The CNN model is trained utilizing the provided pre-trained Word2Vec embeddings (pretrained_word2vec_model). It monitors the training process and computes the testing accuracy on the test data.

- Output: Upon completion of training, the code prints the test accuracy for ternary classification. This indicates how accurately the CNN model can predict ternary sentiment labels based on the provided pre-trained Word2Vec embeddings.

In [83]:
vector_size = 300
output_channels1 = 50
output_channels2 = 10
output_size = 3

model_ternary, accuracy_ternary, training_losses_ternary = CNN_Model(
    word2vec_ternary_x_train, word2vec_ternary_y_train, word2vec_ternary_x_test, word2vec_ternary_y_test, vector_size, output_channels1, output_channels2, output_size,  pretrained_word2vec_model,
    "Word2Vec-Pretrained", "CNN - (50 concat)", "Ternary", result
)
print(f'Ternary Classification Test Accuracy: {accuracy_ternary * 100:.2f}%')



Training (CNN - (50 concat)) - (Word2Vec-Pretrained) - (Ternary) : 

Model configuration :
Sequential(
  (0): Conv1d(300, 50, kernel_size=(3,), stride=(1,))
  (1): ReLU()
  (2): Conv1d(50, 10, kernel_size=(3,), stride=(1,))
  (3): ReLU()
  (4): Flatten(start_dim=1, end_dim=-1)
  (5): Linear(in_features=460, out_features=3, bias=True)
)

Training Loss for each epoch :
Epoch [1/10], Loss: 0.5784
Epoch [2/10], Loss: 0.5058
Epoch [3/10], Loss: 0.4847
Epoch [4/10], Loss: 0.4700
Epoch [5/10], Loss: 0.4568
Epoch [6/10], Loss: 0.4453
Epoch [7/10], Loss: 0.4357
Epoch [8/10], Loss: 0.4266
Epoch [9/10], Loss: 0.4182
Epoch [10/10], Loss: 0.4111
Ternary Classification Test Accuracy: 77.18%


##### This code segment employs the CNN_Model function to train a Convolutional Neural Network (CNN) model for ternary classification using custom Word2Vec embeddings. Here's a brief explanation:

- Model Configuration: The CNN model is configured with a word embedding size of 300 dimensions (vector_size). It consists of two convolutional layers with 50 and 10 output channels (output_channels1 and output_channels2, respectively), followed by a fully connected layer with an output size of 3 for ternary classification (output_size).

- Training: The CNN_Model function is called with training and testing data (word2vec_ternary_x_train, word2vec_ternary_y_train, word2vec_ternary_x_test, word2vec_ternary_y_test) and the specified model configuration. The CNN model is trained using the provided custom Word2Vec embeddings (own_word2vec_model.wv). It monitors the training process and computes the testing accuracy on the test data.

- Output: After training, the code prints the test accuracy for ternary classification. This represents how accurately the CNN model can predict ternary sentiment labels based on the provided custom Word2Vec embeddings.

In [84]:
vector_size = 300
output_channels1 = 50
output_channels2 = 10
output_size = 3

model_ternary, accuracy_ternary, training_losses_ternary = CNN_Model(
    word2vec_ternary_x_train, word2vec_ternary_y_train, word2vec_ternary_x_test, word2vec_ternary_y_test, vector_size, output_channels1, output_channels2, output_size,  own_word2vec_model.wv,
    "Word2Vec-Custom", "CNN - (50 concat)", "Ternary", result
)
print(f'Ternary Classification Test Accuracy: {accuracy_ternary * 100:.2f}%')



Training (CNN - (50 concat)) - (Word2Vec-Custom) - (Ternary) : 

Model configuration :
Sequential(
  (0): Conv1d(300, 50, kernel_size=(3,), stride=(1,))
  (1): ReLU()
  (2): Conv1d(50, 10, kernel_size=(3,), stride=(1,))
  (3): ReLU()
  (4): Flatten(start_dim=1, end_dim=-1)
  (5): Linear(in_features=460, out_features=3, bias=True)
)

Training Loss for each epoch :
Epoch [1/10], Loss: 0.5662
Epoch [2/10], Loss: 0.5003
Epoch [3/10], Loss: 0.4817
Epoch [4/10], Loss: 0.4665
Epoch [5/10], Loss: 0.4536
Epoch [6/10], Loss: 0.4420
Epoch [7/10], Loss: 0.4317
Epoch [8/10], Loss: 0.4223
Epoch [9/10], Loss: 0.4136
Epoch [10/10], Loss: 0.4051
Ternary Classification Test Accuracy: 78.28%


# Results and Observation

In [85]:
results_df = pd.DataFrame(result, columns=result_column).sort_values(['Model Name', 'Classification'], ascending=False, ignore_index=True)
results_df

Unnamed: 0,Input Data,Model Name,Classification,Training Accuracy,Testing Accuracy
0,TF-IDF,SVM,Binary,0.920781,0.90855
1,Word2Vec-custom,SVM,Binary,0.882437,0.880775
2,Word2Vec-Pre-Trained,SVM,Binary,0.862769,0.858525
3,TF-IDF,Perceptron,Binary,0.874138,0.8707
4,Word2Vec-custom,Perceptron,Binary,0.850669,0.8501
5,Word2Vec-Pre-Trained,Perceptron,Binary,0.838494,0.838725
6,TF-IDF,Naive Bayes,Binary,0.869219,0.8654
7,TF-IDF,Logistic,Binary,0.916825,0.907975
8,Word2Vec-custom,Logistic,Binary,0.881644,0.880675
9,Word2Vec-Pre-Trained,Logistic,Binary,0.861375,0.8574


#### Terminologies
- "Custom" -> Word2vec model trained on given data set
- "Pretrained" -> Word2vec model trained by 
- "Concat" -> concatenate the first 10 Word2Vec vectors for each review as the input feature 
- "50 concat" -> limit the maximum review length to 50 by truncating longer reviews and padding shorter reviews with a null value 
- "Average" -> average Word2Vec vectors for each review as the input feature [(x = 1/N summation (i = 1 to i = N) for Wi ] for a review with N words.


### Observations
- Comparison of Semantic Similarities and Encoding Semantic Similarities:

    - The pretrained Word2Vec model tends to produce more intuitive semantic similarities between words compared to the Word2Vec model trained on your own dataset.

    - For example, in the comparison of "summer - hot + cold = winter", the pretrained model produces a semantic similarity with "winter", which aligns well with our expectations. However, the similarity score is relatively lower in the model trained on your own dataset.

    - Similarly, in the comparison of "worse - bad + good = better", the pretrained model again produces a semantic similarity with "better", whereas the model trained on your dataset returns "excellent" with a lower similarity score.

    - The same trend is observed in the comparison of "dad - mom + daughter = son", where the pretrained model provides a more relevant semantic similarity.

- Encoding Semantic Similarities:

    - The pretrained Word2Vec model, trained on a large corpus of diverse texts, seems to encode semantic similarities between words better. This is likely because the pretrained model has been trained on a vast amount of data, capturing richer semantic relationships.

    - On the other hand, the Word2Vec model trained on your own dataset may not have been exposed to as much diverse linguistic context, leading to less accurate semantic embeddings.

<ins> In summary, the pretrained Word2Vec model generally performs better in encoding semantic similarities between words, likely due to its training on a larger and more diverse dataset. However, the model trained on your own dataset can still be valuable if it captures domain-specific nuances or vocabulary not present in the pretrained model.</ins>



- Feature Type Performance Comparison :

    - TF-IDF features showcase strong performance, especially notable in the SVM model for binary classification (Testing Accuracy: 0.908550), highlighting its effectiveness in distinguishing relevant textual features for POS tagging.
    
    - Your trained Word2Vec (custom Word2Vec)  embeddings outperform the pretrained variants across several models, indicating that embeddings tailored to the POS tagging task capture more relevant semantic nuances. This is particularly evident in the SVM model (Testing Accuracy: 0.880775) and the FNN model utilizing average Word2Vec embeddings for binary classification (Testing Accuracy: 0.899075).
    
    - Pretrained Word2Vec embeddings, while beneficial, generally underperform in comparison to custom embeddings and TF-IDF features, likely due to the generic nature of their training data.

<ins>In summary, TF-IDF features perform well overall, while custom Word2Vec embeddings tailored to the dataset tend to yield the highest accuracy, outperforming both pre-trained Word2Vec and TF-IDF features.</ins>


- Model Accuracy Comparison : 
    
    - Incorporating character-level information through a CNN module with custom Word2Vec embeddings notably improves model performance in both binary and ternary classifications, with the CNN model achieving a Testing Accuracy of 0.916350 for binary classification..
        
    - FNN models, particularly with the "Average" strategy and custom Word2Vec embeddings, demonstrate substantial effectiveness, achieving a Testing Accuracy of 0.899075 for binary classification and 0.762120 for ternary classification, highlighting the strength of neural network approaches in POS tagging.
        
<ins> Overall, FNN models provide competitive accuracy values compared to SVM and Perceptron models, showcasing the potential of neural networks for text classification tasks.</ins>