# PA6: Sentiment Classification 
##### By: Jon Llamado | NLP1000, XX22


## Description

The following is a notebook about sentiment classfication from tweet data. The data was taken from a Kaggle competition in this link: https://www.kaggle.com/competitions/nlp1000-ml-challenge-2/rules

I will first pre-process the data to be compatible with Multinomial Naive Bayes.

## Data Pre-processing

In [1]:
import pandas as pd
import numpy as np
import re

from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.feature_extraction.text import CountVectorizer

import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\PC\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### View of both the training and testing data

In [2]:
train_data = pd.read_csv('train.csv')
train_data.head()

Unnamed: 0,DocumentId,Text,IsPositive
0,2388179,never again LINK,True
1,657251,ma okey rang tanan promise naas lord dita nya ...,True
2,1730789,MENTION yawaaa gyud oh ?,True
3,868789,"bahalag mangalata ang all, dili jud ko mag gma...",False
4,1570427,makatambok gyud diay ang wholeday na tulog,False


In [3]:
test_data = pd.read_csv('test.csv')
test_data.head()

Unnamed: 0,DocumentId,Text
0,1450521,hi baby boy LINK
1,889575,MENTION hahahaha. nice ra tan awon pero patay ...
2,1168201,nadagdagan ang tropa simula nung napunta ako s...
3,81808,"sooo, i started cleaning my old room and found..."
4,188577,"and stop calling it ""good government"" dahil sa..."


### Initialization of X Data

A function is made for cleaning the data of both data for the features (Text).
1. The text is turned into a numpy array
2. The text will be lowercased
3. Using regex, retain only letters. This returns an array of words.
4. Join the words back into sentences
5. Append the clean sentence to a new numpy array and repeat until all is done
6. Return the data

In [18]:
def clean_x_data(dataframe):
    data = []
    for text in dataframe['Text'].to_numpy(): # Make sure the text is numpy array first
        # Lowercase all text
        text = text.lower()

        # Get regex of only letters
        text = re.findall('[a-zA-Z]+', text)

        # Join the array of words separated by spaces
        cleaned_text = " ".join(text)

        # Append the cleaned text to X_train
        data.append(cleaned_text)

    return data
    

A count vectorizer is initialized to get the words present in the training data. 

We transform the training and testing data using to get the count sparse matrix of both. This is so it is compatible with Multinomial Naive Bayes.

In [None]:
X_train = clean_x_data(train_data)
X_test = clean_x_data(test_data)

# Initialize count vectorizer. X_train data is used here.
count_vect = CountVectorizer()
count_vect.fit(X_train)

# I am not sure what this is for but I think it is if we want a view of the numbers? 
# word_features = count_vect.get_feature_names_out()

# Get count sparse matrix for training and test set
X_train = count_vect.transform(X_train)
X_test = count_vect.transform(X_test)

As seen on the shape of both X_train and X_test, the features are the same for each column (76,484 features)

In [19]:
X_train.shape

(105845, 76484)

In [20]:
X_test.shape

(59538, 76484)

### Initializing y_train

A label encoder will be used for the values in the "IsPositive" column. This is also so it is compatible with Multinomial Naive Bayes.

In [7]:
label_enc = preprocessing.LabelEncoder()
train_data["IsPositive"] = label_enc.fit_transform(train_data['IsPositive'])
train_data["IsPositive"]

0         1
1         1
2         1
3         0
4         0
         ..
105840    1
105841    1
105842    1
105843    1
105844    1
Name: IsPositive, Length: 105845, dtype: int64

Map the data so that False is 0 and True is 1.

In [8]:
mapping = dict(zip(label_enc.classes_, label_enc.transform(label_enc.classes_)))

print("Mapping:", mapping)

Mapping: {False: 0, True: 1}


In [9]:
train_data.head(5)

Unnamed: 0,DocumentId,Text,IsPositive
0,2388179,never again LINK,1
1,657251,ma okey rang tanan promise naas lord dita nya ...,1
2,1730789,MENTION yawaaa gyud oh ?,1
3,868789,"bahalag mangalata ang all, dili jud ko mag gma...",0
4,1570427,makatambok gyud diay ang wholeday na tulog,0


Initialize y_train and turn the "IsPositive" column to numpy array

In [10]:
y_train = train_data['IsPositive'].to_numpy()
y_train.shape

(105845,)

### Check shapes of training and testing data

This looks ready to go for model.

In [32]:
print(f"Shape of X_train: \t{X_train.shape}")
print(f"Shape of y_train: \t{y_train.shape}")

print(f"\nShape of X_test: \t{X_test.shape}")

Shape of X_train: 	(105845, 76484)
Shape of y_train: 	(105845,)

Shape of X_test: 	(59538, 76484)


## Training the Model

### Function for computing the accuracy of the model

This is to compute the accuracy of the training data

In [12]:
def compute_accuracy(predictions, actual):
    """
    Compute accuracy given predicted and actual values.

    Parameters:
    - predictions: Numpy array of shape (N,) representing predicted values
    - actual: Numpy array of shape (N,) representing actual (target) values

    Returns:
    - accuracy: Scalar representing the percentage of matching values
    """
    # Ensure predictions and actual have the same length
    if len(predictions) != len(actual):
        raise ValueError("Lengths of predictions and actual must be the same.")

    # Calculate accuracy
    correct_predictions = np.sum(predictions == actual)
    total_samples = len(actual)
    accuracy = correct_predictions / total_samples

    # Convert to percentage
    accuracy_percentage = accuracy * 100.0

    return accuracy_percentage

### Initializing Multinomial Naive Bayes

In [13]:
naive_bayes = MultinomialNB()

Note: Training this model in my PC took 9 minutes. I am running an AMD Ryzen 7 5800X 8-Core Processor @ 3.80GHz and 32GB of RAM

In [14]:
naive_bayes.fit(X_train, y_train)

## Testing the Model

Predicted the training data and getting the accuracy using the compute_accuracy function.

In [34]:
count = 10000
predictions = naive_bayes.predict(X_train)

In [35]:
print("Training accuracy: ", compute_accuracy(predictions, y_train), "%")

Training accuracy:  76.9493126741934 %


# Submission

Prediction of testing data

In [21]:
predictions = naive_bayes.predict(X_test)

Get the label encoders of the "IsPositive" column form awhile ago and concatenate the DocumentId and the new column together. This is the format for the competition

In [26]:
submission = pd.concat([test_data["DocumentId"], pd.Series(predictions, name="IsPositive")], axis=1)
submission["IsPositive"] = label_enc.inverse_transform(submission["IsPositive"])

In [30]:
submission

Unnamed: 0,DocumentId,IsPositive
0,1450521,True
1,889575,True
2,1168201,False
3,81808,True
4,188577,False
...,...,...
59533,493718,True
59534,748831,False
59535,1454668,True
59536,947591,False


Export the dataframe to a csv file, not including the index

In [31]:
submission.to_csv('submission.csv', index=False)