# **CSC461: Final Project**
## <u>Team Members:</u> *Javier Sin & Nicolás Pelegrín*
---

## 0. Data import and visualization

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


Dataset including 50k IMDB reviews for sentiment analysis. There's two columns: the first one is the reviews themselves and the second one is the sentiment of the review ('negative' || 'positive').

In [2]:
df_review = pd.read_csv("IMDB Dataset.csv")
df_review

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


## 1. Metrics

Now we'll cite the [metrics](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics) we'll be using during the project:
* [Accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score)
* [F1 Score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score)


* **Accuracy**: should be used when the classes are relatively balanced. If the dataset is imbalanced, accuracy can be misleading because the model may simply predict the majority class more often.
* **F1 score**: is the weighted average of precision and recall. It takes both false positives and false negatives into account. It is a better measure than accuracy for imbalanced datasets (not the case).
* **ROC-AUC**: this is a good metric for binary classification problems such us this one. It plots the true positive rate against the false positive rate. It is a good metric to use when the classes are imbalanced.

In this case, as we're going to use balanced datasets, we'll be using <u>accuracy</u> as our main metric. We'll also be using the <u>F1 score</u> to have a better understanding of the model's performance.

We're also going to define a **Panda's Dataframe** so that we can store the results obtained for each model in order to compare them at the end of this notebook.

In [15]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, precision_recall_curve, roc_auc_score, confusion_matrix

df = pd.DataFrame() # Global variable with the table storing the results of the models

def add_to_df(y_train, y_train_prediction, y_test, y_test_prediction, method, comments):
    # Modifying a global variable inside a function must be done explicitly
    global df

    df_aux = pd.DataFrame(
        {
          "Method"    : [method],
          # "R2_train"  : [r2_score(y_train, y_train_prediction)],
          # "R2_test"   : [r2_score(y_test, y_test_prediction)],
          "accuracy_train" : [accuracy_score(y_train, y_train_prediction)],
          "accuracy_test" : [accuracy_score(y_test, y_test_prediction)],
          # "precision_train" : [precision_score(y_train, y_train_prediction)],
          # "precision_test" : [precision_score(y_test, y_test_prediction)],
          # "recall_train" : [recall_score(y_train, y_train_prediction)],
          # "recall_test" : [recall_score(y_test, y_test_prediction)],
          # "f1_train" : [f1_score(y_train, y_train_prediction)],
          # "f1_test" : [f1_score(y_test, y_test_prediction)],
          # "precision_recall_curve_train" : [precision_recall_curve(y_train, y_train_prediction)],
          # "precision_recall_curve_test" : [precision_recall_curve(y_test, y_test_prediction)],
          # "roc_auc_train" : [roc_auc_score(y_train, y_train_prediction)],
          # "roc_auc_test" : [roc_auc_score(y_test, y_test_prediction)],
          # "confusion_matrix_train" : [confusion_matrix(y_train, y_train_prediction)],
          # "confusion_matrix_test" : [confusion_matrix(y_test, y_test_prediction)],
          # "RMSE_train": [mean_squared_error(y_train, y_train_prediction, squared = False)],
          # "RMSE_test" : [mean_squared_error(y_test, y_test_prediction, squared = False)],
          "Comments": [comments]
        }
    )

    df = pd.concat([df, df_aux], ignore_index=True)
    return df_aux

---
## 2. Preparing the data

Now we'll prepare the data for the models. First we'll split the data into training and testing sets. Then we'll use the `TfidfVectorizer` to convert the text data into a matrix of TF-IDF features, which will be used as input for the models.

In [4]:
from sklearn.model_selection import train_test_split

X = df_review['review']
y = df_review['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

y_train.value_counts() # Checking the balance of the classes

sentiment
negative    20039
positive    19961
Name: count, dtype: int64

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')
X_train_vect = vectorizer.fit_transform(X_train)
X_test_vect = vectorizer.transform(X_test)

In [6]:
print(X_train.shape)
print(X_train_vect.shape)

(40000,)
(40000, 92692)


## 3. Models to be used

Our goal is to create and tune all the models in order to compare which one is the best for this particular dataset. Instead of just creating the default models and using the only one with best accuracy, we're going to tune them all and compare the results.

### 3.1: Decision Tree

The first model we're going to test is the [Decision Tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier).

#### 3.1.1: Model creation

In [17]:
# Import the Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier

# Create the Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)

# Train the model
dt.fit(X_train_vect, y_train)

# Predict the training set
y_train_prediction = dt.predict(X_train_vect)

# Predict the test set
y_test_prediction = dt.predict(X_test_vect)


In [16]:
# Add the results to the df
add_to_df(y_train, y_train_prediction, y_test, y_test_prediction, "Decision Tree", "Default parameters")

Unnamed: 0,Method,accuracy_train,accuracy_test,Comments
0,Decision Tree,1.0,0.7265,Default parameters


#### 3.1.2: Model tuning

Now we'll tune the model in order to avoid overfitting so that we can obtain better results.

In [None]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 10, 20],
    'min_samples_leaf': [1, 5, 10]
}

# Create the Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)

# Use GridSearchCV to find the best parameters
grid_search = GridSearchCV(dt, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_vect, y_train)

# Get the best estimator
best_dt = grid_search.best_estimator_

# Train the model with the best parameters
best_dt.fit(X_train_vect, y_train)

# Predict the training set
y_train_prediction = best_dt.predict(X_train_vect)

# Predict the test set
y_test_prediction = best_dt.predict(X_test_vect)

# Evaluate the performance
train_accuracy = accuracy_score(y_train, y_train_prediction)
test_accuracy = accuracy_score(y_test, y_test_prediction)

print(f"Train Accuracy: {train_accuracy}")
print(f"Test Accuracy: {test_accuracy}")

Train Accuracy: 0.801775
Test Accuracy: 0.7434


In [21]:
# Print the best configuration of parameters obtained by GridSearchCV
print("Best parameters found by GridSearchCV:")
print(grid_search.best_params_)

Best parameters found by GridSearchCV:
{'max_depth': 20, 'min_samples_leaf': 10, 'min_samples_split': 2}


In [None]:
# Plot the metrics

In [19]:
add_to_df(y_train, y_train_prediction, y_test, y_test_prediction, "Decision Tree", "Tuned parameters using GridSearchCV")

Unnamed: 0,Method,accuracy_train,accuracy_test,Comments
0,Decision Tree,0.801775,0.7434,Tuned parameters using GridSearchCV


### 3.2: Naive Bayes

#### 3.2.1: Model creation

#### 3.2.2: Model tuning

### 3.3: Logistic Regression

#### 3.3.1: Model creation

#### 3.3.2: Model tuning

## Model comparison

Print the table with the performance of the models in order to visually see the differences between them.