<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 5px; height: 50px">

# Project 3: Web APIs & NLP

### Project Title: Generative AI and Art - understanding and predicting chatter from online communities

**DSI-41 Group 2**: Muhammad Faaiz Khan, Lionel Foo, Gabriel Tan

## Part 4: Modeling


### 4.1 Imports
___

In [1]:
import pandas as pd
import numpy as np
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')
from nltk.corpus import stopwords

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

# from sklearn.metrics import confusion_matrix, plot_confusion_matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, classification_report

# Import CountVectorizer and TFIDFVectorizer from feature_extraction.text.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, ENGLISH_STOP_WORDS

import time

# this setting widens how many characters pandas will display in a column:
pd.options.display.max_colwidth = 4000

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Aspire\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
# Load data:
reddit = pd.read_csv('../data/reddit_df_2.csv')

# Check data:
reddit.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6030 entries, 0 to 6029
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   subr-def_ai         6030 non-null   int64  
 1   is_op               6030 non-null   int64  
 2   author              6030 non-null   object 
 3   post_id             6030 non-null   object 
 4   body                6030 non-null   object 
 5   upvotes             6030 non-null   int64  
 6   num_comments        6030 non-null   int64  
 7   post_length         6030 non-null   int64  
 8   post_word_count     6030 non-null   int64  
 9   neg                 6030 non-null   float64
 10  neu                 6030 non-null   float64
 11  pos                 6030 non-null   float64
 12  compound            6030 non-null   float64
 13  subjectivity_score  6030 non-null   float64
dtypes: float64(5), int64(6), object(3)
memory usage: 659.7+ KB


### 4.2 Preparing dataframe for Modelling
___

The following steps are performed to prepare the dataframe before modelling:

1. Create predictor and target variables (X & y).
2. Lemmatize the text in X.
3. Perform train-test split.

We will also define the additional stopwords as discussed in Section 3 EDA.


In [3]:
# Creating X (features) and y (target)
X = reddit['body']  # Features
y = reddit['subr-def_ai']  # Target

In [4]:
# Check for distribution of class
y.value_counts()

subr-def_ai
1    3015
0    3015
Name: count, dtype: int64

In [5]:
# Create a lemmatizer object
lemmatizer = WordNetLemmatizer()

# Define a function to perform lemmatization on a text
def lemmatize_text(text):
    return ' '.join([lemmatizer.lemmatize(word) for word in text.split()])

In [6]:
# Apply lemmatization to all rows in X
X_lemmatized = X.apply(lemmatize_text)

In [7]:
# Perform train-test split with 20% test size and stratify with y
X_train, X_test, y_train, y_test = train_test_split(X_lemmatized, y, test_size=0.2, stratify=y, random_state=42)

In [8]:
# additional stop words to remove
additional_stop_words = ['wa', 'ha', 'doe', 'did', 've', 'ca', 'll', 'gon', 'don', 'wan', 'na']

# Combine native 'english' stop words with additional stop words
all_stop_words = list(set(ENGLISH_STOP_WORDS).union(additional_stop_words))

### 4.3 Performing Modelling
___

The goal of the classifier model is to predict the origin of posts—whether they belong to r/DefendingAIArt (1) or r/ArtistHate (0).We will assess the performance of five diverse models. Our evaluation takes into consideration various factors, with a focus on model accuracy, computational efficiency, and other relevant metrics.

We will run the following models on our training data:

1. Naive Bayes model (MultinomialNB)
    - An intuitive probabilistic model based on Bayes' theorem, particularly suitable for text classification tasks
2. Logistic Regression model
    - A versatile linear model known for its simplicity and interpretability.
3. RandomForest Classifier
    - A robust ensemble model that leverages multiple decision trees for improved performance.
4. K-Nearest Neighbours (KNN)
    - A non-parametric method that classifies instances based on the majority class of their k-nearest neighbors.
5. AdaBoost Classifier
    - A boosting algorithm that combines weak learners to create a strong classifier.



### 4.3.1 Naive Bayes model (MultinomialNB)

In [11]:
# Create a pipeline with CountVectorizer and MultinomialNB
pipeline_nb = Pipeline([
    ('cvec', CountVectorizer(lowercase=True, stop_words=all_stop_words)),
    ('nb', MultinomialNB())
])

# Define parameter grid for grid search
param_grid_nb = {
    'cvec__max_features': [4000, 5000],
    'cvec__min_df': [2, 3, 4, 5],
    'cvec__max_df': [0.4, 0.6, 0.8],
    'cvec__ngram_range': [(1, 1), (1, 2), (1, 3)],
    'nb__alpha': [0.1, 0.2, 0.5, 0.8, 1.0],
}

# Perform grid search with cross-validation
grid_search_nb = GridSearchCV(pipeline_nb, param_grid_nb, cv=5, scoring='accuracy', verbose=1)
start_time = time.time()
grid_search_nb.fit(X_train, y_train)
end_time = time.time()

# Print best parameters
best_params_nb = grid_search_nb.best_params_
print("Best Parameters:")
print(best_params_nb)

# Print computational time
print(f"Grid Search took {end_time - start_time:.2f} seconds")

# Print accuracy score for the test set
y_pred_nb_test = grid_search_nb.best_estimator_.predict(X_test)
accuracy_nb_test = accuracy_score(y_test, y_pred_nb_test)
print("Accuracy Score on Test Set:", accuracy_nb_test)

# Print accuracy score for the training set
y_pred_nb_train = grid_search_nb.best_estimator_.predict(X_train)
accuracy_nb_train = accuracy_score(y_train, y_pred_nb_train)
print("Accuracy Score on Training Set:", accuracy_nb_train)

# Print classification report for the test set
classification_report_nb = classification_report(y_test, y_pred_nb_test)
print("Classification Report on Test Set:")
print(classification_report_nb)

Fitting 5 folds for each of 360 candidates, totalling 1800 fits
Best Parameters:
{'cvec__max_df': 0.4, 'cvec__max_features': 5000, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 3), 'nb__alpha': 0.2}
Grid Search took 292.74 seconds
Accuracy Score on Test Set: 0.6846473029045643
Accuracy Score on Training Set: 0.8346129902469392
Classification Report on Test Set:
              precision    recall  f1-score   support

           0       0.70      0.64      0.67       603
           1       0.67      0.73      0.70       602

    accuracy                           0.68      1205
   macro avg       0.69      0.68      0.68      1205
weighted avg       0.69      0.68      0.68      1205



### 4.3.2 Logistic Regression model

In [None]:
# Create a pipeline with CountVectorizer and LogisticRegression
pipeline_lr = Pipeline([
    ('cvec', CountVectorizer(lowercase=True, stop_words=all_stop_words)),
    ('lr', LogisticRegression(max_iter=5000, random_state=42))
])

# Define parameter grid for grid search
param_grid_lr = {
    'cvec__max_features': [4000, 5000],
    'cvec__min_df': [2, 3, 4],
    'cvec__max_df': [0.40, 0.60, 0.80],
    'cvec__ngram_range': [(1, 1), (1, 2), (1, 3)],
    'lr__C': [0.001, 0.01, 0.1, 1, 10],
    'lr__penalty': [None, 'l2']
}

# Perform grid search with cross-validation
grid_search_lr = GridSearchCV(pipeline_lr, param_grid_lr, cv=5, scoring='accuracy', verbose=1)
start_time_lr = time.time()
grid_search_lr.fit(X_train, y_train)
end_time_lr = time.time()

# Print best parameters
best_params_lr = grid_search_lr.best_params_
print("Best Parameters for Logistic Regression:")
print(best_params_lr)

# Print computational time
print(f"Grid Search took {end_time_lr - start_time_lr:.2f} seconds")

# Print accuracy score for the test set
y_pred_lr_test = grid_search_lr.best_estimator_.predict(X_test)
accuracy_lr_test = accuracy_score(y_test, y_pred_lr_test)
print("Accuracy Score on Test Set (Logistic Regression):", accuracy_lr_test)

# Print accuracy score for the training set
y_pred_lr_train = grid_search_lr.best_estimator_.predict(X_train)
accuracy_lr_train = accuracy_score(y_train, y_pred_lr_train)
print("Accuracy Score on Training Set (Logistic Regression):", accuracy_lr_train)

# Print classification report for the test set
classification_report_lr = classification_report(y_test, y_pred_lr_test)
print("Classification Report on Test Set (Logistic Regression):")
print(classification_report_lr)

Fitting 5 folds for each of 540 candidates, totalling 2700 fits




Best Parameters for Logistic Regression:
{'cvec__max_df': 0.4, 'cvec__max_features': 5000, 'cvec__min_df': 3, 'cvec__ngram_range': (1, 3), 'lr__C': 1, 'lr__penalty': 'l2'}
Grid Search took 526.61 seconds
Accuracy Score on Test Set (Logistic Regression): 0.7095435684647303
Accuracy Score on Training Set (Logistic Regression): 0.9626478522515045
Classification Report on Test Set (Logistic Regression):
              precision    recall  f1-score   support

           0       0.69      0.75      0.72       603
           1       0.73      0.67      0.70       602

    accuracy                           0.71      1205
   macro avg       0.71      0.71      0.71      1205
weighted avg       0.71      0.71      0.71      1205



In [16]:
# Retrieve the logistic regression model from the best estimator in the pipeline
lr_model = grid_search_lr.best_estimator_.named_steps['lr']

# Retrieve the CountVectorizer from the pipeline
cvec_model = grid_search_lr.best_estimator_.named_steps['cvec']

# Get feature names from CountVectorizer
feature_names = cvec_model.get_feature_names_out()

# Get coefficients and corresponding features
coefficients = lr_model.coef_.flatten()
coef_features = list(zip(coefficients, feature_names))

# Sort coefficients in descending order
top_coef_features = sorted(coef_features, key=lambda x: abs(x[0]), reverse=True)[:50]

# Display the top 50 coefficients and their corresponding features
print("Top 50 Coefficients for Logistic Regression Model:")
for coef, feature in top_coef_features:
    print(f"{feature}: {coef:.4f}")

Top 50 Coefficients for Logistic Regression Model:
ai bros: -1.9346
aibros: -1.8121
copying: 1.7302
vaush: 1.7208
glaze: -1.6984
heart: -1.6333
voice: 1.5362
antis: 1.4486
bother: -1.4385
software: -1.4373
adobe: -1.4321
explained: -1.4202
nightshade: -1.4188
anti: 1.4163
sex: -1.4035
ai bro: -1.3768
film: -1.3737
furry: -1.3645
ubi: -1.3507
term ai: -1.3281
ml: -1.3247
thinking_face: 1.3052
break: 1.3026
wacom: -1.2914
crowd: 1.2775
creation: 1.2752
continue: 1.2742
anybody: 1.2620
issue ai: -1.2317
meaningless: 1.2287
religion: 1.2197
driven: 1.2136
pixel art: 1.2100
length: 1.2043
ignorance: 1.1945
people ai: -1.1928
fad: -1.1901
amazon: -1.1857
plagiarism: -1.1803
epic: 1.1792
heavily: 1.1792
luddite: 1.1762
left: 1.1721
incredible: 1.1450
porn: -1.1444
glad: -1.1443
sent: 1.1341
mad: 1.1272
campaign: 1.1245
ai user: -1.1164


### 4.3.3 RandomForestClassifier Model

In [13]:
# Create a pipeline with CountVectorizer and RandomForestClassifier
pipeline_rf = Pipeline([
    ('cvec', CountVectorizer(lowercase=True, stop_words=all_stop_words)),
    ('rf', RandomForestClassifier(random_state=42))
])

# Define parameter grid for grid search
param_grid_rf = {
    'cvec__max_features': [4000, 5000],
    'cvec__min_df': [2, 3],
    'cvec__max_df': [0.4, 0.6, 0.8],
    'cvec__ngram_range': [(1, 1), (1, 2)],
    'rf__n_estimators': [50, 100, 200],
    'rf__max_depth': [None, 10, 20],
    'rf__min_samples_split': [2, 5, 10],
    'rf__min_samples_leaf': [1, 2, 4]
}

# Perform grid search with cross-validation
grid_search_rf = GridSearchCV(pipeline_rf, param_grid_rf, cv=5, scoring='accuracy', verbose=1)
start_time_rf = time.time()
grid_search_rf.fit(X_train, y_train)
end_time_rf = time.time()

# Print best parameters
best_params_rf = grid_search_rf.best_params_
print("Best Parameters for Random Forest:")
print(best_params_rf)

# Print computational time
print(f"Grid Search took {end_time_rf - start_time_rf:.2f} seconds")

# Print accuracy score for the test set
y_pred_rf_test = grid_search_rf.best_estimator_.predict(X_test)
accuracy_rf_test = accuracy_score(y_test, y_pred_rf_test)
print("Accuracy Score on Test Set (Random Forest):", accuracy_rf_test)

# Print accuracy score for the training set
y_pred_rf_train = grid_search_rf.best_estimator_.predict(X_train)
accuracy_rf_train = accuracy_score(y_train, y_pred_rf_train)
print("Accuracy Score on Training Set (Random Forest):", accuracy_rf_train)

# Print classification report for the test set
classification_report_rf = classification_report(y_test, y_pred_rf_test)
print("Classification Report on Test Set (Random Forest):")
print(classification_report_rf)

Fitting 5 folds for each of 1944 candidates, totalling 9720 fits
Best Parameters for Random Forest:
{'cvec__max_df': 0.4, 'cvec__max_features': 4000, 'cvec__min_df': 3, 'cvec__ngram_range': (1, 1), 'rf__max_depth': None, 'rf__min_samples_leaf': 2, 'rf__min_samples_split': 5, 'rf__n_estimators': 100}
Grid Search took 3136.82 seconds
Accuracy Score on Test Set (Random Forest): 0.6464730290456432
Accuracy Score on Training Set (Random Forest): 0.9348412533720689
Classification Report on Test Set (Random Forest):
              precision    recall  f1-score   support

           0       0.67      0.58      0.62       603
           1       0.63      0.71      0.67       602

    accuracy                           0.65      1205
   macro avg       0.65      0.65      0.64      1205
weighted avg       0.65      0.65      0.64      1205



### 4.3.4 K Nearest Neighbors Classifier Model

In [14]:
# Create a pipeline with CountVectorizer and KNeighborsClassifier
pipeline_knn = Pipeline([
    ('cvec', CountVectorizer(lowercase=True, stop_words=all_stop_words)),
    ('knn', KNeighborsClassifier())
])

# Define parameter grid for grid search with different values
param_grid_knn = {
    'cvec__max_features': [4000, 5000],
    'cvec__min_df': [2, 3, 4],
    'cvec__max_df': [0.4, 0.6, 0.8],
    'cvec__ngram_range': [(1, 1), (1, 2)],
    'knn__n_neighbors': [3, 5, 7, 10],  # Different values for n_neighbors
    'knn__weights': ['uniform', 'distance'],
    'knn__p': [1, 2]  # Different values for p (1: Manhattan, 2: Euclidean)
}

# Perform grid search with cross-validation
grid_search_knn = GridSearchCV(pipeline_knn, param_grid_knn, cv=5, scoring='accuracy', verbose=1)
start_time_knn = time.time()
grid_search_knn.fit(X_train, y_train)
end_time_knn = time.time()

# Print best parameters
best_params_knn = grid_search_knn.best_params_
print("Best Parameters for k-NN:")
print(best_params_knn)

# Print computational time
print(f"Grid Search took {end_time_knn - start_time_knn:.2f} seconds")

# Print accuracy score for the test set
y_pred_knn_test = grid_search_knn.best_estimator_.predict(X_test)
accuracy_knn_test = accuracy_score(y_test, y_pred_knn_test)
print("Accuracy Score on Test Set (k-NN):", accuracy_knn_test)

# Print accuracy score for the training set
y_pred_knn_train = grid_search_knn.best_estimator_.predict(X_train)
accuracy_knn_train = accuracy_score(y_train, y_pred_knn_train)
print("Accuracy Score on Training Set (k-NN):", accuracy_knn_train)

# Print classification report for the test set
classification_report_knn = classification_report(y_test, y_pred_knn_test)
print("Classification Report on Test Set (k-NN):")
print(classification_report_knn)

Fitting 5 folds for each of 576 candidates, totalling 2880 fits
Best Parameters for k-NN:
{'cvec__max_df': 0.4, 'cvec__max_features': 4000, 'cvec__min_df': 4, 'cvec__ngram_range': (1, 1), 'knn__n_neighbors': 3, 'knn__p': 2, 'knn__weights': 'distance'}
Grid Search took 524.49 seconds
Accuracy Score on Test Set (k-NN): 0.5502074688796681
Accuracy Score on Training Set (k-NN): 1.0
Classification Report on Test Set (k-NN):
              precision    recall  f1-score   support

           0       0.53      0.93      0.67       603
           1       0.70      0.17      0.28       602

    accuracy                           0.55      1205
   macro avg       0.61      0.55      0.48      1205
weighted avg       0.61      0.55      0.48      1205



### 4.3.5 Adaboost

In [15]:
# Create a pipeline with CountVectorizer and AdaBoostClassifier
pipeline_ada = Pipeline([
    ('cvec', CountVectorizer(lowercase=True, stop_words=all_stop_words)),
    ('ada', AdaBoostClassifier(
        base_estimator=DecisionTreeClassifier(random_state=42),
        random_state=42
    ))
])

# Define parameter grid for grid search with specified values
param_grid_ada = {
    'cvec__max_features': [4000, 5000],
    'cvec__min_df': [2, 3, 4],
    'cvec__max_df': [0.4, 0.6, 0.8],
    'cvec__ngram_range': [(1, 1), (1, 2)],
    'ada__n_estimators': [50, 75, 100],  # AdaBoost n_estimators
    'ada__base_estimator__max_depth': [1, 2],  # DecisionTree max_depth
    'ada__learning_rate': [0.8, 0.9, 1.0],  # AdaBoost learning_rate
}

# Perform grid search with cross-validation
grid_search_ada = GridSearchCV(pipeline_ada, param_grid_ada, cv=5, scoring='accuracy', verbose=1)
start_time_ada = time.time()
grid_search_ada.fit(X_train, y_train)
end_time_ada = time.time()

# Print best parameters
best_params_ada = grid_search_ada.best_params_
print("Best Parameters for AdaBoost:")
print(best_params_ada)

# Print computational time
print(f"Grid Search took {end_time_ada - start_time_ada:.2f} seconds")

# Print accuracy score for the test set
y_pred_ada_test = grid_search_ada.best_estimator_.predict(X_test)
accuracy_ada_test = accuracy_score(y_test, y_pred_ada_test)
print("Accuracy Score on Test Set (AdaBoost):", accuracy_ada_test)

# Print accuracy score for the training set
y_pred_ada_train = grid_search_ada.best_estimator_.predict(X_train)
accuracy_ada_train = accuracy_score(y_train, y_pred_ada_train)
print("Accuracy Score on Training Set (AdaBoost):", accuracy_ada_train)

# Print classification report for the test set
classification_report_ada = classification_report(y_test, y_pred_ada_test)
print("Classification Report on Test Set (AdaBoost):")
print(classification_report_ada)

Fitting 5 folds for each of 648 candidates, totalling 3240 fits




Best Parameters for AdaBoost:
{'ada__base_estimator__max_depth': 2, 'ada__learning_rate': 0.8, 'ada__n_estimators': 100, 'cvec__max_df': 0.4, 'cvec__max_features': 4000, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 1)}
Grid Search took 1374.54 seconds
Accuracy Score on Test Set (AdaBoost): 0.674688796680498
Accuracy Score on Training Set (AdaBoost): 0.7943556754513384
Classification Report on Test Set (AdaBoost):
              precision    recall  f1-score   support

           0       0.65      0.74      0.70       603
           1       0.70      0.61      0.65       602

    accuracy                           0.67      1205
   macro avg       0.68      0.67      0.67      1205
weighted avg       0.68      0.67      0.67      1205

