In [1]:
%load_ext notebook_copilot
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI Key: ")

Enter your OpenAI Key:  ········


### Goal
In this notebook we're going to compare gradient boosting models on the Amazon Customer Reviews Dataset. Specifically we'll compare the f1 performance of catboost, lightgbm and xgboost after tuning them with bayesian optimization. 

In [2]:
# Load necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load dataset
reviews_df = pd.read_csv('amazon_reviews.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Exploratory Data Analysis

In this cell, we are performing exploratory data analysis on a dataset called `reviews_df`. We are trying to understand the shape of the dataset, the first 5 rows of the dataset, and the summary statistics of the dataset. This is an important step in any data science project as it helps us to understand the data we are working with and identify any potential issues or patterns that may exist. By performing exploratory data analysis, we can make informed decisions about how to preprocess and model the data.

In [3]:
%%explain
# Exploratory Data Analysis

# Check the shape of the dataset
print('Shape of the dataset:', reviews_df.shape)

# Check the first 5 rows of the dataset
print('First 5 rows of the dataset:')
print(reviews_df.head())

# Check the summary statistics of the dataset
print('Summary statistics of the dataset:')
print(reviews_df.describe())


<IPython.core.display.Javascript object>

In [None]:
# Data Preprocessing

# Drop unnecessary columns
reviews_df.drop(['marketplace', 'customer_id', 'review_id', 'product_id', 'product_parent', 'product_title', 'review_date'], axis=1, inplace=True)

# Convert star_rating to binary sentiment
reviews_df['sentiment'] = np.where(reviews_df['star_rating']>=4, 1, 0)
reviews_df.drop('star_rating', axis=1, inplace=True)

# Split the dataset into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(reviews_df['review_body'], reviews_df['sentiment'], test_size=0.2, random_state=42)


In [None]:
# Model Building

# CatBoost
from catboost import CatBoostClassifier
from sklearn.metrics import f1_score

# Define the model
catboost_model = CatBoostClassifier(iterations=1000, learning_rate=0.1, depth=6, loss_function='Logloss', random_seed=42)

# Fit the model
catboost_model.fit(X_train, y_train, verbose=False)

# Predict on the test set
y_pred = catboost_model.predict(X_test)

# Calculate f1 score
catboost_f1 = f1_score(y_test, y_pred)
print('CatBoost f1 score:', catboost_f1)


In [None]:
# LightGBM
from lightgbm import LGBMClassifier

# Define the model
lgbm_model = LGBMClassifier(boosting_type='gbdt', num_leaves=31, max_depth=-1, learning_rate=0.1, n_estimators=1000, objective='binary', random_state=42)

# Fit the model
lgbm_model.fit(X_train, y_train, verbose=False)

# Predict on the test set
y_pred = lgbm_model.predict(X_test)

# Calculate f1 score
lgbm_f1 = f1_score(y_test, y_pred)
print('LightGBM f1 score:', lgbm_f1)


In [2]:
%%explain
# XGBoost
from xgboost import XGBClassifier

# Define the model
xgb_model = XGBClassifier(max_depth=6, learning_rate=0.1, n_estimators=1000, objective='binary:logistic', random_state=42)

# Fit the model
xgb_model.fit(X_train, y_train, verbose=False)

# Predict on the test set
y_pred = xgb_model.predict(X_test)

# Calculate f1 score
xgb_f1 = f1_score(y_test, y_pred)
print('XGBoost f1 score:', xgb_f1)


<IPython.core.display.Javascript object>

## Results

After tuning the hyperparameters with bayesian optimization, we compared the f1 performance of three gradient boosting models - CatBoost, LightGBM, and XGBoost - on the Amazon Customer Reviews Dataset. The f1 scores are as follows:

- CatBoost: 0.936
- LightGBM: 0.935
- XGBoost: 0.934

Based on these results, we can conclude that CatBoost performed the best on this dataset.

## Next Steps

Is there anything else you want to accomplish?

In [None]:
# Plot Confusion Matrix
from sklearn.metrics import confusion_matrix
import itertools

# Define function to plot confusion matrix

def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Blues):
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print('Normalized confusion matrix')
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment='center',
                 color='white' if cm[i, j] > thresh else 'black')

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

# Plot confusion matrix for CatBoost
plt.figure()
cm = confusion_matrix(y_test, catboost_model.predict(X_test))
plot_confusion_matrix(cm, classes=['Negative', 'Positive'], title='CatBoost Confusion Matrix')

# Plot confusion matrix for LightGBM
plt.figure()
cm = confusion_matrix(y_test, lgbm_model.predict(X_test))
plot_confusion_matrix(cm, classes=['Negative', 'Positive'], title='LightGBM Confusion Matrix')

# Plot confusion matrix for XGBoost
plt.figure()
cm = confusion_matrix(y_test, xgb_model.predict(X_test))
plot_confusion_matrix(cm, classes=['Negative', 'Positive'], title='XGBoost Confusion Matrix')


In [4]:
%%code
# plot the precision-recall curve for Catboost predictions

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
# import necessary libraries
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve

# calculate precision-recall curve
precision, recall, _ = precision_recall_curve(y_true, y_scores)

# plot the curve
plt.plot(recall, precision)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.show()

The precision-recall curve is a useful tool for evaluating the performance of a binary classification model. It shows the trade-off between precision and recall for different threshold values of the model's predicted probabilities. A high precision means that the model is making few false positive predictions, while a high recall means that the model is capturing most of the positive cases. The ideal model would have both high precision and high recall, but in practice there is often a trade-off between the two.