# Predictions, part IV
- drop columns: no
- scaler: yes
- hyperparameter tuning: yes
- one-hot encoding: yes, the dataset was found encoded
- **oversampling: yes**

In this session, I'm addressing class imbalance in the target category.\
I'm addressing it by using 3 techniques:
1. oversampling,
2. undersampling,
3. SMOTE.

Target column "is_canceled" has 37% vs 63% for canceled vs not canceled bookings.\
What I did is a quick solution, to avoid redoing dozens of calculations.\
I created copies of notebooks 4 and 5, which contain calculations for all the models, using unscaled (notebook 4) and scaled (notebook 5) data.\
Then I oversampled the target and ran the models anew.\
Oversampling reduced every model's accuracy scores dropped by up to 6%.\
I repeated the process with use of undersampling and SMOTE. The results keep confirming the previous findings.\
**Reducing class imbalance reduces every model's performance in terms of accuracy scores.**\
The notebooks copies are available for reference in the folder "imbalanced".\
The codes used are in this notebook.

# preprocessing

In [None]:
# import libraries
%run common_imports.py

# load and split data
%run load_and_split_data.py
X_train, X_test, y_train, y_test = load_and_split_data()

# scale data
%run minmaxscaler.py
X_train_scaled, X_test_scaled = scale_data(X_train, X_test)

# oversampling

In [None]:
# Make a copy of X_train to avoid altering the original DataFrame
X_train_cl = X_train.copy()

# Add the 'is_canceled' column to X_train_cl
X_train_cl["is_canceled"] = y_train.values

# Separate canceled and not_canceled instances
canceled = X_train_cl[X_train_cl["is_canceled"] == 1]
not_canceled = X_train_cl[X_train_cl["is_canceled"] == 0]

# Visualize class distribution
canceled_plt = X_train_cl["is_canceled"].value_counts()
canceled_plt.plot(kind="bar")
plt.show()

# Oversample the minority class
canceled_oversampled = resample(canceled,
                                replace=True,
                                n_samples=len(not_canceled),
                                random_state=0)

# Concatenate oversampled minority class with majority class
train_over = pd.concat([canceled_oversampled, not_canceled])
display(train_over, "")

# Update X_train and y_train with the oversampled data
X_train = train_over.drop(columns=["is_canceled"])
y_train = train_over["is_canceled"]

# Visualize class distribution after oversampling
canceled_plt = train_over["is_canceled"].value_counts()
canceled_plt.plot(kind="bar")
plt.show()

# undersampling

In [None]:
# Make a copy of X_train to avoid altering the original DataFrame
X_train_copy = X_train.copy()

# Add the 'is_canceled' column to X_train_copy
X_train_copy["is_canceled"] = y_train.values

# Separate canceled and not_canceled instances
canceled = X_train_copy[X_train_copy["is_canceled"] == 1]
not_canceled = X_train_copy[X_train_copy["is_canceled"] == 0]

# Visualize class distribution
canceled_plt = X_train_copy["is_canceled"].value_counts()
canceled_plt.plot(kind="bar")
plt.show()

# Undersample the majority class
not_canceled_undersampled = resample(not_canceled,
                                      replace=True,
                                      n_samples=len(canceled),
                                      random_state=0)

# Concatenate undersampled majority class with minority class
train_under = pd.concat([not_canceled_undersampled, canceled])
display(train_under, "")

# Update X_train and y_train with the undersampled data
X_train = train_under.drop(columns=["is_canceled"])
y_train = train_under["is_canceled"]

# Visualize class distribution after undersampling
canceled_plt = train_under["is_canceled"].value_counts()
canceled_plt.plot(kind="bar")
plt.show()


# SMOTE

In [None]:
# Visualize class distribution before SMOTE
canceled_plt_before = y_train.value_counts()
canceled_plt_before.plot(kind="bar")
plt.show()

# Apply SMOTE
sm = SMOTE(random_state=1, sampling_strategy=1.0)
X_train_resampled, y_train_resampled = sm.fit_resample(X_train, y_train)

# Visualize class distribution after SMOTE
canceled_plt_after = y_train_resampled.value_counts()
canceled_plt_after.plot(kind="bar")
plt.show()