# Binary classification with unbalanced data

In this notebook, we learn the problems we run into when training a classifier to predict rare events. Many binary classification problems involve rare events, such as predicting that someone has a rare disease, or predicting that someone looking at an ad will buy the product. When one of the classes (by convention the positive class) makes up only a very small percentage of the data and most data points belong to the other class (by convention the negative class), we call the uncommon class a **rare event** and we say that we have **unbalanced data**. We explore in this notebook how unbalanced data affects the way that we evaluate the model.

The Ames housing dataset has housing data including sale price. We create a binary label to flag houses in a price range and build a classifier to predict the likelihood of this price range for a house.  

https://www.kaggle.com/datasets/prevek18/ames-housing-dataset


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
from sklearn.preprocessing import OneHotEncoder

import warnings
warnings.simplefilter("ignore")

In [None]:
from sklearn.datasets import fetch_openml
house_prices = fetch_openml(name="house_prices", as_frame=True)
print('house_prices keys:', house_prices.keys())
ames_df = house_prices.frame # The value for the 'frame' key is a data frame with features and target

## Missing Data
The Ames housing data has missing data.  For simplicity sake we will just remove any column that has more than 0.1% missing data.  

In [None]:
original_columns = ames_df.columns
ames_df.loc[ames_df['Electrical'].isna(), 'Electrical'] = 'SBrkr'
ames_df.dropna(axis = 1, inplace = True)
print(original_columns.shape[0], 'columns reduced to', ames_df.columns.shape[0],'columns')

## Categorical Columns

In [None]:
original_columns = ames_df.columns
object_columns_before = ames_df.select_dtypes(include=np.object).columns
numeric_columns = ames_df.select_dtypes(include=np.number).columns
print('The data frame has', original_columns.shape[0],
      'columns, including', object_columns_before.shape[0],'categorical columns and',
      numeric_columns.shape[0], 'numeric columns')

### Exercise (4 minutes)

- As one example, recall that earlier in the notebook we used `np.unique(...)` to get counts. Use it to get counts for each unique value of a categorical (object) column with more than 3 values.
- Also turn the counts into percentages.

- Since getting counts and turning them into percentages is such a common data-related task, there's got to be an easier way to do it. And there is. Search online to see if `pandas` offers a function for getting unique counts for a column in the data.  Use the pandas method to get the percentages of the categories for `Exterior2nd` and `Neighborhood`.

### End of exercise

### Reducing Categorical Complexity
- Category columns will need to be one-hot encoded
- Too many categories will lead to too many feature dimensions
- For simplicity we will drop category columns with too many categories.  Alternatively, we could bin or consolidate similar categories

In [None]:
# Drop category columns with more than 5 categories
column_drop_list = []
for column in object_columns_before:
    if np.unique(ames_df[column]).shape[0] > 5:
        column_drop_list = column_drop_list + [column]
ames_df.drop(columns=column_drop_list, inplace=True)
object_columns_after = ames_df.select_dtypes(include=np.object).columns
print(object_columns_before.shape[0], 'category columns reduced to', object_columns_after.shape[0],'columns')

In [None]:
original_columns = ames_df.columns
ames_df_num = ames_df[numeric_columns].copy() # ames_df.select_dtypes(include=np.number).copy() # only select columns that are numeric
ames_df_cat = ames_df[object_columns_after].copy() # ames_df.select_dtypes('object').copy() # only select columns that have type 'object'
onehot = OneHotEncoder(sparse = False) # initialize one-hot-encoder
onehot.fit(ames_df_cat)
col_names = onehot.get_feature_names_out(ames_df_cat.columns) # this allows us to properly name columns
ames_df_onehot =  pd.DataFrame(onehot.transform(ames_df_cat), columns = col_names)
AmesFeatures = ames_df_num.join(ames_df_onehot)
print(object_columns_after.shape[0], 'category columns expanded to', ames_df_onehot.columns.shape[0],'columns')
print('Total of', AmesFeatures.columns.shape[0], 'columns')
# AmesFeatures.to_csv('AmesFeatures.csv', index=False)

Let's now visualize the target variable, housing price.

In [None]:
sns.histplot(AmesFeatures.SalePrice)
plt.show()
print(np.quantile(AmesFeatures.SalePrice, [0.05, 0.1, 0.5, 0.9, 0.95]))

Say we're interested in training a classification algorithm to predict whether or not a house is within a specified price range. So first we create a target variable that flags houses who sold in that price range.

In [None]:
y = (AmesFeatures['SalePrice'] > 200000) & (AmesFeatures['SalePrice'] < 230000)
X = AmesFeatures.drop(columns=['SalePrice'])
y.value_counts()

We start by splitting `X` and `y` into training data and testing data. The easiest way to do this is using the `train_test_split` function.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 0)

### Exercise (6 minutes)

- Find counts for the target in training data

- Train a logistic regression classifier to predict whether the price of a house is within our specified range. Begin by loading the library as such: `from sklearn.linear_model import LogisticRegression`. Then create an instance of the algorithm and train it by invoking the `.fit(X_train, y_train)`.

- Once the model is trained, pass it the testing data to see if we get predictions back. To do so, we invoke the `.predict(x_test)` method. We can also invoke the `.predict_proba(x_test)` method if we wish to get the raw probabilities instead of the final predictions.  Name the predictions `y_test_pred`.

- Get the accuracy of the model by loading `from sklearn.metrics import accuracy_score` and calling the `accuracy_score` function. What two arguments do we pass to this function to evaluate the model's accuracy?

- Is accuracy a good metric for evaluating this model? Why or why not? To give some context, let's say you're a developer and want to predict house prices. You prefer to bid low and lose a bid than bid high for a house that's not worth it.

### End of exercise

Let's find some more useful evaluation metrics. The most direct metric to look at, is the confusion matrix.

In [None]:
from sklearn import metrics

cm_train = metrics.confusion_matrix(y_train, y_train_pred)
print('Confusion matrix based on training data:')
print(cm_train)
print('\nConfusion matrix as an accuracy measure (test data):')
cm_test = metrics.confusion_matrix(y_test, y_test_pred)
print(cm_test)

From the confusion matrix, we can derive accuracy, precision, recall, and the F1-score, which is a sort of average of precision and recall. We don't have time to get into all of them in detail, but [here](http://www.win-vector.com/blog/2009/11/i-dont-think-that-means-what-you-think-it-means-statistics-to-english-translation-part-1-accuracy-measures/) is an excellent article explaining in great detail the differences between each.

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_train, y_train_pred))
print(classification_report(y_test, y_test_pred))

### ROC:  Gold Standard for Classification Accuracy

One way to visually evaluate a binary classification model is using the ROC plot. The ROC plot is the only metric that can be used by itself and that has some meaning across different data sets.  The area under the ROC plot is called AUC (area under the curve) and the closer it is to 1, the better the model.  An AUC of 0.5 is a random model and is considered to be the worst case scenario.  Such a model is represented by the diagonal from the lower left to the upper right.  (A model that performs below 0.5 with confidence just needs its labels reversed.)

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(y_train, y_train_proba)
roc_auc = metrics.auc(fpr, tpr)

import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([-0.1, 1.1])
plt.ylim([-0.1, 1.1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_test_proba)
roc_auc = metrics.auc(fpr, tpr)

import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([-0.1, 1.1])
plt.ylim([-0.1, 1.1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

- In the above ROC plot for the test data we see a plot whose ROC curve is somewhat above the diagonal.  This means that the probabilities are not very useful.  The AUC is around 0.62 and the scale is from 0.5 to 1.  This AUC indicates that the model is very weak. 

#### Basics on AUC
- For a first understanding of AUC scores we can presume the following (obviously every situation is different and determined by the business use case):
    - A random binary classification model (like a flip of a coin) is 50% (worst accuracy)
    - A barely usable model has an AUC above 0.72. 
    - A good model has an AUC above 0.8.
    - A very good model has an AUC above 0.85
    - A model with an AUC above 0.95 is often an indicator for target leakage
- With the ROC we do not have to worry that the AUC is different for the positive case vs. the negative case as it is for recall, precision and F1 score.  Each of recall, precision, and F1 scores require 2 values, namely one for the positive case and one for the negative case.  The ROCs of a positive case is the mirror image of a negative case.  The AUC of the ROC is the same regardless of which binary outcome is selected as positive.

# Solving Class Imbalance
Now that we have seen different approaches to evaluating machine learning models, let's look at using data manipulation techniques like downsampling and upsampling to solve the class imbalance problem. We need to  make sure to **only apply downsampling and upsampling to training data**.


https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis

### Down-sampling
- An unbalanced data set is where the classes have different numbers of cases/observations/rows.  
- The class with the most observations is called the majority class.  
- The class with the least observations is called the minority class.  
- Down-sampling is a technique where the majority class is reduced in size to match the minority class.

In [None]:
print('The imbalanced class distribution for the train data was {}:{}'
      .format(y_train.shape[0] - np.sum(y_train), np.sum(y_train)))
# Extract positive cases
X_train_pos = X_train.loc[y_train, :]
numberOfRowsPerClass = X_train_pos.shape[0] # np.sum(y_train)
y_train_pos = np.ones(numberOfRowsPerClass, dtype=bool)
# Extract negative cases
X_train_neg = X_train.loc[~y_train, :]
# Down-sample negative cases
X_train_neg_downsample = X_train_neg.sample(n=numberOfRowsPerClass, random_state=0)
y_train_neg_downsample = np.zeros(numberOfRowsPerClass, dtype=bool)

# combine positive and negative cases
X_train_downsample = X_train_pos.append(other=X_train_neg_downsample, ignore_index=True)
y_train_downsample = np.append(arr=y_train_pos, values=y_train_neg_downsample)

print('The balanced class distribution for the down-sampled train data was {}:{}'
      .format(y_train_neg_downsample.shape[0], y_train_pos.shape[0]))

In [None]:
logregDownSample = LogisticRegression()
logregDownSample.fit(X_train_downsample, y_train_downsample)

y_train_down_pred = logregDownSample.predict(X_train_downsample)
y_train_down_proba = logregDownSample.predict_proba(X_train_downsample)[:,1]

y_test_down_pred = logregDownSample.predict(X_test)
y_test_down_proba = logregDownSample.predict_proba(X_test)[:,1]

print('Training "accuracy":', accuracy_score(y_true=y_train_downsample, y_pred=y_train_down_pred))
print('Accuracy (test data):', accuracy_score(y_true=y_test, y_pred=y_test_down_pred))

cm_down_train = metrics.confusion_matrix(y_train_downsample, y_train_down_pred)
print('Confusion matrix based on down-sampled training data:')
print(cm_down_train)
print('\nConfusion matrix as an accuracy measure (test data):')
cm_down_test = metrics.confusion_matrix(y_test, y_test_down_pred)
print(cm_down_test)

print(classification_report(y_train_downsample, y_train_down_pred))
print(classification_report(y_test, y_test_down_pred))

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(y_train_downsample, y_train_down_proba)
roc_auc = metrics.auc(fpr, tpr)

import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([-0.1, 1.1])
plt.ylim([-0.1, 1.1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_test_down_proba)
roc_auc = metrics.auc(fpr, tpr)

import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([-0.1, 1.1])
plt.ylim([-0.1, 1.1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

In [None]:
#!pip install imblearn

## Upsampling (Data Augmentation)

https://en.wikipedia.org/wiki/Data_augmentation


In [None]:
# pip install imblearn

In [None]:
from imblearn.over_sampling import SMOTE

X_resampled, y_resampled = SMOTE().fit_resample(X_train, y_train)

In [None]:
y_resampled.value_counts()

In [None]:
logit_upsample = LogisticRegression(max_iter=5000)
logit_upsample.fit(X_resampled, y_resampled)

In [None]:
y_train_hat_augment = logit_upsample.predict(X_resampled)
print('Training accuracy score:', accuracy_score(y_true=y_resampled, y_pred=y_train_hat_augment))
cm = metrics.confusion_matrix(y_resampled, y_train_hat_augment)
print(cm)

In [None]:
y_test_upsample_hat_pred = logit_upsample.predict(X_test)
print('Accuracy score (based on test data):', accuracy_score(y_true=y_test, y_pred=y_test_upsample_hat_pred))
cm = metrics.confusion_matrix(y_test, y_test_upsample_hat_pred)
print(cm)

In [None]:
print(classification_report(y_test, y_test_upsample_hat_pred))

In [None]:
y_train_hat_augment_proba = logit_upsample.predict_proba(X_resampled)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y_resampled, y_train_hat_augment_proba)
roc_auc = metrics.auc(fpr, tpr)

import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([-0.1, 1.1])
plt.ylim([-0.1, 1.1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

In [None]:
y_test_hat_augment_proba = logit_upsample.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_test_hat_augment_proba)
roc_auc = metrics.auc(fpr, tpr)

import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([-0.1, 1.1])
plt.ylim([-0.1, 1.1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

# Assignment

In this assignment, we want to implement cross-validation for logistic regression. Cross-validation is a powerful technique for model selection (such as when choosing the right hyper-parameters), especially when the data size is not very large. The goal of this assignment is to first implement cross-validation and compare it to a baseline model (with no cross-validation).

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter("ignore")
%matplotlib inline
random_state = 0

### Get Data
Read in `AmesFeatures.csv` which contains the processed data from the lecture.  This data file can be found in canvas.  If you want, you can generate this data file yourself by uncommenting the line `# AmesFeatures.to_csv('AmesFeatures.csv', index=False)` in file `Lesson_10_b_Student.ipynb`.  You may need to change the path below.  <br/><span style="color:red" float:right>[0 point]</span>

In [None]:
AmesFeatures = pd.read_csv('AmesFeatures.csv')
y = (AmesFeatures['SalePrice'] > 200000) & (AmesFeatures['SalePrice'] < 230000)
X = AmesFeatures.drop(columns=['Id', 'SalePrice'])
display(y.value_counts())
display(X.shape)
display(X)

Use some of the code from the lecture. 
- Split X and y into X_train, X_test, y_train, and y_test using `test_size = 0.30`.  You may want to use `random_state = 0` to make your results the same as others 
- Present the counts for False and True in y_train and y_test to verify the imbalanced data in your train and test sets
<br/><span style="color:red" float:right>[0 point]</span>

In [None]:
# Add code here to split data into test and train sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = random_state)
# Add code to present the imbalance of the class labels for both test and training


1. Train basic `LogisticRegression` classifier (repeat what was done in class)
 1. Train a model with `X_train` and `y_train`
 2. predict (hard baseline and soft) on training and test features with `.predict()` and `.predict_proba()`
 3. evaluate model using confusion matrix (`confusion_matrix`) and its metrics (`classification_report`)
 4. evaluate model using ROC and AUC of ROC
 5. comment on the model's usability. 
<br/><span style="color:red" float:right>[2 point]</span>

In [None]:
# Add code here to train the logistic regression


In [None]:
# Add code to predict (hard and soft) on training and test features


In [None]:
# Add code to evaluate predictions using confusion matrix and its metrics


In [None]:
# Add code to evaluate predictions using ROC


#### Add Comment on usability


2. Train `LogisticRegression` with balanced class weights 
 1. Read the documentation to see what `class_weight` does
 2. Train a new model with the same `X_train` and `y_train` setting `class_weight` so the weights are balanced
 3. predict (hard and soft) on training and test features with .predict() and .predict_proba()
 4. evaluate model using confusion matrix (confusion_matrix) and its metrics (classification_report)
 5. evaluate model using ROC
 6. How does balancing class weights change any of the results? Why?
<br/><span style="color:red" float:right>[3 point]</span>

In [None]:
# Add code here to train the logistic regression with balanced weights


In [None]:
# Add code to predict (hard and soft) on training and test features


In [None]:
# Add code to evaluate predictions using confusion matrix and its metrics


In [None]:
# Add code to evaluate predictions using ROC


#### Add Comment on the effect of weight balancing


3. Use `LogisticRegressionCV` to train a cross-validation logistic regression.  The CV stands for cross-validation.
 1. train a cross-validation logistic regression
    - Use the same `X_train` and `y_train`.
    - You may want to use `random_state = 0` to make your results the same as others.
    - Set the `cv` parameter to 5.  5 is the default value.
    - Set `class_weight` so the weights are balanced
 2. predict (hard and soft) on training and test features with `.predict()` and `.predict_proba()`
 3. evaluate test and training predictions using confusion matrix (`confusion_matrix`) and its metrics (`classification_report`)
 4. evaluate test and training predictions using ROC
 5. comment on whether cross-validation makes a difference in the results. 
<br/><span style="color:red" float:right>[3 point]</span>

In [None]:
# Add code to train logistic regression cross-validation (cv = 5)


In [None]:
# Add code to predict (hard and soft) on training and test features


In [None]:
# Add code to evaluate predictions using confusion matrix and its metrics


In [None]:
# Add code to evaluate predictions using ROC


#### Add comments on model evaluation and  cross-validation


4. Increase the number of folds and train the CV model again:
 1. train a cross-validation logistic regression
    - Use the same `X_train` and `y_train`.
    - You may want to use `random_state = 0` to make your results the same as others.
    - Set the `cv` parameter to 10 
    - Set `class_weight` so the weights are balanced
 2. predict (hard and soft) on training and test features with `.predict()` and `.predict_proba()`
 3. evaluate test and training predictions using confusion matrix (`confusion_matrix`) and its metrics (`classification_report`)
 4. evaluate test and training predictions using ROC
 5. comment on whether cross-validation makes a difference in the results. 
<br/><span style="color:red" float:right>[2 point]</span>

In [None]:
# Add code to train logistic regression cross-validation (cv = 10)


In [None]:
# Add code to predict (hard and soft) on training and test features


In [None]:
# Add code to evaluate model using confusion matrix and its metrics


In [None]:
# Add code to evaluate model using ROC


#### Add comments on model evaluation and extended cross-validation


5. What was the cost of increasing the number of folds in terms of training run-time? <span style="color:red" float:right>[2 point]</span>

In [None]:
# Add code here to determine cost of increasing folds


#### Add comments on training run-time here


# End of assignment