In [1]:
import pandas as pd
url = (
    "http://biostat.mc.vanderbilt.edu/"
    "wiki/pub/Main/DataSets/titanic3.xls"
)
df = pd.read_excel(url)

In [7]:
df['pclass'].value_counts()

3    709
1    323
2    277
Name: pclass, dtype: int64

# Imbalanced Classes

If you are classifying data, and the classes are not relatively balanced in size, the bias toward more popular classes can carry over into your model. 

For example, if you have 1 positive case and 99 negative cases, you can get 99% accuracy simply by classifying everthing as negative. There are various options for dealing with *imbalanced classes*.

## Use a different metric


One hint is to use a measure other than accuracy (AUC is a good choice) for calibrating models. Precision and recall are also better options when the target sizes are different. However, there are other options to consider.

## Tree-based Algorithms and Ensembles

Three-based models may perform better depending on the distribution of the smaller class. If they tend to be clustered, they can be classified easily.

Enseble methods can futher aid in pulling out the minority classes. Bagging and boosting are options found in tree models like random forests and Extreme Gradient Boosting (XGBoost). 

## Penalize Models

Many scikit-learn classification models support the `class_weight` parameter. Setting this to 'balanced' will attempt to regularize minority classes and incentivize themodel to classify them correctly. 

Alternatively, you can use a grid search and specify the weight options by passing in a dictionary mapping class to weight (give a higher weight to smaller classes)

The XGBoost library has the `max_delta_step` parameter, which can be set from 1 to 10 to make the update step more conservative. It also has the `scale_pos_wieght` parameter that sets the ration of negative to positve samples (for binary classes). The `eval_metric` should be set to `auc` rather than tthe default.

The KNN model has a`weights` parameter that can bias neightbors that are closer. If the minority class samples are close together, setting this parameter to `distance` may improve performance. 

## Upsampling Minority

you can upsample the minority class in a couple of ways.


In [16]:
from sklearn.utils import resample
mask = df['survived'] == 1
surv_df = df[mask]
death_df = df[~mask]
df_upsample = resample(surv_df, replace=True, 
                       n_samples=len(death_df), random_state=42)

In [17]:
df['survived'].value_counts()

0    809
1    500
Name: survived, dtype: int64

In [5]:
df2 = pd.concat([death_df, df_upsample])

In [7]:
df2['survived'].value_counts()

1    809
0    809
Name: survived, dtype: int64

## Downsampling Majority

Another method to balance classes is to downsample majority classes. 

**Don't replace when downsampling**

In [9]:
df_downsample = resample(death_df, 
                         replace=False, n_samples=len(surv_df), 
                         random_state = 42)
# Don't replace when downsampling

In [10]:
df3 = pd.concat([surv_df, df_downsample])

In [11]:
df3['survived'].value_counts()

1    500
0    500
Name: survived, dtype: int64