# Class-weighted Learning
- Class-imbalance problem is actually a quite common problem. For instance, there are much more purchasers among mobile app users and much more non-criminals than criminals in society.
- However, if class imbalance is too severe (i.e., training set is highly skewed), it is likely to  bear undesirable effects. 
    - For instance, algorithm will tend to vote for majority class, all the time.
    - This is highly risky since we might lose track of purchasers among mobile app users and criminals, which are relatively rare among training instances

In [1]:
import numpy as np
from sklearn.utils import class_weight
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from collections import Counter
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import adam

Using TensorFlow backend.


## Load Dataset
- Breast cancer dataset in ```sklearn```
- doc: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html

In [2]:
data = load_breast_cancer()
X_data = data.data.tolist()
y_data = data.target.tolist()

In [3]:
print("Number of malignant instances (0): ", Counter(y_data)[0])
print("Number of benign instances (1): ", Counter(y_data)[1])

Number of malignant instances (0):  212
Number of benign instances (1):  357


In [4]:
# delete some of malignant instances to generate class-imbalance situation artificially
for i in range(200):
    if y_data[i] == 0:
        X_data[i] = None
        y_data[i] = None

In [5]:
X_data = [x for x in X_data if x != None]
y_data = [y for y in y_data if y != None]

In [6]:
print("Number of malignant instances (0): ", Counter(y_data)[0])
print("Number of benign instances (1): ", Counter(y_data)[1])

Number of malignant instances (0):  108
Number of benign instances (1):  357


In [7]:
X_train, X_test, y_train, y_test = train_test_split(np.asarray(X_data), np.asarray(y_data), test_size = 0.2, random_state = 7) 

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(372, 30) (93, 30) (372,) (93,)


## Computing class weights
- We compute class weights based on training dataset, and deliver as parameter when fitting

In [11]:
weights = class_weight.compute_class_weight('balanced', np.unique(y_train), y_train)

In [12]:
class_weights = dict(zip(np.unique(y_train), weights))
print("Computed class weights: ", class_weights)

Computed class weights:  {0: 2.2409638554216866, 1: 0.643598615916955}


## Naive Learning

In [13]:
def simple_mlp():
    model = Sequential()
    model.add(Dense(10, input_shape = (X_train.shape[1],), activation = 'relu'))
    model.add(Dense(1, activation = 'sigmoid'))
    model.compile(optimizer = adam(lr = 0.001), loss = 'binary_crossentropy', metrics = ['acc'])
    return model

In [14]:
model = simple_mlp()
model.fit(X_train, y_train, epochs = 100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.History at 0x12a386d30>

In [15]:
y_pred = model.predict(X_test).round()

In [16]:
print("% of predicted 1's: ", y_pred.sum()/len(y_pred))
print("Overall Accuracy Score: ", accuracy_score(y_pred, y_test))

% of predicted 1's:  0.0
Overall Accuracy Score:  0.26881720430107525


## Class-weighted learning

In [17]:
model = simple_mlp()
model.fit(X_train, y_train, epochs = 100, class_weight = class_weights)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.History at 0x12a886780>

In [18]:
y_pred = model.predict(X_test).round()

In [19]:
print("% of predicted 1's: ", y_pred.sum()/len(y_pred))
print("Overall Accuracy Score: ", accuracy_score(y_pred, y_test))

% of predicted 1's:  0.7096774193548387
Overall Accuracy Score:  0.9354838709677419
