## Dealing with Data Imbalance

### Context
The dataset consists of data collected from heavy Scania trucks in everyday usage. The system in focus is the Air Pressure system (APS) which generates pressurized air that is utilized in various functions in a truck, such as braking and gear changes. The datasets' positive class consists of component failures for a specific component of the APS system. The negative class consists of trucks with failures for components not related to the APS. The data consists of a subset of all available data, selected by experts.

### Content
The training set contains 60000 examples in total in which 59000 belong to the negative class and 1000 positive class. The test set contains 16000 examples. There are 171 attributes per record.

The attribute names of the data have been anonymized for proprietary reasons. It consists of both single numerical counters and histograms consisting of bins with different conditions. Typically the histograms have open-ended conditions at each end. For example, if we measuring the ambient temperature "T" then the histogram could be defined with 4 bins where:

The attributes are as follows: class, then anonymized operational data. The operational data have an identifier and a bin id, like "Identifier_Bin". In total there are 171 attributes, of which 7 are histogram variables. Missing values are denoted by "na".

### Acknowledgements
This file is part of APS Failure and Operational Data for Scania Trucks. It was imported from the UCI ML Repository.

### Inspiration
The total cost of a prediction model the sum of Cost_1 multiplied by the number of Instances with type 1 failure and Cost_2 with the number of instances with type 2 failure, resulting in a Total_cost. In this case Cost_1 refers to the cost that an unnecessary check needs to be done by an mechanic at an workshop, while Cost_2 refer to the cost of missing a faulty truck, which may cause a breakdown. Cost_1 = 10 and Cost_2 = 500, and Total_cost = Cost_1*No_Instances + Cost_2*No_Instances.

Can you create a model which accurately predicts and minimizes [the cost of] failures?

In [None]:
import pandas as pd
import numpy as np

In [None]:
scania_df = pd.read_csv('https://drive.google.com/uc?export=download&id=1iDFs7jt3sdEbRMXrBGLShGT4tORSEQMg')

In [None]:
scania_df.head(2)

Unnamed: 0,aa_000,ac_000,ae_000,af_000,ag_000,ag_001,ag_002,ag_003,ag_004,ag_005,ag_006,ag_007,ag_008,ag_009,ah_000,ai_000,aj_000,ak_000,al_000,am_0,an_000,ao_000,ap_000,aq_000,ar_000,as_000,at_000,au_000,av_000,ax_000,ay_000,ay_001,ay_002,ay_003,ay_004,ay_005,ay_006,ay_007,ay_008,ay_009,...,cs_009,dd_000,de_000,df_000,dg_000,dh_000,di_000,dj_000,dk_000,dl_000,dm_000,dn_000,do_000,dp_000,dq_000,dr_000,ds_000,dt_000,du_000,dv_000,dx_000,dy_000,dz_000,ea_000,eb_000,ec_00,ed_000,ee_000,ee_001,ee_002,ee_003,ee_004,ee_005,ee_006,ee_007,ee_008,ee_009,ef_000,eg_000,class
0,76698.0,2130706000.0,0.0,0.0,0.0,0.0,0.0,0.0,37250.0,1432864.0,3664156.0,1007684.0,25896.0,0.0,2551696.0,0.0,0.0,0.0,0.0,0.0,4933296.0,3655166.0,1766008.0,1132040.0,0.0,0.0,0.0,0.0,1012.0,268.0,0.0,0.0,0.0,0.0,0.0,469014.0,4239660.0,703300.0,755876.0,0.0,...,0.0,4732.0,1126.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,62282.0,85908.0,32790.0,0.0,0.0,202710.0,37928.0,14745580.0,1876644.0,0.0,0.0,0.0,0.0,2801180.0,2445.8,2712.0,965866.0,1706908.0,1240520.0,493384.0,721044.0,469792.0,339156.0,157956.0,73224.0,0.0,0.0,0.0,neg
1,33058.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,18254.0,653294.0,1720800.0,516724.0,31642.0,0.0,1393352.0,0.0,68.0,0.0,0.0,0.0,2560898.0,2127150.0,1084598.0,338544.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,71510.0,772720.0,1996924.0,99560.0,0.0,...,0.0,3312.0,522.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,33736.0,36946.0,5936.0,0.0,0.0,103330.0,16254.0,4510080.0,868538.0,0.0,0.0,0.0,0.0,3477820.0,2211.76,2334.0,664504.0,824154.0,421400.0,178064.0,293306.0,245416.0,133654.0,81140.0,97576.0,1500.0,0.0,0.0,neg


In [None]:
scania_df.shape

(60000, 147)

In [None]:
scania_df['class'].value_counts()

neg    59000
pos     1000
Name: class, dtype: int64

In [None]:
scania_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60000 entries, 0 to 59999
Columns: 147 entries, aa_000 to class
dtypes: float64(146), object(1)
memory usage: 67.3+ MB


### Split Dataset

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train_df, test_df = train_test_split(scania_df,
                                     train_size = 0.8,
                                     random_state = 100)

### Oversampling and undersampling

In [None]:
from sklearn.utils import resample

In [None]:
# Separate the case of yes-subscribes and no-subscribes
train_neg = train_df[train_df['class'] == 'neg']
train_pos = train_df[train_df['class'] == 'pos']

In [None]:
train_pos.shape

(784, 147)

In [None]:
train_neg.shape

(47216, 147)

### Oversampling and Undersampling

In [None]:
##Upsample the yes-subscribed cases.
scania_pos_upsampled = resample(train_pos, 
                                replace=True,     # sample with replacement
                                n_samples=10000) 

##Upsample the yes-subscribed cases.
scania_neg_downsampled = resample(train_neg, 
                                  replace=False,     # sample without replacement
                                  n_samples=10000) 

# Combine majority class with upsampled minority class
train_df_sampled = pd.concat([scania_pos_upsampled, scania_neg_downsampled])

In [None]:
train_df_sampled.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20000 entries, 593 to 2472
Columns: 147 entries, aa_000 to class
dtypes: float64(146), object(1)
memory usage: 22.6+ MB


In [None]:
train_df_sampled['class'].value_counts()

neg    10000
pos    10000
Name: class, dtype: int64

### SMOTE sampling

In [None]:
!pip install imblearn



In [None]:
!pip install delayed



In [None]:
from imblearn.over_sampling import SMOTE

In [None]:
x_features = list(train_df.columns)
x_features.remove('class')

In [None]:
X_resampled, y_resampled = SMOTE().fit_resample(train_df[x_features], 
                                                train_df['class'])

In [None]:
X_resampled.shape

(94432, 146)

In [None]:
y_resampled.value_counts()

neg    47216
pos    47216
Name: class, dtype: int64

### Undersampling using cluster centroids

In [None]:
from imblearn.under_sampling import ClusterCentroids

In [None]:
cc = ClusterCentroids(random_state=100)
X_resampled, y_resampled = cc.fit_resample(train_df[x_features], 
                                           train_df['class'])

In [None]:
y_resampled.value_counts()

pos    784
neg    784
Name: class, dtype: int64