# Missing Values

1. Remove missing values
  - pros: simple, straight-forward
  - cons: remove too many rows if missing values are too many
2. Fill in missing values
  - fill with mean, median, mode
  - predict missing values with other features
    - decision tree

In [1]:
import pandas as pd
import numpy as np

dataset_missing = pd.DataFrame({
    "A": [1, 2, 3, 5, np.nan, 6, 6, 7],
    "B": [1, 1, np.nan, 2, 3, 4, 1, np.nan]
})

dataset_missing

Unnamed: 0,A,B
0,1.0,1.0
1,2.0,1.0
2,3.0,
3,5.0,2.0
4,,3.0
5,6.0,4.0
6,6.0,1.0
7,7.0,


In [2]:
# drop missing values

dataset_missing.dropna()

Unnamed: 0,A,B
0,1.0,1.0
1,2.0,1.0
3,5.0,2.0
5,6.0,4.0
6,6.0,1.0


In [3]:
# fill missing values

dataset_missing.fillna(dataset_missing.mean())

Unnamed: 0,A,B
0,1.0,1.0
1,2.0,1.0
2,3.0,2.0
3,5.0,2.0
4,4.285714,3.0
5,6.0,4.0
6,6.0,1.0
7,7.0,2.0


# Feature Encoding

## One-hot Encoding
- transform oen column to many columns with 1 vs 0
- pros: can handle categorical variables
- cons: create sparse feature matrix (a lot of 0's)

## Discretization
- transform continuous variable to discrete value
- when to use discretization:
  - linear model: more discrete features + simple model
    - pros: simple model
    - cons: hard for feature engineering
  - non-linear model (deep learning): less continuous features + complex model
    - pros: don't need complex feature engineering
    - cons: complex model

In [4]:
dataset_feature_encoding = pd.DataFrame({
    "Gender": ["M", "M", "F", "M", "F"],
    "Country": ["US", "US", "CHN", "KOR", "JPN"],
    "Age": [10, 15, 20, 30, 50]
})

dataset_feature_encoding

Unnamed: 0,Gender,Country,Age
0,M,US,10
1,M,US,15
2,F,CHN,20
3,M,KOR,30
4,F,JPN,50


In [5]:
# one-hot encoding in pandas

pd.get_dummies(dataset_feature_encoding)

Unnamed: 0,Age,Gender_F,Gender_M,Country_CHN,Country_JPN,Country_KOR,Country_US
0,10,0,1,0,0,0,1
1,15,0,1,0,0,0,1
2,20,1,0,1,0,0,0
3,30,0,1,0,0,1,0
4,50,1,0,0,1,0,0


In [6]:
# discretization in pandas

pd.cut(x=dataset_feature_encoding['Age'], bins=2)

0    (9.96, 30.0]
1    (9.96, 30.0]
2    (9.96, 30.0]
3    (9.96, 30.0]
4    (30.0, 50.0]
Name: Age, dtype: category
Categories (2, interval[float64]): [(9.96, 30.0] < (30.0, 50.0]]

# Standardization

## min-max
- obtain min, max from traning set

$$\hat{x} = \frac{x - min(x)}{max(x) - min(x)} \in [0, 1]$$

## z-score
- obtain mean, std from training set

$$\hat{x} = \frac{x - \mu(x)}{\sigma(x)}$$
$$E[\hat{x}] = 0, Var[\hat{x}] = 1$$



In [7]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

data_pre_standardization = [[-1, 2], [-0.5, 6], [0, 10], [1, 18], [3, 6], [2, 5]]

# min max scaler
min_max_scaler = MinMaxScaler()
min_max_scaler.fit_transform(data_pre_standardization)

array([[0.    , 0.    ],
       [0.125 , 0.25  ],
       [0.25  , 0.5   ],
       [0.5   , 1.    ],
       [1.    , 0.25  ],
       [0.75  , 0.1875]])

In [8]:
# z-score scaler
z_score_scaler = StandardScaler()
z_score_scaler.fit_transform(data_pre_standardization)

array([[-1.24393264, -1.14096739],
       [-0.88852332, -0.35858975],
       [-0.53311399,  0.42378789],
       [ 0.17770466,  1.98854317],
       [ 1.59934197, -0.35858975],
       [ 0.88852332, -0.55418416]])

# Imbalanced Data

## Downsampling / Undersampling
- sample major samples without replacement
- ensure the distributions before and after downsampling are the same

## Upsampling / Oversampling
- sample minor samples with replacement
- ensure the distributions before and after upsampling are the same


In [9]:
from sklearn.utils import resample

data_imbalanced = pd.DataFrame({
    "Gender": ["M", "M", "F", "M", "F", "M", "F", "F", "M", "F", "M", "F"],
    "Age": [30, 40, 32, 19, 12, 35, 12, 31, 10, 21, 57, 39],
    "Target": [0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0]
})

data_imbalanced

Unnamed: 0,Gender,Age,Target
0,M,30,0
1,M,40,0
2,F,32,0
3,M,19,1
4,F,12,0
5,M,35,0
6,F,12,0
7,F,31,1
8,M,10,0
9,F,21,0


In [10]:
# downsampling

def dowmsampling(df, target_col):
    df_major = df[df[target_col] == 0]
    df_minor = df[df[target_col] == 1]
    
    df_major_downsampled = resample(df_major, 
                                 replace=False,               # sample without replacement
                                 n_samples=len(df_minor),     # to match minority class
                                 random_state=123)
    
    df_downsampled = pd.concat([df_major_downsampled, df_minor])
    
    return df_downsampled

dowmsampling(data_imbalanced, 'Target')

Unnamed: 0,Gender,Age,Target
5,M,35,0
0,M,30,0
3,M,19,1
7,F,31,1


In [11]:
# upsampling

def upsampling(df, target_col):
    df_major = df[df[target_col] == 0]
    df_minor = df[df[target_col] == 1]
    
    df_minor_upsampled = resample(df_minor, 
                                 replace=True,               # sample without replacement
                                 n_samples=len(df_major),     # to match minority class
                                 random_state=123)
    
    df_upsampled = pd.concat([df_minor_upsampled, df_major])
    
    return df_upsampled

upsampling(data_imbalanced, 'Target')

Unnamed: 0,Gender,Age,Target
3,M,19,1
7,F,31,1
3,M,19,1
3,M,19,1
3,M,19,1
3,M,19,1
3,M,19,1
7,F,31,1
7,F,31,1
3,M,19,1


## Boostrap / Ensemble
1. sample from major class
2. combine with minor class to train classifier
3. repeat the process to train multiple classifiers
4. average the results from all classifiers

<img src="https://www.kdnuggets.com/wp-content/uploads/imbalanced-data-2.png" alt="ensemble" width=500>

## Class Weights
- modify weights of different classes
- $w_k$: weights for different classes

$$\mathcal{L} = \sum_i^n w_k \mathcal{l}(x_i, y_i)$$

## Evaluation Metrics

- precision
- recall
- F1 score

# Sparse Data

$$Sparsity = \frac{\text{number of zero elements}}{\text{total number of elements}}$$

- L1 regularization to drop features: Lasso
- possibility of linear separable: linear SVM

# References

- 7 Techniques to Handle Imbalanced Dataset: https://www.kdnuggets.com/2017/06/7-techniques-handle-imbalanced-data.html
- A Gentle Introduction to Sparse Matrices for Machine Learning: https://machinelearningmastery.com/sparse-matrices-for-machine-learning/