This is the example of using EM-algorithm to impute missing data. <br> In this case we use the data of air pollution, and the goal is to perform regression analysis to predict the value of PM2.5.<br>
To simulate the condition of a dataset with missing values, we will pick random rows and replace some features with NaN.<br>
In the end we will compare the R-squared value of EM-algorithm, original data and baseline method(KNN) as a metric.

In [2]:
# To deal with the relative import problems.
import sys
import os
import time
print(os.path.dirname(os.getcwd()))
sys.path.append(f'{os.path.dirname(os.getcwd())}')

# Required packages
import numpy as np
import pandas as pd
import Imputers.utils as utils
import Imputers.em as em
import Imputers.MissForest as MissForest
import Imputers.fcm_impute as Fcm_imputer
import DataQuality.continuous as continuous
from sklearn.impute import KNNImputer, SimpleImputer
from datetime import datetime
import warnings
warnings.filterwarnings("ignore")

/Users/kogby/大學/ongoing/EDASH


Do this part first. You should generate the missing data, then run the algorithm.

Read the missing data you generated.

In [3]:

m_X_train = pd.read_csv(f"./data/Gas Sensor Drift Dataset/miss_{miss_rate}/X_train.csv")
y_train = pd.read_csv(f"./data/Gas Sensor Drift Dataset/miss_{miss_rate}/y_train.csv")
m_X_test = pd.read_csv(f"./data/Gas Sensor Drift Dataset/miss_{miss_rate}/X_test.csv")
y_test = pd.read_csv(f"./data/Gas Sensor Drift Dataset/miss_{miss_rate}/y_test.csv")

Every row contains NaN values.


In [2]:
config = {
    'miss_rate' : [10,20,30,40,50,75],
}

result = {
    'em' : [],
    'knn' : [],
    'miss' : [],
    'fcm' : [],
    'mean' : [],
}

In [3]:
for miss_rate in config['miss_rate']:
    m_X_train = pd.read_csv(f"./data/Gas Sensor Drift Dataset/miss_{miss_rate}/X_train.csv")
    y_train = pd.read_csv(f"./data/Gas Sensor Drift Dataset/miss_{miss_rate}/y_train.csv")
    m_X_test = pd.read_csv(f"./data/Gas Sensor Drift Dataset/miss_{miss_rate}/X_test.csv")
    y_test = pd.read_csv(f"./data/Gas Sensor Drift Dataset/miss_{miss_rate}/y_test.csv")
    
    # For XGB fix
    y_train = y_train-1
    y_test = y_test-1
    # print(m_X_train.shape)
    # print(m_X_test.shape)
    missing_df = pd.concat([m_X_train, m_X_test], axis=0)
    # print(missing_df.shape)
    missing_df.reset_index(drop=True, inplace=True)
    
    # EM Imputer
    result_imputed = em.impute_em(missing_df, 25, 50000, True)
    x_train = result_imputed['X_imputed'][:11823]
    x_test = result_imputed['X_imputed'][11823:]
    accuracy, f1 = utils.generate_stack_prediction(x_train, y_train, x_test, y_test)
    result['em'].append([accuracy, f1, result_imputed['time']])
    
    # KNN
    t_start = datetime.now()
    knn = KNNImputer(n_neighbors=3)
    knn_X = knn.fit_transform(missing_df)
    time_s = datetime.now() - t_start
    knn_df = pd.DataFrame(knn_X, columns=missing_df.columns)
    
    x_train = knn_df[:11823]
    x_test = knn_df[11823:]
    accuracy, f1 = utils.generate_stack_prediction(x_train, y_train, x_test, y_test)
    result['knn'].append([accuracy, f1, time_s])

Iteration 1/25
Convergence Check: Mu:0.0000 | S:5805174052.8143
Iteration 2/25
Convergence Check: Mu:282.5408 | S:4409909385.5394
Iteration 3/25
Convergence Check: Mu:78.3394 | S:347805416.7972
Iteration 4/25
Convergence Check: Mu:30.9410 | S:92926202.8481
Iteration 5/25
Convergence Check: Mu:21.4832 | S:27960834.1280
Iteration 6/25
Convergence Check: Mu:15.7706 | S:10174902.1619
Iteration 7/25
Convergence Check: Mu:12.0635 | S:5244661.9660
Iteration 8/25
Convergence Check: Mu:9.6451 | S:2973511.8116
Iteration 9/25
Convergence Check: Mu:7.9165 | S:1580893.3171
Iteration 10/25
Convergence Check: Mu:6.4646 | S:1048638.5697
Iteration 11/25
Convergence Check: Mu:5.2568 | S:676926.4170
Iteration 12/25
Convergence Check: Mu:4.2991 | S:421972.8710
Iteration 13/25
Convergence Check: Mu:3.5505 | S:266412.1388
Iteration 14/25
Convergence Check: Mu:2.9604 | S:179426.1277
Iteration 15/25
Convergence Check: Mu:2.4886 | S:129357.8350
Iteration 16/25
Convergence Check: Mu:2.1062 | S:114597.1675
Itera

## Classification on Original data

In [5]:
c_x_train = pd.read_csv("./data/Gas Sensor Drift Dataset/complete/X_train.csv")
c_x_test = pd.read_csv("./data/Gas Sensor Drift Dataset/complete/X_test.csv")
utils.generate_stack_prediction(c_x_train, y_train, c_x_test, y_test)

Stacking Model Test Accuracy: 0.9913751796837565
Stacking Model Test F1 Score: 0.9912982959465593


(0.9913751796837565, 0.9912982959465593)

## Classification on imputed data (MissForest)

In [6]:
for miss_rate in [25,50,75]:
    m_X_train = pd.read_csv(f"./data/Gas Sensor Drift Dataset/miss_{miss_rate}/X_train.csv")
    m_X_test = pd.read_csv(f"./data/Gas Sensor Drift Dataset/miss_{miss_rate}/X_test.csv")

    missing_df = pd.concat([m_X_train, m_X_test], axis=0)
    missing_df.reset_index(drop=True, inplace=True)
    mf_imputer = MissForest.MissForest(max_iter = 15)
    start_time = datetime.now()
    mf_df = mf_imputer.fit_transform(missing_df, verbose=False)
    print(f"Execution time: {datetime.now() - start_time}")
    time = datetime.now() - start_time

    x_train = mf_df[:11823]
    x_test = mf_df[11823:]
    accuracy, f1 = utils.generate_stack_prediction(x_train, y_train, x_test, y_test)

    result['miss'].append([accuracy, f1, time])

Execution time: 0:16:37.992924
Stacking Model Test Accuracy: 0.9923334930522281
Stacking Model Test F1 Score: 0.9920354347327498
Execution time: 0:12:28.276056
Stacking Model Test Accuracy: 0.9908960229995208
Stacking Model Test F1 Score: 0.9906800377967869
Execution time: 0:11:17.969443
Stacking Model Test Accuracy: 0.9789171058936272
Stacking Model Test F1 Score: 0.9777012944225629


In [7]:
display(missing_df.shape)

(13910, 128)

## Classificaiton on imputed data (fuzzy)

In [9]:
import time
for miss_rate in [25,50,75]:
    m_X_train = pd.read_csv(f"./data/Gas Sensor Drift Dataset/miss_{miss_rate}/X_train.csv")
    m_X_test = pd.read_csv(f"./data/Gas Sensor Drift Dataset/miss_{miss_rate}/X_test.csv")

    missing_df = pd.concat([m_X_train, m_X_test], axis=0)
    missing_df.reset_index(drop=True, inplace=True)
    c_X_train = pd.read_csv(f"./data/Gas Sensor Drift Dataset/complete/X_train.csv")
    c_X_test = pd.read_csv(f"./data/Gas Sensor Drift Dataset/complete/X_test.csv")
    x_train = pd.concat([c_X_train[0:1391], missing_df[1391:11823]], axis=0)
    x_test = missing_df[11823:]
    missing_df = pd.concat([x_train, x_test], axis=0)
    missing_df.reset_index(drop=True, inplace=True)

    fcmImputer = Fcm_imputer.FCMImputer(data = missing_df, num_clusters = 3)
    s_time = time.time()
    fuzzy_X = fcmImputer.impute()
    duration = time.time() - s_time

    x_train = fuzzy_X[:11823]
    x_test = fuzzy_X[11823:]
    accuracy, f1 = utils.generate_stack_prediction(x_train, y_train, x_test, y_test)

    result['fcm'].append([accuracy, f1, duration])

Stacking Model Test Accuracy: 0.9870627695256349
Stacking Model Test F1 Score: 0.9866270443215205
Stacking Model Test Accuracy: 0.9587925251557259
Stacking Model Test F1 Score: 0.9547934125831459
Stacking Model Test Accuracy: 0.8711068519405846
Stacking Model Test F1 Score: 0.8622686966881696


## Mean Impute

In [None]:
for miss_rate in [25,50,75]:
    m_X_train = pd.read_csv(f"./data/Gas Sensor Drift Dataset/miss_{miss_rate}/X_train.csv")
    m_X_test = pd.read_csv(f"./data/Gas Sensor Drift Dataset/miss_{miss_rate}/X_test.csv")

    missing_df = pd.concat([m_X_train, m_X_test], axis=0)
    missing_df.reset_index(drop=True, inplace=True)
    s_time = time.time()
    mean_imputer = SimpleImputer(strategy='mean')
    mean_imputer.fit(missing_df)
    mean_x = mean_imputer.fit_transform(missing_df)
    duration = time.time() - s_time
    
    x_train = mean_x[:11823]
    x_test = mean_x[11823:]
    accuracy, f1 = utils.generate_stack_prediction(x_train, y_train, x_test, y_test)
    result['mean'].append([accuracy, f1, duration])

In [None]:
for i in range(0,3):
    print(result['miss'][i][1])
    print(result['em'][i][1])
    print(result['knn'][i][1])
    print(result['fcm'][i][1])
    print(result['mean'][i][1])

In [None]:
# con = continuous()
# frame = con.comparison(knn_X_df, result_imputed['X_imputed'])