This is the example of using EM-algorithm to impute missing data. <br> In this case we use the data of air pollution, and the goal is to perform regression analysis to predict the value of PM2.5.<br>
To simulate the condition of a dataset with missing values, we will pick random rows and replace some features with NaN.<br>
In the end we will compare the R-squared value of EM-algorithm, original data and baseline method(KNN) as a metric.

In [5]:
import sys
import os
print(os.path.dirname(os.getcwd()))
sys.path.append(f'{os.path.dirname(os.getcwd())}')

/Users/kogby/大學/ongoing/EDASH


In [19]:
import numpy as np
import pandas as pd
import Imputers.utils as utils
import Imputers.em as em
import Imputers.MissForest as MissForest
import DataQuality.continuous as continuous
from sklearn.impute import KNNImputer
from datetime import datetime
import warnings
warnings.filterwarnings("ignore")

In [20]:
m_X_train = pd.read_csv("./data/Gas Sensor Drift Dataset/miss_25/X_train.csv")
y_train = pd.read_csv("./data/Gas Sensor Drift Dataset/miss_25/y_train.csv")
m_X_test = pd.read_csv("./data/Gas Sensor Drift Dataset/miss_25/X_test.csv")
y_test = pd.read_csv("./data/Gas Sensor Drift Dataset/miss_25/y_test.csv")
print(m_X_train.shape)
print(m_X_test.shape)

(11823, 128)
(2087, 128)


In [21]:
missing_df = pd.concat([m_X_train, m_X_test], axis=0)
print(missing_df.shape)
missing_df.reset_index(drop=True, inplace=True)

(13910, 128)


In [9]:
result_imputed = em.impute_em(missing_df, 30, 50000, False)


KeyboardInterrupt



The imputed data

In [6]:
x_train = result_imputed['X_imputed'][:11823]
x_test = result_imputed['X_imputed'][11823:]
print(x_train.shape, x_test.shape)

(11823, 128) (2087, 128)


Check how many E-step & M-step are iterated, and check if there's still NaN values.

In [7]:
# There's no NaN
print("Count of NaN values: \n", np.isnan(result_imputed['X_imputed']).sum())
print("The iterations count is: ", result_imputed['iteration'])

Count of NaN values: 
 feat_9     0
feat_65    0
feat_11    0
feat_12    0
feat_67    0
          ..
feat_32    0
feat_40    0
feat_29    0
feat_50    0
feat_58    0
Length: 128, dtype: int64
The iterations count is:  10


In [24]:
y_train = y_train-1
y_test = y_test-1

## Classification on imputed data (EM)

In [15]:
utils.generate_stack_prediction(x_train, y_train, x_test, y_test)

  rf_model.fit(X_train, y_train)


Stacking Model Test Accuracy: 0.9908960229995208
Stacking Model Test F1 Score: 0.9906699626007768


  y = column_or_1d(y, warn=True)


## Classification on Original data

In [17]:
c_x_train = pd.read_csv("./data/Gas Sensor Drift Dataset/complete/X_train.csv")
c_x_test = pd.read_csv("./data/Gas Sensor Drift Dataset/complete/X_test.csv")
utils.generate_stack_prediction(x_train, y_train, c_x_test, y_test)

  rf_model.fit(X_train, y_train)


Stacking Model Test Accuracy: 0.9928126497364638
Stacking Model Test F1 Score: 0.9924020428382795


  y = column_or_1d(y, warn=True)


## Classification on imputed data (KNN)

In [24]:
t_start = datetime.now()
knn = KNNImputer(n_neighbors=3)
knn_X = knn.fit_transform(missing_df)
print(f"Time spent (KNN): {datetime.now() - t_start}")
knn_df = pd.DataFrame(knn_X, columns=missing_df.columns)
print(knn_df)

Time spent (KNN): -1 day, 23:56:02.948628
              feat_9        feat_65    feat_11    feat_12    feat_67  \
0       80127.020500   68950.868967  24.049747  37.051157  25.634138   
1        3994.906200    1371.974600   0.855557   1.405853   0.402949   
2       12669.775000   12788.556200   2.405838   3.672876   2.281033   
3       21257.013667   21429.613300   5.033300   8.510600   5.328100   
4       41302.083500   32388.739200  11.882996  19.091307   8.124052   
...              ...            ...        ...        ...        ...   
13905   37239.475100   27690.390100  10.733777  16.006722   6.981733   
13906    5967.971100    2481.697000   1.231430   1.549247   0.451234   
13907  218099.800800  179740.490300  53.097933  76.838475  47.779912   
13908    2257.013700    2023.651400   0.282856   0.489725   0.211132   
13909   20108.225600   11606.244500   5.854643   8.722554   2.640724   

         feat_14    feat_68        feat_73    feat_69         feat_1  ...  \
0     -17.922769

In [25]:
x_train = knn_df[:11823]
x_test = knn_df[11823:]
utils.generate_stack_prediction(x_train, y_train, x_test, y_test)

  rf_model.fit(X_train, y_train)


Stacking Model Test Accuracy: 0.9880210828941064
Stacking Model Test F1 Score: 0.9875257391487245


  y = column_or_1d(y, warn=True)


## Classification on imputed data (MissForest)

In [22]:
mf_imputer = MissForest.MissForest(max_iter = 20)
start_time = datetime.now()
mf_df = mf_imputer.fit_transform(missing_df, verbose=True)
print(f"Execution time: {datetime.now() - start_time}")

Iteration: 1/20
Continuous: feat_9 
Using LGBMRegressor(verbosity=-1)
Continuous: feat_65 
Using LGBMRegressor(verbosity=-1)
Continuous: feat_11 
Using LGBMRegressor(verbosity=-1)
Continuous: feat_12 
Using LGBMRegressor(verbosity=-1)
Continuous: feat_67 
Using LGBMRegressor(verbosity=-1)
Continuous: feat_14 
Using LGBMRegressor(verbosity=-1)
Continuous: feat_68 
Using LGBMRegressor(verbosity=-1)
Continuous: feat_73 
Using LGBMRegressor(verbosity=-1)
Continuous: feat_69 
Using LGBMRegressor(verbosity=-1)
Continuous: feat_1 
Using LGBMRegressor(verbosity=-1)
Continuous: feat_70 
Using LGBMRegressor(verbosity=-1)
Continuous: feat_75 
Using LGBMRegressor(verbosity=-1)
Continuous: feat_100 
Using LGBMRegressor(verbosity=-1)
Continuous: feat_3 
Using LGBMRegressor(verbosity=-1)
Continuous: feat_84 
Using LGBMRegressor(verbosity=-1)
Continuous: feat_116 
Using LGBMRegressor(verbosity=-1)
Continuous: feat_15 
Using LGBMRegressor(verbosity=-1)
Continuous: feat_77 
Using LGBMRegressor(verbosity

In [36]:
print(mf_df.shape)

In [25]:
x_train = mf_df[:11823]
x_test = mf_df[11823:]
utils.generate_stack_prediction(x_train, y_train, x_test, y_test)

Stacking Model Test Accuracy: 0.9918543363679924
Stacking Model Test F1 Score: 0.9914947844696913


## Classificaiton on imputed data (fuzzy)

In [None]:
# con = continuous()
# frame = con.comparison(knn_X_df, result_imputed['X_imputed'])