Iterative Imputation and MissForest 

Iterative method with different regressors for missing data imputation
- Sklearn - Iterative Imputation for Tables. https://scikit-learn.org/stable/modules/impute.html#iterative-imputer
- MissForest - https://academic.oup.com/bioinformatics/article/28/1/112/219101


## (1) Initial Setup

In [8]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.model_selection import train_test_split
import random
from sklearn.impute import KNNImputer
from sklearn.metrics import r2_score
from sklearn import linear_model

### This part is for real data 

In [9]:
# Mount google drive so that any data could be accessed in this folder
# Read the data,it is important to put the "Research" directory in ones own google drive
X = np.load("/Users/jiaweizhang/research/data/X.npy")
Y = np.load("/Users/jiaweizhang/research/data/Y.npy")
Z = np.load("/Users/jiaweizhang/research/data/Z.npy")
U = np.load("/Users/jiaweizhang/research/data/U.npy")
M = np.load("/Users/jiaweizhang/research/data/M.npy")

#display(pd.DataFrame(X))
display(pd.DataFrame(Y))
#display(pd.DataFrame(Z))
#display(pd.DataFrame(U))
#display(pd.DataFrame(M))


Unnamed: 0,0,1,2
0,48.232187,64.290355,249.989860
1,-12.442706,-7.083779,40.446066
2,8.363526,-1.712051,4.156605
3,19.191582,5.420550,1.553134
4,9.187136,-1.146317,2.446474
...,...,...,...
19995,2.400543,3.147297,2.128416
19996,28.342436,24.370663,60.225367
19997,33.910725,66.446965,255.272269
19998,50.144414,51.241442,172.876204


## (2) Data Processing
- Simulation clinical data
- Train test split
- Data masking (to simulate missing values)

### Data masking for Y train and Y test

This code generates a new matrix Y_missing with the same dimensions as Y. For each value in Y, a random number is generated using random.random() and if it is below the missing_value_threshold (default 0.5), the corresponding value in Y_missing is set to "nah", otherwise it is set to the value from Y.

In [7]:
# Hide 20% of the Y in the test set
def replace_values(Y, missing_value_threshold=0.5):
    Y_missing = np.zeros_like(Y, dtype=object)
    n_samples, n_features = Y.shape
    for i in range(n_samples):
        for j in range(n_features):
            if random.random() < missing_value_threshold:
                Y_missing[i, j] = np.nan
            else:
                Y_missing[i, j] = Y[i, j]
    return Y_missing
·
Y_train_missing = replace_values(Y_train, missing_value_threshold=0.2)
Y_test_missing = replace_values(Y_test, missing_value_threshold=0.2)

display(pd.DataFrame(Y_train_missing))
display(pd.DataFrame(Y_test_missing))

Unnamed: 0,0,1,2
0,-1.041859,,
1,17.707395,4.939858,1.46241
2,39.603196,85.183521,
3,-1.099649,-5.068775,5.568457
4,3.934051,8.996238,8.846162
...,...,...,...
15995,0.521284,-2.319886,2.293248
15996,2.109003,,20.19499
15997,14.893276,38.998803,116.564241
15998,-3.392623,-6.745171,27.081845


Unnamed: 0,0,1,2
0,9.049006,-2.254202,3.318803
1,6.195936,,
2,-12.909524,,32.401378
3,14.001708,38.280861,
4,,-1.04303,2.661235
...,...,...,...
3995,-32.33616,-12.087786,67.26905
3996,,9.152145,6.511489
3997,9.665732,23.81599,43.406062
3998,-4.063137,-6.830574,12.54507


### Train test split

In [10]:
# Perform split to mask 50% of the data in target Y('AHD') 
X_train, X_test = train_test_split(X, test_size=0.2)
Y_train, Y_test = train_test_split(Y, test_size=0.2)
Z_train, Z_test = train_test_split(Z, test_size=0.2)


###Create dataframe


In [None]:
# Concatenate X, Y, and Z horizontally
XYZ_train = np.concatenate((X_train, Y_train_missing, Z_train), axis=1)
XYZ_train_true = np.concatenate((X_train, Y_train, Z_train), axis=1)
XYZ_test = np.concatenate((X_test, Y_test_missing, Z_test), axis=1)
XYZ_test_true = np.concatenate((X_test, Y_test, Z_test), axis=1)
df_train = pd.DataFrame(XYZ_train)
df_train_true = pd.DataFrame(XYZ_train_true)
df_test = pd.DataFrame(XYZ_test)
df_test_true = pd.DataFrame(XYZ_test_true)
print(df_train,df_train_true,df_test,df_test_true)


## (3) Data Imputation

### (i)  MissForest
missForest is an algorithm for data imputation, which is the process of filling in missing values in a dataset. missForest is popular, and turns out to be a particular instance of different sequential imputation algorithms that can all be implemented with IterativeImputer by passing in different regressors to be used for predicting missing feature values. In the case of missForest, this regressor is a Random Forest. See Imputing missing values with variants of IterativeImputer.

missForest is an implementation of the random forest algorithm for missing data imputation. The algorithm works by building an ensemble of decision trees to predict the missing values in a dataset. The idea behind the algorithm is that decision trees can be used to model the relationship between the variables in a dataset and can be used to predict missing values. The algorithm works by splitting the dataset into several smaller datasets, building decision trees on each of these smaller datasets, and combining the predictions from these decision trees to obtain a final imputed dataset.

One of the advantages of using missForest is that it can handle missing values in both categorical and continuous variables. It also handles data with different missing patterns and can be used to impute multiple imputations at once. Additionally, missForest provides a measure of the imputation uncertainty, which is important for correctly interpreting the results of the imputed data. 

source - 

In [None]:
missForest = IterativeImputer(estimator = RandomForestRegressor(), random_state=0)
missForest.fit(df_train)
display(pd.DataFrame(missForest.transform(df_train)))

___
### (ii) KNN Imputation
The basic idea behind KNN for imputation is to replace missing values with the average of the k-nearest neighbors in the feature space. The value of k is determined by the user and can be set using cross-validation. KNN imputation is considered a simple and effective method for imputing missing data, particularly for small amounts of missing values. However, for larger amounts of missing data or for data with a large number of features, more advanced imputation methods may be needed.

Source - https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html#sklearn.impute.KNNImputer

In [None]:
KNNimputer = KNNImputer(n_neighbors=15)
KNNimputer.fit(df_train)
display(pd.DataFrame(KNNimputer.transform(df_train)))

### (ii) BayesianRidge Imputation
The BayesianRidge model tries to estimate the coefficients of a linear regression model that best fit the data, taking into account prior knowledge about the coefficients. For data imputation, the missing values are treated as if they are unknown coefficients and are estimated along with the other coefficients during the model fitting process. BayesianRidge can be a good choice for imputation when the relationship between the features is well approximated by a linear model. However, it may not perform well for data sets with more complex relationships between the features.

Source - https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.BayesianRidge.html#sklearn.linear_model.BayesianRidge

In [None]:
BayesianImputer = IterativeImputer(estimator = linear_model.BayesianRidge(), random_state=0)
BayesianImputer.fit(df_train)
display(pd.DataFrame(BayesianImputer.transform(df_train)))

### Analysis of the result

The coefficient of determination $R^2$ is defined as $1 - \frac{u}{v}$, where $u$ is the residual sum of squares $(((y_{true} - y_{pred}))^2).sum()$ and $v$ is the total sum of squares $((y_{true} - \overline{y_{true}})^2).sum()$. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of $y$, disregarding the input features, would get a $R^2$ score of 0.0.

In [None]:
#print(missForest.score(df_test,df_test_true))
def getY(df_pred,df_missing,df_true):
  Y_pred = []
  Y_true = []
  m,n = df_pred.shape
  print(df_pred.shape,df_missing.shape,df_true.shape,m,n)
  for i in range(m):
    for j in range(n):
      if pd.isna(df_missing.loc[i,j]):
        Y_pred.append(df_pred[i][j])
        Y_true.append(df_true.loc[i,j])
  return Y_true,Y_pred

#missForest.score(df_test,df_test_true))
#Y_true, Y_pred = getY(missForest.transform(df_test),df_test,df_test_true)
#r2 = r2_score(Y_true, Y_pred)
#print(r2)
#KNNimputer.score(df_test,df_test_true))
Y_true, Y_pred = getY(KNNimputer.transform(df_train),df_train,df_train_true)
r2 = r2_score(Y_true, Y_pred)
print(r2)
#BayesianImputer.score(df_test,df_test_true))
Y_true, Y_pred = getY(BayesianImputer.transform(df_train),df_train,df_train_true)
r2 = r2_score(Y_true, Y_pred)
print(r2)

In [None]:
Y_true, Y_pred = getY(KNNimputer.transform(df_test),df_test,df_test_true)
r2 = r2_score(Y_true, Y_pred)
print(r2)
#BayesianImputer.score(df_test,df_test_true))
Y_true, Y_pred = getY(BayesianImputer.transform(df_test),df_test,df_test_true)
r2 = r2_score(Y_true, Y_pred)
print(r2)