# Note

* Following these [instructions](https://github.com/liguowang/ultra-impute/tree/master) to install `ultra-impute`.
* If you install `ultra-impute` into a [virtual environment](https://docs.python.org/3/library/venv.html), you will also need to install **[Jupyter Notebook](https://jupyter.org/)** into that same virtual environment and launch it from there. Otherwise, Jupyter Notebook may not be able to load the necessary Python modules for this tutorial.

# Load Python modules

In [11]:
import numpy as np
import pandas as pd
from ultra_impute import MissFiller
from util import calculate_metrics, toy_df

# Initialize the `MissFiller` object
`MissFiller` can be initialized with a **dictionary**, **NumPy ndarray**, or **pandas DataFrame**. Dictionaries and ndarrays will be converted to DataFrames upon initialization of the `MissFiller` class.

In [9]:
# d1 is a dictionary. `np.nan` represents missing value.
d1 = {
'A':[0, 5, 10, np.nan, 20],
'B': [1, 6, np.nan, 16, 21],
'C':[np.nan, 7, 12, 17, 22],
'D':[3, 8, 13, 18, 23],
'E':[4, 9, 14, 19, np.nan]
}

# d2 is a NumPy ndarray
d2 = np.array([
[ 0.,  1., np.nan,  3., 4],
[ 5.,  6.,  7.,  8., 9],
[10., np.nan, 12., 13., 14],
[np.nan, 16., 17., 18., 19],
[20., 21., 22., 23., np.nan]
])

# d3 and d4 are DataFrames created from d1 and d2, respectively.
d3= pd.DataFrame(d1)
d4 = pd.DataFrame(d2, columns=['A','B','C','D', 'E'])

# Create MissFiller objects, noting that mf1, mf2, mf3, and mf4 
# all contain the same data.
mf1 = MissFiller(d1)
mf2 = MissFiller(d2)
mf3 = MissFiller(d3)
mf4 = MissFiller(d4)
mf1.df


Unnamed: 0,A,B,C,D,E
0,0.0,1.0,,3,4.0
1,5.0,6.0,7.0,8,9.0
2,10.0,,12.0,13,14.0
3,,16.0,17.0,18,19.0
4,20.0,21.0,22.0,23,


# Create a toy dataframe (10 x 5) with 20% missing values

In [20]:
d5 = toy_df(n_rows=10, n_cols=5, missingness=0.2, min_val=0, max_val=1, rand_seed=123)
d5

10


array([[0.69646919, 0.28613933, 0.22685145,        nan, 0.71946897],
       [0.42310646,        nan, 0.68482974,        nan, 0.39211752],
       [0.34317802, 0.72904971, 0.43857224,        nan, 0.39804426],
       [0.73799541,        nan, 0.17545176, 0.53155137, 0.53182759],
       [0.63440096, 0.84943179, 0.72445532, 0.61102351, 0.72244338],
       [0.32295891, 0.36178866,        nan, 0.29371405, 0.63097612],
       [0.09210494, 0.43370117, 0.43086276, 0.4936851 , 0.42583029],
       [0.31226122, 0.42635131, 0.89338916, 0.94416002, 0.50183668],
       [0.62395295, 0.1156184 , 0.31728548,        nan, 0.86630916],
       [0.25045537, 0.48303426, 0.98555979,        nan, 0.61289453]])

# Get the indices of missing values

In [21]:
mf5 = MissFiller(d5)
na_locations = mf5.get_na_indices()
na_locations

array([[0, 3],
       [1, 1],
       [1, 3],
       [2, 3],
       [3, 1],
       [5, 2],
       [8, 3],
       [9, 3]])

# Remove mising values

In [22]:
# remove rows with any missing values. Default is axis=0
mf5.remove_na()

Unnamed: 0,0,1,2,3,4
4,0.634401,0.849432,0.724455,0.611024,0.722443
6,0.092105,0.433701,0.430863,0.493685,0.42583
7,0.312261,0.426351,0.893389,0.94416,0.501837


In [23]:
# remove columns with any missing values
mf5.remove_na(axis=1)

Unnamed: 0,0,4
0,0.696469,0.719469
1,0.423106,0.392118
2,0.343178,0.398044
3,0.737995,0.531828
4,0.634401,0.722443
5,0.322959,0.630976
6,0.092105,0.42583
7,0.312261,0.501837
8,0.623953,0.866309
9,0.250455,0.612895


# Fill missing values with `mean`, `median`, `min`, `max`, `bfill` or `ffill`

**bfill** (back fill): Fills a missing value with the next value.
                If axis=0, "next" refers to the value directly below the
                missing value. As a result, missing values at the bottom of
                the column will not be filled.
                If axis=1, "next" refers to the value to the right of the
                missing value. Consequently, missing values on the rightmost
                side of the row will not be filled.
                
**ffill** (forward fill): Fills a missing value with the previous value.
                If axis=0, "previous" refers to the value directly above the
                missing value. Therefore, missing values at the top of the
                column will not be filled.
                If axis=1, "previous" refers to the value to the left of the
                missing value. Thus, missing values on the leftmost side of
                the row will not be filled.
                

In [24]:
# fill missing values with row mean
mf5.fill_trend(axis=1, method='mean')

Unnamed: 0,0,1,2,3,4
0,0.696469,0.286139,0.226851,0.482232,0.719469
1,0.423106,0.500018,0.68483,0.500018,0.392118
2,0.343178,0.72905,0.438572,0.477211,0.398044
3,0.737995,0.494207,0.175452,0.531551,0.531828
4,0.634401,0.849432,0.724455,0.611024,0.722443
5,0.322959,0.361789,0.402359,0.293714,0.630976
6,0.092105,0.433701,0.430863,0.493685,0.42583
7,0.312261,0.426351,0.893389,0.94416,0.501837
8,0.623953,0.115618,0.317285,0.480791,0.866309
9,0.250455,0.483034,0.98556,0.582986,0.612895


In [25]:
# fill missing values with column mean
mf5.fill_trend(axis=0, method='mean')

Unnamed: 0,0,1,2,3,4
0,0.696469,0.286139,0.226851,0.574827,0.719469
1,0.423106,0.460639,0.68483,0.574827,0.392118
2,0.343178,0.72905,0.438572,0.574827,0.398044
3,0.737995,0.460639,0.175452,0.531551,0.531828
4,0.634401,0.849432,0.724455,0.611024,0.722443
5,0.322959,0.361789,0.541918,0.293714,0.630976
6,0.092105,0.433701,0.430863,0.493685,0.42583
7,0.312261,0.426351,0.893389,0.94416,0.501837
8,0.623953,0.115618,0.317285,0.574827,0.866309
9,0.250455,0.483034,0.98556,0.574827,0.612895


In [26]:
# back fill missing values by row
# When axis=1, "back fill" means a missing value will be filled 
# by the value on its rightside
mf5.fill_trend(axis=1, method='bfill')

Unnamed: 0,0,1,2,3,4
0,0.696469,0.286139,0.226851,0.719469,0.719469
1,0.423106,0.68483,0.68483,0.392118,0.392118
2,0.343178,0.72905,0.438572,0.398044,0.398044
3,0.737995,0.175452,0.175452,0.531551,0.531828
4,0.634401,0.849432,0.724455,0.611024,0.722443
5,0.322959,0.361789,0.293714,0.293714,0.630976
6,0.092105,0.433701,0.430863,0.493685,0.42583
7,0.312261,0.426351,0.893389,0.94416,0.501837
8,0.623953,0.115618,0.317285,0.866309,0.866309
9,0.250455,0.483034,0.98556,0.612895,0.612895


In [27]:
# back fill missing values by row
# When axis=0, "back fill" means a missing value will be filled 
# by the value below.
mf5.fill_trend(axis=0, method='bfill')

Unnamed: 0,0,1,2,3,4
0,0.696469,0.286139,0.226851,0.531551,0.719469
1,0.423106,0.72905,0.68483,0.531551,0.392118
2,0.343178,0.72905,0.438572,0.531551,0.398044
3,0.737995,0.849432,0.175452,0.531551,0.531828
4,0.634401,0.849432,0.724455,0.611024,0.722443
5,0.322959,0.361789,0.430863,0.293714,0.630976
6,0.092105,0.433701,0.430863,0.493685,0.42583
7,0.312261,0.426351,0.893389,0.94416,0.501837
8,0.623953,0.115618,0.317285,,0.866309
9,0.250455,0.483034,0.98556,,0.612895


# Fill missing values with random values chose from the same row or column

In [28]:
# Fill by random values chose from the same column
mf5.fill_rand(axis=0)

Unnamed: 0,0,1,2,3,4
0,0.696469,0.286139,0.226851,0.293714,0.719469
1,0.423106,0.72905,0.68483,0.493685,0.392118
2,0.343178,0.72905,0.438572,0.611024,0.398044
3,0.737995,0.286139,0.175452,0.531551,0.531828
4,0.634401,0.849432,0.724455,0.611024,0.722443
5,0.322959,0.361789,0.724455,0.293714,0.630976
6,0.092105,0.433701,0.430863,0.493685,0.42583
7,0.312261,0.426351,0.893389,0.94416,0.501837
8,0.623953,0.115618,0.317285,0.94416,0.866309
9,0.250455,0.483034,0.98556,0.611024,0.612895


In [29]:
# Fill by random values chose from the same row
mf5.fill_rand(axis=1)

Unnamed: 0,0,1,2,3,4
0,0.696469,0.286139,0.226851,0.696469,0.719469
1,0.423106,0.68483,0.68483,0.68483,0.392118
2,0.343178,0.72905,0.438572,0.72905,0.398044
3,0.737995,0.531551,0.175452,0.531551,0.531828
4,0.634401,0.849432,0.724455,0.611024,0.722443
5,0.322959,0.361789,0.322959,0.293714,0.630976
6,0.092105,0.433701,0.430863,0.493685,0.42583
7,0.312261,0.426351,0.893389,0.94416,0.501837
8,0.623953,0.115618,0.317285,0.115618,0.866309
9,0.250455,0.483034,0.98556,0.98556,0.612895


# Fill missing values with mean calculated from the sliding windows

In [30]:
# Default axis = 0, means the sliding window will move along the columns.
mf5.fill_mw()

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


Unnamed: 0,0,1,2,3,4
0,0.696469,0.286139,0.226851,0.544797,0.719469
1,0.423106,0.507595,0.68483,0.531551,0.392118
2,0.343178,0.72905,0.438572,0.558042,0.398044
3,0.737995,0.695359,0.175452,0.531551,0.531828
4,0.634401,0.849432,0.724455,0.611024,0.722443
5,0.322959,0.361789,0.449954,0.293714,0.630976
6,0.092105,0.433701,0.430863,0.493685,0.42583
7,0.312261,0.426351,0.893389,0.94416,0.501837
8,0.623953,0.115618,0.317285,,0.866309
9,0.250455,0.483034,0.98556,,0.612895


# Fill missing values using the `k-nearest neighbour`

In [31]:
# fill missing values using the 'fKNN' algorithem (fast k-nearest neighbour).
# When axis = 1 (default), search columns for the nearest neighbours.
mf5.fill_fKNN()

Unnamed: 0,0,1,2,3,4
0,0.696469,0.286139,0.226851,0.3992,0.719469
1,0.423106,0.537525,0.68483,0.557274,0.392118
2,0.343178,0.72905,0.438572,0.514626,0.398044
3,0.737995,0.472949,0.175452,0.531551,0.531828
4,0.634401,0.849432,0.724455,0.611024,0.722443
5,0.322959,0.361789,0.382209,0.293714,0.630976
6,0.092105,0.433701,0.430863,0.493685,0.42583
7,0.312261,0.426351,0.893389,0.94416,0.501837
8,0.623953,0.115618,0.317285,0.402664,0.866309
9,0.250455,0.483034,0.98556,0.73068,0.612895


In [32]:
# When axis = 0, search rows for the nearest neighbours.
mf5.fill_fKNN(axis=0)

Unnamed: 0,0,1,2,3,4
0,0.696469,0.286139,0.226851,0.530151,0.719469
1,0.423106,0.540602,0.68483,0.554548,0.392118
2,0.343178,0.72905,0.438572,0.487227,0.398044
3,0.737995,0.321908,0.175452,0.531551,0.531828
4,0.634401,0.849432,0.724455,0.611024,0.722443
5,0.322959,0.361789,0.517105,0.293714,0.630976
6,0.092105,0.433701,0.430863,0.493685,0.42583
7,0.312261,0.426351,0.893389,0.94416,0.501837
8,0.623953,0.115618,0.317285,0.499448,0.866309
9,0.250455,0.483034,0.98556,0.657904,0.612895


In [33]:
# fill missing values using the sklearn's 'KNNImputer'
# When axis = 1 (default), search columns for the nearest neighbours.
# the results of KNN (using the mean) might be slightly different 
# from fKNN (using the weighted mean)
mf5.fill_KNN()

Unnamed: 0,0,1,2,3,4
0,0.696469,0.286139,0.226851,0.482232,0.719469
1,0.423106,0.500018,0.68483,0.500018,0.392118
2,0.343178,0.72905,0.438572,0.477211,0.398044
3,0.737995,0.494207,0.175452,0.531551,0.531828
4,0.634401,0.849432,0.724455,0.611024,0.722443
5,0.322959,0.361789,0.402359,0.293714,0.630976
6,0.092105,0.433701,0.430863,0.493685,0.42583
7,0.312261,0.426351,0.893389,0.94416,0.501837
8,0.623953,0.115618,0.317285,0.480791,0.866309
9,0.250455,0.483034,0.98556,0.582986,0.612895


# Fill missing values using the Expectation-Maximization (EM)

In [34]:
# Imputes missing data using the Expectation-Maximization (EM) algorithm.
mf5.fill_EM()

Unnamed: 0,0,1,2,3,4
0,0.696469,0.286139,0.226851,0.723044,0.719469
1,0.423106,0.541666,0.68483,0.430225,0.392118
2,0.343178,0.72905,0.438572,0.669662,0.398044
3,0.737995,0.475712,0.175452,0.531551,0.531828
4,0.634401,0.849432,0.724455,0.611024,0.722443
5,0.322959,0.361789,0.43008,0.293714,0.630976
6,0.092105,0.433701,0.430863,0.493685,0.42583
7,0.312261,0.426351,0.893389,0.94416,0.501837
8,0.623953,0.115618,0.317285,0.008294,0.866309
9,0.250455,0.483034,0.98556,0.833682,0.612895


# Fill missing values using Buck's Method

In [35]:
# Imputes missing data using Buck's Method.
mf5.fill_Buck()

Unnamed: 0,0,1,2,3,4
0,0.696469,0.286139,0.226851,0.070144,0.719469
1,0.423106,0.460639,0.68483,1.097562,0.392118
2,0.343178,0.72905,0.438572,0.832934,0.398044
3,0.737995,0.460639,0.175452,0.531551,0.531828
4,0.634401,0.849432,0.724455,0.611024,0.722443
5,0.322959,0.361789,0.541918,0.293714,0.630976
6,0.092105,0.433701,0.430863,0.493685,0.42583
7,0.312261,0.426351,0.893389,0.94416,0.501837
8,0.623953,0.115618,0.317285,-0.281854,0.866309
9,0.250455,0.483034,0.98556,0.733134,0.612895


# Impute missing values using NuclearNormMinimization.
## NOTE: this method can be very slow, especially for large dataset.

In [36]:
mf5.fill_NNM()

                                     CVXPY                                     
                                     v1.5.3                                    
(CVXPY) Oct 13 09:18:26 AM: Your problem has 50 variables, 50 constraints, and 0 parameters.
(CVXPY) Oct 13 09:18:26 AM: It is compliant with the following grammars: DCP, DQCP
(CVXPY) Oct 13 09:18:26 AM: (If you need to solve this problem multiple times, but with different data, consider using parameters.)
(CVXPY) Oct 13 09:18:26 AM: CVXPY will first compile your problem; then, it will invoke a numerical solver to obtain a solution.
(CVXPY) Oct 13 09:18:26 AM: Your problem is compiled with the CPP canonicalization backend.
-------------------------------------------------------------------------------
                                  Compilation                                  
-------------------------------------------------------------------------------
(CVXPY) Oct 13 09:18:26 AM: Compiling problem (target solver=CVXOPT).
(

Unnamed: 0,0,1,2,3,4
0,0.696469,0.286139,0.226851,0.376516,0.719469
1,0.423106,0.398286,0.68483,0.522861,0.392118
2,0.343178,0.72905,0.438572,0.382084,0.398044
3,0.737995,0.272199,0.175452,0.531551,0.531828
4,0.634401,0.849432,0.724455,0.611024,0.722443
5,0.322959,0.361789,0.333507,0.293714,0.630976
6,0.092105,0.433701,0.430863,0.493685,0.42583
7,0.312261,0.426351,0.893389,0.94416,0.501837
8,0.623953,0.115618,0.317285,0.416525,0.866309
9,0.250455,0.483034,0.98556,0.667208,0.612895


# Impute missing values using `SoftImpute`

In [37]:
# Matrix completion by iterative soft thresholding of SVD decompositions. 
# Similar to R softImpute package.
mf5.fill_SoftImpute()

[SoftImpute] Max Singular Value of X_init = 3.245500
[SoftImpute] Iter 1: observed MAE=0.016806 rank=5
[SoftImpute] Iter 2: observed MAE=0.016818 rank=5
[SoftImpute] Iter 3: observed MAE=0.016842 rank=5
[SoftImpute] Iter 4: observed MAE=0.016861 rank=5
[SoftImpute] Iter 5: observed MAE=0.016877 rank=5
[SoftImpute] Iter 6: observed MAE=0.016888 rank=5
[SoftImpute] Iter 7: observed MAE=0.016896 rank=5
[SoftImpute] Iter 8: observed MAE=0.016900 rank=5
[SoftImpute] Iter 9: observed MAE=0.016901 rank=5
[SoftImpute] Iter 10: observed MAE=0.016899 rank=5
[SoftImpute] Iter 11: observed MAE=0.016908 rank=5
[SoftImpute] Iter 12: observed MAE=0.016917 rank=5
[SoftImpute] Iter 13: observed MAE=0.016929 rank=5
[SoftImpute] Iter 14: observed MAE=0.016942 rank=5
[SoftImpute] Iter 15: observed MAE=0.016952 rank=5
[SoftImpute] Iter 16: observed MAE=0.016960 rank=5
[SoftImpute] Iter 17: observed MAE=0.016966 rank=5
[SoftImpute] Iter 18: observed MAE=0.016969 rank=5
[SoftImpute] Iter 19: observed MAE=0.0

Unnamed: 0,0,1,2,3,4
0,0.696469,0.286139,0.226851,0.365466,0.719469
1,0.423106,0.393501,0.68483,0.511446,0.392118
2,0.343178,0.72905,0.438572,0.384299,0.398044
3,0.737995,0.281268,0.175452,0.531551,0.531828
4,0.634401,0.849432,0.724455,0.611024,0.722443
5,0.322959,0.361789,0.342569,0.293714,0.630976
6,0.092105,0.433701,0.430863,0.493685,0.42583
7,0.312261,0.426351,0.893389,0.94416,0.501837
8,0.623953,0.115618,0.317285,0.398719,0.866309
9,0.250455,0.483034,0.98556,0.649914,0.612895


# Impute missing values using `IterativeSVD`

In [39]:
# Matrix completion by iterative low-rank SVD decomposition. 
# The input dataframe must have at least 5 columns.

d6 = toy_df(n_rows=10, n_cols=10, missingness=0.2, min_val=0, max_val=1, rand_seed=123)
mf6 = MissFiller(d6)
mf6.fill_IterativeSVD()

20
[IterativeSVD] Iter 1: observed MAE=0.198822
[IterativeSVD] Iter 2: observed MAE=0.143338
[IterativeSVD] Iter 3: observed MAE=0.069606
[IterativeSVD] Iter 4: observed MAE=0.004337


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.696469,0.286139,0.226851,0.551315,0.719469,0.423106,0.980764,0.68483,0.480932,0.392118
1,0.343178,0.72905,0.467525,0.354695,0.398044,0.368672,0.182492,0.238923,0.531551,0.372009
2,0.634401,0.849432,0.724455,0.611024,0.722443,0.322959,0.361789,0.228263,0.293714,0.630976
3,0.092105,0.433701,0.430863,0.493685,0.42583,0.312261,0.426351,0.58346,0.94416,0.501837
4,0.623953,0.115618,0.317285,0.414826,0.866309,0.250455,0.483034,0.98556,0.519485,0.612895
5,0.120629,0.826341,0.60306,0.406402,0.342764,0.304121,0.417022,0.681301,0.875457,0.510422
6,0.669314,0.585937,0.42053,0.674689,0.842342,0.083195,0.763683,0.243666,0.194223,0.572457
7,0.095713,0.885327,0.627249,0.257878,0.016129,0.594432,0.024888,0.15896,0.583866,0.297125
8,0.318766,0.398851,0.355699,0.35445,0.925132,0.84167,0.357398,0.043591,0.304768,0.357662
9,0.704959,0.356562,0.355915,0.339798,0.593177,0.691702,0.151127,0.398876,0.240856,0.343456


# Impute missing values using `IterativeImpute`

In [40]:
# A strategy for imputing missing values by modeling each feature with 
# missing values as a function of other features in a round-robin fashion.
# Same as MICE (Multiple Imputation by  chained equations) in R.
mf5.fill_IterativeImputer()

Unnamed: 0,0,1,2,3,4
0,0.696469,0.286139,0.226851,0.321996,0.719469
1,0.423106,0.509825,0.68483,0.964111,0.392118
2,0.343178,0.72905,0.438572,0.641431,0.398044
3,0.737995,0.440712,0.175452,0.531551,0.531828
4,0.634401,0.849432,0.724455,0.611024,0.722443
5,0.322959,0.361789,0.342973,0.293714,0.630976
6,0.092105,0.433701,0.430863,0.493685,0.42583
7,0.312261,0.426351,0.893389,0.94416,0.501837
8,0.623953,0.115618,0.317285,0.187013,0.866309
9,0.250455,0.483034,0.98556,0.822103,0.612895


# Impute missing values using `MatrixFactorization`

In [41]:
# Direct factorization of the incomplete matrix into low-rank U and V, 
# with per-row and per-column biases, as well as a global bias.
mf5.fill_MatrixFactorization()

[MatrixFactorization] Iter 10: observed MAE=0.164723 rank=40
[MatrixFactorization] Iter 20: observed MAE=0.156667 rank=40
[MatrixFactorization] Iter 30: observed MAE=0.152574 rank=40
[MatrixFactorization] Iter 40: observed MAE=0.151244 rank=40
[MatrixFactorization] Iter 50: observed MAE=0.150224 rank=40
[MatrixFactorization] Iter 60: observed MAE=0.149265 rank=40
[MatrixFactorization] Iter 70: observed MAE=0.148215 rank=40
[MatrixFactorization] Iter 80: observed MAE=0.146967 rank=40
[MatrixFactorization] Iter 90: observed MAE=0.145332 rank=40
[MatrixFactorization] Iter 100: observed MAE=0.143600 rank=40
[MatrixFactorization] Iter 110: observed MAE=0.141464 rank=40
[MatrixFactorization] Iter 120: observed MAE=0.139005 rank=40
[MatrixFactorization] Iter 130: observed MAE=0.136338 rank=40
[MatrixFactorization] Iter 140: observed MAE=0.133158 rank=40
[MatrixFactorization] Iter 150: observed MAE=0.129856 rank=40
[MatrixFactorization] Iter 160: observed MAE=0.126171 rank=40
[MatrixFactorizat

Unnamed: 0,0,1,2,3,4
0,0.696469,0.286139,0.226851,0.498693,0.719469
1,0.423106,0.544872,0.68483,0.639136,0.392118
2,0.343178,0.72905,0.438572,0.517894,0.398044
3,0.737995,0.318584,0.175452,0.531551,0.531828
4,0.634401,0.849432,0.724455,0.611024,0.722443
5,0.322959,0.361789,0.321836,0.293714,0.630976
6,0.092105,0.433701,0.430863,0.493685,0.42583
7,0.312261,0.426351,0.893389,0.94416,0.501837
8,0.623953,0.115618,0.317285,0.539327,0.866309
9,0.250455,0.483034,0.98556,0.789752,0.612895


# Impute missing values using `Random Forest`

In [42]:
mf5.fill_RF()

Iteration: 0
Iteration: 1
Iteration: 2
Iteration: 3
Iteration: 4
Iteration: 5
Iteration: 6


Unnamed: 0,0,1,2,3,4
0,0.696469,0.286139,0.226851,0.446378,0.719469
1,0.423106,0.619769,0.68483,0.651717,0.392118
2,0.343178,0.72905,0.438572,0.555057,0.398044
3,0.737995,0.45625,0.175452,0.531551,0.531828
4,0.634401,0.849432,0.724455,0.611024,0.722443
5,0.322959,0.361789,0.364974,0.293714,0.630976
6,0.092105,0.433701,0.430863,0.493685,0.42583
7,0.312261,0.426351,0.893389,0.94416,0.501837
8,0.623953,0.115618,0.317285,0.445658,0.866309
9,0.250455,0.483034,0.98556,0.681689,0.612895


# Impute the missing values using multi-output learning
In this example, we created a dataframe containing 10,000 CpGs and 30 samples, derived from real 450K DNA methylation data (GSE105018). We manually removed 10% of the values and compared the imputed values with the original ones to evaluate the accuracy of the imputation process.

The imputation process consists of three main steps:

* **Handling random missing values**: Buck’s method is applied to predict random missing values (if any). Available methods for this step include ['Buck', 'KNN', 'MICE', 'RF'].

* **Clustering based on missingness patterns**: K-means is used to group samples into two clusters based on the pattern of missing values. For instance, samples with data generated from 450K and 850K arrays are grouped into separate clusters. This step will be skipped if users provide the `group` variable (e.g., `group = {'A':["sample1", "sample2", "sample3"], 'B':["sample4", "sample5", "sample6"]}`).

* **Imputing systematic or block missing values**: A deep neural network (DNN) is trained using the available values to impute systematic or block missing data. Methods available for this step include ['RF', 'KNN', 'SVR', 'DNN'].

In [43]:
#Original data
original_df = pd.read_csv("http://publicepidata.s3.amazonaws.com/GSE105018_N30_R10K.original.tsv", sep="\t", index_col=0, header=0)

# manually marked 10% values as missing (non-random block missing)
d7 = pd.read_csv("http://publicepidata.s3.amazonaws.com/GSE105018_N30_R10K.10per.tsv", sep="\t", index_col=0, header=0)
mf7 = MissFiller(d7)

# predict the missing values
filled_df = mf7.fill_morel(initial_model = 'Buck', second_model='DNN')

Binerize sample IDs into two groups using K-means ...
Group "0" contains 15 samples
	3442
	3311
	3312
	3341
	3342
	3411
	3412
	3461
	3462
	3471
	3472
	3501
	3502
	3451
	3452
Group "1" contains 15 samples
	2621
	2622
	2691
	2692
	3391
	3392
	2791
	2792
	3161
	3162
	3211
	3212
	3431
	3432
	3441
493 rows in group "0" are complete missing.
507 rows in group "1" are complete missing.
Predict missing values in group "0" using deep neural network
Split data into training and testing ...


[1m71/71[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 771us/step - loss: 0.0212 - root_mean_squared_error: 0.0344
Test MAE loss, test RMSE: [0.020796336233615875, 0.032823145389556885]
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step


Predict missing values in group "1" using deep neural network
Split data into training and testing ...


[1m71/71[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 597us/step - loss: 0.0233 - root_mean_squared_error: 0.0366
Test MAE loss, test RMSE: [0.02299191989004612, 0.0356442965567112]
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step 


Re-order the index as the original dataframe ...


# Calculate the imputation accuracy

In [45]:
# the locations of 10% NA
na_locations = mf7.get_na_indices()

# calculate the imputation perfromance
MNAE, MDAE, RAE, MAPE, RMSE, R2 = calculate_metrics(original_df, filled_df, na_locations)

In [6]:
print("Mean Absolute Error is: %f" %  MNAE)

Mean Absolute Error is: 0.020967


# Batch Effect Correction

Batch effects (for example, differences between 450K and 850K arrays) can significantly impact imputation accuracy. However, commonly used methods for correcting batch effects—such as [ComBat](https://pubmed.ncbi.nlm.nih.gov/16632515/)—do not handle missing values.

To address this limitation, we first apply K-nearest neighbors (KNN) imputation to fill in missing values. Next, we run the [ComBat](https://pubmed.ncbi.nlm.nih.gov/16632515/) algorithm to correct for batch effects. Finally, we restore the original missing values, allowing other imputation methods to be applied later on the batch-effect–corrected data. This procedure is implemented in the [beta_combat.py](https://cpgtools.readthedocs.io/en/latest/demo/beta_combat.html) function within our [CpGtools](https://pubmed.ncbi.nlm.nih.gov/31808791/) package. 
