# Missing value imputation

Se exemplifica cateva pachete Python 3rd party pentru missing value imputation.

Surse:
1. [fancyimpute](https://github.com/iskandr/fancyimpute)
1. [missingpy](https://github.com/epsilon-machine/missingpy)
1. [sklearn](https://scikit-learn.org/stable/modules/impute.html)

## Pregatirea datelor

In [10]:
import numpy as np
np.random.seed = 1

In [11]:
# pregateste date fara valori lipsa

X = np.random.rand(200, 5)
X_orig = X.copy()
assert not np.isnan(X_orig.sum())  # niciun nan

In [12]:
# producem artificial niste valori lipsa in setul de date

lines = np.random.choice(X.shape[0], 10)
cols = np.random.choice(X.shape[1], 10)
X[lines, cols] = np.nan
assert np.isnan(X.sum()) # cel putin un nan in X

## Scikit Learn IterativeImputer

In [13]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

In [14]:
imp = IterativeImputer(max_iter=10, random_state=0)
X_filled_sklearn = imp.fit_transform(X)
assert not np.isnan(X_filled_sklearn.sum())

In [15]:
# Masuram diferenta intre datele originare si cele rezultate in urma umplerii de valori lipsa

np.linalg.norm(X_filled_sklearn-X_orig)

0.7614000100614554

## ScikitLearn KNNImputer

In [16]:
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler

# Intrucat k nearst neighbors folosetse distante intre vectori, se va face scalarea datelor in prealabil
scaler = MinMaxScaler(feature_range=(0, 1))
X_scaled = scaler.fit_transform(X)

# obiectul care efectueaza scalarea este util mai deprte, pentru a face transformarea inversa

In [17]:
# Cream obiectul care va face umplerea valorilor goale
knn_imp = KNNImputer(missing_values=np.nan, n_neighbors=10, weights='uniform', metric='nan_euclidean')
X_imputed = knn_imp.fit_transform(X_scaled)
# de-scalam datele
X_imputed_unscaled = scaler.inverse_transform(X_imputed)

In [18]:
# Masuram diferenta intre datele originare si cele rezultate in urma umplerii de valori lipsa

np.linalg.norm(X_imputed_unscaled-X_orig)

0.7111147808185497

## Pachetul fancyimpute

In [19]:
# instalarea dureaza cateva minute, puteti rula comanda in conda prompt pentru a vedea progresul 

!pip install fancyimpute



**Nota:** daca la importul de mai jos apare eroare de forma:
`RuntimeError: module compiled against API version 0xe but this version of numpy is 0xd`
atunci faceti upgrade de numpy, folosind:
```
pip install -U numpy
```

Este posibil ca instalarea lui fancyimpute sa faca downgrade la pachetul numpy, acesta trebuie restaurat.

Desigur, puteti rula comenzile de pip direct in conda prompt, in virtual environment dedicat. 

In [20]:
!pip install -U numpy



In [21]:
from fancyimpute import IterativeImputer, SoftImpute

### fancyimpute.SoftImpute

In [22]:
X_filled_softimpute = SoftImpute().fit_transform(X)
# verificam ca nu avem date lipsa
assert not np.isnan(X_filled_softimpute.sum())

[SoftImpute] Max Singular Value of X_init = 16.518847
[SoftImpute] Iter 1: observed MAE=0.019398 rank=5
[SoftImpute] Iter 2: observed MAE=0.019419 rank=5
[SoftImpute] Iter 3: observed MAE=0.019438 rank=5
[SoftImpute] Iter 4: observed MAE=0.019454 rank=5
[SoftImpute] Iter 5: observed MAE=0.019468 rank=5
[SoftImpute] Iter 6: observed MAE=0.019480 rank=5
[SoftImpute] Iter 7: observed MAE=0.019491 rank=5
[SoftImpute] Iter 8: observed MAE=0.019500 rank=5
[SoftImpute] Iter 9: observed MAE=0.019509 rank=5
[SoftImpute] Iter 10: observed MAE=0.019516 rank=5
[SoftImpute] Iter 11: observed MAE=0.019523 rank=5
[SoftImpute] Iter 12: observed MAE=0.019528 rank=5
[SoftImpute] Iter 13: observed MAE=0.019533 rank=5
[SoftImpute] Iter 14: observed MAE=0.019538 rank=5
[SoftImpute] Iter 15: observed MAE=0.019541 rank=5
[SoftImpute] Iter 16: observed MAE=0.019545 rank=5
[SoftImpute] Iter 17: observed MAE=0.019548 rank=5
[SoftImpute] Iter 18: observed MAE=0.019550 rank=5
[SoftImpute] Iter 19: observed MAE=0.

In [23]:
# diferenta intre setul originar si cel cu date lipsa umplute prin estimare

np.linalg.norm(X_filled_softimpute-X_orig)

1.0134919822726245

### fancyimpute.IterativeImputer

In [24]:
X_filled_iterative = IterativeImputer().fit_transform(X)
# verificam ca nu avem date lipsa
assert not np.isnan(X_filled_iterative.sum())

In [25]:
# diferenta intre setul originar si cel cu date lipsa umplute prin estimare

np.linalg.norm(X_filled_iterative-X_orig)

0.7614000100614554

## Pachetul missingpy

In [45]:
!pip install missingpy



In [26]:
# to avoid ModuleNotFoundError: No module named 'sklearn.neighbors.base' we use the trick below
# from https://stackoverflow.com/questions/60145652/no-module-named-sklearn-neighbors-base

import sys
import sklearn.neighbors._base
sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base

In [27]:
from missingpy import MissForest

In [28]:
imputer = MissForest(criterion='squared_error', max_features=1)

In [29]:
X_filled_missingpy = imputer.fit_transform(X)

Iteration: 0
Iteration: 1
Iteration: 2
Iteration: 3


In [30]:
assert not np.isnan(X_filled_missingpy.sum())  # nu exista nan in X_imputed

In [31]:
# diferenta intre setul originar si cel cu date lipsa umplute prin estimare

np.linalg.norm(X_filled_missingpy-X_orig)

0.7886835353041497