# [sklearn.impute.SimpleImputer](https://scikit-learn.org/stable/modules/impute.html)

## Univariate 


Datasets contain missing values, often encoded as blanks, NaNs or other placeholders.
Such datasets however are incompatible with scikit-learn estimators which assume that all values in an array are numerical,
and that all have and hold meaning.

Scikit learn library provides option to impute values various algorithms.
- **Univariate imputation**<br>
    which imputes values in the i-th feature dimension using only non-missing values in that feature dimension (e.g. impute.SimpleImputer).
- **Multivariate imputation**<br>
    use the entire set of available feature dimensions to estimate the missing values (e.g. impute.IterativeImputer).

This sample covers imputing the missing values using **Univariate** imputation

The **SimpleImputer** class provides basic strategies for imputing missing values.
Missing values can be imputed with a provided constant value, or using the statistics (mean, median or most frequent) of each column in which the missing values are located. This class also allows for different missing values encodings.

*class sklearn.impute.**SimpleImputer**(missing_values=nan, strategy=’mean’, fill_value=None, verbose=0, copy=True, add_indicator=False)*
##### Parameters:	
- **missing_values** : number, string, np.nan (default) or None.<br>The placeholder for the missing values. All occurrences of missing_values will be imputed.
- **strategy** : string, optional (default=”mean”).<br>The imputation strategy.
 - **mean** replaces missing values using the mean along each column. Can only be used with numeric data.
 - **median** replaces missing values using the median along each column. Can only be used with numeric data.
 - **most_frequent** replaces missing using the most frequent value along each column. Can be used with strings or numeric data.
 - **constant** replaces missing values with fill_value. Can be used with strings or numeric data.<br>
New in version 0.20: strategy=”constant” for fixed value imputation.
- **fill_value** : string or numerical value, optional (default=None)<br>
When strategy == “constant”, fill_value is used to replace all occurrences of missing_values.<br> If left to the default, fill_value will be 0 when imputing numerical data and “missing_value” for strings or object data types.
- **verbose** : integer, optional (default=0).<br>Controls the verbosity of the imputer.
- **copy** : boolean, optional (default=True).<br>If True, a copy of X will be created. <br>If False, imputation will be done in-place whenever possible. <br>***Note*** that, in the following cases, a new copy will always be made, even if copy=False:
 - If X is not an array of floating values;
 - If X is encoded as a CSR matrix;
 - If add_indicator=True.
- **add_indicator** : boolean, optional (default=False).<br>If True, a MissingIndicator transform will stack onto output of the imputer’s transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won’t appear on the missing indicator even if there are missing values at transform/test time.

The following snippet demonstrates how to replace missing values, encoded as np.nan, using the mean value of the columns (axis 0) that contain the missing values:

In [1]:
import numpy as np
from sklearn.impute import SimpleImputer

In [2]:
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit([[1, 2], [np.nan, 3], [7, 6]])

SimpleImputer(add_indicator=False, copy=True, fill_value=None,
              missing_values=nan, strategy='mean', verbose=0)

In [3]:
X = [[np.nan, 2], [6, np.nan], [7, 6]]
print(imp.transform(X))

[[4.         2.        ]
 [6.         3.66666667]
 [7.         6.        ]]


The SimpleImputer class also supports **sparse matrices**:

In [5]:
import scipy.sparse as sp
X = sp.csc_matrix([[1, 2], [0, -1], [8, 4]])
imp = SimpleImputer(missing_values=-1, strategy='mean')
imp.fit(X)

SimpleImputer(add_indicator=False, copy=True, fill_value=None,
              missing_values=-1, strategy='mean', verbose=0)

In [6]:
X_test = sp.csc_matrix([[-1, 2], [6, -1], [7, 6]])
print(imp.transform(X_test).toarray())

[[3. 2.]
 [6. 3.]
 [7. 6.]]


**Note** *that this format is not meant to be used to implicitly store missing values in the matrix because it would densify it at transform time. Missing values encoded by 0 must be used with dense input.*

The SimpleImputer class also supports **categorical data** represented as string values or pandas categoricals when using the ***most_frequent*** or ***constant*** strategy:

In [7]:
import pandas as pd
df = pd.DataFrame([["a", "x"],[np.nan, "y"],["a", np.nan],["b", "y"]], dtype="category")
imp = SimpleImputer(strategy="most_frequent")
print(imp.fit_transform(df))

[['a' 'x']
 ['a' 'y']
 ['a' 'y']
 ['b' 'y']]
