# Simple imputer

Hey, welcome back to your book on machine learning with scikit-learn.

In *machine learning*, it's common to encounter datasets with missing information in some of their columns. Missing information in the scientific Python world is sometimes coded as `NaN`s or null values.

Missing information can occur due to various factors: there was an error in data collection, the data was corrupted somewhere, or it was simply never collected.

Under certain conditions, it's better to calculate values for these missing values so they don't affect your model's performance, causing biased or incorrect predictions. In fact, many of the algorithms offered by scikit-learn require that your dataset contains no null values.

For this task, scikit-learn offers us a class called `SimpleImputer` that helps with the task of filling in missing values through different strategies.

Let's create a dataset with some missing data:

In [None]:
import pandas as pd

data =pd.DataFrame([
    ('red', 1, 1.0, -1), ('blue', 2, None, -3), (None, 3, 3.0, -5),
    ('red', 4, 4.0, -2), ('red', None, 5.0, -5), ('blue', 6, 6.0, -1),
    ('red', 7, None), ('blue', 8, 8.0, None), ('green', 9, 9.0, None),
    ('red', 10, 10.0, None),
], columns=['color', 'number', 'value', 'other'])

data


To use it, first you need to import it from `sklearn.impute`

In [None]:
from sklearn.impute import SimpleImputer


First, we will work with numerical values, and with the default arguments of the class, which will use the average:

In [None]:

imputer = SimpleImputer()
imputer.fit(data[['value']])
data['value'] = imputer.transform(data[['value']])
data


Let's say for another column, you want to use the median instead of the average:

In [None]:
imputer = SimpleImputer(strategy='median')
imputer.fit(data[['number']])
data['number'] = imputer.transform(data[['number']])
data


It is also possible to fill in values based on the most frequent element. For example, for the missing value in the `color` column, we can choose this option since the two previous ones only work with numerical data:

In [None]:
imputer = SimpleImputer(missing_values=pd.NA, strategy='most_frequent')
imputer.fit(data[['color']])
data['color'] = imputer.transform(data[['color']]).squeeze()
data


The fourth and final strategy is to establish a constant value. Useful when you have calculated this value in advance, for this it is ideal to use two arguments `strategy='constant'` and `fill_value` with the value you want to set:

In [None]:
imputer = SimpleImputer(strategy='constant', fill_value=10)
imputer.fit(data[['other']])
data['other'] = imputer.transform(data[['other']])
data


And that's how we can have a dataset without missing values, ready to be processed and used to train a machine learning model using scikit-learn.

I'll see you in the next chapter, where we'll explore other utilities that scikit-learn offers us to train our models more effectively.