# Impute missing data with Machine Learning and Optimus

Missing values are considered to be the first obstacle in predictive modeling. Hence, it’s important to master the methods to overcome them. With Optimus we made easy the missing data imputation (imputation is the process of replacing missing data with substituted values) with a function called `impute_missing()`

In [1]:
# Import optimus
import optimus as op
# Import os for reading from local
import os

Deleting previous folder if exists...
Creation of checkpoint directory...
Done.


Lets import the utilities module to read the csv we have in the folder called impute_data.csv

In [2]:
tools = op.Utilities()

In [3]:
path = "file:///" + os.getcwd() + "/impute_data.csv"
df = tools.read_dataset_csv(path, delimiter_mark=",", header="true")

In [4]:
df.show()

+---+---+
|  a|  b|
+---+---+
|1.0|NaN|
|2.0|NaN|
|NaN|3.0|
|4.0|4.0|
|5.0|5.0|
+---+---+



We can see that in our data there are some missing values for both columns `a`and `b`. Lets use the DataFrameProfiler to run some some visual analysis on our data.

In [5]:
profiler = op.DataFrameProfiler(df)

In [6]:
profiler.profiler()

0,1
Number of variables,2
Number of observations,5
Total Missing (%),10.0%
Total size in memory,0.0 B
Average record size in memory,0.0 B

0,1
Numeric,1
Categorical,0
Date,0
Text (Unique),0
Rejected,1

0,1
Distinct count,5
Unique (%),125.0%
Missing (%),20.0%
Missing (n),1
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3
Minimum,1
Maximum,5
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,2
Q1,2
Median,2
Q3,4
95-th percentile,5
Maximum,5
Range,4
Interquartile range,2

0,1
Standard deviation,1.8257
Coef of variation,0.60858
Kurtosis,-1.64
Mean,3
MAD,1.5
Skewness,0
Sum,12
Variance,3.3333
Memory size,0.0 B

0,1
Correlation,1

Unnamed: 0,a,b
0,1.0,
1,2.0,
2,,3.0
3,4.0,4.0
4,5.0,5.0


We can see that the profiler found the missing value in the `a` column, he did not run the analysis fully on the `b`column because it detected that both columns contained the same information (ther correlation is 1).

## Data Imputing with Optimus 

We wrapped the `imputer` function of Apache Spark Machine Learning (ML) library and make your life much easier. If you want to look the way yo can do this wih Spark Vanilla please enter the [Pyspark API](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html?highlight=impute#pyspark.ml.feature.Imputer). You can choose to use the mean or the median of the columns in which the missing values are located for your imputation. The input columns should be of DoubleType or FloatType. 

In Optimus you only have to do this:

In [7]:
transformer = op.DataFrameTransformer(df)

In [8]:
# Choose the columns to run the analysis and the names of the columns for the output
transformer.impute_missing(["a","b"],["out_a","out_B"],strategy="mean").show()

+---+---+-----+-----+
|  a|  b|out_a|out_B|
+---+---+-----+-----+
|1.0|NaN|  1.0|  4.0|
|2.0|NaN|  2.0|  4.0|
|NaN|3.0|  3.0|  3.0|
|4.0|4.0|  4.0|  4.0|
|5.0|5.0|  5.0|  5.0|
+---+---+-----+-----+



In [9]:
# Choose the columns to run the analysis and the names of the columns for the output
transformer.impute_missing(["a","b"],["out_a","out_B"],strategy="median").show()

+---+---+-----+-----+
|  a|  b|out_a|out_B|
+---+---+-----+-----+
|1.0|NaN|  1.0|  4.0|
|2.0|NaN|  2.0|  4.0|
|NaN|3.0|  2.0|  3.0|
|4.0|4.0|  4.0|  4.0|
|5.0|5.0|  5.0|  5.0|
+---+---+-----+-----+



And voilà :)