## Problem Statement
- Load data set 
- missing values
- Statistical Impuation of missing values

## Description
- Real-world data often has missing values. 
- Data can have missing values for a number of reasons such as observations that were not recorded and data corruption. 
- Handling missing data is important as many machine learning algorithms do not support data with missing values.



#### Load Python libraries and dataset

In [None]:
import pandas as pd
from numpy import set_printoptions, nan, isnan


In [None]:
data = pd.read_csv("../data/pima-indians-diabetes.csv")

#### Check Your Data

In [None]:
# check first 20 rows of the dataset
print(data.head(5))

## Find Missing Values

#### Count the number of missing values for each column

In [None]:
# count the number of data points where valaue is zero
num_missing = (data[['plas', 'skin', 'test', 'mass', 'pedi']] == 0).sum()
# report the results
print(num_missing)

## Handle  Missing Values
- Here we will see two methods to handle missing values
- 1. Remove the missing values - the implementataion of this method depends upon on the volume of data. 
- 2. Impute missing value - replacing the missing value with statisticala value (e.g. mean)

## Remove rows with missing values

#### Replace '0' values with 'nan'

In [None]:
data[['plas', 'skin', 'test', 'mass', 'pedi']] = data[['plas', 'skin', 'test', 'mass', 'pedi']].replace(0, nan)
# count the number of nan values in each column
print(data.isnull().sum())

In [None]:
# check data witn NaN values
print(data.head(20))

In [None]:
print(f'data shape before removing missing values\n=========================================\n{data.shape}')
# drop rows with missing values
data.dropna(inplace=True)
# summarize the shape of the data with missing rows removed
print(f'\ndata shape after removing missing values\n=========================================\n{data.shape}')

## Statistical Impuation of Missing Values

**What is missing data imputation?**

- It is a method to identify missing values and replace them with a value prior to modeling the prediction task.

**Common approach for data imputation**
- Calculate a statistical value for each column (such as a mean) and replace all missing values for that column with the statistic.

In [None]:
data_horse = pd.read_csv("../data/horse.csv", header=None, na_values='?')
print(data_horse.head(5))
print(data_horse.columns)
# summarize the number of rows with missing values for each column
for i in range(data_horse.shape[1]):
    # count number of rows with missing values
    n_miss = data_horse[[i]].isnull().sum()
    perc = n_miss / data_horse.shape[0] * 100
    print('> %d, Missing: %d (%.1f%%)' % (i, n_miss, perc))

In [None]:
# split into input and output elements
data1 = data_horse.values
ix = [i for i in range(data1.shape[1]) if i != 23]
#print(ix)
X, y = data1[:, ix], data1[:, 23]
#print(X)
#print(y)

## Statistical Imputation With SimpleImputer

The scikit-learn machine learning library provides the SimpleImputer class that supports
statistical imputation. The SimpleImputer is a data transform that is first confgured based on the type of statistic to
calculate for each column, e.g. mean.

**Other strategies available in SimpleImputer**

strategies = ['mean', 'median', 'most_frequent', 'constant']

In [None]:
# load Python libraries 
from sklearn.impute import SimpleImputer

In [None]:
# summarize total missing
print('Before Imputation\n==================\nMissing: %d' % sum(pd.isnull(X).flatten()))
imputer = SimpleImputer(strategy='mean')
# fit on the dataset
imputer.fit(X)
# transsform the dataset
X_transformed = imputer.transform(X)
#Summarize total misssing
print('Before Imputation\n==================\nMissing: %d' % sum(pd.isnull(X_transformed).flatten()))
print('\nAfter Imputation\n==================\nMissing: %d' % sum(isnan(X_transformed).flatten()))