
# Exercice 4 Handling missing values

The goal of this exercise is to learn to handle missing values. In the previous exercise we used the first techniques: filter out the missing values. We were lucky because the proportion of missing values was low. But in some cases, dropping the missing values is not possible because the filtered data set would be too small.

This article explains the different types of missing data and how they should be handled.

https://towardsdatascience.com/data-cleaning-with-python-and-pandas-detecting-missing-values-3e9c6ebcf78b

"**It’s important to understand these different types of missing data from a statistics point of view. The type of missing data will influence how you deal with filling in the missing values.**"

1. Drop the `flower` column

- Fill the missing values with a different "strategy" for each column:

    `sepal_length` -> `mean`

    `sepal_width` -> `median`

    `petal_length`, `petal_width` -> `0`

2. Explain why filling the missing values with 0 or the mean is a bad idea
3. Fill the missing values using the median



In [30]:
import pandas as pd
from pandas import Series, DataFrame

df = pd.read_csv("iris.csv", sep=',')

df = df.drop('flower', axis=1)

def update_types(df) -> DataFrame:
    for c in df:
        df[c] = pd.to_numeric(df[c], errors='coerce')
    return df

df = update_types(df)

# testing
# for column in df:
#     print(type(df[column]))
#     print(type(df[column].values[0]))

bad = df.fillna({0:df['sepal_length'].mean(),
2:df['sepal_width'].median(),
3:0,
4:0})

print(bad.describe())

# print(df)

       Unnamed: 0  sepal_length  sepal_width  petal_length  petal_width
count  150.000000    146.000000   141.000000    120.000000   147.000000
mean    74.500000     56.907534    52.625532     15.529167    12.026531
std     43.445368    572.222221   417.127170    127.459631   131.873447
min      0.000000     -4.400000    -3.600000     -4.800000    -2.500000
25%     37.250000      5.100000     2.800000      2.725000     0.300000
50%     74.500000      5.750000     3.000000      4.500000     1.300000
75%    111.750000      6.400000     3.300000      5.100000     1.800000
max    149.000000   6900.000000  3809.000000   1400.000000  1600.000000
