#### Random Sample Imputation
Aim: Random sample imputation consisits of taking random observation from the dataset and we use this observation to replace the nan values

When should it be used? It assumes that the data are missing completely at random (MCAR)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from pandas.core.frame import DataFrame

In [2]:
df: DataFrame = pd.read_csv('../data/titanic.csv', usecols=['Age', 'Fare', 'Survived'])
df.head()

Unnamed: 0,Survived,Age,Fare
0,0,22.0,7.25
1,1,38.0,71.2833
2,1,26.0,7.925
3,1,35.0,53.1
4,0,35.0,8.05


In [3]:
df.isnull().sum()

Survived      0
Age         177
Fare          0
dtype: int64

In [4]:
df.isnull().mean()

Survived    0.000000
Age         0.198653
Fare        0.000000
dtype: float64

In [5]:
df['Age'].dropna().sample()

741    36.0
Name: Age, dtype: float64

In [6]:
df['Age'].isnull().sum()

177

In [7]:
df['Age'].dropna().sample(df['Age'].isnull().sum(), random_state=0)

423    28.00
177    50.00
305     0.92
292    36.00
889    26.00
       ...  
539    22.00
267    25.00
352    15.00
99     34.00
689    15.00
Name: Age, Length: 177, dtype: float64

Examples for sample()
--------
>>> df = pd.DataFrame({'num_legs': [2, 4, 8, 0],
...                    'num_wings': [2, 0, 0, 0],
...                    'num_specimen_seen': [10, 2, 1, 8]},
...                   index=['falcon', 'dog', 'spider', 'fish'])
>>> df
        num_legs  num_wings  num_specimen_seen
falcon         2          2                 10
dog            4          0                  2
spider         8          0                  1
fish           0          0                  8

Extract 3 random elements from the ``Series`` ``df['num_legs']``:
Note that we use `random_state` to ensure the reproducibility of
the examples.

>>> df['num_legs'].sample(n=3, random_state=1)
fish      0
spider    8
falcon    2
Name: num_legs, dtype: int64


In [8]:
df[df['Age'].isnull()].index

Int64Index([  5,  17,  19,  26,  28,  29,  31,  32,  36,  42,
            ...
            832, 837, 839, 846, 849, 859, 863, 868, 878, 888],
           dtype='int64', length=177)

In [9]:
# df.loc[row-label, column] # to get the value in column in row with index name as row-label
# df.loc[:, :] # get all rows and all column
df.loc[1, 'Age']

38.0

In [10]:
 def impute_nan(df: DataFrame, variable: str, median):
        df[variable + "_median"] = df[variable].fillna(median)
        df[variable + "_random"] = df[variable]
        # It will have the random sample to fill the na
        random_sample = df[variable].dropna().sample(df[variable].isnull().sum(), random_state=0)
        # pandas need to have same index in order to manage the dataset
        random_sample.index = df[df[variable].isnull()].index
        # df.loc[row-label, col-name] # to get the value in col-name in row with index name as row-label
        df.loc[df[variable].isnull(), variable + "_random"] = random_sample

In [11]:
median = df.Age.median()
median

28.0

In [12]:
impute_nan(df, "Age", median)