# Imputation

In [1]:
import numpy as np
import pandas as pd

import os

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.neighbors import KNeighborsRegressor
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.svm import SVR
from sklearn import tree
from sklearn.tree import DecisionTreeRegressor
from imblearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder

from imputation_utils import FamilyAgeImputer, GroupByImputer, Imputer
from feature_engineering_utils import Preprocessor

Let's use now data that has all features already created. This data was created in the feature_engineering notebook.

In [2]:
_DATADIR = './data'
_ALL_FEATURE_DATA = os.path.join(_DATADIR, 'raw_data_all_features.csv')

In [3]:
all_feature_data = pd.read_csv(_ALL_FEATURE_DATA, index_col='PassengerId')
all_feature_data

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,...,Title_grouped,FamilySize,FamilySize_grouped,GroupSize,GroupSize_grouped,GroupRate,GroupRate_grouped,FamilyRate,FamilyRate_grouped,Fare_adjusted
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,...,Mr,2,1,2,1,0.188908,0,0.000000,0,3.62500
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,...,Mrs,2,1,2,1,0.742038,1,0.742038,1,35.64165
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,...,Miss,1,0,1,0,0.742038,1,0.742038,1,7.92500
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,...,Mrs,2,1,2,1,0.500000,1,0.500000,1,26.55000
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,...,Mr,1,0,1,0,0.188908,0,0.500000,1,8.05000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,...,Religious,1,0,1,0,0.188908,0,0.188908,0,13.00000
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,...,Miss,1,0,1,0,0.742038,1,0.666667,1,30.00000
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,...,Miss,4,1,4,1,0.000000,0,0.000000,0,5.86250
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,...,Mr,1,0,1,0,0.188908,0,0.188908,0,30.00000


As we saw in the data_analysis notebook, most of the missing values come from the age column. Cabin column missing values are not actually missing values and other columns have only a few missing values in train/test set:

In [4]:
competition_data = pd.read_csv(os.path.join(_DATADIR, 'test.csv')) # not all features created here
print('Missing values in training data with all features: \n')
print(all_feature_data.isna().sum())
print('\nMissing values in competition/test data: \n')
print(competition_data.isna().sum())

Missing values in training data with all features: 

Survived                0
Pclass                  0
Name                    0
Sex                     0
Age                   177
SibSp                   0
Parch                   0
Ticket                  0
Fare                    0
Cabin                 687
Embarked                2
HasCabin                0
CabinType               0
Surname                 0
Title                   0
Title_grouped           0
FamilySize              0
FamilySize_grouped      0
GroupSize               0
GroupSize_grouped       0
GroupRate               0
GroupRate_grouped       0
FamilyRate              0
FamilyRate_grouped      0
Fare_adjusted           0
dtype: int64

Missing values in competition/test data: 

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64


We will now focus on imputing age values as it seems the most problematic column to impute. Before focusing on age values, we can impute the missing values in 'Embarked' column in training data. This imputation of 'Embarked' is just done for sake of analysis in this notebook. After we have made imputer for age we can create a general imputer (pipeline component) that can handle other columns as well.

In [5]:
# impute with the most frequent value
embarked_groupby = all_feature_data.groupby(['Pclass'])['Embarked'].agg(lambda x: pd.Series.mode(x)[0]) 
embarked_missing = all_feature_data[all_feature_data['Embarked'].isna()]
all_feature_data.loc[embarked_missing.index, 'Embarked'] = [embarked_groupby[row['Pclass']] 
                                                                             for _, row in embarked_missing.iterrows()]

Use data that has age values as training and validation data:

In [6]:
age_missing_indices = all_feature_data['Age'].isna()
full_age_data = all_feature_data[~age_missing_indices]

We will then compare different imputation methods and we can choose the best-performing one as our final imputer. Some of the imputers need preprocessed data as input so we will use our Preprocessing class (see feature_engineering_utils.py for more details) to preprocess the data. It will return all the columns that we have specified to be numerical, ordinal or categorical in their correct forms:

In [7]:
numerical = []
ordinal = []
categorical = ['Title_grouped', 'Pclass', 'FamilySize_grouped'] 

preprocessing_params = {'categorical_cols': categorical, 'numerical_cols': numerical, 'ordinal_cols': ordinal}
imputation_preprocessing = Preprocessor(**preprocessing_params) # to be used by KNeighborsRegressor and SVR
preprocessed_data_example = imputation_preprocessing.fit_transform(full_age_data)

In [8]:
preprocessed_data_example

Unnamed: 0,cat__Title_grouped_British_noble,cat__Title_grouped_Master,cat__Title_grouped_Miss,cat__Title_grouped_Mr,cat__Title_grouped_Mrs,cat__Title_grouped_Other,cat__Title_grouped_Religious,cat__Pclass_1,cat__Pclass_2,cat__Pclass_3,cat__FamilySize_grouped_0,cat__FamilySize_grouped_1,cat__FamilySize_grouped_2
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
709,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
710,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
711,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
712,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0


Next, we will define imputation methods that we will try:
- groupby_imputer() is a simple imputer that groups data by some columns and returns mean of those groups as age values. Now, we will use Title_grouped and Pclass as groupby_cols. See more details about GroupByImputer() in imputation_utils.py
- decision_tree_imputer, kneighbors_regressor_imputer and svr_imputer use standard sklearn regressors that use preprocessed data as input
- family_imputer (see more details of FamilyAgeImputer in imputation_utils.py) is an imputation method that tries to improve predictions of it's base_imputation_method by using passenger's family info if available. We will start by using GroupByImputer as it's base_imputation_method and let's see if using family info can improve it's performance.

In [9]:
def groupby_imputer(X_train, y_train, X_test):
    imputer = GroupByImputer('Age', ['Title_grouped', 'Pclass'], pd.Series.median)
    imputer.fit(X_train, y_train)
    return imputer.predict(X_test)

def decision_tree_imputer(X_train, y_train, X_test, preprocessing_params):
    pipeline = Pipeline(steps=[
    ('preprocessing', Preprocessor(**preprocessing_params)),
    ('regressor', DecisionTreeRegressor(max_depth=5))
    ])
    
    pipeline.fit(X_train, y_train)
     
    return pipeline.predict(X_test)

def kneighbors_regressor_imputer(X_train, y_train, X_test, preprocessing_params): 
    pipeline = Pipeline(steps=[
    ('preprocessing', Preprocessor(**preprocessing_params)),
    ('regressor', KNeighborsRegressor(n_neighbors=3))
    ])
    
    pipeline.fit(X_train, y_train)
    
    return pipeline.predict(X_test)


def svr_imputer(X_train, y_train, X_test, preprocessing_params):
    pipeline = Pipeline(steps=[
    ('preprocessing', Preprocessor(**preprocessing_params)),
    ('regressor', SVR())
    ])
    
    pipeline.fit(X_train, y_train)
    
    return pipeline.predict(X_test)
    

def family_imputer(X_train, y_train, X_test):
    base_imputation_method = GroupByImputer('Age', ['Title_grouped', 'Pclass'], pd.Series.median)
    imp = FamilyAgeImputer(base_imputation_method=base_imputation_method)
    imp.fit(X_train, y_train)

    X_test_age_missing = X_test.copy(deep=True)
    X_test_age_missing['Age'] = np.nan
    imputed_data, imputed_rows = imp.transform(X_test_age_missing)
    return imputed_data
    

Let's initialize now the preprocessor that some of the methods use.

In [10]:
imputation_preprocessing = Preprocessor(**preprocessing_params) # to be used by KNeighborsRegressor and SVR

Let's also define score metrics that we want to track:

In [11]:
def mse_mad_mre_scores(y_pred, y):
    errors = y_pred - y
    mse = np.mean(errors**2)
    mad = np.mean(np.abs(errors))
    mre = np.mean(np.abs(errors) / y_test)
    return mse, mad, mre

Now we are ready to test the methods. Methods are compared using standard 5-fold cross validation.

In [12]:
scores = {'Group By Imputer': [], 'Decision Tree Regressor': [],
               'KNeighbor Regressor Imputer': [], 'SVR Imputer': [], 
               'Family Age Imputer': []}


kf = KFold()

for train_index, test_index in kf.split(full_age_data):
    X_train, X_test = full_age_data.iloc[train_index], full_age_data.iloc[test_index]
    y_train, y_test = full_age_data['Age'].iloc[train_index], full_age_data['Age'].iloc[test_index]
    
    gm_scores = mse_mad_mre_scores(groupby_imputer(X_train, y_train, X_test), y_test)
    scores['Group By Imputer'].append(gm_scores)
    
    dtree_scores = mse_mad_mre_scores(decision_tree_imputer(X_train, y_train, X_test, preprocessing_params), 
                                      y_test)
    scores['Decision Tree Regressor'].append(dtree_scores)
    
    kn_scores = mse_mad_mre_scores(kneighbors_regressor_imputer(X_train, y_train, X_test, preprocessing_params),
                                   y_test)
    scores['KNeighbor Regressor Imputer'].append(kn_scores)
    
    svr_scores = mse_mad_mre_scores(svr_imputer(X_train, y_train, X_test, preprocessing_params), y_test)
    scores['SVR Imputer'].append(svr_scores)
    
    family_scores = mse_mad_mre_scores(family_imputer(X_train, y_train, X_test), y_test)
    scores['Family Age Imputer'].append(family_scores)
      
for name, imputer_scores in scores.items():
    print(f'####### {name} #######')
    mse, mad, mre = np.mean(imputer_scores, axis=0)
    print(f'MSE: {mse}, MAD: {mad}, MRE: {mre}')


####### Group By Imputer #######
MSE: 137.88373216832466, MAD: 8.951257657835123, MRE: 0.5961989775196657
####### Decision Tree Regressor #######
MSE: 125.61921257175035, MAD: 8.603154226079546, MRE: 0.5168124175298973
####### KNeighbor Regressor Imputer #######
MSE: 161.12966244492594, MAD: 9.789553727962177, MRE: 0.5764195656206711
####### SVR Imputer #######
MSE: 142.3984128669165, MAD: 9.151694278934382, MRE: 0.8609015349408647
####### Family Age Imputer #######
MSE: 126.50254255288864, MAD: 8.419034183431565, MRE: 0.4299577940436937


Looks like Decision Tree Regressor and Family Age Imputer are clearly the most accurate imputation methods by all metrics. It's now clear that using family info improved the accuracy of Group By Imputer. We can finally test if using family info can improve accuracy of Decision Tree Regressor. Thus, we will set Decision Tree Regressor as base_imputation_method for Family Age Imputer and run the comparison again:

In [13]:
def family_imputer(X_train, y_train, X_test):
    pipeline = Pipeline(steps=[
        ('preprocessing', Preprocessor(**preprocessing_params)),
        ('regressor', DecisionTreeRegressor(max_depth=5))
        ])
    imp = FamilyAgeImputer(base_imputation_method=pipeline)
    imp.fit(X_train, y_train)

    X_test_age_missing = X_test.copy(deep=True)
    X_test_age_missing['Age'] = np.nan
    imputed_data, imputed_rows = imp.transform(X_test_age_missing)
    return imputed_data


In [14]:
scores = {'Group By Imputer': [], 'Decision Tree Regressor': [],
               'KNeighbor Regressor Imputer': [], 'SVR Imputer': [], 
               'Family Age Imputer': []}


kf = KFold()

for train_index, test_index in kf.split(full_age_data):
    X_train, X_test = full_age_data.iloc[train_index], full_age_data.iloc[test_index]
    y_train, y_test = full_age_data['Age'].iloc[train_index], full_age_data['Age'].iloc[test_index]
    
    gm_scores = mse_mad_mre_scores(groupby_imputer(X_train, y_train, X_test), y_test)
    scores['Group By Imputer'].append(gm_scores)
    
    dtree_scores = mse_mad_mre_scores(decision_tree_imputer(X_train, y_train, X_test, preprocessing_params), 
                                      y_test)
    scores['Decision Tree Regressor'].append(dtree_scores)
    
    kn_scores = mse_mad_mre_scores(kneighbors_regressor_imputer(X_train, y_train, X_test, preprocessing_params),
                                   y_test)
    scores['KNeighbor Regressor Imputer'].append(kn_scores)
    
    svr_scores = mse_mad_mre_scores(svr_imputer(X_train, y_train, X_test, preprocessing_params), y_test)
    scores['SVR Imputer'].append(svr_scores)
    
    family_scores = mse_mad_mre_scores(family_imputer(X_train, y_train, X_test), y_test)
    scores['Family Age Imputer'].append(family_scores)
      
for name, imputer_scores in scores.items():
    print(f'####### {name} #######')
    mse, mad, mre = np.mean(imputer_scores, axis=0)
    print(f'MSE: {mse}, MAD: {mad}, MRE: {mre}')

####### Group By Imputer #######
MSE: 137.88373216832466, MAD: 8.951257657835123, MRE: 0.5961989775196657
####### Decision Tree Regressor #######
MSE: 125.61921257175035, MAD: 8.603154226079546, MRE: 0.5168124175298973
####### KNeighbor Regressor Imputer #######
MSE: 161.12966244492594, MAD: 9.789553727962177, MRE: 0.5764195656206711
####### SVR Imputer #######
MSE: 142.3984128669165, MAD: 9.151694278934382, MRE: 0.8609015349408647
####### Family Age Imputer #######
MSE: 115.64672049589315, MAD: 8.271332555901521, MRE: 0.43732025052279566


Looks like setting Decision Tree Regressor as base_imputation_method improved the accuracy even more! And now this combo is clearly the best predictor for age by all metrics! Let's see in which situations it fails to predict age accurately:

In [15]:
y_pred = family_imputer(X_train, y_train, X_test)

large_error_rows = np.abs(y_pred - y_test) > 15
print(f'Proportion of large errors (>15 years): {sum(large_error_rows) / len(y_pred)}')
X_test_copy = X_test.copy(deep=True)
X_test_copy.loc[:, 'Age_predicted'] = y_pred
X_test_copy.loc[large_error_rows, ['Age_predicted', 'Age', 'Pclass', 'Name', 'Sex', 'SibSp', 
                                   'Parch', 'Ticket', 'Embarked', 'Fare_adjusted', 'HasCabin',
                                   'CabinType', 'Title', 'FamilySize', 'GroupSize']]

Proportion of large errors (>15 years): 0.14084507042253522


Unnamed: 0_level_0,Age_predicted,Age,Pclass,Name,Sex,SibSp,Parch,Ticket,Embarked,Fare_adjusted,HasCabin,CabinType,Title,FamilySize,GroupSize
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
724,33.247525,50.0,2,"Hodges, Mr. Henry Price",male,0,0,250643,S,13.0,False,,Mr,1,1
730,9.033333,25.0,3,"Ilmakangas, Miss. Pieta Sofia",female,1,0,STON/O2. 3101271,S,3.9625,False,,Miss,2,2
732,29.458738,11.0,3,"Hassan, Mr. Houssein G N",male,0,0,2699,C,9.39375,False,,Mr,1,2
749,39.344262,19.0,1,"Marvin, Mr. Daniel Warner",male,1,0,113773,S,26.55,True,D,Mr,2,2
758,33.247525,18.0,2,"Bailey, Mr. Percy Andrew",male,0,0,29108,S,11.5,False,,Mr,1,1
772,29.458738,48.0,3,"Jensen, Mr. Niels Peder",male,0,0,350047,S,7.8542,False,,Mr,1,1
773,33.247525,57.0,2,"Mack, Mrs. (Mary)",female,0,0,S.O./P.P. 3,S,5.25,True,E,Mrs,1,2
775,33.247525,54.0,2,"Hocking, Mrs. Elizabeth (Eliza Needs)",female,1,3,29105,S,4.6,False,,Mrs,5,5
778,23.464286,5.0,3,"Emanuel, Miss. Virginia Ethel",female,0,0,364516,S,6.2375,False,,Miss,1,2
783,46.233333,29.0,1,"Long, Mr. Milton Clyde",male,0,0,113501,S,30.0,True,D,Mr,1,1


Looks like the age prediction error of family imputer was less than 15 years in 86% of test set. Not too bad! It looks like most of these large errors were caused by passengers that travelled alone. It is impossible to use family/group info for these passengers and imputation used default values created by Decision Tree Regressor for them.

Finally, we can create a general imputer that will impute all specified columns with specific methods. We will make it a Sklearn pipeline component so that it can be easily used with different train/test datasets. In order to check that this general imputer works we will apply it to data in raw_data_all_features.csv that has missing values both in Age and Embarked columns (Cabin column we don't need to impute):

In [16]:
all_feature_data = pd.read_csv(_ALL_FEATURE_DATA, index_col='PassengerId')
print(all_feature_data.isna().sum())

Survived                0
Pclass                  0
Name                    0
Sex                     0
Age                   177
SibSp                   0
Parch                   0
Ticket                  0
Fare                    0
Cabin                 687
Embarked                2
HasCabin                0
CabinType               0
Surname                 0
Title                   0
Title_grouped           0
FamilySize              0
FamilySize_grouped      0
GroupSize               0
GroupSize_grouped       0
GroupRate               0
GroupRate_grouped       0
FamilyRate              0
FamilyRate_grouped      0
Fare_adjusted           0
dtype: int64


The general imputer can be found in imputation_utils.py. It takes dictionary with columns as keys and imputation methods as values as an argument and applies imputations for all specified columns. We will use FamilyAgeImputer with DecisionTreeRegressor as base method for imputing Age column and GroupByImputer for other columns (test data has missing Fare values so we specified imputation method for it as well):

In [17]:
# Age imputation parameters
numerical = []
ordinal = []
categorical = ['Title_grouped', 'Pclass', 'FamilySize_grouped'] 

preprocessing_params = {'categorical_cols': categorical, 'numerical_cols': numerical, 'ordinal_cols': ordinal}

age_imputation_base_method = Pipeline(steps=[
    ('preprocessing', Preprocessor(**preprocessing_params)),
    ('regressor', DecisionTreeRegressor(max_depth=5))
    ])

# Define imputation methods for other columns as well
col_imputations = {
                  'Embarked': GroupByImputer('Embarked', ['Pclass'], lambda x: pd.Series.mode(x)[0]),
                  'Fare': GroupByImputer('Fare', ['Pclass', 'HasCabin'], pd.Series.median),
                  'Age': FamilyAgeImputer(base_imputation_method=age_imputation_base_method)
                  }
imputer = Imputer(col_imputation_methods=col_imputations)
imputed_data = imputer.fit_transform(all_feature_data)
print(f'Missing values in imputed data: \n{imputed_data.isna().sum()}')

Missing values in imputed data: 
Survived                0
Pclass                  0
Name                    0
Sex                     0
Age                     0
SibSp                   0
Parch                   0
Ticket                  0
Fare                    0
Cabin                 687
Embarked                0
HasCabin                0
CabinType               0
Surname                 0
Title                   0
Title_grouped           0
FamilySize              0
FamilySize_grouped      0
GroupSize               0
GroupSize_grouped       0
GroupRate               0
GroupRate_grouped       0
FamilyRate              0
FamilyRate_grouped      0
Fare_adjusted           0
dtype: int64


We see now that all missing Embarked and Age values are imputed. We can also check what rows were imputed by using get_last_imputed_values() method of Imputer:

In [18]:
imputed_rows = imputer.get_last_imputed_values()
print('Rows that have imputed Age values:\n')
imputed_data[imputed_rows['Age']]

Rows that have imputed Age values:



Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,...,Title_grouped,FamilySize,FamilySize_grouped,GroupSize,GroupSize_grouped,GroupRate,GroupRate_grouped,FamilyRate,FamilyRate_grouped,Fare_adjusted
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
6,0,3,"Moran, Mr. James",male,28.724891,0,0,330877,8.4583,,...,Mr,1,0,1,0,0.188908,0,0.333333,0,8.458300
18,1,2,"Williams, Mr. Charles Eugene",male,33.119048,0,0,244373,13.0000,,...,Mr,1,0,1,0,0.188908,0,0.250000,0,13.000000
20,1,3,"Masselmani, Mrs. Fatima",female,33.515152,0,0,2649,7.2250,,...,Mrs,1,0,1,0,0.742038,1,0.742038,1,7.225000
27,0,3,"Emir, Mr. Farred Chehab",male,28.724891,0,0,2631,7.2250,,...,Mr,1,0,1,0,0.188908,0,0.188908,0,7.225000
29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,22.263889,0,0,330959,7.8792,,...,Miss,1,0,1,0,0.742038,1,0.742038,1,7.879200
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
860,0,3,"Razi, Mr. Raihed",male,28.724891,0,0,2629,7.2292,,...,Mr,1,0,1,0,0.188908,0,0.188908,0,7.229200
864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,2.874444,8,2,CA. 2343,69.5500,,...,Miss,11,2,11,2,0.000000,0,0.000000,0,6.322727
869,0,3,"van Melkebeke, Mr. Philemon",male,28.724891,0,0,345777,9.5000,,...,Mr,1,0,1,0,0.188908,0,0.188908,0,9.500000
879,0,3,"Laleff, Mr. Kristo",male,28.724891,0,0,349217,7.8958,,...,Mr,1,0,1,0,0.188908,0,0.188908,0,7.895800


In [19]:
print('Rows that have imputed Embarked values:\n')
imputed_data[['Name', 'Embarked', 'Pclass', 'Age', 'Fare']][imputed_rows['Embarked']]

Rows that have imputed Embarked values:



Unnamed: 0_level_0,Name,Embarked,Pclass,Age,Fare
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
62,"Icard, Miss. Amelie",S,1,38.0,80.0
830,"Stone, Mrs. George Nelson (Martha Evelyn)",S,1,62.0,80.0


The imputed values look reasonable so we are for now happy with our imputer pipeline component! Let's store our imputed data to data directory.

In [20]:
imputed_data.to_csv(os.path.join(_DATADIR, 'imputed_data_all_features.csv'))