## Train an imputer to do iterative imputing of conversation scores with using all features. Compare the performance of simple imputer (median imputing) and iterative imputer. Test the influence of imputer on RF model by doing the following:
1. Fit imputer on original dataset.
2. Test imputer's influence on the following datasets:
    - Access RF performance on unimputed dataset
    - Access RF performance by training model on imputed training set and apply trained model on unimputed test set
        - Do train-test split.
        - Create missing values in training set following the fraction of missing values in conservation type.
        - Transform the training set with the fitted imputer.
    - Access RF performance by training model on imputed training set and apply trained model on imputed test set
        - Do train-test split.
        - Fit the imputer on training set.
        - Create missing values in training and test set following the fraction of missing values in conservation type.
        - Transform the training set and test set with the fitted imputer.

In [26]:
import pandas as pd
import sklearn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix,plot_confusion_matrix, plot_roc_curve, plot_precision_recall_curve, matthews_corrcoef, cohen_kappa_score, f1_score
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.impute import SimpleImputer

In [27]:
all_features= ['Probability', 'IUPredShort', 'Anchor', 'DomainOverlap', 'qfo_RLC', 'qfo_RLCvar', 'vertebrates_RLC',
               'vertebrates_RLCvar', 'mammalia_RLC', 'mammalia_RLCvar', 'metazoa_RLC', 'metazoa_RLCvar',
               'DomainEnrichment_pvalue', 'DomainEnrichment_zscore', 'DomainFreqbyProtein1', 'DomainFreqinProteome1']
all_features_renamed= ['Probability', 'IUPredShort', 'Anchor', 'DomainOverlap', 'qfo_RLC', 'qfo_RLCvar',
                       'vertebrates_RLC', 'vertebrates_RLCvar', 'mammalia_RLC', 'mammalia_RLCvar', 'metazoa_RLC',
                       'metazoa_RLCvar', 'DomainEnrichment_pvalue', 'DomainEnrichment_zscore', 'DomainFreqbyProtein',
                       'DomainFreqinProteome']
cons_features= ['qfo_RLC', 'qfo_RLCvar', 'vertebrates_RLC', 'vertebrates_RLCvar', 'mammalia_RLC',
                                 'mammalia_RLCvar', 'metazoa_RLC', 'metazoa_RLCvar']
def preprocessing_dataset(PRS_input, RRS_input): # takes the PRS and RRS, concatenate them and preprocessing the NaNs and dummy value.
    PRS= pd.read_csv(PRS_input, sep= '\t', index_col= 0)
    PRS['label']= 1
    RRS= pd.read_csv(RRS_input, sep= '\t', index_col= 0)
    RRS['label']= 0
    for df in [PRS, RRS]:
        df.replace(88888, df.DomainEnrichment_zscore.median(), inplace= True)
        for ind, row in df.iterrows():
            if pd.notna(row['DomainFreqbyProtein2']):
                df.loc[ind, 'DomainFreqbyProtein1'] = np.mean([row['DomainFreqbyProtein1'], row['DomainFreqbyProtein2']])
                df.loc[ind, 'DomainFreqinProteome1'] = np.mean([row['DomainFreqinProteome1'], row['DomainFreqinProteome2']])
    df= pd.concat([PRS, RRS], axis= 0, ignore_index= True)
    df.dropna(subset= all_features, inplace= True)
    df.rename(columns= {'DomainFreqbyProtein1': 'DomainFreqbyProtein', 'DomainFreqinProteome1': 'DomainFreqinProteome'}, inplace= True)
    X= df[all_features_renamed].copy()
    y= df['label']
    return df, X, y

In [28]:
PRS_file= '/Users/chopyanlee/Coding/Python/DMI/PRS/PRS_v3_only_human_with_pattern_alt_iso_swapped_removed_20210413_slim_domain_features_annotated.tsv'
RRS_file= '/Users/chopyanlee/Coding/Python/DMI/RRS/RRSv4/RRSv4_3_20210428_slim_domain_features_annotated.tsv'
df, X, y= preprocessing_dataset(PRS_file, RRS_file)

In [29]:
df.loc[df['label'] == 0]

Unnamed: 0,Accession,Elm,Regex,Pattern,Probability,interactorElm,ElmMatch,IUPredLong,IUPredShort,Anchor,...,DomainMatchEvalue1,DomainFreqbyProtein,DomainFreqinProteome,DomainID2,DomainMatch2,DomainMatchEvalue2,DomainFreqbyProtein2,DomainFreqinProteome2,DMISource,label
898,ELME000232,DEG_APCC_KENBOX_2,.KEN.,cRKENLm,0.000184,P49888,84-88,0.113878,0.151720,0.168301,...,230.0|0.0033|3e-06|1.3|0.023|7e-09,0.012993,0.081438,,,,,,,0
899,ELME000197,LIG_BRCT_BRCA1_1,.(S)..F,tYSGQFv,0.001912,P49281,392-396,0.045085,0.051060,0.012547,...,2.3e-17|7.9e-07,0.000883,0.001814,,,,,,,0
900,ELME000428,MOD_CDK_SPxxK_3,...([ST])P..[RK],kTTTTPGRKp,0.001929,P17096,75-82,0.961519,0.857200,0.519405,...,9.1e-107,0.017209,0.017651,,,,,,,0
901,ELME000053,MOD_GSK3_1,...([ST])...[ST],eTTSTTTTTh,0.026787,Q9Y2J2,1013-1020,0.715710,0.511750,0.236566,...,5.3e-99,0.017209,0.017651,,,,,,,0
902,ELME000063,MOD_CK1_1,S..([ST])...,gSIITKCSi,0.017041,Q13895,310-316,0.016548,0.019414,0.005109,...,7.1e-69,0.017013,0.017896,,,,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1893,ELME000443,MOD_Plk_2-3,[DE]..([ST])[EDILMVFWY](([DE].)|(.[DE])),eECGTDEYc,0.002175,O94907,90-96,0.299008,0.217457,0.119911,...,8.6e-77,0.017209,0.017651,,,,,,,0
1894,ELME000093,MOD_Cter_Amidation,(.)G[RK][RK],gRGRRr,0.001155,Q6RFH8,18-21,0.593735,0.516950,0.436073,...,9.8e-46,0.000196,0.000196,,,,,,,0
1895,ELME000388,DEG_SPOP_SBC_1,[AVP].[ST][ST][ST],pPVTTSs,0.000938,Q5T5P2,1114-1118,0.924984,0.793780,0.693042,...,2.6e-13,0.000588,0.000588,,,,,,,0
1896,ELME000239,DOC_USP7_MATH_1,[PA][^P][^FYWIL]S[^P],gAFNSKq,0.012388,Q93096,140-144,0.117379,0.204080,0.268205,...,3.4e-15,0.000588,0.000588,,,,,,,0


In [30]:
PRS= pd.read_csv('/Users/chopyanlee/Coding/Python/DMI/PRS/PRS_v3_only_human_with_pattern_alt_iso_swapped_removed_20210413_slim_domain_features_annotated.tsv', sep= '\t', index_col= 0)
PRS.isnull().sum()

Accession                               0
Elm                                     0
Regex                                   0
Pattern                                 0
Probability                             0
interactorElm                           0
ElmMatch                                0
IUPredLong                              0
IUPredShort                             0
Anchor                                  0
DomainOverlap                           0
qfo_RLC                                 1
qfo_RLCvar                              1
vertebrates_RLC                         0
vertebrates_RLCvar                      0
mammalia_RLC                            0
mammalia_RLCvar                         0
metazoa_RLC                            10
metazoa_RLCvar                         10
DomainEnrichment_pvalue                57
DomainEnrichment_zscore                57
TotalNetworkDegree                     57
vertex_with_domain_in_real_network     57
interactorDomain                  

Rows with missing value:
- in PRS = 898 - 830= 68
- in RRS = 1000 - 984= 16

In [31]:
X.reset_index(drop= True, inplace= True)
X_mv= X.copy()
X_mv

Unnamed: 0,Probability,IUPredShort,Anchor,DomainOverlap,qfo_RLC,qfo_RLCvar,vertebrates_RLC,vertebrates_RLCvar,mammalia_RLC,mammalia_RLCvar,metazoa_RLC,metazoa_RLCvar,DomainEnrichment_pvalue,DomainEnrichment_zscore,DomainFreqbyProtein,DomainFreqinProteome
0,0.000341,0.517880,0.556591,0.00,0.886889,0.214219,0.390192,0.258963,0.452688,0.230275,0.024562,0.230875,1.000,-0.044766,0.000049,0.000049
1,0.000341,0.328600,0.476941,0.00,0.387889,0.259872,0.045674,0.260415,0.069223,0.238319,0.003646,0.122818,1.000,-0.044766,0.000049,0.000049
2,0.000341,0.685640,0.916987,0.00,0.254642,0.211103,0.076006,0.250091,0.075856,0.241229,0.137134,0.122282,1.000,-0.044766,0.000049,0.000049
3,0.000341,0.577780,0.642643,0.00,0.029903,0.243912,0.008157,0.267088,0.127182,0.145867,0.001037,0.196769,1.000,-0.044766,0.000049,0.000049
4,0.000341,0.608380,0.558934,0.00,0.002668,0.303413,0.001596,0.309179,0.136781,0.297461,0.000706,0.301185,1.000,-0.044766,0.000049,0.000049
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1809,0.002175,0.217457,0.119911,1.00,0.396973,0.322000,0.453741,0.253503,0.856074,0.131559,0.471393,0.198536,1.000,-0.377436,0.017209,0.017651
1810,0.001155,0.516950,0.436073,0.75,0.106577,0.339308,0.696627,0.186380,0.241627,0.278057,0.093427,0.229418,1.000,-0.031639,0.000196,0.000196
1811,0.000938,0.793780,0.693042,0.00,0.958176,0.179219,0.572231,0.271791,0.967641,0.120793,0.687974,0.265692,0.004,5.987147,0.000588,0.000588
1812,0.012388,0.204080,0.268205,0.40,0.112624,0.440174,0.260990,0.007590,0.234920,0.007290,0.376097,0.369990,1.000,-0.169725,0.000588,0.000588


In [32]:
X_mv.shape[0]

1814

Fraction of each conservation that needs imputation: {'metazoa': 0.006, 'qfo': 0.034, 'vertebrates': 0.008, 'mammalia': 0.004, 'network': 0.327}

In [33]:
def mask_X(X):
    metazoa_prob= 0.006
    qfo_prob= 0.034
    vertebrates_prob= 0.008
    mammalia_prob= 0.004
    X_mv= X.copy()
    metazoa_na= np.random.choice([True, False], size= X_mv.shape[0], p= [metazoa_prob, 1 - metazoa_prob])
    qfo_na= np.random.choice([True, False], size= X_mv.shape[0], p= [qfo_prob, 1 - qfo_prob])
    vertebrates_na= np.random.choice([True, False], size= X_mv.shape[0], p= [vertebrates_prob, 1 - vertebrates_prob])
    mammalia_na= np.random.choice([True, False], size= X_mv.shape[0], p= [mammalia_prob, 1 - mammalia_prob])
    X_mv['metazoa_RLC']= X_mv['metazoa_RLC'].mask(metazoa_na)
    X_mv['metazoa_RLCvar']= X_mv['metazoa_RLCvar'].mask(metazoa_na)
    X_mv['qfo_RLC']= X_mv['qfo_RLC'].mask(qfo_na)
    X_mv['qfo_RLCvar']= X_mv['qfo_RLCvar'].mask(qfo_na)
    X_mv['vertebrates_RLC']= X_mv['vertebrates_RLC'].mask(vertebrates_na)
    X_mv['vertebrates_RLCvar']= X_mv['vertebrates_RLCvar'].mask(vertebrates_na)
    X_mv['mammalia_RLC']= X_mv['mammalia_RLC'].mask(mammalia_na)
    X_mv['mammalia_RLCvar']= X_mv['mammalia_RLCvar'].mask(mammalia_na)
    return X_mv

## Median imputation

In [136]:
# Access RF performance by training model on imputed training set and apply trained model on unimputed test set

# do train test split and only apply median_imp on training set.
X_train, X_test, y_train, y_test= train_test_split(X, y, stratify= y)
X_train_mv= mask_X(X_train)

# impute on the masked training set and train RF on this imputed training set
median_imp= SimpleImputer(strategy= 'median')
X_train_mv_transformed= median_imp.fit_transform(X_train_mv)
rf= RandomForestClassifier(n_estimators= 1000, random_state= 0, oob_score= True)
rf.fit(X_train_mv_transformed, y_train)

# apply trained model on unimputed test set
print(classification_report(y_test, rf.predict(X_test)))
print(rf.oob_score_)
print(median_imp.statistics_[4:12])

              precision    recall  f1-score   support

           0       0.89      0.89      0.89       246
           1       0.87      0.87      0.87       208

    accuracy                           0.88       454
   macro avg       0.88      0.88      0.88       454
weighted avg       0.88      0.88      0.88       454

0.8764705882352941
[0.33245355 0.14934773 0.27207018 0.13469302 0.36021859 0.10897926
 0.28278625 0.14661517]


In [137]:
# Access RF performance by training model on imputed training set and apply trained model on imputed test set

# do train test split and only apply median_imp on training set.
X_train, X_test, y_train, y_test= train_test_split(X, y, stratify= y)

# fit imputer on training set
median_imp= SimpleImputer(strategy= 'median')
median_imp.fit(X_train)

# mask training and test set
X_train_mv= mask_X(X_train)
X_test_mv= mask_X(X_test)

# impute mask training and test set with imputer fitted on training set
X_train_mv_transformed= median_imp.transform(X_train_mv)
X_test_mv_transformed= median_imp.transform(X_test_mv)

# train RF on this imputed training set
rf= RandomForestClassifier(n_estimators= 1000, random_state= 0, oob_score= True)
rf.fit(X_train_mv_transformed, y_train)

# apply trained model on unimputed test set
print(classification_report(y_test, rf.predict(X_test_mv_transformed)))
print(rf.oob_score_)
print(median_imp.statistics_[4:12])

              precision    recall  f1-score   support

           0       0.87      0.90      0.89       246
           1       0.88      0.85      0.86       208

    accuracy                           0.87       454
   macro avg       0.87      0.87      0.87       454
weighted avg       0.87      0.87      0.87       454

0.8845588235294117
[0.32623562 0.14991629 0.27417547 0.13379616 0.35295572 0.10882557
 0.27978777 0.14612682]


In [138]:
# what happens if I set all the cons in test set to be np.nan and impute them with median imputer.
X_test.iloc[:,4:12]= np.nan
X_test_mv_transformed= median_imp.transform(X_test)
print(classification_report(y_test, rf.predict(X_test_mv_transformed)))

              precision    recall  f1-score   support

           0       0.82      0.92      0.87       246
           1       0.89      0.76      0.82       208

    accuracy                           0.85       454
   macro avg       0.86      0.84      0.84       454
weighted avg       0.85      0.85      0.85       454



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value, self.name)


In [139]:
pd.DataFrame(data= {'Features': all_features_renamed, 'Importance': rf.feature_importances_}).sort_values(by= 'Importance', ascending= False)

Unnamed: 0,Features,Importance
1,IUPredShort,0.139997
10,metazoa_RLC,0.131876
6,vertebrates_RLC,0.096043
0,Probability,0.091823
2,Anchor,0.076732
13,DomainEnrichment_zscore,0.07638
15,DomainFreqinProteome,0.069431
14,DomainFreqbyProtein,0.056691
4,qfo_RLC,0.045041
12,DomainEnrichment_pvalue,0.03825


## Mean imputation

In [140]:
# Access RF performance by training model on imputed training set and apply trained model on unimputed test set

# do train test split and only apply mean_imp on training set.
X_train, X_test, y_train, y_test= train_test_split(X, y, stratify= y)
X_train_mv= mask_X(X_train)

# impute on the masked training set and train RF on this imputed training set
mean_imp= SimpleImputer(strategy= 'mean')
X_train_mv_transformed= mean_imp.fit_transform(X_train_mv)
rf= RandomForestClassifier(n_estimators= 1000, random_state= 0, oob_score= True)
rf.fit(X_train_mv_transformed, y_train)

# apply trained model on unimputed test set
print(classification_report(y_test, rf.predict(X_test)))
print(rf.oob_score_)
print(mean_imp.statistics_[4:12])

              precision    recall  f1-score   support

           0       0.88      0.90      0.89       246
           1       0.88      0.85      0.87       208

    accuracy                           0.88       454
   macro avg       0.88      0.88      0.88       454
weighted avg       0.88      0.88      0.88       454

0.8764705882352941
[0.39411564 0.15435413 0.3628692  0.14248946 0.41926298 0.13379158
 0.38406375 0.14751687]


In [141]:
# Access RF performance by training model on imputed training set and apply trained model on imputed test set

# do train test split and only apply mean_imp on training set.
X_train, X_test, y_train, y_test= train_test_split(X, y, stratify= y)

# fit imputer on training set
mean_imp= SimpleImputer(strategy= 'mean')
mean_imp.fit(X_train)

# mask training and test set
X_train_mv= mask_X(X_train)
X_test_mv= mask_X(X_test)

# impute mask training and test set with imputer fitted on training set
X_train_mv_transformed= mean_imp.transform(X_train_mv)
X_test_mv_transformed= mean_imp.transform(X_test_mv)

# train RF on this imputed training set
rf= RandomForestClassifier(n_estimators= 1000, random_state= 0, oob_score= True)
rf.fit(X_train_mv_transformed, y_train)

# apply trained model on unimputed test set
print(classification_report(y_test, rf.predict(X_test_mv_transformed)))
print(rf.oob_score_)
print(mean_imp.statistics_[4:12])

              precision    recall  f1-score   support

           0       0.89      0.88      0.89       246
           1       0.86      0.88      0.87       208

    accuracy                           0.88       454
   macro avg       0.88      0.88      0.88       454
weighted avg       0.88      0.88      0.88       454

0.8757352941176471
[0.39800029 0.1537958  0.36406186 0.14536145 0.42342329 0.13487377
 0.39234071 0.15032072]


In [142]:
# what happens if I set all the cons in test set to be np.nan and impute them with mean imputer.
X_test.iloc[:,4:12]= np.nan
X_test_mv_transformed= mean_imp.transform(X_test)
print(classification_report(y_test, rf.predict(X_test_mv_transformed)))

              precision    recall  f1-score   support

           0       0.81      0.90      0.85       246
           1       0.87      0.75      0.80       208

    accuracy                           0.83       454
   macro avg       0.84      0.83      0.83       454
weighted avg       0.84      0.83      0.83       454



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value, self.name)


In [143]:
pd.DataFrame(data= {'Features': all_features_renamed, 'Importance': rf.feature_importances_}).sort_values(by= 'Importance', ascending= False)

Unnamed: 0,Features,Importance
10,metazoa_RLC,0.129231
1,IUPredShort,0.124095
0,Probability,0.10287
6,vertebrates_RLC,0.09669
15,DomainFreqinProteome,0.082416
2,Anchor,0.071003
13,DomainEnrichment_zscore,0.070497
14,DomainFreqbyProtein,0.061654
4,qfo_RLC,0.048055
12,DomainEnrichment_pvalue,0.037745


## Iterative Imputation

In [144]:
# Access RF performance by training model on imputed training set and apply trained model on unimputed test set

# do train test split and only apply median_imp on training set.
X_train, X_test, y_train, y_test= train_test_split(X, y, stratify= y)
X_train_mv= mask_X(X_train)

# impute on the masked training set and train RF on this imputed training set
ite_imp= IterativeImputer(random_state=0 , initial_strategy= 'median')
X_train_mv_transformed= ite_imp.fit_transform(X_train_mv)
rf= RandomForestClassifier(n_estimators= 1000, random_state= 0, oob_score= True)
rf.fit(X_train_mv_transformed, y_train)

# apply trained model on unimputed test set
print(classification_report(y_test, rf.predict(X_test)))
print(rf.oob_score_)

              precision    recall  f1-score   support

           0       0.88      0.93      0.90       246
           1       0.91      0.85      0.88       208

    accuracy                           0.89       454
   macro avg       0.89      0.89      0.89       454
weighted avg       0.89      0.89      0.89       454

0.8764705882352941


In [145]:
# Access RF performance by training model on imputed training set and apply trained model on imputed test set

# do train test split and only apply median_imp on training set.
X_train, X_test, y_train, y_test= train_test_split(X, y, stratify= y)

# fit imputer on training set
ite_imp= IterativeImputer(random_state=0 , initial_strategy= 'median')
ite_imp.fit(X_train)

# mask training and test set
X_train_mv= mask_X(X_train)
X_test_mv= mask_X(X_test)

# impute mask training and test set with imputer fitted on training set
X_train_mv_transformed= ite_imp.transform(X_train_mv)
X_test_mv_transformed= ite_imp.transform(X_test_mv)

# train RF on this imputed training set
rf= RandomForestClassifier(n_estimators= 1000, random_state= 0, oob_score= True)
rf.fit(X_train_mv_transformed, y_train)

# apply trained model on unimputed test set
print(classification_report(y_test, rf.predict(X_test_mv_transformed)))
print(rf.oob_score_)

              precision    recall  f1-score   support

           0       0.90      0.88      0.89       246
           1       0.86      0.88      0.87       208

    accuracy                           0.88       454
   macro avg       0.88      0.88      0.88       454
weighted avg       0.88      0.88      0.88       454

0.8816176470588235


In [146]:
# what happens if I set all the cons in test set to be np.nan and impute them with median imputer.
X_test.iloc[:,4:12]= np.nan
X_test_mv_transformed= ite_imp.transform(X_test)
print(classification_report(y_test, rf.predict(X_test_mv_transformed)))

              precision    recall  f1-score   support

           0       0.87      0.89      0.88       246
           1       0.87      0.85      0.86       208

    accuracy                           0.87       454
   macro avg       0.87      0.87      0.87       454
weighted avg       0.87      0.87      0.87       454



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value, self.name)


In [147]:
pd.DataFrame(data= {'Features': all_features_renamed, 'Importance': rf.feature_importances_}).sort_values(by= 'Importance', ascending= False)

Unnamed: 0,Features,Importance
10,metazoa_RLC,0.133653
1,IUPredShort,0.121655
6,vertebrates_RLC,0.101968
13,DomainEnrichment_zscore,0.084273
0,Probability,0.079723
15,DomainFreqinProteome,0.079096
2,Anchor,0.07788
14,DomainFreqbyProtein,0.060387
4,qfo_RLC,0.045761
12,DomainEnrichment_pvalue,0.039548


## Even though RF uses two RLC scores heavily in prediction, when I masked all RLC scores in the test set and imputed them with median from unmasked train set, the RF's drop in accuracy is negligible. Imputation generally hurts sensitivity more than specificity. Nonetheless, comparing three methods of imputation, there is no significant difference and since median imputation is the most straightforward way to impute missing feature, I prefer median imputation. My interpretation on this is that by imputing missing value as the median makes the missing value 'look' like the majority to the algorithm and forces the algorithm to use other features to decide the class of the datapoint.

## Since imputation shows optimistic result, try imputing for z-score too.

In [151]:
# what happens if I set all the cons in test set to be np.nan and impute them with median imputer.
X_test.iloc[:,4:14]= np.nan
# X_test.iloc[:, 13]= np.nan
X_test_mv_transformed= median_imp.transform(X_test)
print(classification_report(y_test, rf.predict(X_test_mv_transformed)))

              precision    recall  f1-score   support

           0       0.80      0.92      0.86       246
           1       0.89      0.73      0.80       208

    accuracy                           0.83       454
   macro avg       0.85      0.83      0.83       454
weighted avg       0.84      0.83      0.83       454



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value, self.name)


In [152]:
# what happens if I set all the cons in test set to be np.nan and impute them with mean imputer.
X_test.iloc[:,4:14]= np.nan
# X_test.iloc[:, 13]= np.nan
X_test_mv_transformed= mean_imp.transform(X_test)
print(classification_report(y_test, rf.predict(X_test_mv_transformed)))

              precision    recall  f1-score   support

           0       0.83      0.89      0.86       246
           1       0.86      0.78      0.82       208

    accuracy                           0.84       454
   macro avg       0.84      0.84      0.84       454
weighted avg       0.84      0.84      0.84       454



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value, self.name)


In [153]:
# what happens if I set all the cons in test set to be np.nan and impute them with iterative imputer.
X_test.iloc[:,4:14]= np.nan
# X_test.iloc[:, 13]= np.nan
X_test_mv_transformed= ite_imp.transform(X_test)
print(classification_report(y_test, rf.predict(X_test_mv_transformed)))

              precision    recall  f1-score   support

           0       0.84      0.83      0.83       246
           1       0.80      0.82      0.81       208

    accuracy                           0.82       454
   macro avg       0.82      0.82      0.82       454
weighted avg       0.82      0.82      0.82       454



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value, self.name)


## With all z-score masked and imputed with median, the sensitivity dropped drastically but accuracy of the model is not sigficantly affected. In this case, rather than training a separate predictor without domain enrichment feature, it's easier to just have one predictor and one median imputer for cases of missing values.