Here, we will explore vast array of methods that we can use to understand the underlying pattern in data and diagnose machine learning models. We are interested in how features individually and collectively impact the model prediction. Here are three agnostic methods we use to better understand the features.

    1. Permutation Feature Importance (Rank how important features are independently and intuitively.''
    2. Partial Dependency Plot (marginal contribution of each features towards model's prediction.
    3. Individual Conditional Expectation (explain changes in prediction when a feature changes. 

Notes 

## PDP
- plot PDP against features on a graph to interpret and analyse the model
- The partial dependece function at a particulat value represnents the average prediction if we force all data points to assume that feature value. 
- how the average prediction in your dataset changes when the j-th feature is changed.

Advantages
- intituive and easy to implement 
- has a causal interpretation


Disadvantages 
- A maximum number of features is 2
- Features must be independent of the others
- PD plots do not show the feature distribution. the dense the distribution, the more reliable the PDP value is at a certain point of the feature. 
- Heterogeneous effects might be hidden because PD plots only show the average marginal effects. Suppose that for a feature half your data points have a positive association with the prediction – the larger the feature value the larger the prediction – and the other half has a negative association – the smaller the feature value the larger the prediction. The PD curve could be a horizontal line, since the effects of both halves of the dataset could cancel each other out. You then conclude that the feature has no effect on the prediction. By plotting the **individual conditional expectation curves** instead of the aggregated line, we can uncover heterogeneous effects

In [9]:
import pandas as pd
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn import metrics, linear_model, tree, discriminant_analysis,\
                    ensemble, neural_network, inspection
import matplotlib.pyplot as plt

In [10]:
house =  pd.read_csv("src/train.csv")
ml_dt = house.drop(["PoolQC", "Fence", "FireplaceQu", "Alley", "MiscFeature", "Id"], axis=1)
ml_dt = ml_dt.dropna()
ml_dt.info()

object_features = ml_dt.select_dtypes(include=['object'])
numerical_features = ml_dt.select_dtypes(include=['int64', "float64"])

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1094 entries, 0 to 1459
Data columns (total 75 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1094 non-null   int64  
 1   MSZoning       1094 non-null   object 
 2   LotFrontage    1094 non-null   float64
 3   LotArea        1094 non-null   int64  
 4   Street         1094 non-null   object 
 5   LotShape       1094 non-null   object 
 6   LandContour    1094 non-null   object 
 7   Utilities      1094 non-null   object 
 8   LotConfig      1094 non-null   object 
 9   LandSlope      1094 non-null   object 
 10  Neighborhood   1094 non-null   object 
 11  Condition1     1094 non-null   object 
 12  Condition2     1094 non-null   object 
 13  BldgType       1094 non-null   object 
 14  HouseStyle     1094 non-null   object 
 15  OverallQual    1094 non-null   int64  
 16  OverallCond    1094 non-null   int64  
 17  YearBuilt      1094 non-null   int64  
 18  YearRemo

In [13]:
object_features[]

Index(['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
       'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
       'Functional', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond',
       'PavedDrive', 'SaleType', 'SaleCondition'],
      dtype='object')