This notebook is based entirely on the knowledge which I gained from doing the Machine Learning courses on Kaggle and reading sklearn/pandas documentation.  

I tried here to create an automated pipeline for feature preparation, which does:
- remove numerical columns with a large amount of missing data,
- impute missing data in categorical columns and in the numerical columns which have most entries and a small part of missing data,
- removes features have small mutual information with the target,
- performs One Hot Encoding on the categorical data.

Next, the preprocessed data is fed into a simple XGBRegressor model, which can be run with early stopping rounds, when performed on the train/validations sets.



In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.feature_selection import mutual_info_regression
from sklearn.preprocessing import OneHotEncoder
from xgboost import XGBRegressor

# Read the data
X = pd.read_csv('../input/home-data-for-ml-course/train.csv', index_col='Id') 
X_test = pd.read_csv('../input/home-data-for-ml-course/test.csv', index_col='Id')

# Remove rows with missing target, separate target from predictors
X.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X.SalePrice
X.drop(['SalePrice'], axis=1, inplace=True)

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y,
                                                      train_size=0.8, test_size=0.2,
                                                      random_state=0)

First, I define the function which will preprocess data. 
It takes as arguments: 
- a train set `X` 
- a test/validation set `X_test` 
- the target output as `y`.  

It works on copies of the `X` and `X_test` data, and returns these two datasets after completing all preprocessing steps, which include:
- removing numerical columns which have more than `missing_threshold` missing entries,
- imputing missing data in the rest of the numerical columns with missing data using the median value or 0,
- imputing missing data in categorical columns with a new value (often the NaN in categorical data has the meaning of a new category, e.g. "the absence of a garage") 
- removing features have mutual information with the target smaller than `mutual_inf_threshold`,
- dropping categorical data that have cardinality larger than `low_cardinality_threshold`,
- performing One Hot Encoding on the low-cardinality categorical data.


In [2]:
def make_mi_scores(X, y, discrete_features):
    mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores

def preprocessing_pipeline(X, y, X_test, 
                           missing_threshold=100, 
                           mutual_inf_threshold=0.05, 
                           low_cardinality_threshold=15,
                           verbose = False):
    
    data = X.copy()
    data_test = X_test.copy()
    
    if verbose:
        print('**************************')
        print('\nNumber of features at the beginning:')
        print(len(data.columns))
        print(len(data_test.columns))
    
    #--- Impute missing values
    cols_with_missing = set([col for col in data.columns if data[col].isnull().any()])
    cols_with_missing.update([col for col in data_test.columns if data_test[col].isnull().any()])
    if verbose:
        print('Columns with missing values: ')
        print(cols_with_missing)
    
    cols_to_drop = [col for col in cols_with_missing if data[col].isnull().sum()>missing_threshold] 
    cols_with_missing_num = [col for col in cols_with_missing if data[col].isnull().sum()<missing_threshold and data[col].dtype != "object"] 
    cols_with_missing_cat = [col for col in cols_with_missing if data[col].isnull().sum()<missing_threshold and data[col].dtype == "object"]
    
    if verbose:
        print('\n Columns which will be dropped due to a lot of missing values: ')
        print(cols_to_drop)
    data = data.drop(cols_to_drop, axis=1)
    data_test = data_test.drop(cols_to_drop, axis=1)
    
    if verbose:
        print('\nNumber of features after dropping missing values:')
        print(len(data.columns))
        print(len(data_test.columns))
    
    for column in cols_with_missing_num:
        if column == 'LotFrontage':
            data[column] = data[column].fillna(data[column].median())
            data_test[column] = data_test[column].fillna(data_test[column].median())
        else:
            data[column] = data[column].fillna(0)
            data_test[column] = data_test[column].fillna(0)
    
    for column in cols_with_missing_cat:
        data[column] = data[column].fillna('N')
        data_test[column] = data_test[column].fillna('N')
    
    imputed = data.copy()
    imputed_test = data_test.copy()

    imputed.columns = data.columns
    imputed_test.columns = data_test.columns
    
    if verbose:
        print('\nNumber of features after imputation:')
        print(len(imputed.columns))
        print(len(imputed_test.columns))
    
    #--- Throw away numerical features which have very small mutual information with target
    num_cols = [col for col in imputed.columns if imputed[col].dtypes != 'object']
    
    num = (data.dtypes != 'object')
    num_cols = list(num[num].index)
    
    mi_scores = make_mi_scores(imputed[num_cols], y, 'auto')
    if verbose:
        print('\nNumerical features with highest mutual information with target: ')
        print(mi_scores)
    
    unimportant_columns = [index for index,score in mi_scores.iteritems() if score < mutual_inf_threshold]
    if verbose:
        print('\nColumns which will be dropped: ')
        print(unimportant_columns)
    
    high_mi = imputed.drop(unimportant_columns, axis=1)
    high_mi_test = imputed_test.drop(unimportant_columns, axis=1)
    
    if verbose:
        print('\nNumber of features after dropping low MI features:')
        print(len(high_mi.columns))
        print(len(high_mi_test.columns))

    #--- Encode categorical features
    
    cat = (high_mi.dtypes == 'object')
    categorical_cols = list(cat[cat].index)
    low_cardinality_cols = [col for col in categorical_cols if high_mi[col].nunique() < low_cardinality_threshold]
    high_cardinality_cols = list(set(categorical_cols)-set(low_cardinality_cols))
    
    lc_X_train = high_mi.drop(high_cardinality_cols, axis=1)
    lc_X_valid = high_mi_test.drop(high_cardinality_cols, axis=1)
    
    OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
    OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(lc_X_train[low_cardinality_cols]))
    OH_cols_valid = pd.DataFrame(OH_encoder.transform(lc_X_valid[low_cardinality_cols]))
    
    OH_cols_train.index = high_mi.index
    OH_cols_valid.index = high_mi_test.index
    num_X_train = high_mi.drop(categorical_cols, axis=1)
    num_X_valid = high_mi_test.drop(categorical_cols, axis=1)
    OH_X = pd.concat([num_X_train, OH_cols_train], axis=1)
    OH_X_test = pd.concat([num_X_valid, OH_cols_valid], axis=1)
    
    if verbose:
        print('\nNumber of features after OHE:')
        print(len(OH_X.columns)) 
        print(len(OH_X_test.columns))
    
    return OH_X, OH_X_test

Next, during the first call the preprocessing pipeline is applied to the train and validations sets. These two sets will be used for validating the model and searching best parameters.    

Below, the preprocessing pipeline is applied to the full training set and the test set. These two sets will be used for the final training of the model with best parameters and generating the predictions for the submission.

In [3]:
smallest_mae = 30_000

for missing_threshold in [500,1000,1500]:
    for mutual_inf_threshold in [0,0.02,0.05,0.1]:
        for low_cardinality_threshold in [5,10,20]:
            print(f'running: {missing_threshold}, {mutual_inf_threshold}, {low_cardinality_threshold}')
            
            processed_X_train, processed_X_valid = preprocessing_pipeline(X_train, y_train, X_valid, 
                                                                          missing_threshold=missing_threshold,      
                                                                          mutual_inf_threshold=mutual_inf_threshold, 
                                                                          low_cardinality_threshold=low_cardinality_threshold)
            
            model = XGBRegressor(n_estimators=5_000, learning_rate = 0.01)
            model.fit(processed_X_train, y_train, early_stopping_rounds=10, 
                      eval_set=[(processed_X_valid, y_valid)],verbose=False)
            preds_test = model.predict(processed_X_valid)
            
            print('**********************************************')
            print(missing_threshold)
            print(mutual_inf_threshold)
            print(low_cardinality_threshold)
            mae = mean_absolute_error(y_valid, preds_test)
            print(mae)
            if mae < smallest_mae:
                smallest_mae = mae
                best_params = missing_threshold, mutual_inf_threshold, low_cardinality_threshold
                
print(smallest_mae)
print(best_params)

running: 500, 0, 5
**********************************************
500
0
5
17372.77334385702
running: 500, 0, 10
**********************************************
500
0
10
16748.630912885274
running: 500, 0, 20
**********************************************
500
0
20
16451.913848458906
running: 500, 0.02, 5
**********************************************
500
0.02
5
17060.511584974316
running: 500, 0.02, 10
**********************************************
500
0.02
10
16606.881969713184
running: 500, 0.02, 20
**********************************************
500
0.02
20
16392.254414597603
running: 500, 0.05, 5
**********************************************
500
0.05
5
17045.279350385274
running: 500, 0.05, 10
**********************************************
500
0.05
10
16729.266748715752
running: 500, 0.05, 20
**********************************************
500
0.05
20
16428.584305436645
running: 500, 0.1, 5
**********************************************
500
0.1
5
17419.006728916953
running: 500, 0.1, 

Finally, the best parameters will be used to the generate predictions on the entire training set.

In [4]:
processed_X, processed_X_test = preprocessing_pipeline(X, y, X_test, 
                                                       missing_threshold=best_params[0],      
                                                       mutual_inf_threshold=best_params[1],
                                                       low_cardinality_threshold=best_params[2])
print(len(processed_X.columns), len(processed_X_test.columns))

model = XGBRegressor(n_estimators = 5_000, learning_rate = 0.01)
model.fit(processed_X, y)
preds_test = model.predict(processed_X_test)

output = pd.DataFrame({'Id': X_test.index,
                       'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)
print('output saved')

252 252
output saved


The last thing which remains would be optimize the XGBRegressor parameters, e.g. using GridSearchCV.