#  Kaggle Getting Started Competition : House Prices - Advanced Regression Techniques 

The notebook is based on the [notebook](https://www.kaggle.com/code/ryanholbrook/feature-engineering-for-house-prices) provided for [House prices](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) Kaggle competition. The notebook is a buildup of hands-on-exercises presented in Kaggle Learn courses of [Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning) and [Feature Engineering](https://www.kaggle.com/learn/feature-engineering)

Following are the imports required to build the pipeline and pass the data between components for building up the kubeflow pipeline

In [1]:
# Install the kfp 
# !pip install kfp --upgrade 

In [2]:
import kfp
from kfp.components import func_to_container_op
import kfp.components as comp

All the essential imports required in a pipeline component are put together in a list which then is passed on to each pipeline component. Though this might not be efficient when you are dealing with lot of packages, so in cases with many packages and dependencies you can go for docker image which then can be passed to each pipeline component

In [3]:
import_packages = ['pandas', 'sklearn', 'category_encoders', 'xgboost', 'numpy']

In the following implementation of kubeflow pipeline we are making use of [lightweight python function components](https://www.kubeflow.org/docs/components/pipelines/sdk/python-function-components/) to build up the pipeline. The data is passed between component instances(tasks) using InputPath and OutputPath. This doesn't require use of defining external volume and attaching to the tasks as the system takes care of storing the data. Further details and examples of it can be found in the following [link](https://github.com/Ark-kun/kfp_samples/blob/65a98da2d4d2bd27a803ee58213b4cfd8a84825e/2019-10%20Kubeflow%20summit/104%20-%20Passing%20data%20for%20python%20components/104%20-%20Passing%20data%20for%20python%20components.ipynb)

The pipeline is divided into five components
1. Download data zip file from url
2. Load data
3. Creating data with features
4. Train data
5. Evaluating data

### Download Data

For the purpose of this, we are using an existing yaml file available from kubeflow/pipelines for 'Download Data' component to download data from URLs. In our case, we are getting it from github.

In [4]:
web_downloader_op = kfp.components.load_component_from_url(
    'https://raw.githubusercontent.com/kubeflow/pipelines/master/components/contrib/web/Download/component.yaml')

### Load and Preprocess Data

In [5]:
def load_and_preprocess_data(file_path : comp.InputPath() , train_output_csv: comp.OutputPath(), test_output_csv: comp.OutputPath()):
    
    # Read data
    import pandas as pd
    from pandas.api.types import CategoricalDtype
    from zipfile import ZipFile   
    
    # Extracting from zip file 
    with ZipFile(file_path, 'r') as zip:
        zip.extractall()
        
    # Load the training and test data
    train_file_dir = 'data/train.csv'
    test_file_dir = 'data/test.csv'
    df_train = pd.read_csv(train_file_dir, index_col="Id")
    df_test = pd.read_csv( test_file_dir , index_col="Id")
    
    # Merge the splits so we can process them together
    df = pd.concat([df_train, df_test])
        
    # Clean data
    df["Exterior2nd"] = df["Exterior2nd"].replace({"Brk Cmn": "BrkComm"})
    # Some values of GarageYrBlt are corrupt, so we'll replace them
    # with the year the house was built
    df["GarageYrBlt"] = df["GarageYrBlt"].where(df.GarageYrBlt <= 2010, df.YearBuilt)
    # Names beginning with numbers are awkward to work with
    df.rename(columns={
        "1stFlrSF": "FirstFlrSF",
        "2ndFlrSF": "SecondFlrSF",
        "3SsnPorch": "Threeseasonporch",
    }, inplace=True,
    )
    
    # Encode data
    
    # Nominal categories
    # The numeric features are already encoded correctly (`float` for
    # continuous, `int` for discrete), but the categoricals we'll need to
    # do ourselves. Note in particular, that the `MSSubClass` feature is
    # read as an `int` type, but is actually a (nominative) categorical.

    # The nominative (unordered) categorical features
    features_nom = ["MSSubClass", "MSZoning", "Street", "Alley", "LandContour", "LotConfig", "Neighborhood", "Condition1", "Condition2", "BldgType", "HouseStyle", "RoofStyle", "RoofMatl", "Exterior1st", "Exterior2nd", "MasVnrType", "Foundation", "Heating", "CentralAir", "GarageType", "MiscFeature", "SaleType", "SaleCondition"]

    # Pandas calls the categories "levels"
    five_levels = ["Po", "Fa", "TA", "Gd", "Ex"]
    ten_levels = list(range(10))

    ordered_levels = {
        "OverallQual": ten_levels,
        "OverallCond": ten_levels,
        "ExterQual": five_levels,
        "ExterCond": five_levels,
        "BsmtQual": five_levels,
        "BsmtCond": five_levels,
        "HeatingQC": five_levels,
        "KitchenQual": five_levels,
        "FireplaceQu": five_levels,
        "GarageQual": five_levels,
        "GarageCond": five_levels,
        "PoolQC": five_levels,
        "LotShape": ["Reg", "IR1", "IR2", "IR3"],
        "LandSlope": ["Sev", "Mod", "Gtl"],
        "BsmtExposure": ["No", "Mn", "Av", "Gd"],
        "BsmtFinType1": ["Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"],
        "BsmtFinType2": ["Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"],
        "Functional": ["Sal", "Sev", "Maj1", "Maj2", "Mod", "Min2", "Min1", "Typ"],
        "GarageFinish": ["Unf", "RFn", "Fin"],
        "PavedDrive": ["N", "P", "Y"],
        "Utilities": ["NoSeWa", "NoSewr", "AllPub"],
        "CentralAir": ["N", "Y"],
        "Electrical": ["Mix", "FuseP", "FuseF", "FuseA", "SBrkr"],
        "Fence": ["MnWw", "GdWo", "MnPrv", "GdPrv"],
    }

    # Add a None level for missing values
    ordered_levels = {key: ["None"] + value for key, value in
                      ordered_levels.items()}


    for name in features_nom:
        df[name] = df[name].astype("category")
        # Add a None category for missing values
        if "None" not in df[name].cat.categories:
            df[name].cat.add_categories("None", inplace=True)
    # Ordinal categories
    for name, levels in ordered_levels.items():
        df[name] = df[name].astype(CategoricalDtype(levels,
                                                    ordered=True))
        
    
    # Impute data
    for name in df.select_dtypes("number"):
        df[name] = df[name].fillna(0)
    for name in df.select_dtypes(include = ["category"]):
        df[name] = df[name].fillna("None")
        
    # Reform splits        
    df_train = df.loc[df_train.index, :]
    df_test = df.loc[df_test.index, :]
    
    # passing the data as csv files to outputs
    df_train.to_csv(train_output_csv)
    df_test.to_csv(test_output_csv)        


In [6]:
load_and_preprocess_data_op = func_to_container_op(load_and_preprocess_data,packages_to_install = import_packages)

### Creating data with features

In [7]:
def featured_data(train_path: comp.InputPath(), test_path : comp.InputPath(), feat_train_output_csv: comp.OutputPath(), feat_test_output_csv: comp.OutputPath()):
    
    import pandas as pd
    from pandas.api.types import CategoricalDtype
    from category_encoders import MEstimateEncoder
    from sklearn.feature_selection import mutual_info_regression
    from sklearn.cluster import KMeans
    from sklearn.decomposition import PCA
    from sklearn.model_selection import KFold, cross_val_score
    
    df_train = pd.read_csv(train_path, index_col="Id")
    df_test = pd.read_csv(test_path, index_col="Id")
    
    def make_mi_scores(X, y):
        X = X.copy()
        for colname in X.select_dtypes(["object","category"]):
            X[colname], _ = X[colname].factorize()
        # All discrete features should now have integer dtypes
        discrete_features = [pd.api.types.is_integer_dtype(t) for t in X.dtypes]
        mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features, random_state=0)
        mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
        mi_scores = mi_scores.sort_values(ascending=False)
        return mi_scores
    
    def drop_uninformative(df, mi_scores):
        return df.loc[:, mi_scores > 0.0]
    
    def label_encode(df):
        
        X = df.copy()   
        for colname in X.select_dtypes(["category"]):
            X[colname] = X[colname].cat.codes
        return X

    def mathematical_transforms(df):
        X = pd.DataFrame()  # dataframe to hold new features
        X["LivLotRatio"] = df.GrLivArea / df.LotArea
        X["Spaciousness"] = (df.FirstFlrSF + df.SecondFlrSF) / df.TotRmsAbvGrd
        return X

    def interactions(df):
        X = pd.get_dummies(df.BldgType, prefix="Bldg")
        X = X.mul(df.GrLivArea, axis=0)
        return X

    def counts(df):
        X = pd.DataFrame()
        X["PorchTypes"] = df[[
            "WoodDeckSF",
            "OpenPorchSF",
            "EnclosedPorch",
            "Threeseasonporch",
            "ScreenPorch",
        ]].gt(0.0).sum(axis=1)
        return X

    def break_down(df):
        X = pd.DataFrame()
        X["MSClass"] = df.MSSubClass.str.split("_", n=1, expand=True)[0]
        return X

    def group_transforms(df):
        X = pd.DataFrame()
        X["MedNhbdArea"] = df.groupby("Neighborhood")["GrLivArea"].transform("median")
        return X
    
    cluster_features = [
        "LotArea",
        "TotalBsmtSF",
        "FirstFlrSF",
        "SecondFlrSF",
        "GrLivArea",
        ]

    def cluster_labels(df, features, n_clusters=20):
        X = df.copy()
        X_scaled = X.loc[:, features]
        X_scaled = (X_scaled - X_scaled.mean(axis=0)) / X_scaled.std(axis=0)
        kmeans = KMeans(n_clusters=n_clusters, n_init=50, random_state=0)
        X_new = pd.DataFrame()
        X_new["Cluster"] = kmeans.fit_predict(X_scaled)
        return X_new

    def cluster_distance(df, features, n_clusters=20):
        X = df.copy()
        X_scaled = X.loc[:, features]
        X_scaled = (X_scaled - X_scaled.mean(axis=0)) / X_scaled.std(axis=0)
        kmeans = KMeans(n_clusters=20, n_init=50, random_state=0)
        X_cd = kmeans.fit_transform(X_scaled)
        # Label features and join to dataset
        X_cd = pd.DataFrame(
            X_cd, columns=[f"Centroid_{i}" for i in range(X_cd.shape[1])]
        )
        return X_cd
    
    def apply_pca(X, standardize=True):
        # Standardize
        if standardize:
            X = (X - X.mean(axis=0)) / X.std(axis=0)
        # Create principal components
        pca = PCA()
        X_pca = pca.fit_transform(X)
        # Convert to dataframe
        component_names = [f"PC{i+1}" for i in range(X_pca.shape[1])]
        X_pca = pd.DataFrame(X_pca, columns=component_names)
        # Create loadings
        loadings = pd.DataFrame(
            pca.components_.T,  # transpose the matrix of loadings
            columns=component_names,  # so the columns are the principal components
            index=X.columns,  # and the rows are the original features
        )
        return pca, X_pca, loadings

    def pca_inspired(df):
        X = pd.DataFrame()
        X["Feature1"] = df.GrLivArea + df.TotalBsmtSF
        X["Feature2"] = df.YearRemodAdd * df.TotalBsmtSF
        return X


    def pca_components(df, features):
        X = df.loc[:, features]
        _, X_pca, _ = apply_pca(X)
        return X_pca


    pca_features = [
        "GarageArea",
        "YearRemodAdd",
        "TotalBsmtSF",
        "GrLivArea",
    ]
    
    class CrossFoldEncoder:
        def __init__(self, encoder, **kwargs):
            self.encoder_ = encoder
            self.kwargs_ = kwargs  # keyword arguments for the encoder
            self.cv_ = KFold(n_splits=5)

        # Fit an encoder on one split and transform the feature on the
        # other. Iterating over the splits in all folds gives a complete
        # transformation. We also now have one trained encoder on each
        # fold.
        def fit_transform(self, X, y, cols):
            self.fitted_encoders_ = []
            self.cols_ = cols
            X_encoded = []
            for idx_encode, idx_train in self.cv_.split(X):
                fitted_encoder = self.encoder_(cols=cols, **self.kwargs_)
                fitted_encoder.fit(
                    X.iloc[idx_encode, :], y.iloc[idx_encode],
                )
                X_encoded.append(fitted_encoder.transform(X.iloc[idx_train, :])[cols])
                self.fitted_encoders_.append(fitted_encoder)
            X_encoded = pd.concat(X_encoded)
            X_encoded.columns = [name + "_encoded" for name in X_encoded.columns]
            return X_encoded

        # To transform the test data, average the encodings learned from
        # each fold.
        def transform(self, X):
            from functools import reduce

            X_encoded_list = []
            for fitted_encoder in self.fitted_encoders_:
                X_encoded = fitted_encoder.transform(X)
                X_encoded_list.append(X_encoded[self.cols_])
            X_encoded = reduce(
                lambda x, y: x.add(y, fill_value=0), X_encoded_list
            ) / len(X_encoded_list)
            X_encoded.columns = [name + "_encoded" for name in X_encoded.columns]
            return X_encoded
        
    X = df_train.copy()
    y = X.pop("SalePrice") 
    
    X_test = df_test.copy()
    X_test.pop("SalePrice")
    
    # Get the mutual information scores
    mi_scores = make_mi_scores(X, y)
    
    # Concat the training and test dataset before restoring categorical encoding
    X = pd.concat([X, X_test])
    
    # Restore the categorical encoding removed during csv conversion
    # The nominative (unordered) categorical features
    features_nom = ["MSSubClass", "MSZoning", "Street", "Alley", "LandContour", "LotConfig", "Neighborhood", "Condition1", "Condition2", "BldgType", "HouseStyle", "RoofStyle", "RoofMatl", "Exterior1st", "Exterior2nd", "MasVnrType", "Foundation", "Heating", "CentralAir", "GarageType", "MiscFeature", "SaleType", "SaleCondition"]

    # Pandas calls the categories "levels"
    five_levels = ["Po", "Fa", "TA", "Gd", "Ex"]
    ten_levels = list(range(10))

    ordered_levels = {
        "OverallQual": ten_levels,
        "OverallCond": ten_levels,
        "ExterQual": five_levels,
        "ExterCond": five_levels,
        "BsmtQual": five_levels,
        "BsmtCond": five_levels,
        "HeatingQC": five_levels,
        "KitchenQual": five_levels,
        "FireplaceQu": five_levels,
        "GarageQual": five_levels,
        "GarageCond": five_levels,
        "PoolQC": five_levels,
        "LotShape": ["Reg", "IR1", "IR2", "IR3"],
        "LandSlope": ["Sev", "Mod", "Gtl"],
        "BsmtExposure": ["No", "Mn", "Av", "Gd"],
        "BsmtFinType1": ["Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"],
        "BsmtFinType2": ["Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"],
        "Functional": ["Sal", "Sev", "Maj1", "Maj2", "Mod", "Min2", "Min1", "Typ"],
        "GarageFinish": ["Unf", "RFn", "Fin"],
        "PavedDrive": ["N", "P", "Y"],
        "Utilities": ["NoSeWa", "NoSewr", "AllPub"],
        "CentralAir": ["N", "Y"],
        "Electrical": ["Mix", "FuseP", "FuseF", "FuseA", "SBrkr"],
        "Fence": ["MnWw", "GdWo", "MnPrv", "GdPrv"],
    }

#     Add a None level for missing values
    ordered_levels = {key: ["None"] + value for key, value in
                      ordered_levels.items()}
    
    for name in features_nom:
        X[name] = X[name].astype("category")
        if "None" not in X[name].cat.categories:
            X[name].cat.add_categories("None", inplace=True)
        
    # Ordinal categories
    for name, levels in ordered_levels.items():
        X[name] = X[name].astype(CategoricalDtype(levels,
                                                    ordered=True))
           
    # Drop features with less mutual information scores
    X = drop_uninformative(X, mi_scores)
    

    # Transformations
    X = X.join(mathematical_transforms(X))
    X = X.join(interactions(X))
    X = X.join(counts(X))
    # X = X.join(break_down(X))
    X = X.join(group_transforms(X))

    # Clustering
    # X = X.join(cluster_labels(X, cluster_features, n_clusters=20))
    # X = X.join(cluster_distance(X, cluster_features, n_clusters=20))

    # PCA
    X = X.join(pca_inspired(X))
    # X = X.join(pca_components(X, pca_features))
    # X = X.join(indicate_outliers(X))
    
    # Label encoding
    X = label_encode(X)
    
    # Reform splits
    X_test = X.loc[df_test.index, :]
    X.drop(df_test.index, inplace=True)

    # Target Encoder
    encoder = CrossFoldEncoder(MEstimateEncoder, m=1)
    X = X.join(encoder.fit_transform(X, y, cols=["MSSubClass"]))
    
    X_test = X_test.join(encoder.transform(X_test))
    
    X.to_csv(feat_train_output_csv)
    X_test.to_csv(feat_test_output_csv)

    

In [8]:
featured_data_op = func_to_container_op(featured_data, packages_to_install = import_packages)

### Train data

In [9]:
def train_data(train_path: comp.InputPath(), feat_train_path: comp.InputPath(), feat_test_path : comp.InputPath(), model_path : comp.OutputPath('XGBoostModel')):
    
    import pandas as pd
    import numpy as np
    from xgboost.sklearn import XGBRegressor
    from pathlib import Path
    
    df_train = pd.read_csv(train_path, index_col="Id")
    X_train = pd.read_csv(feat_train_path, index_col="Id")
    X_test = pd.read_csv(feat_test_path, index_col="Id")
    y_train = df_train.loc[:, "SalePrice"]
    
    xgb_params = dict(
    max_depth=6,           # maximum depth of each tree - try 2 to 10
    learning_rate=0.01,    # effect of each tree - try 0.0001 to 0.1
    n_estimators=1000,     # number of trees (that is, boosting rounds) - try 1000 to 8000
    min_child_weight=1,    # minimum number of houses in a leaf - try 1 to 10
    colsample_bytree=0.7,  # fraction of features (columns) per tree - try 0.2 to 1.0
    subsample=0.7,         # fraction of instances (rows) per tree - try 0.2 to 1.0
    reg_alpha=0.5,         # L1 regularization (like LASSO) - try 0.0 to 10.0
    reg_lambda=1.0,        # L2 regularization (like Ridge) - try 0.0 to 10.0
    num_parallel_tree=1,   # set > 1 for boosted random forests
    )

    xgb = XGBRegressor(**xgb_params)
    # XGB minimizes MSE, but competition loss is RMSLE
    # So, we need to log-transform y to train and exp-transform the predictions
    xgb.fit(X_train, np.log(y_train))

    Path(model_path).parent.mkdir(parents=True, exist_ok=True)
    xgb.save_model(model_path)
    

In [10]:
train_data_op = func_to_container_op(train_data, packages_to_install= import_packages)

### Evaluate data

In [11]:
def eval_data(test_data_path: comp.InputPath(), model_path: comp.InputPath('XGBoostModel')):
    
    import pandas as pd
    import numpy as np
    from xgboost.sklearn import XGBRegressor
    
    X_test = pd.read_csv(test_data_path, index_col="Id")
    
    xgb = XGBRegressor()
    
    
    xgb.load_model(model_path)
    
    predictions = np.exp(xgb.predict(X_test))
    
    print(predictions)
       
#     output = pd.DataFrame({'Id': X_test.index, 'SalePrice': predictions})
#     output.to_csv('data/my_submission.csv', index=False)
#     print("Your submission was successfully saved!")
    

In [12]:
eval_data_op = func_to_container_op(eval_data, packages_to_install= import_packages)

### Defining function that implements the pipeline

In [13]:
def vanilla_pipeline(url):
    
    web_downloader_task = web_downloader_op(url=url)

    load_and_preprocess_data_task = load_and_preprocess_data_op(file = web_downloader_task.outputs['data'])

    featured_data_task = featured_data_op(train = load_and_preprocess_data_task.outputs['train_output_csv'], test = load_and_preprocess_data_task.outputs['test_output_csv'])
    
    train_eval_task = train_data_op(train = load_and_preprocess_data_task.outputs['train_output_csv'] , feat_train = featured_data_task.outputs['feat_train_output_csv'],
                                                 feat_test = featured_data_task.outputs['feat_test_output_csv'])
    
    eval_data_task = eval_data_op(test_data = featured_data_task.outputs['feat_test_output_csv'],model = train_eval_task.output)
    

In [14]:
# Using kfp.Client() to run the pipeline from notebook itself
client = kfp.Client() # change arguments accordingly

# Running the pipeline
client.create_run_from_pipeline_func(
    vanilla_pipeline,
    arguments={
        # Github url to fetch the data. This would change when you clone the repo. Please update the url as per that.
        'url': 'https://github.com/kubeflow/examples/raw/master/house-prices-kaggle-competition/data.zip'
    })

RunPipelineResult(run_id=66011ba0-a465-4d5b-beba-f081ab3002b4)