# Introduction
In the update version 3, I used the pipeline to systematically solve the problem. In addition, I will explore how to solve numerical and categorical variables.

So, in the dataset we can see it has two kinds: numerical variables (with type are int64 and float64), categorical variables (with type is object). In this version we do not care about outliers and skew of dataset, we only talk about how to use pipeline to manage preprocessing and model steps.

OK. let's go!


## STEP 1: Find the path of input and output

This is important for use dataset and output data. When we have created a model, we need to use it to handle complex situations require multiple models

In [1]:
# this code to see the path of input and output
import os
print('The path file input:')
for dirname, _,filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname,filename))

print('The path file output:')
for dirname1, _,filenames1 in os.walk('/kaggle/working'):
    for filename1 in filenames1:
        print(os.path.join(dirname1,filename1))

The path file input:
/kaggle/input/home-data-for-ml-course/train.csv
/kaggle/input/home-data-for-ml-course/data_description.txt
/kaggle/input/home-data-for-ml-course/train.csv.gz
/kaggle/input/home-data-for-ml-course/sample_submission.csv
/kaggle/input/home-data-for-ml-course/sample_submission.csv.gz
/kaggle/input/home-data-for-ml-course/test.csv.gz
/kaggle/input/home-data-for-ml-course/test.csv
The path file output:
/kaggle/working/__notebook_source__.ipynb


## STEP 2: Exploratory data analysis
In this step we need:
1. Preminary observations: in this part we have to know: 
*size, how many columns with numerical and categorical*
2. Read file and split it into seperate part

## 2.1. Preminary observations

In [2]:
# Call the libraries 
import pandas as pd

# Path of the file to read.
file_path_home = '/kaggle/input/home-data-for-ml-course/train.csv'
file_path_test = '/kaggle/input/home-data-for-ml-course/test.csv'
home_train_data = pd.read_csv(file_path_home,index_col = 'Id')
home_test_data = pd.read_csv(file_path_test, index_col = 'Id')
train_data = home_train_data.copy() # copy dataset for process without change it  
test_data = home_test_data.copy()

# display data
print(test_data.shape)
print(train_data.shape)
print('numeric cloumns {} '
      .format(train_data.select_dtypes(exclude=['object']).columns))
print('len of numeric cloumns {} '
      .format(len(train_data.select_dtypes(exclude=['object']).columns)))
print('categoric cloumns {} '
      .format(train_data.select_dtypes(include=['object']).columns))
print('len of categoric cloumns {} '
      .format(len(train_data.select_dtypes(include=['object']).columns)))

(1459, 79)
(1460, 80)
numeric cloumns Index(['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond',
       'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2',
       'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces',
       'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal',
       'MoSold', 'YrSold', 'SalePrice'],
      dtype='object') 
len of numeric cloumns 37 
categoric cloumns Index(['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
       'BsmtQual', '

**We can see, in this dataset we have **
1. The size of dataset: train_data:(1460, 80) and test_data:(1460, 79).
2. The numeric columns: 37
3. The categoric columns: 43

## 2.2. Split data
**In this step we will split data into 3 part:**
1. Numerical columns
2. Categorical colmns with unique feature < 10: For OneHotEncoding
3. Categorical columns with unique feature > 10: For LabelEncoding

In [3]:
numerical_col = [num for num in test_data.columns 
                 if test_data[num].dtype in ['int64','float64']]

categorical_low_col = [calow for calow in test_data.columns
                      if test_data[calow].dtype in ['object']
                      and test_data[calow].nunique() < 10]
categorical_high_col = [cahigh for cahigh in test_data.columns 
                       if test_data[cahigh].dtype in ['object']
                       and test_data[cahigh].nunique() >= 10]

# check three parts
print('numerical_col {}'.format(numerical_col))
print('len of numerical_col {}'.format(len(numerical_col)))
print('categorical_low_col {}'.format(categorical_low_col))
print('len of categorical_low_col {}'.format(len(categorical_low_col)))
print('categorical_high_col {}'.format(categorical_high_col))
print('len of categorical_high_col {}'.format(len(categorical_high_col)))

numerical_col ['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold']
len of numerical_col 36
categorical_low_col ['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual'

## STEP3: Processing data
Processing data is important step that help your model more accurately. In this step we need
1. create new dataset with columns that do not have too much missing values(70%)
2. check missing data and drop columns with too much missing values
3. Processing categorical variables and numerical variables
4. Create input, output, test for train and test your model

**we will slove 2 3 4 step by use pipe**

In [4]:
from sklearn.model_selection import train_test_split
# drop the missing value row in out SalePrice in train_data and test_data
train_data.dropna(axis=0,subset=["SalePrice"],inplace=True)
# create a output for train
y = train_data.SalePrice
numerical_col_new = [numnew for numnew in numerical_col
                    if train_data[numnew].isnull().sum() < 1000
                    or test_data[numnew].isnull().sum() < 1000]
categorical_low_col_new = [calownew for calownew in categorical_low_col
                          if train_data[calownew].isnull().sum() < 1000
                          or test_data[calownew].isnull().sum() < 1000]
categorical_high_col_new = [cahighnew for cahighnew in categorical_high_col
                          if train_data[cahighnew].isnull().sum() < 1000
                          or test_data[cahighnew].isnull().sum() < 1000]
my_cols = numerical_col_new + categorical_low_col_new + categorical_high_col_new

# create new dataset
train_data_new = train_data[my_cols].copy()
test_data_new = test_data[my_cols].copy()

X_train, X_valid, y_train, y_valid = train_test_split(train_data_new,
                                                      y,train_size=0.8,
                                                      test_size=0.2,
                                                      random_state=0)

# check three parts
print('numerical_col_new {}'.format(numerical_col_new))
print('len of numerical_col_new {}'.format(len(numerical_col_new)))
print('categorical_low_col_new {}'.format(categorical_low_col_new))
print('len of categorical_low_col_new {}'.format(len(categorical_low_col_new)))
print('categorical_high_col_new {}'.format(categorical_high_col_new))
print('len of categorical_high_col_new {}'.format(len(categorical_high_col_new)))

numerical_col_new ['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold']
len of numerical_col_new 36
categorical_low_col_new ['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQu

**We can see**

In the categorical_low_col_new dropped 4 columns with too much missing values

**Now we will use pipeline**

In [5]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
import category_encoders as ce
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# preprocessing for numerical data 
numerical_transformer = Pipeline(verbose=False,steps=[
    ('imputer', SimpleImputer(strategy = 'mean')),
])
# preprocessing for categorical data
categorical_low_transformer = Pipeline(verbose=False,steps=[
    ('imputer_one_hot', SimpleImputer(strategy = 'most_frequent')),
    ('one_hot',OneHotEncoder(handle_unknown = 'ignore'))
])
categorical_high_transformer = Pipeline(verbose = False, steps=[
    ('imputer_lable',SimpleImputer(strategy = 'most_frequent')),
    ('lable',ce.OrdinalEncoder())])

# bundle preprocessing for numerical and categorical data 
preprocessor = ColumnTransformer(verbose = False,transformers=[
    ('num',numerical_transformer,numerical_col_new),
    ('calow',categorical_low_transformer,categorical_low_col_new),
    ('cahigh',categorical_high_transformer,categorical_high_col_new)])

my_pipeline = Pipeline(verbose = False, steps = [
    ('preprocessor',preprocessor),
    ('model',RandomForestRegressor(n_estimators=100,random_state=0))])
my_pipeline.fit(train_data_new,y)
test_preds = my_pipeline.predict(test_data_new)

# find the best tree size does not good work
# tryto = range(500)
# values = []
# for i in tryto:
#     values.append(i)
# del values[0:3]
# def get_mae(value, X_train, X_valid, y_train, y_valid):
#     model_test = RandomForestRegressor(n_estimators = value,random_state=0)
#     test_pipeline = Pipeline(verbose = False, steps = [
#         ('preprocessor',preprocessor),
#         ('model_test',model_test)])
#     test_pipeline.fit(X_train,y_train)
#     predict = test_pipeline.predict(X_valid)
#     mean = mean_absolute_error(y_valid, predict)
#     return mean
# compare = {value:get_mae(value,X_train, X_valid, y_train, y_valid) for value in values}    
# best_tree_size = min(compare, key = compare.get)
# print(best_tree_size)



## STEP4: Create submission.csv

In [6]:
output = pd.DataFrame({'Id': test_data_new.index,
                       'SalePrice': test_preds})
output.to_csv('submission.csv', index=False)