# Introduction
Machine learning competitions are a great way to improve your data science skills and measure your process.

In this project, you will create and submit predictions for  a kaggle competition. you can then improve your model (e.g. by adding features) to improve and see how you stack up to others taking this competition.

The steps in this notebook are:
1. Built a Random Forest model with all your data (**X** and **y**) 
2. Read in the 'test' data. which doesn't include values for the target. Predict home values in the test data with your Ramdom Forest model. 
3. Submit those predictions to the competition and see your score.
4. optionally, come back to see if you can improve your model by adding features or changing your model. Then you can resubmit and see how that stack up on the competition leaderboard.  

In [1]:
# this code to see the path of input and output
import os
print('The path file input:')
for dirname, _,filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname,filename))

print('The path file output:')
for dirname1, _,filenames1 in os.walk('/kaggle/working'):
    for filename1 in filenames1:
        print(os.path.join(dirname1,filename1))

The path file input:
/kaggle/input/home-data-for-ml-course/train.csv
/kaggle/input/home-data-for-ml-course/data_description.txt
/kaggle/input/home-data-for-ml-course/train.csv.gz
/kaggle/input/home-data-for-ml-course/sample_submission.csv
/kaggle/input/home-data-for-ml-course/sample_submission.csv.gz
/kaggle/input/home-data-for-ml-course/test.csv.gz
/kaggle/input/home-data-for-ml-course/test.csv
The path file output:
/kaggle/working/__notebook__.ipynb
/kaggle/working/__output__.json


## Recap
Here is the code I've written so far. start by runing it again.

## Step1: Read data
In this step we need:
1. call the necessary libraries for training 
2. Read train file and test file

In [2]:
# Call the libraries 
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

# Path of the file to read.
file_path_home = '/kaggle/input/home-data-for-ml-course/train.csv'
file_path_test = '/kaggle/input/home-data-for-ml-course/test.csv'
home_train_data = pd.read_csv(file_path_home,index_col = 'Id')
home_test_data = pd.read_csv(file_path_test, index_col = 'Id')
train_data = home_train_data.copy()
test_data = home_test_data.copy()

In [3]:
print(test_data.shape)
print(train_data.shape)

(1459, 79)
(1460, 80)


## Step 2: Processing data
Processing data is important step that help your model more accurately. In this step we need
1. check missing data and drop columns with too much missing values
2. Processing categorical variables and numerical variables
3. Create input, output, test for train and test your model

In [4]:
# drop the missing value row in out SalePrice in train_data and test_data
train_data.dropna(axis=0,subset=["SalePrice"],inplace=True)

# Classify numerical and categorical variables

# numerical columns
numeric_col = [name for name in train_data.columns 
               if train_data[name].dtype in ['int64','float64']]
numeric_col.remove("SalePrice")

# categorical columns
categoric_col = [name1 for name1 in train_data.columns
                if train_data[name1].dtype in ['object']]

In [5]:
print(numeric_col)
print(len(numeric_col))
print('--------------------------------------------------------------------------')
print(categoric_col)
print(len(categoric_col))

['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold']
36
--------------------------------------------------------------------------
['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Func

In [6]:
# create data of categorical variables and numerical variables
categorical_data_train = train_data.copy()
categorical_data_test = test_data.copy()
numerical_data_train = train_data.copy()
numerical_data_test = test_data.copy()

In [7]:
print('categorical_data_train: {}'.format(categorical_data_train.shape))
print('categorical_data_test: {}'.format(categorical_data_test.shape))
print('numerical_data_train: {}'.format(numerical_data_train.shape))
print('numerical_data_test: {}'.format(numerical_data_test.shape))

categorical_data_train: (1460, 80)
categorical_data_test: (1459, 79)
numerical_data_train: (1460, 80)
numerical_data_test: (1459, 79)


**This code below explain: *How to solve with categorical variables***

1. We will solve this problem with categorical_data_train and categorical_data_test. which I create in above code by copy dataset train_data and test_data.

2. We have categorical columns in the list categoric_col.

3. We need to creat a dataset only have categorical variables.

In [8]:
# processing with categorical variables

# craete a dataset with only categorical variables
categorical_data_train.drop(numeric_col,axis = 1, inplace = True)
categorical_data_train.drop(["SalePrice"],axis = 1, inplace = True)
categorical_data_test.drop(numeric_col,axis = 1, inplace = True)

# check missing values in each columns
missing_train_categorical_col = [co 
    for co in categorical_data_train.columns 
                           if categorical_data_train[co].isnull().sum()> 0]
missing_test_categorical_col = [co1 
    for co1 in categorical_data_test.columns 
                           if categorical_data_test[co1].isnull().sum()> 0]
a=[]
for ch in missing_train_categorical_col:
    a.append(ch)
for ch1 in missing_test_categorical_col:
    a.append(ch1)
    
missing_categorical_col = list(set(a))
categorical_data_train.drop(missing_categorical_col,axis = 1, inplace = True)
categorical_data_test.drop(missing_categorical_col,axis = 1, inplace = True)

# find unique values in each columns
label_nunique_categorical = []
for i in categorical_data_train.columns:
    if categorical_data_train[i].nunique() >10:
        label_nunique_categorical.append(i)
onehot_nunique_categorical = list(set(categorical_data_train.columns)-set(label_nunique_categorical))

In [9]:
print('categorical_data_train: {}'.format(categorical_data_train.shape))
print('categorical_data_test: {}'.format(categorical_data_test.shape))
print(missing_train_categorical_col)
print(missing_test_categorical_col)
print(label_nunique_categorical)
print(len(label_nunique_categorical))
print(onehot_nunique_categorical)
print(len(onehot_nunique_categorical))

categorical_data_train: (1460, 20)
categorical_data_test: (1459, 20)
['Alley', 'MasVnrType', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Electrical', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'MiscFeature']
['MSZoning', 'Alley', 'Utilities', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType']
['Neighborhood']
1
['LotConfig', 'LotShape', 'BldgType', 'SaleCondition', 'LandSlope', 'RoofStyle', 'RoofMatl', 'LandContour', 'ExterCond', 'HouseStyle', 'ExterQual', 'Street', 'HeatingQC', 'Foundation', 'PavedDrive', 'Heating', 'Condition1', 'CentralAir', 'Condition2']
19


In [10]:
# label
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
for lab in label_nunique_categorical:
    categorical_data_train[lab] = label_encoder.fit_transform(
        categorical_data_train[lab])
    categorical_data_test[lab] = label_encoder.transform(
        categorical_data_test[lab])

# one_hot_encoding
from sklearn.preprocessing import OneHotEncoder
OH_Encoder = OneHotEncoder(handle_unknown = 'ignore', sparse = False)
OH_col_train = pd.DataFrame(OH_Encoder.fit_transform(
    categorical_data_train[onehot_nunique_categorical]))
OH_col_test = pd.DataFrame(OH_Encoder.transform(
    categorical_data_test[onehot_nunique_categorical]))

# when use OneHotEncoder that remove index, put it back
OH_col_train.index = categorical_data_train.index
OH_col_test.index = categorical_data_test.index

In [11]:
print(categorical_data_train.shape)
print(categorical_data_test.shape)

(1460, 20)
(1459, 20)


In [12]:
# drop old column
num_categorical_data_train = categorical_data_train.drop(onehot_nunique_categorical,
                                                         axis = 1)
num_categorical_data_test = categorical_data_test.drop(onehot_nunique_categorical,
                                                       axis = 1)

# add new cloumns
complete_categorical_data_train = pd.concat((num_categorical_data_train,
                                             OH_col_train),axis = 1)
complete_categorical_data_test = pd.concat((num_categorical_data_test,
                                            OH_col_test),axis = 1)

In [13]:
print(num_categorical_data_train.shape)
print(num_categorical_data_test.shape)
print(complete_categorical_data_train.shape)
print(complete_categorical_data_test.shape)
# complete_categorical_data_train.head()
# complete_categorical_data_test.head()

(1460, 1)
(1459, 1)
(1460, 100)
(1459, 100)


In [14]:
# in here we have problem. how to combined numerical and categorical columns 
# when one of all have missing values
# we have two the way to solve this problem
# 1. dropping all columns have missing values
# 2. filling with mean values for missing values


# so we try two way and choose the most accurate model
# for the way 2: I checked label_nunique_categorical and it can use


**This code below explain: *How to solve with numerical variables***

1. We will solve this problem with numerical_data_train and numerical_data_test. which I create in above code by copy dataset train_data and test_data.

2. We have numeric columns in the list numeric_col.

3. We need to creat a dataset only have numerical variables.

In [15]:
print(numeric_col)
print(len(numeric_col))

['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold']
36


In [16]:
print(numerical_data_train.shape)
print(numerical_data_test.shape)

(1460, 80)
(1459, 79)


In [17]:
print(categoric_col)
print(len(categoric_col))

['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType', 'SaleCondition']
43


In [18]:
# processing with numerical variables

# craete a dataset with only numerical variables
numerical_data_train.drop(categoric_col,axis = 1, inplace = True)
numerical_data_train.drop(["SalePrice"],axis = 1, inplace = True)
numerical_data_test.drop(categoric_col,axis = 1, inplace = True)

# check missing values in each columns
missing_train_numerical_col = [co2 
    for co2 in numerical_data_train.columns 
                           if numerical_data_train[co2].isnull().sum()> 1000]
missing_test_numerical_col = [co3 
    for co3 in numerical_data_test.columns 
                           if numerical_data_test[co3].isnull().sum()>1000]
b=[]
for ch2 in missing_train_numerical_col:
    b.append(ch2)
for ch3 in missing_test_numerical_col:
    b.append(ch1)
    
missing_numerical_col = list(set(b))
numerical_data_train.drop(missing_numerical_col,axis = 1, inplace = True)
numerical_data_test.drop(missing_numerical_col,axis = 1, inplace = True)

#fill mean values into missing values
complete_numerical_data_train = numerical_data_train.fillna(
    numerical_data_train.mean())
complete_numerical_data_test = numerical_data_test.fillna(
    numerical_data_test.mean())


In [19]:
print('numerical_data_train: {}'.format(numerical_data_train.shape))
print('numerical_data_test: {}'.format(numerical_data_test.shape))
print(missing_train_numerical_col)
print(len(missing_train_numerical_col))
print(missing_test_numerical_col)
print(len(missing_test_numerical_col))
print('complete_numerical_data_train: {}'
      .format(complete_numerical_data_train.shape))
print('complete_numerical_data_test: {}'
      .format(complete_numerical_data_test.shape))
print(complete_numerical_data_train.columns.isnull().sum())
print(complete_numerical_data_test.columns.isnull().sum())
# numerical_data_train.head()
# numerical_data_test.head()
# complete_numerical_data_train.head()
# complete_numerical_data_test.head()

numerical_data_train: (1460, 36)
numerical_data_test: (1459, 36)
[]
0
[]
0
complete_numerical_data_train: (1460, 36)
complete_numerical_data_test: (1459, 36)
0
0


**Now we have to combine two dataset: numerical and categorical dataset**

In [20]:
dataset_train = pd.concat((complete_numerical_data_train
                           ,complete_categorical_data_train)
                          ,axis = 1)
dataset_test = pd.concat((complete_numerical_data_test
                           ,complete_categorical_data_test)
                          ,axis = 1)

In [21]:
print('dataset_train: {}'.format(dataset_train.shape))
print('dataset_test: {}'.format(dataset_test.shape))
# dataset_test.head()
# dataset_train.head()

dataset_train: (1460, 136)
dataset_test: (1459, 136)


**STEP 3**

When we have full dataset of train and test we going to make model 

In [22]:
X = dataset_train.copy()
X_test = dataset_test.copy()
y = train_data.SalePrice

X_train, X_valid, y_train, y_valid = train_test_split(X, y,
                                                      train_size=0.8, test_size=0.2,
                                                      random_state=0)

In [23]:
print(y.shape)
print(X.shape)
print(X_test.shape)
y.head()

(1460,)
(1460, 136)
(1459, 136)


Id
1    208500
2    181500
3    223500
4    140000
5    250000
Name: SalePrice, dtype: int64

In [24]:
# tryto = range(1000)
# values = []
# for i in tryto:
#     values.append(i)
# del values[0:3]
# def get_mae(value,train_X, val_X, train_y, val_y):
#     model = RandomForestRegressor(max_leaf_nodes = value, random_state=1)
#     model.fit(train_X,train_y)
#     predict = model.predict(val_X)
#     mean = mean_absolute_error(val_y, predict)
#     return mean
# compare = {value:get_mae(value,X_train, X_valid, y_train, y_valid) 
#            for value in values}
# best_tree_size = min(compare, key = compare.get)
# print(best_tree_size)

In [25]:
rf_model_on_full_data = RandomForestRegressor(max_leaf_nodes=195, random_state=0)
rf_model_on_full_data.fit(X,y)
test_preds = rf_model_on_full_data.predict(X_test)



In [26]:
output = pd.DataFrame({'Id': X_test.index,
                       'SalePrice': test_preds})
output.to_csv('submission.csv', index=False)

In [27]:

# Check missing values and drop column with too much missing values
# missing_values = {col:train_data[col].isnull().sum()
#                   for col in train_data.columns if train_data[col].isnull().any()}
# missing_values_col = list(set(missing_values))
# for i in missing_values_col:
#     if missing_values[i] >= 1000:
# #         print('hell')
#         train_data.drop(i, axis= 1, inplace = True )
# #         print('llo')
#         test_data.drop(i, axis= 1, inplace = True )

# # Processing numberical variables
# if train_data['SalePrice'].isnull().sum() > 0:
#     print('the SalePrice has missing values')
#     train_data.dropna(axis = 0, subset= ['SalePrice'],inplace= True)
# numeric_col = [name for name in train_data.columns 
#               if train_data[name].dtype in ['int64','float64']]
# # create a array for check how many missing values in each columns
# missing_numeric_col = [co 
#     for co in numeric_col if train_data[co].isnull().any()]
# for drc in missing_numeric_col:
#     if train_data[drc].isnull().sum() >= 1000:
#         print('delete') # code to check loop
#         train_data.drop(drc,asix = 1 , inplace = True)
# # Processing categorical variables
# the_categorical_column = train_data.select_dtypes(include = ['object'])
# print(len(the_categorical_column.columns))
# missing_categorical_column=['{}: {}'.format(nam,the_categorical_column[nam].isnull().sum())
#                             for nam in the_categorical_column.columns 
#                             if the_categorical_column[nam].isnull().any()]
# print(missing_categorical_column)
# print(len(missing_categorical_column))

In [28]:
# print(numeric_col)
# train_data[numeric_col].head()