${\large \textbf{Predict House Price}}$ 

The data set given contains information that can be considered when we buy a house, such as a year built, number of bedrooms, and quality of a house. The train set also contains the sale price for each house. We wish to predict the sales price for each house in the test set.

In [1]:
#import
import os
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [2]:
train_set = pd.read_csv("train.csv")
train_set.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


We split the predictor variable and target variable(SalePrice) then make a label.

In [3]:
housing_labels = train_set["SalePrice"].copy()

housing=train_set.drop("SalePrice",axis = 1)

In [4]:
def makeindex(housing):

    #Since the index column is not related to the sale price, 
    #we will delete it from predictor variables and save it separately.
    test_index = housing["Id"]
    housing = housing.drop("Id",axis = 1)
    
    #There are missing data for some features.
    #We will exclude features having less than 1000 non-missing data
    #and fill in appropriate data for other features with missing data.
    print("Deleted columns:")
    Non_null = housing.count()
    idx = 0
    for Num in Non_null:
        if Num < 1000:
            print(housing.columns[idx])
            housing = housing.drop(housing.columns[idx], axis=1)
            idx -=1
        idx += 1

    #collect text features:
    idx = 0
    objects = []
    Dtype = housing.dtypes
    for row in Dtype:
        if row == object:
            objects = np.append(objects,[housing.columns[idx]])
        idx += 1
    
    #We separate the train set into a set with numeric data 
    #and texture data has categorical features.
    
    housing_num = housing.drop(objects,axis = 1)
    housing_cat = housing[objects]
    num_columns = housing_num.columns
    cat_columns = housing_cat.columns
    
    return [test_index, num_columns, cat_columns, housing_num, housing_cat, housing]

[test_index, num_columns, cat_columns, housing_num, housing_cat, housing]= makeindex(housing)

Deleted columns:
Alley
FireplaceQu
PoolQC
Fence
MiscFeature


For the numeric set, we calculate mean value for each feature and replace missing data with mean value. Then using ${\small \textbf{StandardScaler}}$, standardization features.
For the categorical set, we replace missing data with the most frequent value in each column. Then using ${\small \textbf{OneHotEncoder}}$, translate categorical features into sparse matrices. An ${\small \textbf{OneHotEncoder}}$ only store the positions of non-zero values.

In [5]:
num_tran = Pipeline(steps =[
    ('imputer', SimpleImputer(strategy = 'mean')),
    ('standard', StandardScaler())
])


cat_tran = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy = 'most_frequent')),
    ('onehot', OneHotEncoder())
])

#combine numeric and categorical set
preprocessor = ColumnTransformer(
    transformers = [
        ('num', num_tran, num_columns),
        ('cat', cat_tran, cat_columns)
    ])

In [6]:
#train the train set with linear regression

housing_prepared = preprocessor.fit_transform(housing)

lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)

LinearRegression()

We evaluate the model with Root Mean Squared Error between the logarithm of the predicted value and the logarithm of the observed sales price.

In [7]:
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(np.log(housing_labels), np.log(housing_predictions))
lin_rmse = np.sqrt(lin_mse)
lin_rmse

0.11162752289307662

In [8]:
#test with test set
test_set = pd.read_csv("test.csv")

housing_test=test_set.copy()

In [9]:
[test_index, num_columns, cat_columns, housing_num, housing_cat, housing_result] = makeindex(housing_test)
housing_prepared_test = preprocessor.transform(housing_result)
test_predict = lin_reg.predict(housing_prepared_test)

Deleted columns:
Alley
FireplaceQu
PoolQC
Fence
MiscFeature


Save predict values corresponding to the index set in the test set.

In [10]:
output = pd.DataFrame({'Id': test_index,
                       'SalePrice':test_predict})