# Dive Deeper

**Additional material for Week 5: Introduction to Machine Learning 1**

📖 Version: Ultron Day Online - Jan 2023

##  House Price Prediction using Regression🏡💸
 
### Dataset Problems 
This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015. A simple regression model is needed to predict the house price

🎯 We need to:
* Explore and preprocess the data set for regression models.
* Build regression models w/ different experiments using model Support Vector Regression (SVR).

**Data Sources**
- [House Sales in King County, USA](https://www.kaggle.com/datasets/harlfoxem/housesalesprediction).

### Import Library

In [23]:
# import library 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn

from sklearn.model_selection import train_test_split
from sklearn import svm

# set option
pd.set_option('display.float_format', lambda x: '%.2f' % x) #mengambil 2 angka belakang koma pada tipe data float

## Load Data
*Baca data `house_price_adj.csv` dan simpanlah kedalam variabel dengan nama `house_df`*

In [2]:
# code here
house_df = pd.read_csv('house_price_adj.csv')

In [3]:
house_df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_lot,floors,waterfront,condition,grade,sqft_basement,yr_built,yr_renovated,zipcode
0,514500195,20141016T000000,556000.0,4,2.5,7200,1.0,0,4,7,1010,1957,0,98005
1,2255500125,20140716T000000,749950.0,3,2.5,2263,2.0,0,3,8,670,2014,0,98122
2,1223039235,20141114T000000,605000.0,5,2.75,13332,2.0,0,4,8,0,1940,1991,98146
3,1432400095,20141106T000000,175000.0,3,1.0,7572,1.0,0,4,6,0,1958,0,98058
4,6788200360,20140903T000000,727000.0,3,2.25,4200,1.5,0,5,8,660,1939,0,98112


There are **14 variables** in this data set:
*   **2 categorical** variables,
*   **10 continuous** variables,
*   **1** variable to store house ID, and
*   **1** variable to store date house sold.

The following is the **structure of the data set**.

* Id: House ID	➡️7129300520; ...
* date:	Date house sold	➡️20141013T000000; 20141209T000000; ...
* price: House price ➡️221900; 538000; ...
* bedrooms:	Number of bedrooms ➡️3; 2; ...
* bathrooms: Number of bathrooms ➡️1; 2.25; ...
* sqft_lot:	Lot size (in sqft) ➡️5650; 7242; ...
* floors: Number of floors ➡️1; 2; ...
* waterfront: Has access to waterway ➡️(0 = no; 1 = yes)	0; 1; ...
* condition: House condition ➡️(1 = bad; 5 = perfect)	1; 5; ...
* grade: House grade ➡️7; 6; ...
* sqft_basement: Basement size (in sqft) ➡️0; 400; ...
* yr_built:	Year when house was built ➡️1955; 1951; ...
* yr_renovated:	Year when house was renovated ➡️0; 1991; ...
* zipcode:	House zipcode ➡️98178; 98125; ...

## Exploratory Data Analysis & Data Preprocessing

### Check Data Types
*Coba check tipe data untuk setiap kolom dari data `house_df`. Apakah tipe data telah sesuai?*

In [5]:
# code here 
house_df.dtypes


id                 int64
date              object
price            float64
bedrooms           int64
bathrooms        float64
sqft_lot           int64
floors           float64
waterfront         int64
condition          int64
grade              int64
sqft_basement      int64
yr_built           int64
yr_renovated       int64
zipcode            int64
dtype: object

### Check Missing Values & Duplicates

*Coba cek apakah ada nilai yang hilang/duplikat atau tidak?*

In [6]:
# code here (check missing sum)
house_df.isna().sum()

id               0
date             0
price            0
bedrooms         0
bathrooms        0
sqft_lot         0
floors           0
waterfront       0
condition        0
grade            0
sqft_basement    0
yr_built         0
yr_renovated     0
zipcode          0
dtype: int64

In [7]:
# code here (check duplicates)
house_df.duplicated().sum()

0

###  Features Selection

❓Menurut Anda, **fitur manakah yang sudah pasti tidak akan mempengaruhi harga rumah?** 

_⚠️ Catatan: dalam latihan kali ini, Anda tidak perlu melakukan visualisasi, cukup tentukan fitur yang sudah jelas tidak akan berpengaruh sama sekali._

> Jawaban: ID dan Date

In [26]:
house_df.drop(['date'],axis=1).corr()[['price']]

Unnamed: 0,price
id,-0.03
price,1.0
bedrooms,0.32
bathrooms,0.55
sqft_lot,0.14
floors,0.27
waterfront,0.26
condition,0.02
grade,0.67
sqft_basement,0.36


In [27]:
house_df.corr()

  house_df.corr()


Unnamed: 0,id,price,bedrooms,bathrooms,sqft_lot,floors,waterfront,condition,grade,sqft_basement,yr_built,yr_renovated,zipcode
id,1.0,-0.03,-0.03,-0.02,-0.14,-0.0,-0.03,-0.04,-0.01,0.01,0.01,0.02,0.0
price,-0.03,1.0,0.32,0.55,0.14,0.27,0.26,0.02,0.67,0.36,0.04,0.14,-0.05
bedrooms,-0.03,0.32,1.0,0.49,0.07,0.17,-0.02,0.03,0.37,0.29,0.12,0.03,-0.13
bathrooms,-0.02,0.55,0.49,1.0,0.14,0.51,0.08,-0.15,0.68,0.32,0.51,0.06,-0.22
sqft_lot,-0.14,0.14,0.07,0.14,1.0,0.03,0.01,-0.03,0.18,0.05,0.08,0.02,-0.16
floors,-0.0,0.27,0.17,0.51,0.03,1.0,0.04,-0.28,0.45,-0.21,0.49,-0.03,-0.08
waterfront,-0.03,0.26,-0.02,0.08,0.01,0.04,1.0,0.0,0.09,0.11,-0.02,0.21,0.03
condition,-0.04,0.02,0.03,-0.15,-0.03,-0.28,0.0,1.0,-0.15,0.15,-0.34,-0.06,0.03
grade,-0.01,0.67,0.37,0.68,0.18,0.45,0.09,-0.15,1.0,0.21,0.44,-0.01,-0.2
sqft_basement,0.01,0.36,0.29,0.32,0.05,-0.21,0.11,0.15,0.21,1.0,-0.13,0.1,0.1


🔻Maka ambillah kolom-kolom yang selain ada di jawaban Anda sebagai prediktor dan kolom `price` sebagai target variable

In [36]:
# code here
y = house_df['price']
X = house_df.drop(['id','date','price'], axis=1) # isi dengan nama kolom yang tidak menjadi prediktor

In [12]:
y

0       556000.00
1       749950.00
2       605000.00
3       175000.00
4       727000.00
          ...    
1995    170000.00
1996   1225000.00
1997    703300.00
1998    325000.00
1999    685100.00
Name: price, Length: 2000, dtype: float64

### Train-Test Splitting

Lakukan **train-test splitting** untuk data train dan test menggunakan `X` dan `y` dengan `random_state=42` dan `test_size=0.2`. Kemudian simpan hasilnya pada variabel `X_train, X_test, y_train, y_test`

In [32]:
# code here
X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                   test_size = 0.2,
                                                   random_state = 42)

In [33]:
X_train

Unnamed: 0,bedrooms,bathrooms,sqft_lot,floors,waterfront,condition,grade,sqft_basement,yr_built,yr_renovated,zipcode
968,3,1.00,8910,1.00,0,4,7,0,1949,0,98117
240,3,2.50,8577,2.00,0,3,9,0,1987,0,98023
819,2,3.00,1073,2.00,0,3,7,280,2007,0,98144
692,5,1.75,48787,2.00,0,3,6,0,1922,0,98059
420,4,1.75,6000,1.00,0,4,6,780,1947,0,98125
...,...,...,...,...,...,...,...,...,...,...,...
1130,3,1.75,6818,1.00,0,5,7,0,1972,0,98034
1294,4,2.00,5572,1.50,0,3,7,450,1911,0,98126
860,3,2.50,4077,2.00,0,3,8,0,2011,0,98038
1459,3,3.25,51177,2.00,0,3,9,0,2005,0,98042


### **Model Fitting**: 

Buatlah model SVR untuk data `house_df` dan disimpan dalam variabel `model_svr`. Kemudian lakukan pembelajaran untuk data train menggunakan metode `.fit()`

In [40]:
# code here
model_svr = svm.SVR()
model_svr.fit(X_train, y_train)

In [35]:
# additional: untuk melihat parameter
model_svr.get_params()

{'C': 1.0,
 'cache_size': 200,
 'coef0': 0.0,
 'degree': 3,
 'epsilon': 0.1,
 'gamma': 'scale',
 'kernel': 'rbf',
 'max_iter': -1,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}

### **Model Prediction** 
Lakukan prediksi menggunakan model machine learning yang sudah dibangun `model_svr` terhadap data test (`y_pred_test`)

In [45]:
y_pred_test = model_svr.predict(X_test)
y_pred_test[:3]

array([441747.74591892, 441747.78094545, 441747.66880208])

In [49]:
y_test.values[:3] #hasil data asli

array([535000., 583000., 349990.])

### **Model Evaluation** 
Cek nilai `mean_absolute_error` model baik pada data test untuk memastikan apakah model kita sudah belajar dengan baik.

In [38]:
from sklearn.metrics import mean_absolute_error

In [51]:
# code here - test mae
mean_absolute_error(y_pred_test, y_test)
# mean_absolute_error(model_svr.predict(X_test), y_test)

244058.54183195753

___

### ✨ Hyperparameters Tuning

Coba lakukan hyperparameters tuning agar meminimalkan error yang terjadi

In [52]:
# code here 
from sklearn.svm import SVR
model_svr_tuned = SVR(kernel = 'rbf', C =100).fit(X_train, y_train)

In [None]:
# code here - test mae


✨ **Opsional**: try to use MAE percentage using `mean_absolute_percentage_error` so we can know that error actually is big or not on our target variable range

In [None]:
# from sklearn.metrics import mean_absolute_percentage_error

In [None]:
# code here - train mae

# Good Job! 👍👍👍