## Business Understanding

**Problem**: Sebuah perusahaan ingin membuat proyek pembangunan perumahan di berbagai lokasi. Tiap lokasi memiliki karakteristik daerah yang berbeda. Perusahaan tersebut membutuhkan masukan terkait prediksi harga rumah yang akan dibangun nantinya

**Goals**: Memprediksi harga rumah di berbagai lokasi

**Objective**: Membangun model Machine Learning yang dapat memprediksi harga rumah

## Exploratory Data Analysis

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('housing.csv')
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,INLAND
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,INLAND
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0,INLAND
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,INLAND


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


**keterangan**
1. longitude: Seberapa jauh ke barat sebuah rumah; nilai yang lebih tinggi lebih jauh ke barat

2. latitude: Seberapa jauh ke utara sebuah rumah; nilai yang lebih tinggi lebih jauh ke utara

3. housingMedianAge: Usia rata-rata sebuah rumah dalam satu blok; angka yang lebih rendah adalah bangunan baru

4. totalRooms: Jumlah total kamar dalam satu blok

5. totalBedrooms: Jumlah total kamar tidur dalam satu blok

6. population: Jumlah total orang yang tinggal dalam satu blok

7. households: Jumlah total rumah tangga, sekelompok orang yang tinggal dalam satu unit rumah, untuk satu blok

8. medianIncome: Pendapatan rata-rata untuk rumah tangga dalam satu blok rumah (diukur dalam puluhan ribu Dolar AS)

9. medianHouseValue: Nilai median rumah untuk rumah tangga dalam satu blok (diukur dalam Dolar AS)

10. oceanProximity: Lokasi rumah dengan laut/laut

In [4]:
num = ["longitude", "latitude", "housing_median_age", "total_rooms", "total_bedrooms", "population", "households", "median_income", "median_house_value"]
cat = ["ocean_proximity"]

In [5]:
df[num].describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


In [6]:
df[cat].describe()

Unnamed: 0,ocean_proximity
count,20640
unique,5
top,<1H OCEAN
freq,9136


In [7]:
print("Value count kolom ocean_proximity:\n", 
      df["ocean_proximity"].value_counts())

Value count kolom ocean_proximity:
 <1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64


In [8]:
df.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


##### Univariate Analysis

In [9]:
plt.figure(figsize=(16, 7))
for i in range(0, len(num)):
    plt.subplot(3, len(num)/2, i+1)
    sns.boxplot(df[num[i]], color='gray')
    plt.tight_layout()

ValueError: Number of columns must be a positive integer, not 4.5

<Figure size 1152x504 with 0 Axes>

In [None]:
plt.figure(figsize=(16, 7))
for i in range(0, len(num)):
    plt.subplot(3, len(num)/2, i+1)
    sns.distplot(df[num[i]], color='gray')
    plt.tight_layout()

In [None]:
plt.figure(figsize=(10,4))
sns.countplot('ocean_proximity', data=df, color='gray')

##### Bivariate Analysis

In [None]:
plt.figure(figsize=(10, 10))
sns.heatmap(df.corr(), cmap='Blues', annot=True, fmt='.2f')

In [None]:
plt.figure(figsize=(16, 16))
sns.pairplot(df, diag_kind='kde')

In [None]:
for i in range(0, len(num)):
    sns.catplot(data=df, x='ocean_proximity', y=num[i])
    plt.tight_layout()

In [None]:
plt.figure(figsize=(13, 8))
sns.boxplot(x='ocean_proximity',y='median_house_value',data=df, palette="rocket")

In [None]:
plt.figure(figsize=(13, 8))
sns.boxplot(x='ocean_proximity',y='housing_median_age',data=df, palette="rocket")

##### Insight EDA

* Terdapat oulier pada variabel `total_rooms`, `total_bedrooms`, `populations`, `households`, dan `median_income`.
* Variabel `total_bedrooms`, `populations`, `households`, dan `median_income` terlihat berdistribusi skew positif.
* Terdapat korelasi kuat antara beberapa variabel:
  * `total_bedrooms` dan `total_rooms` memiliki korelasi 0.93,
  * `populations` dan `total_rooms` memiliki korelasi 0.86,
  * `total_bedrooms` dan `total_bedrooms` memiliki korelasi 0.88,
  * `households` dan `total_rooms` memiliki korelasi 0.92,
  * `households` dan `total_bedrooms` memiliki korelasi 0.98, serta
  * `households` dan `populations` memiliki korelasi 0.91.

## Data Preprocessing

In [None]:
#mengecek missing value
df.isna().sum()

In [None]:
#mengisi missing value
df['total_bedrooms'].fillna(df['total_bedrooms'].median(), 
                            inplace=True)

In [None]:
df.isna().sum()

In [None]:
#mengecek data duplikat
print(df.duplicated().sum())

In [None]:
#handling outliers dengan z-score
from scipy import stats

print(f"jumlah baris sebelum difilter: {len(df)}")

filtered_entries = np.array([True] * len(df))
for col in num:
    zscore = abs(stats.zscore(df[col]))
    filtered_entries = (zscore < 3) & filtered_entries
    df2 = df[filtered_entries]

print(f"jumlah baris setelah difilter: {len(df2)}")

In [None]:
#mengecek outliers setelah difilter
plt.figure(figsize=(16, 7))
for i in range(0, len(num)):
    plt.subplot(3, len(num)/2, i+1)
    sns.boxplot(df2[num[i]], color='gray')
    plt.tight_layout()

In [None]:
#drop variabel yang merupakan identitas (longitude dan latitude), serta variabel x berkorelasi tinggi dan sisakan satu variabel
df2.drop(['longitude', 'latitude','total_rooms', 'households','total_bedrooms'], 
        axis=1,inplace=True)
df2.columns

In [None]:
df2.info()

In [None]:
#feature encoding
df_dummies = pd.get_dummies(df2['ocean_proximity'])
df2 = pd.concat([df_dummies, df2], axis = 1)
df2

In [None]:
df2.drop('ocean_proximity', axis=1,inplace=True)

In [None]:
df2.columns

## Modeling

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.metrics import r2_score

In [None]:
from sklearn.preprocessing import MinMaxScaler
for i in df2.columns:
    df2[i] = MinMaxScaler().fit_transform(df2[i].values.reshape(len(df2), 1))

df2.describe()

In [None]:
df2

In [None]:
x = df2.drop("median_house_value", axis=1).values
y = df2["median_house_value"].values

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1,test_size=0.2)

##### Lazy Regressor

In [None]:
!pip install lazypredict

In [None]:
import lazypredict
from lazypredict.Supervised import LazyRegressor

In [None]:
reg = LazyRegressor(predictions=True)
models, predictions = reg.fit(x_train, x_test, y_train, y_test)
models

##### LGBMRegressor

In [None]:
import lightgbm as ltb
model = ltb.LGBMRegressor()
model.fit(x_train, y_train)
print(model)

expected_y  = y_test
predicted_y = model.predict(x_test)

In [None]:
model_score = model.score(x_test,y_test)
print("R2 model {:.2%}".format(model_score))

In [None]:
model_df = pd.DataFrame({"Actual Data": expected_y, "Predicted Data": predicted_y})

In [None]:
plt.figure(figsize =(14,6))
model_df = model_df.reset_index()
model_df = model_df.drop(["index"], axis=1)
plt.plot(model_df[:50])
plt.legend(["Actual", "Predicted"])
plt.title("LGBMRegressor Model")
plt.show()

In [None]:
#Evaluasi Model
print("MSE: %.2f" % mean_squared_error(y_test, predicted_y))
print("MAE: %.2f" % mean_absolute_error(y_test, predicted_y))
print("RMSE: %.2f" % mean_squared_error(y_test, predicted_y, squared=False))

#R-squared
print('R2 score: %.2f' % model_score)

##### Ringkasan Modeling

* Dilakukan normalisasi pada data sehingga seluruh variabel berada pada range yang sama
* Membagi data train dan data test dengan porsi 80% dan 20%
* Dengan menggunakan Lazy Regressor, didapat bahwa model terbaik yang didapatkan adalah LGBMRegressor model
* R2 score yang didapatkan untuk model LGBMRegressor adalah sebesar 60.51%