<a href="https://www.kaggle.com/code/ainurrohmanbwx/house-price-prediction-eda-linear-regression?scriptVersionId=144637942" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Abstract

This study focuses on using linear regression to forecast house prices in Jakarta and Tebet, Indonesia. After analyzing the data, we obtained a prediction accuracy of 65.64%. The results suggest that our model is capable of making reasonably accurate predictions, although there is still room for improvement. The study provides insights into the factors that impact house prices in these areas and demonstrates the utility of linear regression as a tool for analyzing real estate market trends. Future research could build upon this study by refining the model and exploring additional factors that may affect house prices. Overall, this experiment highlights the potential of linear regression for real estate forecasting and provides valuable insights into the housing markets of Jakarta and Tebet.

# Introduction

In recent years, the real estate industry in Jakarta and Tebet, Indonesia has experienced rapid growth, making it an interesting area for research and experimentation. One common task in this field is forecasting house prices, which can be challenging due to the complex and dynamic nature of the market. In this experiment, we will use linear regression, a widely used statistical method, to analyze data and make predictions about house prices in Jakarta and Tebet. By applying this technique, we aim to gain insights into the factors that affect house prices and improve our ability to forecast them accurately.

# Literature review

**Linear regression** is a statistical technique used to establish a relationship between a dependent variable and one or more independent variables. Here are some pros and cons of linear regression:

Pros:

* Simple to understand and easy to implement.
* Useful for making predictions and forecasting future outcomes.
* Provides a measure of the strength and direction of the relationship between the dependent and independent variables.
* Helps identify the most significant independent variables that affect the dependent variable.
* Allows for the examination of the effects of multiple independent variables on the dependent variable.

Cons:

* Assumes a linear relationship between the dependent and independent variables, which may not always hold true in reality.
* Cannot be used to establish cause-and-effect relationships between variables.
* Susceptible to outliers, which can significantly influence the outcome of the regression analysis.
* Assumes the independence of the observations, which may not be true in some cases.
* Cannot be used to analyze non-numeric data or categorical variables without some form of transformation.

# Experiment

## Disable warning

In [None]:
import warnings
warnings.filterwarnings('ignore')

## Load data

In [None]:
import pandas as pd

df = pd.read_excel('/kaggle/input/daftar-harga-rumah/DATA RUMAH.xlsx', sheet_name='Sheet1')
df.info()

- Explanation of each attribute:
    1. Nama Rumah = House name
    2. LB = Total Building Area
    3. LT = Total Land Area
    4. KT = Number of Bedrooms
    5. KM = Number of Bathrooms
    6. GRS = Number of Car Capacity in the Garage
    7. Harga =  House prices (IDR)

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df = df.loc[:,['LB', 'LT', 'KT', 'KM', 'GRS', 'HARGA']]
df.head()

## Exploratory Data Analysis

### Data information
statistical description of the mean, quartiles, standard deviation, etc

In [None]:
df.describe()

### Identify outliers

#### Identification of outlier attribute LB

In [None]:
import seaborn as sns

sns.boxplot(x='LB', data=df);

Look for Q1, Q3, upper limit, lower limit, difference between Q3 and Q1 to find which data on the LB attribute are outliers

In [None]:
import numpy as np

q11,q31=np.percentile(df['LB'], [25,75])
s1 = q31-q11
ba1 = q31+(1.5*s1)
bw1 = q11-(1.5*s1)
print(q11)
print(q31)
print(s1)
print(ba1)
print(bw1)

In [None]:
# shows outlier data on the LB attribute
dt1 = df[(df['LB']<bw1) | (df['LB']>ba1)]
dt1.head()

#### Identification of outlier attribute LT

In [None]:
sns.boxplot(x='LT', data=df);

Look for Q1, Q3, upper limit, lower limit, difference between Q3 and Q1 to find which data on the LT attribute are outliers

In [None]:
q12,q32=np.percentile(df['LT'], [25,75])
s2 = q32-q12
ba2 = q32+(1.5*s2)
bw2 = q12-(1.5*s2)
print(q12)
print(q32)
print(s2)
print(ba2)
print(bw2)

In [None]:
# shows outlier data on the LT attribute
dt2 = df[(df['LT']<bw2) | (df['LT']>ba2)]
dt2.head()

#### Identification of outlier attribute KT

In [None]:
sns.boxplot(x='KT', data=df);

Look for Q1, Q3, upper limit, lower limit, difference between Q3 and Q1 to find which data on the KT attribute are outliers

In [None]:
q13,q33=np.percentile(df['KT'], [25,75])
s3 = q33-q13
ba3 = q33+(1.5*s3)
bw3 = q13-(1.5*s3)
print(q13)
print(q33)
print(s3)
print(ba3)
print(bw3)

In [None]:
# shows outlier data on the KT attribute
dt3 = df[(df['KT']<bw3) | (df['KT']>ba3)]
dt3.head()

#### Identification of outlier attribute KM

Look for Q1, Q3, upper limit, lower limit, difference between Q3 and Q1 to find which data on the KM attribute are outliers

In [None]:
sns.boxplot(x='KM', data=df);

In [None]:
q14,q34=np.percentile(df['KM'], [25,75])
s4 = q34-q14
ba4 = q34+(1.5*s4)
bw4 = q14-(1.5*s4)
print(q14)
print(q34)
print(s4)
print(ba4)
print(bw4)

In [None]:
# shows outlier data on the KM attribute
dt4 = df[(df['KM']<bw4) | (df['KM']>ba4)]
dt4.head()

#### Identification of outlier attribute GRS

In [None]:
sns.boxplot(x='GRS', data=df)

Look for Q1, Q3, upper limit, lower limit, difference between Q3 and Q1 to find which data on the GRS attribute are outliers

In [None]:
q15,q35=np.percentile(df['GRS'], [25,75])
s5 = q35-q15
ba5 = q35+(1.5*s5)
bw5 = q15-(1.5*s5)
print(q15)
print(q35)
print(s5)
print(ba5)
print(bw5)

In [None]:
# shows outlier data on the GRS attribute
dt5 = df[(df['GRS']<bw5) | (df['GRS']>ba5)]
dt5.head()

#### Identification of outlier attribute Harga

In [None]:
sns.boxplot(x='HARGA', data=df);

Look for Q1, Q3, upper limit, lower limit, difference between Q3 and Q1 to find which data on the Harga attribute are outliers

In [None]:
q16,q36=np.percentile(df['HARGA'], [25,75])
s6 = q36-q16
ba6 = q36+(1.5*s6)
bw6 = q16-(1.5*s6)
print(q16)
print(q36)
print(s6)
print(ba6)
print(bw6)

In [None]:
# shows outlier data on the Harga attribute
dt6 = df[(df['HARGA']<bw6) | (df['HARGA']>ba6)]
dt6.head()

### Data distribution

#### Distribution of data from attribute LB

In [None]:
import matplotlib.pyplot as plt

f = plt.figure(figsize=(12,4))

f.add_subplot(1,2,1)
df['LB'].plot(kind='kde')

f.add_subplot(1,2,2 )
plt.boxplot(df['LB'])

plt.show()

- Shows that most of the Building Area is at 200
- Data has many outliers

#### Distribution of data from attribute LT

In [None]:
f = plt.figure(figsize=(12,4))

f.add_subplot(1,2,1)
df['LT'].plot(kind='kde')

f.add_subplot(1,2,2)
plt.boxplot(df['LT'])

plt.show()

- shows that most of the Land Area is in the number 200
- Data has many outliers

#### Distribution of data from attribute KT

In [None]:
f = plt.figure(figsize=(12,4))

f.add_subplot(1,2,1)
sns.countplot(x=df['KT'])

f.add_subplot(1,2,2)
plt.boxplot(df['KT'])

plt.show()

- Shows that most of the Number of Bedrooms is 4 and 5
- Data has few outliers

#### Distribution of data from attribute KM

In [None]:
f = plt.figure(figsize=(12,4))

f.add_subplot(1,2,1)
sns.countplot(x=df['KM'])

f.add_subplot(1,2,2)
plt.boxplot(df['KM'])

plt.show()

- Shows that most of the Number of Bathrooms is 4 and 5
- Data has few outliers

#### Distribution of data from attribute GRS

In [None]:
f = plt.figure(figsize=(12,4))

f.add_subplot(1,2,1)
sns.countplot(x=df['GRS'])

f.add_subplot(1,2,2)
plt.boxplot(df['GRS'])

plt.show();

- Shows that most of the Total Garage Capacity is 1 and 2 cars
- Data has few outliers

#### Distribution of data from attribute Harga

In [None]:
f = plt.figure(figsize=(12,4))

f.add_subplot(1,2,1)
df['HARGA'].plot(kind='kde')

f.add_subplot(1,2,2)
plt.boxplot(df['HARGA'])

plt.show()

### Correlation between independent and dependent variable

In [None]:
# Bivariate analysis between independent variables and dependent variables
plt.figure(figsize=(10,8))
sns.pairplot(data=df, x_vars=['LT', 'LB', 'KM', 'KT', 'GRS'], y_vars=['HARGA'], size=5, aspect=0.75)
plt.show()

In [None]:
# correlation of independent variable and dependent variable
df.corr().style.background_gradient().set_precision(1)

## Classic assumption test

In [None]:
x = df.drop(columns=['HARGA'])
y = df['HARGA']

In [None]:
import statsmodels.api as sm

model=sm.OLS(y,x).fit()
predictions=model.predict(x)
model.summary()

In [None]:
# adding a constant variable
X=sm.add_constant(x)
model=sm.OLS(y,X).fit()
model.summary()

### Normality test

In the normality test we use the Prob Jarque Bera (JB) value from the above test of 0.00. With the following hypothesis:

- Determine the Hypothesis
     - H0 : Residuals are normally distributed
     - H1 : Residuals are not normally distributed
- Significance level
     - ∝=5% (∝=0.05)
- Test Statistics
     - p-value = 0.00
- Critical area
     - Reject H0 if p-value < α
- Decision
     - Because the p-value is equal to 0.00, where the p-value < α is 0.00 < 0.05, then reject H0.
- Conclusion
     - In this dataset the data is not normally distributed

### Multicollinearity test

In [None]:
from patsy import dmatrices
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor

lm = smf.ols(formula = "HARGA~LB+LT+KT+KM+GRS", data = df).fit()
Y,X = dmatrices ("HARGA~LB+LT+KT+KM+GRS", data = df, return_type ="dataframe")
vif = [variance_inflation_factor(X.values, i) for i in range (X.shape[1])]
print(vif)

Multicollinearity test to show whether there is a correlation between the independent variables in multiple linear regression.
- Determine the Hypothesis
     - H0 : VIF < 10 means there is no multicollinearity.
     - H1 : VIF > 10 means that there is multicollinearity.
- Significance level
     - ∝=5% (∝=0.05)
- Test Statistics
     - VIF :
     - Constant = 10,500
     -LB = 2,690
     - LT = 2,410
     - KT = 1,928
     - Miles = 2,122
     - GRS = 2,122
- Critical Area
     - Reject H0 if VIF > ∝
- Decision
     - Because the VIF value (LB = 2,690, LT = 2,410, KT = 1,928, KM = 2,122, and GRS = 2,122) < α then fails to Reject H0
- Conclusion
     - So, the data set does not have multicollinearity

### Heteroscedasticity test

In [None]:
lm=smf.ols(formula="HARGA~LB+LT+KT+KM+GRS",data=df).fit()
lm
resid=lm.resid
plt.scatter(lm.predict(),resid)

In [None]:
import statsmodels.stats as stats

stats.diagnostic.het_white(resid, lm.model.exog)

Based on the results of the heteroscedasticity test, it can be seen that the dots have no clear shape. And the scatter points above and below the number 0 on the Y axis. So it can be concluded that there is no heteroscedasticity problem in the regression model.

## Partial test

Partial test is used to determine whether the independent variable (X) has a significant (real) impact on the dependent variable (Y). From the data above, the p-value (Constant) is 0.064, the value (LB) is 0.000, the value (LT) is 0.000, the value (KT) is 0.000, the value (KM) is 0.000, and the value (GRS) is 0.001.
Here's the hypothesis:
- Hypothesis
     - H0 : βi = 0, i = 0,1,2 (There is no significant effect between X and Y)
     - H1 : βi ≠ 0, i = 0,1,2 (There is a significant effect between X and Y)
- Significance level
     - ∝=5% = 0.05
- Critical area
     - If p-value ≤ ∝ (0.05) → Reject H0
     - P-value : = 0.001 and = 0.000 ; ∝= 0.05
- Decision
     - Because the p-value for β1, β2, β3, β4, β5 < ∝ then reject
- Conclusion
     - In the dataset there is a significant influence between variable X (LB, LT, KT, KM, GRS) and variable Y (Harga).

## Modelling

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.4, random_state=4)
X_train.shape , X_test.shape, y_train.shape, y_test.shape

In [None]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

In [None]:
# shows the value of the slope/coefficient) and intercept (b)
print(lin_reg.coef_)
print(lin_reg.intercept_)

In [None]:
coef_dict = {
    'features':x.columns,
    'coef_value':lin_reg.coef_
}
coef = pd.DataFrame(coef_dict, columns = ['features', 'coef_value'])
coef

- from the above values, if put into the formula it becomes: Y = 1.057*1 + 2.569*2 + (-6.260)*3 + 4.280*4 + 2.648*5 + -482584002

In [None]:
y_pred = lin_reg.predict(X_test)
print("Accuracy %.2f%%" % (lin_reg.score(X_test, y_test)*100))

### Case study

- Samuel wants to find a criteria house in DKI Jakarta with the following criteria:
     1. Building Area (LB) = 100
     2. Land Area (LT) = 300
     3. Number of Bedrooms = 3
     4. Number of Bathrooms = 2
     5. Car capacity in garage (GRS) = 1

In [None]:
import math

print("Samuel's dream house costs approx IDR {:,} million".format(math.floor(lin_reg.predict([[100, 300, 3, 2, 1]])/1000000)))

# Result

After conducting the linear regression experiment on house price forecasting in Jakarta and Tebet, we obtained a prediction accuracy of 65.64%. While this result indicates that our model is able to make reasonably accurate predictions, there is certainly room for improvement. Further analysis of the data and refinement of the model could potentially increase the accuracy of our forecasts. Nonetheless, this experiment provides valuable insights into the factors that impact house prices in these areas and demonstrates the utility of linear regression as a tool for analyzing real estate market trends.

# Conclusion

In conclusion, the linear regression experiment on house price forecasting in Jakarta and Tebet has demonstrated the effectiveness of this method for analyzing real estate market trends and making predictions about future prices. While the accuracy of our model was 65.64%, there is still potential for improvement through further refinement of the dataset and the model itself.

One recommendation for future experiments is to normalize the dataset before conducting the analysis. Normalization can improve the accuracy of the model by scaling the variables and ensuring that they have the same range and distribution. Additionally, it may be beneficial to explore additional factors that could impact house prices in Jakarta and Tebet, such as economic indicators and demographic data.

Overall, this experiment has provided valuable insights into the housing markets of Jakarta and Tebet and highlights the potential of linear regression for real estate forecasting. By continuing to refine and improve upon this method, researchers and industry professionals can gain a better understanding of market trends and make more accurate predictions about future prices.