**1) Database description:**

This dataset contains information about real estate properties, including various features of each house. Each row represents a single property. The columns in the dataset are:

* X1 transaction date - transaction date (year and month).
* X2 house age - age of the house in years.
* X3 distance to the nearest MRT station - distance to the nearest MRT station (in meters).
* X4 number of convenience stores - number of convenience stores nearby.
* X5 latitude - latitude of the house's location.
* X6 longitude - longitude of the house's location.
Y house price of unit area - unit price of the house.


**2) Display a statistical description of the first 100 entries of the data set:**

In [1]:
import pandas as pd

# loading database from google drive:

realestate_url = 'https://drive.google.com/uc?id=1avGmngRt3pOS5i_BeEGBixIvMZkeywrh'

realestate_df = pd.read_csv(realestate_url)

print(realestate_df.head())

   No  X1 transaction date  X2 house age  \
0   1             2012.917          32.0   
1   2             2012.917          19.5   
2   3             2013.583          13.3   
3   4             2013.500          13.3   
4   5             2012.833           5.0   

   X3 distance to the nearest MRT station  X4 number of convenience stores  \
0                                84.87882                               10   
1                               306.59470                                9   
2                               561.98450                                5   
3                               561.98450                                5   
4                               390.56840                                5   

   X5 latitude  X6 longitude  Y house price of unit area  
0     24.98298     121.54024                        37.9  
1     24.98034     121.53951                        42.2  
2     24.98746     121.54391                        47.3  
3     24.98746     121.54391  

In [2]:
# creating statistical description of the first 100 rows:

first_100_rows = realestate_df.head(100)
first_100_rows.describe()

Unnamed: 0,No,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude,Y house price of unit area
count,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0
mean,50.5,2013.1617,18.472,1133.658832,4.17,24.969346,121.533141,38.44
std,29.011492,0.29955,11.516489,1308.867454,2.867635,0.012994,0.015743,13.481255
min,1.0,2012.667,0.0,23.38284,0.0,24.94155,121.48458,13.2
25%,25.75,2012.917,9.85,319.389925,1.0,24.963045,121.529468,26.875
50%,50.5,2013.083,16.45,533.4762,4.0,24.97349,121.53913,39.8
75%,75.25,2013.417,29.775,1420.7725,6.0,24.977283,121.54391,48.025
max,100.0,2013.583,40.9,5512.038,10.0,25.01459,121.55282,70.1


**3) Determining the forecasting goal (price):**

In [3]:
y =realestate_df["Y house price of unit area"]
y.head()

0    37.9
1    42.2
2    47.3
3    54.8
4    43.1
Name: Y house price of unit area, dtype: float64

**4) Using all available parameters for forecasting:**

In [4]:
# printing column names:

realestate_df.columns


Index(['No', 'X1 transaction date', 'X2 house age',
       'X3 distance to the nearest MRT station',
       'X4 number of convenience stores', 'X5 latitude', 'X6 longitude',
       'Y house price of unit area'],
      dtype='object')

In [5]:
# selecting parameters for prediction:
from sklearn.model_selection import train_test_split

realestate_features = ["X1 transaction date", "X2 house age", "X3 distance to the nearest MRT station", "X4 number of convenience stores", "X5 latitude", "X6 longitude" ]

X = realestate_df[realestate_features]


# split the data into training and testing sets:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

X_train.head()

Unnamed: 0,X1 transaction date,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude
159,2012.667,15.5,815.9314,4,24.97886,121.53464
95,2012.917,8.0,104.8101,5,24.96674,121.54067
11,2013.333,6.3,90.45606,9,24.97433,121.5431
374,2013.25,5.4,390.5684,5,24.97937,121.54245
165,2012.917,13.7,1236.564,1,24.97694,121.55391


**5) Writing a model using DecisionTreeRegressor:**

In [6]:
from sklearn.tree import DecisionTreeRegressor

# create and train the model:

realestate_model = DecisionTreeRegressor(random_state=1)
realestate_model.fit(X_train, y_train)

**6) Predicting the property price for the first 5 rows:**

In [7]:
# make predictions for the first 5 rows:

first_5_rows = X.head(5)
predictions = realestate_model.predict(first_5_rows)

predictions

array([37.9, 42.2, 47.3, 54.8, 45.2])

**7) Create a new model and predict the house price using only the house coordinates:**

In [8]:
# select features (only latitude and longitude):

realestate_features_2 = ["X5 latitude", "X6 longitude"]

X_coords = realestate_df[realestate_features_2]

X_coords.head()

Unnamed: 0,X5 latitude,X6 longitude
0,24.98298,121.54024
1,24.98034,121.53951
2,24.98746,121.54391
3,24.98746,121.54391
4,24.97937,121.54245


In [9]:
# split the data into training and testing sets:

X_train_coords, X_test_coords, y_train_coords, y_test_coords = train_test_split(X_coords, y, test_size=0.2, random_state=1)

In [10]:
# create and train the model:

model_coords = DecisionTreeRegressor(random_state=1)
model_coords.fit(X_train_coords, y_train_coords)

In [11]:
# make predictions for the first 5 rows using only coordinates:

first_5_rows_coords = X_coords.head(5)
predictions_coords = model_coords.predict(first_5_rows_coords)

predictions_coords

array([37.9, 42.2, 47.5, 47.5, 45.6])

**8) conclusions about which prices were incorrectly forecasted:**

In [12]:
# predictions from both models:

print("Predictions using all features:", predictions)
print("Predictions using only coordinates:", predictions_coords)

# comparison with actual prices:

actual_prices = y.head(5).values
print("Actual prices:", actual_prices)

# difference analysis:

differences_all_features = actual_prices - predictions
differences_coords = actual_prices - predictions_coords

print("Differences (all features):", differences_all_features)
print("Differences (coordinates only):", differences_coords)

Predictions using all features: [37.9 42.2 47.3 54.8 45.2]
Predictions using only coordinates: [37.9 42.2 47.5 47.5 45.6]
Actual prices: [37.9 42.2 47.3 54.8 43.1]
Differences (all features): [ 0.   0.   0.   0.  -2.1]
Differences (coordinates only): [ 0.   0.  -0.2  7.3 -2.5]


A model based on multiple parameters predicted house prices with greater accuracy than a model based only on geographic coordinates.