In [1]:
# Laoding the Libraries
import pandas as pd


In [2]:
# Loading the Dataset
data = pd.read_csv('Real_Estate.csv')

In [3]:
# Analyse the Top 5 rows of the Dataset
data.head()

Unnamed: 0,Transaction date,House age,Distance to the nearest MRT station,Number of convenience stores,Latitude,Longitude,House price of unit area
0,2012-09-02 16:42:30.519336,13.3,4082.015,8,25.007059,121.561694,6.488673
1,2012-09-04 22:52:29.919544,35.5,274.0144,2,25.012148,121.54699,24.970725
2,2012-09-05 01:10:52.349449,1.1,1978.671,10,25.00385,121.528336,26.694267
3,2012-09-05 13:26:01.189083,22.2,1055.067,5,24.962887,121.482178,38.091638
4,2012-09-06 08:29:47.910523,8.5,967.4,6,25.011037,121.479946,21.65471


In [4]:
# Check information about the Dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 414 entries, 0 to 413
Data columns (total 7 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Transaction date                     414 non-null    object 
 1   House age                            414 non-null    float64
 2   Distance to the nearest MRT station  414 non-null    float64
 3   Number of convenience stores         414 non-null    int64  
 4   Latitude                             414 non-null    float64
 5   Longitude                            414 non-null    float64
 6   House price of unit area             414 non-null    float64
dtypes: float64(5), int64(1), object(1)
memory usage: 22.8+ KB


#### Data Preprocessing


In [5]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import datetime

In [6]:
# convert "Transaction date" to datetime and extract year and month
data['Transaction date'] = pd.to_datetime(data['Transaction date'])
data['Transaction year'] = data['Transaction date'].dt.year
data['Transaction month'] = data['Transaction date'].dt.month

In [7]:
data.head()

Unnamed: 0,Transaction date,House age,Distance to the nearest MRT station,Number of convenience stores,Latitude,Longitude,House price of unit area,Transaction year,Transaction month
0,2012-09-02 16:42:30.519336,13.3,4082.015,8,25.007059,121.561694,6.488673,2012,9
1,2012-09-04 22:52:29.919544,35.5,274.0144,2,25.012148,121.54699,24.970725,2012,9
2,2012-09-05 01:10:52.349449,1.1,1978.671,10,25.00385,121.528336,26.694267,2012,9
3,2012-09-05 13:26:01.189083,22.2,1055.067,5,24.962887,121.482178,38.091638,2012,9
4,2012-09-06 08:29:47.910523,8.5,967.4,6,25.011037,121.479946,21.65471,2012,9


In [8]:
# drop the original "Transaction date" as we've extracted relevant features
data.drop(columns='Transaction date', inplace = True)

In [9]:
data.head()

Unnamed: 0,House age,Distance to the nearest MRT station,Number of convenience stores,Latitude,Longitude,House price of unit area,Transaction year,Transaction month
0,13.3,4082.015,8,25.007059,121.561694,6.488673,2012,9
1,35.5,274.0144,2,25.012148,121.54699,24.970725,2012,9
2,1.1,1978.671,10,25.00385,121.528336,26.694267,2012,9
3,22.2,1055.067,5,24.962887,121.482178,38.091638,2012,9
4,8.5,967.4,6,25.011037,121.479946,21.65471,2012,9


In [10]:
# define features and target variable
X = data.drop(columns= 'House price of unit area')
y = data['House price of unit area']

In [12]:
# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 42)

In [13]:
# scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [14]:
X_train_scaled.shape

(331, 7)

In [15]:
X_test_scaled.shape

(83, 7)

#### Model Training and Comparison


###### Now, we’ll proceed with training multiple models and comparing their performance. We’ll start with a few commonly used models for regression tasks:

1. Linear Regression: A good baseline model for regression tasks.
2. Decision Tree Regressor: To see how a simple tree-based model performs.
3. Random Forest Regressor: An ensemble method to improve upon the decision tree’s performance.
4. Gradient Boosting Regressor: Another powerful ensemble method for regression.

In [16]:
# Let’s start with training these models and comparing their performance
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, r2_score

In [17]:
# initialize the models
models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(random_state = 42),
    "Random Forest": RandomForestRegressor(random_state = 42),
    "Gradient Boosting": GradientBoostingRegressor(random_state = 42)
}


In [18]:
# dictionary to hold the evaluation metrics for each model
results = {}

In [20]:
# Train and evaluate each model
for name, model in models.items():
    # training the model
    model.fit(X_train_scaled, y_train)

    # making predictions on the test set
    predictions = model.predict(X_test_scaled)

    # calculating evaluation metrics
    mae = mean_absolute_error(y_test, predictions)
    r2 = r2_score(y_test, predictions)

    # storing the metrics
    results[name] = {'MAE':mae, 'R2': r2}


In [26]:
# convert the results to a DataFrame for better readability
results_df = pd.DataFrame(results).T

In [27]:
results_df

Unnamed: 0,MAE,R2
Linear Regression,9.748246,0.529615
Decision Tree,11.760342,0.204962
Random Forest,9.887601,0.509547
Gradient Boosting,10.000117,0.476071


###### Linear Regression has the lowest MAE (9.75) and the highest R² (0.53), making it the best-performing model among those evaluated. It suggests that, despite its simplicity, Linear Regression is quite effective for this dataset.

###### Decision Tree Regressor shows the highest MAE (11.76) and the lowest R² (0.20), indicating it may be overfitting to the training data and performing poorly on the test data. On the other hand, Random Forest Regressor and Gradient Boosting Regressor have similar MAEs (9.89 and 10.00, respectively) and R² scores (0.51 and 0.48, respectively), performing slightly worse than the Linear Regression model but better than the Decision Tree.

##### Summary
This is how you can train and compare multiple Machine Learning models using Python. By comparing multiple models, we aim to select the most effective algorithm that offers the optimal balance of accuracy, complexity, and performance for their specific problem.