In this notebook I've created a Linear Regresion model that predicts the price of houses. The dataset was obtained in kaggle.com <br>
https://www.kaggle.com/datasets/tawfikelmetwally/chicago-house-price

Note: Due to the low amount of data in the dataset if possible that the model doesn't develop a great accuracy

In [25]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [28]:
data = pd.read_csv('realest.csv')

data.describe()

Unnamed: 0,Price,Bedroom,Space,Room,Lot,Tax,Bathroom,Garage,Condition
count,156.0,156.0,146.0,156.0,146.0,147.0,156.0,156.0,156.0
mean,56.474359,3.166667,1097.246575,6.5,32.808219,911.707483,1.480769,0.846154,0.230769
std,12.875307,1.348037,462.540698,1.675247,8.457859,443.26343,0.529408,0.808454,0.422682
min,32.0,1.0,539.0,4.0,24.0,418.0,1.0,0.0,0.0
25%,46.0,2.0,805.25,5.0,25.0,652.5,1.0,0.0,0.0
50%,55.0,3.0,965.5,6.0,30.0,821.0,1.5,1.0,0.0
75%,65.0,4.0,1220.5,7.0,37.0,1012.5,2.0,1.5,0.0
max,90.0,8.0,2295.0,12.0,50.0,2752.0,3.0,2.0,1.0


We have the following variables:

Price: Price of the house <br>
Bedroom: Number of bedrooms <br>
Space: Size of the house (in square feet) <br>
Room: Number of rooms <br>
Lot: Width of a lot <br>
Tax: Amount of annual tax <br>
Bathroom: Number of bathrooms <br>
Garage: Number of garages <br>
Condition: Condition of house (1 if good , 0 otherwise) <br>

In [35]:
data = data.dropna()

In [36]:
data.corr(numeric_only = True)['Price']

Price        1.000000
Bedroom      0.321623
Space        0.739074
Room         0.578310
Lot          0.467262
Tax          0.505958
Bathroom     0.567729
Garage       0.555242
Condition    0.137773
Name: Price, dtype: float64

Analyzing the correlation between the variables X and the price Y, we see that almost all the variables have an important correlation excluding the "Condition" variable, so we will develop a model with and without that variable to see which model has a higher precision.

First we'll start with the model including the "Condition" variable

In [61]:
X_data = data[['Bedroom', 'Space', 'Room', 'Lot', 'Tax', 'Bathroom', 'Garage', 'Condition']]
y_data = data['Price']

In [62]:
scaler = StandardScaler()
scaler.fit(X_data)
scaled_data = scaler.transform(X_data)

Normalize data for better perfomance of the model

In [63]:
max_acc = 0
best_size = 0


#Use different test sizes to see which has better performance
for size in [0.15, 0.2, 0.25, 0.3, 0.35]:
  X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, random_state=42, test_size=size)

  reg = LinearRegression().fit(X_train, y_train)

  score = reg.score(X_test, y_test)

  if score > max_acc:
    max_acc = score
    best_size = size

print("The best accuracy is obtained with a test size of {} and is {}".format(best_size, max_acc))

The best accuracy is obtained with a test size of 0.25 and is 0.7013069999242456


Now we repeat the creation of the model without the "Condition" variable

In [64]:
X_data = data[['Bedroom', 'Space', 'Room', 'Lot', 'Tax', 'Bathroom', 'Garage']]
y_data = data['Price']

In [65]:
scaler = StandardScaler()
scaler.fit(X_data)
scaled_data = scaler.transform(X_data)

Normalize data for better perfomance of the model

In [67]:
max_acc = 0
best_size = 0

#Use different test sizes to see which has better performance
for size in [0.15, 0.2, 0.25, 0.3, 0.35]:
  X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, random_state=42, test_size=size)

  reg = LinearRegression().fit(X_train, y_train)

  score = reg.score(X_test, y_test)

  if score > max_acc:
    max_acc = score
    best_size = size

print("The best accuracy is obtained with a test size of {} and is {}".format(best_size, max_acc))

The best accuracy is obtained with a test size of 0.25 and is 0.6949236382366816


Despite the low correlation between the price and the "Condition" variable, the accuracy of the model y little bit better including it