# <center>House price prediction with machine learning models 🤖</center>

![house](image.png)

## Contents

- [Getting the data](#getting-the-data)
- [Data analysis](#data-analysis)
- [Data cleaning](#data-cleaning)
- [Data visualization](#data-visualization)
- [Setting up the models](#setting-up-the-models)
- [Training the models](#training-the-models)

## Data description 📝

- `id` - Unique ID for every individual row entry  
- `date` - Date the house was sold  
- `price` - Price of the house  
- `bedrooms` - Number of bedrooms  
- `bathrooms` - Number of bathrooms  
- `sqft_living` - Square footage of the house  
- `sqft_lot` - Square footage of the lot  
- `floors` - Number of floors  
- `waterfront` - Whether the house has a view to a waterfront  
- `view` - How good the view of the property is  
- `condition` - Condition of the house  
- `grade` - Grade given to the house based on the overall construction and design  
- `sqft_above` - Square footage of house apart from the basement  
- `sqft_basement` - Square footage of the basement  
- `yr_built` - Year the house was built  
- `yr_renovated` - Year the house was renovated  
- `zipcode` - Zipcode of the house  
- `lat` - Latitude coordinate  
- `long` - Longitude coordinate  
- `sqft_living15` - Living room area in 2015(implies-- some renovations) This might or - might not have affected the lotsize area  
- `sqft_lot15` - LotSize area in 2015(implies-- some renovations)  

<hr>

In [35]:
import numpy as np
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt

## Getting the data

In [36]:
# Create a dataframe from the csv file
df = pd.read_csv('kc_house_data.csv')

In [37]:
# Shape of the dataframe
df.shape

(21613, 21)

<hr>

## Data exploration

In [38]:
# Display the first 5 rows of the dataframe
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [39]:
# Display the last 5 rows of the dataframe
df.tail()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
21608,263000018,20140521T000000,360000.0,3,2.5,1530,1131,3.0,0,0,...,8,1530,0,2009,0,98103,47.6993,-122.346,1530,1509
21609,6600060120,20150223T000000,400000.0,4,2.5,2310,5813,2.0,0,0,...,8,2310,0,2014,0,98146,47.5107,-122.362,1830,7200
21610,1523300141,20140623T000000,402101.0,2,0.75,1020,1350,2.0,0,0,...,7,1020,0,2009,0,98144,47.5944,-122.299,1020,2007
21611,291310100,20150116T000000,400000.0,3,2.5,1600,2388,2.0,0,0,...,8,1600,0,2004,0,98027,47.5345,-122.069,1410,1287
21612,1523300157,20141015T000000,325000.0,2,0.75,1020,1076,2.0,0,0,...,7,1020,0,2008,0,98144,47.5941,-122.299,1020,1357


In [40]:
# Sample random 10 rows from the dataframe
df.sample(10)

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
21082,8562900430,20140718T000000,800000.0,4,2.5,3691,11088,2.0,0,1,...,8,3691,0,2013,0,98074,47.6122,-122.059,3190,11270
10426,6746700565,20141023T000000,447000.0,2,1.0,850,2700,1.0,0,0,...,6,850,0,1924,0,98105,47.6684,-122.316,1630,3000
9693,8644300200,20140605T000000,555000.0,4,2.75,2020,10720,1.0,0,0,...,8,1420,600,1976,0,98052,47.6373,-122.104,2190,10164
12279,2313900810,20150402T000000,610000.0,4,2.0,2220,5821,1.5,0,0,...,7,1380,840,1916,0,98116,47.5723,-122.382,1850,5000
17248,725069102,20150330T000000,650000.0,3,2.25,2180,60112,2.0,0,0,...,8,2180,0,1976,0,98053,47.6723,-122.082,2060,120225
16567,7603100095,20141110T000000,1260000.0,3,3.0,3230,8625,2.0,0,3,...,10,2220,1010,1998,0,98116,47.562,-122.404,2330,6022
2079,1336800160,20140605T000000,875000.0,5,2.5,2920,5568,2.0,0,0,...,8,2320,600,1906,0,98112,47.6265,-122.312,2970,5568
584,5419800510,20141117T000000,268500.0,4,1.75,1420,7500,1.0,0,0,...,7,1080,340,1981,0,98031,47.4025,-122.176,1500,7260
19849,7708210070,20140617T000000,535000.0,4,2.75,3070,7201,2.0,0,0,...,9,3070,0,2006,0,98059,47.4897,-122.147,2880,8364
12868,4014400237,20140523T000000,132500.0,3,1.0,1080,10500,1.0,0,0,...,7,1080,0,1967,0,98001,47.32,-122.278,1200,9607


In [41]:
# Display the columns of the dataframe
df.columns

Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15'],
      dtype='object')

In [42]:
# Display info and the datatypes of the columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  float64
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21613 non-null  int64  
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  int64  
 17  lat            21613 non-null  float64
 18  long  

### Data shows the following:

- There are 21,613 rows or entries in the dataset
- There are 21 columns or features in the dataset
- There are different data types in the dataset, including `int64`, `float64`, and `object`
- There are no missing values in the dataset
- There are 15,458 unique values in the `id` column
- Memory usage of the dataset is 3.5+ MB

## Data cleaning

In [43]:
# Search for missing values
df.isnull().sum()

id               0
date             0
price            0
bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
grade            0
sqft_above       0
sqft_basement    0
yr_built         0
yr_renovated     0
zipcode          0
lat              0
long             0
sqft_living15    0
sqft_lot15       0
dtype: int64

## Data cleaning

In [None]:
# Search for missing values
df.isnull().sum()

id               0
date             0
price            0
bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
grade            0
sqft_above       0
sqft_basement    0
yr_built         0
yr_renovated     0
zipcode          0
lat              0
long             0
sqft_living15    0
sqft_lot15       0
dtype: int64

In [44]:
# Display the statistics of the dataframe. This method provides a summary of the numerical attributes like `count`, `mean`, `min`, `max` and `std`.
df.describe()

Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
count,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0
mean,4580302000.0,540088.1,3.370842,2.114757,2079.899736,15106.97,1.494309,0.007542,0.234303,3.40943,7.656873,1788.390691,291.509045,1971.005136,84.402258,98077.939805,47.560053,-122.213896,1986.552492,12768.455652
std,2876566000.0,367127.2,0.930062,0.770163,918.440897,41420.51,0.539989,0.086517,0.766318,0.650743,1.175459,828.090978,442.575043,29.373411,401.67924,53.505026,0.138564,0.140828,685.391304,27304.179631
min,1000102.0,75000.0,0.0,0.0,290.0,520.0,1.0,0.0,0.0,1.0,1.0,290.0,0.0,1900.0,0.0,98001.0,47.1559,-122.519,399.0,651.0
25%,2123049000.0,321950.0,3.0,1.75,1427.0,5040.0,1.0,0.0,0.0,3.0,7.0,1190.0,0.0,1951.0,0.0,98033.0,47.471,-122.328,1490.0,5100.0
50%,3904930000.0,450000.0,3.0,2.25,1910.0,7618.0,1.5,0.0,0.0,3.0,7.0,1560.0,0.0,1975.0,0.0,98065.0,47.5718,-122.23,1840.0,7620.0
75%,7308900000.0,645000.0,4.0,2.5,2550.0,10688.0,2.0,0.0,0.0,4.0,8.0,2210.0,560.0,1997.0,0.0,98118.0,47.678,-122.125,2360.0,10083.0
max,9900000000.0,7700000.0,33.0,8.0,13540.0,1651359.0,3.5,1.0,4.0,5.0,13.0,9410.0,4820.0,2015.0,2015.0,98199.0,47.7776,-121.315,6210.0,871200.0


In [45]:
df.price.describe()

count    2.161300e+04
mean     5.400881e+05
std      3.671272e+05
min      7.500000e+04
25%      3.219500e+05
50%      4.500000e+05
75%      6.450000e+05
max      7.700000e+06
Name: price, dtype: float64

In [46]:
# Numeric Features
numeric_features = df.select_dtypes(['int', 'float']).columns
numeric_features

Index(['id', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
       'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long',
       'sqft_living15', 'sqft_lot15'],
      dtype='object')

In [47]:
# Categorical Features
categorical_features = df.select_dtypes('object').columns
categorical_features

Index(['date'], dtype='object')

In [48]:
print(f'Number of `Numerical` features: {len(numeric_features)}')
print(f'Number of `Categorical` features: {len(categorical_features)}')
print(f'Total features: {len(numeric_features) + len(categorical_features)}')

Number of `Numerical` features: 20
Number of `Categorical` features: 1
Total features: 21


In [49]:

# Find unique values in dataframe
print(f"Total records in the dataframe: {len(df)}")
for col in numeric_features:
    print(f'Unique values in {col} are: {len(df[col].unique())}')

Total records in the dataframe: 21613
Unique values in id are: 21436
Unique values in price are: 4028
Unique values in bedrooms are: 13
Unique values in bathrooms are: 30
Unique values in sqft_living are: 1038
Unique values in sqft_lot are: 9782
Unique values in floors are: 6
Unique values in waterfront are: 2
Unique values in view are: 5
Unique values in condition are: 5
Unique values in grade are: 12
Unique values in sqft_above are: 946
Unique values in sqft_basement are: 306
Unique values in yr_built are: 116
Unique values in yr_renovated are: 70
Unique values in zipcode are: 70
Unique values in lat are: 5034
Unique values in long are: 752
Unique values in sqft_living15 are: 777
Unique values in sqft_lot15 are: 8689


In [50]:
# Remove columns that are not required
df.drop(['id', 'date'], axis=1, inplace=True)
df.head()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,221900.0,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,180000.0,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,604000.0,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,510000.0,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


## Data visualization

Contenido de la sección "Data visualization"

## Setting up the models

Contenido de la sección "Setting up the models"

## Training the models

Contenido de la sección "Training the models"