# Multiple linear regression with sklearn

Using linear regression model from sklearn library as a benchmark before implementing our own multiple linear regression model.
For evaluation, for comparing, we will be using R-squared metric.

### Importing dataset

In [1]:
# Importing libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
# Importing cleaned dataset with apartments for sale in Belgrade
dataset = pd.read_csv('../../database/backup/belgrade_flats.csv')
dataset.head()

Unnamed: 0,price,location_city,location_city_district,area_property,num_floors_building,apartment_floor,registered,heating_type,num_rooms,num_bathrooms
0,109000.0,beograd,stari grad,71.0,9.0,7.0,False,Centralno,2.0,1.0
1,149500.0,beograd,stari grad,74.0,4.0,2.0,True,Centralno,3.0,1.0
2,108000.0,beograd,zvezdara,60.0,4.0,3.0,False,Etažno,2.5,1.0
3,145000.0,beograd,vracar,96.0,2.0,1.0,True,Etažno,4.5,1.0
4,180000.0,beograd,palilula,94.0,5.0,4.0,True,Centralno,4.0,1.0


### Preparing data for modeling

Over multiple iterations and testing with various column combinations, creating new columns etc...best model performance was achieved when we removed the following columns:
- num_floors_building
- num_bathrooms
- apartment_floor

That slight change/improvement, when using the remaining combination of columns was nothing substantial, but anyhow, in this example we're using the columns that provided the best model performance.  

In [3]:
# Reving columns that aren't going to be used in regression model
dataset = dataset.drop(columns=['location_city', 'num_floors_building', 'num_bathrooms', 'apartment_floor'])
dataset.head()

Unnamed: 0,price,location_city_district,area_property,registered,heating_type,num_rooms
0,109000.0,stari grad,71.0,False,Centralno,2.0
1,149500.0,stari grad,74.0,True,Centralno,3.0
2,108000.0,zvezdara,60.0,False,Etažno,2.5
3,145000.0,vracar,96.0,True,Etažno,4.5
4,180000.0,palilula,94.0,True,Centralno,4.0


In [4]:
# Splitting dataset into features (input values) and output value 
X = dataset.iloc[:, 1:].values
y = dataset.iloc[:, 0].values

In [5]:
# Encocding catgorical variables:
# 1. City district/municipality
# 2. Heating type

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0, 3])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [6]:
# Splitting dataset into train and test dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X.tolist(), y, test_size = 0.2, random_state = 0)

### Creating Multiple Linear Regression Model

In [7]:
# Training multiple linear regression model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression(normalize=False)
regressor.fit(X_train, y_train)

LinearRegression()

In [8]:
# Creating predictions on testing dataset
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[265682.76 215000.  ]
 [152704.58  89900.  ]
 [252579.94 249000.  ]
 ...
 [185172.39 204000.  ]
 [106774.    68500.  ]
 [105930.73  75000.  ]]


### Evaluating our model

In [9]:
# Using cross validation to evaluate performance of our model

from sklearn.model_selection import cross_val_score
scores = cross_val_score(regressor, X_train, y_train, scoring='r2', cv=10)
print(scores)
print(scores.mean())

[0.58 0.65 0.56 0.58 0.61 0.52 0.7  0.55 0.54 0.67]
0.5974855592482461


In [10]:
# Calculating R-squared metric

from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

0.6885286721045634