# Ordinary Least Squares Assumptions

In Regression Analysis, there are several assumptions that must be accepted to make the OLS is working properly. These assumptions could be described as follows:
- Linearity
- No Endogeneity
- Homoscedasticity
- No Autocorrelation
- No Multicollinearity

### Importing all of the libraries needed

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns
import sklearn
from sklearn import preprocessing

sns.set()

In [2]:
import matplotlib
print("Pandas Version: ", pd.__version__)
print("NumPy Version: ", np.__version__)
print("Matplotlib Version: ", matplotlib.__version__)
print("Seaborn Version: ", sns.__version__)
print("Scikit-Learn Version: ", sklearn.__version__)

Pandas Version:  2.1.3
NumPy Version:  1.26.2
Matplotlib Version:  3.8.2
Seaborn Version:  0.13.0
Scikit-Learn Version:  1.3.2


## Load The Data

In [5]:
dataset = pd.read_csv('MELBOURNE_CLEANED_OLS.csv')
dataset.head()

Unnamed: 0,Suburb,Rooms,Type,Method,SellerG,Date,Distance,Postcode,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount,log_price
0,Abbotsford,2,h,SS,Jellis,2016-09-03,2.5,3067.0,1.0,1.0,126.0,136.0,1970.0,Yarra City Council,-37.8014,144.9958,Northern Metropolitan,4019.0,13.676248
1,Abbotsford,2,h,S,Biggin,2016-12-03,2.5,3067.0,1.0,1.0,202.0,136.0,1970.0,Yarra City Council,-37.7996,144.9984,Northern Metropolitan,4019.0,14.207553
2,Abbotsford,2,h,S,Biggin,2016-02-04,2.5,3067.0,1.0,0.0,156.0,79.0,1900.0,Yarra City Council,-37.8079,144.9934,Northern Metropolitan,4019.0,13.849912
3,Abbotsford,3,u,VB,Rounds,2016-02-04,2.5,3067.0,2.0,1.0,0.0,136.0,1970.0,Yarra City Council,-37.8114,145.0116,Northern Metropolitan,4019.0,13.676248
4,Abbotsford,3,h,SP,Biggin,2017-03-04,2.5,3067.0,2.0,0.0,134.0,150.0,1900.0,Yarra City Council,-37.8093,144.9944,Northern Metropolitan,4019.0,14.197366


### Refining Variable Types
- Categorical Variables

In [6]:
# Identifying object column and converting it to categorical variable
print(dataset.select_dtypes(["object"]).columns)

categorical = [
    "Suburb",
    "Type",
    "Method",
    "SellerG",
    "Date",
    "CouncilArea",
    "Regionname",
]

for cat_variables in categorical:
    dataset[cat_variables] = dataset[cat_variables].astype('category')

Index(['Suburb', 'Type', 'Method', 'SellerG', 'Date', 'CouncilArea',
       'Regionname'],
      dtype='object')


In [7]:
# Convert data column to date object
dataset['Date'] = pd.to_datetime(dataset['Date'], format='%Y-%m-%d', dayfirst=True)
dataset['Date'].head(5)

0   2016-09-03
1   2016-12-03
2   2016-02-04
3   2016-02-04
4   2017-03-04
Name: Date, dtype: category
Categories (78, datetime64[ns]): [2016-01-28, 2016-02-04, 2016-04-16, 2016-04-23, ..., 2018-02-24, 2018-03-03, 2018-03-10, 2018-03-17]

In [8]:
# Since postal code can be converted as categorical data type, thus
postal = ['Postcode']

for postcode in postal:
    dataset[postcode] = dataset[postcode].astype('category')

In [9]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29238 entries, 0 to 29237
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   Suburb         29238 non-null  category
 1   Rooms          29238 non-null  int64   
 2   Type           29238 non-null  category
 3   Method         29238 non-null  category
 4   SellerG        29238 non-null  category
 5   Date           29238 non-null  category
 6   Distance       29238 non-null  float64 
 7   Postcode       29238 non-null  category
 8   Bathroom       29238 non-null  float64 
 9   Car            29238 non-null  float64 
 10  Landsize       29238 non-null  float64 
 11  BuildingArea   29238 non-null  float64 
 12  YearBuilt      29238 non-null  float64 
 13  CouncilArea    29236 non-null  category
 14  Lattitude      29238 non-null  float64 
 15  Longtitude     29238 non-null  float64 
 16  Regionname     29236 non-null  category
 17  Propertycount  29236 non-null  

## Dummy Variables

In [11]:
categories = dataset.select_dtypes(["category"])
categories.describe(include="all")

Unnamed: 0,Suburb,Type,Method,SellerG,Date,Postcode,CouncilArea,Regionname
count,29238,29238,29238,29238,29238,29238.0,29236,29236
unique,319,3,9,351,78,189.0,29,8
top,Reservoir,h,S,Nelson,2017-10-28 00:00:00,3073.0,Boroondara City Council,Southern Metropolitan
freq,764,18954,16893,2850,900,764.0,2751,9686


In [13]:
new_dataset = pd.get_dummies(dataset, drop_first=True, dtype=int)
new_dataset

Unnamed: 0,Rooms,Distance,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount,...,CouncilArea_Wyndham City Council,CouncilArea_Yarra City Council,CouncilArea_Yarra Ranges Shire Council,Regionname_Eastern Victoria,Regionname_Northern Metropolitan,Regionname_Northern Victoria,Regionname_South-Eastern Metropolitan,Regionname_Southern Metropolitan,Regionname_Western Metropolitan,Regionname_Western Victoria
0,2,2.5,1.0,1.0,126.0,136.0,1970.0,-37.80140,144.99580,4019.0,...,0,1,0,0,1,0,0,0,0,0
1,2,2.5,1.0,1.0,202.0,136.0,1970.0,-37.79960,144.99840,4019.0,...,0,1,0,0,1,0,0,0,0,0
2,2,2.5,1.0,0.0,156.0,79.0,1900.0,-37.80790,144.99340,4019.0,...,0,1,0,0,1,0,0,0,0,0
3,3,2.5,2.0,1.0,0.0,136.0,1970.0,-37.81140,145.01160,4019.0,...,0,1,0,0,1,0,0,0,0,0
4,3,2.5,2.0,0.0,134.0,150.0,1900.0,-37.80930,144.99440,4019.0,...,0,1,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29233,4,6.3,1.0,3.0,593.0,136.0,1970.0,-37.81053,144.88467,6543.0,...,0,0,0,0,0,0,0,0,1,0
29234,2,6.3,2.0,1.0,98.0,104.0,2018.0,-37.81551,144.88826,6543.0,...,0,0,0,0,0,0,0,0,1,0
29235,2,6.3,1.0,2.0,220.0,120.0,2000.0,-37.82286,144.87856,6543.0,...,0,0,0,0,0,0,0,0,1,0
29236,3,6.3,2.0,0.0,521.0,136.0,1970.0,0.00000,0.00000,6543.0,...,0,0,0,0,0,0,0,0,1,0


In [17]:
new_dataset.to_csv("MELBOURNE_CLEANED_DUMMY.csv")