<table class="table table-bordered">
    <tr>
        <th style="text-align:center; width:35%"><img src='https://dl.dropboxusercontent.com/s/hrpgwq7gqoxf4am/smu_scis.png' style="width: 300px; height: 60px; "></th>
    <th style="text-align:center;"><font size="4"> <br/>IS215 - Analytics in Python - Practical 1b Regression (Student)</font></th>
    </tr>
</table> 

The following program is to illustrate the steps to create a regression model  with Singapore HDB house price dataset extracted and manipulated from https://github.com/valerielimyh/Predict_housing_prices/tree/master/data/raw. There are 4 input variables and 1 target variable - predict the HDB price from the input variables.
The variable names and their brief description are as follows:
- Property Type (categorical)
- Price (numeric)
- Area_sqm (numeric)
- Corner Unit (categorical)
- Renovated (categorical)

### Step 1: import relevant libraries

In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error


### Step 2: read in the data

In [3]:
#read in data file
df = pd.read_csv('hdb_house_price_processed.csv')

#check top few rows from data file
df.head()

Unnamed: 0,Property Type,Price,Area_sqm,Corner Unit,Renovated
0,HDB Executive,560000,142,1,1
1,HDB 4 Rooms,480000,92,1,0
2,HDB 5 Rooms,568000,113,1,1
3,HDB 4 Rooms,308888,84,0,1
4,HDB 5 Rooms,750000,120,0,1


In [4]:
#summary statistics
df.describe(include='all')

Unnamed: 0,Property Type,Price,Area_sqm,Corner Unit,Renovated
count,1888,1888.0,1888.0,1888.0,1888.0
unique,7,,,,
top,HDB 4 Rooms,,,,
freq,704,,,,
mean,,517027.4,102.768538,0.506356,0.523835
std,,171069.1,24.522659,0.500092,0.499564
min,,199000.0,31.0,0.0,0.0
25%,,395000.0,90.0,0.0,0.0
50%,,488000.0,104.0,1.0,1.0
75%,,610000.0,120.0,1.0,1.0


### Step 3: create input and target data

In [5]:
# Split into input (X) and target (y) dataframes.

X = df.drop('Price', axis=1)
y = df['Price']

print(X.shape,y.shape)

(1888, 4) (1888,)


### Step 4: preprocess data

In [6]:
def preprocess_data(X):
    # if it is a numeric column - 
    # Normalize using MinMaxScaler to constrain values to between 0 and 1.

    scaler = MinMaxScaler(feature_range = (0,1))
    df_numeric = X.select_dtypes(include=[np.number])
    numeric_cols = df_numeric.columns.values

    for col in numeric_cols:
        #MinMaxScale needs [n_samples, n_features] and hence requires df[[col]]
        X[col] = scaler.fit_transform(X[[col]])
        
    print("---Successfully processed numeric column(s)")
    print(X.head(5))
    
    # if it is a categorical column, need to convert the column into one-hot encoding
    df_categorical = X.select_dtypes(exclude=[np.number])
    categorical_cols = df_categorical.columns.values

    X = pd.get_dummies(X,columns=categorical_cols)
           
    print("---Successfully processed categorical column(s)")
    print(X.head(5))
        
    return X,scaler

In [7]:
# preprocess data
X,scaler = preprocess_data(X)

---Successfully processed numeric column(s)
   Property Type  Area_sqm  Corner Unit  Renovated
0  HDB Executive  0.689441          1.0        1.0
1    HDB 4 Rooms  0.378882          1.0        0.0
2    HDB 5 Rooms  0.509317          1.0        1.0
3    HDB 4 Rooms  0.329193          0.0        1.0
4    HDB 5 Rooms  0.552795          0.0        1.0
---Successfully processed categorical column(s)
   Area_sqm  Corner Unit  Renovated  Property Type_HDB 1 Room  \
0  0.689441          1.0        1.0                         0   
1  0.378882          1.0        0.0                         0   
2  0.509317          1.0        1.0                         0   
3  0.329193          0.0        1.0                         0   
4  0.552795          0.0        1.0                         0   

   Property Type_HDB 2 Rooms  Property Type_HDB 3 Rooms  \
0                          0                          0   
1                          0                          0   
2                          0      

### Step 5: split data into training and testing data

In [8]:
# Split feature and label sets to train and data sets - 80-20, random_state is desirable for reproducibility
# stratify parameter is not relevant for regression since it is not about splitting via category/class

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(1510, 10) (378, 10) (1510,) (378,)


### Step 6: create the Linear Regression model

In [9]:
# Create Regression Model
#------------------------
model = LinearRegression()

# Train the model
#----------------
model.fit(X_train, y_train)

# Use model to make predictions
#------------------------------
y_pred = model.predict(X_test)

### Step 7: evaluate the Linear Regression model

In [10]:
# Evaluation - using relevant metrics
#------------------------------------
print("Model Features:", model.feature_names_in_)
print("Model Coefficients:", model.coef_)
print("Mean Absolute Error:", mean_absolute_error(y_test, y_pred))

Model Features: ['Area_sqm' 'Corner Unit' 'Renovated' 'Property Type_HDB 1 Room'
 'Property Type_HDB 2 Rooms' 'Property Type_HDB 3 Rooms'
 'Property Type_HDB 4 Rooms' 'Property Type_HDB 5 Rooms'
 'Property Type_HDB Executive' 'Property Type_HDB Jumbo']
Model Coefficients: [ 3.54986141e+04  1.42388945e+04  3.41375011e+04  8.73114914e-11
 -2.32589800e+05 -1.60653323e+05 -3.29799982e+04  7.04621490e+04
  1.52297540e+05  2.03463432e+05]
Mean Absolute Error: 101412.80515416089


In [11]:
# Coefficient of Determination of the model
#------------------------------------------
print("Coefficient of Determination or r-squared:", r2_score(y_test, y_pred))

Coefficient of Determination or r-squared: 0.3841640758969246


## Conclusion:

In [12]:
#########################################################################################
# an r-squared of 38% reveals that 38% of the variability observed in the target variable
# is explained by the regression model.
#########################################################################################

# The model does not seem to work very well - r-squared should be as close to 1 
# but depending on industry. 
#
# In some fields, such as the social sciences, even a relatively low R-Squared such as 0.5 
# could be considered relatively strong.

# What can be used to improve the model?
