# Project 3: House Price Prediction

Objective: Predict house prices based on various features like location, size, and house
characteristics.

## Model Building: Train and evaluate at least TWO machine learning models to predict the target variable.

## Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Reading Data

In [2]:
df = pd.read_csv("../otherSolution/cleaned_house_data.csv")
df = df.drop(['Unnamed: 0'],axis=1)
df.head()

Unnamed: 0,Dwell_Type,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,60,65.0,8450,7,5,2003,2003,196.0,706,0,...,0,0,0,1,0,0,0,0,1,0
1,20,80.0,9600,6,8,1976,1976,0.0,978,0,...,0,0,0,1,0,0,0,0,1,0
2,60,68.0,11250,7,5,2001,2002,162.0,486,0,...,0,0,0,1,0,0,0,0,1,0
3,70,60.0,9550,7,5,1915,1970,0.0,216,0,...,0,0,0,1,1,0,0,0,0,0
4,60,84.0,14260,8,5,2000,2000,350.0,655,0,...,0,0,0,1,0,0,0,0,1,0


## Support Vector Regression Model

**working on log transform of y target label**

**Separate out the data into X features and y target label**

In [3]:
X = df.drop(['Property_Sale_Price_natural_log','Property_Sale_Price'],axis=1)
y = df['Property_Sale_Price']
log_y = df['Property_Sale_Price_natural_log']

**Perform a Train|Test split on the data, with a 10% test size. Note: The solution uses a random state of 101**

In [4]:
from sklearn.model_selection import train_test_split

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=101)
X_train_log, X_test_log, log_y_train, log_y_test = train_test_split(X, log_y, test_size=0.1, random_state=101)


**Scale the X train and X test data.**

In [6]:
from sklearn.preprocessing import StandardScaler

In [7]:
scaler = StandardScaler()
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)

In [8]:
scaler_log = StandardScaler()
scaled_X_train_log = scaler.fit_transform(X_train_log)
scaled_X_test_log = scaler.transform(X_test_log)

**Use a GridSearchCV to run a grid search for the best SVR() parameters.**

In [9]:
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV

In [10]:
svr = SVR()

In [11]:
param_grid = {'C':[100,200,300],
             'kernel':['linear','rbf'],
              'gamma':['scale','auto'],
              'degree':[1,2],
              'epsilon':[0.1,1,2,3]}

In [12]:
grid = GridSearchCV(svr,param_grid=param_grid, cv=5,
                    scoring='neg_mean_squared_error', n_jobs=-1)


**Working with normal y**

In [13]:
grid.fit(scaled_X_train,y_train)


In [14]:
grid.best_params_

{'C': 200, 'degree': 1, 'epsilon': 3, 'gamma': 'scale', 'kernel': 'linear'}

In [15]:
grid_preds = grid.predict(scaled_X_test)

**Evaluate your model's performance on the unseen 10% scaled test set. Using MAE and a RMSE**

In [16]:
from sklearn.metrics import mean_absolute_error,mean_squared_error, r2_score

In [17]:
MAE = mean_absolute_error(y_test,grid_preds)
#% error from mean
(MAE *100)/180149.242279

np.float64(9.089936231792796)

In [18]:
MSE = mean_squared_error(y_test,grid_preds)
RMSE = np.sqrt(MSE)
#% error from mean
(RMSE *100)/180149.242279

np.float64(15.035750609127419)

In [19]:
r2 = r2_score(y_test, grid_preds)
#% error from mean
r2 

0.8937497600753737

**Working with normal log transform of y**

In [20]:
svr = SVR()

param_grid = {'C':[0,0.001,0.01],
             'kernel':['linear','rbf'],
              'gamma':['scale','auto'],
              'degree':[1],
              'epsilon':[0,0.01,0.1]}

In [21]:
grid = GridSearchCV(svr,param_grid=param_grid, cv=5,
                    scoring='neg_mean_squared_error')

In [22]:
grid.fit(scaled_X_train_log,log_y_train)

60 fits failed out of a total of 180.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
60 fits failed with the following error:
Traceback (most recent call last):
  File "/home/ReemGamal/miniconda3/envs/ml/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/ReemGamal/miniconda3/envs/ml/lib/python3.9/site-packages/sklearn/base.py", line 1466, in wrapper
    estimator._validate_params()
  File "/home/ReemGamal/miniconda3/envs/ml/lib/python3.9/site-packages/sklearn/base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "/home/ReemGamal/miniconda3/envs/ml/lib/python3.9/site-packages/sklearn/utils/_param_vali

In [23]:
grid.best_params_

{'C': 0.001,
 'degree': 1,
 'epsilon': 0.01,
 'gamma': 'scale',
 'kernel': 'linear'}

In [24]:
grid_preds_log = grid.predict(scaled_X_test_log)

**Evaluate your model's performance on the unseen 10% scaled test set. Using MAE and a RMSE**

In [25]:
from sklearn.metrics import mean_absolute_error,mean_squared_error, r2_score

In [26]:
MAE = mean_absolute_error(log_y_test,grid_preds_log)
#% error from mean
(MAE*100)/12.021984

np.float64(0.6334280167764837)

In [27]:
MSE = mean_squared_error(log_y_test,grid_preds_log)
RMSE = np.sqrt(MSE)
#% error from mean
(RMSE*100)/12.021984

np.float64(0.9340084133928446)

In [29]:
r2 = r2_score(log_y_test,grid_preds_log)
#% error from mean
r2

0.9285713229448201

----