# Price Prediction Model

Regression Models are realiable for identifying patterns need For Prediction. These patterns are analyze relationships within variables to produce weighted predictions. 

Linear Regression is a powerful and common method to estimate values. This model in statistics allows us to levearge past measured behaviors to predict future behavior. Regression works on the line equation "y=mx+c". The goal of linear regression is to draw a trend line a to predict the outcome.



## Background Information- Machine Learing
“Machine Learning – it’s a field of study that gives computers the ability to learn without being explicitly programmed” –  Arthur Samuel

**Types:**
**Supervised learning:** the goal here would be to train a model that allows predictions to be made on unseen future data. For this to happen data must be labeled

**Unsupervised learning**: this type of learning works with unlabeled data and its goal would be to find hidden patterns in this data, and, probably some meaningful information

**Reinforcement learning:** the goal here would be to develop a system that learns and improves over time by interacting with the environment.

In [1]:
# HEADERS 
import pandas as pd # data processing
import numpy as np # working with arrays
import matplotlib.pyplot as plt # visualization
import seaborn as sb # visualization
from termcolor import colored as cl # text customization

from sklearn.model_selection import train_test_split # data split

from sklearn.linear_model import LinearRegression # OLS algorithm
from sklearn.linear_model import Ridge # Ridge algorithm
from sklearn.linear_model import Lasso # Lasso algorithm
from sklearn.linear_model import BayesianRidge # Bayesian algorithm
from sklearn.linear_model import ElasticNet # ElasticNet algorithm
from sklearn.svm import SVC 
from sklearn.metrics import classification_report, confusion_matrix

from sklearn.metrics import explained_variance_score as evs # evaluation metric
from sklearn.metrics import r2_score as r2 # evaluation metric

In [2]:
# FINAL DATA SOURCE

missing_values = ["N/A", "NaN"]

#READING FROM CSV FILE
dataSource = pd.read_csv(
    "/Users/rburgess/Documents/GitHub/Webscrapping/Pricing Model Prjct/Data/DS1_Cleaned.csv", na_values=missing_values)
print("Clean Data from Source 1: \n", dataSource.head(2))
print('\nNumber of Rows: ',len(dataSource))

# FORMATTING COLUMN DATA TYPES
dataSource = dataSource.astype({'Beds':'int', 'Baths':'int','Square Feet': 'int','Property Type': 'string','Listing Price': 'int'})

# DROP DUPLICATES
dataSource.drop_duplicates(subset=["Address"], keep="last", inplace=True)


print("Dataset Original Data Types : \n",cl(dataSource.dtypes, attrs = ['bold']))
print('\nNumber of Unique Rows: ',len(dataSource))

Clean Data from Source 1: 
                                              Address  Beds  Baths  \
0  ['1808','S','Bumby','Avenue','Orlando','FL','3...     3    2.0   
1  ['8429','Leeland','Archer','Blvd','Orlando','F...     4    3.0   

   Square Feet Property Type  Listing Price  
0         1246   Residential         450000  
1         1512   Residential         425000  

Number of Rows:  2312
Dataset Original Data Types : 
 Address          object
Beds              int64
Baths             int64
Square Feet       int64
Property Type    string
Listing Price     int64
dtype: object

Number of Unique Rows:  2262


In [3]:
# FORMATTING COLUMN DATA TYPES
dataSource = dataSource.astype({'Beds':'int', 'Baths':'int','Square Feet': 'int','Property Type': 'string','Listing Price': 'int'})
print("Dataset Final Data Type : \n",cl(dataSource.dtypes, attrs = ['bold']))


Dataset Final Data Type : 
 Address          object
Beds              int64
Baths             int64
Square Feet       int64
Property Type    string
Listing Price     int64
dtype: object


In [4]:

# FEATURE SELECTION & DATA SPLIT

#Independent Variables- predictor variable
X_var = dataSource[['Beds', 'Baths', 'Square Feet']].values

#Dependent Variable - Variable being predicted
Y_var = dataSource['Listing Price'].values

#Split data into 2 SUBSETS: train set and test set
X_train, X_test, y_train, y_test = train_test_split(X_var, Y_var, test_size = 0.3, random_state = 4)

print(cl('X_train samples : \n', attrs = ['bold']), X_train[0:5])
print(cl('X_test samples : \n', attrs = ['bold']), X_test[0:5])
print(cl('y_train samples : \n', attrs = ['bold']), y_train[0:5])
print(cl('y_test samples : \n', attrs = ['bold']), y_test[0:5])

X_train samples : 
 [[   1    1  930]
 [   3    2 3445]
 [   4    4 2616]
 [   2    1  713]
 [   2    3 1449]]
X_test samples : 
 [[   3    2 1198]
 [   2    2 1093]
 [   4    3 2724]
 [   3    2 1678]
 [   3    2 1328]]
y_train samples : 
 [ 195000 1200000  749000  220000  489995]
y_test samples : 
 [219000 115000 885837 388000 447000]


Model_selection is a method for setting a blueprint to analyze data and then using it to measure new data. Selecting a proper model allows you to generate accurate results when making a prediction.


https://www.bitdegree.org/learn/train-test-split

In [5]:
# MODELING


### 
'''
1. OLS
 Ordinary Least Squares regression (OLS) is a common technique for estimating coefficients 
 of linear regression equations which describe the relationship between one or more 
 independent quantitative variables and a dependent variable (simple or multiple linear regression).
'''
ols = LinearRegression()
ols.fit(X_train, y_train)
ols_yhat = ols.predict(X_test)

'''
2. Lasso
 Lasso regression is a type of linear regression that uses shrinkage. 
 Shrinkage is where data values are shrunk towards a central point, like the mean
'''
lasso = Lasso(alpha = 1)
lasso.fit(X_train, y_train)
lasso_yhat = lasso.predict(X_test)

'''
3. Ridge 
 is a method of estimating the coefficients of multiple-regression models 
 in scenarios where the independent variables are highly correlated.'''
ridge = Ridge(alpha = 0.3)
ridge.fit(X_train, y_train)
ridge_yhat = ridge.predict(X_test)

'''
4. ElasticNet
 The Elastic-Net is a regularized regression method that linearly
 combines both penalties i.e. L1 and L2 of the Lasso and Ridge regression methods.
 It is useful when there are multiple correlated features.
'''
en = ElasticNet(alpha = 0.67)
en.fit(X_train, y_train)
en_yhat = en.predict(X_test)

'''
5. Bayesian
 Bayesian regression allows a natural mechanism to survive insufficient 
 data or poorly distributed data by formulating linear regression using 
 probability distributors rather than point estimates.
'''
bayesian = BayesianRidge()
bayesian.fit(X_train, y_train)
bayesian_yhat = bayesian.predict(X_test)




Fig 1: https://www.alchemer.com/resources/blog/regression-analysis/

In [6]:
# MODEL EVALUATION
'''
 ‘explained_variance_score’ metric, the score should not below 0.60 or 60%.
 If it is the case, then our built model is not sufficient for our data to solve the given case. 
 So, the ideal score of the ‘explained_variance_score’ should be between 0.60 and 1.0.
'''


# Explained Variance Score

print('| EXPLAINED VARIANCE SCORE:               |')
print('===========================================')
print('| OLS Model  {}'.format(evs(y_test, ols_yhat)) ,'          |')
print('| Lasso Model {}'.format(evs(y_test, lasso_yhat)) ,'         |')
print('| Ridge Model{}'.format(evs(y_test, ridge_yhat)) ,'          |')
print('| ElasticNet Model {}'.format(evs(y_test, en_yhat)) ,'    |')
print('| Bayesian Model {}'.format(evs(y_test, bayesian_yhat)) ,'      |')


| EXPLAINED VARIANCE SCORE:               |
| OLS Model  0.6476347995032612           |
| Lasso Model 0.6476351478635625          |
| Ridge Model0.6476395557526757           |
| ElasticNet Model 0.6455462244610374     |
| Bayesian Model 0.6326685483370438       |


In [7]:
'''
he r2_score (R-squared) metric a measurement of how well the dependent variable explains the vari-ance of the independent variable. 
It is the most popu-lar evaluation metric for regression models. The ideal r2_score of a build should be more than 0.70 (at least > 0.60). 
For the experiment all regression models were compared to see who had the best score.
'''

# 2. R-squared

print('| R-SQUARED:                               |')
print('============================================')
print('| OLS Model:  {}'.format(r2(y_test, ols_yhat)),'          |')
print('| Lasso Model:  {}'.format(r2(y_test, lasso_yhat)),'        |')
print('| Ridge Model:  {}'.format(r2(y_test, ridge_yhat)),'        |')
print('| ElasticNet Model:  {}'.format(r2(y_test, en_yhat)),'    |')
print('| Bayesian Model:  {}'.format(r2(y_test, bayesian_yhat)) ,'     |')

| R-SQUARED:                               |
| OLS Model:  0.6476271181314507           |
| Lasso Model:  0.6476274678140661         |
| Ridge Model:  0.6476318839222759         |
| ElasticNet Model:  0.645545215659568     |
| Bayesian Model:  0.6326658019846867      |
