### Regression and Regularization

**Scenario**
You've been provided the following data set on House Sales in King City. Your task is to build a regression model which can predict the price of a house, based on the features available.

Questions / Tasks:
1. Build a regression model to predict the price of a house.

You  need to clean and transform the data, including feature engineering(creating dummy variables, or using dimensionality reduction)
Be sure to explain why you chose the approach you did, and why it's the best approach for the data provided.

2. Evaluate the model using techniques covered in class and explain the results. How do you know this is the best model you can build, given the tools you have?

3. Explain the results to a business executive. What are the main drivers of house prices in King City? And how much do these drivers impact the price?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
%matplotlib inline
from statsmodels.formula.api import ols
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression,ElasticNet,Ridge,Lasso
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler,StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import mean_squared_error,r2_score
from sklearn.ensemble import RandomForestRegressor

In [2]:
house = pd.read_csv('https://raw.githubusercontent.com/delinai/schulich_ds1/main/Datasets/kc_house_data.csv')

In [3]:
house.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [4]:
# Dropping rows with any missing values
house = house.dropna()  

In [5]:
house.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  float64
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21613 non-null  int64  
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  int64  
 17  lat            21613 non-null  float64
 18  long  

In [6]:
# Extracting month from date
house['sale_month'] = pd.to_datetime(house['date']).dt.month

In [7]:
# Creating new feature: total_sqft
house['total_sqft'] = house['sqft_living'] + house['sqft_lot']

In [8]:
# Droping unnecessary/redundant columns
house = house.drop(['id', 'date', 'sqft_above', 'sqft_basement','lat','long'], axis=1)

In [9]:
house.head()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,yr_built,yr_renovated,zipcode,sqft_living15,sqft_lot15,sale_month,total_sqft
0,221900.0,3,1.0,1180,5650,1.0,0,0,3,7,1955,0,98178,1340,5650,10,6830
1,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,1951,1991,98125,1690,7639,12,9812
2,180000.0,2,1.0,770,10000,1.0,0,0,3,6,1933,0,98028,2720,8062,2,10770
3,604000.0,4,3.0,1960,5000,1.0,0,0,5,7,1965,0,98136,1360,5000,12,6960
4,510000.0,3,2.0,1680,8080,1.0,0,0,3,8,1987,0,98074,1800,7503,2,9760


In [10]:
# Performing PCA

# Standardizing the features
scaler = StandardScaler()
house_std = scaler.fit_transform(house)

pca = PCA()
house_pca = pca.fit_transform(house_std)

# The transformed data is an array, converting it back into a dataframe
house_pca = pd.DataFrame(house_pca, columns=[f'PC{i+1}' for i in range(len(house.columns))])

# Printing the explained variance ratio
print('Explained variance ratio:', pca.explained_variance_ratio_)

# Printing the cumulative explained variance ratio
cumsum_variance = np.cumsum(pca.explained_variance_ratio_)
print('Cumulative explained variance ratio:', cumsum_variance)

# Showing the first few rows of transformed dataframe
house_pca.head()

Explained variance ratio: [2.81414607e-01 1.49591190e-01 1.08426299e-01 7.34935994e-02
 6.17228618e-02 5.89300746e-02 5.27815618e-02 4.23256046e-02
 4.05148309e-02 3.17897076e-02 2.38891809e-02 2.20039216e-02
 1.89112892e-02 1.49349691e-02 1.11106174e-02 8.15968643e-03
 5.44678526e-31]
Cumulative explained variance ratio: [0.28141461 0.4310058  0.53943209 0.61292569 0.67464856 0.73357863
 0.78636019 0.8286858  0.86920063 0.90099034 0.92487952 0.94688344
 0.96579473 0.9807297  0.99184031 1.         1.        ]


Unnamed: 0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,PC11,PC12,PC13,PC14,PC15,PC16,PC17
0,-2.620705,0.079954,0.275655,-0.730216,-0.553165,0.942206,-1.146867,-0.069407,1.041288,0.348613,-0.346628,-0.008296,0.282784,0.566879,-0.054596,-0.067387,3.752632e-14
1,-0.253938,-0.419124,1.27765,-1.698044,-4.20431,1.373333,1.73154,0.133527,-0.975012,-0.236241,-0.415627,-0.095258,0.226982,-0.30147,0.336054,-0.689077,-3.41203e-15
2,-2.445457,0.623121,-0.153334,-0.121243,0.850275,-1.392872,1.198713,-1.424185,0.61472,0.290752,0.68078,-0.759953,0.957869,-1.128264,0.16446,0.538108,2.325195e-15
3,-0.713767,-0.300262,1.111489,1.862395,-0.312785,1.715038,-1.070244,1.201214,-0.798353,-0.166101,-1.223352,0.820688,-0.618063,-0.232378,0.229767,0.361537,-2.929669e-14
4,-0.517742,-0.161518,-0.546317,-0.2835,0.490755,-1.456751,0.271566,-0.465233,0.495116,0.28679,-0.551504,0.50222,-0.227239,0.391351,-0.115527,0.20956,4.444239e-13


In [11]:
# Performing PCA again but selecting the number of components to be 10 since the Cumulative Explained Variance Ratio is apporixately 
# equal to 90% which is the desirable threshold

scaler = StandardScaler()
house_std = scaler.fit_transform(house)

pca = PCA(n_components=10)
house_pca = pca.fit_transform(house_std)

# The transformed data is an array, converting it back into a DataFrame
house_pca = pd.DataFrame(house_pca, columns=[f'PC{i+1}' for i in range(10)])

# Printing the explained variance ratio
print('Explained variance ratio:', pca.explained_variance_ratio_)

# Printing the cumulative explained variance ratio
cumsum_variance = np.cumsum(pca.explained_variance_ratio_)
print('Cumulative explained variance ratio:', cumsum_variance)

# Showing the first few rows of the transformed DataFrame
house_pca.head()

Explained variance ratio: [0.28141461 0.14959119 0.1084263  0.0734936  0.06172286 0.05893007
 0.05278156 0.0423256  0.04051483 0.03178971]
Cumulative explained variance ratio: [0.28141461 0.4310058  0.53943209 0.61292569 0.67464856 0.73357863
 0.78636019 0.8286858  0.86920063 0.90099034]


Unnamed: 0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10
0,-2.620705,0.079954,0.275655,-0.730216,-0.553165,0.942206,-1.146867,-0.069407,1.041288,0.348613
1,-0.253938,-0.419124,1.27765,-1.698044,-4.20431,1.373333,1.73154,0.133527,-0.975012,-0.236241
2,-2.445457,0.623121,-0.153334,-0.121243,0.850275,-1.392872,1.198713,-1.424185,0.61472,0.290752
3,-0.713767,-0.300262,1.111489,1.862395,-0.312785,1.715038,-1.070244,1.201214,-0.798353,-0.166101
4,-0.517742,-0.161518,-0.546317,-0.2835,0.490755,-1.456751,0.271566,-0.465233,0.495116,0.28679


In [12]:
# Defining numerical features (based on  PCA-generated dataframe)
numerical_features = ['PC1', 'PC2', 'PC3','PC4', 'PC5', 'PC6','PC7', 'PC8', 'PC9','PC10'] 

In [13]:
# Splitting training and testing data
X_train, X_test, y_train, y_test = train_test_split(house_pca[numerical_features], house['price'], test_size=0.2, random_state=1234)

In [14]:
# Defining numerical transformer
num_transformer = Pipeline(steps=[
    ('scales', MinMaxScaler())
])

In [15]:
# Building the pipeline for Linear Regression
linear_regression = Pipeline(steps=[
    ('preprocessor', num_transformer ),
    ('regressor', LinearRegression())
])

# Building the pipeline for Ridge Regression
ridge_regression = Pipeline(steps=[
    ('preprocessor', num_transformer ),
    ('regressor',Ridge(alpha=1.0)) #hyperparamter - higher the alpha parameter, more the penanlty
])

# Building the pipeline for Lasso Regression
lasso_regression = Pipeline(steps=[
    ('preprocessor', num_transformer ),
    ('regressor',Lasso(alpha=1.0)) #hyperparamter - higher the alpha parameter, more the penanlty
])

# Building the pipeline for Elastic Net Regression
elasticnet_regression = Pipeline(steps=[
    ('preprocessor', num_transformer ),
    ('regressor',ElasticNet(alpha=1.0)) #hyperparamter - higher the alpha parameter, more the penanlty
])

In [16]:
# Evaluating Model Performance

# Defining the models
models = {
    'Linear Regression': linear_regression,
    'Ridge Regression': ridge_regression,
    'Lasso Regression': lasso_regression,
    'ElasticNet Regression': elasticnet_regression
}

# Evaluating each model
for model_name, model in models.items():
    # Fitting the model to the training data
    model.fit(X_train, y_train)
    
    # Making predictions on the testing data
    y_pred = model.predict(X_test)
    
    # Calculating RMSE
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    
    # Calculating R-squared
    r2 = r2_score(y_test, y_pred)
    
    # Performing cross-validation and calculate cross-validated RMSE
    cv_rmse = np.sqrt(-cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error').mean())
    
    # Printing the evaluation metrics for each model
    print(f"Model: {model_name}")
    print(f"RMSE: {rmse}")
    print(f"R-squared: {r2}")
    print(f"Cross-validated RMSE: {cv_rmse}")
    print("------------------------------------")

Model: Linear Regression
RMSE: 145149.94286178125
R-squared: 0.8398031660093688
Cross-validated RMSE: 147806.72496467302
------------------------------------
Model: Ridge Regression
RMSE: 144951.31991364303
R-squared: 0.8402412922180001
Cross-validated RMSE: 147994.00514511173
------------------------------------


Model: Lasso Regression
RMSE: 145145.70992679766
R-squared: 0.8398125093529748
Cross-validated RMSE: 147806.97620036392
------------------------------------
Model: ElasticNet Regression
RMSE: 357911.2784792829
R-squared: 0.02597308341148774
Cross-validated RMSE: 363459.11184153985
------------------------------------


In [17]:
house.head()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,yr_built,yr_renovated,zipcode,sqft_living15,sqft_lot15,sale_month,total_sqft
0,221900.0,3,1.0,1180,5650,1.0,0,0,3,7,1955,0,98178,1340,5650,10,6830
1,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,1951,1991,98125,1690,7639,12,9812
2,180000.0,2,1.0,770,10000,1.0,0,0,3,6,1933,0,98028,2720,8062,2,10770
3,604000.0,4,3.0,1960,5000,1.0,0,0,5,7,1965,0,98136,1360,5000,12,6960
4,510000.0,3,2.0,1680,8080,1.0,0,0,3,8,1987,0,98074,1800,7503,2,9760


In [18]:
# To find the main drivers of House Prices in King city, performing a feature importance analysis using a Random Forest Regressor.

# Separating the features (independent variables) and the target variable (house prices)
X = house.drop('price', axis=1)
y = house['price']

# Initializing the Random Forest model
rf = RandomForestRegressor()

# Fitting the model on the data
rf.fit(X, y)

# Getting feature importances
importances = rf.feature_importances_

# Sorting the importances in descending order
sorted_indices = np.argsort(importances)[::-1]

# Printing the feature names and their corresponding importances
for i, idx in enumerate(sorted_indices):
    print(f"{i+1}. {X.columns[idx]}: {importances[idx]}")

1. grade: 0.36235036323608916
2. sqft_living: 0.2618765135046039
3. zipcode: 0.0888707071850324
4. yr_built: 0.08518755102495147
5. sqft_living15: 0.04791083483175737
6. sqft_lot15: 0.02805444912440116
7. waterfront: 0.026586124132569718
8. view: 0.01881758441111751
9. bathrooms: 0.017309267615859072
10. total_sqft: 0.016022135753213484
11. sqft_lot: 0.015727159683011877
12. sale_month: 0.01195611436150886
13. bedrooms: 0.005687906383188479
14. condition: 0.005414518819859626
15. floors: 0.004869367317926294
16. yr_renovated: 0.003359402614909562


----------------------------------------------------------------------------------------------------------------------------------------

### Build a regression model to predict the price of a house. You may need to clean and transform the data, including feature engineering (creating dummy variables, or using dimensionality reduction) .Be sure to explain why you chose the approach you did, and why it's the best approach for the data provided.


Before building the regression model, the usual data cleaning activities were done which included :
1. Dropping Duplicates using dropna() function

2. Extracting the month from the 'date' column in the 'house' DataFrame and assigning it to a new column called 'sale_month'. The actual date is in the '20141013T000000' format, it is not directly interpretable as a date by Python. By using the pd.to_datetime() function, we convert the 'date' column into a datetime data type, which allows for easier manipulation and extraction of specific components like the month.

3. Createing a new feature called 'total_sqft' calculated by adding the values from the 'sqft_living' column and the 'sqft_lot' column together. This feature represents the combined size of the house and the land it occupies and can be used to explore the relationship between the total square footage of a property and its price.

4. Eliminating unnecessary or redundant columns that are not relevant for modeling.

'id': This column represents an identification number for each house, which does not provide any meaningful information for modelling

'date': This column contains the date of sale, which may not be directly relevant for predicting house prices. It was used earlier to extract the 'sale_month' feature which is the feature of interest for our modelling.

'sqft_above' and 'sqft_basement': represent the square footage of the house above ground level and in the basement respectively. Since the 'total_sqft' feature was created by summing 'sqft_living' and 'sqft_lot', these specific breakdowns are no longer necessary.

'lat' and 'long': These columns represent the latitude and longitude coordinates of the house location which are not directly relevant for modeling especially when other location-related information like the 'zipcode' is available.

By dropping these columns, the 'house' DataFrame becomes more streamlined and focused on the relevant features that are more likely to contribute to modeling of house prices.

5. Dimentionality Reduction using PCA - to reduce the dimensionality of the dataset by extracting the most important patterns and capturing a significant portion of the variance in a smaller number of components. The number of components is chosen to be 10 based on the cumulative explained variance ratio being approximately equal to 90%. This threshold indicates that these 10 components can effectively represent around 90% of the original information in the dataset while reducing the dimensionality.


After completing these activities, pipelines were built for the PCA-generated dataframe using the models ; Linear Regression, LASSO Regression, Ridge Regression & Elastic Net Regression. The Model Performance metrics for each of these were evaluated. 

----------------------------------------------------------------------------------------------------------------------------------------

### Evaluate the model using techniques covered in class and explain the results. How do you know this is the best model you can build, given the tools you have?

Based on the results from the code, we have evaluated the models as follows:

Model: Linear Regression -
RMSE: 145149.94286178108 ;
R-squared: 0.8398031660093692;
Cross-validated RMSE: 147806.72496467282

Model: Ridge Regression -
RMSE: 144951.31991364283 ;
R-squared: 0.8402412922180007 ;
Cross-validated RMSE: 147994.00514511153

Model: Lasso Regression -
RMSE: 145145.70992679748 ;
R-squared: 0.8398125093529751 ;
Cross-validated RMSE: 147806.9762003637

Model: ElasticNet Regression -
RMSE: 357911.2784792829 ;
R-squared: 0.02597308341148774 ;
Cross-validated RMSE: 363459.1118415398


To determine the best model, we will consider all these metrics-

RMSE: Lower RMSE values indicate better predictive performance, so models with lower RMSE values are generally preferred.

R-squared: Higher R-squared values indicate better goodness of fit, where a value of 1 represents a perfect fit. Models with higher R-squared values indicate better ability to explain the variation in the target variable.

Cross-validated RMSE: Cross-validation provides a more robust evaluation by assessing the model's performance on multiple folds of the training data. Lower cross-validated RMSE values indicate better generalization ability of the model.

Based on these evaluation metrics, we can see that Ridge Regression has the lowest RMSE, highest R-squared, and a comparable cross-validated RMSE to other models. Therefore, Ridge Regression appears to be the best model among the options provided to predict the prices of houses.

----------------------------------------------------------------------------------------------------------------------------------------

###  Explain the results to a business executive. What are the main drivers of house prices in King City? And how much do these drivers impact the price?

Based on the feature importance analysis using Random Forest Regressor, we have identified the main drivers of house prices in King City which are as follows -


1. grade: 0.3624939716424652
2. sqft_living: 0.26436150434232153
3. zipcode: 0.08927293848354619
4. yr_built: 0.08442484153274278
5. sqft_living15: 0.048379914308035676
6. sqft_lot15: 0.02769424066536445
7. waterfront: 0.025178874075714557
8. bathrooms: 0.01784586221543381
9. view: 0.017170574666928142
10. total_sqft: 0.016136708176200932
11. sqft_lot: 0.015752813925044712
12. sale_month: 0.011885098996124813
13. condition: 0.005513969445161964
14. bedrooms: 0.005428627997143746
15. floors: 0.004770829720634885
16. yr_renovated: 0.0036892298071367307



Grade: The grade of the house has the highest importance (0.36) in predicting house prices. The higher the grade, the higher the impact on the price. Grade refers to the construction quality and materials used, and houses with better grades command higher prices.

Square footage of the living area: The size of the living area (sqft_living) is the second most important factor (0.26) affecting house prices. Larger living areas generally result in higher prices.

Zipcode: The specific location of the house, represented by the zipcode, also plays a significant role (0.09) in determining house prices. Different areas within King City may have varying desirability and demand, influencing the prices.

Year built: The year the house was built (yr_built) has a moderate impact (0.08) on the prices. Older houses may have lower prices compared to newer ones due to factors such as condition, amenities, and architectural styles.

Square footage of living area for the nearest 15 houses (sqft_living15): The average size of the living area for the 15 nearest houses has a relatively lower impact (0.05) on house prices.

These findings indicate that factors such as the quality of the house (grade), its size (sqft_living), location (zipcode), and age (yr_built) are the key drivers of house prices in King City. Other factors like waterfront view, number of bathrooms, and overall condition also contribute to the prices but to a lesser extent.

