<a href="https://colab.research.google.com/github/nitish6121999/US-HOME-PRICE-PREDICTION/blob/main/Home_LLC__Modelling_Assignmnet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - U.S Home Price Prediction



##### **Project Type**    - Supervised Machine Learning Regression
##### **Created by**      - Nitish N Naik

# **GitHub Link -**

https://github.com/nitish6121999/US-HOME-PRICE-PREDICTION

# **Problem Statement**



The aim of this study is to develop a robust predictive model for US home prices leveraging comprehensive datasets encompassing various economic, societal, and market-related factors. The objective is to accurately predict future home prices based on historical trends and the interplay of influential variables.

###Steps Involved:

**Data Collection**: Gather data from credible online sources encompassing US home pricing, real estate metrics, and a comprehensive set of factors known to influence housing markets.

**Dataset Merging and Understanding**: Consolidate and merge datasets to create a unified dataset. Understand the structure, contents, and relationships within the data.

**Focus on US Home Pricing and Real Estate**: Delve deep into the dynamics of US home pricing and real estate markets to comprehend patterns, trends, and influencing factors.

**Factor Identification and Understanding**: Identify, analyze, and understand the multitude of factors affecting home prices as potential independent variables.

**Exploratory Data Analysis (EDA)**: Perform exploratory data analysis to uncover correlations, trends, outliers, and patterns within the dataset. This step involves data cleaning, visualization, and statistical analysis.

**Model Development**: Utilize machine learning or statistical modeling techniques to build a predictive model. Train the model using historical data and the identified factors to predict future home prices accurately.

**Model Evaluation and Validation**: Validate the model's accuracy, robustness, and predictive capabilities using suitable metrics and validation techniques.


# Factors considered

House price Index 2001-2023

Average Sales Price of Houses Sold for the United States

Consumer Price Index for All Urban Consumers Housing in U.S. City

Average Economic Policy Uncertainty Index for United States (USEPUINDXD)

Government subsidies Federal Housing L312051A027NBEA

Homeownership Rate in the United States RHORUSQ156N

Interest Rates and Price Indexes 2001-2023.csv

Monthly Supply of New Houses in the United States

Mortgage Average in the United States MORTGAGE30US

Net housing value added Subsidies

Public Transportation in U.S. City Average

Real Median Household Income in the United States 2001-2022

Unemployeement rate 2002-2023

Working Age Population Aged 15-64

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
df_hpi=path= '/content/drive/MyDrive/Access file/Extras/HOME LLC/merged_dataset.csv'

df_hpi = pd.read_csv(path, index_col=0)

In [None]:
df_hpi.head(10)
#df_hpi.drop(columns='unnamed',inplace=True)

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

In [None]:
sns.pairplot(df_hpi, diag_kind= 'kde')

1. Why did you pick the specific chart?

WE can understand how data is distributed across the dataset

In [None]:
numeric_col= df_hpi.describe().columns

In [None]:
for col in numeric_col[1:]:
  fig = plt.figure(figsize=(9,6))
  ax = fig.gca()
  feature = df_hpi[col]
  correlation = feature.corr(df_hpi['Price-index'])
  plt.scatter(x=feature, y= df_hpi['Price-index'])
  plt.xlabel(col)
  plt.ylabel('Price-index')
  ax.set_title('Price-index vs '+col+ ' with the correlation value : '+str(correlation))
  z= np.polyfit(df_hpi[col],df_hpi['Price-index'],1)
  y_hat = np.poly1d(z)(df_hpi[col])
  plt.plot(df_hpi[col],y_hat,'r--',lw=1)

plt.show()

1. Why did you pick the specific chart?
To understand the correlation of the columns wrt the dependent variable

2. What is/are the insight(s) found from the chart?
From above scatter plot or Regression plot shows some numeric features has positive correlation with dependent variable and some has negative correlation

correlation of features with dependent feature :--

negative correlation = Homeownership rate, unemployment rate,fedral rate, monthly new house columns

In [None]:
plt.figure(figsize=(14,8))
sns.heatmap(abs(df_hpi.corr()), cmap='coolwarm', annot=True)


###Multicolinearity variables (using VIF method to deal with them)

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calc_vif(X):

   vif = pd.DataFrame()
   vif["variables"] = X.columns
   vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

   return(vif)


calc_vif(df_hpi[[i for i in df_hpi.describe().columns if i not in ['Price-index','Population','Year','govt_subsidies']]])

In [None]:
df_hpi.drop(columns='population',inplace=True)
df_hpi.drop(columns='Year',inplace=True)
df_hpi.drop(columns='govt_subsidies',inplace=True)

In [None]:
df_hpi.shape

In [None]:
plt.figure(figsize=(14,8))
sns.heatmap(abs(df_hpi.corr()), cmap='coolwarm', annot=True)

#Model Building

In [None]:

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import ridge_regression
from sklearn.model_selection import GridSearchCV

from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import OneHotEncoder

In [None]:
x= df_hpi.drop(columns=['Price-index'])

y=np.sqrt(df_hpi['Price-index'])


xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.20,random_state=10)


xtrain.shape,xtest.shape,ytrain.shape,ytest.shape

##Function to different algorithms

In [None]:


# Appending all models parameters to the corrosponding list
mean_absolut_error = []
mean_sq_error=[]
root_mean_sq_error=[]
training_score =[]
r2_list=[]
adj_r2_list=[]
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score


def score_metrix (model,X_train,X_test,Y_train,Y_test):

  '''
    train the model and gives mae, mse,rmse,r2,adj r2 score of the model

  '''
  #training the model
  model.fit(X_train,Y_train)

  # Training Score
  training  = model.score(X_train,Y_train)
  print("Training score  =", training)

  try:
      # finding the best parameters of the model if any
    print(f"The best parameters found out to be :{model.best_params_} \nwhere model best score is:  {model.best_score_} \n")
  except:
    pass


  #predicting the Test set and evaluting the models

  if model == LinearRegression() or model == Lasso() or model == Ridge():
    Y_pred = model.predict(X_test)

    #finding mean_absolute_error
    MAE  = mean_absolute_error(Y_test**2,Y_pred**2)
    print("MAE :" , MAE)

    #finding mean_squared_error
    MSE  = mean_squared_error(Y_test**2,Y_pred**2)
    print("MSE :" , MSE)

    #finding root mean squared error
    RMSE = np.sqrt(MSE)
    print("RMSE :" ,RMSE)

    #finding the r2 score

    r2 = r2_score(Y_test**2,Y_pred**2)
    print("R2 :" ,r2)
    #finding the adjusted r2 score
    adj_r2=1-(1-r2_score(Y_test**2,Y_pred**2))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
    print("Adjusted R2 : ",adj_r2,'\n')

  else:
    # for tree base models
    Y_pred = model.predict(X_test)

    #finding mean_absolute_error
    MAE  = mean_absolute_error(Y_test,Y_pred)
    print("MAE :" , MAE)

    #finding mean_squared_error
    MSE  = mean_squared_error(Y_test,Y_pred)
    print("MSE :" , MSE)

    #finding root mean squared error
    RMSE = np.sqrt(MSE)
    print("RMSE :" ,RMSE)

    #finding the r2 score

    r2 = r2_score(Y_test,Y_pred)
    print("R2 :" ,r2)
    #finding the adjusted r2 score
    adj_r2=1-(1-r2_score(Y_test,Y_pred))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
    print("Adjusted R2 : ",adj_r2,'\n')

    try:

      # ploting the graph of feature importance

       best = model.best_estimator_
       features = X_train.columns
       importances = best.feature_importances_
       indices = np.argsort(importances)[-5:]  # Selecting the top five most important features

       plt.figure(figsize=(10, 8))  # Adjust figure size as needed
       plt.title('Top 5 Feature Importance')
       plt.barh(range(len(indices)), importances[indices], color='red', align='center')
       plt.yticks(range(len(indices)), [features[i] for i in indices])
       plt.xlabel('Relative Importance')
       plt.show()

    except:
      pass

  # Here we appending the parameters for all models
  mean_absolut_error.append(MAE)
  mean_sq_error.append(MSE)
  root_mean_sq_error.append(RMSE)
  training_score.append(training)
  r2_list.append(r2)
  adj_r2_list.append(adj_r2)

  print('*'*80)
  # print the cofficient and intercept of which model have these parameters and else we just pass them
  try :
    print("coefficient \n",model.coef_)
    print('\n')
    print("Intercept  = " ,model.intercept_)
  except:
    pass
  print('\n')
  print('*'*20, 'ploting the graph of Actual and predicted only with 80 observation', '*'*20)

  # ploting the graph of Actual and predicted only with 80 observation for better visualisation which model have these parameters and else we just pass them
  try:
    # ploting the line graph of actual and predicted values
    plt.figure(figsize=(15,7))
    plt.plot((Y_pred)[:80])
    plt.plot((np.array(Y_test)[:80]))
    plt.legend(["Predicted","Actual"])
    plt.show()
  except:
    pass

1. Which Evaluation metrics did you consider for a positive business impact and why?

### **Interpretability**:

The R2 score provides a straightforward interpretation of the proportion of variance in the target variable explained by the model. It represents the model's ability to capture and explain the variation in the data. A higher R2 score indicates that the model can better predict the target variable, which is crucial for making informed business decisions.

###Stakeholder Communication:

The R2 score is a widely understood metric and can be easily communicated to stakeholders, including non-technical individuals. It allows you to explain the model's performance in a concise and intuitive manner, enabling effective communication of the value and effectiveness of the predictive model.

###Decision-making Support:

A high R2 score suggests that the model is capturing a significant amount of the underlying patterns and relationships in the data. This can provide valuable insights for decision-making processes within a business context.

###Model Comparison:

The R2 score facilitates the comparison of different regression models or variations of the same model. It allows you to assess the relative performance of different approaches or configurations and choose the one that provides the best predictive power.

#Linear Regression

In [None]:
score_metrix (LinearRegression(),xtrain,xtest,ytrain,ytest)

## Ridge Regression

In [None]:
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer()
xtrain_trans = pt.fit_transform(xtrain)      # fit transform the training set
xtest_trans = pt.transform(xtest)

L2 = Ridge()
parameters = {'alpha': [1e-15,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1,5,10,20,30,40,45,50,55,60,100,0.5,1.5,1.6,1.7,1.8,1.9]}      # giving parameters
L2_cv = GridSearchCV(L2, parameters, scoring='r2', cv=5)                                                                    #using gridsearchcv and cross validate the model
score_metrix(L2_cv,xtrain_trans,xtest_trans,ytrain,ytest)

## Lasso Regression

In [None]:
L1 = Lasso() #creating variable
parameters = {'alpha': [1e-15,1e-13,1e-10,1e-8,1e-5,1e-4,1e-3,1e-2,1e-1,1,5,10,20,30,40,45,50,55,60,100,0.0014]} #lasso parameters
lasso_cv = GridSearchCV(L1, parameters, cv=5) #using gridsearchcv and cross validate the model


score_metrix(lasso_cv,xtrain_trans,xtest_trans,ytrain,ytest)


##Random forest regression model

In [None]:
new_x=df_hpi.drop(columns='Price-index')
new_y=df_hpi['Price-index']


new_xtrain, new_xtest, new_ytrain, new_ytest = train_test_split(new_x, new_y, test_size= 0.20, random_state = 10)


new_xtrain.shape,   new_xtest.shape,   new_ytrain.shape,  new_ytest.shape


In [None]:
from sklearn.ensemble import RandomForestRegressor


# parameters for random forest regression model

rf_param_grid ={"n_estimators":[50,100,150],                    ### we can put any values for parameters
              'max_depth' : [10,15,20,25,'none'],
              'min_samples_split': [10,50,100],
              'max_features' :[24,35,40,49]}


# Using grid search cv
Ranom_forest_Grid_search = GridSearchCV(RandomForestRegressor(),param_grid=rf_param_grid,n_jobs=-1,verbose=2)

In [None]:
score_metrix(Ranom_forest_Grid_search, new_xtrain, new_xtest, new_ytrain, new_ytest)

##Decission Tree Regressor model

In [None]:
from sklearn.tree import DecisionTreeRegressor


# Parameters for Decission Tree model
param_grid = {'criterion' : ["squared_error"],
              'splitter' : ["best", "random"],
              'max_depth' : [10,15,25, 'none'],
              'min_samples_split': [10,50,100],
              'max_features' :[24,35,40,49]}

# Gridsearch CV
Dt_grid_search = GridSearchCV(DecisionTreeRegressor(),param_grid=param_grid,cv=2,n_jobs=-1)


score_metrix(Dt_grid_search,new_xtrain,new_xtest,new_ytrain,new_ytest)

###Conclusion

**If simplicity and interpretability** are important, linear models like Linear Regression, Ridge, or Lasso might be favorable.

If **predictive accuracy** is the priority and interpretability is less critical, Random Forest or Decision Tree might be considered.

**Regularization (Ridge/Lasso)** helps when you suspect overfitting.

The balance between bias and variance needs consideration. Linear models have lower variance but might have higher bias compared to ensemble methods like Random Forest.

Considering the overall balance between the given metrics, Ridge or Lasso Regression might be suitable as they offer a balance between good performance and regularization to mitigate overfitting.

###MODEL EXPLAINABILITY

In [None]:
model1=LinearRegression()


In [None]:
model1.fit(xtrain,ytrain)

In [None]:
pip install shap

In [None]:
import shap
explainer = shap.KernelExplainer(model1.predict, xtrain)

In [None]:
shap_values = explainer.shap_values(xtest)

In [None]:
shap.summary_plot(shap_values, xtest,feature_names=numeric_col[1:])


In [None]:
import pickle

In [None]:
# Saving model into pickle file for deployement process
pickle.dump(model1, open('model1.pkl', 'wb'))



In the exploration of U.S. home price prediction spanning two decades, multiple regression models were deployed, including linear regression, Lasso and Ridge regularization, Random Forest Regressor, and Decision Tree Regressor. These models underwent a thorough evaluation based on diverse metrics.

Among the various models examined, Lasso regularization emerged as a standout performer in predicting U.S. home prices. Its consistent and robust performance underscores its ability to capture intricate patterns within the dataset effectively.

Additionally, the application of the SHAP (SHapley Additive exPlanations) tool for model interpretability illuminated the influential factors impacting U.S. home prices. This tool provided invaluable insights into understanding the factors that significantly influence the housing market.

An interesting revelation from this comprehensive analysis was the identification of certain features that hold minimal or negligible impact on U.S. home prices. This nuanced understanding of influential factors, coupled with the selection of the most accurate model, holds immense potential to guide informed decision-making processes within the dynamic landscape of real estate.