# Project: Predicting Housing Prices

Submitted by Murali Cheruvu

Presentation Link: https://drive.google.com/file/d/1eEE2YdKz0Z3FD4erokP5UY9pNYuZXkbf/view

## Project Domain and Data Sources

- The goal of this project is to **predict the sale prices** of residential homes listed in the test dataset as accurately as possible; hence solving a **regression problem**

- **Kaggle Website** provides the dataset for the housing prices- training and testing, each with **80 features** describing various aspects of residential homes in city of Ames, Iowa State; **snapshot taken in 2010**.

- There are **1460 rows** in the training dataset and **1459 rows** in the test dataset

- **Out of the 80 variables - 23 are nominal, 23 are ordinal, 14 are discrete, and 20 are continuous**

## Exploratory Data Analysis (EDA) Outcome

Apply univariate, bivariate and multivariate analytical techniques; perform various statistical
and data visualizations on each feature. 

- **Top 5 features** that very **highly correlate with sales price** are: *Over-all-Quality, Ground-Living-Area, Garage-Cars, Garage-Area and Total-Basement-Sq-Ft*

- **Top 5 features** that **skewed more than 75%** are: *Low-Quality-Fin-Sq-Ft, Ground-Living-Area, Kitchen-Above-Garage, Wood-Deck-Sf, Basement-Half-Bath*

- **Top 5 features having most of the null values**: *Pool-Quality, Misc-Features, Alley, Fence, Fireplace-Quality*

- Categorical **One-hot encoding** created about **160 new features**

- **Feature Engineering** added **23 new features** including: *Total-Area, High-Season, Age, Season-Sold, Remodeled*

- **Top 2 outliers** are: *Ground-Living-Area and Garage-Area*; **8 rows** are effected by outliers of these two features

## ML Workflow

1. Apply ML pipeline to **clean, scale, encode and apply feature engineering** on the *training and testing datasets* **separately**; make sure one dataset will not impact the other during the preparation process
2. Make sure all the features those have been present in the training dataset are also there in the test dataset as ML **algorithms expect fit and predict methods apply on the same set of features**
3. Apply univariate feature to select top 20% best features based on their **statistical significance** using metrics like *f_regression, f_classif and chi-square* statistics through **Select Percentile** modeling
4. Create **cross-validation** dataset from training dataset in the ratio of **70-30**
5. Apply ML pipeline of algorithms: **Ridge, Lasso, SVM, Random Forest** and **XGB** through **K-Fold** cross-validation and collect various performance metrics – **MAE, MSE, RMSE and R^2**
6. **Compare** the top 2 performing algorithms: Random Forest and XGB with the third performing algorithm SVM as **baseline**; do the **statistical significance** to prove the same
7. Tune the **hyper-parameters** of both Random Forest and XGB models using **Grid Search** and fit/predict the training / cross-validation datasets and compare them with the actual results; then **predict** the sale prices of test dataset
8. **Ensemble** Random Forest and XGB to get better predictions and submit the predictions to **Kaggle** to get better score

## Conclusions

- TOP 3 performing algorithms are XGB, Random Forest and SVM
- Ensembling top 2, XGB and Random Forest, gave better performance
- Extract the feature importance of XGB to make sure they are meaningful, and as per our analysis

![image.png](attachment:image.png)