# House Prices - Advanced Regression Techniques
[Michael DiSanto](https://www.michaelpdisanto.com) - 2023

## Project Objective

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

In this project, I aim to predict sales prices of homes using advanced regression techniques, including feature engineering, random forests, and gradient boosting. For each Id in the test set, I will predict the value of the SalePrice variable. The output will be evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

## Understanding the Data
* Import necessary libraries.
* Load and inspect the dataset(s) you'll be working with.
* Display basic statistics, data types, and any initial observations.
* Data cleaning, preprocessing, and handling missing values if necessary.

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd

### Loading Data

In [2]:
data = pd.read_csv("data/train.csv")

### Data Summary

In [5]:
data.shape

(1460, 81)

In [6]:
data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [13]:
null_pcts = round((data.isnull().sum() / len(data)) * 100, 2)
null_pcts[null_pcts > 0]

LotFrontage     17.74
Alley           93.77
MasVnrType       0.55
MasVnrArea       0.55
BsmtQual         2.53
BsmtCond         2.53
BsmtExposure     2.60
BsmtFinType1     2.53
BsmtFinType2     2.60
Electrical       0.07
FireplaceQu     47.26
GarageType       5.55
GarageYrBlt      5.55
GarageFinish     5.55
GarageQual       5.55
GarageCond       5.55
PoolQC          99.52
Fence           80.75
MiscFeature     96.30
dtype: float64

In [18]:
data.Alley.value_counts()

Grvl    50
Pave    41
Name: Alley, dtype: int64

In [19]:
data.FireplaceQu.value_counts()

Gd    380
TA    313
Fa     33
Ex     24
Po     20
Name: FireplaceQu, dtype: int64

In [20]:
data.PoolQC.value_counts()

Gd    3
Ex    2
Fa    2
Name: PoolQC, dtype: int64

In [23]:
data.Fence.value_counts()

MnPrv    157
GdPrv     59
GdWo      54
MnWw      11
Name: Fence, dtype: int64

In [22]:
data.MiscFeature.value_counts()

Shed    49
Gar2     2
Othr     2
TenC     1
Name: MiscFeature, dtype: int64

### Data Cleaning + Preprocessing

## Exploratory Data Analysis (EDA)
* Visualizations and statistical summaries to gain insights into the data.
* Histograms, scatter plots, box plots, correlation matrices, etc.
* Identify patterns, trends, and potential outliers.

### Visualizations

### Statistical Summaries

## Data Preparation 
* Feature engineering: Create new features if needed.
* Data scaling, normalization, or encoding for machine learning models.
* Train-test split: Divide the data into training and testing sets.

### Feature Engineering

### Data Normalization/Encoding (for ML models)

### Splitting the Data

## Modeling
* Select machine learning algorithms that are appropriate for your problem.
* Train and evaluate models.
* Hyperparameter tuning.
* Cross-validation if applicable.
* Performance metrics (e.g., accuracy, F1-score, RMSE, etc.).

### Model Selection

### Model Training and Evaluation

### Hyperparameter Tuning

### Cross-Validation (if applicable)

### Performance Metrics

## Results and Discussion
* Present the results of your analysis and modeling.
* Interpret the model's performance and what it means for the project's goals.
* Discuss any challenges encountered and potential improvements.

### Analysis and Modeling Results

xxxxxx

### Performance Interpretation

xxxxxx

### Challenges and Potential Improvements

xxxxxx

## Conclusion
* Summarize the key findings and outcomes.
* Reiterate the project's objectives and whether they were achieved.
* Suggest possible extensions or future steps for the project.
* Highlight areas that could benefit from additional data or research.

### Key Findings and Outcomes

xxxxxx

### Future Work

xxxxxx

## References
* https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques
* http://jse.amstat.org/v19n3/decock.pdf
* https://www.kaggle.com/code/skirmer/fun-with-real-estate-data

## License

MIT License

Copyright (c) 2023 Michael DiSanto

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.