# Problem Statement

This model aims to predict the sales price (target variable) for each house based on a given a set of characteristics like the size of the house, it's age etc (predictor variables). The specific output being predicting the value of the SalePrice variable of each Id in the test set. 


# Executive Summary: 

We were given two datasets (training and test). The training dataset was used to train the model that was later used to predict house sale prices on the test data set. 

The training data set contained 81 variables and had 2051 rows of data points. This was eventually whittled down to 20 variables (of which 12 were dummy variables from the same variable) that formed the final selected features used to generate the model. 

I fitted the base model using a linear regression model, and subsequently optimised for the parameters using the grid-search hyperparameter optimisation technique on both the lasso and ridge regresion models.  

The final model was fitted using a basic linear regression model, as this model produced the best results of the lot.  It had an adjusted R^2 value of 0.835 as well as a RMSE of approx 29500 when tested on the test-portion of the training dataset. 

This fitted model was then used to predict the values on the given test dataset - it achieved a RMSE of 33500. 


**Overview of project flow** 

1) Project Objectives:
- Generate a regression model to predict housing prices; specifically for the kaggle challenge (& as a benchmark of the effectiveness of the model) to achieve the lowest RMSE

2) Exploratory data analysis
- ensure that data points were cast in the correct datatypes 
- identified all null values in the dataset and carried out further inspection
- plotted heatmap, pairplot, histogram and boxplots to understand the general distribution of data and presence of outliers 

3) Clean up dataset
- did not remove any null values at the initial stage as it was too numerous  
- renamed column names to standardize naming  
- split data set into numerical and categorical variables 

4) Transform variables
- split categorical variables into nominal, ordinal and binary  
- performed mean encoding on nominal variables, ordinal encoding on oridnal variables and hot-encoding on binary variables

5) Feature Selection 
- ran regressions (for numerical variables) and selected the most highly correlated variables
- ran regressions separately for ordinal and nominal vars and selected the most highly correlated ones 
- selected the top numerical, ordinal, nominal and all binary vars as the final feature set for the model

6) Train/test/split 
- used sklearn's tts to split, train and test, with 80% of the dataset used for training and the rest for testing

7) Scale variables
- used sklearn's standard Scaler for scaling - fitted using the training set. 

8) Evaluate, test and fit model :
- K-fold cross valuation using 10 folds. K-fold cross R^2 scores were pretty consistent  
- mean cross valuation R^2 score was also approx the same as the model R^2 score when fitted on a lin reg model
- fitted using a OLS stat model to generate other stats too 


9) Regularize & tuned hyperparameters:
- Ridge regression
- Lasso regression
- Hyperparameter tuning using grid search 

9) Final checks & final model selection:
- compared (adjusted) R^2 and RMSE scores different mdoels and selected the best (adjusted) R^2 score 

10) Generate predictions on general (test) dataset: 
- Performed the same set of transformations on the test set predictor variables as we did on the training set predictor variables



# Data Dictionary: 

Source: The data set contains information from the Ames Assessor’s Office used in computing assessed values for individual residential properties sold in Ames, IA from 2006 to 2010.

The data dicitonary here only contains hte set of the final features selected for the model. 

Go here for a description of the full dataset used: http://jse.amstat.org/v19n3/decock/DataDocumentation.txt


|Name|Type|Description|
|---|---|---|
|**overall_qual**|*int64*|oridnal scale from 1-10, 1 being the poorest. rates the overall material of the house| 
|**gr_liv_area**|*int64*|Above grade(ground) living area square feet|
|**garage_area**|*float64*|ordinal scale of 5. Size of garage in square feet|
|**1st_flr_sf**|*int64*|First floor square feet|
|**exter_qual_ord**|*int64*|ordinal scale of 4. Evaluates present condition of the material on the exterior|
|**kitchen_qual_ord**|*int64*|ordinal scale of 4. Kitchen quality|
|**bsmt_qual_ord**|*float64*|ordinal scale of 5. Evaluates the height of the basement|
|**garage_finish_ord**|*float64*|ordinal scale of 3. Interior finish of the garage|
|**ClearCr**|*uint8*|dummy variable of neighborhoods. Clear creek |
|**CollgCr**|*uint8*|dummy variable of neighborhoods. College creek |
|**Crawfor**|*uint8*|dummy variable of neighborhoods. Crawford |
|**Mitchel**|*uint8*|dummy variable of neighborhoods. Mitchell |
|**NAmes**|*uint8*|dummy variable of neighborhoods. Northwest Ames |
|**NoRidge**|*uint8*|dummy variable of neighborhoods. Northridge |
|**NridgHt**|*uint8*|dummy variable of neighborhoods. Northridge Heights |
|**Sawyer**|*uint8*|dummy variable of neighborhoods. Sawyer |
|**Somerst**|*uint8*|dummy variable of neighborhoods. Somerset |
|**StoneBr**|*uint8*|dummy variable of neighborhoods. Stonebrook |
|**Timber**|*uint8*|dummy variable of neighborhoods. Timberland |
|**Veenker**|*uint8*|dummy variable of neighborhoods. Veenker|


# Conclusions & Next Steps:

The model generated can be used to predict housing prices with an error of about +/- 20%. This error rate was approximated based on the generated mean squared error score based on the test data. 

The model can be used by property investor's to evaluate the investment potential of a particular house - whether they are over or underpaying. For property owners, they can use the model to price their houes competitively / add or remove certain features in their houses to optimize the sale price of their houses. 

Note to self for potential future improvements to the model: 
- could run chi squared test (for categorical variables) and remove one of each highly correlated pair
- check for multicollinearity via Variable inflation factor (VIF)
- use sklearn's recursive feature elimination to select for final features
- could have lassoed and ridge regressioned again after removing the features with high T-test p-values 
- do more extensive EDA (e.g. identifying and removing the outliers before running the correlations, attempting to normalize the variable distributions) 
- can consider writing a function that automatically removes values that are at a specific standard deviation away from the mean (to be entered as the argument - e.g. more than 2.5 stnandard devs away or wtv you want to classify as outliers) but also need to consider what removing impact removing these outliers would have on the final interpretation of the model's results and make sure you take note of which outliers you have removed. 
