## EDA and Preprocessing
For the House Price dataset, preprocessing was fairly trivial. The dataset comes without outliers or null values, meaning nothing needs to be done here.

After running forwards, backwards, and stepwise selections, features were ranked in importance of:
1. Square footage
2. Lot size
3. Number of bedrooms
4. Number of bathrooms
5. Year built
6. Garage Size
7. Neighborhood Quality

While all features increased the R^2 value, Neighborhood Quality had a minimal increase (0.00004), and could be left out. Stepwise selection chose not to include this feature in its final list. Because of this, we can drop this column. After rerunning the feature selection methods on the adjusted dataset, stepwise did not include Garage Size since it only increased the R^2 by 0.0002. For now though, we will leave this feature in. On a more complex dataset, it may be good to drop this feature to increase speed and reduce possible noise.

## Linear Regression

In Linear Regression, we consistently see R^2 values of ~0.991 (in 5-fold cross-validation this is the average), with some variation occurring in the 4th decimal digit. 5-fold cross validation give use the best idea of what to expect from our dataset here. The least accurate fold has an R^2 value of .9710, suggesting that this fold was the most difficult to predict. The highest fold was 0.9931, the best score of any model. Given these two values, we could expect our model to perform somewhere in this range on new data. 5-fold cross validation here is the most robust and realistic option to get an idea of how good our model is. In-sample is typically prone to overfitting, and therefore usually isn't a great choice. Note that the next regressions are run with both In-Sample and 80/20 TTS, but since results are consistent with Linear Regression in terms of how the Qof affects the results, they are not discussed. Below is a chart of the results and some metrics:

| Metric | In-Sample (Full) | Train-Test Split (80/20) | 5-Fold Cross-Validation (Avg) |
| :--- | :--- | :--- | :--- |
| **$R^2$** | 0.991850 | 0.991850 | 0.991712 |
| **Adj. $R^2$** | 0.991809 | 0.991809 | 0.991678 |
| **RMSE** | 22880.23 | 22880.23 | 22889.37 |
| **MAPE** | 3.968% | 3.968% | ~4.00% |

## Ridge Regression
In Ridge Regression, we don't see much change. This is expected, as there is low multi-collinearity in the dataset. We suspected this given the "coolness" shown in the correlation heatmap. Lambda is also relatively small (by default), meaning that Ridge Regression should not have had an enormous impact on the coefficients anyways. With a R^2 of 0.991801, the results are more or less identical to Linear Regression's.

| Variable | OLS (Previous Run) | Ridge (Î»=0.1) | Change |
| :--- | :--- | :--- | :--- |
| **Square_Footage** | 199.0866 | 199.0867 | Negligible |
| **Num_Bedrooms** | 9650.4529 | 9650.0250 | -0.4279 |
| **Num_Bathrooms** | 7167.8387 | 7166.8555 | -0.9832 |
| **Year_Built** | -13.5254 | -13.5224 | +0.0030 |
| **Lot_Size** | 13676.0093 | 13675.2164 | -0.7929 |
| **Garage_Size** | 4344.7273 | 4344.1877 | -0.5396 |

## Lasso Regression

In Lasso Regression, we saw a couple interesting things that confirmed and contradicted what stepwise selection showed us. Lasso Regression ended up eliminating Year Built, which was one of the weaker features (ranked 5th originally). However, it is interesting that it did not eliminate Garage Size, which may have been expected given Stepwise Selection suggested eliminating this feature. Other variables stayed consistent, showing they are stable and/or strong. See the table below for more detail. We see that our R^2 drops slightly to 0.9806 due to the fact that this is a more generalizable model, meaning it trades some training accuracy for better widespread performance.

| Variable | OLS/Ridge Estimate | Lasso Estimate | Status |
| :--- | :--- | :--- | :--- |
| **Square_Footage** | 199.086 | 199.087 | Strong |
| **Num_Bedrooms** | 9650.45 | 9650.45 | Strong |
| **Num_Bathrooms** | 7167.83 | 7167.84 | Stable |
| **Year_Built** | -13.522 | 0.000 | Dropped |
| **Lot_Size** | 13676.01 | 13676.01 | Strong |
| **Garage_Size** | 4344.72 | 4344.73 | Weak/Stable |

## Transformed Regression

Due to the fact that the price of a house has a strong linear correlation to the features we are using, we do not expect Transformed Regressions to perform very well (and they don't).

### Sqrt Regression
Sqrt Regression gave an R^2 of 0.9828. A decent result, but not necessarily when compared to our baseline, standard Regression models. This model is attempting to predict the sqrt(house_price), and this result shows that the way the price grows in response to our features is better represented linearly than quadratically.

### Log1p Regression
Log1p Regression gave an R^2 of 0.7494, by far the worst we have seen so far. Log transformations are typically used when data spans multiple orders of magnitude, or have a heavy right skew. This is not the case for our data, which is relatively evenly distributed. The drop in accuracy here suggests that the relationship between our features and price is not multiplicative, but linear.

### Box-Cox Regression
Box-Cox Regression gave an R^2 of 0.9828. Box-cox effectively tries different transformations in order to make the data reach a distribution close to that of a Gaussian/Normal one. Because it's results were close to that of Sqrt, it likely found that the optimal lambda value was close to 0.5. A distinction to make here is that Box-Cox does not optimize for R^2, but rather for the distribution. This is why we see a lower R^2 value vs when we use a linear relationship between features and price.

### Yeo-Johnson
Yeo-Johnson does the same task as Box-Cox, but is able to handle negative values as well. Because we have no negative values, Yeo-Johnson acts the same as Box-Cox, and gives the same result.

### Table of results:
| Model | R2 | RMSE | MAPE | Verdict |
| :--- | :--- | :--- | :--- | :--- |
| **Baseline (OLS)** | **0.9918** | **22,880** | **3.97%** | **Best** |
| Sqrt Transform | 0.9828 | 33,242 | 5.69% | Worse |
| Box-Cox | 0.9828 | 33,249 | 5.69% | Worse |
| Yeo-Johnson | 0.9828 | 33,249 | 5.69% | Worse |
| Log1p | 0.8494 | 98,357 | 11.51% | Poor |

## SymRidge Regression

SymRidge Regression relies on creating new inputs that are transformations of pre-existing inputs. In this case, quadratic transformations. A big find here is the change in the Year Built coefficient, which went from -1029.34 to +0.51. This helped to represent that house price correlates to Year Built in a sort of "U" shape, where very old and very new houses see an increase, and price decreases in a "U" shape as Year Built reaches it's middle. Because of this, we saw R^2 increase to 0.9985. This is a 0.67 increase from the previous baseline model. SymRidge Regression is the best choice for this dataset to help explain the previously missing relationship between Year Built and Price. We can ignore the increase in Vif, as this is expected due to the fact that squared features values are naturally correlated to the base feature values.