## Regularized Regression Project

Build a linear regression model that predicts the `price` column in the dataset on San Francisco Apartment rentals. Make sure to go through all the the relevant steps of the modelling workflow.

1. Use the model you built for the prior project as the basis for comparison
2. Engineer (or un-engineer previously) engineered Features as needed
3. Fit a Lasso, Ridge, and Elastic Net Regression using the features in your original model.
4. Once you are ready, fit your final model and report final model performance estimate by scoring on the test data. Report both test R-squared and MAE.
5. What happens to your error if you only model apartments <= 6000 in price... should we do this?

Advice:

1. Remember, regularization doesn't always help!

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
from sklearn.model_selection import KFold
from sklearn.metrics import r2_score as r2, mean_absolute_error as mae, mean_squared_error as mse

rentals_df = pd.read_csv("../Data/sf_clean.csv")

rentals_df.head()

### Data Dictionary

1. Price: The price of the rental and our target variable
2. sqft: The area in square feet of the rental
3. beds: The number of bedrooms in the rental
4. bath: The number of bathrooms in the rental
5. laundry: Does the rental have a laundry machine inside the house, a shared laundry machine, or no laundry on site?
6. pets: Does the rental allow pets? Cats only, dogs only or both cats and dogs?
7. Housing type: Is the rental in a multi-unit building, a building with two units, or a stand alone house? 
8. Parking: Does the apartment off a parking space? No, protected in a garage, off-street in a parking lot, or valet service?
9. Hood district: Which part of San Francisco is the apartment located?

![image info](SFAR_map.png)

In [0]:
rentals_df.info()

## EDA

1. Based on the range of prices below, we may need to subset our data based on some value to predict more "realistic" apartments. Possibly subset based on square-footage.

2. The 'hood_district' feature was read in as an integer but is really a categorical feature. Let's fix that.


In [0]:
rentals_df["hood_district"] = rentals_df["hood_district"].astype("object") 

In [0]:
rentals_df.describe()

There are some very rare, expensive apartments that cost over 10k.

In [0]:
sns.histplot(rentals_df, x="price")

Most of our numeric features are positively correlated with each other, could cause problems.

It's good to see that we have some strong correlations with our target here though.

In [0]:
# sns.heatmap(
#     rentals_df.corr(numeric_only=True), 
#     vmin=-1, 
#     vmax=1, 
#     cmap="coolwarm",
#     annot=True
# );

Based on the pairplot below, we may be able to slice off the most expensive aparments by subsetting to only aparments <2500 sqft. 

In [0]:
# sns.pairplot(rentals_df, corner=True)

Moving on to our categorical features, we have some rare categories that may need to be binned together.

We should consider:

1. Pets: Bin 'dogs', and 'both' into a 'dogs' 'allows_dogs' category.
2. Housing_type: Group 'multi' and 'double' together
3. Parking: Group 'protected', 'off-street', and 'valet' together
4. We should bin some of our lower count neighbhoords with neighboring ones. Let's look at average pricing for each and see which are related based on price.

In [0]:
# # Let's check the frequency of our categorical features

# def value_counter(dataframe):
#     value_series = pd.DataFrame()
#     for col in dataframe.select_dtypes(["object"]).columns:
#         print(dataframe[col].value_counts())
        
        
# value_counter(rentals_df)

In [0]:
# def cat_plotter(data, target):
#     for col in data.select_dtypes(["object"]).columns:
#         sns.barplot(data=data, x=col, y=target)
#         plt.xticks(rotation=45)
#         plt.show()
        
# cat_plotter(rentals_df, "price")

# Feature Engineering

1. Group Categories together
2. Trying a Squared Term for Bedrooms, sqft, and bath

In [0]:
laundry_map = {
    "(a) in-unit": "in_unit",
    "(b) on-site": "not_in_unit",
    "(c) no laundry": "not_in_unit",
}

pet_map = {
    "(a) both": "allows_dogs",
    "(b) dogs": "allows_dogs",
    "(c) cats": "no_dogs",
    "(d) no pets": "no_dogs"
}


housing_type_map = {
    "(a) single": "single",
    "(b) double": "multi",
    "(c) multi": "multi",
}

district_map = {
    1.0: "west",
    2.0: "southwest",
    3.0: "southwest",
    4.0: "central",
    5.0: "central",
    6.0: "central",
    7.0: "marina",
    8.0: "north beach",
    9.0: "FiDi/SOMA",
    10.0: "southwest"
    
}

In [0]:
eng_df = rentals_df.assign(
#     hood_district = rentals_df["hood_district"].map(district_map),
#     housing_type = rentals_df["housing_type"].map(housing_type_map),
#     pets = rentals_df["pets"].map(pet_map),
#     laundry = rentals_df["laundry"].map(laundry_map),
    sqft2 = rentals_df["sqft"] ** 2,
    sqft3 = rentals_df["sqft"] ** 3,
    beds2 = rentals_df["beds"] ** 2,
    beds3 = rentals_df["beds"] ** 3,
    bath2 = rentals_df["bath"] ** 2,
    bath3 = rentals_df["bath"] ** 3,
    beds_bath_ratio = rentals_df["beds"] / rentals_df["bath"]
)

eng_df = pd.get_dummies(eng_df, drop_first=True)

In [0]:
eng_df.head()

In [0]:
from sklearn.model_selection import train_test_split

target = "price"
drop_cols = [
#     "pets_no_dogs",
#     "housing_type_single"
]

X = sm.add_constant(eng_df.drop([target] + drop_cols, axis=1))

# Log transform slightly improves normality
y = np.log(eng_df[target])
# y = eng_df[target]

# Test Split
X, X_test, y, y_test = train_test_split(X, y, test_size=.2, random_state=2023)

# Scaling Data

In [0]:
from sklearn.preprocessing import StandardScaler

std = StandardScaler()
X_tr = std.fit_transform(X.values)
X_te = std.transform(X_test.values)

In [0]:
from sklearn.linear_model import RidgeCV

n_alphas = 100
alphas = 10 ** np.linspace(-3, 3, n_alphas)

ridge_model = RidgeCV(alphas=alphas, cv=5)

ridge_model.fit(X_tr, y)
print(f"Cross Val R2: {ridge_model.score(X_tr, y)}")
print(f"Cross Val MAE: {mae(np.exp(y), np.exp(ridge_model.predict(X_tr)))}")
print(f"Alpha: {ridge_model.alpha_}")

In [0]:
list(zip(X.columns, ridge_model.coef_))

# Lasso

In [0]:
from sklearn.linear_model import LassoCV

n_alphas = 200
alphas = 10 ** np.linspace(-2, 3, n_alphas)

lasso_model = LassoCV(alphas=alphas, cv=5)

lasso_model.fit(X_tr, y)

print(f"Cross Val R2: {lasso_model.score(X_tr, y)}")
print(f"Cross Val MAE: {mae(np.exp(y), np.exp(lasso_model.predict(X_tr)))}")
print(f"Alpha: {lasso_model.alpha_}")

In [0]:
list(zip(X.columns, lasso_model.coef_))

In [0]:
# print(mae(np.exp(y_test), np.exp(lasso_model.predict(X_te))))
# print(f"Test R2: {r2(y_test, lasso_model.predict(X_te))}")

## ENET

In [0]:
from sklearn.linear_model import ElasticNetCV

alphas = 10 ** np.linspace(-2, 3, 200)
l1_ratios = np.linspace(.01, 1, 100)

enet_model = ElasticNetCV(alphas=alphas, l1_ratio=l1_ratios, cv=5)

enet_model.fit(X_tr, y)

print(f"Cross Val R2: {enet_model.score(X_tr, y)}")
print(f"Cross Val MAE: {mae(np.exp(y), np.exp(enet_model.predict(X_tr)))}")
print(f"Alpha: {enet_model.alpha_}")
print(f"L1_Ratio: {enet_model.l1_ratio_}")

# Final Model Test

In [0]:
print(f"Test MAE: {mae(np.exp(y_test), np.exp(ridge_model.predict(X_te)))}")
print(f"Test R2: {r2(y_test, ridge_model.predict(X_te))}")