<a href="https://colab.research.google.com/github/gptix/DS-Unit-2-Linear-Models/blob/master/module3-ridge-regression/ASSIGNMENT_Jud_Taylor_REVISED_assignment_regression_classification_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 3*

---

# Ridge Regression

## Assignment

We're going back to our other **New York City** real estate dataset. Instead of predicting apartment rents, you'll predict property sales prices.

But not just for condos in Tribeca...

- [ ] Use a subset of the data where `BUILDING_CLASS_CATEGORY` == `'01 ONE FAMILY DWELLINGS'` and the sale price was more than 100 thousand and less than 2 million.
- [ ] Do train/test split. Use data from January — March 2019 to train. Use data from April 2019 to test.
- [ ] Do one-hot encoding of categorical features.
- [ ] Do feature selection with `SelectKBest`.
- [ ] Fit a ridge regression model with multiple features. Use the `normalize=True` parameter (or do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html) beforehand — use the scaler's `fit_transform` method with the train set, and the scaler's `transform` method with the test set)
- [ ] Get mean absolute error for the test set.
- [ ] As always, commit your notebook to your fork of the GitHub repo.

The [NYC Department of Finance](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page) has a glossary of property sales terms and NYC Building Class Code Descriptions. The data comes from the [NYC OpenData](https://data.cityofnewyork.us/browse?q=NYC%20calendar%20sales) portal.


## Stretch Goals

Don't worry, you aren't expected to do all these stretch goals! These are just ideas to consider and choose from.

- [ ] Add your own stretch goal(s) !
- [ ] Instead of `Ridge`, try `LinearRegression`. Depending on how many features you select, your errors will probably blow up! 💥
- [ ] Instead of `Ridge`, try [`RidgeCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html).
- [ ] Learn more about feature selection:
    - ["Permutation importance"](https://www.kaggle.com/dansbecker/permutation-importance)
    - [scikit-learn's User Guide for Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html)
    - [mlxtend](http://rasbt.github.io/mlxtend/) library
    - scikit-learn-contrib libraries: [boruta_py](https://github.com/scikit-learn-contrib/boruta_py) & [stability-selection](https://github.com/scikit-learn-contrib/stability-selection)
    - [_Feature Engineering and Selection_](http://www.feat.engineering/) by Kuhn & Johnson.
- [ ] Try [statsmodels](https://www.statsmodels.org/stable/index.html) if you’re interested in more inferential statistical approach to linear regression and feature selection, looking at p values and 95% confidence intervals for the coefficients.
- [ ] Read [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf), Chapters 1-3, for more math & theory, but in an accessible, readable way.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

# Assignment

## Set up for Import.

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

## Import useful code.

In [0]:
# Import userful code.
import sys
import warnings
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

import pandas as pd
import pandas_profiling
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression, Ridge, RidgeCV
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.feature_selection import SelectKBest, f_regression

import plotly.express as px

import category_encoders as ce

from IPython.display import display, HTML

## Import raw data.

In [0]:
# Read New York City property sales data
df_raw = pd.read_csv(DATA_PATH+'condos/NYC_Citywide_Rolling_Calendar_Sales.csv')

Make a non-raw copy of df.

In [39]:
df = df_raw
df.columns

Index(['BOROUGH', 'NEIGHBORHOOD', 'BUILDING CLASS CATEGORY',
       'TAX CLASS AT PRESENT', 'BLOCK', 'LOT', 'EASE-MENT',
       'BUILDING CLASS AT PRESENT', 'ADDRESS', 'APARTMENT NUMBER', 'ZIP CODE',
       'RESIDENTIAL UNITS', 'COMMERCIAL UNITS', 'TOTAL UNITS',
       'LAND SQUARE FEET', 'GROSS SQUARE FEET', 'YEAR BUILT',
       'TAX CLASS AT TIME OF SALE', 'BUILDING CLASS AT TIME OF SALE',
       'SALE PRICE', 'SALE DATE'],
      dtype='object')

## Clean data.

In [0]:
# Change column names: replace spaces with underscores
df.columns = [col.replace(' ', '_') for col in df]

# SALE_PRICE was read as strings.
# Remove symbols, convert to integer
df['SALE_PRICE'] = (
    df['SALE_PRICE']
    .str.replace('$','')
    .str.replace('-','')
    .str.replace(',','')
    .astype(int)
)

# df.columns

## Filter data.

In [41]:
# Use a subset of the data where BUILDING_CLASS_CATEGORY == 
# '01 ONE FAMILY DWELLINGS' and the sale price 
# was more than 100 thousand and less than 2 million.
# df_raw.shape
df = df[df['BUILDING_CLASS_CATEGORY']=='01 ONE FAMILY DWELLINGS']

df = df[(df['SALE_PRICE']<2e6) & (df['SALE_PRICE']>1e5)]

df.columns
# df.shape

Index(['BOROUGH', 'NEIGHBORHOOD', 'BUILDING_CLASS_CATEGORY',
       'TAX_CLASS_AT_PRESENT', 'BLOCK', 'LOT', 'EASE-MENT',
       'BUILDING_CLASS_AT_PRESENT', 'ADDRESS', 'APARTMENT_NUMBER', 'ZIP_CODE',
       'RESIDENTIAL_UNITS', 'COMMERCIAL_UNITS', 'TOTAL_UNITS',
       'LAND_SQUARE_FEET', 'GROSS_SQUARE_FEET', 'YEAR_BUILT',
       'TAX_CLASS_AT_TIME_OF_SALE', 'BUILDING_CLASS_AT_TIME_OF_SALE',
       'SALE_PRICE', 'SALE_DATE'],
      dtype='object')

## Engineer some more.

In [0]:
# BOROUGH is a numeric column, but arguably should be a categorical feature,
# so convert it from a number to a string
df['BOROUGH'] = df['BOROUGH'].astype(str)

In [43]:
print(df.shape)

df=df[  (df['GROSS_SQUARE_FEET'] >= np.percentile(df['GROSS_SQUARE_FEET'], 0.5)) &
        (df['GROSS_SQUARE_FEET'] <= np.percentile(df['GROSS_SQUARE_FEET'], 0.95))  ]

df.shape  

(3151, 21)


(40, 21)

In [0]:
# Reduce cardinality for NEIGHBORHOOD feature

# Get a list of the top 10 neighborhoods
top10 = df['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10, 
# replace the neighborhood with 'OTHER'
df.loc[~df['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'

## Take a look at data.

In [0]:
# Review data.
# df
# df.describe()
df.describe().T
# df.head()
# df.tail()
# df.columns
# df.dtypes
# df['bedrooms'].describe()
# df.shape
# df.columns.isna()
# df.isna()

## Split for Train and Test


In [0]:
# Use data from January — March 2019 to train. 
df_train = df[(df['SALE_DATE'] >= '01/01/2019') & (df['SALE_DATE'] < '04/01/2019')]
df_train.shape

# Use data from April 2019 to test.
df_test = df[(df['SALE_DATE'] >= '04/01/2019') & (df['SALE_DATE'] < '05/01/2019')]
df_train.shape, df_test.shape

## Do one-hot encoding of categorical features.

In [0]:
# find categorical features.
df_train.describe(exclude='number').T.sort_values(by='unique')

In [0]:
target = 'SALE_PRICE'

# Rows w/ only one value add no analytical value. 
only_one_value = ['BUILDING_CLASS_CATEGORY','APARTMENT_NUMBER']
high_cardinality = ['ADDRESS', 'LAND_SQUARE_FEET', 'SALE_DATE']

features = df_train.columns.drop([target] + high_cardinality + only_one_value)

df_train[features].describe().T

In [0]:
df_train.describe(include='number').T

In [0]:
df.columns[df_train.isna().any()].tolist()

## Drop Columns (as features) that have NaN
These look unimportant.

In [0]:
features = df_train.columns.drop([target] + 
                                 high_cardinality + 
                                 only_one_value + 
                                 ['EASE-MENT', 'APARTMENT_NUMBER'])

features

df_train[features].describe().T
df_train.shape

## Split train and test sets into X matrix and y vector

In [0]:
X_train = df_train[features]
y_train = df_train[target]
X_test = df_test[features]
y_test = df_test[target]

## Do One Hot Encoder

In [0]:
encoder = ce.OneHotEncoder(use_cat_names=True)
X_train = encoder.fit_transform(X_train)

In [0]:
X_train

In [0]:
# X_train.columns

In [0]:
X_test = encoder.transform(X_test)
X_test

## Select K Best

In [0]:
# Select the 11 features that best correlate with the target.
# (11 is cool.)

# Instantiate. Instantiate.       Instantiate.
selector = SelectKBest(score_func=f_regression, k=11)

# IMPORTANT!
# .fit_transform ON THE TRAIN SET!
# .transform on -----------> ON THE TEST SET!
X_train_selected = selector.fit_transform(X_train, y_train)
# X_test_selected = selector.transform(X_test)
# X_train_selected.shape, X_test_selected.shape

In [0]:
# Identify features selected by SelectK
names = X_train.columns
selected_mask = selector.get_support()

print('Selected:')
for n in names[selected_mask]:
    print(n)
print()
print('NOT selected:')
for n in names[~selected_mask]:
    print(n)

In [0]:
# Decide efficient number of features to use.

for feature_count in range(1, len(X_train.columns)+1):
    print(f'{feature_count} features')

    selector = SelectKBest(score_func=f_regression, k=feature_count)
    X_train_selected = selector.fit_transform(X_train, y_train)
    X_test_selected = selector.transform(X_test)

    model = LinearRegression()
    model.fit(X_train_selected, y_train)
    y_pred = model.predict(X_test_selected)

    mae = mean_absolute_error(y_test, y_pred)
    print(f'Test MAE: ${mae:,.0f} \n')

In [0]:
for alpha in [0.001, 0.01, 0.1, 1.0, 1, 100.0, 1000.0]:
    
    # Fit Ridge Regression model
    display(HTML(f'Ridge Regression, with alpha={alpha}'))
    model = Ridge(alpha=alpha, normalize=True, random_state=17)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    # Get MAE for test
    mae = mean_absolute_error(y_test, y_pred)
    # display(HTML(f'Test Mean Absolute Error: ${mae:,.0f}'))
    print(f'Test Mean Absolute Error: ${mae:,.0f}')
    
    # Plot coefficients
    coefficients = pd.Series(model.coef_, X_train.columns)
    plt.figure(figsize=(16,8))
    coefficients.sort_values().plot.barh(color='grey')
    plt.xlim(-10000,10000)
    plt.show()