<a href="https://colab.research.google.com/github/npgeorge/DS-Unit-2-Linear-Models/blob/master/Nicholas_George_Assignment_3%2C_Regression_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 3*

---

# Ridge Regression

## Assignment

We're going back to our other **New York City** real estate dataset. Instead of predicting apartment rents, you'll predict property sales prices.

But not just for condos in Tribeca...

Instead, predict property sales prices for **One Family Dwellings** (`BUILDING_CLASS_CATEGORY` == `'01 ONE FAMILY DWELLINGS'`). 

Use a subset of the data where the **sale price was more than \\$100 thousand and less than $2 million.** 

The [NYC Department of Finance](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page) has a glossary of property sales terms and NYC Building Class Code Descriptions. The data comes from the [NYC OpenData](https://data.cityofnewyork.us/browse?q=NYC%20calendar%20sales) portal.

- [ ] Do train/test split. Use data from January — March 2019 to train. Use data from April 2019 to test.
- [ ] Do one-hot encoding of categorical features.
- [ ] Do feature selection with `SelectKBest`.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Fit a ridge regression model with multiple features.
- [ ] Get mean absolute error for the test set.
- [ ] As always, commit your notebook to your fork of the GitHub repo.


## Stretch Goals
- [ ] Add your own stretch goal(s) !
- [ ] Instead of `Ridge`, try `LinearRegression`. Depending on how many features you select, your errors will probably blow up! 💥
- [ ] Instead of `Ridge`, try [`RidgeCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html).
- [ ] Learn more about feature selection:
    - ["Permutation importance"](https://www.kaggle.com/dansbecker/permutation-importance)
    - [scikit-learn's User Guide for Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html)
    - [mlxtend](http://rasbt.github.io/mlxtend/) library
    - scikit-learn-contrib libraries: [boruta_py](https://github.com/scikit-learn-contrib/boruta_py) & [stability-selection](https://github.com/scikit-learn-contrib/stability-selection)
    - [_Feature Engineering and Selection_](http://www.feat.engineering/) by Kuhn & Johnson.
- [ ] Try [statsmodels](https://www.statsmodels.org/stable/index.html) if you’re interested in more inferential statistical approach to linear regression and feature selection, looking at p values and 95% confidence intervals for the coefficients.
- [ ] Read [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf), Chapters 1-3, for more math & theory, but in an accessible, readable way.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import pandas as pd
import pandas_profiling

# Read New York City property sales data
df = pd.read_csv(DATA_PATH+'condos/NYC_Citywide_Rolling_Calendar_Sales.csv')

# Change column names: replace spaces with underscores
df.columns = [col.replace(' ', '_') for col in df]

# SALE_PRICE was read as strings.
# Remove symbols, convert to integer
df['SALE_PRICE'] = (
    df['SALE_PRICE']
    .str.replace('$','')
    .str.replace('-','')
    .str.replace(',','')
    .astype(int)
)

In [0]:
# BOROUGH is a numeric column, but arguably should be a categorical feature,
# so convert it from a number to a string
df['BOROUGH'] = df['BOROUGH'].astype(str)

In [0]:
# Reduce cardinality for NEIGHBORHOOD feature

# Get a list of the top 10 neighborhoods
top10 = df['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10, 
# replace the neighborhood with 'OTHER'
df.loc[~df['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'

In [291]:
top10

Index(['FLUSHING-NORTH', 'UPPER EAST SIDE (59-79)', 'UPPER EAST SIDE (79-96)',
       'BEDFORD STUYVESANT', 'BOROUGH PARK', 'UPPER WEST SIDE (59-79)',
       'GRAMERCY', 'ASTORIA', 'FOREST HILLS', 'EAST NEW YORK'],
      dtype='object')

In [292]:
df.tail(5)

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
23035,4,OTHER,01 ONE FAMILY DWELLINGS,1,10965,276,,A5,111-17 FRANCIS LEWIS BLVD,,11429.0,1.0,0.0,1.0,1800,1224.0,1945.0,1,A5,510000,04/30/2019
23036,4,OTHER,09 COOPS - WALKUP APARTMENTS,2,169,29,,C6,"45-14 43RD STREET, 3C",,11104.0,0.0,0.0,0.0,0,0.0,1929.0,2,C6,355000,04/30/2019
23037,4,OTHER,10 COOPS - ELEVATOR APARTMENTS,2,131,4,,D4,"50-05 43RD AVENUE, 3M",,11377.0,0.0,0.0,0.0,0,0.0,1932.0,2,D4,375000,04/30/2019
23038,4,OTHER,02 TWO FAMILY DWELLINGS,1,8932,18,,S2,91-10 JAMAICA AVE,,11421.0,2.0,1.0,3.0,2078,2200.0,1931.0,1,S2,1100000,04/30/2019
23039,4,OTHER,12 CONDOS - WALKUP APARTMENTS,2,1216,1161,,R2,"61-05 39TH AVENUE, F5",F5,11377.0,1.0,0.0,85.0,15151,854.0,1927.0,2,R2,569202,04/30/2019


In [293]:
df.shape

(23040, 21)

In [0]:
#making data frame one family dwellings
df = df[df['BUILDING_CLASS_CATEGORY'] == '01 ONE FAMILY DWELLINGS']

In [295]:
df.shape

(5061, 21)

In [296]:
#passing a condition on the sale price into the data frame
condition = (df['SALE_PRICE'] >= 100000) & (df['SALE_PRICE'] <= 2000000)
df = df[condition]
df

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
44,3,OTHER,01 ONE FAMILY DWELLINGS,1,5495,801,,A9,4832 BAY PARKWAY,,11230.0,1.0,0.0,1.0,6800,1325.0,1930.0,1,A9,550000,01/01/2019
61,4,OTHER,01 ONE FAMILY DWELLINGS,1,7918,72,,A1,80-23 232ND STREET,,11427.0,1.0,0.0,1.0,4000,2001.0,1940.0,1,A1,200000,01/01/2019
78,2,OTHER,01 ONE FAMILY DWELLINGS,1,4210,19,,A1,1260 RHINELANDER AVE,,10461.0,1.0,0.0,1.0,3500,2043.0,1925.0,1,A1,810000,01/02/2019
108,3,OTHER,01 ONE FAMILY DWELLINGS,1,5212,69,,A1,469 E 25TH ST,,11226.0,1.0,0.0,1.0,4000,2680.0,1899.0,1,A1,125000,01/02/2019
111,3,OTHER,01 ONE FAMILY DWELLINGS,1,7930,121,,A5,5521 WHITTY LANE,,11203.0,1.0,0.0,1.0,1710,1872.0,1940.0,1,A5,620000,01/02/2019
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23029,4,OTHER,01 ONE FAMILY DWELLINGS,1,13215,3,,A2,244-15 135 AVENUE,,11422.0,1.0,0.0,1.0,3300,1478.0,1925.0,1,A2,635000,04/30/2019
23031,4,OTHER,01 ONE FAMILY DWELLINGS,1,11612,73,,A1,10919 132ND STREET,,11420.0,1.0,0.0,1.0,2400,1280.0,1930.0,1,A1,514000,04/30/2019
23032,4,OTHER,01 ONE FAMILY DWELLINGS,1,11808,50,,A0,135-24 122ND STREET,,11420.0,1.0,0.0,1.0,4000,1333.0,1945.0,1,A0,635000,04/30/2019
23033,4,OTHER,01 ONE FAMILY DWELLINGS,1,12295,23,,A1,134-34 157TH STREET,,11434.0,1.0,0.0,1.0,2500,1020.0,1935.0,1,A1,545000,04/30/2019


In [297]:
df.dtypes

BOROUGH                            object
NEIGHBORHOOD                       object
BUILDING_CLASS_CATEGORY            object
TAX_CLASS_AT_PRESENT               object
BLOCK                               int64
LOT                                 int64
EASE-MENT                         float64
BUILDING_CLASS_AT_PRESENT          object
ADDRESS                            object
APARTMENT_NUMBER                   object
ZIP_CODE                          float64
RESIDENTIAL_UNITS                 float64
COMMERCIAL_UNITS                  float64
TOTAL_UNITS                       float64
LAND_SQUARE_FEET                   object
GROSS_SQUARE_FEET                 float64
YEAR_BUILT                        float64
TAX_CLASS_AT_TIME_OF_SALE           int64
BUILDING_CLASS_AT_TIME_OF_SALE     object
SALE_PRICE                          int64
SALE_DATE                          object
dtype: object

In [0]:
#fix square footage to reduce cardinality
#cleaning up "," for ease of use
df['LAND_SQUARE_FEET'] = (
    df['LAND_SQUARE_FEET']
    .str.replace(',','')
    .astype(int)
)

#try grouping them into four sub groups

In [299]:
df.dtypes #land square feet is now an integer

BOROUGH                            object
NEIGHBORHOOD                       object
BUILDING_CLASS_CATEGORY            object
TAX_CLASS_AT_PRESENT               object
BLOCK                               int64
LOT                                 int64
EASE-MENT                         float64
BUILDING_CLASS_AT_PRESENT          object
ADDRESS                            object
APARTMENT_NUMBER                   object
ZIP_CODE                          float64
RESIDENTIAL_UNITS                 float64
COMMERCIAL_UNITS                  float64
TOTAL_UNITS                       float64
LAND_SQUARE_FEET                    int64
GROSS_SQUARE_FEET                 float64
YEAR_BUILT                        float64
TAX_CLASS_AT_TIME_OF_SALE           int64
BUILDING_CLASS_AT_TIME_OF_SALE     object
SALE_PRICE                          int64
SALE_DATE                          object
dtype: object

In [300]:
df.head(1)

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
44,3,OTHER,01 ONE FAMILY DWELLINGS,1,5495,801,,A9,4832 BAY PARKWAY,,11230.0,1.0,0.0,1.0,6800,1325.0,1930.0,1,A9,550000,01/01/2019


In [301]:
#splitting the training data off
df['SALE_DATE'] = pd.to_datetime(df['SALE_DATE'])

#ask is between January and March for train
start_date = '01-01-2019'
end_date = '03-31-2019'

mask = (df['SALE_DATE'] > start_date) & (df['SALE_DATE'] <= end_date)

df_train = df.loc[mask]

#verify this worked, grab beginning and end dates of training data
print(df_train.head(1))
print(df_train.tail(1))

   BOROUGH NEIGHBORHOOD  ... SALE_PRICE  SALE_DATE
78       2        OTHER  ...     810000 2019-01-02

[1 rows x 21 columns]
      BOROUGH NEIGHBORHOOD  ... SALE_PRICE  SALE_DATE
18147       4        OTHER  ...     104000 2019-03-30

[1 rows x 21 columns]


In [302]:
df_train.shape

(2515, 21)

In [303]:
#splitting the TEST data off
df['SALE_DATE'] = pd.to_datetime(df['SALE_DATE'])

#ask is between January and March for train
start_date = '04-01-2019'
end_date = '04-30-2019'

mask = (df['SALE_DATE'] > start_date) & (df['SALE_DATE'] <= end_date)

df_test = df.loc[mask]

#verify this worked, grab beginning and end dates of training data
print(df_test.head(1))
print(df_test.tail(1))
print(df_test.shape)

      BOROUGH NEIGHBORHOOD  ... SALE_PRICE  SALE_DATE
18500       2        OTHER  ...     375000 2019-04-02

[1 rows x 21 columns]
      BOROUGH NEIGHBORHOOD  ... SALE_PRICE  SALE_DATE
23035       4        OTHER  ...     510000 2019-04-30

[1 rows x 21 columns]
(606, 21)


In [304]:
#Back to Training Data
#numeric columns on TRAIN
df_train.select_dtypes(include='number').describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
BLOCK,2515.0,6750.89503,3978.561448,21.0,3817.5,6021.0,9888.5,16350.0
LOT,2515.0,75.387276,156.628356,1.0,21.0,42.0,69.5,2720.0
EASE-MENT,0.0,,,,,,,
ZIP_CODE,2515.0,10993.625447,494.153233,10301.0,10314.0,11234.0,11413.0,11697.0
RESIDENTIAL_UNITS,2515.0,0.987674,0.110358,0.0,1.0,1.0,1.0,1.0
COMMERCIAL_UNITS,2515.0,0.016302,0.129763,0.0,0.0,0.0,0.0,2.0
TOTAL_UNITS,2515.0,1.003976,0.171521,0.0,1.0,1.0,1.0,3.0
LAND_SQUARE_FEET,2515.0,3144.7666,1797.856858,0.0,2000.0,2600.0,4000.0,18906.0
GROSS_SQUARE_FEET,2515.0,1473.349901,599.270609,0.0,1144.0,1367.0,1683.0,7875.0
YEAR_BUILT,2515.0,1944.737575,27.063971,1890.0,1925.0,1940.0,1960.0,2018.0


In [305]:
#non-numeric columns
df_train.select_dtypes(exclude='number').describe().T 

Unnamed: 0,count,unique,top,freq,first,last
BOROUGH,2515,5,4,1208,NaT,NaT
NEIGHBORHOOD,2515,7,OTHER,2366,NaT,NaT
BUILDING_CLASS_CATEGORY,2515,1,01 ONE FAMILY DWELLINGS,2515,NaT,NaT
TAX_CLASS_AT_PRESENT,2515,2,1,2484,NaT,NaT
BUILDING_CLASS_AT_PRESENT,2515,13,A1,920,NaT,NaT
ADDRESS,2515,2505,216-29 114TH ROAD,2,NaT,NaT
APARTMENT_NUMBER,1,1,RP.,1,NaT,NaT
BUILDING_CLASS_AT_TIME_OF_SALE,2515,11,A1,920,NaT,NaT
SALE_DATE,2515,67,2019-01-31 00:00:00,78,2019-01-02,2019-03-30


In [0]:
#lets drop addresses, cardinality is too high
target = 'SALE_PRICE'
high_cardinality = ['ADDRESS']
features = df_train.columns.drop([target] + high_cardinality)

x_train = df_train[features]
y_train = df_train[target]
x_test = df_test[features]
y_test = df_test[target]

In [307]:
#check train
df_train.select_dtypes(exclude='number').describe().T 

Unnamed: 0,count,unique,top,freq,first,last
BOROUGH,2515,5,4,1208,NaT,NaT
NEIGHBORHOOD,2515,7,OTHER,2366,NaT,NaT
BUILDING_CLASS_CATEGORY,2515,1,01 ONE FAMILY DWELLINGS,2515,NaT,NaT
TAX_CLASS_AT_PRESENT,2515,2,1,2484,NaT,NaT
BUILDING_CLASS_AT_PRESENT,2515,13,A1,920,NaT,NaT
ADDRESS,2515,2505,216-29 114TH ROAD,2,NaT,NaT
APARTMENT_NUMBER,1,1,RP.,1,NaT,NaT
BUILDING_CLASS_AT_TIME_OF_SALE,2515,11,A1,920,NaT,NaT
SALE_DATE,2515,67,2019-01-31 00:00:00,78,2019-01-02,2019-03-30


In [308]:
#check test
df_test.select_dtypes(exclude='number').describe().T 

Unnamed: 0,count,unique,top,freq,first,last
BOROUGH,606,5,4,354.0,NaT,NaT
NEIGHBORHOOD,606,7,OTHER,562.0,NaT,NaT
BUILDING_CLASS_CATEGORY,606,1,01 ONE FAMILY DWELLINGS,606.0,NaT,NaT
TAX_CLASS_AT_PRESENT,606,2,1,598.0,NaT,NaT
BUILDING_CLASS_AT_PRESENT,606,10,A1,253.0,NaT,NaT
ADDRESS,606,605,46-12 30TH ROAD,2.0,NaT,NaT
APARTMENT_NUMBER,0,0,,,NaT,NaT
BUILDING_CLASS_AT_TIME_OF_SALE,606,10,A1,253.0,NaT,NaT
SALE_DATE,606,22,2019-04-15 00:00:00,43.0,2019-04-02,2019-04-30


In [309]:
!pip install category_encoders



In [310]:
#do one hot encoding of categorical features
import category_encoders as ce 
encoder = ce.OneHotEncoder(use_cat_names=True)
df_train = encoder.fit_transform(df_train)
df_test = encoder.transform(df_test)
df_train.head()

Unnamed: 0,BOROUGH_2,BOROUGH_3,BOROUGH_4,BOROUGH_5,BOROUGH_1,NEIGHBORHOOD_OTHER,NEIGHBORHOOD_FLUSHING-NORTH,NEIGHBORHOOD_EAST NEW YORK,NEIGHBORHOOD_BEDFORD STUYVESANT,NEIGHBORHOOD_FOREST HILLS,NEIGHBORHOOD_BOROUGH PARK,NEIGHBORHOOD_ASTORIA,BUILDING_CLASS_CATEGORY_01 ONE FAMILY DWELLINGS,TAX_CLASS_AT_PRESENT_1,TAX_CLASS_AT_PRESENT_1D,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT_A1,BUILDING_CLASS_AT_PRESENT_A5,BUILDING_CLASS_AT_PRESENT_A0,BUILDING_CLASS_AT_PRESENT_A2,BUILDING_CLASS_AT_PRESENT_A3,BUILDING_CLASS_AT_PRESENT_A9,BUILDING_CLASS_AT_PRESENT_S1,BUILDING_CLASS_AT_PRESENT_A4,BUILDING_CLASS_AT_PRESENT_A6,BUILDING_CLASS_AT_PRESENT_A8,BUILDING_CLASS_AT_PRESENT_B2,BUILDING_CLASS_AT_PRESENT_S0,BUILDING_CLASS_AT_PRESENT_B3,ADDRESS_1260 RHINELANDER AVE,ADDRESS_469 E 25TH ST,ADDRESS_5521 WHITTY LANE,ADDRESS_1747 EAST 23RD STREET,ADDRESS_1582 EAST 15TH STREET,ADDRESS_201-08 50TH AVENUE,ADDRESS_85-11 57 ROAD,ADDRESS_53-19 198TH STREET,ADDRESS_208-03 HOLLIS AVENUE,...,ADDRESS_80-55 88TH ROAD,ADDRESS_55-01 32 AVE,ADDRESS_54-02 32ND AVENUE,ADDRESS_32-11 54TH STREET,ADDRESS_37 BARB STREET,ADDRESS_70 GARY STREET,ADDRESS_74 DAFFODIL COURT,ADDRESS_72 GOLD AVENUE,ADDRESS_265 ELVERTON AVENUE,ADDRESS_408 DOANE AVENUE,ADDRESS_404 COLON AVENUE,ADDRESS_120 YORK AVENUE,ADDRESS_10 SEAFOAM STREET,ADDRESS_74 MCVEIGH AVE,ADDRESS_479 VILLA AVENUE,ADDRESS_63 NUGENT AVENUE,ADDRESS_223-29 103RD AVENUE,APARTMENT_NUMBER_nan,APARTMENT_NUMBER_RP.,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE_A1,BUILDING_CLASS_AT_TIME_OF_SALE_A5,BUILDING_CLASS_AT_TIME_OF_SALE_A0,BUILDING_CLASS_AT_TIME_OF_SALE_A2,BUILDING_CLASS_AT_TIME_OF_SALE_A3,BUILDING_CLASS_AT_TIME_OF_SALE_A9,BUILDING_CLASS_AT_TIME_OF_SALE_S1,BUILDING_CLASS_AT_TIME_OF_SALE_A4,BUILDING_CLASS_AT_TIME_OF_SALE_A6,BUILDING_CLASS_AT_TIME_OF_SALE_A8,BUILDING_CLASS_AT_TIME_OF_SALE_S0,SALE_PRICE,SALE_DATE
78,1,0,0,0,0,1,0,0,0,0,0,0,1,1,0,4210,19,,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,10461.0,1.0,0.0,1.0,3500,2043.0,1925.0,1,1,0,0,0,0,0,0,0,0,0,0,810000,2019-01-02
108,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,5212,69,,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,11226.0,1.0,0.0,1.0,4000,2680.0,1899.0,1,1,0,0,0,0,0,0,0,0,0,0,125000,2019-01-02
111,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,7930,121,,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,11203.0,1.0,0.0,1.0,1710,1872.0,1940.0,1,0,1,0,0,0,0,0,0,0,0,0,620000,2019-01-02
120,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,6806,72,,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,11229.0,1.0,0.0,1.0,4000,1932.0,1930.0,1,1,0,0,0,0,0,0,0,0,0,0,1150000,2019-01-02
121,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,6761,42,,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,11230.0,1.0,0.0,1.0,2000,1722.0,1920.0,1,1,0,0,0,0,0,0,0,0,0,0,836500,2019-01-02


In [311]:
df_train.shape

(2515, 2559)

In [312]:
df_test.shape

(606, 2559)

In [313]:
#feature selection, picking the top 15
from sklearn.feature_selection import f_regression, SelectKBest
selector = SelectKBest(score_func = f_regression, k=15)
x_train_selected = selector.fit_transform(x_train, y_train)
x_test_selected = selector.transform(x_test)

x_train_selected.shape

ValueError: ignored

In [0]:
#Ridge
#Ridge and Lasso regression are some of the simple techniques 
#to reduce model complexity and prevent over-fitting which may 
#result from simple linear regression.

#So, ridge regression shrinks the coefficients 
#and it helps to reduce the model complexity and multi-collinearity.

#So the lower the constraint (low λ) on the features, 
#the model will resemble linear regression model.

In [0]:
#5 steps
import pandas as pd
#Step 1 - importing the appropriate estimator, in this case, Linear Regression
from sklearn.linear_model import Ridge

#choosing model hyperparamaters by instantiating class with "model" in this case
model = Ridge()

#Arranging into features matrix and target matrix. 
features = ['YEAR_BUILT']
target = ['SALE_PRICE']
x=df_train[features]
y=df_train[target]

#check
#print(x.shape, y.shape)

#fit the model to the data..
model.fit(x,y)

In [0]:
import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd
import matplotlib

#SKlearn
#from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge