<a href="https://colab.research.google.com/github/ltoosaint24/DS-Unit-2-Linear-Models/blob/master/Loveline_Toussaint__LS_DS_213_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 3*

---

# Ridge Regression

## Assignment

We're going back to our other **New York City** real estate dataset. Instead of predicting apartment rents, you'll predict property sales prices.

But not just for condos in Tribeca...

- [ ] Use a subset of the data where `BUILDING_CLASS_CATEGORY` == `'01 ONE FAMILY DWELLINGS'` and the sale price was more than 100 thousand and less than 2 million.
- [ ] Do train/test split. Use data from January — March 2019 to train. Use data from April 2019 to test.
- [ ] Do one-hot encoding of categorical features.
- [ ] Do feature selection with `SelectKBest`.
- [ ] Fit a ridge regression model with multiple features. Use the `normalize=True` parameter (or do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html) beforehand — use the scaler's `fit_transform` method with the train set, and the scaler's `transform` method with the test set)
- [ ] Get mean absolute error for the test set.
- [ ] As always, commit your notebook to your fork of the GitHub repo.

The [NYC Department of Finance](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page) has a glossary of property sales terms and NYC Building Class Code Descriptions. The data comes from the [NYC OpenData](https://data.cityofnewyork.us/browse?q=NYC%20calendar%20sales) portal.


## Stretch Goals

Don't worry, you aren't expected to do all these stretch goals! These are just ideas to consider and choose from.

- [ ] Add your own stretch goal(s) !
- [ ] Instead of `Ridge`, try `LinearRegression`. Depending on how many features you select, your errors will probably blow up! 💥
- [ ] Instead of `Ridge`, try [`RidgeCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html).
- [ ] Learn more about feature selection:
    - ["Permutation importance"](https://www.kaggle.com/dansbecker/permutation-importance)
    - [scikit-learn's User Guide for Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html)
    - [mlxtend](http://rasbt.github.io/mlxtend/) library
    - scikit-learn-contrib libraries: [boruta_py](https://github.com/scikit-learn-contrib/boruta_py) & [stability-selection](https://github.com/scikit-learn-contrib/stability-selection)
    - [_Feature Engineering and Selection_](http://www.feat.engineering/) by Kuhn & Johnson.
- [ ] Try [statsmodels](https://www.statsmodels.org/stable/index.html) if you’re interested in more inferential statistical approach to linear regression and feature selection, looking at p values and 95% confidence intervals for the coefficients.
- [ ] Read [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf), Chapters 1-3, for more math & theory, but in an accessible, readable way.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [2]:
import pandas as pd
import pandas_profiling

# Read New York City property sales data
df = pd.read_csv(DATA_PATH+'condos/NYC_Citywide_Rolling_Calendar_Sales.csv')

# Change column names: replace spaces with underscores
df.columns = [col.replace(' ', '_') for col in df]

# SALE_PRICE was read as strings.
# Remove symbols, convert to integer
df['SALE_PRICE'] = (
    df['SALE_PRICE']
    .str.replace('$','')
    .str.replace('-','')
    .str.replace(',','')
    .astype(int)
)

In [3]:
# BOROUGH is a numeric column, but arguably should be a categorical feature,
# so convert it from a number to a string
df['BOROUGH'] = df['BOROUGH'].astype(str)

In [4]:
# Reduce cardinality for NEIGHBORHOOD feature

# Get a list of the top 10 neighborhoods
top10 = df['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10, 
# replace the neighborhood with 'OTHER'
df.loc[~df['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'

In [5]:
df


Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
0,1,OTHER,13 CONDOS - ELEVATOR APARTMENTS,2,716,1246,,R4,"447 WEST 18TH STREET, PH12A",PH12A,10011.0,1.0,0.0,1.0,10733,1979.0,2007.0,2,R4,0,01/01/2019
1,1,OTHER,21 OFFICE BUILDINGS,4,812,68,,O5,144 WEST 37TH STREET,,10018.0,0.0,6.0,6.0,2962,15435.0,1920.0,4,O5,0,01/01/2019
2,1,OTHER,21 OFFICE BUILDINGS,4,839,69,,O5,40 WEST 38TH STREET,,10018.0,0.0,7.0,7.0,2074,11332.0,1930.0,4,O5,0,01/01/2019
3,1,OTHER,13 CONDOS - ELEVATOR APARTMENTS,2,592,1041,,R4,"1 SHERIDAN SQUARE, 8C",8C,10014.0,1.0,0.0,1.0,0,500.0,0.0,2,R4,0,01/01/2019
4,1,UPPER EAST SIDE (59-79),15 CONDOS - 2-10 UNIT RESIDENTIAL,2C,1379,1402,,R1,"20 EAST 65TH STREET, B",B,10065.0,1.0,0.0,1.0,0,6406.0,0.0,2,R1,0,01/01/2019
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23035,4,OTHER,01 ONE FAMILY DWELLINGS,1,10965,276,,A5,111-17 FRANCIS LEWIS BLVD,,11429.0,1.0,0.0,1.0,1800,1224.0,1945.0,1,A5,510000,04/30/2019
23036,4,OTHER,09 COOPS - WALKUP APARTMENTS,2,169,29,,C6,"45-14 43RD STREET, 3C",,11104.0,0.0,0.0,0.0,0,0.0,1929.0,2,C6,355000,04/30/2019
23037,4,OTHER,10 COOPS - ELEVATOR APARTMENTS,2,131,4,,D4,"50-05 43RD AVENUE, 3M",,11377.0,0.0,0.0,0.0,0,0.0,1932.0,2,D4,375000,04/30/2019
23038,4,OTHER,02 TWO FAMILY DWELLINGS,1,8932,18,,S2,91-10 JAMAICA AVE,,11421.0,2.0,1.0,3.0,2078,2200.0,1931.0,1,S2,1100000,04/30/2019


In [6]:
df.isnull().sum()

BOROUGH                               0
NEIGHBORHOOD                          0
BUILDING_CLASS_CATEGORY               0
TAX_CLASS_AT_PRESENT                  1
BLOCK                                 0
LOT                                   0
EASE-MENT                         23040
BUILDING_CLASS_AT_PRESENT             1
ADDRESS                               0
APARTMENT_NUMBER                  17839
ZIP_CODE                              1
RESIDENTIAL_UNITS                     1
COMMERCIAL_UNITS                      1
TOTAL_UNITS                           1
LAND_SQUARE_FEET                     53
GROSS_SQUARE_FEET                     1
YEAR_BUILT                           35
TAX_CLASS_AT_TIME_OF_SALE             0
BUILDING_CLASS_AT_TIME_OF_SALE        0
SALE_PRICE                            0
SALE_DATE                             0
dtype: int64

In [7]:
df.dropna

<bound method DataFrame.dropna of       BOROUGH             NEIGHBORHOOD  ... SALE_PRICE   SALE_DATE
0           1                    OTHER  ...          0  01/01/2019
1           1                    OTHER  ...          0  01/01/2019
2           1                    OTHER  ...          0  01/01/2019
3           1                    OTHER  ...          0  01/01/2019
4           1  UPPER EAST SIDE (59-79)  ...          0  01/01/2019
...       ...                      ...  ...        ...         ...
23035       4                    OTHER  ...     510000  04/30/2019
23036       4                    OTHER  ...     355000  04/30/2019
23037       4                    OTHER  ...     375000  04/30/2019
23038       4                    OTHER  ...    1100000  04/30/2019
23039       4                    OTHER  ...     569202  04/30/2019

[23040 rows x 21 columns]>

In [8]:
df[df['BUILDING_CLASS_CATEGORY']=='01 ONE FAMILY DWELLINGS']

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
7,2,OTHER,01 ONE FAMILY DWELLINGS,1,4090,37,,A1,1193 SACKET AVENUE,,10461.0,1.0,0.0,1.0,3404,1328.0,1925.0,1,A1,0,01/01/2019
8,2,OTHER,01 ONE FAMILY DWELLINGS,1,4120,18,,A5,1215 VAN NEST AVENUE,,10461.0,1.0,0.0,1.0,2042,1728.0,1935.0,1,A5,0,01/01/2019
9,2,OTHER,01 ONE FAMILY DWELLINGS,1,4120,20,,A5,1211 VAN NEST AVENUE,,10461.0,1.0,0.0,1.0,2042,1728.0,1935.0,1,A5,0,01/01/2019
42,3,OTHER,01 ONE FAMILY DWELLINGS,1,6809,54,,A1,2601 AVENUE R,,11229.0,1.0,0.0,1.0,3333,1262.0,1925.0,1,A1,0,01/01/2019
44,3,OTHER,01 ONE FAMILY DWELLINGS,1,5495,801,,A9,4832 BAY PARKWAY,,11230.0,1.0,0.0,1.0,6800,1325.0,1930.0,1,A9,550000,01/01/2019
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23029,4,OTHER,01 ONE FAMILY DWELLINGS,1,13215,3,,A2,244-15 135 AVENUE,,11422.0,1.0,0.0,1.0,3300,1478.0,1925.0,1,A2,635000,04/30/2019
23031,4,OTHER,01 ONE FAMILY DWELLINGS,1,11612,73,,A1,10919 132ND STREET,,11420.0,1.0,0.0,1.0,2400,1280.0,1930.0,1,A1,514000,04/30/2019
23032,4,OTHER,01 ONE FAMILY DWELLINGS,1,11808,50,,A0,135-24 122ND STREET,,11420.0,1.0,0.0,1.0,4000,1333.0,1945.0,1,A0,635000,04/30/2019
23033,4,OTHER,01 ONE FAMILY DWELLINGS,1,12295,23,,A1,134-34 157TH STREET,,11434.0,1.0,0.0,1.0,2500,1020.0,1935.0,1,A1,545000,04/30/2019


In [9]:
#the .loc select rows by multiple labels conditions 
df_condition = df.loc[(df['BUILDING_CLASS_CATEGORY'] == '01 ONE FAMILY DWELLINGS') & (df['SALE_PRICE']>= 100000) & (df['SALE_PRICE'] <=2000000)]
df_condition

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
44,3,OTHER,01 ONE FAMILY DWELLINGS,1,5495,801,,A9,4832 BAY PARKWAY,,11230.0,1.0,0.0,1.0,6800,1325.0,1930.0,1,A9,550000,01/01/2019
61,4,OTHER,01 ONE FAMILY DWELLINGS,1,7918,72,,A1,80-23 232ND STREET,,11427.0,1.0,0.0,1.0,4000,2001.0,1940.0,1,A1,200000,01/01/2019
78,2,OTHER,01 ONE FAMILY DWELLINGS,1,4210,19,,A1,1260 RHINELANDER AVE,,10461.0,1.0,0.0,1.0,3500,2043.0,1925.0,1,A1,810000,01/02/2019
108,3,OTHER,01 ONE FAMILY DWELLINGS,1,5212,69,,A1,469 E 25TH ST,,11226.0,1.0,0.0,1.0,4000,2680.0,1899.0,1,A1,125000,01/02/2019
111,3,OTHER,01 ONE FAMILY DWELLINGS,1,7930,121,,A5,5521 WHITTY LANE,,11203.0,1.0,0.0,1.0,1710,1872.0,1940.0,1,A5,620000,01/02/2019
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23029,4,OTHER,01 ONE FAMILY DWELLINGS,1,13215,3,,A2,244-15 135 AVENUE,,11422.0,1.0,0.0,1.0,3300,1478.0,1925.0,1,A2,635000,04/30/2019
23031,4,OTHER,01 ONE FAMILY DWELLINGS,1,11612,73,,A1,10919 132ND STREET,,11420.0,1.0,0.0,1.0,2400,1280.0,1930.0,1,A1,514000,04/30/2019
23032,4,OTHER,01 ONE FAMILY DWELLINGS,1,11808,50,,A0,135-24 122ND STREET,,11420.0,1.0,0.0,1.0,4000,1333.0,1945.0,1,A0,635000,04/30/2019
23033,4,OTHER,01 ONE FAMILY DWELLINGS,1,12295,23,,A1,134-34 157TH STREET,,11434.0,1.0,0.0,1.0,2500,1020.0,1935.0,1,A1,545000,04/30/2019


In [10]:
#Split data from January -March 2019 train and April 2019 test
df.dtypes

BOROUGH                            object
NEIGHBORHOOD                       object
BUILDING_CLASS_CATEGORY            object
TAX_CLASS_AT_PRESENT               object
BLOCK                               int64
LOT                                 int64
EASE-MENT                         float64
BUILDING_CLASS_AT_PRESENT          object
ADDRESS                            object
APARTMENT_NUMBER                   object
ZIP_CODE                          float64
RESIDENTIAL_UNITS                 float64
COMMERCIAL_UNITS                  float64
TOTAL_UNITS                       float64
LAND_SQUARE_FEET                   object
GROSS_SQUARE_FEET                 float64
YEAR_BUILT                        float64
TAX_CLASS_AT_TIME_OF_SALE           int64
BUILDING_CLASS_AT_TIME_OF_SALE     object
SALE_PRICE                          int64
SALE_DATE                          object
dtype: object

In [11]:
import datetime 

#First the date object had to be shifted to date time sequence for conditioning
dateti = [x for x in df_condition['SALE_DATE']]
for ix in dateti:
  df_condition['SALE_DATE'] = datetime.datetime.strptime(ix,'%m/%d/%Y')


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [12]:
df_condition

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
44,3,OTHER,01 ONE FAMILY DWELLINGS,1,5495,801,,A9,4832 BAY PARKWAY,,11230.0,1.0,0.0,1.0,6800,1325.0,1930.0,1,A9,550000,2019-04-30
61,4,OTHER,01 ONE FAMILY DWELLINGS,1,7918,72,,A1,80-23 232ND STREET,,11427.0,1.0,0.0,1.0,4000,2001.0,1940.0,1,A1,200000,2019-04-30
78,2,OTHER,01 ONE FAMILY DWELLINGS,1,4210,19,,A1,1260 RHINELANDER AVE,,10461.0,1.0,0.0,1.0,3500,2043.0,1925.0,1,A1,810000,2019-04-30
108,3,OTHER,01 ONE FAMILY DWELLINGS,1,5212,69,,A1,469 E 25TH ST,,11226.0,1.0,0.0,1.0,4000,2680.0,1899.0,1,A1,125000,2019-04-30
111,3,OTHER,01 ONE FAMILY DWELLINGS,1,7930,121,,A5,5521 WHITTY LANE,,11203.0,1.0,0.0,1.0,1710,1872.0,1940.0,1,A5,620000,2019-04-30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23029,4,OTHER,01 ONE FAMILY DWELLINGS,1,13215,3,,A2,244-15 135 AVENUE,,11422.0,1.0,0.0,1.0,3300,1478.0,1925.0,1,A2,635000,2019-04-30
23031,4,OTHER,01 ONE FAMILY DWELLINGS,1,11612,73,,A1,10919 132ND STREET,,11420.0,1.0,0.0,1.0,2400,1280.0,1930.0,1,A1,514000,2019-04-30
23032,4,OTHER,01 ONE FAMILY DWELLINGS,1,11808,50,,A0,135-24 122ND STREET,,11420.0,1.0,0.0,1.0,4000,1333.0,1945.0,1,A0,635000,2019-04-30
23033,4,OTHER,01 ONE FAMILY DWELLINGS,1,12295,23,,A1,134-34 157TH STREET,,11434.0,1.0,0.0,1.0,2500,1020.0,1935.0,1,A1,545000,2019-04-30


In [13]:
df_condition.dtypes

BOROUGH                                   object
NEIGHBORHOOD                              object
BUILDING_CLASS_CATEGORY                   object
TAX_CLASS_AT_PRESENT                      object
BLOCK                                      int64
LOT                                        int64
EASE-MENT                                float64
BUILDING_CLASS_AT_PRESENT                 object
ADDRESS                                   object
APARTMENT_NUMBER                          object
ZIP_CODE                                 float64
RESIDENTIAL_UNITS                        float64
COMMERCIAL_UNITS                         float64
TOTAL_UNITS                              float64
LAND_SQUARE_FEET                          object
GROSS_SQUARE_FEET                        float64
YEAR_BUILT                               float64
TAX_CLASS_AT_TIME_OF_SALE                  int64
BUILDING_CLASS_AT_TIME_OF_SALE            object
SALE_PRICE                                 int64
SALE_DATE           

In [14]:
df_train= df_condition[(df_condition['SALE_DATE'] >='01-01-2019') | (df_condition['SALE_DATE'] <= '03-31-2019')]
df_test = df_condition[(df_condition['SALE_DATE'] >= '04-01-2019') | (df_condition['SALE_DATE']<='04-29-2019')]

In [15]:
df_train.shape, df_test.shape

((3164, 21), (3164, 21))

In [16]:
from sklearn.model_selection import train_test_split

df_train

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
44,3,OTHER,01 ONE FAMILY DWELLINGS,1,5495,801,,A9,4832 BAY PARKWAY,,11230.0,1.0,0.0,1.0,6800,1325.0,1930.0,1,A9,550000,2019-04-30
61,4,OTHER,01 ONE FAMILY DWELLINGS,1,7918,72,,A1,80-23 232ND STREET,,11427.0,1.0,0.0,1.0,4000,2001.0,1940.0,1,A1,200000,2019-04-30
78,2,OTHER,01 ONE FAMILY DWELLINGS,1,4210,19,,A1,1260 RHINELANDER AVE,,10461.0,1.0,0.0,1.0,3500,2043.0,1925.0,1,A1,810000,2019-04-30
108,3,OTHER,01 ONE FAMILY DWELLINGS,1,5212,69,,A1,469 E 25TH ST,,11226.0,1.0,0.0,1.0,4000,2680.0,1899.0,1,A1,125000,2019-04-30
111,3,OTHER,01 ONE FAMILY DWELLINGS,1,7930,121,,A5,5521 WHITTY LANE,,11203.0,1.0,0.0,1.0,1710,1872.0,1940.0,1,A5,620000,2019-04-30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23029,4,OTHER,01 ONE FAMILY DWELLINGS,1,13215,3,,A2,244-15 135 AVENUE,,11422.0,1.0,0.0,1.0,3300,1478.0,1925.0,1,A2,635000,2019-04-30
23031,4,OTHER,01 ONE FAMILY DWELLINGS,1,11612,73,,A1,10919 132ND STREET,,11420.0,1.0,0.0,1.0,2400,1280.0,1930.0,1,A1,514000,2019-04-30
23032,4,OTHER,01 ONE FAMILY DWELLINGS,1,11808,50,,A0,135-24 122ND STREET,,11420.0,1.0,0.0,1.0,4000,1333.0,1945.0,1,A0,635000,2019-04-30
23033,4,OTHER,01 ONE FAMILY DWELLINGS,1,12295,23,,A1,134-34 157TH STREET,,11434.0,1.0,0.0,1.0,2500,1020.0,1935.0,1,A1,545000,2019-04-30


In [17]:
df_train, val = train_test_split(df_train, random_state =42)

In [18]:
train_test_split(df_train, random_state = 42)

[      BOROUGH NEIGHBORHOOD  ... SALE_PRICE  SALE_DATE
 10355       4        OTHER  ...     520000 2019-04-30
 9741        3        OTHER  ...     632000 2019-04-30
 13575       4        OTHER  ...     400000 2019-04-30
 12798       5        OTHER  ...     270000 2019-04-30
 7687        4        OTHER  ...     555000 2019-04-30
 ...       ...          ...  ...        ...        ...
 15315       4        OTHER  ...     550000 2019-04-30
 1382        3        OTHER  ...     790000 2019-04-30
 15942       4        OTHER  ...     430000 2019-04-30
 429         5        OTHER  ...     926608 2019-04-30
 21066       2        OTHER  ...     507500 2019-04-30
 
 [1779 rows x 21 columns],
       BOROUGH NEIGHBORHOOD  ... SALE_PRICE  SALE_DATE
 15811       3        OTHER  ...     400000 2019-04-30
 17611       3        OTHER  ...    1525000 2019-04-30
 9712        3        OTHER  ...     474000 2019-04-30
 6689        4        OTHER  ...     580000 2019-04-30
 6300        4        OTHER  ...    

In [19]:
df_train.shape, val.shape, df_test.shape

((2373, 21), (791, 21), (3164, 21))

In [20]:
target = 'SALE_PRICE'
y_train = df_train[target]
y_train.value_counts(normalize =True)


500000    0.017278
550000    0.013064
450000    0.012221
400000    0.011799
525000    0.010957
            ...   
518500    0.000421
412000    0.000421
276822    0.000421
498000    0.000421
204800    0.000421
Name: SALE_PRICE, Length: 823, dtype: float64

In [21]:
majority_class = y_train.mode()[0]
y_pred = [majority_class] * len(y_train)

In [22]:
from sklearn.metrics import accuracy_score
accuracy_score(y_train, y_pred)

0.017277707543194267

In [23]:
y_val = val[target]
y_pred = [majority_class]* len(y_val)
accuracy_score(y_val, y_pred)

0.008849557522123894

In [24]:
df_train.describe()

Unnamed: 0,BLOCK,LOT,EASE-MENT,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,SALE_PRICE
count,2373.0,2373.0,0.0,2373.0,2373.0,2373.0,2373.0,2373.0,2373.0,2373.0,2373.0
mean,6965.141172,74.674673,,11035.327012,0.986515,0.014749,1.001264,1462.09566,1943.328276,1.0,628049.7
std,3990.347073,153.822387,,480.347284,0.118962,0.12402,0.173006,569.422762,26.425364,0.0,297480.5
min,21.0,1.0,,10301.0,0.0,0.0,0.0,0.0,1890.0,1.0,100000.0
25%,4025.0,21.0,,10462.0,1.0,0.0,1.0,1144.0,1925.0,1.0,444558.0
50%,6367.0,41.0,,11236.0,1.0,0.0,1.0,1352.0,1938.0,1.0,565000.0
75%,10351.0,69.0,,11413.0,1.0,0.0,1.0,1680.0,1955.0,1.0,755000.0
max,16350.0,2720.0,,11697.0,2.0,2.0,3.0,7875.0,2018.0,1.0,2000000.0


In [48]:
#Linear Regression
from sklearn.linear_model import LinearRegression

linear_reg = LinearRegression()
features = ['GROSS_SQUARE_FEET','YEAR_BUILT','SALE_PRICE']
x_train = df_train[features]
x_val = val[features]

from sklearn.impute import SimpleImputer
imputer =SimpleImputer()
x_train_imputed = imputer.fit_transform(x_train)
x_val_imputed = imputer.transform(x_val)


linear_reg.fit(x_train_imputed, y_train)
linear_reg.predict(x_val_imputed)

array([ 390000.,  770000.,  859000.,  361000., 1048800.,  475000.,
        630000.,  395000.,  380000.,  800000.,  380000.,  530000.,
        240000.,  360000.,  800000.,  855000.,  700000.,  728888.,
        890000.,  318000.,  800000.,  485000.,  485000.,  470000.,
        607000.,  330000.,  695000., 1218300.,  539000., 1153000.,
        350000.,  100000.,  495000.,  499000.,  315000.,  499500.,
        895000.,  575000., 1400000.,  822500.,  669000.,  480000.,
        499931.,  806000.,  500000.,  517000.,  300000.,  675000.,
        450000., 1756000., 1100000.,  710000.,  180000.,  700000.,
        412000.,  695000.,  699000.,  499999.,  750000.,  290000.,
        680000.,  335000.,  412600.,  475000.,  130000.,  803800.,
        800000.,  960000.,  721939.,  750000.,  585000.,  380000.,
        550000.,  530000.,  508500.,  575000.,  490000., 1552831.,
        999999.,  480000.,  800000.,  912500., 1120413.,  805000.,
        940000.,  840000.,  679000.,  216489.,  522000.,  4500

In [26]:
pd.Series(linear_reg.coef_, features)

GROSS_SQUARE_FEET   -1.310189e-13
YEAR_BUILT          -3.727463e-12
SALE_PRICE           1.000000e+00
dtype: float64

In [27]:
test_case = [[6000,2016,550000]]
linear_reg.predict(test_case)

array([550000.])

In [28]:
#Logistic Regression
# Import the feature selector utility
from sklearn.feature_selection import SelectKBest, f_regression

# Create the selector object with the best k=1 features
selector = SelectKBest(score_func=f_regression, k=1)

# Run the selector on the training data
x_train_selected = selector.fit_transform(x_train, y_train)

# Find the features that was selected
selected_mask = selector.get_support()
all_features = x_train.columns
selected_feature = all_features[selected_mask]

print('The selected feature: ', selected_feature[0])


The selected feature:  GROSS_SQUARE_FEET


In [29]:
#Imports
import category_encoders as ce 
from sklearn.impute import SimpleImputer 
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler

  import pandas.util.testing as tm


In [30]:
x_train.head()
x_val.head()

Unnamed: 0,GROSS_SQUARE_FEET,YEAR_BUILT,SALE_PRICE
18857,1771.0,1950.0,390000
2590,1831.0,1950.0,770000
1381,1840.0,1920.0,859000
2630,930.0,1925.0,361000
10944,1249.0,1950.0,1048800


In [31]:
encoder = ce.OneHotEncoder(use_cat_names= True)
x_train_encoded = encoder.fit_transform(x_train)
x_val_encoded = encoder.transform(x_val)
x_train_encoded.head()

  elif pd.api.types.is_categorical(cols):


Unnamed: 0,GROSS_SQUARE_FEET,YEAR_BUILT,SALE_PRICE
10088,1360.0,1935.0,930000
10420,2025.0,1965.0,180000
5287,1321.0,1901.0,425000
20137,1995.0,1930.0,730000
6344,1296.0,1940.0,450000


In [32]:
x_val_encoded.head()

Unnamed: 0,GROSS_SQUARE_FEET,YEAR_BUILT,SALE_PRICE
18857,1771.0,1950.0,390000
2590,1831.0,1950.0,770000
1381,1840.0,1920.0,859000
2630,930.0,1925.0,361000
10944,1249.0,1950.0,1048800


In [40]:
imputer = SimpleImputer(strategy='mean')
x_train_imputed = imputer.fit_transform(x_train_encoded)
x_val_imputed = imputer.transform(x_val_encoded)

In [41]:
y_train_encoded = encoder.fit_transform(y_train)
y_val_encoded = encoder.transform(y_val)
y_train_encoded.head()

  elif pd.api.types.is_categorical(cols):


Unnamed: 0,SALE_PRICE
10088,930000
10420,180000
5287,425000
20137,730000
6344,450000


In [43]:
y_train_imputed = imputer.fit_transform(y_train_encoded)
y_val_imputed = imputer.transform(y_val_encoded)

In [34]:
x_train_imputed[:5]

array([[  1360.,   1935., 930000.],
       [  2025.,   1965., 180000.],
       [  1321.,   1901., 425000.],
       [  1995.,   1930., 730000.],
       [  1296.,   1940., 450000.]])

In [35]:
pd.DataFrame(x_train_imputed, columns = x_train_encoded.columns)

Unnamed: 0,GROSS_SQUARE_FEET,YEAR_BUILT,SALE_PRICE
0,1360.0,1935.0,930000.0
1,2025.0,1965.0,180000.0
2,1321.0,1901.0,425000.0
3,1995.0,1930.0,730000.0
4,1296.0,1940.0,450000.0
...,...,...,...
2368,1368.0,1960.0,580000.0
2369,1152.0,1950.0,780000.0
2370,1528.0,1920.0,562500.0
2371,1656.0,1940.0,994000.0


In [36]:
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train_imputed)
x_val_scaled = scaler.transform(x_val_imputed)

In [37]:
x_train_scaled[:15]

array([[-0.17933456, -0.31522865,  1.01523943],
       [ 0.98876094,  0.82028355, -1.50646551],
       [-0.2478394 , -1.60214247, -0.68270856],
       [ 0.9360649 , -0.50448068,  0.34278478],
       [-0.29175277, -0.12597662, -0.59865173],
       [-0.79763473,  0.63103151, -0.45071171],
       [ 1.04672658, -0.88298475, -0.6658972 ],
       [ 0.80432481,  0.44177948, -1.00212452],
       [-0.69224265, -0.69373271, -0.80038813],
       [-1.01368848, -0.12597662, -1.42308113],
       [-0.24959594, -0.88298475, -0.22880167],
       [ 0.36167809, -0.69373271,  0.29235068],
       [ 1.70367051,  2.75065428,  4.30764129],
       [-2.56822158, -0.20167743, -1.02431553],
       [-0.82222621, -1.60214247,  1.06567353]])

In [50]:
y_train_scaled= scaler.fit_transform(y_train_imputed)
y_val_scaled = scaler.transform(y_val_imputed)

In [None]:
y_train_scaled = y_train_scaled[:15]

In [52]:
model = LogisticRegressionCV()
model.fit(x_train_scaled, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

ValueError: ignored

In [None]:
y_pred = model.predict(x_val_scaled)
accuracy_score(y_val, y_pred)