<a href="https://colab.research.google.com/github/james-caldwell1981/DS-Unit-2-Linear-Models/blob/master/module2-regression-2/LS_DS_212_assignment_Caldwell_James_TL_Hadi.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 1, Module 2*

---

# Regression 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [ ] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [ ] Engineer at least two new features. (See below for explanation & ideas.)
- [ ] Fit a linear regression model with at least two features.
- [ ] Get the model's coefficients and intercept.
- [ ] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [ ] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [ ] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Add your own stretch goal(s) !

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [2]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [3]:
#import modules

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import plotly.express as px

In [4]:
#look at the data
print(df.shape)
df.head()

(48817, 34)


Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [5]:
#check created dtype

df['created'].dtype

dtype('O')

In [6]:
#turn 'created' column into datetime object

df['created'] = pd.to_datetime(df['created'], infer_datetime_format=True)

In [7]:
#verify datetime object creation success

df['created'].dtype

dtype('<M8[ns]')

In [8]:
#get all entries from 2016

train_test_df = df[df['created'].dt.year == 2016]
train_test_df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [9]:
#get only entries from april, may, and june

april_2016 = train_test_df[train_test_df['created'].dt.month == 4]
may_2016 = train_test_df[train_test_df['created'].dt.month == 5]
june_2016 = train_test_df[train_test_df['created'].dt.month == 6]
train_test_df = april_2016.append(may_2016)
train_test_df = train_test_df.append(june_2016)

In [10]:
#verify data
print(train_test_df['created'].head())
print(train_test_df['created'].tail())

2   2016-04-17 03:26:41
3   2016-04-18 02:22:02
4   2016-04-28 01:32:41
5   2016-04-19 04:24:47
6   2016-04-27 03:19:56
Name: created, dtype: datetime64[ns]
49305   2016-06-16 04:20:46
49310   2016-06-21 06:25:35
49320   2016-06-02 13:24:18
49332   2016-06-06 01:22:44
49347   2016-06-02 05:41:05
Name: created, dtype: datetime64[ns]


In [11]:
corr_df = df.corr()

In [12]:
corr_df['price'].sort_values(ascending=False)

price                   1.000000
bathrooms               0.687296
bedrooms                0.535503
doorman                 0.276215
laundry_in_unit         0.271195
dining_room             0.242911
fitness_center          0.228775
dishwasher              0.223899
elevator                0.207169
terrace                 0.145973
outdoor_space           0.142146
balcony                 0.139140
swimming_pool           0.134513
no_fee                  0.132240
roof_deck               0.122929
garden_patio            0.103672
hardwood_floors         0.101503
high_speed_internet     0.090269
wheelchair_access       0.072517
new_construction        0.071431
dogs_allowed            0.060401
cats_allowed            0.051453
common_outdoor_space    0.011517
loft                    0.007100
exclusive              -0.013251
laundry_in_building    -0.019417
pre-war                -0.029122
latitude               -0.036286
longitude              -0.251004
Name: price, dtype: float64

In [13]:
y = train_test_df['price']

In [14]:
X = train_test_df[['bathrooms', 'bedrooms']]

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, random_state=42)

In [16]:
y_test[:5]

3391      2725
4045      2500
6384      1500
13624     2495
47700    13080
Name: price, dtype: int64

In [17]:
X_test[:5]

Unnamed: 0,bathrooms,bedrooms
3391,1.0,1
4045,1.0,2
6384,1.0,0
13624,1.0,2
47700,3.0,3


In [19]:
model = LinearRegression()

In [20]:
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [21]:
y_pred = model.predict(X_test)

In [22]:
[y_pred[i] for i in range(0, 5, 1)]

[2950.698854926981,
 3331.2412530281676,
 2570.1564568257936,
 3331.2412530281676,
 7935.060447776633]

In [23]:
mae = metrics.mean_absolute_error(y_test, y_pred)

In [24]:
mae

831.538317110766

In [25]:
average_price = y_pred.mean()
round(average_price, 2)

3596.44

In [26]:
df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [27]:
df.columns

Index(['bathrooms', 'bedrooms', 'created', 'description', 'display_address',
       'latitude', 'longitude', 'price', 'street_address', 'interest_level',
       'elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed',
       'doorman', 'dishwasher', 'no_fee', 'laundry_in_building',
       'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck',
       'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony',
       'swimming_pool', 'new_construction', 'terrace', 'exclusive', 'loft',
       'garden_patio', 'wheelchair_access', 'common_outdoor_space'],
      dtype='object')

In [28]:
apt_feat_list = ['elevator',
                 'hardwood_floors',
                 'doorman',
                 'dishwasher',
                 'no_fee',
                 'laundry_in_building',
                 'fitness_center',
                 'pre-war',
                 'laundry_in_unit',
                 'roof_deck',
                 'outdoor_space',
                 'dining_room',
                 'high_speed_internet',
                 'balcony',
                 'swimming_pool',
                 'new_construction',
                 'terrace',
                 'exclusive',
                 'loft',
                 'garden_patio',
                 'wheelchair_access',
                 'common_outdoor_space']

In [29]:
df['features'] = df[apt_feat_list].sum(axis=1)

In [30]:
df['pets'] = df[['cats_allowed', 'dogs_allowed']].sum(axis=1)

In [31]:
df['features'].head()

0    0
1    3
2    3
3    2
4    1
Name: features, dtype: int64

In [32]:
df['pets'].head()

0    0
1    2
2    0
3    0
4    0
Name: pets, dtype: int64

In [33]:
df[df['pets'] > 0] = 1

In [34]:
df['pets'].head()

0    0
1    1
2    0
3    0
4    0
Name: pets, dtype: int64

In [35]:
X = df[['features', 'pets']]
y = df['price']

In [36]:
X_test, X_train, y_test, y_train = train_test_split(X, y, test_size = .25, random_state = 42)

In [37]:
model2 = LinearRegression()

In [38]:
model2.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [39]:
y_pred = model2.predict(X_test)

In [40]:
mae = metrics.mean_absolute_error(y_test, y_pred)

In [41]:
mae

564.3296151415583

In [42]:
print(model2.coef_)
print(model2.intercept_)

[  215.97655238 -2939.17194171]
2724.1953893278464


In [43]:
r2 = metrics.r2_score(y_test, y_pred)

In [44]:
r2

0.7099027763704916

In [45]:
px.scatter(data_frame=None, x=y_pred, y=y_test, )

In [46]:
X = df[['bathrooms', 'bedrooms', 'doorman', 'laundry_in_unit']]
y = df['price']

X_test, X_train, y_test, y_train = train_test_split(X, y, test_size = .25, random_state = 42)

model3 = LinearRegression()

model3.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [47]:
y_pred = model3.predict(X_test)

mae = metrics.mean_absolute_error(y_test, y_pred)
mae

816.087446777156

In [48]:
print(model3.coef_)
print(model3.intercept_)

[ 2552.92074938   600.12775624  -234.79065557 -1976.24887446]
-464.50658514898987


In [49]:
r2 = metrics.r2_score(y_test, y_pred)
r2

0.6753807042852598

In [50]:
px.scatter(data_frame=None, x=y_pred, y=y_test, )

Model 2, which incorporates all apartment features, seems to have the best result at just over .709 r2 versus model 3 which incorporated 4 features yielding an r2 metric of .675