<a href="https://colab.research.google.com/github/maximematerno/DS-Unit-2-Regression-Classification/blob/master/module2/assignment_regression_classification_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Regression & Classification, Module 2

## Assignment

You'll continue to **predict how much it costs to rent an apartment in NYC,** using the dataset from renthop.com.

- [x] Do train/test split. Use data from April & May 2016 to train. Use data from June 2016 to test.
- [x] Engineer at least two new features. (See below for explanation & ideas.)
- [x] Fit a linear regression model with at least two features.
- [x] Get the model's coefficients and intercept.
- [x] Get regression metrics RMSE, MAE, and $R^2$, for both the train and test data.
- [x] What's the best test MAE you can get? Share your score and features used with your cohort on Slack!
- [x] As always, commit your notebook to your fork of the GitHub repo.


#### [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)

> "Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used." — Pedro Domingos, ["A Few Useful Things to Know about Machine Learning"](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — Andrew Ng, [Machine Learning and AI via Brain simulations](https://forum.stanford.edu/events/2011/2011slides/plenary/2011plenaryNg.pdf) 

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. 

#### Feature Ideas
- Does the apartment have a description?
- How long is the description?
- How many total perks does each apartment have?
- Are cats _or_ dogs allowed?
- Are cats _and_ dogs allowed?
- Total number of rooms (beds + baths)
- Ratio of beds to baths
- What's the neighborhood, based on address or latitude & longitude?

## Stretch Goals
- [ ] If you want more math, skim [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf),  Chapter 3.1, Simple Linear Regression, & Chapter 3.2, Multiple Linear Regression
- [ ] If you want more introduction, watch [Brandon Foltz, Statistics 101: Simple Linear Regression](https://www.youtube.com/watch?v=ZkjP5RJLQF4)
(20 minutes, over 1 million views)
- [ ] Do the [Plotly Dash](https://dash.plot.ly/) Tutorial, Parts 1 & 2.
- [ ] Add your own stretch goal(s) !

In [3]:
# If you're in Colab...
import os, sys
in_colab = 'google.colab' in sys.modules

if in_colab:
    # Install required python packages:
    # pandas-profiling, version >= 2.0
    # plotly, version >= 4.0
    !pip install --upgrade pandas-profiling plotly
    
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification.git
    !git pull origin master
    
    # Change into directory for module
    os.chdir('module1')

Collecting pandas-profiling
[?25l  Downloading https://files.pythonhosted.org/packages/2c/2f/aae19e2173c10a9bb7fee5f5cad35dbe53a393960fc91abc477dcc4661e8/pandas-profiling-2.3.0.tar.gz (127kB)
[K     |██▋                             | 10kB 11.9MB/s eta 0:00:01[K     |█████▏                          | 20kB 1.8MB/s eta 0:00:01[K     |███████▊                        | 30kB 2.6MB/s eta 0:00:01[K     |██████████▎                     | 40kB 1.7MB/s eta 0:00:01[K     |████████████▉                   | 51kB 2.1MB/s eta 0:00:01[K     |███████████████▍                | 61kB 2.5MB/s eta 0:00:01[K     |██████████████████              | 71kB 2.9MB/s eta 0:00:01[K     |████████████████████▋           | 81kB 3.3MB/s eta 0:00:01[K     |███████████████████████▏        | 92kB 3.7MB/s eta 0:00:01[K     |█████████████████████████▊      | 102kB 2.8MB/s eta 0:00:01[K     |████████████████████████████▎   | 112kB 2.8MB/s eta 0:00:01[K     |██████████████████████████████▉ | 122kB 2.8MB/

Initialized empty Git repository in /content/.git/
remote: Enumerating objects: 94, done.[K
remote: Total 94 (delta 0), reused 0 (delta 0), pack-reused 94[K
Unpacking objects: 100% (94/94), done.
From https://github.com/LambdaSchool/DS-Unit-2-Regression-Classification
 * branch            master     -> FETCH_HEAD
 * [new branch]      master     -> origin/master


In [0]:
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [0]:
import numpy as np
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt

# Read New York City apartment rental listing data
df = pd.read_csv('../data/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [0]:
from ipywidgets import interact
import pandas as pd
from sklearn.linear_model import LinearRegression

# Read New York City apartment rental listing data
import pandas as pd
import numpy as np
df = pd.read_csv('../data/renthop-nyc.csv')
assert df.shape == (49352, 34)


# Subset:
df = df[(df['latitude']<45) & (df['latitude']>39)]
df = df[(df['longitude']<-70) & (df['longitude']>-80)]
df = df.query('price < 1000000')
df = df[(df['bedrooms']>1) & (df['bedrooms'] <7 )] 


# Data now has 24111 rows, 34 columns
assert df.shape == (24111, 34)

## Import estimator class from Scikit-Learn
from sklearn.linear_model import LinearRegression


# Instantiate class
model = LinearRegression()


# Arrange features & target
features= ['bedrooms','bathrooms']
target= 'price'
X = df[features]
y= df[target]




In [0]:
# train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=11)

In [35]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score


model.fit(X_train, y_train)
y_pred_train = model.predict(X_train)

#train data intercept and coefficient
print(f"Intercept and Coefficient: {model.intercept_} {model.coef_}")

#MAE,MSE, RMSE for train data
mse = mean_squared_error(y_train, y_pred_train)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_train, y_pred_train)
r2 = r2_score(y_train, y_pred_train)

# RMSE, MAE, R2
print(f"The features for the model train: {features}\nRMSE: {rmse}\nMAE: {mae}\nR^2: {r2}")

Intercept and Coefficient: -768.1702831525563 [ 427.9750228  2964.76473509]
The features for the model train: ['bedrooms', 'bathrooms']
RMSE: 2247.75722199512
MAE: 1212.1285353170524
R^2: 0.4434763320228253


In [22]:
model.fit(X_test, y_test)
y_pred_test = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred_test)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred_test)
r2 = r2_score(y_test, y_pred_test)


# RMSE, MAE, R2
print(f"The features for the model Test: {features}\nRMSE: {rmse}\nMAE: {mae}\nR^2: {r2}")

The features for the model Test: ['bedrooms', 'bathrooms']
RMSE: 2419.792425549494
MAE: 1215.475834101489
R^2: 0.4189656895903311
