Lambda School Data Science

*Unit 2, Sprint 1, Module 1*

---

# Regression 1

## Assignment

You'll use another **New York City** real estate dataset. 

But now you'll **predict how much it costs to rent an apartment**, instead of how much it costs to buy a condo.

The data comes from renthop.com, an apartment listing website.

- [ ] Look at the data. Choose a feature, and plot its relationship with the target.
- [ ] Use scikit-learn for linear regression with one feature. You can follow the [5-step process from Jake VanderPlas](https://jakevdp.github.io/PythonDataScienceHandbook/05.02-introducing-scikit-learn.html#Basics-of-the-API).
- [ ] Define a function to make new predictions and explain the model coefficient.
- [ ] Organize and comment your code.

> [Do Not Copy-Paste.](https://docs.google.com/document/d/1ubOw9B3Hfip27hF2ZFnW3a3z9xAgrUDRReOEo-FHCVs/edit) You must type each of these exercises in, manually. If you copy and paste, you might as well not even do them. The point of these exercises is to train your hands, your brain, and your mind in how to read, write, and see code. If you copy-paste, you are cheating yourself out of the effectiveness of the lessons.

If your **Plotly** visualizations aren't working:
- You must have JavaScript enabled in your browser
- You probably want to use Chrome or Firefox
- You may need to turn off ad blockers
- [If you're using Jupyter Lab locally, you need to install some "extensions"](https://plot.ly/python/getting-started/#jupyterlab-support-python-35)

## Stretch Goals
- [ ] Do linear regression with two or more features.
- [ ] Read [The Discovery of Statistical Regression](https://priceonomics.com/the-discovery-of-statistical-regression/)
- [ ] Read [_An Introduction to Statistical Learning_](http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf), Chapter 2.1: What Is Statistical Learning?

In [1]:
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'

# If you're working locally:
else:
    DATA_PATH = '../data/'
    
# Ignore this Numpy warning when using Plotly Express:
# FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='numpy')

In [2]:
# Read New York City apartment rental listing data
import pandas as pd
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

In [3]:
# Remove outliers: 
# the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= 1375) & (df['price'] <= 15500) & 
        (df['latitude'] >=40.57) & (df['latitude'] < 40.99) &
        (df['longitude'] >= -74.1) & (df['longitude'] <= -73.38)]

In [4]:
# Import dependencies

import plotly.express as px
import numpy as np

from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

In [5]:
# Look at the 

df.head()

Unnamed: 0,bathrooms,bedrooms,created,description,display_address,latitude,longitude,price,street_address,interest_level,elevator,cats_allowed,hardwood_floors,dogs_allowed,doorman,dishwasher,no_fee,laundry_in_building,fitness_center,pre-war,laundry_in_unit,roof_deck,outdoor_space,dining_room,high_speed_internet,balcony,swimming_pool,new_construction,terrace,exclusive,loft,garden_patio,wheelchair_access,common_outdoor_space
0,1.5,3,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,40.7145,-73.9425,3000,792 Metropolitan Avenue,medium,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1.0,2,2016-06-12 12:19:27,,Columbus Avenue,40.7947,-73.9667,5465,808 Columbus Avenue,low,1,1,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1.0,1,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,40.7388,-74.0018,2850,241 W 13 Street,high,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1.0,1,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,40.7539,-73.9677,3275,333 East 49th Street,low,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1.0,4,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,40.8241,-73.9493,3350,500 West 143rd Street,low,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [6]:
# Check for null values

df.isna().sum()

bathrooms                  0
bedrooms                   0
created                    0
description             1425
display_address          133
latitude                   0
longitude                  0
price                      0
street_address            10
interest_level             0
elevator                   0
cats_allowed               0
hardwood_floors            0
dogs_allowed               0
doorman                    0
dishwasher                 0
no_fee                     0
laundry_in_building        0
fitness_center             0
pre-war                    0
laundry_in_unit            0
roof_deck                  0
outdoor_space              0
dining_room                0
high_speed_internet        0
balcony                    0
swimming_pool              0
new_construction           0
terrace                    0
exclusive                  0
loft                       0
garden_patio               0
wheelchair_access          0
common_outdoor_space       0
dtype: int64

In [7]:
# Replace null values with 'Missing

df['display_address'] = df[['display_address']].replace(np.nan, 'Missing')


# Drop erroneous columns

df = df.drop(['created', 'description', 'street_address'], axis = 1)


# Create feature and target list

features = ['display_address']
target = ['price']


# Create feature matrix and target vector

X = df[features]
y = df[target]
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [8]:
# Instantiate transformers

encoder = OrdinalEncoder()
scaler = StandardScaler()


# Create  list of columns to encode

encode_cols = X_train.describe(exclude = 'number').columns


# Encode categorical columns and scale all columns

X_train[encode_cols] = encoder.fit_transform(X_train[encode_cols])
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns = X_train.columns)


# View the results

X_train_scaled.head()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,display_address
0,1.461414
1,-0.809316
2,-1.892724
3,0.049145
4,0.968419


In [9]:
# Plot feature 'display_address' against the target
# Include OLS trendline and color it black for contrast

fig = px.scatter(x = X_train_scaled['display_address'], y = y_train['price'], trendline = 'ols')
fig.data[1].update(line_color='black') 
fig.show()


pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.



In [10]:
# Instantiate the linear model

model = LinearRegression()


# Fit the model with transformed
# feature matrix

model.fit(X_train_scaled, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [11]:
# Encode categorical columns and scale all columns

X_test[encode_cols] = encoder.fit_transform(X_test[encode_cols])
X_test_scaled = pd.DataFrame(scaler.fit_transform(X_test), columns = X_test.columns)


# View the results

X_test_scaled.head()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,display_address
0,-0.161924
1,-2.016162
2,-0.983463
3,1.386574
4,-0.376382


In [12]:
# Get the mean absolute error for the model
# on the testing data

y_pred = model.predict(X_test_scaled)

mae = mean_absolute_error(y_test, y_pred)
mae

1191.2842309171408

### Below is a multiple feature based regresion with data prep function.


In [37]:
def split_data(df, features, target):

    '''
    Splits Data Frame into feature matrix, target vector,
    replaces NaN values with 'Missing' for categorical columns
    replaces NaN values with 0 for numeric columns,and performs
    a train test split.

    Arguments

    df: The dataframe for the model.
    features: A list of string representations of
              each feature column intended for use 
              in the model.
    target: A string representation of the intended
            target column of the model.

    Returns feature matrix and target vector.
    '''


    # Copy the data frame to avoid setting with copy warning
    df = df.copy()

    # Replace NaN values with 'Missing'.
    df = df.replace(np.nan, 'Missing')

    # Split the Data Frame into feature matrix and target vector
    X = df[features]
    y = df[target]

    # Perform train test split
    X_train, X_test, y_train, y_test = train_test_split(X, y)

    # Print shapes of new dataframes
    print('Feature Train and Test Shapes: ', X_train.shape, X_test.shape)
    print('Target Train and Test Shapes: ', y_train.shape, y_test.shape)

    # Return the data
    return X_train, X_test, y_train, y_test

In [54]:
def prep(X, y, trans, train = None):

    '''
    Preprocess data for linear regression.
    Arguments assume pandas dataframe unless
    stated otherwise.

    Arguments
    
    X: The feature matrix
    y: The target vector
    trans: Transformers for the feature matrix
           in an ascending alphabetized list.
           Function assumes instantiation.
    train: Whether the data being prepared
           is for training or not.
    
    Prints shapes of each dataframe.

    Returns Feature matrix and target vector
    as pandas Data Frames.
    '''
    try:
        train == None
    except:
        raise ValueError('Please indidicate training status.')
    

    # Copy data to avoid setting with copy warning.
    X = X.copy()

    # Create lists of numeric and non-numeric columns
    num_cols = X._get_numeric_data().columns
    cat_cols = [i for i in X.columns if i not in num_cols]

    # Encode categorical columns and scale all columns
    if train == True:
        X[cat_cols] = trans[0].fit_transform(X[cat_cols])
        X = pd.DataFrame(trans[1].fit_transform(X), columns = X.columns)        
    if train == False:
        # Why is ordinal encoder trying to encode
        # a dropped column?
        X[cat_cols] = trans[0].transform(X[cat_cols])
        X = pd.DataFrame(trans[1].transform(X), columns = X.columns)

    # Print shapes of dataframes
    print('Feature and Test Shapes: ', X.shape, y.shape)


    # Return feature matrix and target vector
    return X, y, trans

In [47]:
    # Instantiate transformers

    encoder = OrdinalEncoder()
    scaler = StandardScaler()


    # Place transformers in list

    t_list = [encoder, scaler]

In [48]:
# Create feature and target list

features = df.drop(['price'], axis = 1).columns
target = 'price'

In [49]:
#  Split the data into train and test matrices and vectors

X_train, X_test, y_train, y_test = split_data(df, features, target)

Feature Train and Test Shapes:  (36613, 30) (12205, 30)
Target Train and Test Shapes:  (36613,) (12205,)


In [50]:
# Prepare training data for fitting

X_train_clean, y_train_clean, t_list = prep(X_train, y_train, t_list, True)

Index(['bathrooms', 'bedrooms', 'display_address', 'latitude', 'longitude',
       'interest_level', 'elevator', 'cats_allowed', 'hardwood_floors',
       'dogs_allowed', 'doorman', 'dishwasher', 'no_fee',
       'laundry_in_building', 'fitness_center', 'pre-war', 'laundry_in_unit',
       'roof_deck', 'outdoor_space', 'dining_room', 'high_speed_internet',
       'balcony', 'swimming_pool', 'new_construction', 'terrace', 'exclusive',
       'loft', 'garden_patio', 'wheelchair_access', 'common_outdoor_space'],
      dtype='object')
Index(['bathrooms', 'bedrooms', 'latitude', 'longitude', 'elevator',
       'cats_allowed', 'hardwood_floors', 'dogs_allowed', 'doorman',
       'dishwasher', 'no_fee', 'laundry_in_building', 'fitness_center',
       'pre-war', 'laundry_in_unit', 'roof_deck', 'outdoor_space',
       'dining_room', 'high_speed_internet', 'balcony', 'swimming_pool',
       'new_construction', 'terrace', 'exclusive', 'loft', 'garden_patio',
       'wheelchair_access', 'common_ou

In [51]:
# Instantiate the linear model

model = LinearRegression()


# Fit the model

model.fit(X_train_clean, y_train_clean)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [52]:
y_pred = model.predict(X_train_clean)

mae = mean_absolute_error(y_pred, y_train_clean)
mae

694.6039802200468

In [53]:
# The encoder keeps trying to encode a column dropped at the beginning of the
# document. Need to figure this issue out.

# X_test_clean, y_test_clean, t_list = prep(X_test, y_test, t_list, False)
# 
# y_pred = model.predict(X_test_clean)
# 
# mae = mean_absolute_error(y_pred, y_test_clean)
# mae

Index(['bathrooms', 'bedrooms', 'display_address', 'latitude', 'longitude',
       'interest_level', 'elevator', 'cats_allowed', 'hardwood_floors',
       'dogs_allowed', 'doorman', 'dishwasher', 'no_fee',
       'laundry_in_building', 'fitness_center', 'pre-war', 'laundry_in_unit',
       'roof_deck', 'outdoor_space', 'dining_room', 'high_speed_internet',
       'balcony', 'swimming_pool', 'new_construction', 'terrace', 'exclusive',
       'loft', 'garden_patio', 'wheelchair_access', 'common_outdoor_space'],
      dtype='object')
Index(['bathrooms', 'bedrooms', 'latitude', 'longitude', 'elevator',
       'cats_allowed', 'hardwood_floors', 'dogs_allowed', 'doorman',
       'dishwasher', 'no_fee', 'laundry_in_building', 'fitness_center',
       'pre-war', 'laundry_in_unit', 'roof_deck', 'outdoor_space',
       'dining_room', 'high_speed_internet', 'balcony', 'swimming_pool',
       'new_construction', 'terrace', 'exclusive', 'loft', 'garden_patio',
       'wheelchair_access', 'common_ou

ValueError: ignored

In [None]:
y_pred.shape, y_test_clean.shape, X_test_clean.shape