## New Notebook For Fitting a Model

There will be some more cleaning and exploring to do before finding the model, but the previous EDA notebook should provide a good start with a clean dataset. 

### Overall Goals
- **Find highest R-squared regression model that can predict prices for real estate agency.**
- Decide if any outliers need to be dropped.
- Decide what to do with skewed target variable data.
- Use one-hot encoding for categorical variables.
- Decide which variables to drop given multicollinearity.
- Decide which variables to use given stepwise selection methods & recursive feature elimination.
- Consider log transformations on data that is not normally distributed.
- Check other tests for linearity assumptions; drop variabels that don't meet standards
- Consider scaling or normalizing.
- Validation, test and training the model.

In [35]:
#Loading the needed packages, libraries, functions and variables from the EDA notebook.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [36]:
#Original DataFrame
%store -r df_original

In [37]:
#Cleaned DataFrame
%store -r df_clean

In [38]:
# For consistent randomness
np.random.seed(42)

## Categorical Variables

In [64]:
df_clean_dumm = df_clean.copy()

In [65]:
df_clean_dumm.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   date           21597 non-null  datetime64[ns]
 1   price          21597 non-null  float64       
 2   bedrooms       21597 non-null  int64         
 3   bathrooms      21597 non-null  float64       
 4   sqft_living    21597 non-null  int64         
 5   sqft_lot       21597 non-null  int64         
 6   floors         21597 non-null  float64       
 7   waterfront     21597 non-null  object        
 8   view           21597 non-null  object        
 9   condition      21597 non-null  int64         
 10  grade          21597 non-null  int64         
 11  zipcode        21597 non-null  int64         
 12  lat            21597 non-null  float64       
 13  long           21597 non-null  float64       
 14  sqft_living15  21597 non-null  int64         
 15  sqft_lot15     2159

In [67]:
df_clean_dumm['zipcode'].value_counts()

98103    602
98038    589
98115    583
98052    574
98117    553
        ... 
98102    104
98010    100
98024     80
98148     57
98039     50
Name: zipcode, Length: 70, dtype: int64

In [124]:
# Get zipcode dummies
zipcode_dummies = pd.get_dummies(df_clean_dumm['zipcode'], drop_first=True)
df_clean_dumm = pd.concat([df_clean_dumm, zipcode_dummies], axis=1)

In [128]:
# Water front to binary
df_clean_dumm['waterfront'] = df_clean_dumm['waterfront'].replace('NO'= 0)

SyntaxError: expression cannot contain assignment, perhaps you meant "=="? (<ipython-input-128-b3c298986d97>, line 2)

In [126]:
df_clean_dumm['waterfront'].value_counts()

NO     21451
YES      146
Name: waterfront, dtype: int64

In [55]:
test_data = df_clean.loc[:,['price', 'bedrooms', 'floors', 'condition', 'sqft_living']]

## Modeling

In [122]:
from pandas.api.types import is_numeric_dtype
def only_numeric(data):
    '''returns a dataframe with only numeric values'''
    for column in data.columns:
        if is_numeric_dtype(data[column]) == False:
            data = data.drop(column, axis=1)
            print(column)
        else:
            continue
    return data

In [123]:
only_numeric(df_clean_dumm)

date
waterfront
view


Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,condition,grade,zipcode,lat,...,98146,98148,98155,98166,98168,98177,98178,98188,98198,98199
0,221900.0,3,1.00,1180,5650,1.0,2,7,98178,47.5112,...,0,0,0,0,0,0,1,0,0,0
1,538000.0,3,2.25,2570,7242,2.0,2,7,98125,47.7210,...,0,0,0,0,0,0,0,0,0,0
2,180000.0,2,1.00,770,10000,1.0,2,6,98028,47.7379,...,0,0,0,0,0,0,0,0,0,0
3,604000.0,4,3.00,1960,5000,1.0,4,7,98136,47.5208,...,0,0,0,0,0,0,0,0,0,0
4,510000.0,3,2.00,1680,8080,1.0,2,8,98074,47.6168,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21592,360000.0,3,2.50,1530,1131,3.0,2,8,98103,47.6993,...,0,0,0,0,0,0,0,0,0,0
21593,400000.0,4,2.50,2310,5813,2.0,2,8,98146,47.5107,...,1,0,0,0,0,0,0,0,0,0
21594,402101.0,2,0.75,1020,1350,2.0,2,7,98144,47.5944,...,0,0,0,0,0,0,0,0,0,0
21595,400000.0,3,2.50,1600,2388,2.0,2,8,98027,47.5345,...,0,0,0,0,0,0,0,0,0,0


In [110]:
def my_train_test(data, target):
    data = only_numeric(data)
    y = data[target]
    X = data.drop(target, axis=1)
    X_train, X_test, y_train, y_test = train_test_split(X, y)

    return X_train, X_test, y_train, y_test

In [118]:
X_train, X_test, y_train, y_test = my_train_test(test_data, 'price')

In [119]:
print(X_train.shape)
print(X_test.shape)

print(X_train.shape[0] == y_train.shape[0])
print(X_test.shape[0] == y_test.shape[0])

(16197, 4)
(5400, 4)
True
True


In [120]:
def train_test_compare(X_tr, X_te, y_tr, y_te):
    lr = LinearRegression()
    lr.fit(X_tr, y_tr)
    train_score = lr.score(X_tr, y_tr)
    test_score = lr.score(X_te, y_te)
    return print(f'training data R2: {train_score}\n testing data R2: {test_score}')

In [121]:
train_test_compare(X_train, X_test, y_train, y_test)

training data R2: 0.5146417020649201
 testing data R2: 0.5166811231371914


## Business & Data Understanding
#### Revisiting our end goals with sombe EDA knowledge
- We want to create a tool for a real estate agency to estimate sales or purchase prices given housing info.
- This can be done with a regression model.

In [None]:
df_clean.info()

In [None]:
df_clean.corr().abs()['price'].sort_values()

high_corr_cols = ['sqft_living', 'sqft_above', 'sqft_living15', 'bathrooms', 'sqft_basement', 'bedrooms']

In [None]:
y = df_clean['price']
X = df_clean
    
reg = LinearRegression().fit(X, y)

plt.scatter(X, y, color='green')
plt.plot(X, reg.predict(X))
plt.xlabel('sqft_living')
plt.ylabel('Price');

In [None]:
for x in high_corr_cols:
    y = df_clean['price']
    X = df_clean[x]
    
    reg = LinearRegression().fit(X, y)

    plt.scatter(X, y, color='green')
    plt.plot(X, reg.predict(X))
    plt.xlabel(x)
    plt.ylabel('Price');