Please fill out:
* Student name: 
* Student pace: self paced / part time / full time
* Scheduled project review date/time: 
* Instructor name: 
* Blog post URL:


Testing 1, 2, 3

# Phase 2 Project

### Importing Data

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn')
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
from statsmodels.formula.api import ols
from statsmodels.regression.linear_model import OLS
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
import scipy.stats as stats
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

df = pd.read_csv('Data/kc_house_data.csv')
df.head()

#### Formatting Cell

In [None]:
pd.set_option('display.max_rows', 1000) #change the amount of rows displayed

## Glossary

### Column Names and descriptions for Kings County Data Set
(source: Data/column_names.md)
* **id** - unique identified for a house
* **date** - house was sold
* **price** -  is prediction target
* **bedrooms** -  of Bedrooms/House
* **bathrooms** -  of bathrooms/bedrooms
* **sqft_livings** -  footage of the home
* **sqft_lots** -  footage of the lot
* **floors** -  floors (levels) in house
* **waterfront** - House which has a view to a waterfront
* **view** - Has been viewed
* **condition** - How good the condition is ( Overall )
* **grade** - overall grade given to the housing unit, based on King County grading system
* **sqft_above** - square footage of house apart from basement
* **sqft_basement** - square footage of the basement
* **yr_built** - Built Year
* **yr_renovated** - Year when house was renovated
* **zipcode** - zip
* **lat** - Latitude coordinate
* **long** - Longitude coordinate
* **sqft_living15** - The square footage of interior housing living space for the nearest 15 neighbors
* **sqft_lot15** - The square footage of the land lots of the nearest 15 neighbors

### Building Condition Explaination
https://info.kingcounty.gov/assessor/esales/Glossary.aspx?type=r#d (accessed 12/6/2021)

Relative to age and grade. Coded 1-5.

1 = Poor- Worn out. Repair and overhaul needed on painted surfaces, roofing, plumbing, heating and numerous functional inadequacies. Excessive deferred maintenance and abuse, limited value-in-use, approaching abandonment or major reconstruction; reuse or change in occupancy is imminent. Effective age is near the end of the scale regardless of the actual chronological age.

2 = Fair- Badly worn. Much repair needed. Many items need refinishing or overhauling, deferred maintenance obvious, inadequate building utility and systems all shortening the life expectancy and increasing the effective age.

3 = Average- Some evidence of deferred maintenance and normal obsolescence with age in that a few minor repairs are needed, along with some refinishing. All major components still functional and contributing toward an extended life expectancy. Effective age and utility is standard for like properties of its class and usage.

4 = Good- No obvious maintenance required but neither is everything new. Appearance and utility are above the standard and the overall effective age will be lower than the typical property.

5= Very Good- All items well maintained, many having been overhauled and repaired as they have shown signs of wear, increasing the life expectancy and lowering the effective age with little deterioration or obsolescence evident with a high degree of utility.


### Building Grade Explaination
https://info.kingcounty.gov/assessor/esales/Glossary.aspx?type=r#d (accessed 12/6/2021)


Represents the construction quality of improvements. Grades run from grade 1 to 13. Generally defined as:

1-3 Falls short of minimum building standards. Normally cabin or inferior structure.

4 Generally older, low quality construction. Does not meet code.

5 Low construction costs and workmanship. Small, simple design.

6 Lowest grade currently meeting building code. Low quality materials and simple designs.

7 Average grade of construction and design. Commonly seen in plats and older sub-divisions.

8 Just above average in construction and design. Usually better materials in both the exterior and interior finish work.

9 Better architectural design with extra interior and exterior design and quality.

10 Homes of this quality generally have high quality features. Finish work is better and more design quality is seen in the floor plans. Generally have a larger square footage.

11 Custom design and higher quality finish work with added amenities of solid woods, bathroom fixtures and more luxurious options.

12 Custom design and excellent builders. All materials are of the highest quality and all conveniences are present.

13 Generally custom designed and built. Mansion level. Large amount of highest quality cabinet work, wood trim, marble, entry ways etc.

# Data Cleaning

## Dropping Uncessary Columns

In [None]:
df = df.drop(df[['id', 'date', 'view', 'lat', 'long', 'yr_renovated']], axis=1)
df.head()

## Checking Data Types

In [None]:
df.info()

### Fixing sqft_basement
- slicing out all records with a '?' and calculating the correct value using other known fields.

In [None]:
unknown_basements = df[df['sqft_basement'] == '?']
known_basements = df[df['sqft_basement'] != '?']

print('Unkown Basement:',(len(unknown_basements)))
print('Known Basement:',(len(known_basements)))

In [None]:
sqft_basement = unknown_basements.apply(lambda x: x['sqft_living'] - x['sqft_above'], axis=1)
unknown_basements['sqft_basement'] = sqft_basement

cleaned_df = known_basements.append(unknown_basements)

#changing to float so that decminals are in the same format
cleaned_df['sqft_basement'] = cleaned_df['sqft_basement'].astype(float)
cleaned_df['sqft_above'] = cleaned_df['sqft_above'].astype(float)

cleaned_df['sqft_basement'].value_counts().head()

### Changing Zip Code to Category

In [None]:
cleaned_df['zipcode'] = df['zipcode'].astype(str)
cleaned_df['zipcode'].value_counts().head()

### Dropping Bedroom Outliers

In [None]:
#dropping outliers
cleaned_df = cleaned_df.sort_values('bedrooms', ascending=False).reset_index()
cleaned_df = cleaned_df.drop([0,1,2])
cleaned_df.head(5)

In [None]:
#dropping index
cleaned_df = cleaned_df.drop(['index'], axis=1)

In [None]:
cleaned_df.info()

### Exploring Data with Scatter Plot

In [None]:
#using scatter plot to look for linear relationships
pd.plotting.scatter_matrix(cleaned_df, figsize = [20,20]);
plt.show()

### Analysis:
At first glance, the following variables seem to have linear relationships:
- price with bedrooms, sqft_above, & sqft_basement.
    - price also seems to have a linear relationship with categorical variable 'grade'.
- bedrooms with bathrooms, sqft_living, sqft_above, & sqft_basement
- sqft_living and sqft_above have the closest linear relationship
    - They are very similar data points. I may need to eliminate one to prevent multicolinearity.
    
The Following Variables seem to be categorical:
- floors
- waterfront
- condition
- zip code (not shown because I have already made it an object)

Ordinal Variables:
- bedrooms
- bathrooms



### What To Do with the Ordinal Values

I am going to treat grade as a continuous variable as it has very linear relationships with many features. Including price.

## Checking for Null Values

In [None]:
cleaned_df.isna().sum()

## Fixing Waterfront

In [None]:
waterfront_cleaned = cleaned_df['waterfront'].fillna(0) 
cleaned_df['waterfront'] = waterfront_cleaned
cleaned_df.isna().sum()

# Exploratory Data Analysis

In [None]:
cleaned_df.describe()

In [None]:
cleaned_df.hist(figsize = (20,18));

Analysis:
- Price is very skewed. I will need to fix this as it is my target variable.

## Analysis of key variables again the Target (price) using jointplots

In [None]:
sns.jointplot('bedrooms','price', data=cleaned_df, kind='reg');

<u>Bedrooms</u>: While this is an ordinal variable, it behaves more like a categorical than a continuous variable. 
7 bedrooms isn't necessarily better than 2 bedrooms, it all depends on the house itself. I should
one-hot-encode this as a categorical when I get to that step.

In [None]:
sns.jointplot('bathrooms','price', data=cleaned_df, kind='reg');

<u>Bathroooms</u>: Unlike bedrooms, bathrooms behave more like a continous variable than a categorical one, so I will treat it as such.

In [None]:
sns.jointplot('sqft_living','price', data=cleaned_df, kind='reg');

<u>Sqft_living</u>: This seems to be a very linear relationship. This makes sense as the bigger the house it, the more likely that it will be more expensive.

In [None]:
sns.jointplot('sqft_lot','price', data=cleaned_df, kind='reg');

<u>sqft_lot</u>: Lot size has a slight correlation with the price of a house, but there are a lot of outliers, especially with little to no lot size. It will be hard to use this as a predictor.

In [None]:
sns.jointplot('floors','price', data=cleaned_df, kind='reg');

<u>Floors</u>: Floors is another ordinal that behaves more like a categorical value than a continuous one and will be treated as such.

In [None]:
sns.jointplot('waterfront','price', data=cleaned_df, kind='reg');

<u>Waterfront</u>: There appears to be a slight linear relationship between price and being on the waterfront.

In [None]:
sns.jointplot('condition','price', data=cleaned_df, kind='reg');

<u>Condition</u>: If treated strictly as a continous variable,  condition doesn't have much of an affect on price. I can drop this.

In [None]:
sns.jointplot('grade','price', data=cleaned_df, kind='reg');

<u>Grade</u>: Grade is a fairly linear relationship with a little noise. I should keep it as a continous variable. The relationship looks like it could be improved with some cleaning, removing outliers, etc.

In [None]:
sns.jointplot('sqft_above','price', data=cleaned_df, kind='reg');

<u>Sqft_Above</u>: Based on their description in the glossary, this is almost exactly the same thing as sqft_living. I will almost definitely need to remove one of the two of these variables and use the other due to multicolinearity. I will determine which to use when I check for that.

In [None]:
sns.jointplot('sqft_basement','price', data=cleaned_df, kind='reg');

<u>Sqft Basement</u>: Basement size has a slight linear relationship with price. But I also see that there are many outliers that have very little size that are skewing the results.

In [None]:
sns.jointplot('yr_built','price', data=cleaned_df, kind='reg');

<u>Year Built</u>: Appears to have no relationship with Price and can likely be excluded from analysis.

In [None]:
sns.jointplot('sqft_living15','price', data=cleaned_df, kind='reg');

<u>sqft_living15</u>: The size of houses nearby does have a linear relationship with price. Looks fairly close to sqft_living and sqft_above so there's a strong chance of multicolinearity here as well.

In [None]:
sns.jointplot('sqft_lot15','price', data=cleaned_df, kind='reg');

<u>sqft_lot15</u>: Looks identical to sqft_lot, which I likely won't end up using. This will likely be dropped as well. If I use either, it would be just that one as they are very likely to be multicolinear.

## Drop Columns

In [None]:
#lets go ahead and remove the features that aren't useful, per my analysis of the jointplots.
cleaned_df = cleaned_df.drop(['sqft_lot', 'yr_built', 'sqft_lot15', 'condition'], axis=1)
cleaned_df.head(1)

In [None]:
feats = ['sqft_living', 'sqft_above', 'sqft_living15','bedrooms','bathrooms', 'grade', 'condition',
         'sqft_basement']
corr = cleaned_df[feats].corr()
corr

In [None]:
sns.heatmap(corr, center=0, annot=True);

# Initial Modeling 

In [None]:
list(cleaned_df.columns)

In [None]:
# Defining the problem
outcome = 'price'
x_cols = list(cleaned_df.columns)
x_cols.remove(outcome)

In [None]:
train, test = train_test_split(cleaned_df)

print(len(train), len(test))

In [None]:
print(len(train), len(test))
train.head()

In [None]:
test.head()

In [None]:
# Fitting the actual model
predictors = '+'.join(x_cols)
formula = outcome + '~' + predictors
model = ols(formula=formula, data=train).fit()
model.summary()

In [None]:
# Extract the p-value table from the summary and use it to subset our features
summary = model.summary()
p_table = summary.tables[1]
p_table = pd.DataFrame(p_table.data)
p_table.columns = p_table.iloc[0]
p_table = p_table.drop(0)
p_table = p_table.set_index(p_table.columns[0])
p_table['P>|t|'] = p_table['P>|t|'].astype(float)
x_cols = list(p_table[p_table['P>|t|'] < 0.05].index)
x_cols.remove('Intercept')
print(len(p_table), len(x_cols))
print(x_cols[:5])
p_table.head()

### Refining Model

In [None]:
#removing problem zipcodes
#df_1 = cleaned_df[cleaned_df['zipcode'] == '98002']
#df_2 = cleaned_df[cleaned_df['zipcode'] == '98003']
#df_3 = cleaned_df[cleaned_df['zipcode'] == '98004']
#df_4 = cleaned_df[cleaned_df['zipcode'] == '98005']


#print('df_1:', len(df_1))
print('df_2:', len(df_2))
print('df_3:', len(df_3))
print('df_4:', len(df_4))

Removing the problem zipcodes removes 4% of the data from the data set.
- 964 records removed from 21,596

In [None]:
#problem_zips = pd.concat([df_1, df_2, df_3, df_4])
#problem_zips

In [None]:
#cleaned_df['zipcode'].value_counts().head()

In [None]:
#cleaned_df= cleaned_df.drop(problem_zips.index)
#cleaned_df.head(1)

In [None]:
# Defining the problem
#outcome = 'price'
#x_cols = list(cleaned_df.columns)
#x_cols.remove(outcome)

In [None]:
#train, test = train_test_split(cleaned_df)

In [None]:
#print(len(train), len(test))
#train.head()

In [None]:
# Fitting the actual model
#predictors = '+'.join(x_cols)
#formula = outcome + '~' + predictors
#model = ols(formula=formula, data=train).fit()
#model.summary()

### Cleaning & Encoding

In [None]:
#dropping sqft_lot15 because of it's p-value
#encoded_df = cleaned_df_2.drop(cleaned_df_2[['sqft_lot15']], axis=1)
#encoded_df.head(1)
encoded_df= cleaned_df

In [None]:
subs = [(' ', '_'),('.',''),("'",""),('™', ''), ('®',''),
        ('+','plus'), ('½','half'), ('-','_')
       ]
def col_formatting(col):
    for old, new in subs:
        col = col.replace(old,new)
    return col

In [None]:
encoded_df.columns = [col_formatting(col) for col in encoded_df.columns]

In [None]:
list(encoded_df.columns)

In [None]:
#one-hot encoding
feats = ['bedrooms','floors', 'waterfront', 'condition','zipcode']
#feats = ['floors', 'waterfront','zipcode'] #treating bedrooms as a continous variable helps the model
#feats = ['zipcode']
encoded_df[feats] = encoded_df[feats].astype(str)
encoded_df = pd.get_dummies(encoded_df, drop_first=True)

In [None]:
#encoded_df.info()

In [None]:
encoded_df.columns = [col_formatting(col) for col in encoded_df.columns]

In [None]:
list(encoded_df.columns)

### Normalizing Data

In [None]:
def norm_feat(series):
    return (series - series.mean())/series.std()

In [None]:
df_norm = norm_feat(encoded_df)
df_norm.head()

### Checking Multicolinearity with VIF scores

In [None]:
x_cols = list(df_norm.columns)
x_cols.remove(outcome)

In [None]:
X = df_norm[x_cols]
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
list(zip(x_cols, vif))

### Analysis:
- sqft_living, sqft_above, and sqft_basement all have infinite VIF scores.
- Several bedroom values, and condition values also have very high VIF scores.

### Using old method to find colinear pairs

In [None]:
cc_df = df_norm.corr().abs().stack().reset_index().sort_values(0, ascending=False)

cc_df['pairs'] = list(zip(cc_df.level_0, cc_df.level_1))

cc_df.set_index(['pairs'], inplace = True)

cc_df.drop(columns=['level_1', 'level_0'], inplace = True)

# cc for correlation coefficient
cc_df.columns = ['cc']

cc_df.drop_duplicates(inplace=True)

cc_df[(cc_df.cc>.70) & (cc_df.cc<1)]

Analysis: sqft_living, and sqft_living_15 are causing a lot of multicollinearity. Dropping it will resolve most of the issues. I will also drop grade as it is so highly correlated to sqft_above, which is going to be one of my most important predictors.

In [None]:
#df_norm = df_norm.drop(['sqft_living', 'grade', 'sqft_living15'], axis=1)
df_norm = df_norm.drop(['grade'], axis=1)
df_norm.head(1)

In [None]:
cc_df = df_norm.corr().abs().stack().reset_index().sort_values(0, ascending=False)

cc_df['pairs'] = list(zip(cc_df.level_0, cc_df.level_1))

cc_df.set_index(['pairs'], inplace = True)

cc_df.drop(columns=['level_1', 'level_0'], inplace = True)

# cc for correlation coefficient
cc_df.columns = ['cc']

cc_df.drop_duplicates(inplace=True)

cc_df[(cc_df.cc>.70) & (cc_df.cc<1)]

Condition_3 and Condition_4 mean 'average' and 'good' according to the glossary. Maybe merging them as a common value would be a good idea.

In [None]:
good = cleaned_df[cleaned_df['condition'] == '3']
good = cleaned_df[cleaned_df['condition'] == '4']

That took care of the correlated pairs. Now let's check the vif scores again and see if it resolved the infinite correlations.

In [None]:
x_cols = list(df_norm.columns)
x_cols.remove(outcome)

In [None]:
X = df_norm[x_cols]
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
list(zip(x_cols, vif))

I am happy with these VIF scores. There is still a decent correlation between sqft_above and bathrooms, but it is within the limits that I have set, and they are two predictors that I want to keep if at all possible.

### Running the model again

In [None]:
# Defining the problem
outcome = 'price'
x_cols = list(df_norm.columns)
x_cols.remove(outcome)

In [None]:
train, test = train_test_split(df_norm)

In [None]:
print(len(train), len(test))
train.head()

In [None]:
test.head()

In [None]:
# Fitting the actual model
predictors = '+'.join(x_cols)
formula = outcome + '~' + predictors
model = ols(formula=formula, data=train).fit()
model.summary()

Model Analysis: 
- R2 is 78.5%. I would ideally like to see it at 80% or above, but this is very close.
- Prob(F-statistic) is 0, which means that there is good model integrity.
- Kurtosis is still really high. I will need to refine it so that it is closer to normal (3)
- Model is skewed. Still need to fix that.

# Checking Assumptions

## Checking Normality

In [None]:
fig = sm.graphics.qqplot(model.resid, dist=stats.norm, line='45', fit=True)

There are more errors as price increases. This needs to be refined so that the model is accurate. This model cannot be used without further refinement.

## Checking Homoscedasticity 

In [None]:
plt.scatter(model.predict(train[x_cols]), model.resid)
plt.plot(model.predict(train[x_cols]), [0 for i in range(len(train))])

Funnel-shaped. Need to correct.

## Dealing with Outliers

I want to switch back to encoded_df so that I can see what the acutal price is, instead of the normalized price. I will drop the same columns that I dropped from df_norm so that they contain the same data.

In [None]:
encoded_df = encoded_df.drop(['sqft_living', 'grade', 'sqft_living15'], axis=1)
#encoded_df = encoded_df.drop(['sqft_living', 'grade'], axis=1)
encoded_df.head(1)

In [None]:
encoded_df.price.hist()

In [None]:
for i in range(80,100):
    q = i/100
    print("{} percentile: {}".format(q, encoded_df.price.quantile(q=q)))

In [None]:
for i in range(0,20):
    q = i/100
    print("{} percentile: {}".format(q, encoded_df.price.quantile(q=q)))

In [None]:
df = encoded_df

orig_tot = len(df)
df = df[df.price < 1500000]# Subsetting to remove extreme outliers
df = df[df.price > 149000]
print('Percent removed:', (orig_tot -len(df))/orig_tot)
df.price = df.price.map(np.log) # Applying a log transformation
train, test = train_test_split(df)

# Refit model with subset features
predictors = '+'.join(x_cols)
formula = outcome + "~" + predictors
final_model = ols(formula=formula, data=train).fit()
final_model.summary()


Model Analysis: Removing some of the price outliers on each end improved the model. (R2 is now at 83%)
- This only removed 2.7% of the data, which is acceptable.

In [None]:
#making sure the changes are saved as final_df
final_df = df
final_df.head()

In [None]:
final_df.price.hist()

This histogram now looks to have a normal distribution. This is a good sign.

## Normalizing final_df and running model on it

In [None]:
final_df.describe()

In [None]:
final_df_norm = norm_feat(final_df)
final_df_norm.head()

In [None]:
# Defining the problem
outcome = 'price'
x_cols = list(final_df_norm.columns)
x_cols.remove(outcome)

In [None]:
train, test = train_test_split(final_df_norm)

In [None]:
print(len(train), len(test))
train.head()

In [None]:
test.head()

In [None]:
# Fitting the actual model
predictors = '+'.join(x_cols)
formula = outcome + '~' + predictors
model = ols(formula=formula, data=train).fit()
model.summary()

## Checking Assumptions Again

In [None]:
fig = sm.graphics.qqplot(model.resid, dist=stats.norm, line='45', fit=True)

Normality is definitely improved, but isn't where it should be yet. 

In [None]:
plt.scatter(model.predict(train[x_cols]), model.resid)
plt.plot(model.predict(train[x_cols]), [0 for i in range(len(train))])

Homoscedasity: Also improved, but still not fully looking like it needs to.

If I attempt to remove more outliers from price, the model R2 score drops, and there is no difference with the assumption checks. I will need to refine other variables to improve my model.

## Checking Mulitcolinearity

In [None]:
x_cols = list(final_df_norm.columns)
x_cols.remove(outcome)

X = final_df_norm[x_cols]
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
list(zip(x_cols, vif))

# Alternative Approach: Building from the Ground Up

In [None]:
alt_df = encoded_df[['price', 'sqft_above', 'bedrooms', 'bathrooms', 'sqft_basement']]
alt_df

In [None]:
x_cols = ['price', 'sqft_above', 'bedrooms', 'bathrooms', 'sqft_basement']

In [None]:
train, test = train_test_split(alt_df)

print(len(train), len(test))

In [None]:
# Refit model with subset features
predictors = '+'.join(x_cols)
formula = outcome + "~" + predictors
alt_model = ols(formula=formula, data=train).fit()
alt_model.summary()

In [None]:
x_cols = list(alt_df.columns)
x_cols.remove(outcome)

In [None]:
X = alt_df[x_cols]
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
list(zip(x_cols, vif))

In [None]:
alt_df = alt_df.drop(['bathrooms'], axis=1)

In [None]:
x_cols = ['price', 'sqft_above', 'bedrooms', 'sqft_basement']

In [None]:
x_cols = list(alt_df.columns)
x_cols.remove(outcome)

In [None]:
X = alt_df[x_cols]
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
list(zip(x_cols, vif))

In [None]:
fig = sm.graphics.qqplot(alt_model.resid, dist=stats.norm, line='45', fit=True)

In [None]:
plt.scatter(alt_model.predict(train[x_cols]), alt_model.resid)
plt.plot(alt_model.predict(train[x_cols]), [0 for i in range(len(train))])

Maybe I should drop bedrooms. They can be covered in my outcome as just "extra square feet added to the house". A bathroom is a more unique feature that I would like to capture.
- The normality assumption looks great now, but the model is still not homoscedastic.
- Let me check multicolinear pairs before dropping anything...

## normalizing and re-running

In [None]:
alt_df_norm = norm_feat(alt_df)
alt_df_norm.head()

In [None]:
# Refit model with subset features
predictors = '+'.join(x_cols)
formula = outcome + "~" + predictors
alt_model = ols(formula=formula, data=train).fit()
alt_model.summary()

In [None]:
cc_df = alt_df.corr().abs().stack().reset_index().sort_values(0, ascending=False)

cc_df['pairs'] = list(zip(cc_df.level_0, cc_df.level_1))

cc_df.set_index(['pairs'], inplace = True)

cc_df.drop(columns=['level_1', 'level_0'], inplace = True)

# cc for correlation coefficient
cc_df.columns = ['cc']

cc_df.drop_duplicates(inplace=True)

cc_df[(cc_df.cc>.5) & (cc_df.cc<1)]

It appears that bathrooms are the bigger problem.

In [None]:
#removing bathrooms to see what happens.
alt_df = alt_df.drop(['bathrooms'], axis=1)
alt_df.head()

In [None]:
x_cols = ['sqft_above', 'bedrooms', 'sqft_basement']

In [None]:
#checking VIF again to see what effect removing bedrooms had.
X = alt_df[x_cols]
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
list(zip(x_cols, vif))1

In [None]:
cc_df = alt_df.corr().abs().stack().reset_index().sort_values(0, ascending=False)

cc_df['pairs'] = list(zip(cc_df.level_0, cc_df.level_1))

cc_df.set_index(['pairs'], inplace = True)

cc_df.drop(columns=['level_1', 'level_0'], inplace = True)

# cc for correlation coefficient
cc_df.columns = ['cc']

cc_df.drop_duplicates(inplace=True)

cc_df[(cc_df.cc>.6) & (cc_df.cc<1)]

In [None]:
#removing bedrooms to see what happens.
alt_df = alt_df.drop(['bedrooms'], axis=1)
alt_df.head()

In [None]:
x_cols = ['sqft_above', 'bathrooms', 'sqft_basement']

In [None]:
#checking VIF again to see what effect removing bedrooms had.
X = alt_df[x_cols]
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
list(zip(x_cols, vif))

That didn't make much of a difference. 

# Taking what I learned back to the old model
- I am going to try dropping bathrooms and/or bedrooms from the old model to see if it fixes some of my issues.

In [None]:
df_norm.head()

In [None]:
df_norm = df_norm.drop(['bathrooms'], axis=1)
df_norm.head(1)

In [None]:
train, test = train_test_split(df_norm)

In [None]:
print(len(train), len(test))
train.head()

In [None]:
x_cols = list(df_norm.columns)
x_cols.remove(outcome)

In [None]:
X = df_norm[x_cols]
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
list(zip(x_cols, vif))

In [None]:
vif_scores = list(zip(x_cols, vif))
x_cols = [x for x,vif in vif_scores if vif < 5]
print(len(vif_scores), len(x_cols))