# Housing Data Analysis

## Overview
We looked at housing data taken from King County, WA and made a linear regression model to predict the price of a house described by given inputs. Our analysis of the data and our intermediate models determined that **variables** were the most important features for predicting the price of a house.

## Business Understanding
We were tasked with finding a way to predict an aproximate price of a house given some list of factors in order to ensure that our client isn't getting overcharged when buying a house.

## Data Exploration and Cleaning

In [1]:
# Import relevant functions and libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

In [2]:
df = pd.read_csv('data/final_clean_data.csv')
df = df.drop(columns=['Unnamed: 0','date'])

In [3]:
x = df[['sqft_living', 'sqft_lot', 'floors', 'sqft_basement', 'yr_built', 'bedrooms', 'bathrooms', 'grade']]
y = df['price']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state = 23)

***Data exploration and cleaning here***

## Final Model

Our final model used **variables** and accounted for **r2**% of the variance within the data. This model is limited by the dataset, since we filtered to only include houses with between **lower bound** and **higher bound** bedrooms, and only houses outside of **filtered zipcodes**. ***Include other filters***

In [4]:
cat_cols = ['grade']
encoder = OneHotEncoder(handle_unknown='error',drop='first',categories='auto')
ct = ColumnTransformer(transformers=[('ohe', encoder, cat_cols)],remainder='passthrough',sparse_threshold=0)
ct.fit(x_train)
x_train_enc = ct.transform(x_train)

In [5]:
scaler = StandardScaler()
scaler.fit(x_train_enc)
x_train_scaled = scaler.transform(x_train_enc)

In [6]:
lr = LinearRegression()
lr = lr.fit(x_train_scaled,y_train)

In [7]:
mean_living= x_train['sqft_living'].mean()
mean_lot = x_train['sqft_lot'].mean()
median_floors = x_train['floors'].median()
mean_basement = x_train['sqft_basement'].mean()
median_yr = x_train['yr_built'].median()
median_bedrooms = x_train['bedrooms'].median()
median_bathrooms = x_train['bathrooms'].median()
mode_grade = x_train['grade'].mode().values[0]

In [15]:
def take_inputs():
    living_val = input("Enter the square foot living area of the house:\t")
    lot_val = input("Enter the square foot lot area of the house:\t")
    floors_val = input("Enter the number of floors the house has:\t")
    basement_val = input("Enter the square foot basement area of the house:\t")
    yr_val = input("Enter the year the house was built:\t")
    bedrooms_val = input("Enter the number of bedrooms in the house:\t")
    bathrooms_val = input("Enter the number of bathrooms in the house:\t")
    grade_val = str(input("Enter the grade of the house:\t"))
    
    
    for unique_grade in [*x_train['grade'].value_counts().index]:
        if str(grade_val) in unique_grade:
            print(F"Grade read as:\t{unique_grade}\n")
            grade_val = unique_grade
    if grade_val not in [*x_train['grade'].value_counts().index]:
        grade_val = mode_grade
        print("Value not in model, using '7 Average' in place.")
    return [living_val, lot_val, floors_val, basement_val, yr_val, bedrooms_val, bathrooms_val, grade_val]

In [16]:
def predict_price(fitted_ct,
                  fitted_scaler,
                  fitted_lr,
                  sqft_living = mean_living,
                  sqft_lot = mean_lot,
                  floors = median_floors,
                  sqft_basement = mean_basement,
                  yr_built = median_yr,
                  bedrooms = median_bedrooms,
                  bathrooms = median_bathrooms,
                  grade = mode_grade
                 ):
    '''
    Takes in information about a house and uses the linear regression model, column transformer, and scaler passed to it
    to predict the value of a house matching the values passed in.
    If a value is not passed to the function it will use a measure of central tendency depending on the column.
    '''
    # create a single row dataframe to test the model on and get the price prediction
    test_df = pd.DataFrame({'sqft_living': [sqft_living],
                            'sqft_lot': [sqft_lot],
                            'floors': [floors],
                            'sqft_basement': sqft_basement,
                            'yr_built':yr_built,
                            'bedrooms': [bedrooms],
                            'bathrooms': [bathrooms],
                            'grade': [grade]
                           })
    print("Input data:")
    display(test_df)
    # encode categorical values
    test_df_enc = fitted_ct.transform(test_df)
    
    # scale data
    test_df_scaled = fitted_scaler.transform(test_df_enc)
    
    # run the linear regression and return the prediction
    prediction = lr.predict(test_df_scaled)
    
    print(F"\nPredicted price of this house:\t{int(prediction[0])}.")
    return

In [20]:
user_input = take_inputs()
function_input = [ct,scaler,lr,*user_input]
predict_price(*function_input)

Enter the square foot living area of the house:	1
Enter the square foot lot area of the house:	1
Enter the number of floors the house has:	1
Enter the square foot basement area of the house:	1
Enter the year the house was built:	1
Enter the number of bedrooms in the house:	1
Enter the number of bathrooms in the house:	1
Enter the grade of the house:	1
Grade read as:	10 Very Good

Input data:


Unnamed: 0,sqft_living,sqft_lot,floors,sqft_basement,yr_built,bedrooms,bathrooms,grade
0,1,1,1,1,1,1,1,10 Very Good



Predicted price of this house:	7076963.


## Visualizations
**To be filled in later**

## Conclusions