# King County House Prices Predictions

## Background
In the world of real estate, homeowners often find themselves in a difficult dilemma: trying to understand the intricate web of factors that influence the price of their most cherished possession—their homes. In the world of real estate, homeowners often struggle to understand what makes their homes valuable. Imagine you're a homeowner. Your home is one of your most precious assets. But have you ever wondered why some homes are more expensive than others? It's a bit like a puzzle. Our project is all about solving this puzzle. We are embarking on a journey to empower homeowners with a deep understanding of the determinants of house prices through our project. This project is a collaborative effort with our stakeholder, a prominent real estate agency dedicated to guiding homeowners through the intricate process of buying and selling homes.

## Business Understanding
In the dynamic real estate market of King County, Washington, numerous households aspire to purchase homes. However, the ever-present information asymmetry often leaves these potential buyers navigating the market blindly. To address this prevalent challenge, our project undertakes an in-depth analysis of house sales data spanning the years 2014 to 2015 within King County. Our mission is to offer invaluable consultation services to a reputable real estate agency dedicated to assisting households in their pursuit of homeownership.

Through a comprehensive examination of this dataset, we aim to bridge the information gap in the real estate market. Our objective is to provide a robust method for predicting house prices, enabling prospective buyers to make well-informed decisions about their property investments. In doing so, we empower both homebuyers and the real estate agency with the knowledge and insights needed to navigate the competitive King County housing landscape effectively.

### Challenges
- For homeowners, the challenge is understanding why their homes are worth a certain amount. This knowledge can help them decide if they should sell, renovate, or just enjoy their home as it is.
- The real estate agency faces a challenge too, which is to give homeowners the best advice. To do that, they need to know what makes a home valuable and to provide precise advice, they need to understand what drives property prices.

### Solutions
- Our project's solution is rooted in the power of data analysis. We will embark on a comprehensive exploration of house prices and their underlying determinants. Our aim is to not only identify the fundamental factors that sway home prices but to quantify their influence. In doing so, we seek to provide homeowners with invaluable insights.
- By uncovering the complexities of the property pricing landscape, we aim to enhance the services offered by the real estate agency. Our project is designed to be a beacon of clarity amid the maze of the real estate market.

### Conclusion
This project reflects our collective pursuit to empower homeowners with data-driven insights into their property's value and to unravel the intricate and compelling aspects of house pricing. It is our commitment to provide homeowners, buyers, and our partner agency with a crystal-clear perspective on the determinants of property value, culminating in a predictive model of unwavering accuracy.

### Problem Statement
In the dynamic real estate market of King County, Washington, prospective homebuyers often face challenges when attempting to understand the multifaceted factors that determine the prices of residential properties. This information gap can leave homeowners and potential buyers navigating the housing market without clear insights into the key determinants of property values. As a result, individuals struggle to make informed decisions when buying, selling, or investing in homes.

### Objectives
Our general objective is to investigate the effect of independent variables on the price of a house.

The Specific Objectives are :
1.  To Assess Correlations Between Independent Variables and House Prices.
2.	To Investigate the Impact of Highly and Lowly Correlated Variables on House Prices.
3.	To Develop a Robust Multilinear Regression Model for House Price Prediction.


## Preliminaries

We are going to start by importing all the necessary modules to the file

In [1]:
#important libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from scipy import stats
from statsmodels.formula.api import ols
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score
from sklearn.linear_model import LinearRegression

%matplotlib inline

A separate section with all the functions will be created here as well. This was deemed as the best/most convenient approach.

In [2]:
def read_file_to_dataframe(file):
    """
    Reads a file and returns its contents as a pandas DataFrame.

    Parameters:
    - file (str): The path to the file to be read.

    Returns:
    - df (pandas.DataFrame): The contents of the file as a DataFrame.

    Raises:
    - None

    Examples:
    >>> read_file_to_dataframe('data.csv')
         col1   col2
    0   value1  value2
    1   value3  value4
    """
    extension = file.split(".")[-1]
    if extension == "csv":
        df = pd.read_csv(file)
    elif extension == "xlsx" or extension == "xls":
        df = pd.read_excel(file)
    elif extension == "json":
        df = pd.read_json(file)
    elif extension == "parquet":
        df = pd.read_parquet(file)
    elif extension == "tsv" or extension == "txt":
        df = pd.read_csv(file, sep="\t")
    elif extension == 'pkl':
        df = pd.read_pickle(file)
    else:
        print("File format not supported")
        return
    return df

def dataframe_detailed(df):
    """
    Print details of the dataframe.

    Parameters:
        df (DataFrame): The dataframe to be analyzed.

    Returns:
        None
    """
    print(f"DATAFRAME SHAPE: {df.shape}\n\n")
    print(f"{df.info()}\n\n")
    print(f"DATAFRAME HEAD:\n {df.head()}\n\n")
    print(f"DATAFRAME KEY STATISTIC DESCRIPTION:\n {df.describe()}\n\n")

def map_replace_values(df, column, replacement_dict):
    """
    Replace values in a DataFrame column using a dictionary mapping.

    :param df: The DataFrame to modify.
    :type df: pandas.DataFrame
    :param column: The name of the column to replace values in.
    :type column: str
    :param replacement_dict: A dictionary mapping old values to new values.
    :type replacement_dict: dict
    :return: The modified DataFrame with replaced values.
    :rtype: pandas.DataFrame
    """
    df[column] = df[column].map(replacement_dict)
    return df

def fit_simple_linear_regression(df, dependent_variable, independent_variable):
    """
    Fits a simple linear regression model to the given dataframe.

    :param df: The pandas dataframe containing the data.
    :param dependent_variable: The name of the dependent variable column.
    :param independent_variable: The name of the independent variable column.

    :return: A tuple containing the root mean squared error (rmse),
      the mean absolute error (mae), and the model summary.
    """

    # Create and fit the simple linear regression model
    y = df[dependent_variable]
    X = df[independent_variable]
    X = sm.add_constant(X)  # Add a constant for the intercept
    model = sm.OLS(y, X).fit()

    # Calculate RMSE and MAE
    predictions = model.predict(X)
    residuals = y - predictions
    rmse = np.sqrt(np.mean(residuals**2))
    mae = np.mean(np.abs(residuals))

    # Get the model summary
    model_summary = model.summary()

    return rmse, mae, model_summary

def fit_linear_regression(df, dependent_variable, independent_variables):
    """
    Fit a linear regression model to the given data.

    Args:
        df (pandas.DataFrame): The input dataframe containing the data.
        dependent_variable (str): The name of the dependent variable.
        independent_variables (list): The list of independent variables.

    Returns:
        tuple: A tuple containing the fitted model and its summary.

    Raises:
        None

    Example:
        >>> df = pd.DataFrame({'x': [1, 2, 3], 'y': [2, 4, 6]})
        >>> fit_linear_regression(df, 'y', ['x'])
        (model, model_summary)
    """
    if len(independent_variables) == 1:
        # Simple Linear Regression
        model_formula = f"{dependent_variable} ~ {independent_variables[0]}"
        model = ols(model_formula, data=df).fit()
    else:
        # Multiple Linear Regression
        X = df[independent_variables]
        X = sm.add_constant(X)  # Add a constant for the intercept
        y = df[dependent_variable]
        model = sm.OLS(y, X).fit()

    model_summary = model.summary()

    return model, model_summary

def model_metrics(df, dependent_variable, independent_variables):
    """
    Calculate the root mean squared error (RMSE) and mean absolute error (MAE) for a linear regression model.

    Parameters:
    - df (pandas.DataFrame): The input DataFrame containing the data.
    - dependent_variable (str): The name of the dependent variable column in the DataFrame.
    - independent_variables (list): A list of names of the independent variable columns in the DataFrame.

    Returns:
    - rmse (float): The root mean squared error.
    - mae (float): The mean absolute error.
    """
    # Create and fit the linear regression model
    X = df[independent_variables]
    X = sm.add_constant(X)  # Add a constant for the intercept
    y = df[dependent_variable]
    model = sm.OLS(y, X).fit()

    # Calculate RMSE
    y_pred = model.predict(X)
    rmse = (mean_squared_error(y, y_pred)) ** 0.5

    # Calculate MAE
    mae = mean_absolute_error(y, y_pred)

    return rmse, mae

def plot_residuals_ols_mlr(ols_model, mlr_model, X_ols, X_mlr, y):
    """
    Generate plots of residuals for OLS and MLR models.

    Parameters:
    ols_model (object): The OLS model object.
    mlr_model (object): The MLR model object.
    X_ols (array-like): The input data for the OLS model.
    X_mlr (array-like): The input data for the MLR model.
    y (array-like): The target variable.

    Returns:
    None
    """
    # Calculate residuals for OLS and MLR models
    residuals_ols = y - ols_model.predict(sm.add_constant(X_ols))
    residuals_mlr = y - mlr_model.predict(sm.add_constant(X_mlr))

    # Create a scatter plot of residuals vs. predicted values for OLS
    plt.figure(figsize=(12, 6))
    plt.subplot(1, 2, 1)
    plt.scatter(ols_model.predict(sm.add_constant(X_ols)),
                residuals_ols, alpha=0.5)
    plt.title("OLS Residuals vs. Predicted Values")
    plt.xlabel("Predicted Values (OLS)")
    plt.ylabel("Residuals (OLS)")
    plt.axhline(y=0, color='r', linestyle='--')

    # Create a scatter plot of residuals vs. predicted values for MLR
    plt.subplot(1, 2, 2)
    plt.scatter(mlr_model.predict(sm.add_constant(X_mlr)),
                residuals_mlr, alpha=0.5)
    plt.title("MLR Residuals vs. Predicted Values")
    plt.xlabel("Predicted Values (MLR)")
    plt.ylabel("Residuals (MLR)")
    plt.axhline(y=0, color='r', linestyle='--')

    # Create a histogram of residuals for OLS
    plt.figure(figsize=(12, 6))
    plt.subplot(1, 2, 1)
    plt.hist(residuals_ols, bins=30, alpha=0.75)
    plt.title("Histogram of OLS Residuals")
    plt.xlabel("Residuals (OLS)")
    plt.ylabel("Frequency")

    # Create a histogram of residuals for MLR
    plt.subplot(1, 2, 2)
    plt.hist(residuals_mlr, bins=30, alpha=0.75)
    plt.title("Histogram of MLR Residuals")
    plt.xlabel("Residuals (MLR)")
    plt.ylabel("Frequency")

    plt.tight_layout()
    plt.show()

def calculate_rmse_mae_multi(model, X, y):
    """
    Calculate RMSE and MAE for a multiple linear regression model.

    Args:
    model: The fitted multiple linear regression model.
    X: The feature matrix.
    y: The true target values.

    Returns:
    rmse (float): Root Mean Squared Error.
    mae (float): Mean Absolute Error.
    """
    # Predict the target values using the model
    y_pred = model.predict(X)

    # Calculate RMSE
    rmse = np.sqrt(mean_squared_error(y, y_pred))

    # Calculate MAE
    mae = mean_absolute_error(y, y_pred)

    return rmse, mae

def calculate_vif(df, target_variable):
    """
    Calculate the Variance Inflation Factor (VIF) for each independent variable in a dataframe.

    Parameters:
        df (DataFrame): The input dataframe containing the independent variables.
        target_variable (str): The name of the target variable.

    Returns:
        vif_data (DataFrame): A dataframe containing the variables and their corresponding VIF values.
    """
    # Separate the target variable
    independent_vars = df.drop(columns=[target_variable])

    vif_data = pd.DataFrame()
    vif_data["Variable"] = independent_vars.columns
    vif_data["VIF"] = [variance_inflation_factor(
        independent_vars.values, i) for i in range(independent_vars.shape[1])]

    # Handle cases with high VIF values (you can choose a threshold) by setting VIF to NaN
    threshold = 5  # You can adjust this threshold based on your analysis
    vif_data.loc[vif_data["VIF"] > threshold, "VIF"] = np.nan

    return vif_data

## Data Understanding

In this section, the identifcation, collection, and surface-level analysis of the data shall be done by:
- Collecting initial data (Has been compiled into a csv file).
- Describing the data we are working with.
- Exploring the data for any relationships and trends.
- Verifying the data quality.

In [3]:
raw_df = read_file_to_dataframe('data/kc_house_data.csv')

We will not examine the properties of the dataframe by calling the custom function `dataframe_detailed`.

In [4]:
dataframe_detailed(raw_df)

DATAFRAME SHAPE: (21597, 21)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21597 entries, 0 to 21596
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21597 non-null  int64  
 1   date           21597 non-null  object 
 2   price          21597 non-null  float64
 3   bedrooms       21597 non-null  int64  
 4   bathrooms      21597 non-null  float64
 5   sqft_living    21597 non-null  int64  
 6   sqft_lot       21597 non-null  int64  
 7   floors         21597 non-null  float64
 8   waterfront     19221 non-null  object 
 9   view           21534 non-null  object 
 10  condition      21597 non-null  object 
 11  grade          21597 non-null  object 
 12  sqft_above     21597 non-null  int64  
 13  sqft_basement  21597 non-null  object 
 14  yr_built       21597 non-null  int64  
 15  yr_renovated   17755 non-null  float64
 16  zipcode        21597 non-null  int64  
 17  lat            2159

However, when finding out more information about the dataframe, it was noted that not all columns have the same non-null values. Thus, we will establish exactly how many null values are in each column.

In [5]:
raw_df.isnull().sum()

id                  0
date                0
price               0
bedrooms            0
bathrooms           0
sqft_living         0
sqft_lot            0
floors              0
waterfront       2376
view               63
condition           0
grade               0
sqft_above          0
sqft_basement       0
yr_built            0
yr_renovated     3842
zipcode             0
lat                 0
long                0
sqft_living15       0
sqft_lot15          0
dtype: int64

We will also take the time to establish whether there are any duplicate values in the dataframe. As there are multiple columns with numerical data, duplicates in those columns can be expected. However, when looking at unique columns e.g. the id, duplicates should ideally not be found.

Despite this, duplicate IDs will be kept in the dataframe as the assumption is that these are people who are flipping houses, and the ID is not assigned to a particular transaction, but to a particular house.

## Data Preparation

The preparation of the final dataset is done by:
- Removal of erroneous data.
- Handling of duplicate data, and 
- Removal of null data.


The process is initiated by dropping certain columns that are deemed unnecessary. These columns are: `date` and `yr_renovated`. This is not to say that other columns may not be dropped later, but this step simply eliminates all columns that are deemed surplus to requirements.