<a href="https://colab.research.google.com/github/naphtron/Phase-2-Group-Project/blob/main/student.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Final Project Submission

Please fill out:
* Student name:
* Student pace: self paced / part time / full time
* Scheduled project review date/time:
* Instructor name:
* Blog post URL:


Chapter 1: Business Overview


1.1   Introduction


The real estate market is a complex and dynamic environment where accurately pricing houses is of paramount importance. In this ever-changing landscape, homeowners, buyers, and real estate agencies are often faced with the challenge of determining the fair market value of a property. The consequences of inaccurate pricing can be significant, ranging from houses languishing on the market for extended periods to missed opportunities for maximizing profit.
The quest for a precise and data-driven solution to this challenge has led us to explore a predictive model using linear regression. By leveraging the power of data analysis and predictive modeling, we aim to provide a practical tool that can revolutionize the way houses are priced, making the process more transparent, efficient, and informed.


1.2   Challenges


The challenges in the real estate market are multifaceted. Real estate agencies often grapple with two primary issues: overpricing and the lack of a robust decision framework. Overpricing can lead to properties remaining unsold for prolonged periods, incurring additional costs, and diminishing potential profits. On the other hand, prospective buyers face difficulties in determining which properties align with their budgets and desired features.


1.3    Problem Statement


The core problem that our project addresses is to provide a suburban house pricing model that considers the features of a house to determine its value. Overpricing can be detrimental to both sellers and buyers. The absence of a reliable decision framework means that clients with varying budgets and preferences lack guidance in their property search. As such, there is a clear need for a data-driven solution that can provide precise house price predictions and, in doing so, mitigate the challenges faced by stakeholders in the real estate market.


Objectives

Primary Objective: Develop a robust linear regression model to accurately predict suburban house prices in King County, Washington, utilizing relevant variables from the dataset. Therefore our objectives are:

a). Develop a regression model to predict suburban house prices based on their features.


b). Identify Key Factors Influencing House Prices in King County, California, to provide valuable insights for precise pricing strategies.


c). Analyze Model Performance using metrics such as mean squared error, R-squared values, and residual analysis to gauge the model's effectiveness.


d). Provide Actionable Recommendations to the Real Estate Agency for improving profitability and market presence, leveraging insights from the model.






### 1.0 IMPORTING THE NECESSARY LIBRARIES AND LOADING THE DATASET

#### 1.2 IMPORTING THE NECESSARY LIBRARIES

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import statsmodels.api as sm

# Project Libraries
# import functions as func

%matplotlib inline


#### 1.3 IMPORTING THE DATASET INTO A PANDAS DATAFRAME

In [2]:
df = pd.read_csv('kc_house_data.csv')
df.head(3)

FileNotFoundError: ignored

In [None]:
df.tail(3)

In [None]:
df.shape

#####
    A brief overview of the dataset shows that it has 21 columns and 21,567 rows. All of them have been successfully been loaded into the dataframe

### 2.0 DATA UNDERSTANDING

#### 2.1 UNDERSTANDING THE CHARACTERISTICS OF THE DATASET

In [None]:
df.info()

In [None]:
df.dtypes

In [None]:
df.columns

In [None]:
df.describe()

In [None]:
df.isnull().sum()

In [None]:
df.duplicated().any()

####
    From the above overview, we have established that the dataset does not have any duplicated values. There are a few columns that have missing values (waterfront, view and yr_renovated). We also have categorical values and numerical values

## 3.0 DATA CLEANING

### 3.1 HANDLING MISSING VALUES

    We'll start with visualizing our data to see if it has any missing values

In [None]:
# Visualise the missing values in the dataset
msno.bar(df, color='purple', figsize=(10, 5), fontsize=8)
plt.title('Missing Values Within Dataset')
plt.show()

    Lets find out how many each of the column has

In [None]:
df.isnull().sum()

    We can see that the Waterfront, View (albeit few) and Yr_renovated have missing values
    Since the Waterfront and View column has many missing values, we cannot drop all of them, we can group the
    values by their zipcodes and replace the values with the mode of each zipcode. It is reasonable
    to assume that all houses in the same zipcode have similar properties as far as waterfront and a view is
    concerned

#### 3.1.1. Missing values in categorical columns (waterfront and view)

In [None]:
df.info()

    Define a function that takes a dataframe, the column to group by and the target
     column as arguments and calculates the mode for the target column within each group
     and fills the missing vallues in the target column based on the mode within each group.

In [None]:
def fill_missing_with_mode(df, group_by_column, target_column):
    # Group the DataFrame by the specified column and calculate the mode for the target column within each group
    mode_by_group = df.groupby(group_by_column)[target_column].agg(lambda x: x.mode().iloc[0])

    # Fill missing values in the target column based on the mode within each group
    for index, row in df.iterrows():
        if pd.isna(row[target_column]):
            df.at[index, target_column] = mode_by_group[row[group_by_column]]

# Example usage to fill missing 'waterfront' values based on 'zipcode' mode
# fill_missing_with_mode(df, 'zipcode', 'waterfront')


In [None]:
# Use the function to fill missing 'waterfront' values based on 'zipcode' mode
fill_missing_with_mode(df, 'zipcode', 'waterfront')

In [None]:
# Use the function to fill missing 'view' values based on 'zipcode' mode
fill_missing_with_mode(df, 'zipcode', 'view')

#### 3.1.2 Handling Missing Values in Numerical Columns (yr_renovated)

In [None]:
# Fill missing values in the 'yr_renovated' column with 0
df['yr_renovated'].fillna(0, inplace=True)


####
    We have elected to fill all missing values in the yr_renovated column with '0'.
    This is based on the assumption that those are houses that have missing values never been renovated.
    We will feature engineer a new column in which houses will either be renovated or not.
    This will eliminate the problem of likely bias that can arise when we fill the missing values

In [None]:
df.info()

#### 3.2 DETECTING DUPLICATES

In [None]:
df.duplicated().any()

    We do not have any duplicated values in the dataframe

#### 3.4 DETECTING OUTLIERS


#### 3.4.1 Numerical column Outliers

    In this section, i will only focus on numerical columns.
    I will also exclude the following columns [id, lat, long]
    because they are not expected to be used in model training.
    They will be dropped at a later time.

####  
    This will plot box plots for the following numeircal columns.
    columns = ['price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
            'yr_built', 'sqft_above','lat', 'long', 'sqft_living15', 'sqft_lot15']

    
    


    We define a function to create a box plot for a specified column in the df with the necessary labels.

    the function will take the following Parameters:
    - data: The DataFrame containing the data (df).
    - column: The name of the column to create a box plot for.

    Returns:
    - The box plot as a Matplotlib Axes object.

In [None]:

def create_box_plot(data, column):

    # Create a single subplot
    fig, ax = plt.subplots(figsize=(10, 5))

    # Plot the box plot
    sns.boxplot(x=data[column], ax=ax, orient='h')

    # Set the title and x-label based on the column name
    ax.set_title(f'Box Plot of {column}')
    ax.set_xlabel(column)

    return ax


In [None]:
# Create a box plot for the 'price' column
create_box_plot(df, 'price')
plt.show()

In [None]:
# Create a box plot for the 'bedrooms' column
create_box_plot(df, 'bedrooms')
plt.show()

In [None]:
# Create a box plot for the 'sqft_living' column
create_box_plot(df, 'sqft_living')
plt.show()

In [None]:
# Create a box plot for the 'sqft_lot' column
create_box_plot(df, 'sqft_lot')
plt.show()

In [None]:
# Create a box plot for the 'floor' column
create_box_plot(df, 'floors')
plt.show()

In [None]:
# Create a box plot for the 'yr_built' column
create_box_plot(df, 'yr_built')
plt.show()

In [None]:
# Create a box plot for the 'sqft_above' column
create_box_plot(df, 'sqft_above')
plt.show()

In [None]:
# Apply the transformation and filtering to create df_filtered
df['yr_renovated'] = df['yr_renovated'] - (df['yr_built'].min() - 1900)
df_filtered = df[df['yr_renovated'] > 0]

# Call the create_box_plot function with df_filtered as the data
create_box_plot(df_filtered, 'yr_renovated')

####
    There are several outliers in each of the datasets.
    We need to Drop some of the outliers to make sure we only deal with houses that are suburban.

## EXPLORATORY DATA ANALYSIS

### EXPLORING CATEGORICAL COLUMNS

    """
    Create a count plot for a specified categorical column in a given DataFrame.

    Parameters:
    - data: The DataFrame containing the data.
    - column: The name of the categorical column to create a count plot for.

    Returns:
    - The count plot as a Matplotlib Axes object.
    """

In [None]:

def create_count_plot(data, column):

    # Create a single subplot
    fig, ax = plt.subplots(figsize=(10, 5))

    # Create the count plot
    sns.countplot(x=data[column], ax=ax)

    # Set the title and x-label based on the column name
    ax.set_title(f'Value Counts of {column}')
    ax.set_xlabel(column)
    ax.tick_params(axis='x', rotation=45)

    # Add labels displaying the total value counts for each bar
    for p in ax.patches:
        ax.annotate(f'Total: {p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
                    ha='center', va='center', fontsize=8, color='black', xytext=(0, 10),
                    textcoords='offset points')

    return ax


In [None]:
# Create a count plot for the 'grade' column
create_count_plot(df, 'grade')
plt.show()

In [None]:
# Create a count plot for the 'waterfront' column
create_count_plot(df, 'waterfront')
plt.show()

In [None]:
# Create a count plot for the 'view' column
create_count_plot(df, 'view')
plt.show()

In [None]:
# Create a count plot for the 'zipcode' column
create_count_plot(df, 'zipcode')
plt.show()

### EXPLORING NUMERICAL COLUMNS (using Histoplot, countplot)

    """
    Create a customized plot for a specified column in a given DataFrame.

    Parameters:
    - data: The DataFrame containing the data.
    - plot_type: Type of plot (e.g., 'histplot', 'countplot', etc.).
    - column: The name of the column to create the plot for.
    - figsize: Tuple specifying the figure size (width, height).

    Returns:
    - The plot as a Matplotlib Axes object.
    """

In [None]:


def create_custom_plot(data, plot_type, column, figsize=(10, 5)):

    # Create a single subplot
    fig, ax = plt.subplots(figsize=figsize)

    # Check the plot type and create the corresponding plot
    if plot_type == 'histplot':
        sns.histplot(data[column], kde=True, ax=ax)
    elif plot_type == 'countplot':
        sns.countplot(x=data[column], ax=ax)
    elif plot_type == 'rugplot':
        sns.rugplot(x=data[column], ax=ax)
    elif plot_type == 'ridgeplot':
        sns.kdeplot(x=data[column], ax=ax)
    elif plot_type == 'beanplot':
        sns.violinplot(x=data[column], ax=ax)

    # Set the title and x-label based on the column name
    ax.set_title(f'{plot_type.capitalize()} of {column}')
    ax.set_xlabel(column)

    # Additional customization based on the plot type can be added here

    return ax


In [None]:
# Create a count plot for the 'bathrooms' column
create_custom_plot(df, 'countplot', 'bathrooms')
plt.show()

In [None]:
# Create a count plot for the 'bedrooms' column
create_custom_plot(df, 'countplot', 'bedrooms')
plt.show()

In [None]:
# Create a count plot for the 'yr_renovated' column
create_custom_plot(df_filtered, 'histplot', 'yr_renovated')
plt.show()

In [None]:
# Create a count plot for the 'floors' column
create_custom_plot(df, 'countplot', 'floors')
plt.show()

In [None]:
# Create a histogram for the 'price' column
create_custom_plot(df, 'histplot', 'price')
plt.show()

In [None]:
# Create a histogram for the 'price' column
create_custom_plot(df, 'histplot', 'sqft_living')
plt.show()

In [None]:
# Create a histogram for the 'sqft_lot' column
create_custom_plot(df, 'histplot', 'sqft_lot')
plt.show()

In [None]:
# Create a histogram for the 'sqft_above' column
create_custom_plot(df, 'histplot', 'sqft_above')
plt.show()

In [None]:
# Create a histogram for the 'sqft_basement' column
create_custom_plot(df, 'histplot', 'sqft_basement')
plt.show()

In [None]:
# Create a histogram for the 'yr_built' column
create_custom_plot(df, 'histplot', 'yr_built')
plt.show()

## DATA PREPARATION & FEATURE ENGINEERING

In [None]:
df.info()

#### DROPPING THE COLUMNS THAT WE BELIEVE WILL NOT BE NECESSARY TO OUR MODEL
    In our model we want to drop all the columns that we do not believe contribute to our model's
    performance. We are focused on using features pertinent to each house irrespective of its
    geographic location or characteristics of neighbouring houses.
    long and lat columns have geographical and should we need to consider location
    properties we use zipcode
    sqft_living 15 and sqft_lot contain details of the nearest 15 neighbors. These are not
    directly features of each house in our dataset
    Therefore, we'll drop
        sqft_living15,
        sqft_lot15
        long
        lat


In [None]:
# Create a new df that we can drop columns and work with
new_df = df.copy()
# new_df.head(3)

In [None]:
# Assuming 'df' is your original DataFrame

# Create a new copy of the data while dropping the specified columns
new_df = df.drop(['lat', 'long', 'sqft_living15', 'sqft_lot15'], axis=1).copy()

# 'new_df' is a copy of the data without the specified columns


In [None]:
df.info()

#####
    The original dataframe has 20 columns without feature engineering.
    This dataframe will remain accessible should we need to use any element from it in other tasks down the line


In [None]:
new_df.info()

####
    The new df (new_df) has 16 columns which are listed above.

#### ADDING NEW COLUMNS

####
    Add a new column to store the age of the houses

In [None]:
new_df['date'] = pd.to_datetime(new_df['date'])
new_df['age'] = new_df['date'].dt.year - new_df['yr_built']

# Drop the 'date' column
new_df = new_df.drop(columns=['date'])

###
    Removing null values in the 'yr_built" column and adding
    the 'renovated' column to show whether the house has been renovated or not

In [None]:
new_df.loc[new_df.yr_renovated.isnull(), 'yr_renovated'] = 0
new_df['renovated'] = new_df['yr_renovated'].apply(lambda x: 0 if x == 0 else 1)
# new_df.renovated

####
    Change the has_basement to a binary value

In [None]:
new_df['sqft_basement'] =new_df['sqft_basement'].replace('?', '0').astype('float')
new_df['has_basement'] =new_df['sqft_basement'].apply(lambda x: 0 if x == 0 else 1)


In [None]:
new_df.info()

### ORDINAL ENCODING

#####
    Create a function that maps ordinal values into a dataframe
    with the corresponding numerical values based on a provided dictionary

In [None]:
def map_ordinal_values(df, col_name, value_dict):
    # map the ordinal values to numerical values using the provided dictionary
    df[col_name] = df[col_name].map(value_dict).astype(int)
    return df

In [None]:
print(new_df.condition.unique())

In [None]:
condition_dict = {'Poor': 1, 'Fair': 2, 'Average': 3, 'Good': 4, 'Very Good': 5}
grade_dict = {'3 Poor': 3, '4 Low': 4, '5 Fair': 5, '6 Low Average': 6, '7 Average': 7, '8 Good': 8, '9 Better': 9, '10 Very Good': 10, '11 Excellent': 11, '12 Luxury': 12, '13 Mansion': 13}
view_dict = {'NONE':0, 'AVERAGE':1, 'GOOD': 2, 'FAIR':3, 'EXCELLENT':4}
new_df = map_ordinal_values(new_df, 'condition', condition_dict)
new_df = map_ordinal_values(new_df, 'grade', grade_dict)
new_df = map_ordinal_values(new_df, 'view', view_dict)

# print(new_df[['condition', 'grade', 'view']])

In [None]:
# def convert_column_data_type(df, column_name, new_data_type):
#     try:
#         df[column_name] = df[column_name].astype(new_data_type)
#     except ValueError:
#         print(f"Conversion to {new_data_type} failed. Check the data in the column.")
#         # You can handle the error as needed, e.g., return an error code or message

In [None]:
# Example usage:
# Assuming you have a DataFrame 'df' and want to convert the 'age' column to float
# convert_column_data_type(new_df, 'condition', int)

In [None]:
new_df.info()

#### ONE HOT ENCODING

####
    One hot encoding will be done for he waterfront and the view column.
    To avoid the 'Dummy variable trap" we'll drop one of the created column

In [None]:
new_df.waterfront.nunique()


    Create a function to do one-hot encoding on the specified column

In [None]:
def one_hot_encode(df, columns):
    if isinstance(columns, str):
        columns = [columns]  # Convert to a list if it's a string

    df = pd.get_dummies(df, columns=columns, prefix_sep='_', drop_first=True)
    return df

In [None]:
# columns_to_encode = ['waterfront', 'view']
# new_df = one_hot_encode(new_df, columns=['view'])
new_df = one_hot_encode(new_df, columns=['waterfront'])

In [None]:
new_df.head(3)

In [None]:
# Select columns with dtype 'bool' and convert them to int
bool_columns = new_df.select_dtypes(include=['bool'])
new_df[bool_columns.columns] = bool_columns.astype(int)


In [None]:
new_df.head(3)

In [None]:
new_df.shape

In [None]:
new_df.info()

In [None]:
# new_df.corr()

####
    We have established that there are a number of outliers in the dataset
    especially in the numerical columns. This Next Steps will remove all datapoints
     that are above the 75th quartile to ensure that our model is reflective
     of where our majority of the houses are.
     We will also create a new df to workwith.


#### REMOVING OUTLIERS

#### Filtering DF to Remove outliers and creating a mask That returns a new df

##### CREATE A COPY OF THE DATAFRAME

In [None]:
df1 = new_df.copy()

In [None]:
df1.info()

In [None]:
correlations = df1.corr()['price']

# Sort the correlations in descending order
sorted_correlations = correlations.sort_values(ascending=False)

# Print or display the sorted correlations
print(sorted_correlations)

    Filter rows in a DataFrame based on the specified quantile for selected columns.

    Parameters:
    - df: The DataFrame containing the data.
    - quantile_dict: A dictionary where keys are column names, and values are the quantile levels.

    Returns:
    - A new DataFrame with rows eliminated above the specified quantiles for the selected columns.

In [None]:

def filter_rows_by_quantile(df, quantile_dict):

    masks = {}

    # Calculate masks for each column based on the specified quantiles
    for column, quantile in quantile_dict.items():
        quantiles = df[column].quantile(quantile)
        masks[column] = df[column] <= quantiles

    # Combine the masks using logical AND to select rows below the specified quantiles for all columns
    combined_mask = pd.concat(masks.values(), axis=1).all(axis=1)

    # Apply the combined mask to create a new DataFrame
    filtered_df = df[combined_mask]

    return filtered_df




In [None]:
df1.info()

In [None]:
# Update this dictionary as needed to adjust the paramters of your data
# Define the quantile dictionary
quantiles_dict = {
    'price':0.99,
    'bathrooms': 0.99,
    'sqft_living': 0.99,
    'bedrooms': 0.99,
    'sqft_lot':0.99,
    'sqft_above':0.99,
    'sqft_basement':0.99,
}

# Filter rows above the specified quantiles for the specified columns
filtered_df = filter_rows_by_quantile(df1, quantiles_dict)

In [None]:
filtered_df.shape

In [None]:
filtered_df.info()

####
    Since we now have a df that that does not have outliers, we can now start analyzing the data


In [None]:
# Calculate the correlation between 'price' and all other columns in the filtered df
correlations = filtered_df.corr()['price']

# Sort the correlations in descending order
sorted_correlations = correlations.sort_values(ascending=False)

# Print or display the sorted correlations
print(sorted_correlations)


### BI-VARIATE ANALYSIS

#### Scatter plot to show the relationship between price and other columns

    Create a bivariate plot (e.g., scatter plot) for specified columns.

    Parameters:
    - df: The DataFrame containing the data.
    - plot_type: The type of plot to create (e.g., 'scatter', 'line', etc.).
    - x_column: The column to use as the x-axis.
    - y_column: The column to use as the y-axis.

In [None]:

def create_bivariate_plot(df, plot_type, x_column, y_column):

    if plot_type == 'scatter':
        plt.scatter(df[x_column], df[y_column], alpha=0.5)
        plt.xlabel(x_column)
        plt.ylabel(y_column)
        plt.title(f'Scatter Plot for {y_column} vs. {x_column}')

    elif plot_type == 'heatmap':
        sns.heatmap(df[[x_column, y_column]].corr(), annot=True)
        plt.title(f'Heatmap for {y_column} vs. {x_column}')

    elif plot_type == 'contour':
        sns.kdeplot(df[x_column], df[y_column], cmap='Blues', fill=True)
        plt.xlabel(x_column)
        plt.ylabel(y_column)
        plt.title(f'Contour Plot for {y_column} vs. {x_column}')

    elif plot_type == 'bubble':
        plt.scatter(df[x_column], df[y_column], c=df['square_footage'], cmap='viridis', alpha=0.5)
        plt.xlabel(x_column)
        plt.ylabel(y_column)
        plt.title(f'Bubble Chart for {y_column} vs. {x_column}')
        plt.colorbar(label='Square Footage')

    elif plot_type == 'boxplot':
        sns.boxplot(x=x_column, y=y_column, data=df)
        plt.title(f'Box Plot for {y_column} vs. {x_column}')

    elif plot_type == 'histogram':
        plt.hist2d(df[x_column], df[y_column], bins=(30, 30), cmap='Blues')
        plt.colorbar()
        plt.xlabel(x_column)
        plt.ylabel(y_column)
        plt.title(f'2D Histogram for {y_column} vs. {x_column}')

    plt.show()


In [None]:

# Create a scatter plot for 'sqft_living' vs. 'price'
create_bivariate_plot(filtered_df, plot_type='scatter', x_column='sqft_living', y_column='price')



In [None]:
# Create a scatter plot for 'bathrooms' vs. 'price'
create_bivariate_plot(filtered_df, plot_type='heatmap', x_column='bathrooms', y_column='price')


In [None]:

# Create a scatter plot for 'bedrooms' vs. 'price'
create_bivariate_plot(filtered_df, plot_type='heatmap', x_column='bedrooms', y_column='price')


In [None]:

# Create a scatter plot for 'sqft_lot' vs. 'price'
create_bivariate_plot(filtered_df, plot_type='heatmap', x_column='sqft_lot', y_column='price')


In [None]:

# Create a scatter plot for 'floors' vs. 'price'
create_bivariate_plot(filtered_df, plot_type='heatmap', x_column='floors', y_column='price')


In [None]:

# Create a scatter plot for 'sqft_living' vs. 'price'
create_bivariate_plot(filtered_df, plot_type='heatmap', x_column='yr_built', y_column='price')


In [None]:

# Create a scatter plot for 'yr_renovated' vs. 'price'
create_bivariate_plot(filtered_df, plot_type='heatmap', x_column='yr_renovated', y_column='price')


In [None]:

# Create a scatter plot for 'age' vs. 'price'
create_bivariate_plot(filtered_df, plot_type='heatmap', x_column='age', y_column='price')


In [None]:

# Create a scatter plot for 'sqft_living' vs. 'price'
create_bivariate_plot(filtered_df, plot_type='heatmap', x_column='renovated', y_column='price')


In [None]:

# Create a scatter plot for 'sqft_living' vs. 'price'
create_bivariate_plot(filtered_df, plot_type='heatmap', x_column='zipcode', y_column='price')


In [None]:
# Calculate the correlation matrix
correlation_matrix = filtered_df.corr()

# Mask the upper triangle of the correlation matrix
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))

# Set a color palette
cmap = sns.color_palette("viridis")

# Create the heatmap
plt.figure(figsize=(14, 8))  # Adjust the figure size as needed
sns.heatmap(correlation_matrix, cmap=cmap, annot=True, fmt=".2f", linewidths=0.5)
plt.title('Correlation Heatmap for new_df')

plt.show()

In [None]:
# Calculate the correlation matrix
correlation_matrix = filtered_df.corr()

# Mask the upper triangle of the correlation matrix
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))

# Set a color palette
cmap = sns.color_palette("viridis")

# Create the heatmap
plt.figure(figsize=(14, 8))  # Adjust the figure size as needed
sns.heatmap(correlation_matrix, cmap=cmap, annot=True, fmt=".2f", mask=mask, linewidths=0.5)
plt.title('Correlation Heatmap for new_df')

plt.show()


In [None]:
# Calculate the correlation of 'price' with all numerical columns and sort them in descending order
price_corr = filtered_df.corr()['price'].sort_values(ascending=False)

print(price_corr)


In [None]:

# Calculate the correlation of 'price' with all numerical columns and sort them in descending order
price_corr = filtered_df.corr()['price'].sort_values(ascending=False)

# Create a bar plot
plt.figure(figsize=(10, 6))  # Adjust the figure size as needed
sns.barplot(x=price_corr.values, y=price_corr.index, palette='viridis')
plt.xlabel('Correlation')
plt.ylabel('Numerical Columns')
plt.title('Correlation of Price with Numerical Columns')
plt.show()


### REGRESSION MODELLING
#### Creating the base model

#### From the figure above, it is clear that sqft_living has the highest correlation to the price of the house:

In [None]:
import statsmodels.api as sm

# Add a constant to the independent variable for the intercept
X = sm.add_constant(filtered_df['grade'])

# Fit the OLS (Ordinary Least Squares) regression model
model = sm.OLS(filtered_df['price'], X).fit()

# Get model summary
summary = model.summary()

# Extract R-squared and F-statistic from the summary
r_squared = model.rsquared
f_statistic = model.fvalue

print(summary)
# print(f"R-squared: {r_squared}")
# print(f"F-statistic: {f_statistic}")


#### At the moment the model can predict about 49.3% of the price of the houses.

In [None]:
filtered_df.columns

In [None]:
plt.figure(figsize=(10,8))
# sns.heatmap(filtered_df.corr(),annot=True,fmt='.2f',cmap='coolwarm')
price_corr_series = filtered_df.corr()['price'].sort_values(ascending=False)

threshold = 0.3

filtered_price_corr_series = price_corr_series[price_corr_series>threshold]
filtered_price_corr_series

In [None]:
sns.heatmap(filtered_df[list(filtered_price_corr_series.index)[1:]].corr(),fmt='.2f',annot=True,cmap='coolwarm')

In [None]:
'''
Feature with high correlation with grade
- sqft_living: .73
- sqft_above: .72
- bathrooms: .63

Feature with high correlation with sqft_living
- grade: .73
- sqft_above: .86
- bathrooms: .72
- bedrooms: .60
'''

In [None]:
# consider creating new features
alt_df = filtered_df.copy()

alt_df['condition_living_product'] = filtered_df['grade'] * filtered_df['sqft_living']
alt_df['total_living_area'] = filtered_df['sqft_living'] + filtered_df['sqft_above']
alt_df['sqft_bathroom_ratio'] = filtered_df['sqft_living'] / filtered_df['bathrooms']
alt_df['sqft_bedroom_ratio'] = filtered_df['sqft_living'] / filtered_df['bedrooms']

sel_features = list(filtered_price_corr_series.index)[1:]
m_features = ['price'] + sel_features + ['sqft_bedroom_ratio','total_living_area','sqft_bathroom_ratio','condition_living_product']


plt.figure(figsize=(9,8))
print(alt_df.corr()['price'].sort_values(ascending=False)[alt_df.corr()['price'].sort_values(ascending=False)>0.5])
sns.heatmap(alt_df[m_features].corr(),fmt='.2f',annot=True,cmap='coolwarm')

In [None]:
alt_df['view'].value_counts().plot.pie(autopct='%.2f')
alt_df['view'].value_counts()

In [None]:
target_mean = alt_df.groupby('view')['price'].mean()

In [None]:
alt_df['view_encoded'] = alt_df['view'].map(target_mean)
alt_df

In [None]:
plt.figure(figsize=(13,10))
sns.heatmap(alt_df.corr(), annot=True, fmt='.2f', cmap='coolwarm')

In [None]:
alt_df.columns

In [None]:
from sklearn.metrics import mean_squared_error, make_scorer, r2_score
from statsmodels.formula.api import ols

#potential models

formula = 'price ~ condition_living_product + bedrooms + sqft_bathroom_ratio + view_encoded'
formula2 = 'price ~ grade + sqft_bedroom_ratio + sqft_bathroom_ratio + bathrooms + view_encoded'
formula3 = 'price ~ sqft_living + view + bedrooms'

model = ols(formula, alt_df).fit()
print(model.summary())

X = alt_df[['view_encoded','condition_living_product','bedrooms', 'sqft_bathroom_ratio']]
print(X)
y = alt_df['price']
# model.predict(X)
y_pred = model.predict(X)
MSE = mean_squared_error(y, y_pred)
RMSE = np.sqrt(MSE)
print(RMSE)


In [None]:
model = ols(formula3, alt_df).fit()
print(model.summary())

X = alt_df[['sqft_living','view','bedrooms']]
print(X)
y = alt_df['price']
# model.predict(X)
y_pred = model.predict(X)
MSE = mean_squared_error(y, y_pred)
RMSE = np.sqrt(MSE)
print(RMSE)

In [None]:
model = ols(formula2, alt_df).fit()
print(model.summary())

X = alt_df[['grade','sqft_bedroom_ratio','sqft_bathroom_ratio','bathrooms','view_encoded']]
print(X)
y = alt_df['price']
# model.predict(X)
y_pred = model.predict(X)
MSE = mean_squared_error(y, y_pred)
RMSE = np.sqrt(MSE)
print(RMSE)

In [None]:
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import LinearRegression, RidgeCV, Ridge
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, recall_score, precision_score



In [None]:
formula = 'price ~ condition_living_product + bedrooms + sqft_bathroom_ratio + view_encoded'

X = alt_df[['sqft_living','condition_living_product', 'bedrooms', 'sqft_bathroom_ratio','view_encoded']]
y = alt_df['price']


# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train_std = StandardScaler().fit_transform(X_train)
X_test_std = StandardScaler().fit_transform(X_test)

In [None]:
model_simple = LinearRegression()
model_simple.fit(X_train, y_train)

In [None]:
y_pred = model_simple.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
rmse, r2

In [None]:
poly = PolynomialFeatures(degree=3)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
model_poly = make_pipeline(StandardScaler(), Ridge())
model_poly.fit(X_train_poly, y_train)

y_pred_poly = model_poly.predict(X_test_poly)
rmse = mean_squared_error(y_test, y_pred_poly, squared=False)
r2 = r2_score(y_test, y_pred_poly)
rmse, r2

In [None]:
# Polynomial Ridge Regression
model_ridge = make_pipeline(StandardScaler(), Ridge(alpha=0.5))
model_ridge.fit(X_train_poly, y_train)

y_pred_ridge = model_ridge.predict(X_test_poly)
rmse_ridge = mean_squared_error(y_test, y_pred_ridge, squared=False)
r2_ridge = r2_score(y_test, y_pred_ridge)

print("Polynomial Ridge Regression Results:")
print(f"RMSE: {rmse_ridge}, R2 Score: {r2_ridge}")

In [None]:
model_poly_ridge = make_pipeline(PolynomialFeatures(degree=2), Ridge(alpha=0.5))

# Fit the model on the training data
model_poly_ridge.fit(X_train_poly, y_train)

# Predict on the test data
y_pred_poly_ridge = model_poly_ridge.predict(X_test_poly)

# Calculate the metrics
rmse_poly_ridge = mean_squared_error(y_test, y_pred_poly_ridge, squared=False)
r2_poly_ridge = r2_score(y_test, y_pred_poly_ridge)

print(f"Ridge Regularization: RMSE = {rmse_poly_ridge}, R-squared = {r2_poly_ridge}")

In [None]:
#cross validation

model_cv = make_pipeline(StandardScaler(), RidgeCV(alphas=[0.1, 1.0, 10.0], cv=5))

# Fit the model
model_cv.fit(X_train, y_train)

# Predict the target values
y_pred_cv = model_cv.predict(X_test)

# Compute the metrics
rmse_cv = mean_squared_error(y_test, y_pred_cv, squared=False)
r2_cv = r2_score(y_test, y_pred_cv)

# Display the results
print("Cross-Validated Ridge Regression Results:")
print(f"RMSE: {rmse_cv}, R2: {r2_cv}")

In [None]:
# Implement KFold cross-validation with Ridge regression
kf = KFold(n_splits=5, shuffle=True, random_state=42)
rmse_values = []
r2_values = []

for train_index, test_index in kf.split(X):
    X_train_kf, X_test_kf = X.iloc[train_index], X.iloc[test_index]
    y_train_kf, y_test_kf = y.iloc[train_index], y.iloc[test_index]

    # Fitting the Ridge model
    model_kf = make_pipeline(StandardScaler(), Ridge(alpha=0.5))
    model_kf.fit(X_train_kf, y_train_kf)

    # Predicting the target values
    y_pred_kf = model_kf.predict(X_test_kf)

    # Computing the metrics
    rmse_kf = mean_squared_error(y_test_kf, y_pred_kf, squared=False)
    r2_kf = r2_score(y_test_kf, y_pred_kf)

    rmse_values.append(rmse_kf)
    r2_values.append(r2_kf)

# Averaging the RMSE and R2 values
avg_rmse = np.mean(rmse_values)
avg_r2 = np.mean(r2_values)

# Displaying the results
print("KFold Cross-Validation Results:")
print(f"Average RMSE: {avg_rmse}, Average R2: {avg_r2}")

In [None]:
from sklearn.svm import SVR

model_svm = SVR()
model_svm.fit(X_train, y_train)

y_pred_svm = model_svm.predict(X_test)

print(y_pred_svm)
rmse_svm = mean_squared_error(y_test, y_pred_svm, squared=False)
r2_svm = r2_score(y_test, y_pred_svm)

print("SVM Regression Results:")
print(f"RMSE: {rmse_svm}, R2 Score: {r2_svm}")

In [None]:
#Random Forest
from sklearn.ensemble import RandomForestRegressor

model_rf = RandomForestRegressor()
model_rf.fit(X_train, y_train)

y_pred_rf = model_rf.predict(X_test)

rmse_rf = mean_squared_error(y_test, y_pred_rf, squared=False)
r2_rf = r2_score(y_test, y_pred_rf)


print("Random Forest Regression Results:")
print(f"RMSE: {rmse_rf}, R2 Score: {r2_rf}")

In [None]:
#Gradient Boosting Regression
from sklearn.ensemble import GradientBoostingRegressor

model_gb = GradientBoostingRegressor()
model_gb.fit(X_train, y_train)

y_pred_gb = model_gb.predict(X_test)

rmse_gb = mean_squared_error(y_test, y_pred_gb, squared=False)
r2_gb = r2_score(y_test, y_pred_gb)

print("Gradient Boosting Regression Results:")
print(f"RMSE: {rmse_gb}, R2 Score: {r2_gb}")

In [None]:
#Decision Tree Regression
from sklearn.tree import DecisionTreeRegressor

model_dt = DecisionTreeRegressor()
model_dt.fit(X_train, y_train)

y_pred_dt = model_dt.predict(X_test)

rmse_dt = mean_squared_error(y_test, y_pred_dt, squared=False)
r2_dt = r2_score(y_test, y_pred_dt)

print("Decision Tree Regression Results:")
print(f"RMSE: {rmse_dt}, R2 Score: {r2_dt}")


In [None]:
#Lasso
from sklearn.linear_model import Lasso

model_lasso = Lasso()
model_lasso.fit(X_train, y_train)

y_pred_lasso = model_lasso.predict(X_test)

rmse_lasso = mean_squared_error(y_test, y_pred_lasso, squared=False)
r2_lasso = r2_score(y_test, y_pred_lasso)

print("Decision Tree Regression Results:")
print(f"RMSE: {rmse_lasso}, R2 Score: {r2_lasso}")

In [None]:
from sklearn.linear_model import Ridge

model_ridge = Ridge()
model_ridge.fit(X_train, y_train)

y_pred_ridge = model_ridge.predict(X_test)

rmse_ridge = mean_squared_error(y_test, y_pred_ridge, squared=False)
r2_ridge = r2_score(y_test, y_pred_ridge)

print("Decision Tree Regression Results:")
print(f"RMSE: {rmse_ridge}, R2 Score: {r2_ridge}")

In [None]:
# Create the KNN model
from sklearn.neighbors import KNeighborsRegressor

model_knn = KNeighborsRegressor(n_neighbors=5)
model_knn.fit(X_train_std, y_train)

# Make predictions
y_pred_knn = model_knn.predict(X_test)

# Compute metrics
rmse_knn = mean_squared_error(y_test, y_pred_knn, squared=False)
r2_knn = r2_score(y_test, y_pred_knn)

# Print results
print("KNN Regression Results:")
print(f"RMSE: {rmse_knn}, R2 Score: {r2_knn}")