# 04_hypothesis_analysis_and_insights.ipynb

**Project:** king_county_housing_data
**Author:** Johannes Gooth  
**Date:** April 13, 2024

---

## Introduction

The dataset has been reviewed and confirmed to contain valid and plausible data. In this notebook, we will proceed to address the business case of our client, William Rodriguez, by conducting hypothesis testing, building on the insights gained from our previous Exploratory Data Analysis (EDA). The primary objective here is to validate or refute key hypotheses related to the King County housing market, directly answering critical business questions for our client and guiding strategic decision-making.

### Key Steps:
1. **Introducing our Client:**
Understanding the requirements of our client, we will tailor our hypotheses and analysis to meet his specific needs.

1. **Formulating Hypotheses:** Based on EDA findings, we will define clear, testable hypotheses that address the main business questions, such as factors influencing property prices and the impact of location on value.

2. **Selecting Appropriate Tests:** For each hypothesis, we will choose suitable statistical methods to rigorously evaluate the evidence.

3. **Conducting Hypothesis Tests:** We will perform the statistical tests, interpreting the results to determine whether each hypothesis is supported or refuted.

4. **Deriving Business Insights:** From the test outcomes, we will extract actionable insights to inform recommendations for real estate investment, pricing strategies, and market positioning.

### Expected Outcome:
A set of validated or refuted hypotheses, along with data-driven business insights, providing clear guidance for strategic decisions in the King County housing market.

## Setting-Up the Working Enviroment

In [13]:
import warnings
warnings.filterwarnings("ignore")

# Avoid restarting Kernel 
%load_ext autoreload
%autoreload 2

import sys
# setting path
sys.path.append('../')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import json


from matplotlib.ticker import PercentFormatter
plt.rcParams.update({ "figure.figsize" : (8, 5),"axes.facecolor" : "white", "axes.edgecolor":  "black"})
plt.rcParams["figure.facecolor"]= "w"
pd.plotting.register_matplotlib_converters()
pd.set_option('display.float_format', lambda x: '%.3f' % x)

from src.basic_functions import *
from src.vizualization_functions import *

## Loading the Data

In [14]:
# Filepath of the CSV
file_path_csv = '../data/king_county_housing_data_cleaned_and_preprocessed.csv'

# Load the CSV data into a DataFrame
df = pd.read_csv(file_path_csv)

## Reapplying the Descriptive Label Order

In [15]:
# view_cat label
df = apply_category_order_from_json(df, 'view_cat', '../data/view_order.json')

# condition_cat label
df = apply_category_order_from_json(df, 'condition_cat', '../data/condition_order.json')

# grade_cat label
df = apply_category_order_from_json(df, 'grade_cat', '../data/grade_order.json')

## Introducing Our Client

Our client, William Rodriguez, is a buyer with specific and distinct needs in the King County housing market. William is looking to purchase two properties: one in the countryside and one in the city. Each property serves a different purpose, and therefore, each comes with its own set of requirements.

- **Countryside House:**  
  William is looking for a non-renovated house in a rural area that offers the best timing for purchase. This likely indicates a focus on finding a property with potential for renovation and value appreciation. The countryside house will be home for two people, and William is interested in finding a property that aligns with a strategic purchase timeline.

- **City House:**  
  For the city property, William’s priorities are speed and a central location. He seeks a house that is centrally located within the city and can be acquired quickly, likely to ensure proximity to urban amenities and reduce commute times.

In this analysis, we will focus on addressing William's needs by analyzing the market to identify the best opportunities for purchasing both a countryside and a city house that align with his criteria. Through hypothesis testing and data-driven insights, we aim to provide William with recommendations that will help him make informed decisions in his property search.

## Research Questions and Hypothesis Generation
In this section of the notebook, we will systematically address the research questions posed by our client, William Rodriguez, by testing a series of hypotheses related to the King County housing market. These hypotheses are designed to answer key business questions, specifically tailored to help William make informed decisions about purchasing properties in both city and countryside locations. Below is an overview of the research questions and the corresponding hypotheses that we will test:

| # | Question | Hypothesis | Indicators |
|:-:|:-|:-|:-|
| 1 | How does the location (country vs. city) affect house prices and availability? | *"Location Impact Hypothesis"*: <br> City houses command higher prices than country houses due to increased demand for central locations, but the availability may be lower. | a) *Location (zipcode, lat, long)*: <br> Differentiate between country and city houses and evaluate central locations. Leverage zipcode, lat, and long to compare pricing, availability, and features of country vs. city houses (Comparative Location Analysis). For city houses, calculate distances from key city center landmarks (requiring additional data or assumptions based on lat and long - define city center and calculate the distance) to analyze price gradients and availability (Proximity Evaluation). |
| 2 | How does the size of a house (in terms of bedrooms and bathrooms) correlate with its suitability and value for a two-person household looking to purchase in both country and city locations? | *"Size Hypothesis"*: <br> For a two-person household, houses with fewer bedrooms (e.g., 1-2 bedrooms) and bathrooms are more cost-effective than larger properties. | a) *Number of Bedrooms (bedrooms)*: <br> To analyze the correlation between the house size suitable for two people and its market value. <br> b) *Number of Bathrooms (bathrooms)*: <br> To further refine the suitability analysis based on common needs for a two-person household. <br> c) *Square Footage of the Home (sqft_living)*: <br> To consider the overall living space, which is particularly relevant for understanding comfort and suitability for the household size. |
| 3 | Does the age and the condition of a house affect the price? | *"Condition Hypothesis"*: <br> Newer and better-maintained houses command higher prices than older and less well-maintained ones. | a) *Age of the House (yr_built)*: <br> This can be directly used to assess the age of the house. Consider calculating the "actual age" of the house at the time it was sold, which could involve subtracting 'yr_built' from the year in the date column. <br> b) *Condition of the House (condition)*: <br> This column rates the overall condition of the house and can be used to evaluate how well-maintained the house is. <br> c) *Year Renovated (yr_renovated)*: <br> This can provide additional insights into the condition and up-to-dateness of the house. A recent renovation could significantly impact the house's perceived value, even if the house itself is older. Consider creating a binary indicator for whether the house has been renovated at all, or calculate how recently the renovation occurred (e.g., years since the last renovation). <br> d) *Grade (grade)*: Although not directly mentioned in the hypothesis, the overall grade given to the housing unit could serve as a proxy for its quality, including factors related to its construction, design, and functionality, which are likely correlated with both age and condition. <br> d) *Price (price)*: <br> As the outcome variable, the price at which the house was sold will be the primary indicator of market value you’re trying to explain or predict. |
| 4 | What is the optimal timing to buy a house in the country to get the best deal? | *"Optimal Timing Hypothesis"*: <br> There are seasonal trends in house pricing and availability, with certain times of the year offering better deals, especially for country houses. | a) *Date of Sale (date)*: <br> Analyze seasonal trends and optimal buying times. (seasonal analysis). Use the 'date' column to identify patterns or trends in pricing and availability over different months or seasons. |

In the following sections, we will systematically address each hypothesis, using the data to evaluate and answer them step by step.

## Location Impact Hypothesis

### Mapping All Datapoints on the Map of Seattle
To visually analyze the impact of location on house prices and availability, we will start by mapping all the data points representing properties in the King County housing dataset onto a map of Seattle, including a marker for the city center. This visual representation will allow us to distinguish between city and countryside properties and observe patterns in their geographical distribution, which is crucial for understanding how location affects property values. By plotting the properties' locations, we can also begin to explore the proximity of city houses to central landmarks and how this proximity might influence their pricing.

In [69]:
import plotly.graph_objects as go
import plotly.express as px

# Creating the plot with OpenStreetMap
fig = px.scatter_mapbox(
    df,
    lat='lat',
    lon='long',
    hover_name='id',  # Assuming you have an 'id' column for house identifiers
    color_discrete_sequence=['#72acae'],  # Point color for houses
    size_max=5,  # Max size of points, adjust as needed
    zoom=8.55,  # Zoom level to focus on Seattle
    center={"lat": 47.45, "lon": -122.10},  # Centers the map
    mapbox_style="carto-positron",  # Use OpenStreetMap style
)

# Manually adding the trace name for houses by converting the trace to go.Scattermapbox
house_trace = go.Scattermapbox(
    lat=df['lat'], 
    lon=df['long'],
    mode='markers',
    marker=dict(
        size=5,  # You can set the size here if you want to customize
        color='#72acae',
    ),
    name='Houses',  # Legend name
    showlegend=True,  # Ensure it shows in the legend
)

# Replace the previous figure with the new one that has a legend
fig.add_trace(house_trace)

# Coordinates for the center of Seattle
seattle_center = {"lat": 47.6062, "lon": -122.3321}

# Adding a point for the center of Seattle using Graph Objects
fig.add_trace(go.Scattermapbox(
    lat=[seattle_center['lat']],
    lon=[seattle_center['lon']],
    mode='markers',
    marker=go.scattermapbox.Marker(
        size=6,
        color='black',
        opacity=1
    ),
    name='Center of Seattle',  # Add a custom name for this trace in the legend
    showlegend=True,  # Ensure the trace appears in the legend
    text=['Center of Seattle'],  # Text to display on hover
))

# Customize the legend
fig.update_layout(
    width=1000, 
    height=700,
    legend=dict(
        title=None,  # Remove legend title
        font=dict(size=11, color='black'),  # Font for legend items
        bgcolor="rgba(255, 255, 255, 0.0)",  # Set background color for the legend (transparent white)
        borderwidth=0,  # Remove border around the legend (no frame)
        orientation="v",  # Vertical layout (default)
        yanchor="top",  # Align legend vertically at the top
        y=1,  # Position at the top of the plot
        xanchor="left",  # Align legend horizontally on the left
        x=0,  # Position at the left of the plot
    )
)

# Display the plot
fig.show()

### Visualizing City and Countryside Houses Alongside Seattle's Center on a Unified Map Based on Our Analysis
In this section, we will visualize both city and countryside houses on a single map, including a marker for Seattle's city center. This unified map will help us compare the geographical distribution of properties and how their proximity to the city center might impact pricing and availability. By overlaying these data points, we aim to gain a clearer understanding of the location dynamics that are central to our analysis of the King County housing market.

In [60]:
import plotly.express as px
import plotly.graph_objects as go

# Custom names for the legend categories
custom_legend_names = {
    'city': 'City Houses',  # Rename 'city' to 'City Houses'
    'countryside': 'Countryside Houses',  # Rename 'countryside' to 'Countryside Houses'
}

# Creating the plot with OpenStreetMap
fig = px.scatter_mapbox(
    df,
    lat='lat',
    lon='long',
    hover_name='id',  # Assuming the 'id' column is for house IDs or addresses
    color='location_type',  # Use the 'location_type' column for coloring
    color_discrete_map={'city': '#84a8cb', 'countryside': '#bd8585'},  # Assign custom colors
    category_orders={'location_type': ['city', 'countryside']},  # Ensure order of categories
    labels={'location_type': 'Area Type'},  # Custom label for the color legend
    size_max=15,
    zoom=8.6,  # Adjust the zoom level as needed
    center={"lat": 47.45, "lon": -122.10},  # Centers the map
    mapbox_style="carto-positron",
)

# Rename the legend entries generated by Plotly Express
for trace in fig.data:
    if trace.name in custom_legend_names:
        trace.name = custom_legend_names[trace.name]  # Rename 'city' to 'City Houses', etc.

# Coordinates for the center of Seattle
seattle_center = {"lat": 47.6062, "lon": -122.3321}

# Adding a point for the center of Seattle using Graph Objects
fig.add_trace(go.Scattermapbox(
    lat=[seattle_center['lat']],
    lon=[seattle_center['lon']],
    mode='markers',
    marker=go.scattermapbox.Marker(
        size=6,
        color='black',
        opacity=1
    ),
    name='Center of Seattle',  # Add a custom name for this trace in the legend
    showlegend=True,  # Ensure the trace appears in the legend
    text=['Center of Seattle'],  # Text to display on hover
))

# Customize the legend
fig.update_layout(
    width=1000, 
    height=700,
    legend=dict(
        title=None,  # Remove legend title
        font=dict(size=11, color='black'),  # Font for legend items
        bgcolor="rgba(255, 255, 255, 0.0)",  # Set background color for the legend (transparent white)
        borderwidth=0,  # Remove border around the legend (no frame)
        orientation="v",  # Vertical layout (default)
        yanchor="top",  # Align legend vertically at the top
        y=1,  # Position at the top of the plot
        xanchor="left",  # Align legend horizontally on the left
        x=0,  # Position at the left of the plot
    )
)

# Display the plot
fig.show()


### Comparative Location Analysis

In this section, we will compare basic metrics between city and countryside houses to assess the impact of location on property characteristics. Specifically, we will examine:

- **Price:** The average price of houses in the city versus the countryside.
- **Square Footage (sqft_living):** The average living space of homes, measured in square feet, across both locations.
- **Location Type (location_type):** A categorical variable indicating whether a property is located in the 'city' or the 'countryside'.

By analyzing these metrics, we aim to identify any significant differences between city and countryside properties, providing insights into how location influences both pricing and the size of homes. This comparative analysis will help us test the "Location Impact Hypothesis" and understand the distinct characteristics of urban and rural housing markets in King County.

**Price and Square Footage of Living Space**

These metrics provide insights into the typical property size and value in different locations.

The results are visualized in a bar chart with error bars, which represent the variability (standard deviation) in price and square footage within each location type. This visualization helps to compare:

- *Average Price:* How the average property price differs between city and countryside locations.
- *Average Square Footage:* How the average size of homes compares between these two settings.

The use of error bars provides additional context by showing the range of variability around the mean values, indicating how much prices and sizes vary within each group. This is particularly important for understanding the consistency of the data within each location type.

Visualization Details:

1. *Average Price ($):*  
   The first subplot shows the average price for city versus countryside houses, with standard deviation error bars to indicate the variation in prices.

2. *Average Sqft of Living (square feet):** 
   The second subplot illustrates the average living space for city versus countryside houses, again with error bars to highlight variability.

The initial analysis suggests that the hypothesis—that city houses command significantly higher prices than countryside houses—is not strongly supported. Within one standard deviation, the prices between city and countryside properties are comparable, and the living spaces are also similar. This indicates that larger houses in the countryside do not necessarily compensate for the price difference between city and countryside properties.

**Examine Price per Square Foot**

To further explore the potential differences in value between city and countryside houses, we will now examine the average price per square foot. This metric allows us to assess whether city properties, despite their similar overall prices and living spaces, are valued more highly on a per-unit basis compared to countryside properties. Analyzing price per square foot will provide deeper insights into how location impacts property value and whether city homes indeed command a premium when considering the efficiency of space utilization.

The box plots below provide a visual representation of the distribution of price per square foot in both settings, allowing us to assess whether city properties are valued more highly on a per-unit basis compared to countryside properties

In [54]:
# Prepare data for box plot
city_data = df[df['location_type'] == 'city']['price_per_sqft']
countryside_data = df[df['location_type'] == 'countryside']['price_per_sqft']

# Creating subplots: 1 row, 2 columns for box plots of price per sqft for city and countryside
fig = make_subplots(rows=1, cols=2, subplot_titles=("Price per Sqft: City", "Price per Sqft: Countryside"))

# Adding city data to the first subplot
fig.add_trace(
    go.Box(y=city_data, name='City', marker_color='#B5838D'),
    row=1, col=1
)

# Adding countryside data to the second subplot
fig.add_trace(
    go.Box(y=countryside_data, name='Countryside', marker_color='#6D6875'),
    row=1, col=2
)

# Manually setting y-axis range for each subplot and moving y-axis labels further
fig.update_yaxes(title_text='Price per Sqft ($)', range=[0, 900], row=1, col=1, title_standoff=5)
fig.update_yaxes(title_text='Price per Sqft ($)', range=[0, 900], row=1, col=2, title_standoff=5)

# Update layout (optional adjustments)
fig.update_layout(
    title_text='Comparative Location Analysis: Price per Sqft by Location Type',
    height=400,
    width=800,
    yaxis_automargin=True,
    showlegend=False,
)

# Show the figure
fig.show()

While the average price per square foot in the countryside is lower, the substantial standard deviation indicates that prices between city and countryside homes can be relatively comparable. To gain further insights, we will now take a look from another angle and analyze how proximity to key city center landmarks impacts property prices in the city.

**Proximity Evaluation for City Houses**

In this section, we will focus on city properties and evaluate how their price gradients change relative to their distance from central landmarks. This analysis will help us understand whether closer proximity to the city center and key amenities significantly increases property values. By examining the relationship between distance and price, we aim to determine if a price premium exists for homes located nearer to the city center, which could be a critical factor for urban buyers prioritizing location.

By categorizing distances into bins and calculating the average price for each bin, we can observe the price gradient as homes get closer or further from the city center.

In [56]:
# Create bins for distances
bins = np.linspace(0, 20, 21)  # Creating 20 bins for the range 0-20 miles
df['distance_bin'] = pd.cut(df['mile_dist_center'], bins, labels=bins[:-1].astype(str))

# Calculate the mean price for each bin and location type
binned_prices = df.groupby(['distance_bin', 'location_type'])['price'].mean().reset_index()

# Plotting
fig = px.bar(
    binned_prices,
    x='distance_bin',
    y='price',
    color='location_type',  # Differentiate bars by location type
    barmode='group',
    title='Average House Price vs. Distance from City Center',
    labels={'distance_bin': 'Distance from City Center (miles)', 'price': 'Average Price ($)'},
    color_discrete_map={'city': '#B5838D', 'countryside': '#6D6875'}  # Custom colors
)

# Adjust the layout if needed
fig.update_layout(xaxis_title='Distance from City Center (miles)',
                  yaxis_title='Average Price ($)',
                  xaxis={'categoryorder':'array', 'categoryarray':bins[:-1].astype(str)})

fig.show()

The proximity evaluation reveals an interesting trend: there is a middle range where countryside house prices significantly exceed those in the city. This anomaly could be attributed to desirable suburbs located near the city center, where the appeal of more spacious properties might drive up prices. However, it's also possible that the presence of larger lots in these suburban areas is influencing this trend.

To explore this further, we will revisit the average price per square foot. By focusing on the price per square foot, we can better understand whether these higher prices in the countryside are due to genuinely more valuable properties or simply a result of larger lot sizes inflating overall prices. This analysis will help us determine if the price premium in these middle-range areas is justified on a per-square-foot basis, offering deeper insights into the factors driving property values in these desirable suburban locations.

In [57]:
# Calculate the mean price per sqft for each bin and location type
binned_price_per_sqft = df.groupby(['distance_bin', 'location_type'])['price_per_sqft'].mean().reset_index()

# Plotting
fig = px.bar(
    binned_price_per_sqft,
    x='distance_bin',
    y='price_per_sqft',
    color='location_type',  # Differentiate bars by location type
    barmode='group',
    title='Average Price per Sqft vs. Distance from City Center',
    labels={'distance_bin': 'Distance from City Center (miles)', 'price_per_sqft': 'Average Price per Sqft ($)'},
    color_discrete_map={'city': '#B5838D', 'countryside': '#6D6875'}  # Custom colors
)

# Adjust the layout if needed
fig.update_layout(xaxis_title='Distance from City Center (miles)',
                  yaxis_title='Average Price per Sqft ($)',
                  xaxis={'categoryorder': 'array', 'categoryarray': bins[:-1].astype(str)})

fig.show()

Within the 0-3 mile range in the city, prices are indeed slightly higher than those located 8 or more miles from the city center. Interestingly, there's an intermediate range where properties combine the appeal of countryside living with proximity to the city center, resulting in the highest house prices.

Let's create a Plotly scatter mapbox plot that includes only houses that are either city houses within 3 miles of the center or countryside houses that are more than 8 miles away from the center:

In [58]:
# Update the filter to include city houses within 8 miles and countryside houses beyond 8 miles
filtered_df = df[
    ((df['location_type'] == 'city') & (df['mile_dist_center'] < 3)) |
    ((df['location_type'] == 'countryside') & (df['mile_dist_center'] > 8))
]

# Exclude houses between 3 and 8 miles
filtered_df = filtered_df[~((df['mile_dist_center'] > 3) & (df['mile_dist_center'] < 8))]

# Create a scatter mapbox plot
fig = px.scatter_mapbox(
    filtered_df,
    lat='lat',
    lon='long',
    hover_name='id',  # Assuming 'id' for house IDs or addresses
    color='location_type',  # Use the 'location_type' column for coloring
    color_discrete_map={'city': '#B5838D', 'countryside': '#6D6875'},  # Custom colors
    size_max=15,
    zoom=8.55,  # Adjust zoom level as needed
    center={"lat": 47.45, "lon": -122.10},  # Centers the map around a specific area
    title='City and Countryside Houses: Selective Distance Criteria',
    mapbox_style="open-street-map"
)

# Coordinates for the center of Seattle, adding as an additional point
seattle_center = {"lat": 47.6062, "lon": -122.3321}

# Adding a point for the center of Seattle using Graph Objects
fig.add_trace(go.Scattermapbox(
    lat=[seattle_center['lat']],
    lon=[seattle_center['lon']],
    mode='markers',
    marker=go.scattermapbox.Marker(
        size=10,
        color='black',  # Color for the center point
        opacity=1
    ),
    text=['Center of Seattle'],  # Hover text
))

# Adjust the size of the plot
fig.update_layout(width=1000, height=700)

# Display the plot
fig.show()



### Conclusion
Regarding the validity of our hypothesis, it appears to be context-dependent. Considering our client's preference for a home close to the city, a countryside property located 8 miles or more from the city center might actually be more appealing to him. In this scenario, the hypothesis holds true for his specific needs and preferences.

## Size Hypothesis

To analyze the "Size Hypothesis" which suggests that houses with fewer bedrooms (1-2) and bathrooms are more cost-effective for a two-person household, and to understand the impact of these factors along with square footage on the market value, we can create plots for each indicator. These will include:

- Number of Bedrooms vs. Price: To observe how the market value correlates with the number of bedrooms, suitable for a two-person household.

- Number of Bathrooms vs. Price: To refine the analysis based on the number of bathrooms, addressing common needs for a two-person household.

- Square Footage of the Home vs. Price: To assess the impact of living space on comfort and suitability, and its relation to market value.

For this analysis, we'll use Plotly for visualization to create scatter plots, which are suitable for observing correlations and trends.

In [None]:
# Creating subplots: 3 rows, 1 column
fig = make_subplots(rows=3, cols=1, subplot_titles=("Price vs. Number of Bedrooms", 
                                                     "Price vs. Number of Bathrooms", 
                                                     "Price vs. Square Footage"))

# Adding scatter plot for Number of Bedrooms vs. Price
fig.add_trace(
    go.Scatter(x=df['bedrooms'], y=df['price'], mode='markers', name='Bedrooms', marker_color='#E5989B'),
    row=1, col=1
)

# Adding scatter plot for Number of Bathrooms vs. Price
fig.add_trace(
    go.Scatter(x=df['bathrooms'], y=df['price'], mode='markers', name='Bathrooms', marker_color='#E5989B'),
    row=2, col=1
)

# Adding scatter plot for Square Footage vs. Price
fig.add_trace(
    go.Scatter(x=df['sqft_living'], y=df['price'], mode='markers', name='Sqft Living', marker_color='#E5989B'),
    row=3, col=1
)

# Update layout (optional adjustments)
fig.update_layout(
    height=1200, 
    width=600, 
    title_text="Size Hypothesis Analysis",
    showlegend=False
)

fig.update_xaxes(title_text="Bedrooms", row=1, col=1)
fig.update_xaxes(title_text="Bathrooms", row=2, col=1)
fig.update_xaxes(title_text="Squarefootage of Living Space (squaremiles)", row=3, col=1)
fig.update_yaxes(title_text="Price ($)", row=1, col=1)
fig.update_yaxes(title_text="Price ($)", row=2, col=1)
fig.update_yaxes(title_text="Price ($)", row=3, col=1)

# Show the figure
fig.show()

To enhance the clarity of our analysis, especially since the values on x-axis in the two upper panels is discrete, we've supplemented the scatter plots with additional bar plots. These bar plots will provide a more straightforward visualization of how the number of bedrooms and bathrooms correlate with the market value, making it easier to interpret trends and patterns in these discrete variables. This dual approach ensures a more comprehensive understanding of the "Size Hypothesis" and its implications for our client's decision-making process.

In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Calculating the average price for each number of bedrooms and bathrooms
avg_price_per_bedroom = df.groupby('bedrooms')['price'].mean().reset_index()
avg_price_per_bathroom = df.groupby('bathrooms')['price'].mean().reset_index()

# Creating subplots: 2 rows, 1 column for Bedrooms and Bathrooms analysis
fig = make_subplots(rows=2, cols=1, subplot_titles=("Average Price vs. Number of Bedrooms", 
                                                     "Average Price vs. Number of Bathrooms"))

# Adding bar plot for Average Price vs. Number of Bedrooms
fig.add_trace(
    go.Bar(x=avg_price_per_bedroom['bedrooms'], y=avg_price_per_bedroom['price'], name='Bedrooms', marker_color='#E5989B'),
    row=1, col=1
)

# Adding bar plot for Average Price vs. Number of Bathrooms
fig.add_trace(
    go.Bar(x=avg_price_per_bathroom['bathrooms'], y=avg_price_per_bathroom['price'], name='Bathrooms', marker_color='#E5989B'),
    row=2, col=1
)

# Update layout (optional adjustments)
fig.update_layout(height=800, 
                  width=600, 
                  title_text="Size Hypothesis Analysis: Bedrooms & Bathrooms",
                  showlegend=False,
                  )

fig.update_xaxes(title_text="Number of Bedrooms", row=1, col=1)
fig.update_xaxes(title_text="Number of Bathrooms", type='category', row=2, col=1)  # Setting type as category for better display
fig.update_yaxes(title_text="Average Price ($)", row=1, col=1)
fig.update_yaxes(title_text="Average Price ($)", row=2, col=1)

# Show the figure
fig.show()


### Conclusion

- **Average Price vs. Number of Bedrooms:**  
  This bar plot visualizes the average price for homes with varying numbers of bedrooms, providing insights into cost-effectiveness trends for smaller homes, particularly those with 1-2 bedrooms, which are ideal for a two-person household. This analysis helps to affirm the "Size Hypothesis," suggesting that homes with fewer bedrooms may offer better value for smaller households.

- **Average Price vs. Number of Bathrooms:**  
  Similarly, this plot examines the average price across homes with different numbers of bathrooms, further refining our analysis of suitability for two-person households. By comparing market values in relation to the number of bathrooms, we can better understand the cost implications of this key factor, supporting the "Size Hypothesis" for properties with 1-2 bathrooms, which align with the needs of smaller households.

This method of analysis provides a direct comparison of market values based on the number of bedrooms and bathrooms, reinforcing the "Size Hypothesis" and emphasizing the cost-effectiveness of properties that are well-suited for a two-person household.

## Condition Hypothesis


To analyze the "Condition Hypothesis," which suggests that newer and better-maintained houses command higher prices than older and less well-maintained ones, we can follow a structured approach that examines key indicators. Below is a detailed analysis and visualization strategy using Python and Plotly:

### Age of the House vs. Price
This analysis looks at the relationship between the age of the house at the time of sale and its market value. We calculate the average price for houses based on their age and visualize this relationship to see if younger houses tend to command higher prices.

In [None]:
# a) Age of th House vs. Prize

# Calculating the average price for each age at sale
avg_price_by_age = df.groupby('age_at_sale')['price'].mean().reset_index()

# Sorting by age at sale for ordered plotting
avg_price_by_age_sorted = avg_price_by_age.sort_values(by='age_at_sale')

# Creating the bar plot
fig = px.bar(avg_price_by_age_sorted, 
             x='age_at_sale', 
             y='price',
             labels={'age_at_sale': 'Age at Sale', 'price': 'Average Price'},
             title='Average Sale Price vs. Age at Sale',
             color_discrete_sequence=['#E5989B'])

fig.show()


### Condition of the House vs. Price
This part of the analysis evaluates how the condition of the house affects its price. Therefore, the average price is calculated for each condition category.

In [None]:
# Calculate the average price for each condition label
avg_price_by_condition_label = df.groupby('condition_label')['price'].mean().reset_index()

# Creating the bar plot
fig = px.bar(avg_price_by_condition_label, 
             x='condition_label', 
             y='price',
             labels={'condition_label': 'Condition', 'price': 'Average Price ($)'},
             title='Average Sale Price vs. Condition',
             color_discrete_sequence=['#E5989B'])

fig.show()

### Impact of Renovation
This analysis compares the prices of renovated versus non-renovated houses to see how renovations affect market value. A box plot is used to visualize the price distribution for both groups.

In [None]:
# Visualize price distribution for renovated vs. non-renovated houses.
ig = go.Figure()

# Adding traces for renovated and unrenovated houses
fig.add_trace(go.Box(
    y=df[df['renovation_status'] == 'renovated']['price'],
    name='Renovated',
    line=dict(color='#FFCDB2')  # Setting the color of the box outline
))

fig.add_trace(go.Box(
    y=df[df['renovation_status'] == 'unrenovated']['price'],
    name='Not Renovated',
    line=dict(color='#FFB4A2')  # Setting the color of the box outline
))

# Update layout with title, manually set y-axis range, and y-axis label
fig.update_layout(
    title='Price Distribution: Renovated vs. Not Renovated',
    yaxis=dict(
        range=[0, 2000000],  # Manually setting the y-axis limits; adjust as needed
        title='Price ($)'  # Adding a y-axis label
    )
)

# Show the figure
fig.show()


### Grade of the House vs. Price
The final analysis assesses how the grade of the house, which serves as a proxy for quality, correlates with its price. Therefore, the average price for each grade category is visualized.

In [None]:
# Calculate the average price for each grade label
avg_price_by_grade_label = df.groupby('grade_label')['price'].mean().reset_index()

# Creating the bar plot
fig = px.bar(avg_price_by_grade_label, 
             x='grade_label', 
             y='price',
             labels={'grade_label': 'Grade Category', 'price': 'Average Price ($)'},
             title='Average House Price by Grade Category',
             color_discrete_sequence=['#E5989B'])

fig.show()

### Conclusion

- **Age at Sale:** Younger houses show no trend towards higher prices. Therefore, the actual impact needs to be viewed in conjunction with condition and renovations.
- **Condition and Grade:** Higher condition ratings and grades correlate with higher prices. These factors can serve as quality indicators.
- **Renovation's Impact:** The data shows that renovated houses tend to have higher prices, particularly if the renovation is recent. This underscores the hypothesis that well-maintained or updated homes command higher prices.

This structured approach provides a comprehensive analysis of the factors influencing house prices, supporting the "Condition Hypothesis." By examining age, condition, renovation status, and grade, we gain a deeper understanding of how these elements contribute to the market value of homes in King County.

## Optimal Timig Hypothesis

To analyze the "Optimal Timing Hypothesis" and explore seasonal trends in house pricing and availability, we will focus on the 'date' column to identify patterns across different months or seasons. Here's a step-by-step approach to this analysis:

1. **Extracting the Month or Season:**
We first extract the month from the 'date' column to analyze trends by month. Additionally, we can map these months to seasons if we want to explore broader seasonal trends.
2. **Calculating Average Prices and Availability:**
We calculate the average house prices for each month and each season to observe how pricing varies throughout the year.
3. **Creating Visualizations:**
We then create bar plots to visualize these trends, highlighting any patterns that might indicate the optimal time for purchasing a house.


### Extracting the Month and Season

In [None]:
# Convert values of the into datetime format
df['date'] = pd.to_datetime(df['date'], format='mixed', yearfirst=True)

# Extract month
df['month'] = df['date'].dt.month

# Optional: Map months to seasons if you prefer a seasonal analysis
def month_to_season(month):
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:  # 9, 10, 11
        return 'Autumn'

df['season'] = df['month'].apply(month_to_season)

### Calculating Average Prices by Month and Season

In [None]:
# Calculate average price per month/season
avg_price_per_month = df.groupby('month')['price'].mean().reset_index()

# Or, for seasonal analysis
avg_price_per_season = df.groupby('season')['price'].mean().reset_index()

### Visualizing Trends

In [None]:
# Plotting the average price per month
fig_month = px.bar(avg_price_per_month, x='month', y='price',
                   labels={'month': 'Month', 'price': 'Average Price ($)'},
                   title='Average House Price by Month',
                   color_discrete_sequence=['#E5989B'])

fig_month.show()

# Plotting the average price per season
fig_season = px.bar(avg_price_per_season, x='season', y='price',
                    labels={'season': 'Season', 'price': 'Average Price ($)'},
                    title='Average House Price by Season',
                    category_orders={"season": ["Winter", "Spring", "Summer", "Autumn"]},
                    color_discrete_sequence=['#E5989B'])

fig_season.show()

### Conclusion

- **Seasonal Trends:**
A trend is already evident from the seasonal analysis: purchasing during the winter appears to be optimal. This could be due to lower demand in colder months, leading to more competitive pricing for buyers.

This analysis provides valuable insights into the optimal timing for purchasing a house in the King County housing market. By identifying periods where prices are generally lower, such as in the winter months, our client can make more informed decisions about when to enter the market, potentially securing better deals. This aligns with the "Optimal Timing Hypothesis" and reinforces the importance of timing in real estate transactions.

## Closing
In this notebook, we thoroughly analyzed several key hypotheses related to the King County housing market to provide actionable insights for our client. Through detailed analysis and visualization, we examined how location, size, condition, and timing influence property prices.

- **Location Impact Hypothesis:** We found that properties closer to the city center generally command higher prices, especially within a 3-mile radius. However, certain desirable suburban areas located 8 miles or more from the city center also exhibit higher prices, likely due to their unique blend of countryside appeal and proximity to urban amenities.
  
- **Size Hypothesis:** Our analysis confirmed that smaller homes with 1-2 bedrooms and bathrooms are more cost-effective, particularly for a two-person household. This supports the hypothesis that size significantly impacts market value, with smaller homes offering better value for households with fewer occupants.

- **Condition Hypothesis:** The data showed that newer, well-maintained, and recently renovated houses indeed command higher prices. Condition and grade emerged as strong indicators of quality, directly influencing market value.

- **Optimal Timing Hypothesis:** Seasonal trends suggest that purchasing during the winter months could offer the best deals, as prices tend to be lower during this period due to reduced demand.

In the next notebook, **05_Final_Recommendations_and_Strategy.ipynb**, we will build on these insights to formulate tailored recommendations for our client. This final phase will involve synthesizing all the analyses and findings into strategic advice, ensuring that every recommendation aligns with our client’s specific goals and preferences. We will also conduct any additional necessary validations to strengthen the foundation of our advice, guiding our client toward making informed, strategic real estate investments in the King County housing market.