# C3M4 Lesson 1 Practice Lab: London housing prices - Confidence intervals and hypothesis testing

London's housing market is dynamic and complex. You are working for a real estate agency and need to figure out a price for selling newly built houses in a London suburb.

In this module's practice labs you will be working with a reduced version of the London House Price Data dataset from Kaggle, which includes house prices from 2018 to October 2024. In this lab you will be working with the following columns: 

- `price`: The price of the sale 
- `outcode`: First part of the postcode, grouping properties into broader geographic zones.
- `floorAreaSqM`: The area in square meters of the property.

## General instructions
- **Replace any instances of `None` with your own code**. All `None`s must be replaced.
- **Compare your results with the expected output** shown below the code.
- **Check the solution** using the expandable cell to verify your answer.

Happy coding!

<div style="background-color: #FAD888; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width:95%
">
<strong>Important note</strong>: Code blocks with None will not run properly. If you run them before completing the exercise, you will likely get an error. 
</div>

## Table of contents
- [Step 1: Import libraries](#import-libraries)
- [Step 2: Load the data](#load-the-data)
- [Step 3: Confidence Intervals](#confidence-intervals)
    - [Confidence Intervals for Means](#ci-for-means)
- [Step 4: Hypothesis Testing](#hypothesis-testing)

<a id="import-libraries"></a>

## Step 1: Import libraries
Begin by importing the libraries.

In [4]:
import pandas as pd
import scipy.stats as stats
import numpy as np

<a id="load-the-data"></a>

## Step 2: Load the data
Begin by loading the data. Run the cell below to load the data.

In [5]:
df = pd.read_csv("london_house_price_2018.csv")
df.head()

Unnamed: 0,fullAddress,postcode,outcode,latitude,longitude,bathrooms,bedrooms,floorAreaSqM,livingRooms,tenure,propertyType,currentEnergyRating,price
0,"Flat 1, White Rose Court, Widegate Street, Lon...",E1 7ES,E1,51.517972,-0.078028,2.0,2.0,73.0,1.0,Leasehold,Purpose Built Flat,D,623000
1,"Flat 5, White Rose Court, Widegate Street, Lon...",E1 7ES,E1,51.517972,-0.078028,1.0,2.0,50.0,1.0,Leasehold,Converted Flat,E,575000
2,"9A Petticoat Tower, Petticoat Square, London, ...",E1 7EE,E1,51.515798,-0.077081,1.0,2.0,72.0,2.0,Leasehold,Purpose Built Flat,C,385000
3,"Flat 11, Arcadia Court, 45 Old Castle Street, ...",E1 7NY,E1,51.516568,-0.074793,1.0,1.0,42.0,1.0,Leasehold,Purpose Built Flat,D,370000
4,"Flat 18, Arcadia Court, 45 Old Castle Street, ...",E1 7NY,E1,51.516568,-0.074793,1.0,1.0,39.0,1.0,Leasehold,Purpose Built Flat,C,364000


<a id="confidence-intervals"></a>

## Step 3: Confidence Intervals

<a id="ci-for-means"></a>

### Confidence Intervals for Means

The average price per square meter is crucial for understanding property values, and comparing homes of different sizes. By constructing a confidence interval, you will have a reliable estimate, offering clearer insight into how much houses might be worth.

<div style="background-color: #C6E2FF; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width:95%
">

**▶▶▶ Directions**
1. Create a new column named `price_per_sqm`, dividing the `price` column by the  `floorAreaSqM`.
2. Calculate the sample mean of the price per square meter using `.mean()` method.
3. Calculate the sample standard deviation of the price per square meter using `.std()` method.
4. Find the sample size using `.count()` method on the `price_per_sqm` column.
5. Calculate the scale parameter as the standard deviation divided by the square root of the sample size.
5. Calculate the upper and lower bounds for the 95% confidence interval around the mean price, scaled by the scale parameter.
</div>


In [7]:
### START CODE HERE ###

# Add the price_per_sqm column
df["price_per_sqm"] = df["price"] / df["floorAreaSqM"]

# calculate the sample mean
mean_price_sqm = df["price_per_sqm"].mean()

# calculate the sample standard deviation
std_price_sqm = df["price_per_sqm"].std()

# calculate the sample size
n = df["price"].count()

# calculate the scale
SEM = std_price_sqm / np.sqrt(n)

# Calculate the confidence interval using norm.interval
interval = stats.norm.interval(0.95, loc=mean_price_sqm, scale=SEM)

### END CODE HERE ###

print("Mean price per square meter:", mean_price_sqm)
print("95% confidence interval: (", interval[0], ",", interval[1], ")")

Mean price per square meter: 7502.891480118485
95% confidence interval: ( 7463.11248288455 , 7542.670477352419 )


<details open>
<summary style="background-color: #c6e2ff6c; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.01); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Expected output:</summary> 

```
Mean price per square meter: 7502.891480118485
95% confidence interval: ( 7463.11248288455 , 7542.670477352419 )
```
</details>


<details>
<summary style="background-color: #FDBFC7; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Click here to see the solution</summary> 

<ul style="background-color: #FFF8F8; padding: 10px; border-radius: 3px; margin-top: 5px; width: 95%; box-shadow: inset 0 2px 4px rgba(0, 0, 0, 0.1);">
   
Your solution should look something like this:

```python
# Add the price_per_sqm column
df["price_per_sqm"] = df["price"]/df["floorAreaSqM"]
# # calculate the sample mean
mean_price_sqm = df["price_per_sqm"].mean()
# calculate the sample standard deviation
std_price_sqm = df["price_per_sqm"].std()
# calculate the sample size
n = df["price_per_sqm"].count()
# calculate the scale
SEM = std_price_sqm / np.sqrt(n)

# Calculate the confidence interval using norm.interval
interval = stats.norm.interval(0.95, loc=mean_price_sqm, scale=SEM)
```
</details>

<a id="hypothesis-testing"></a>

## Step 4: Hypothesis Testing
The price per square meter reflects, among other things, location desirability and property value. Your company wants to build new houses in dulwich or wimbledon area. Does the price per square meter change significantly between these two areas? Answer the question using a confidence interval with 5% significance.

<div style="background-color: #C6E2FF; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width: 95%;">

▶▶▶ **Directions** 
1. Create a new DataFrame that includes only houses in Dulwich (`outcode` equals to "SE21").
2. Create another DataFrame that includes only houses in the Wimbledon area (`outcode` equals to "SW19").
3. Find the p-value and test statistic for this two-sample t-test. Remember to only pass the `"price_per_sqm"` column for both dulwich and wimbledon dataframes.
4. What can you conclude with 5% significance?
</div>

In [None]:
### START CODE HERE ###

# create a new DataFrame that includes only Redbridge samples
dulwich_df = df[df["outcode"] == "SE21"]

# create another DataFrame including only Wimbledon samples
wimbledon_df = df[df['outcode'] == "SW19"]

# find the p-value and test statistic for this two-sampled t-test
test_results = stats.ttest_ind(None, None)

### END CODE HERE ###

t_stat = test_results[0]
p_value = test_results[1]

print("T-statistic:", t_stat)
print("P-value:", p_value)

<details open>
<summary style="background-color: #c6e2ff6c; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.01); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Expected output:</summary> 

```
T-statistic: -5.772630665662497
P-value: 9.7521016289875e-09
```
</details>


<details>
<summary style="background-color: #FDBFC7; padding: 10px; border-radius: 3px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.1); width: 95%; text-align: left; cursor: pointer; font-weight: bold;">
Click here to see the solution</summary> 

<ul style="background-color: #FFF8F8; padding: 10px; border-radius: 3px; margin-top: 5px; width: 95%; box-shadow: inset 0 2px 4px rgba(0, 0, 0, 0.1);">
   
Your solution should look something like this:

```python
# create a new DataFrame that includes only Redbridge samples
dulwich_df = df[df["outcode"]== "SE21"]

# create another DataFrame including  only Wimbledon samples
wimbledon_df = df[df["outcode"]== "SW19"]

# find the p-value and test statistic for this two-sampled t-test
test_results = stats.ttest_ind(dulwich_df["price_per_sqm"], wimbledon_df["price_per_sqm"])
```
</details>

Congratulations for making it until the end of this lab. You will keep working on this dataset in Lesson 2. Hope you enjoyed it! 