In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('/content/sample_data/property.csv')

In [3]:
df.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


In [4]:
altona_properties = df[df['Suburb'] == 'Altona']
altona_prices = altona_properties['Price'].dropna()

print(f"Number of properties in Altona: {len(altona_prices)}")
print(f"Mean price in Altona: ${altona_prices.mean():,.2f}")
print(f"Standard deviation of prices in Altona: ${altona_prices.std():,.2f}")

Number of properties in Altona: 74
Mean price in Altona: $834,830.41
Standard deviation of prices in Altona: $291,546.05


Now, let's perform a one-sample t-test to check if the mean property price in Altona is significantly different from $800,000.

#Null Hypothesis = We will assume the null hypothesis that the mean price is less than equal to ```$800,000```

#Alternate Hypothesis = The alternative hypothesis that it is greater than ```$800,000``` (one-tailed test).

#Significance = 5% "We'll use a significance level of 5% (alpha = 0.05)."

## Null Hypothesis

$$ H_0: \text{[The mean price is less than equal to \$800,000 ]} $$

## Alternate Hypothesis
$$ H_1: \text{[The mean price is greater than \$800,000 ]} $$


##Formula is
$$ z = \frac{\bar{X} - \mu}{\frac{\sigma}{\sqrt{n}}} $$


$$ \bar{x} =  834830.41$$

$$ \mu = 800000$$

$$ \sigma = 291546.05 $$

$$ s = 74 $$

## Calculate T-statistic Manually

Calculate the t-statistic for the one-sample t-test manually using the formula: (sample_mean - hypothesized_mean) / (sample_standard_deviation / sqrt(sample_size)).

Let's first calculate the sample mean, sample standard deviation, and sample size from the `altona_prices` data. Then define the hypothesized mean, calculate the standard error of the mean (SEM), and finally compute the t-statistic using the given formula.

In [6]:
altona_properties = df[df['Suburb'] == 'Altona']
altona_prices = altona_properties['Price'].dropna()

sample_mean = altona_prices.mean()
sample_std = altona_prices.std()
sample_size = len(altona_prices)
hypothesized_mean = 800000

SEM = sample_std / np.sqrt(sample_size)
t_statistic_manual = (sample_mean - hypothesized_mean) / SEM

print(f"T-statistic: {t_statistic_manual:.4f}")

T-statistic: 1.0277


##Now that the t-statistic has been manually calculated, the next step is to determine the critical t-value for a one-tailed test at a 5% significance level. This involves calculating the degrees of freedom and then using `scipy.stats.t.ppf` to find the critical value.

## Compare T-statistic with Critical T-value and Draw Conclusion

Compare the `t_statistic_manual` with the `critical_t_value` and determine whether to reject or fail to reject the null hypothesis. Summarize the conclusion.

#### Instructions:
1. Compare the calculated `t_statistic_manual` with the `critical_t_value`.
2. Print a statement indicating whether the null hypothesis is rejected or not, based on the comparison.
3. State the conclusion about the property prices in Altona.

In [9]:
from scipy.stats import t

degrees_freedom = sample_size - 1
alpha = 0.05

critical_t_value = t.ppf(1 - alpha, df=degrees_freedom)

print(f"Degrees of Freedom: {degrees_freedom}")
print(f"Critical T-value (one-tailed, alpha=0.05): {critical_t_value:.4f}")

print(f"\nComparing T-statistic ({t_statistic_manual:.4f}) with Critical T-value ({critical_t_value:.4f})")

if t_statistic_manual > critical_t_value:
    print("Conclusion: Reject the null hypothesis. \nThere is sufficient evidence to suggest that the typical property price in Altona is greater than $800,000.")
else:
    print("Conclusion: Fail to reject the null hypothesis. \nThere is not sufficient evidence to suggest that the typical property price in Altona is greater than $800,000.")


Degrees of Freedom: 73
Critical T-value (one-tailed, alpha=0.05): 1.6660

Comparing T-statistic (1.0277) with Critical T-value (1.6660)
Conclusion: Fail to reject the null hypothesis. 
There is not sufficient evidence to suggest that the typical property price in Altona is greater than $800,000.


## Summary:

The findings indicate that the calculated T-statistic is 1.0277, which is less than the critical T-value of 1.6660 for a one-tailed test at a 5% significance level with 73 degrees of freedom. Therefore, we fail to reject the null hypothesis, concluding that there is insufficient statistical evidence to suggest that the mean property price in Altona is significantly greater than \$800,000.

### Data Analysis Key Findings
*   The manually calculated T-statistic for Altona property prices was determined to be 1.0277.
*   With a sample size of 74, the degrees of freedom for the t-test were 73.
*   For a one-tailed test at a 5% significance level ($\alpha=0.05$), the critical T-value was found to be 1.6660.
*   The calculated T-statistic (1.0277) is less than the critical T-value (1.6660).
*   Based on this comparison, the null hypothesis (mean property price in Altona $\le$ \$800,000) was not rejected, indicating insufficient evidence to conclude that the mean price is significantly greater than \$800,000.


### QUESTION 2. For the year 2016, is there any difference in the prices of properties sold in the  summer months vs winter months?

• Consider months from October till March as winter months and rest as summer months.

• Use a significance level of 5%.

## Prepare Data for 2016

Convert the 'Date' column to datetime objects, filter the DataFrame to include only properties sold in the year 2016, and extract the month for each sale.


To prepare the data for analysis, We will first convert the 'Date' column to datetime objects, then filter the DataFrame to retain only the records from the year 2016, and finally extract the month from the 'Date' column into a new 'Month' column.



In [10]:
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y', errors='coerce')
df_2016 = df[df['Date'].dt.year == 2016].copy()
df_2016['Month'] = df_2016['Date'].dt.month

print(f"Shape of original DataFrame: {df.shape}")
print(f"Shape of 2016 DataFrame: {df_2016.shape}")
df_2016.head()

Shape of original DataFrame: (13580, 21)
Shape of 2016 DataFrame: (6336, 22)


Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount,Month
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,2016-12-03,2.5,3067.0,...,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0,12
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,2016-02-04,2.5,3067.0,...,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0,2
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,2016-06-04,2.5,3067.0,...,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0,6
5,Abbotsford,129 Charles St,2,h,941000.0,S,Jellis,2016-05-07,2.5,3067.0,...,0.0,181.0,,,Yarra,-37.8041,144.9953,Northern Metropolitan,4019.0,5
6,Abbotsford,124 Yarra St,3,h,1876000.0,S,Nelson,2016-05-07,2.5,3067.0,...,0.0,245.0,210.0,1910.0,Yarra,-37.8024,144.9993,Northern Metropolitan,4019.0,5


## Categorize Sales into Summer and Winter

Based on the defined criteria (October-March as winter, rest as summer), categorize the 2016 sales into two groups: 'Summer' and 'Winter' months. Extract the 'Price' for each of these groups, handling any missing price values.


To categorize the 2016 sales into 'Summer' and 'Winter' months, We will define the month ranges, then filter the `df_2016` DataFrame for each season, extract the 'Price' column, and remove any missing values. Finally, Let's print the head and shape of each resulting price series to verify the categorization.



In [11]:
winter_months = [10, 11, 12, 1, 2, 3]
summer_months = [4, 5, 6, 7, 8, 9]

summer_prices_2016 = df_2016[df_2016['Month'].isin(summer_months)]['Price'].dropna()
winter_prices_2016 = df_2016[df_2016['Month'].isin(winter_months)]['Price'].dropna()

print(f"Number of summer sales in 2016: {len(summer_prices_2016)}")
print("First 5 summer prices in 2016:\n", summer_prices_2016.head())
print(f"\nNumber of winter sales in 2016: {len(winter_prices_2016)}")
print("First 5 winter prices in 2016:\n", winter_prices_2016.head())

Number of summer sales in 2016: 4036
First 5 summer prices in 2016:
 4     1600000.0
5      941000.0
6     1876000.0
13    1172500.0
14     441000.0
Name: Price, dtype: float64

Number of winter sales in 2016: 2300
First 5 winter prices in 2016:
 0    1480000.0
1    1035000.0
7    1636000.0
8     300000.0
9    1097000.0
Name: Price, dtype: float64


In [14]:
summer_prices_2016.std()


621493.4398118115

In [15]:
summer_prices_2016.mean()

np.float64(1048054.7286917741)

In [16]:
winter_prices_2016.std()

695498.2768521258

In [17]:
winter_prices_2016.mean()

np.float64(1116647.5917391304)

## Perform Two-Sample T-test

Conduct an independent two-sample t-test to compare the mean property prices between the 'Summer' and 'Winter' months groups. This will involve defining the null and alternative hypotheses and using a significance level of 5%.

#### Instructions
1. Clearly state the null hypothesis ($H_0$) and the alternative hypothesis ($H_1$) for this two-sample t-test.
2. Import the `ttest_ind` function from the `scipy.stats` module.
3. Perform an independent two-sample t-test using `ttest_ind` on `summer_prices_2016` and `winter_prices_2016`.
4. Print the calculated t-statistic and p-value from the test.
5. State the significance level (alpha) for this test.

### Hypotheses:

*   **Null Hypothesis ($H_0$)**: There is no significant difference in the mean property prices between summer months and winter months in 2016. ($\mu_{\text{summer}} = \mu_{\text{winter}}$)
*   **Alternative Hypothesis ($H_1$)**: There is a significant difference in the mean property prices between summer months and winter months in 2016. ($\mu_{\text{summer}} \ne \mu_{\text{winter}}$)

In [18]:
from scipy.stats import ttest_ind

# Define significance level
alpha = 0.05

# Perform independent two-sample t-test
t_statistic, p_value = ttest_ind(summer_prices_2016, winter_prices_2016, equal_var=True)

print(f"T-statistic: {t_statistic:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Significance level (alpha): {alpha}")

T-statistic: -4.0434
P-value: 0.0001
Significance level (alpha): 0.05


## Interpret T-test Results and Conclude

Interpret the results of the two-sample t-test, comparing the p-value with the significance level, and draw a conclusion about the difference in mean property prices between summer and winter months in 2016.

### Calculatioms:
*   **Calculated T-statistic**: -4.0434
*   **Calculated P-value**: 0.0001
*   **Significance Level (alpha)**: 0.05

Since the P-value (0.0001) is less than the significance level (0.05), we reject the null hypothesis.

## Summary:

*   **Is there a significant difference in property prices between summer and winter months in 2016?**
   
    Yes, there is a statistically significant difference in the mean property prices between summer months (April to September) and winter months (October to March) in 2016. This conclusion is based on a two-sample t-test where the p-value (0.0001) was less than the significance level (0.05), leading to the rejection of the null hypothesis.

### Data Analysis Key Findings
*   The initial dataset was filtered to `df_2016`, containing 6336 property sales records for the year 2016.
*   Property sales were categorized into two groups: 4036 sales occurred during summer months (April-September) and 2300 sales occurred during winter months (October-March).
*   An independent two-sample t-test was performed to compare mean property prices between the two seasons.
    *   The calculated t-statistic was -4.0434.
    *   The calculated p-value was 0.0001.
    *   The significance level (alpha) for the test was 0.05.
*   Since the p-value (0.0001) is considerably smaller than the significance level (0.05), the null hypothesis—that there is no significant difference in mean property prices between summer and winter—was rejected.

### Insights or Next Steps
*   The analysis indicates that seasonal factors have a statistically significant impact on property prices in 2016. Further investigation could explore the magnitude of this difference and potential reasons, such as market demand, inventory changes, or specific property types sold in each season.
*   To gain deeper insights, consider analyzing the median prices in addition to the mean, especially if the price distributions are skewed. Additionally, future analyses could explore the impact of specific months within seasons or compare seasonal trends across multiple years to identify consistent patterns.


According to the lecture given during Online classes, using Manual approach. For an independent two-sample t-test, assuming equal variances (which we did in our previous calculation), the t-statistic is calculated using the following formula:

$$ t = \frac{(\bar{X}_1 - \bar{X}_2) - (\mu_1 - \mu_2)}{\sqrt{s_p^2 \left(\frac{1}{n_1} + \frac{1}{n_2}\right)}} $$

Where:

$\bar{X}_1$: The sample mean of the first group (summer prices).

$\bar{X}_2$: The sample mean of the second group (winter prices).

$\mu_1 - \mu_2$: The hypothesized difference between the population means.

For our null hypothesis
($H_0: \mu_{\text{summer}} = \mu_{\text{winter}}$), this value is 0.

$s_p^2$: The pooled variance of the two samples.

$n_1$: The sample size of the first group (summer prices).

$n_2$: The sample size of the second group (winter prices).

First, we would calculate the pooled variance ($s_p^2$) using this formula:

$$ s_p^2 = \frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2} $$

Where:

$s_1^2$: The variance of the first group (summer prices).
$s_2^2$: The variance of the second group (winter prices).

Let's break down the values we already have from our previous steps for the 2016 data:

For Summer Prices (Group 1):

Sample Mean $\bar{X}_1$: ~$1,048,054.73
    
Sample Standard Deviation $s_1$: ~$621,493.44

Sample Size $n_1$: 4036

For Winter Prices (Group 2):

Sample Mean $\bar{X}_2$: ~$1,116,647.59

Sample Standard Deviation ($s_2$): ~$6,95,498.28

Sample Size ($n_2$): 2300

Steps for manual calculation:

Calculate the variance for summer prices ($s_1^2 = s_1 \times s_1$) and winter prices ($s_2^2 = s_2 \times s_2$).

Calculate the pooled variance ($s_p^2$) using the formula above with the sample sizes and variances.

Substitute all values ($\bar{X}_1$, $\bar{X}_2$, $s_p^2$, $n_1$, $n_2$) into the main t-statistic formula.

Performing these calculations would give us a t-statistic of approximately -4.0434, as obtained programmatically.

### 3. For the suburb of Abbotsford, what is the probability that out of 10 properties sold, 3 will not have a car parking space?

• Use the column car in the dataset.

• Round off your answer to 3 decimal places.

# Steps
Steps involves:
1. Filtering the `df` DataFrame for properties in 'Abbotsford'.
2. Identifying the number of properties with and without a car parking space (where `Car` is 0 or NaN/missing).
3. Calculating the proportion of properties in Abbotsford that do not have a car parking space.
4. Using the binomial probability formula to determine the probability of exactly 3 out of 10 properties not having a car parking space, given the calculated proportion.

## Calculate Probability of No Car Parking Space in Abbotsford

Filter the DataFrame for properties in 'Abbotsford', handle missing values in the 'Car' column (assuming NaN means no car space), and calculate the probability that a property in Abbotsford does not have a car parking space. This will be our `p` value for the binomial distribution.

First, We will filter the original DataFrame `df` for entries where the `Suburb` is 'Abbotsford'. Then, We will inspect the 'Car' column for these properties. Since the question asks about properties *not having* a car parking space and some `Car` values might be `NaN`, We will treat `NaN` as no car space (0 car spaces). We will count properties where 'Car' is 0 or NaN and divide by the total number of properties in Abbotsford to get the probability `p`.

In [19]:
abbotsford_properties = df[df['Suburb'] == 'Abbotsford'].copy()

# Inspect the 'Car' column for unique values and their counts
# print(abbotsford_properties['Car'].value_counts(dropna=False))

# Count properties with 0 car spaces or NaN (assuming NaN means no car space)
no_car_space_count = abbotsford_properties[abbotsford_properties['Car'].fillna(0) == 0].shape[0]
total_abbotsford_properties = abbotsford_properties.shape[0]

# Calculate the probability 'p'
p_no_car_space = no_car_space_count / total_abbotsford_properties

print(f"Total properties in Abbotsford: {total_abbotsford_properties}")
print(f"Properties in Abbotsford with no car space: {no_car_space_count}")
print(f"Probability (p) of a property in Abbotsford having no car space: {p_no_car_space:.4f}")

Total properties in Abbotsford: 56
Properties in Abbotsford with no car space: 15
Probability (p) of a property in Abbotsford having no car space: 0.2679


## Apply Binomial Probability Formula

Now that we have the probability `p` of a single property in Abbotsford not having a car space, we can use the binomial probability mass function (PMF) to find the probability of exactly `k` successes (no car spaces) in `n` trials (properties sold). The binomial PMF is given by the formula: `P(X=k) = C(n, k) * p^k * (1-p)^(n-k)`, where `C(n, k)` is the binomial coefficient. I will use `scipy.special.comb` for the binomial coefficient and `stats.binom.pmf` for convenience.

In [20]:
from scipy.stats import binom

# Define the parameters for the binomial distribution
n = 10  # Number of trials (properties sold)
k = 3   # Number of successes (properties with no car space)

# The probability 'p' was calculated in the previous step

# Calculate the binomial probability
probability = binom.pmf(k, n, p_no_car_space)

print(f"The probability that exactly {k} out of {n} properties in Abbotsford will not have a car parking space is: {probability:.3f}")

The probability that exactly 3 out of 10 properties in Abbotsford will not have a car parking space is: 0.260


### Question 4. In the suburb of Abbotsford, what are the chances of finding a property with 3 rooms? Round your answer to 3 decimal places.

### Question 5. In the suburb of Abbotsford, what are the chances of finding a property with 2 bathrooms? Round your answer to 3 decimal places.

## 4. Probability of Finding a Property with 3 Rooms in Abbotsford


In [21]:
# Filter for properties with 3 rooms in Abbotsford
properties_with_3_rooms = abbotsford_properties[abbotsford_properties['Rooms'] == 3].shape[0]

# Calculate the probability
probability_3_rooms = properties_with_3_rooms / total_abbotsford_properties

print(f"Number of properties with 3 rooms in Abbotsford: {properties_with_3_rooms}")
print(f"Total properties in Abbotsford: {total_abbotsford_properties}")
print(f"Probability of finding a property with 3 rooms in Abbotsford: {probability_3_rooms:.3f}")

Number of properties with 3 rooms in Abbotsford: 20
Total properties in Abbotsford: 56
Probability of finding a property with 3 rooms in Abbotsford: 0.357


## 5. Probability of Finding a Property with 2 Bathrooms in Abbotsford

Calculate the probability of finding a property with exactly 2 bathrooms in the suburb of Abbotsford. Round the answer to 3 decimal places.

To find the probability of a property having 2 bathrooms in Abbotsford, I will filter the `abbotsford_properties` DataFrame to count properties where the 'Bathroom' column is equal to 2. It's important to consider `NaN` values in 'Bathroom' as they would not count towards having 2 bathrooms. Then, I will divide this count by the total number of properties in Abbotsford to get the probability.

In [22]:
# Filter for properties with 2 bathrooms in Abbotsford
properties_with_2_bathrooms = abbotsford_properties[abbotsford_properties['Bathroom'] == 2].shape[0]

# Calculate the probability
probability_2_bathrooms = properties_with_2_bathrooms / total_abbotsford_properties

print(f"Number of properties with 2 bathrooms in Abbotsford: {properties_with_2_bathrooms}")
print(f"Total properties in Abbotsford: {total_abbotsford_properties}")
print(f"Probability of finding a property with 2 bathrooms in Abbotsford: {probability_2_bathrooms:.3f}")

Number of properties with 2 bathrooms in Abbotsford: 19
Total properties in Abbotsford: 56
Probability of finding a property with 2 bathrooms in Abbotsford: 0.339


### 6. One-Sample Hypothesis Test (Industry Pricing) A real estate firm claims that the average property price in Richmond is `$1,000,000`. Using the dataset, test whether the actual average price is significantly different from this claim at a 5% significance level. Clearly state:

• Null and alternative hypotheses

• Test statistic

• p-value

• Final business conclusion

# Task
Perform a one-sample hypothesis test to determine if the average property price in Richmond is significantly different from $1,000,000, as claimed by a real estate firm, at a 5% significance level. Clearly state the null and alternative hypotheses, the test statistic, the p-value, and the final business conclusion.

First, Filter the main DataFrame `df` to isolate properties within the 'Richmond' suburb. Then, I'll select the 'Price' column for these properties and remove any `NaN` values to ensure accurate statistical calculations. Finally, will print basic descriptive statistics (count, mean, standard deviation) to get an overview of the data.

In [23]:
richmond_properties = df[df['Suburb'] == 'Richmond']
richmond_prices = richmond_properties['Price'].dropna()

print(f"Number of properties in Richmond: {len(richmond_prices)}")
print(f"Mean price in Richmond: ${richmond_prices.mean():,.2f}")
print(f"Standard deviation of prices in Richmond: ${richmond_prices.std():,.2f}")

Number of properties in Richmond: 260
Mean price in Richmond: $1,083,564.42
Standard deviation of prices in Richmond: $522,353.52


In [24]:
richmond_prices.mean().round(2)

np.float64(1083564.42)

## One-Sample T-test for Richmond Property Prices

### Subtask:
Perform a one-sample t-test to assess if the mean property price in Richmond is significantly different from $1,000,000, using a 5% significance level. Clearly state the hypotheses, test statistic, and p-value.

### Hypotheses:

*   **Null Hypothesis ($H_0$)**: The average property price in Richmond is $1,000,000.

    ($\mu = \$ 1,000,000)
*   **Alternative Hypothesis ($H_1$)**: The average property price in Richmond is significantly different from $1,000,000.
    
    ($\mu \ne \$ 1,000,000)

### Instructions:
1.  Set the hypothesized mean to $1,000,000 and the significance level (alpha) to 0.05.
2.  Use `scipy.stats.ttest_1samp` to perform a two-tailed one-sample t-test on `richmond_prices`.
3.  Print the calculated t-statistic and p-value.

We will import the `stats` module from `scipy` and define the hypothesized mean and significance level. Then, I'll use `stats.ttest_1samp` to conduct a two-tailed t-test on the `richmond_prices` data. Finally, let's print the t-statistic and p-value as instructed.

### Final COnclusion

We will compare the calculated p-value from the one-sample t-test with the significance level (alpha). If the p-value is less than alpha, the null hypothesis will be rejected; otherwise, we will fail to reject it. This comparison will directly lead to the conclusion about the real estate firm's claim.

In [25]:
from scipy import stats

hypothesized_mean_richmond = 1000000
alpha = 0.05

# Perform one-sample t-test (two-tailed as we are checking for 'significantly different')
t_statistic_richmond, p_value_richmond = stats.ttest_1samp(richmond_prices, hypothesized_mean_richmond)

print(f"T-statistic for Richmond: {t_statistic_richmond:.4f}")
print(f"P-value for Richmond: {p_value_richmond:.4f}")
print(f"Significance level (alpha): {alpha}")

print(f"\nComparing P-value ({p_value_richmond:.4f}) with Significance Level ({alpha})")

if p_value_richmond < alpha:
    print("\nBusiness Conclusion: Reject the null hypothesis. There is sufficient evidence to suggest that the average property price in Richmond is significantly different from $1,000,000.")
else:
    print("\nBusiness Conclusion: Fail to reject the null hypothesis. There is not sufficient evidence to suggest that the average property price in Richmond is significantly different from $1,000,000.")


T-statistic for Richmond: 2.5795
P-value for Richmond: 0.0104
Significance level (alpha): 0.05

Comparing P-value (0.0104) with Significance Level (0.05)

Business Conclusion: Reject the null hypothesis. There is sufficient evidence to suggest that the average property price in Richmond is significantly different from $1,000,000.


To manually calculate the t-statistic for the one-sample hypothesis test for Richmond property prices, we use the following formula:

$$ t = \frac{\bar{X} - \mu}{s / \sqrt{n}} $$

Where:
$\bar{X}$: The sample mean of the Richmond property prices.
   
$\mu$: The hypothesized population mean (the claimed average price by the real estate firm).

$s$: The sample standard deviation of the Richmond property prices.
   
$n$: The sample size (number of properties) in Richmond.

Let's plug in the values we have from our previous steps for Richmond:

Sample Mean $\bar{X}$: $1,083,564.42

Hypothesized Mean $\mu$: $1,000,000

Sample Standard Deviation $s$: $522,353.52

Sample Size $n$: 260

Step-by-step calculation:

   Calculate the Standard Error of the Mean (SEM): $$ SEM = s / \sqrt{n} = 522,353.52 / \sqrt{260} $$
   
  $$ SEM = 522,353.52 / 16.1245
    
  $$ SEM \approx 32,394.02 $$

  Calculate the T-statistic: $$ t = (\bar{X} - \mu) / SEM $$ $$ t = (1,083,564.42 - 1,000,000) / 32,394.02 $$ $$ t = 83,564.42 / 32,394.02 $$ $$ t \approx 2.5795 $$

This manually calculated t-statistic of approximately 2.5795 matches the value obtained programmatically.

Given our calculated t-statistic of approximately 2.5795 and degrees of freedom (df) of 259, we would typically look up this value in a t-distribution table.
Found that for two-tailed and for such a higher degree of freedom, got a value as 0.01 which is approx. same as we got using the python via code.

# Question 7. Independent Two-Sample T-Test (Feature Impact)
### Do properties with car parking sell at a higher average price than properties without car parking, across the entire dataset? Use a 5% significance level and justify:

• Choice of test

• Interpretation of p-value

• Business implications for developers

## Prepare Data: Properties with and without Car Parking

Separate the dataset into two groups: properties with car parking and properties without car parking. Handle any missing values in the 'Car' column appropriately. Extract the 'Price' for each group, dropping any missing price values.

To compare property prices based on car parking availability, We need to first divide the `df` DataFrame into two distinct groups: one for properties with car parking and another for properties without. For the 'Car' column, any `NaN` values will be treated as 'no car parking' (0 spaces), as this is a reasonable assumption in the context of absence of information about car spaces. After grouping, We will extract the 'Price' for each and remove any `NaN` prices to ensure the t-test is performed on valid numerical data.

In [26]:
# Treat NaN in 'Car' as 0 (no car parking space)
df_car = df.copy()
df_car['Car'] = df_car['Car'].fillna(0)

# Group 1: Properties with car parking (Car > 0)
properties_with_car = df_car[df_car['Car'] > 0]['Price'].dropna()

# Group 2: Properties without car parking (Car == 0)
properties_without_car = df_car[df_car['Car'] == 0]['Price'].dropna()

print(f"Number of properties with car parking: {len(properties_with_car)}")
print(f"Mean price for properties with car parking: ${properties_with_car.mean():,.2f}")
print(f"Standard deviation for properties with car parking: ${properties_with_car.std():,.2f}")

print(f"\nNumber of properties without car parking: {len(properties_without_car)}")
print(f"Mean price for properties without car parking: ${properties_without_car.mean():,.2f}")
print(f"Standard deviation for properties without car parking: ${properties_without_car.std():,.2f}")

Number of properties with car parking: 12492
Mean price for properties with car parking: $1,074,443.92
Standard deviation for properties with car parking: $649,143.82

Number of properties without car parking: 1088
Mean price for properties without car parking: $1,089,923.07
Standard deviation for properties without car parking: $513,113.16


## Perform Independent Two-Sample T-Test

Conduct an independent two-sample t-test to compare the mean property prices between the group with car parking and the group without car parking. Clearly state the null and alternative hypotheses and use a 5% significance level.

An independent two-sample t-test is appropriate here because we are comparing the means of two independent groups (properties with and without car parking) on a continuous variable (price). We want to see if one group's mean is significantly higher than the other, suggesting a directional hypothesis. We will state the null and alternative hypotheses, then apply `scipy.stats.ttest_ind`.

### Hypotheses:

*   **Null Hypothesis ($H_0$)**: Properties with car parking do not sell at a significantly higher average price than properties without car parking. ($\mu_{\text{with car}} \le \mu_{\text{without car}}$)
*   **Alternative Hypothesis ($H_1$)**: Properties with car parking sell at a significantly higher average price than properties without car parking. ($\mu_{\text{with car}} > \mu_{\text{without car}}$)

### Choice of Test:
An **independent two-sample t-test** is chosen because we are comparing the means of two independent groups (properties with car parking vs. properties without car parking) and the dependent variable (price) is continuous. Since the question asks if prices are *higher*, it is a one-tailed test. We will use `alternative='greater'` in `ttest_ind`.

In [27]:
from scipy.stats import ttest_ind

alpha = 0.05

# Perform independent two-sample t-test with alternative='greater' for a one-tailed test
t_statistic_car, p_value_car = ttest_ind(properties_with_car, properties_without_car, equal_var=False, alternative='greater')

print(f"T-statistic: {t_statistic_car:.4f}")
print(f"P-value: {p_value_car:.4f}")
print(f"Significance level (alpha): {alpha}")

T-statistic: -0.9322
P-value: 0.8243
Significance level (alpha): 0.05


## Interpret P-value and Business Implications

Interpret the calculated p-value in the context of the significance level and draw a final conclusion. Discuss the business implications of this finding for developers.

We will compare the calculated p-value with the significance level to determine whether to reject the null hypothesis. Then, we will translate this statistical conclusion into practical business implications for real estate developers, considering the impact of car parking on property prices.

In [28]:
print(f"\nComparing P-value ({p_value_car:.4f}) with Significance Level ({alpha})")

if p_value_car < alpha:
    print("\nConclusion: Reject the null hypothesis. There is sufficient evidence to suggest that properties with car parking sell at a significantly higher average price than properties without car parking.")
    print("\nBusiness Implications for Developers: This finding suggests that including car parking spaces in new developments or renovations can significantly increase the average selling price of properties. Developers should consider investing in car parking facilities as it likely yields a positive return on investment, aligning with buyer preferences and market value.")
else:
    print("\nConclusion: Fail to reject the null hypothesis. There is not sufficient evidence to suggest that properties with car parking sell at a significantly higher average price than properties without car parking.")
    print("\nBusiness Implications for Developers: This finding suggests that car parking may not be a primary driver for significantly higher average selling prices. Developers should carefully evaluate the cost of providing parking against potential market returns, considering other features or location benefits that might influence price more significantly.")


Comparing P-value (0.8243) with Significance Level (0.05)

Conclusion: Fail to reject the null hypothesis. There is not sufficient evidence to suggest that properties with car parking sell at a significantly higher average price than properties without car parking.

Business Implications for Developers: This finding suggests that car parking may not be a primary driver for significantly higher average selling prices. Developers should carefully evaluate the cost of providing parking against potential market returns, considering other features or location benefits that might influence price more significantly.


However if we try to do approach this by manual method, then there comes to slight variation in the result due to rounding off the values and less precision whereas when python uses its method, it takes utmost precision to calculate the exact value.

For Properties with Car Parking (Group 1):

  Sample Mean ($\bar{X}_1$): $1,074,443.92 *
  Sample Standard Deviation
  
  (\$s_1\$): $649,143.82
  
  Sample Size ($n_1$): 12,492

For Properties without Car Parking (Group 2):

  Sample Mean ($\bar{X}_2$): ($1,089,923.07 * Sample Standard Deviation)
  
  (\$s_2\$) :  $513,113.16
  
  Sample Size ($n_2$): 1,088

We are testing the null hypothesis ($H_0: \mu_{\text{with car}} \le \mu_{\text{without car}}$)

against the alternative hypothesis ($H_1: \mu_{\text{with car}} > \mu_{\text{without car}}$).

Step 1: Calculate the Variances ($s_1^2$ and $s_2^2$)

$s_1^2 = (649,143.82)^2 \approx 421,487,700,000,000$ $s_2^2 = (513,113.16)^2 \approx 263,284,500,000,000$

Step 2: Calculate the Pooled Variance ($s_p^2$)

The formula for pooled variance is: $$ s_p^2 = \frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2} $$

Plugging in our values:

$s_p^2 = \frac{(12,492 - 1) \times (649,143.82)^2 + (1,088 - 1) \times (513,113.16)^2}{12,492 + 1,088 - 2}$

$s_p^2 = \frac{12491 \times 421,487,700,000,000 + 1087 \times 263,284,500,000,000}{13578}$

$s_p^2 = \frac{5,239,376,000,000,000,000 + 286,220,100,000,000,000}{13578}$

$s_p^2 = \frac{5,525,596,100,000,000,000}{13578}$

$s_p^2 \approx 407,099,000,000$


Step 3: Calculate the T-statistic

The formula for the independent two-sample t-statistic (assuming equal variances) is:

$$ t = \frac{(\bar{X}_1 - \bar{X}_2) - (\mu_1 - \mu_2)}{\sqrt{s_p^2 \left(\frac{1}{n_1} + \frac{1}{n_2}\right)}} $$

For our null hypothesis, the hypothesized difference between population means ($\mu_1 - \mu_2$) is 0. So, the formula simplifies to:

$$ t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{s_p^2 \left(\frac{1}{n_1} + \frac{1}{n_2}\right)}} $$

Now, let's substitute the values:

$t = \frac{1,074,443.92 - 1,089,923.07}{\sqrt{407,099,000,000 \left(\frac{1}{12492} + \frac{1}{1088}\right)}}$

$t = \frac{-15,479.15}{\sqrt{407,099,000,000 \times (0.00008005 + 0.00091912)}}$

$t = \frac{-15,479.15}{\sqrt{407,099,000,000 \times 0.00099917}}$

$t = \frac{-15,479.15}{\sqrt{406,758,000}}$

$t = \frac{-15,479.15}{20,168.24}$

$t \approx -0.7675$

Note: The calculated t-statistic of approximately -0.7675 shows a slight variation from the programmatic result (-0.9322).

#8. Two-Way ANOVA (Location & Property Type)
## Investigate whether property prices are influenced by:

• Suburb

• Type of property

• Interaction between suburb and property type

Use Two-Way ANOVA and explain which factors significantly affect price.

## Prepare Data for Two-Way ANOVA

### Subtask:
Select the 'Price', 'Suburb', and 'Type' columns from the DataFrame, handle any missing values in these columns, and filter to a manageable subset of suburbs if necessary.


To prepare the data for ANOVA, I will first select the required columns, drop rows with missing values in those columns, and then inspect the unique counts of 'Suburb' and 'Type' to understand the data's cardinality for further analysis.



In [29]:
df_anova = df[['Price', 'Suburb', 'Type']].copy()
df_anova.dropna(inplace=True)

unique_suburbs = df_anova['Suburb'].nunique()
unique_types = df_anova['Type'].nunique()

print(f"Original DataFrame shape: {df.shape}")
print(f"DataFrame for ANOVA (after selecting columns and dropping NA) shape: {df_anova.shape}")
print(f"Number of unique suburbs: {unique_suburbs}")
print(f"Number of unique property types: {unique_types}")
df_anova.head()

Original DataFrame shape: (13580, 21)
DataFrame for ANOVA (after selecting columns and dropping NA) shape: (13580, 3)
Number of unique suburbs: 314
Number of unique property types: 3


Unnamed: 0,Price,Suburb,Type
0,1480000.0,Abbotsford,h
1,1035000.0,Abbotsford,h
2,1465000.0,Abbotsford,h
3,850000.0,Abbotsford,h
4,1600000.0,Abbotsford,h


Given the large number of unique suburbs (314), filtering the `df_anova` DataFrame to include only the top N most frequent suburbs will make the Two-Way ANOVA computationally feasible and more interpretable. I will choose the top 10 most frequent suburbs to ensure a manageable dataset for ANOVA.



In [30]:
top_n_suburbs = 10
top_suburbs = df_anova['Suburb'].value_counts().nlargest(top_n_suburbs).index
df_anova_filtered = df_anova[df_anova['Suburb'].isin(top_suburbs)].copy()

print(f"DataFrame for ANOVA (after filtering to top {top_n_suburbs} suburbs) shape: {df_anova_filtered.shape}")
print(f"Unique suburbs after filtering: {df_anova_filtered['Suburb'].unique()}")
df_anova_filtered.head()

DataFrame for ANOVA (after filtering to top 10 suburbs) shape: (2327, 3)
Unique suburbs after filtering: ['Bentleigh East' 'Brunswick' 'Coburg' 'Essendon' 'Glen Iris' 'Hawthorn'
 'Preston' 'Reservoir' 'Richmond' 'South Yarra']


Unnamed: 0,Price,Suburb,Type
787,1410000.0,Bentleigh East,h
788,760000.0,Bentleigh East,t
789,825000.0,Bentleigh East,h
790,746000.0,Bentleigh East,u
791,580000.0,Bentleigh East,u


## Perform Two-Way ANOVA

Perform a Two-Way ANOVA test to investigate whether property prices are influenced by 'Suburb', 'Type' of property, and their interaction. Clearly state the null and alternative hypotheses for each factor and their interaction, and define the significance level.

#### Instructions:
1.  State the null and alternative hypotheses for the main effects (Suburb, Type) and their interaction effect.
2.  Import `ol` (Ordinary Least Squares) from `statsmodels.formula.api` and `sm` (statsmodels.api).
3.  Fit an ANOVA model using `ol` with 'Price' as the dependent variable and 'Suburb', 'Type', and their interaction as independent variables. The formula will be `Price ~ C(Suburb) + C(Type) + C(Suburb):C(Type)`.
4.  Print the ANOVA table using `sm.stats.anova_lm` to display the F-statistic and p-values for each factor and interaction.

We will state the null and alternative hypotheses for the Two-Way ANOVA, import the necessary `statsmodels` libraries, fit the ANOVA model using the filtered data, and then display the ANOVA table to see the F-statistics and p-values.



In [31]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Define the significance level
alpha = 0.05

# State the hypotheses
print("### Null and Alternative Hypotheses for Two-Way ANOVA\n")

print("**Main Effect: Suburb**")
print(f"H0: There is no significant difference in mean property prices across different suburbs (\u03bc1 = \u03bc2 = ... = \u03bck)\n")
print(f"H1: At least one suburb has a significantly different mean property price (Not all \u03bc are equal)\n")

print("**Main Effect: Property Type**")
print(f"H0: There is no significant difference in mean property prices across different property types (\u03bc1 = \u03bc2 = ... = \u03bcl)\n")
print(f"H1: At least one property type has a significantly different mean property price (Not all \u03bc are equal)\n")

print("**Interaction Effect: Suburb * Type**")
print(f"H0: There is no significant interaction effect between suburb and property type on property prices.\n")
print(f"H1: There is a significant interaction effect between suburb and property type on property prices.\n")

print(f"Significance Level (alpha): {alpha}\n")

# Fit the ANOVA model
# C() is used to treat Suburb and Type as categorical variables
model = ols('Price ~ C(Suburb) + C(Type) + C(Suburb):C(Type)', data=df_anova_filtered).fit()

# Generate the ANOVA table
anova_table = sm.stats.anova_lm(model, typ=2)

print("### ANOVA Table")
print(anova_table)

### Null and Alternative Hypotheses for Two-Way ANOVA

**Main Effect: Suburb**
H0: There is no significant difference in mean property prices across different suburbs (μ1 = μ2 = ... = μk)

H1: At least one suburb has a significantly different mean property price (Not all μ are equal)

**Main Effect: Property Type**
H0: There is no significant difference in mean property prices across different property types (μ1 = μ2 = ... = μl)

H1: At least one property type has a significantly different mean property price (Not all μ are equal)

**Interaction Effect: Suburb * Type**
H0: There is no significant interaction effect between suburb and property type on property prices.

H1: There is a significant interaction effect between suburb and property type on property prices.

Significance Level (alpha): 0.05

### ANOVA Table
                         sum_sq      df           F         PR(>F)
C(Suburb)          2.021766e+14     9.0  145.988879  3.058266e-218
C(Type)            2.771113e+14     2.0

## Interpret Two-Way ANOVA Results and Conclude

Interpret the results of the Two-Way ANOVA table, comparing the p-values with the significance level (alpha = 0.05). Draw conclusions about which factors (Suburb, Type, and their interaction) significantly affect property prices.

#### Instructions:
1.  Analyze the `PR(>F)` (p-value) for each factor (`C(Suburb)`, `C(Type)`) and the interaction term (`C(Suburb):C(Type)`).
2.  Compare each p-value with the significance level (alpha = 0.05).
3.  For each factor and interaction, state whether the null hypothesis is rejected or failed to be rejected.
4.  Explain which factors significantly affect property prices based on the statistical findings.
5.  Summarize the overall business implications of these findings regarding property pricing.

Now, We will interpret the p-values from the ANOVA table to determine the significance of each factor and their interaction on property prices. Then, I will draw business conclusions based on these statistical findings.



In [32]:
print("### Interpretation of Two-Way ANOVA Results\n")

alph = 0.05 # Significance level

# Interpret C(Suburb)
p_value_suburb = anova_table.loc['C(Suburb)', 'PR(>F)']
if p_value_suburb < alph:
    print(f"C(Suburb) p-value: {p_value_suburb:.4f} < {alph}. Reject H0. Suburb significantly affects property prices.")
else:
    print(f"C(Suburb) p-value: {p_value_suburb:.4f} >= {alph}. Fail to reject H0. Suburb does not significantly affect property prices.")

# Interpret C(Type)
p_value_type = anova_table.loc['C(Type)', 'PR(>F)']
if p_value_type < alph:
    print(f"C(Type) p-value: {p_value_type:.4f} < {alph}. Reject H0. Property Type significantly affects property prices.")
else:
    print(f"C(Type) p-value: {p_value_type:.4f} >= {alph}. Fail to reject H0. Property Type does not significantly affect property prices.")

# Interpret C(Suburb):C(Type) interaction
p_value_interaction = anova_table.loc['C(Suburb):C(Type)', 'PR(>F)']
if p_value_interaction < alph:
    print(f"C(Suburb):C(Type) p-value: {p_value_interaction:.4f} < {alph}. Reject H0. There is a significant interaction effect between Suburb and Property Type on prices.")
else:
    print(f"C(Suburb):C(Type) p-value: {p_value_interaction:.4f} >= {alph}. Fail to reject H0. There is no significant interaction effect between Suburb and Property Type on prices.")

print("\n### Business Implications for Developers\n")
print("Based on the Two-Way ANOVA results:")

if p_value_suburb < alph:
    print("- **Suburb (Location) is a highly significant factor** influencing property prices. This means that property values vary considerably depending on the specific suburb. Developers should conduct thorough market research on different suburbs to identify areas with high demand and pricing potential for their target property types.")
else:
    print("- Suburb (Location) does not significantly influence property prices at a 5% significance level, suggesting other factors might be more dominant.")

if p_value_type < alph:
    print("- **Property Type is also a highly significant factor** affecting property prices. Different types of properties (e.g., houses, townhouses, units) command different price points. Developers should align their construction plans with property types that are most valued in a given market segment.")
else:
    print("- Property Type does not significantly influence property prices at a 5% significance level.")

if p_value_interaction < alph:
    print("- The **significant interaction effect between Suburb and Property Type** is crucial. This implies that the effect of property type on price is not consistent across all suburbs, and vice-versa. For example, a 'house' might be premium in one suburb but a 'unit' might fetch higher prices in another due to local demographics, infrastructure, or zoning. Developers must consider specific suburb-type combinations when planning projects to maximize profitability and market appeal. A 'one-size-fits-all' approach is unlikely to be optimal.")
else:
    print("- There is no significant interaction effect between Suburb and Property Type on prices, meaning the impact of property type on price is consistent across suburbs, and vice-versa.")

print("\nOverall, developers should adopt a nuanced strategy, carefully considering both the location and the type of property, as well as how these two factors combine to influence market value.")

### Interpretation of Two-Way ANOVA Results

C(Suburb) p-value: 0.0000 < 0.05. Reject H0. Suburb significantly affects property prices.
C(Type) p-value: 0.0000 < 0.05. Reject H0. Property Type significantly affects property prices.
C(Suburb):C(Type) p-value: 0.0000 < 0.05. Reject H0. There is a significant interaction effect between Suburb and Property Type on prices.

### Business Implications for Developers

Based on the Two-Way ANOVA results:
- **Suburb (Location) is a highly significant factor** influencing property prices. This means that property values vary considerably depending on the specific suburb. Developers should conduct thorough market research on different suburbs to identify areas with high demand and pricing potential for their target property types.
- **Property Type is also a highly significant factor** affecting property prices. Different types of properties (e.g., houses, townhouses, units) command different price points. Developers should align their constructio

## Summary:

The factors that significantly influence property prices are 'Suburb', 'Type' of property, and the interaction between 'Suburb' and 'Type'.

### Data Analysis Key Findings
*   After initial data cleaning, the dataset for ANOVA included 'Price', 'Suburb', and 'Type' columns, with missing values handled, resulting in 13,580 observations.
*   To manage the complexity of the Two-Way ANOVA, the analysis was focused on the top 10 most frequent suburbs, reducing the dataset to 2,327 observations.
*   The Two-Way ANOVA revealed that 'Suburb' significantly influences property prices, with a p-value of approximately $0.0000$ (well below the $0.05$ significance level).
*   'Property Type' also significantly influences property prices, evidenced by a p-value of approximately $0.0000$.
*   Crucially, there is a significant interaction effect between 'Suburb' and 'Property Type' on property prices, also with a p-value of approximately $0.0000$. This means the effect of property type on price is not consistent across all suburbs, and vice-versa.

### Insights
*   Developers should adopt a highly nuanced strategy, considering both the specific location (suburb) and the property type, as their combined effect on market value is interactive, not merely additive.
*   Further analysis could involve exploring specific suburb-type combinations that yield the highest and lowest property prices to inform targeted development strategies and pricing models.


# Question 9. p-Value Interpretation (Decision Making)
##A hypothesis test comparing prices across two suburbs results in a p-value of 0.032.

Answer:

• What does this p-value indicate?

• Should the null hypothesis be rejected at α = 0.05?

• How should a business stakeholder interpret this result?

Let's break down the interpretation of a p-value of 0.032 for a hypothesis test comparing prices across two suburbs, at a significance level ($\alpha$) of 0.05:

###What does this p-value indicate?
A p-value of 0.032 means that if there were truly no difference in property prices between the two suburbs (i.e., if the null hypothesis were true), there would only be a 3.2% chance of observing a difference as extreme, or more extreme, than what was found in your sample data.

###Should the null hypothesis be rejected at $\alpha$ = 0.05?

Yes. Since the p-value (0.032) is less than the significance level ($\alpha = 0.05$), we should reject the null hypothesis. The null hypothesis typically states that there is no difference (or no effect). Rejecting it suggests there is a statistically significant difference.

### How should a business stakeholder interpret this result?

For a business stakeholder, this result indicates that the observed difference in property prices between the two suburbs is unlikely to be due to random chance alone. There is statistically significant evidence (at the 5% level) to conclude that the average property prices in these two suburbs are genuinely different. This finding could be crucial for:
        
Investment Decisions: Guiding where to invest or develop properties.

Pricing Strategies: Informing pricing for properties in each suburb.

Marketing: Tailoring marketing messages based on perceived value differences.

In practical terms, it means the suburbs are not interchangeable in terms of
average property value, and this difference is robust enough to be considered real, not just a fluke of the sample.

# 10. Industry-Style Hypothesis Validation (Policy Decision)
A housing policy group believes that properties with more than 2 bathrooms
command a premium price.
Design and execute a statistical test to validate this claim:

• Identify the correct test

• State hypotheses

• Report p-value

• Give a clear recommendation to policymakers

# Task
The final goal is to validate a housing policy group's claim that properties with more than 2 bathrooms command a premium price using a statistical test on the `df` DataFrame. This involves preparing the data by separating properties into two groups based on bathroom count (more than 2 vs. 2 or fewer) and handling missing values, identifying and stating the hypotheses for an independent two-sample t-test, executing the test to report the p-value at a 5% significance level, and finally, providing a clear recommendation to policymakers based on the test's outcome.

## Prepare Data for Bathroom Count Comparison

Separate the dataset into two groups: properties with more than 2 bathrooms, and properties with 2 or fewer bathrooms. Handle any missing values in the 'Bathroom' column appropriately (e.g., treating NaN as 0 bathrooms or dropping). Extract the 'Price' for each group, dropping any missing price values.


To prepare the data for the hypothesis test, I will first create a copy of the original DataFrame, handle missing 'Bathroom' values by filling them with 0, then separate properties into two groups based on whether they have more than 2 bathrooms or 2 or fewer bathrooms, and finally extract and clean the 'Price' for each group, along with printing their descriptive statistics.



In [33]:
df_bathroom = df.copy()
df_bathroom['Bathroom'] = df_bathroom['Bathroom'].fillna(0)

# Group 1: Properties with more than 2 bathrooms
prices_more_than_2_bathrooms = df_bathroom[df_bathroom['Bathroom'] > 2]['Price'].dropna()

# Group 2: Properties with 2 or fewer bathrooms
prices_2_or_fewer_bathrooms = df_bathroom[df_bathroom['Bathroom'] <= 2]['Price'].dropna()

print(f"Number of properties with more than 2 bathrooms: {len(prices_more_than_2_bathrooms)}")
print(f"Mean price for properties with more than 2 bathrooms: ${prices_more_than_2_bathrooms.mean():,.2f}")
print(f"Standard deviation for properties with more than 2 bathrooms: ${prices_more_than_2_bathrooms.std():,.2f}")

print(f"\nNumber of properties with 2 or fewer bathrooms: {len(prices_2_or_fewer_bathrooms)}")
print(f"Mean price for properties with 2 or fewer bathrooms: ${prices_2_or_fewer_bathrooms.mean():,.2f}")
print(f"Standard deviation for properties with 2 or fewer bathrooms: ${prices_2_or_fewer_bathrooms.std():,.2f}")

Number of properties with more than 2 bathrooms: 1060
Mean price for properties with more than 2 bathrooms: $1,882,824.20
Standard deviation for properties with more than 2 bathrooms: $976,657.12

Number of properties with 2 or fewer bathrooms: 12520
Mean price for properties with 2 or fewer bathrooms: $1,007,347.94
Standard deviation for properties with 2 or fewer bathrooms: $550,277.34


## Perform Independent Two-Sample T-Test

Conduct an independent two-sample t-test to compare the mean property prices between properties with more than 2 bathrooms and properties with 2 or fewer bathrooms. Clearly state the null and alternative hypotheses and use a 5% significance level.

### Identify the Correct Test:
An **independent two-sample t-test** is appropriate here because we are comparing the means of two independent groups (properties with >2 bathrooms vs. properties with <=2 bathrooms) and the dependent variable (price) is continuous. Since the claim is that properties with more than 2 bathrooms command a *premium price* (i.e., higher price), it will be a one-tailed test.

### Hypotheses:

*   **Null Hypothesis ($H_0$)**: Properties with more than 2 bathrooms do not command a significantly higher average price than properties with 2 or fewer bathrooms. ($\mu_{\text{>2 bathrooms}} \le \mu_{\text{<=2 bathrooms}}$)
*   **Alternative Hypothesis ($H_1$)**: Properties with more than 2 bathrooms command a significantly higher average price than properties with 2 or fewer bathrooms. ($\mu_{\text{>2 bathrooms}} > \mu_{\text{<=2 bathrooms}}$)

### Instructions:
1.  Set the significance level (alpha) to 0.05.
2.  Import the `ttest_ind` function from `scipy.stats`.
3.  Perform an independent two-sample t-test using `ttest_ind` on `prices_more_than_2_bathrooms` and `prices_2_or_fewer_bathrooms`. Remember to specify `alternative='greater'` for the one-tailed test.
4.  Print the calculated t-statistic and p-value from the test.

Now, I will perform the independent two-sample t-test using `scipy.stats.ttest_ind` with the specified `alternative='greater'` to test the hypothesis that properties with more than 2 bathrooms command a significantly higher average price.



In [34]:
from scipy.stats import ttest_ind

alpha = 0.05

# Perform independent two-sample t-test (one-tailed: alternative='greater')
t_statistic_bathroom, p_value_bathroom = ttest_ind(prices_more_than_2_bathrooms, prices_2_or_fewer_bathrooms, equal_var=False, alternative='greater')

print(f"T-statistic: {t_statistic_bathroom:.4f}")
print(f"P-value: {p_value_bathroom:.4f}")
print(f"Significance level (alpha): {alpha}")

T-statistic: 28.8002
P-value: 0.0000
Significance level (alpha): 0.05


## Interpret T-Test Results and Provide Recommendation

Interpret the results of the independent two-sample t-test by comparing the calculated p-value with the significance level ($\alpha = 0.05$). Based on this comparison, draw a conclusion about the housing policy group's claim and provide a clear recommendation to policymakers.

### Instructions:
1.  Compare the `p_value_bathroom` with `alpha`.
2.  State whether to reject or fail to reject the null hypothesis.
3.  Based on the conclusion, provide a clear, actionable recommendation to policymakers regarding properties with more than 2 bathrooms and their pricing premium.

I will compare the calculated p-value from the t-test with the significance level (alpha) to determine whether to reject the null hypothesis. Based on this statistical conclusion, I will provide a clear business recommendation to policymakers regarding the claim that properties with more than 2 bathrooms command a premium price.



In [35]:
print(f"\nComparing P-value ({p_value_bathroom:.4f}) with Significance Level ({alpha})")

if p_value_bathroom < alpha:
    print("\nConclusion: Reject the null hypothesis. There is sufficient evidence to suggest that properties with more than 2 bathrooms command a significantly higher average price than properties with 2 or fewer bathrooms.")
    print("\nRecommendation for Policymakers: The statistical test strongly supports the claim that properties with more than 2 bathrooms command a premium price. Policymakers can leverage this finding to encourage the development of properties with higher bathroom counts, potentially through incentives or zoning adjustments, to meet market demand and possibly influence property values in certain areas.")
else:
    print("\nConclusion: Fail to reject the null hypothesis. There is not sufficient evidence to suggest that properties with more than 2 bathrooms command a significantly higher average price than properties with 2 or fewer bathrooms.")
    print("\nRecommendation for Policymakers: The statistical test does not provide sufficient evidence to support the claim that properties with more than 2 bathrooms command a premium price. Policymakers should be cautious about implementing policies based solely on this factor, and further research might be needed to understand other drivers of property value or to investigate this claim under different market conditions.")


Comparing P-value (0.0000) with Significance Level (0.05)

Conclusion: Reject the null hypothesis. There is sufficient evidence to suggest that properties with more than 2 bathrooms command a significantly higher average price than properties with 2 or fewer bathrooms.

Recommendation for Policymakers: The statistical test strongly supports the claim that properties with more than 2 bathrooms command a premium price. Policymakers can leverage this finding to encourage the development of properties with higher bathroom counts, potentially through incentives or zoning adjustments, to meet market demand and possibly influence property values in certain areas.


##Summarize the findings of the statistical test, including the identified test, hypotheses, p-value, and the recommendation provided to policymakers.


## Summary:

The findings of the statistical test are as follows:
*   **Identified Test:** An independent two-sample t-test (specifically, Welch's t-test due to `equal_var=False`) was used. It was a one-tailed test with `alternative='greater'`.
*   **Hypotheses:**
    *   **Null Hypothesis (\$H_0\$):** Properties with more than 2 bathrooms do not command a significantly higher average price than properties with 2 or fewer bathrooms. ($\mu_{\text{>2 bathrooms}} \le \mu_{\text{<=2 bathrooms}}$)
    *   **Alternative Hypothesis (\$H_1\$):** Properties with more than 2 bathrooms command a significantly higher average price than properties with 2 or fewer bathrooms. ($\mu_{\text{>2 bathrooms}} > \mu_{\text{<=2 bathrooms}}$)
*   **P-value:** The calculated p-value was 0.0000.
*   **Recommendation to Policymakers:** Based on the strong statistical evidence, policymakers can leverage this finding to encourage the development of properties with higher bathroom counts, potentially through incentives or zoning adjustments, to meet market demand and influence property values in specific areas.

### Data Analysis Key Findings
*   Properties with more than 2 bathrooms (1,060 properties) had a mean price of `$ 1,882,824.20`  with a standard deviation of `$976,657.12`.
*   Properties with 2 or fewer bathrooms (12,520 properties) had a mean price of `$1,007,347.94` with a standard deviation of `$550,277.34`.
*   The independent two-sample t-test yielded a t-statistic of 28.8002 and a p-value of 0.0000.
*   Given that the p-value (0.0000) is less than the significance level ($\alpha = 0.05$), the null hypothesis was rejected.
*   There is sufficient evidence to conclude that properties with more than 2 bathrooms command a significantly higher average price than properties with 2 or fewer bathrooms.

### Insights:
*   The housing policy group's claim that properties with more than 2 bathrooms command a premium price is strongly supported by the data, indicating a clear market preference.
*   Policymakers could explore policies, such as incentives or zoning adjustments, to encourage the construction of properties with more bathrooms, aligning supply with market demand and potentially impacting property values positively.
