In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('/content/sample_data/property.csv')

In [3]:
df.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


In [4]:
altona_properties = df[df['Suburb'] == 'Altona']
altona_prices = altona_properties['Price'].dropna()

print(f"Number of properties in Altona: {len(altona_prices)}")
print(f"Mean price in Altona: ${altona_prices.mean():,.2f}")
print(f"Standard deviation of prices in Altona: ${altona_prices.std():,.2f}")

Number of properties in Altona: 74
Mean price in Altona: $834,830.41
Standard deviation of prices in Altona: $291,546.05


Now, let's perform a one-sample t-test to check if the mean property price in Altona is significantly different from $800,000.

#Null Hypothesis = We will assume the null hypothesis that the mean price is less than equal to ```$800,000```

#Alternate Hypothesis = The alternative hypothesis that it is greater than ```$800,000``` (one-tailed test).

#Significance = 5% "We'll use a significance level of 5% (alpha = 0.05)."

## Null Hypothesis

$$ H_0: \text{[The mean price is less than equal to \$800,000 ]} $$

## Alternate Hypothesis
$$ H_1: \text{[The mean price is greater than \$800,000 ]} $$


##Formula is
$$ z = \frac{\bar{X} - \mu}{\frac{\sigma}{\sqrt{n}}} $$


$$ \bar{x} =  834830.41$$

$$ \mu = 800000$$

$$ \sigma = 291546.05 $$

$$ s = 74 $$

## Calculate T-statistic Manually

### Subtask:
Calculate the t-statistic for the one-sample t-test manually using the formula: (sample_mean - hypothesized_mean) / (sample_standard_deviation / sqrt(sample_size)).


**Reasoning**:
To manually calculate the t-statistic, I will first calculate the sample mean, sample standard deviation, and sample size from the `altona_prices` data. Then, I will define the hypothesized mean, calculate the standard error of the mean (SEM), and finally compute the t-statistic using the given formula.



**Reasoning**:
The previous code failed because `altona_prices` was not defined. This variable needs to be created by filtering the `df` DataFrame for properties in 'Altona' and then dropping any missing price values before proceeding with the t-statistic calculation.



In [8]:
altona_properties = df[df['Suburb'] == 'Altona']
altona_prices = altona_properties['Price'].dropna()

sample_mean = altona_prices.mean()
sample_std = altona_prices.std()
sample_size = len(altona_prices)
hypothesized_mean = 800000

SEM = sample_std / np.sqrt(sample_size)
t_statistic_manual = (sample_mean - hypothesized_mean) / SEM

print(f"Manually calculated T-statistic: {t_statistic_manual:.4f}")

Manually calculated T-statistic: 1.0277


##Now that the t-statistic has been manually calculated, the next step is to determine the critical t-value for a one-tailed test at a 5% significance level. This involves calculating the degrees of freedom and then using `scipy.stats.t.ppf` to find the critical value.

## Compare T-statistic with Critical T-value and Draw Conclusion

Compare the `t_statistic_manual` with the `critical_t_value` and determine whether to reject or fail to reject the null hypothesis. Summarize the conclusion.

#### Instructions:
1. Compare the calculated `t_statistic_manual` with the `critical_t_value`.
2. Print a statement indicating whether the null hypothesis is rejected or not, based on the comparison.
3. State the conclusion about the property prices in Altona.

In [9]:
from scipy.stats import t

degrees_freedom = sample_size - 1
alpha = 0.05

critical_t_value = t.ppf(1 - alpha, df=degrees_freedom)

print(f"Degrees of Freedom: {degrees_freedom}")
print(f"Critical T-value (one-tailed, alpha=0.05): {critical_t_value:.4f}")

print(f"\nComparing T-statistic ({t_statistic_manual:.4f}) with Critical T-value ({critical_t_value:.4f})")

if t_statistic_manual > critical_t_value:
    print("Conclusion: Reject the null hypothesis. There is sufficient evidence to suggest that the typical property price in Altona is greater than $800,000.")
else:
    print("Conclusion: Fail to reject the null hypothesis. There is not sufficient evidence to suggest that the typical property price in Altona is greater than $800,000.")


Degrees of Freedom: 73
Critical T-value (one-tailed, alpha=0.05): 1.6660

Comparing T-statistic (1.0277) with Critical T-value (1.6660)
Conclusion: Fail to reject the null hypothesis. There is not sufficient evidence to suggest that the typical property price in Altona is greater than $800,000.


## Summary:

The findings indicate that the calculated T-statistic is 1.0277, which is less than the critical T-value of 1.6660 for a one-tailed test at a 5% significance level with 73 degrees of freedom. Therefore, we fail to reject the null hypothesis, concluding that there is insufficient statistical evidence to suggest that the mean property price in Altona is significantly greater than \$800,000.

### Data Analysis Key Findings
*   The manually calculated T-statistic for Altona property prices was determined to be 1.0277.
*   With a sample size of 74, the degrees of freedom for the t-test were 73.
*   For a one-tailed test at a 5% significance level ($\alpha=0.05$), the critical T-value was found to be 1.6660.
*   The calculated T-statistic (1.0277) is less than the critical T-value (1.6660).
*   Based on this comparison, the null hypothesis (mean property price in Altona $\le$ \$800,000) was not rejected, indicating insufficient evidence to conclude that the mean price is significantly greater than \$800,000.


### 2. For the year 2016, is there any difference in the prices of properties sold in the  summer months vs winter months?

• Consider months from October till March as winter months and rest as summer months.

• Use a significance level of 5%.

# Task
Analyze property prices from the `df` DataFrame for the year 2016 to determine if there is a significant difference between properties sold in summer months (April to September) and winter months (October to March). This involves converting the 'Date' column to datetime, filtering for 2016 data, categorizing sales into 'Summer' and 'Winter' based on the month, handling missing 'Price' values, performing an independent two-sample t-test at a 5% significance level, and visualizing the price distributions using appropriate plots (e.g., box plots or histograms). Finally, summarize the t-test results (t-statistic, p-value) and state the conclusion regarding any significant price difference.

## Prepare Data for 2016

### Subtask:
Convert the 'Date' column to datetime objects, filter the DataFrame to include only properties sold in the year 2016, and extract the month for each sale.


**Reasoning**:
To prepare the data for analysis, I will first convert the 'Date' column to datetime objects, then filter the DataFrame to retain only the records from the year 2016, and finally extract the month from the 'Date' column into a new 'Month' column.



In [10]:
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y', errors='coerce')
df_2016 = df[df['Date'].dt.year == 2016].copy()
df_2016['Month'] = df_2016['Date'].dt.month

print(f"Shape of original DataFrame: {df.shape}")
print(f"Shape of 2016 DataFrame: {df_2016.shape}")
df_2016.head()

Shape of original DataFrame: (13580, 21)
Shape of 2016 DataFrame: (6336, 22)


Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount,Month
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,2016-12-03,2.5,3067.0,...,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0,12
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,2016-02-04,2.5,3067.0,...,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0,2
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,2016-06-04,2.5,3067.0,...,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0,6
5,Abbotsford,129 Charles St,2,h,941000.0,S,Jellis,2016-05-07,2.5,3067.0,...,0.0,181.0,,,Yarra,-37.8041,144.9953,Northern Metropolitan,4019.0,5
6,Abbotsford,124 Yarra St,3,h,1876000.0,S,Nelson,2016-05-07,2.5,3067.0,...,0.0,245.0,210.0,1910.0,Yarra,-37.8024,144.9993,Northern Metropolitan,4019.0,5


## Categorize Sales into Summer and Winter

### Subtask:
Based on the defined criteria (October-March as winter, rest as summer), categorize the 2016 sales into two groups: 'Summer' and 'Winter' months. Extract the 'Price' for each of these groups, handling any missing price values.


**Reasoning**:
To categorize the 2016 sales into 'Summer' and 'Winter' months, I will define the month ranges, then filter the `df_2016` DataFrame for each season, extract the 'Price' column, and remove any missing values. Finally, I will print the head and shape of each resulting price series to verify the categorization.



In [11]:
winter_months = [10, 11, 12, 1, 2, 3]
summer_months = [4, 5, 6, 7, 8, 9]

summer_prices_2016 = df_2016[df_2016['Month'].isin(summer_months)]['Price'].dropna()
winter_prices_2016 = df_2016[df_2016['Month'].isin(winter_months)]['Price'].dropna()

print(f"Number of summer sales in 2016: {len(summer_prices_2016)}")
print("First 5 summer prices in 2016:\n", summer_prices_2016.head())
print(f"\nNumber of winter sales in 2016: {len(winter_prices_2016)}")
print("First 5 winter prices in 2016:\n", winter_prices_2016.head())

Number of summer sales in 2016: 4036
First 5 summer prices in 2016:
 4     1600000.0
5      941000.0
6     1876000.0
13    1172500.0
14     441000.0
Name: Price, dtype: float64

Number of winter sales in 2016: 2300
First 5 winter prices in 2016:
 0    1480000.0
1    1035000.0
7    1636000.0
8     300000.0
9    1097000.0
Name: Price, dtype: float64


In [17]:
summer_prices_2016.std()


621493.4398118115

In [19]:
summer_prices_2016.mean()

np.float64(1048054.7286917741)

In [18]:
winter_prices_2016.std()

695498.2768521258

In [20]:
winter_prices_2016.mean()

np.float64(1116647.5917391304)

## Perform Two-Sample T-test

Conduct an independent two-sample t-test to compare the mean property prices between the 'Summer' and 'Winter' months groups. This will involve defining the null and alternative hypotheses and using a significance level of 5%.

#### Instructions
1. Clearly state the null hypothesis ($H_0$) and the alternative hypothesis ($H_1$) for this two-sample t-test.
2. Import the `ttest_ind` function from the `scipy.stats` module.
3. Perform an independent two-sample t-test using `ttest_ind` on `summer_prices_2016` and `winter_prices_2016`.
4. Print the calculated t-statistic and p-value from the test.
5. State the significance level (alpha) for this test.

### Hypotheses:

*   **Null Hypothesis ($H_0$)**: There is no significant difference in the mean property prices between summer months and winter months in 2016. ($\mu_{\text{summer}} = \mu_{\text{winter}}$)
*   **Alternative Hypothesis ($H_1$)**: There is a significant difference in the mean property prices between summer months and winter months in 2016. ($\mu_{\text{summer}} \ne \mu_{\text{winter}}$)

**Reasoning**:
Now I will import the necessary function for the t-test and perform the two-sample independent t-test as instructed, assuming equal variances, and then print the t-statistic and p-value along with the significance level.



In [12]:
from scipy.stats import ttest_ind

# Define significance level
alpha = 0.05

# Perform independent two-sample t-test
t_statistic, p_value = ttest_ind(summer_prices_2016, winter_prices_2016, equal_var=True)

print(f"T-statistic: {t_statistic:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Significance level (alpha): {alpha}")

T-statistic: -4.0434
P-value: 0.0001
Significance level (alpha): 0.05


## Interpret T-test Results and Conclude

Interpret the results of the two-sample t-test, comparing the p-value with the significance level, and draw a conclusion about the difference in mean property prices between summer and winter months in 2016.

### Calculatioms:
*   **Calculated T-statistic**: -4.0434
*   **Calculated P-value**: 0.0001
*   **Significance Level (alpha)**: 0.05

Since the P-value (0.0001) is less than the significance level (0.05), we reject the null hypothesis.

## Summary:

*   **Is there a significant difference in property prices between summer and winter months in 2016?**
   
    Yes, there is a statistically significant difference in the mean property prices between summer months (April to September) and winter months (October to March) in 2016. This conclusion is based on a two-sample t-test where the p-value (0.0001) was less than the significance level (0.05), leading to the rejection of the null hypothesis.

### Data Analysis Key Findings
*   The initial dataset was filtered to `df_2016`, containing 6336 property sales records for the year 2016.
*   Property sales were categorized into two groups: 4036 sales occurred during summer months (April-September) and 2300 sales occurred during winter months (October-March).
*   An independent two-sample t-test was performed to compare mean property prices between the two seasons.
    *   The calculated t-statistic was -4.0434.
    *   The calculated p-value was 0.0001.
    *   The significance level (alpha) for the test was 0.05.
*   Since the p-value (0.0001) is considerably smaller than the significance level (0.05), the null hypothesis—that there is no significant difference in mean property prices between summer and winter—was rejected.

### Insights or Next Steps
*   The analysis indicates that seasonal factors have a statistically significant impact on property prices in 2016. Further investigation could explore the magnitude of this difference and potential reasons, such as market demand, inventory changes, or specific property types sold in each season.
*   To gain deeper insights, consider analyzing the median prices in addition to the mean, especially if the price distributions are skewed. Additionally, future analyses could explore the impact of specific months within seasons or compare seasonal trends across multiple years to identify consistent patterns.


For an independent two-sample t-test, assuming equal variances (which we did in our previous calculation), the t-statistic is calculated using the following formula:

$$ t = \frac{(\bar{X}_1 - \bar{X}_2) - (\mu_1 - \mu_2)}{\sqrt{s_p^2 \left(\frac{1}{n_1} + \frac{1}{n_2}\right)}} $$

Where:

$$\bar{X}_1$$: The sample mean of the first group (summer prices).

$$\bar{X}_2$$: The sample mean of the second group (winter prices).

$$\mu_1 - \mu_2$$: The hypothesized difference between the population means.

For our null hypothesis
($H_0: \mu_{\text{summer}} = \mu_{\text{winter}}$), this value is 0.

$$s_p^2$$: The pooled variance of the two samples.

$$n_1$$: The sample size of the first group (summer prices).

$$n_2$$: The sample size of the second group (winter prices).

First, we would calculate the pooled variance ($$s_p^2$$) using this formula:

$$ s_p^2 = \frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2} $$

Where:

$$s_1^2$$: The variance of the first group (summer prices).
$$s_2^2$$: The variance of the second group (winter prices).

Let's break down the values we already have from our previous steps for the 2016 data:

For Summer Prices (Group 1):

Sample Mean $$\bar{X}_1$$: ~$1,048,054.73
    
Sample Standard Deviation $$s_1$$: ~$621,493.44

Sample Size $$n_1$$: 4036

For Winter Prices (Group 2):

Sample Mean $$\bar{X}_2$$: ~$1,116,647.59

Sample Standard Deviation ($$s_2$$): ~$6,95,498.28

Sample Size ($$n_2$$): 2300

Steps for manual calculation:

Calculate the variance for summer prices ($$s_1^2 = s_1 \times s_1$$) and winter prices ($$s_2^2 = s_2 \times s_2$$).

Calculate the pooled variance ($$s_p^2$$) using the formula above with the sample sizes and variances.

Substitute all values ($$\bar{X}_1$$, $$\bar{X}_2$$, $$s_p^2$$, $$n_1$$, $$n_2$$) into the main t-statistic formula.

Performing these calculations would give us a t-statistic of approximately -4.0434, as obtained programmatically.

### 3. For the suburb of Abbotsford, what is the probability that out of 10 properties sold, 3 will not have a car parking space?

• Use the column car in the dataset.

• Round off your answer to 3 decimal places.

# Steps
Steps involves:
1. Filtering the `df` DataFrame for properties in 'Abbotsford'.
2. Identifying the number of properties with and without a car parking space (where `Car` is 0 or NaN/missing).
3. Calculating the proportion of properties in Abbotsford that do not have a car parking space.
4. Using the binomial probability formula to determine the probability of exactly 3 out of 10 properties not having a car parking space, given the calculated proportion.

## Calculate Probability of No Car Parking Space in Abbotsford

Filter the DataFrame for properties in 'Abbotsford', handle missing values in the 'Car' column (assuming NaN means no car space), and calculate the probability that a property in Abbotsford does not have a car parking space. This will be our `p` value for the binomial distribution.

First, I will filter the original DataFrame `df` for entries where the `Suburb` is 'Abbotsford'. Then, I will inspect the 'Car' column for these properties. Since the question asks about properties *not having* a car parking space and some `Car` values might be `NaN`, I will treat `NaN` as no car space (0 car spaces). I will count properties where 'Car' is 0 or NaN and divide by the total number of properties in Abbotsford to get the probability `p`.

In [21]:
abbotsford_properties = df[df['Suburb'] == 'Abbotsford'].copy()

# Inspect the 'Car' column for unique values and their counts
# print(abbotsford_properties['Car'].value_counts(dropna=False))

# Count properties with 0 car spaces or NaN (assuming NaN means no car space)
no_car_space_count = abbotsford_properties[abbotsford_properties['Car'].fillna(0) == 0].shape[0]
total_abbotsford_properties = abbotsford_properties.shape[0]

# Calculate the probability 'p'
p_no_car_space = no_car_space_count / total_abbotsford_properties

print(f"Total properties in Abbotsford: {total_abbotsford_properties}")
print(f"Properties in Abbotsford with no car space: {no_car_space_count}")
print(f"Probability (p) of a property in Abbotsford having no car space: {p_no_car_space:.4f}")

Total properties in Abbotsford: 56
Properties in Abbotsford with no car space: 15
Probability (p) of a property in Abbotsford having no car space: 0.2679


## Apply Binomial Probability Formula

Now that we have the probability `p` of a single property in Abbotsford not having a car space, we can use the binomial probability mass function (PMF) to find the probability of exactly `k` successes (no car spaces) in `n` trials (properties sold). The binomial PMF is given by the formula: `P(X=k) = C(n, k) * p^k * (1-p)^(n-k)`, where `C(n, k)` is the binomial coefficient. I will use `scipy.special.comb` for the binomial coefficient and `stats.binom.pmf` for convenience.

In [22]:
from scipy.stats import binom

# Define the parameters for the binomial distribution
n = 10  # Number of trials (properties sold)
k = 3   # Number of successes (properties with no car space)

# The probability 'p' was calculated in the previous step

# Calculate the binomial probability
probability = binom.pmf(k, n, p_no_car_space)

print(f"The probability that exactly {k} out of {n} properties in Abbotsford will not have a car parking space is: {probability:.3f}")

The probability that exactly 3 out of 10 properties in Abbotsford will not have a car parking space is: 0.260


### 4. In the suburb of Abbotsford, what are the chances of finding a property with 3 rooms? Round your answer to 3 decimal places.

### 5. In the suburb of Abbotsford, what are the chances of finding a property with 2 bathrooms? Round your answer to 3 decimal places.

## 4. Probability of Finding a Property with 3 Rooms in Abbotsford


To find the probability of a property having 3 rooms in Abbotsford, I will first filter the `abbotsford_properties` DataFrame to count properties where the 'Rooms' column is equal to 3. Then, I will divide this count by the total number of properties in Abbotsford to get the probability. I will ensure to handle any potential missing values if 'Rooms' could be NaN, though it's typically a numeric, non-null field.

In [23]:
# Filter for properties with 3 rooms in Abbotsford
properties_with_3_rooms = abbotsford_properties[abbotsford_properties['Rooms'] == 3].shape[0]

# Calculate the probability
probability_3_rooms = properties_with_3_rooms / total_abbotsford_properties

print(f"Number of properties with 3 rooms in Abbotsford: {properties_with_3_rooms}")
print(f"Total properties in Abbotsford: {total_abbotsford_properties}")
print(f"Probability of finding a property with 3 rooms in Abbotsford: {probability_3_rooms:.3f}")

Number of properties with 3 rooms in Abbotsford: 20
Total properties in Abbotsford: 56
Probability of finding a property with 3 rooms in Abbotsford: 0.357


## 5. Probability of Finding a Property with 2 Bathrooms in Abbotsford

Calculate the probability of finding a property with exactly 2 bathrooms in the suburb of Abbotsford. Round the answer to 3 decimal places.

To find the probability of a property having 2 bathrooms in Abbotsford, I will filter the `abbotsford_properties` DataFrame to count properties where the 'Bathroom' column is equal to 2. It's important to consider `NaN` values in 'Bathroom' as they would not count towards having 2 bathrooms. Then, I will divide this count by the total number of properties in Abbotsford to get the probability.

In [24]:
# Filter for properties with 2 bathrooms in Abbotsford
properties_with_2_bathrooms = abbotsford_properties[abbotsford_properties['Bathroom'] == 2].shape[0]

# Calculate the probability
probability_2_bathrooms = properties_with_2_bathrooms / total_abbotsford_properties

print(f"Number of properties with 2 bathrooms in Abbotsford: {properties_with_2_bathrooms}")
print(f"Total properties in Abbotsford: {total_abbotsford_properties}")
print(f"Probability of finding a property with 2 bathrooms in Abbotsford: {probability_2_bathrooms:.3f}")

Number of properties with 2 bathrooms in Abbotsford: 19
Total properties in Abbotsford: 56
Probability of finding a property with 2 bathrooms in Abbotsford: 0.339
