# Correlation vs. Causation and Hypothesis Testing

In this lab we will practice visualizing features from the Kaggle competition's data as well as perform a hypothesis test comparing the prices of different room types.

In [None]:
#Import Packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import math
import plotly.express as px

In [None]:
# Read CSV
df = pd.read_csv("synthetic_airbnb_data.csv")

In [None]:
df.head()

In [None]:
# Preprocessing

df = df[['NAME', 'host id', 'host_identity_verified', 'host name',
       'neighbourhood group', 'neighbourhood', 'lat', 'long', 'country',
       'country code', 'instant_bookable', 'cancellation_policy', 'room type',
       'Construction year', 'price', 'service fee', 'minimum nights',
       'number of reviews', 'last review']]

df.columns = df.columns.str.replace(' ','_').str.lower()

df.drop_duplicates(inplace=True)

df['price'] = df['price'].astype(str).str.strip(' $').str.replace(',','')
df['price'] = df['price'].astype(float)

df['service_fee'] = df['service_fee'].astype(str).str.strip(' $').str.replace(',','')
df['service_fee'] = df['service_fee'].astype(float)

convert_columns = ['construction_year', 'number_of_reviews', 'minimum_nights']
df[convert_columns] = df[convert_columns].astype('Int64')

df['last_review'] = pd.to_datetime(df['last_review'])

df['instant_bookable'] = df['instant_bookable'].apply(lambda x: 'Yes' if x is True else ('No' if x is False else np.nan))


df['neighbourhood_group'] = df['neighbourhood_group'].str.replace('manhatan','Manhattan').str.replace('brookln','Brooklyn')

df['host_identity_verified'] = df['host_identity_verified'].str.capitalize()
df['cancellation_policy'] = df['cancellation_policy'].str.capitalize()

df = df.dropna(subset = ['neighbourhood_group'])

In [None]:
df.head()

In [None]:
corr_matrix = df.corr(numeric_only=True)
corr_matrix

In [None]:
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix')
plt.show()

In [None]:
subset = df[df['neighbourhood_group'] == 'Manhattan']

In [None]:

sns.boxplot(x = 'room_type', y = 'price', data  = subset)

## Plotly Demo

In [None]:
fig = px.box(subset, x = "room_type", y = "price")
fig.show()

In [None]:
fig = px.scatter(df, x="number_of_reviews", y="price", facet_col="neighbourhood_group", facet_row="room_type")
fig.show()

#Hypothesis Testing

For this lab, we will be testing if the mean price between an entire house or apartment differs from the mean price of hotel rooms.

Assign Hypotheses:

For the first mean being entire house or apartment prices and the second being hotel prices:

$$H_0: \mu_1 = \mu_2$$
$$H_a: \mu_1 \neq \mu_2$$

Let's use a significance level $\alpha  = 0.05$

Find mean and standard deviation for both subsets of data

In [None]:
house = df[df['room_type'] == 'Entire home/apt']
hotel = df[df['room_type'] == 'Hotel room']

house_mean = house['price'].mean()
hotel_mean = hotel['price'].mean()

house_std = house['price'].std()
hotel_std = hotel['price'].std()

house_n = len(house)
hotel_n = len(hotel)

Calculate test statistic
$$z = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}$$

In [None]:
z = ((house_mean-hotel_mean)/math.sqrt(((house_std**2)/house_n)+((hotel_std**2)/hotel_n)))
z

Looking at a Z test normal distribution table, we can determine that $p \approx 0.00$ because of how low the z test statistic is. Compared with our significance level of 0.05, we can infer that the mean price for an entire house or apartment is significantly different than the mean price of a hotel room. We can therefore reject our null hypothesis.