<img width="200" src="https://raw.githubusercontent.com/lukwies/mid-bootcamp-project/main/data/img/bikes.png">                     
                     
# Bikesharing in Seoul / Hypothesis Tesing

---

### Sources

 * Data: https://archive.ics.uci.edu/ml/datasets/Seoul+Bike+Sharing+Demand
 * Image: https://global.chinadaily.com.cn/a/201801/25/WS5a69cab3a3106e7dcc136a6d.html

---

### Hypothesis
 1. The daily rental amount differs from 400 bikes/hour.
 2. We have a higher rental amount while holidays.
 3. The average rental amount is less if weather is cold (< 10°C)
 4. The average rental amount is higher at day (8°°-19°°) than at night.

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as st
import yaml

#### Load YAML config file

In [2]:
with open('../params.yaml') as file:
    config = yaml.safe_load(file)

#### Read cleaned data to pandas

In [3]:
df = pd.read_csv(config['data']['csv_cleaned'])
df.head()

Unnamed: 0,date,month,hour,daytime,seasons,holiday,functioning_day,temperature,temperature_type,humidity,wind_speed,visibility,solar_radiation,rainfall,snowfall,rented_bike_count
0,2017-12-01,12,0,Night,Winter,No,Yes,-5.2,Frost,37,2.2,2000,0.0,0.0,0.0,254
1,2017-12-01,12,1,Night,Winter,No,Yes,-5.5,Frost,38,0.8,2000,0.0,0.0,0.0,204
2,2017-12-01,12,2,Night,Winter,No,Yes,-6.0,Frost,39,1.0,2000,0.0,0.0,0.0,173
3,2017-12-01,12,3,Night,Winter,No,Yes,-6.2,Frost,40,0.9,2000,0.0,0.0,0.0,107
4,2017-12-01,12,4,Morning,Winter,No,Yes,-6.0,Frost,36,2.3,2000,0.0,0.0,0.0,78


#### For all tests, we use a significance level of 95%.

In [4]:
alpha = 0.05

<br><br>
### Hypothesis: The average rental amount differs from **400 bikes/hour**.

 + **H0**: sample_mean == 400
 + **H1**: sample_mean != 400

So let's perform a t-test for a **95% two-sided** confidence interval.

In [5]:
sample = df['rented_bike_count']

In [6]:
sample_mean = np.mean(sample)

t_statistic    = (sample_mean - 400) / (np.std(sample,ddof=1) / np.sqrt(len(sample)))
lower_critical = st.t.ppf((alpha/2), df=len(sample)-1)
upper_critical = st.t.ppf(1-(alpha/2), df=len(sample)-1)

In [7]:
print(f"Lower critical: {lower_critical}")
print(f"Statistic:      {t_statistic}")
print(f"Upper critical: {upper_critical}")

Lower critical: -1.9602348594690915
Statistic:      44.20046845157618
Upper critical: 1.960234859469091


In [8]:
print("Reject H0" if lower_critical < t_statistic < upper_critical else "Accept H0")

Accept H0


#### Since 3.63 is significantly bigger than 1.9 we accept H0, so the average rental amount differs from 400 bikes/hour.
<br><br>

### Hypothesis: We have a higher rental amount while holidays.

We figured out, that the mean of rented bikes 700 outside of holidays.

 + **H0**: sample_mean <= 700
 + **H1**: sample_mean >  700

Using a **95%** confidence interval on a **right-sided** test.

In [9]:
sample = df[df['holiday']=='Yes']['rented_bike_count']

In [10]:
stat, pval = st.ttest_1samp(sample, popmean=700, alternative="greater")

print(f"statistic: {stat}")
print(f"p-value:   {pval}")

statistic: -7.291822564710907
p-value:   0.9999999999992621


In [11]:
print("Reject H0" if pval < 1-alpha else "Accept H0")

Accept H0


#### Since p-value is greater than our significance-level we accept H0, so on holidays there won't be more than 400 bike rentals in average for one day.
<br><br>

### Hypothesis: The average rental amount is less if weather is cold (< 10°C)

Let's assume the mean of rented bikes is 900 if the temperature is higher than 10°C.

 + **H0**: sample_mean >= 900
 + **H1**: sample_mean <  900

Using a **95%** confidence interval on a **left-sided** test.

In [12]:
sample = df[df['temperature'] < 10]['rented_bike_count']

In [13]:
stat, pval = st.ttest_1samp(sample, popmean=900, alternative="less")

print(f"statistic: {stat}")
print(f"p-value:   {pval}")

statistic: -107.8023190503751
p-value:   0.0


In [14]:
print("Reject H0" if pval < alpha else "Accept H0")

Reject H0


#### Since p-value is less than our significance-level we reject H0, so the average rental amount is less on cold weather.
<br><br>

###  Hypothesis: The average rental amount is higher at day (8°° - 19°°) than at night.

We assume the average nightly rental amount to be 530 (19°° - 8°°)

 + **H0**: sample_mean <= 530
 + **H1**: sample_mean >  530

Using a **95%** confidence interval on a **right-sided** test.

In [15]:
sample = df[(df['hour'] >= 8) & (df['hour'] <= 19)]['rented_bike_count']

In [16]:
stat, pval = st.ttest_1samp(sample, popmean=530, alternative="greater")

print(f"statistic: {stat}")
print(f"p-value:   {pval}")

statistic: 34.25095231042117
p-value:   2.5933691345497692e-228


In [17]:
print("Reject H0" if pval < 1-alpha else "Accept H0")

Reject H0


#### Since p-value is less than our significance-level we reject H0, so the average rental is higher at day.
<br><br>