In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime 
import seaborn as sb
import scipy.stats as stat
%matplotlib inline
import random

In [2]:
random.seed(123)
bikeshare_full = pd.read_csv('less_bikeshare_lat_lon.csv', 
                        parse_dates=['Start date', 'End date', 'start_date_short', 'end_date_short'])
bikeshare_full.drop('Unnamed: 0', 1, inplace=True)
print(bikeshare_full.shape)

bikeshare = bikeshare_full.sample(100000)
print(bikeshare.shape)

bikesharing = pd.read_csv('bikeshare_full.csv', 
                        parse_dates=['Start date', 'End date', 'start_date_short', 'end_date_short'])

(1219847, 24)
(100000, 24)


## Registered Riders vs Casual Riders

The full dataset with outliers is used to show the difference between registered riders and casual riders.

In [3]:
casual_full = bikesharing[bikesharing['Member Type'] == 'Casual']['time_diff']
registered_full = bikesharing[bikesharing['Member Type'] == 'Registered']['time_diff']
casual_full_mean = np.mean(casual_full)
registered_full_mean = np.mean(registered_full)
difference_full = casual_full_mean - registered_full_mean
print('The difference the ride times for casual riders and the registered riders is ' + 
      str(round(difference_full, 2)) + ' minutes.')

The difference the ride times for casual riders and the registered riders is 34.46 minutes.


The difference in ride times with the the outliers is very large. We'll need to do a hypothesis test to determine is this difference was due to random chance or if they are actually different.
<br>
**Null Hypothesis**: The differences in ride time means between Registered Riders and Casual Riders are equal.
<br>
**Alternative Hypothesis**: The difference in ride time means between Registered Riders and Casual Riders are different.
<br>
**Significance**: 0.05

In [4]:
time_test_full = stat.ttest_ind(casual_full, registered_full, equal_var=False)
p_val_full = time_test_full[1]
print(p_val_full)

0.0


With this result (p-value < 0.00001), we can reject the null hypothesis that the two means between registered riders and casual riders are equal. We can argue that the means are different and that the type of rider might effect the ride time. The difference between the two based on the 95% confidence interval is between 33.95 minutes and 34.97 minutes.

In [5]:
casual_std_full = casual_full.std()
registered_std_full = registered_full.std()
first_full = (casual_std_full ** 2) / len(casual_full)
second_full = (registered_std_full ** 2) / len(registered_full)
se_full = np.sqrt(first_full + second_full)
reg_cas_95ci_full = stat.t.interval(0.95, len(casual_full) - 1, loc=difference_full, scale=se_full)
print('The 95% confidence interval is ' + str(reg_cas_95ci_full) + ' minutes.')

The 95% confidence interval is (33.953965897804757, 34.973307613774885) minutes.


The difference between the means was then tested for the dataset without the outliers. 

In [6]:
casual = bikeshare[bikeshare['Member Type'] == 'Casual']['time_diff']
registered = bikeshare[bikeshare['Member Type'] == 'Registered']['time_diff']
casual_mean = np.mean(casual)
registered_mean = np.mean(registered)
difference = casual_mean - registered_mean
print('The difference the ride times for casual riders and the registered riders is ' + 
      str(round(difference, 2)) + ' minutes.')

The difference the ride times for casual riders and the registered riders is 14.13 minutes.


The differences between the means are 14.13 minutes. Knowing that, we'll need to do a hypothesis test to determine is this difference was due to random chance or if they are actually different.
<br>
**Null Hypothesis**: The differences in ride time means between Registered Riders and Casual Riders are equal.
<br>
**Alternative Hypothesis**: The difference in ride time means between Registered Riders and Casual Riders are different.
<br>
**Significance**: 0.05

In [7]:
time_test = stat.ttest_ind(casual, registered, equal_var=False)
p_val = time_test[1]
print(p_val)

0.0


With this result (p-value < 0.00001), we can reject the null hypothesis that the two means between registered riders and casual riders are equal. We can argue that the means are different and that the type of rider might effect the ride time. The difference between the two based on the 95% confidence interval is between 13.65 minutes and 13.80 minutes.

In [8]:
casual_std = casual.std()
registered_std = registered.std()
first = (casual_std ** 2) / len(casual)
second = (registered_std ** 2) / len(registered)
se = np.sqrt(first + second)
reg_cas_95ci = stat.t.interval(0.95, len(casual) - 1, loc=difference, scale=se)
print('The 95% confidence interval is ' + str(reg_cas_95ci) + ' minutes.')

The 95% confidence interval is (13.848473672919683, 14.412654359796294) minutes.


## Seasonal Ride Times
The next question to ask is if the season effects the ride time. By just looking at the differences in means for each of the different seasons shows that there is a difference of roughly 1 to 3 minutes

In [9]:
spring = bikeshare[bikeshare['season'] == 1]['time_diff']
summer = bikeshare[bikeshare['season'] == 2]['time_diff']
fall = bikeshare[bikeshare['season'] == 3]['time_diff']
winter = bikeshare[bikeshare['season'] == 4]['time_diff']

spring_mean = np.mean(spring)
summer_mean = np.mean(summer)
fall_mean = np.mean(fall)
winter_mean = np.mean(winter)

print('The mean ride time for the spring is ' + str(round(spring_mean, 2)) + ' minutes.')
print('The mean ride time for the summer is ' + str(round(summer_mean, 2)) + ' minutes.')
print('The mean ride time for the fall is ' + str(round(fall_mean, 2)) + ' minutes.')
print('The mean ride time for the winter is ' + str(round(winter_mean, 2)) + ' minutes.')

The mean ride time for the spring is 12.77 minutes.
The mean ride time for the summer is 15.41 minutes.
The mean ride time for the fall is 15.17 minutes.
The mean ride time for the winter is 13.53 minutes.


A hypothesis test, using the Analysis of Variance (ANOVA) test, is the best way to determine if these means are by chance or if the mean are actually different.
<br>
**Null Hypothesis**: The mean ride time for the 4 different seasons are not different.
<br>
**Alternative Hypothesis**: The mean ride time for the different seasons are different.
<br>
**Significance**: 0.05

In [10]:
season = stat.f_oneway(spring, summer, fall, winter)
p_val_season = season[1]
p_val_season

1.1970832546529205e-205

The p-value for this ANOVA is very low (less than 0.00000001) so we can reject the null hypothesis and conclude that there are differences in the mean ride time for the 4 different seasons. In order to determine which of the means are different, more hypothesis tests should be completed, using the Student's T Test. Because multiple t-tests will be done, the normal significance will of 0.05 will need to be lowered due to significance being lost. The new significance level for these tests is 
<br>
<br>
**Null Hypothesis (For each test)**: The mean ride time between 2 seasons are not different.
<br>
**Alternative Hypothesis (For each test)**: The mean ride time between 2 seasons are different.
<br>
**Significance**: alpha / k , where k is the number groups, 0.00833

In [37]:
spring_summer = stat.ttest_ind(spring, summer, equal_var=False)
spring_summer_p = spring_summer[1]

spring_fall = stat.ttest_ind(spring, fall, equal_var=False)
spring_fall_p = spring_fall[1]

spring_winter = stat.ttest_ind(spring, winter, equal_var=False)
spring_winter_p = spring_winter[1]

summer_fall = stat.ttest_ind(summer, fall, equal_var=False)
summer_fall_p = summer_fall[1]

summer_winter = stat.ttest_ind(summer, winter, equal_var=False)
summer_winter_p = summer_winter[1]

fall_winter = stat.ttest_ind(fall, winter, equal_var=False)
fall_winter_p = fall_winter[1]

print('The p-value for the difference between the ride times between spring and summer is ' + str(spring_summer_p) + '.')
print('The p-value for the difference between the ride times between spring and fall is ' + str(spring_fall_p) + '.')
print('The p-value for the difference between the ride times between spring and winter is ' + str(spring_winter_p) + '.')
print('The p-value for the difference between the ride times between summer and fall is ' + str(summer_fall_p) + '.')
print('The p-value for the difference between the ride times between summer and winter is ' + str(summer_winter_p) + '.')
print('The p-value for the difference between the ride times between fall and winter is ' + str(fall_winter_p) + '.')

The p-value for the difference between the ride times between spring and summer is 1.07209500254e-150.
The p-value for the difference between the ride times between spring and fall is 5.06978247663e-130.
The p-value for the difference between the ride times between spring and winter is 1.98597115087e-14.
The p-value for the difference between the ride times between summer and fall is 0.0249333191399.
The p-value for the difference between the ride times between summer and winter is 7.26744021469e-65.
The p-value for the difference between the ride times between fall and winter is 2.79848814524e-51.


In each t-test for the difference between seasons, the p-values are extremely low or roughly equal to 0.0. For each case, we can reject null hypothesis and claim that there are differences between each of the 4 seasons.

# Seasonal Ride Times: Casual vs Registered
In the hypothesis above, we determined that the mean ride time is different when factoring in the season. What happens when we factor in season and the member type (casual vs. registered)? Are the ride times different within the each season when broken down by the member type?

In [12]:
casual_spring = bikeshare[(bikeshare['Member Type'] == 'Casual') & (bikeshare['season'] == 1)]['time_diff']
casual_summer = bikeshare[(bikeshare['Member Type'] == 'Casual') & (bikeshare['season'] == 2)]['time_diff']
casual_fall = bikeshare[(bikeshare['Member Type'] == 'Casual') & (bikeshare['season'] == 3)]['time_diff']
casual_winter = bikeshare[(bikeshare['Member Type'] == 'Casual') & (bikeshare['season'] == 4)]['time_diff']
registered_spring = bikeshare[(bikeshare['Member Type'] == 'Registered') & (bikeshare['season'] == 1)]['time_diff']
registered_summer = bikeshare[(bikeshare['Member Type'] == 'Registered') & (bikeshare['season'] == 2)]['time_diff']
registered_fall = bikeshare[(bikeshare['Member Type'] == 'Registered') & (bikeshare['season'] == 3)]['time_diff']
registered_winter = bikeshare[(bikeshare['Member Type'] == 'Registered') & (bikeshare['season'] == 4)]['time_diff']

mean_casual_spring = np.mean(casual_spring)
mean_reg_spring = np.mean(registered_spring)
print('The difference in ride times in the spring for registered and casual riders is ' + 
      str(round(mean_casual_spring - mean_reg_spring, 2)) + ' minutes.')

mean_casual_summer = np.mean(casual_summer)
mean_reg_summer = np.mean(registered_summer)
print('The difference in ride times in the summer for registered and casual riders is ' + 
      str(round(mean_casual_summer - mean_reg_summer, 2)) + ' minutes.')

mean_casual_fall = np.mean(casual_fall)
mean_reg_fall = np.mean(registered_fall)
print('The difference in ride times in the fall for registered and casual riders is ' + 
      str(round(mean_casual_fall - mean_reg_fall, 2)) + ' minutes.')

mean_casual_winter = np.mean(casual_winter)
mean_reg_winter = np.mean(registered_winter)
print('The difference in ride times in the winter for registered and casual riders is ' + 
      str(round(mean_casual_winter - mean_reg_winter, 2)) + ' minutes.')

The difference in ride times in the spring for registered and casual riders is 14.92 minutes.
The difference in ride times in the summer for registered and casual riders is 14.47 minutes.
The difference in ride times in the fall for registered and casual riders is 13.0 minutes.
The difference in ride times in the winter for registered and casual riders is 13.68 minutes.


In looking at the differences within each season, the mean ride time is higher for casual riders and registered riders is between 13-15 minutes, which is about the same for overall ride times between the two groups. However, just looking at these differences does not actually prove that the differences are significant. A hypothesis test, using the Student's T Test, is needed to prove that these differences are not just by chance. 
<br>
<br>
**Null Hypothesis (For each test)**: The mean ride time between casual and registered riders within the same season are not different.
<br>
**Alternative Hypothesis (For each test)**: The mean ride time between casual and registered riders within the same season are different.
<br>
**Significance**: 0.05

In [13]:
reg_cas_spring = stat.ttest_ind(casual_spring, registered_spring, equal_var=False)
reg_cas_spring_p = reg_cas_spring[1]

reg_cas_summer = stat.ttest_ind(casual_summer, registered_summer, equal_var=False)
reg_cas_summer_p = reg_cas_summer[1]

reg_cas_fall = stat.ttest_ind(casual_fall, registered_fall, equal_var=False)
reg_cas_fall_p = reg_cas_fall[1]

reg_cas_winter = stat.ttest_ind(casual_winter, registered_winter, equal_var=False)
reg_cas_winter_p = reg_cas_winter[1]

print('The p-value for the t-test of mean ride times between registered and casual riders in spring is ' + 
      str(reg_cas_spring_p))
print('The p-value for the t-test of mean ride times between registered and casual riders in summer is ' + 
      str(reg_cas_summer_p))
print('The p-value for the t-test of mean ride times between registered and casual riders in fall is ' + 
      str(reg_cas_fall_p))
print('The p-value for the t-test of mean ride times between registered and casual riders in winter is ' + 
      str(reg_cas_winter_p))

The p-value for the t-test of mean ride times between registered and casual riders in spring is 0.0
The p-value for the t-test of mean ride times between registered and casual riders in summer is 0.0
The p-value for the t-test of mean ride times between registered and casual riders in fall is 0.0
The p-value for the t-test of mean ride times between registered and casual riders in winter is 1.65012382537e-291


The p-values for the tests were all roughly 0.0 so we can reject the null hypotheses that the means were equal. We can conclude that the means between registered riders and casual riders are different in each of the 4 seasons.

## Weather Category
The next tests done were to determine if there were differences in mean ride times based on the type of weather. Categories for the weather are sunny, cloudy/misty, and rainy/stormy. 

In [14]:
sunny = bikeshare[bikeshare['weathersit'] == 1]['time_diff']
less_sunny = bikeshare[bikeshare['weathersit'] == 2]['time_diff']
lousy = bikeshare[bikeshare['weathersit'] == 3]['time_diff']

sunny_mean = np.mean(sunny)
less_sunny_mean = np.mean(less_sunny)
lousy_mean = np.mean(lousy)

print('The mean ride time for sunny weather is ' + str(round(sunny_mean, 2)) + ' minutes.')
print('The mean ride time for cloudy/misty weather is ' + str(round(less_sunny_mean, 2)) + ' minutes.')
print('The mean ride time for rainy/stormy weather is ' + str(round(lousy_mean, 2)) + ' minutes.')

The mean ride time for sunny weather is 14.48 minutes.
The mean ride time for cloudy/misty weather is 13.81 minutes.
The mean ride time for rainy/stormy weather is 11.32 minutes.


The differences in means based on the weather are between 1-4 minutes, which is roughly the same as the differences for the mean ride times for the different seasons. A hypothesis test, using ANOVA, will be conducted to determine if there are actual differences or if the differences happened by chance.
<br>
<br>
**Null Hypothesis**: The mean ride time for the 3 different weather categories are not different.
<br>
**Alternative Hypothesis**: The mean ride time for the 3 different weather categories are different.
<br>
**Significance**: 0.05

In [15]:
weather_anova = stat.f_oneway(sunny, less_sunny, lousy)
weather_anova_p = weather_anova[1]
print('The p-value for the ANOVA test of whether mean ride times based on the 3 different weather categories is ' +
      str(weather_anova_p))

The p-value for the ANOVA test of whether mean ride times based on the 3 different weather categories is 3.02713835484e-40


The p-value in this scenario is roughly equal to 0.0 so we can reject the null hypothesis that there are no differences between the different weather categories. Additional T tests need to be done to determine which means are different.
<br>
<br>
**Null Hypothesis (For each test)**: The mean ride time between the 3 weather categories are not different.
<br>
**Alternative Hypothesis (For each test)**: The mean ride time between the 3 weather categories are different.
<br>
**Significance**: 0.016667 (alpha / k, where k is k(k-1) / 2 (k is number of groups))

In [16]:
sunny_less = stat.ttest_ind(sunny, less_sunny, equal_var=False)
sunny_less_p = sunny_less[1]

sunny_lousy = stat.ttest_ind(sunny, lousy, equal_var=False)
sunny_lousy_p = sunny_lousy[1]

less_lousy = stat.ttest_ind(less_sunny, lousy, equal_var=False)
less_lousy_p = less_lousy[1]

print('The p-value for the test between the mean ride times for sunny and cloudy/misty weather is ' + str(sunny_less_p))
print('The p-value for the test between the mean ride times for sunny and rainy/snowy/stormy weather is ' + str(sunny_lousy_p))
print('The p-value for the test between the mean ride times for cloudy and rainy/snowy/stormy weather is ' + str(less_lousy_p))

The p-value for the test between the mean ride times for sunny and cloudy/misty weather is 1.45715297149e-17
The p-value for the test between the mean ride times for sunny and rainy/snowy/stormy weather is 2.53986583283e-54
The p-value for the test between the mean ride times for cloudy and rainy/snowy/stormy weather is 1.48303117553e-33


The p-values for each scenario are relatively close to 0.0 or are 0.0, so we can reject the null hypotheses that the means are the same for each of the different categories. 

## Weather Category: Casual vs Registered
In the test above, we determined that there are differences between the ride times based on the different weather categories. The next set of tests show what happens when the member type is factored into the weather categories.

In [17]:
sunny_casual = bikeshare[(bikeshare['weathersit'] == 1) & (bikeshare['Member Type'] == 'Casual')]['time_diff']
sunny_reg = bikeshare[(bikeshare['weathersit'] == 1) & (bikeshare['Member Type'] == 'Registered')]['time_diff']
sunny_casual_mean = np.mean(sunny_casual)
sunny_reg_mean = np.mean(sunny_reg)
print('The difference between mean ride times between casual riders and registered riders on sunny days is ' + 
     str(round(sunny_casual_mean - sunny_reg_mean, 2)) + ' minutes.')

less_casual = bikeshare[(bikeshare['weathersit'] == 2) & (bikeshare['Member Type'] == 'Casual')]['time_diff']
less_reg = bikeshare[(bikeshare['weathersit'] == 2) & (bikeshare['Member Type'] == 'Registered')]['time_diff']
less_casual_mean = np.mean(less_casual)
less_reg_mean = np.mean(less_reg)
print('The difference between mean ride times between casual riders and registered riders on cloudy/misty days is ' + 
     str(round(less_casual_mean - less_reg_mean, 2)) + ' minutes.')

lousy_casual = bikeshare[(bikeshare['weathersit'] == 3) & (bikeshare['Member Type'] == 'Casual')]['time_diff']
lousy_reg = bikeshare[(bikeshare['weathersit'] == 3) & (bikeshare['Member Type'] == 'Registered')]['time_diff']
lousy_casual_mean = np.mean(lousy_casual)
lousy_reg_mean = np.mean(lousy_reg)
print('The difference between mean ride times between casual riders and registered riders on rainy/stormy/snowy days is ' + 
     str(round(lousy_casual_mean - lousy_reg_mean, 2)) + ' minutes.')

The difference between mean ride times between casual riders and registered riders on sunny days is 14.21 minutes.
The difference between mean ride times between casual riders and registered riders on cloudy/misty days is 14.02 minutes.
The difference between mean ride times between casual riders and registered riders on rainy/stormy/snowy days is 8.74 minutes.


The differences in ride times between the member type and weather categories fall between 8.5 and 14 minutes. Hypothesis tests will be done to determine where the differences occur.
<br>
<br>
**Null Hypothesis (For each test)**: The mean ride time between casual and registered riders within the same weather category are not different.
<br>
**Alternative Hypothesis (For each test)**: The mean ride time between casual and registered riders within the same weather category are different.
<br>
**Significance**: 0.05

In [18]:
sunny_diff = stat.ttest_ind(sunny_casual, sunny_reg, equal_var=False)
sunny_diff_p = sunny_diff[1]

less_diff = stat.ttest_ind(less_casual, less_reg, equal_var=False)
less_diff_p = less_diff[1]

lousy_diff = stat.ttest_ind(lousy_casual, lousy_reg, equal_var=False)
lousy_diff_p = lousy_diff[1]

print('The p-value for the difference in mean ride times between casual and registered riders on sunny days is ' +
      str(sunny_diff_p))
print('The p-value for the difference in mean ride times between casual and registered riders on cloudy/misty days is ' +
      str(less_diff_p))
print('The p-value for the difference in mean ride times between casual and registered riders on rainy/stormy/snowy days is ' +
      str(lousy_diff_p))

The p-value for the difference in mean ride times between casual and registered riders on sunny days is 0.0
The p-value for the difference in mean ride times between casual and registered riders on cloudy/misty days is 0.0
The p-value for the difference in mean ride times between casual and registered riders on rainy/stormy/snowy days is 1.17121924155e-10


The p-values are relatively close to zero, therefore we can reject null hypothesis in favor of the alternative. There are differences in the mean ride times for casual and registered riders based on the weather category.

## Holiday
Based on the visual exploratory analysis done, there seemed to be a difference in average ride times for holidays and non-holidays. The analysis below looks to determine if this difference is by chance or if there is actually a difference.

In [19]:
no_holiday = bikeshare[bikeshare['holiday'] == 0]['time_diff']
holiday = bikeshare[bikeshare['holiday'] == 1]['time_diff']

no_holiday_mean = np.mean(no_holiday)
holiday_mean = np.mean(holiday)
diff = holiday_mean - no_holiday_mean
print('The difference between the mean ride times for holidays and non-holidays is ' + str(round(diff, 2)) + ' minutes.')

The difference between the mean ride times for holidays and non-holidays is 1.1 minutes.


Although the differences in means is about 1.6 minutes, a hypothesis test needs to be conducted to determine if there actually is a difference in the mean ride times based on holidays and non-holidays.
<br>
<br>
**Null Hypothesis**: The mean ride time for holidays and non-holidays are not different.
<br>
**Alternative Hypothesis**: The mean ride time for holidays and non-holidays are different.
<br>
**Significance**: 0.05

In [20]:
holiday_t = stat.ttest_ind(no_holiday, holiday, equal_var=False)
holiday_t_p = holiday_t[1]
print('The p-value for the t-test regarding mean ride times on holidays and non-holidays is ' + 
      str(holiday_t_p))

The p-value for the t-test regarding mean ride times on holidays and non-holidays is 6.32698999154e-05


The p-value for this test is below the significance threshold so we can reject the null hypothesis in favor of the alternative. There is a difference in the mean ride times on holidays and non-holidays.

## Holiday: Casual vs Registered
How does the member type effect the ride time on holidays and non-holidays? The hypothesis tests below determine whether there are differences.

In [21]:
no_hol_cas = bikeshare[(bikeshare['holiday'] == 0) & (bikeshare['Member Type'] == 'Casual')]['time_diff']
no_hol_reg = bikeshare[(bikeshare['holiday'] == 0) & (bikeshare['Member Type'] == 'Registered')]['time_diff']
hol_cas = bikeshare[(bikeshare['holiday'] == 1) & (bikeshare['Member Type'] == 'Casual')]['time_diff']
hol_reg = bikeshare[(bikeshare['holiday'] == 1) & (bikeshare['Member Type'] == 'Registered')]['time_diff']

no_hol_cas_mean = np.mean(no_hol_cas)
no_hol_reg_mean = np.mean(no_hol_reg)
print('The difference between mean ride times between member types on a holiday is ' + 
     str(round(no_hol_cas_mean - no_hol_reg_mean, 2)) + ' minutes.')

hol_cas_mean = np.mean(hol_cas)
hol_reg_mean = np.mean(hol_reg)
print('The difference between mean ride times between member types on non-holidays is ' + 
     str(round(hol_cas_mean - hol_reg_mean, 2)) + ' minutes.')

The difference between mean ride times between member types on a holiday is 14.15 minutes.
The difference between mean ride times between member types on non-holidays is 13.76 minutes.


Based on the diffences in means, the first conclusion is that registered riders have shorter rides than casual riders on holidays and non-holidays. Hypotheses tests are needed to determine if this is actually the case, based on the data.
<br>
<br>
**Null Hypothesis (for both tests)**: The mean ride time for casual and registered riders are not different.
<br>
**Alternative Hypothesis (for both tests)**: The mean ride time for casual and registered riders are different.
<br>
**Significance**: 0.05

In [22]:
no_hol_t = stat.ttest_ind(no_hol_cas, no_hol_reg, equal_var=False)
no_hol_t_p = no_hol_t[1]

hol_t = stat.ttest_ind(hol_cas, hol_reg, equal_var=False)
hol_t_p = hol_t[1]

print('The p-value for the test to determine if the mean ride times for casual and registered riders on non-holidays is ' + 
      str(no_hol_t_p))
print('The p-value for the test to determine if the mean ride times for casual and registered riders on holidays is ' + 
      str(hol_t_p))
      

The p-value for the test to determine if the mean ride times for casual and registered riders on non-holidays is 0.0
The p-value for the test to determine if the mean ride times for casual and registered riders on holidays is 3.0308467326e-64


The p-values for both tests are roughly 0.0 so we can reject both null hypotheses that there are no differences in the mean ride times for casual riders and registered riders on both holidays and non-holidays.

## Workday
The last categorical variable to look at is the workday. This next section will look at if weekends and weekdays ride times are different.

In [23]:
workday = bikeshare[bikeshare['workingday'] == 1]['time_diff']
no_workday = bikeshare[bikeshare['workingday'] == 0]['time_diff']
workday_mean = np.mean(workday)
no_workday_mean = np.mean(no_workday)
print('The difference between mean ride times for workdays and non-workdays is ' + 
      str(round(no_workday_mean - workday_mean, 2)) + ' minutes')

The difference between mean ride times for workdays and non-workdays is 3.18 minutes


There is a difference in the mean ride times but that does not provide enough evidence that there actually is a difference in ride times. 
<br>
<br>
**Null Hypothesis**: The mean ride time for workdays and non-workdays are not different.
<br>
**Alternative Hypothesis**: The mean ride time for workdays and non-workdays are different.
<br>
**Significance**: 0.05

In [24]:
workday_t = stat.ttest_ind(workday, no_workday, equal_var=False)
workday_t_p = workday_t[1]

print('The p-value for the test of whether mean ride times for workdays and non-workdays is ' +
      str(workday_t_p))

The p-value for the test of whether mean ride times for workdays and non-workdays is 2.80404324105e-260


The p-value is less than the significance level of 0.05 so we can reject the null hypothesis in favor of the alternative. This means that there is a difference in the mean ride times.

## Workday: Registered vs Casual
It does look like there are differences in the ride times based on workday and non-workday but how does this change when you factor in registered and casual riders? The following analysis will determine if the member type will effect the ride times.

In [25]:
no_work_cas = bikeshare[(bikeshare['workingday'] == 0) & (bikeshare['Member Type'] == 'Casual')]['time_diff']
no_work_reg = bikeshare[(bikeshare['workingday'] == 0) & (bikeshare['Member Type'] == 'Registered')]['time_diff']
work_cas = bikeshare[(bikeshare['workingday'] == 1) & (bikeshare['Member Type'] == 'Casual')]['time_diff']
work_reg = bikeshare[(bikeshare['workingday'] == 1) & (bikeshare['Member Type'] == 'Registered')]['time_diff']

no_work_cas_mean = np.mean(no_work_cas)
no_work_reg_mean = np.mean(no_work_reg)
print('The difference between mean ride times between member types on a non-workday is ' + 
     str(round(no_work_cas_mean - no_work_reg_mean, 2)) + ' minutes.')

work_cas_mean = np.mean(work_cas)
work_reg_mean = np.mean(work_reg)
print('The difference between mean ride times between member types on workdays is ' + 
     str(round(work_cas_mean - work_reg_mean, 2)) + ' minutes.')

The difference between mean ride times between member types on a non-workday is 15.47 minutes.
The difference between mean ride times between member types on workdays is 12.62 minutes.


There seems to be a difference in the mean ride times based on casual and registered riders. However, the difference in means does not provide enough proof that this is not just based on chance.
<br>
<br>
**Null Hypothesis (for both tests)**: The mean ride time for casual and registered riders are not different.
<br>
**Alternative Hypothesis (for both tests)**: The mean ride time for casual and registered riders are different.
<br>
**Significance**: 0.05

In [26]:
no_working = stat.ttest_ind(no_work_cas, no_work_reg, equal_var=False)
no_working_p = no_working[1]

working_t = stat.ttest_ind(work_cas, work_reg, equal_var=False)
working_t_p = working_t[1]

print('The p-value for the test for non-workdays is ' + str(no_working_p))
print('The p-value for the test for workdays is ' + str(working_t_p))

The p-value for the test for non-workdays is 0.0
The p-value for the test for workdays is 0.0


The p-values for both tests are below the significance level so we can reject the null hypotheses that there are no differences in mean ride times for casual and registered riders.

## Ride Times vs. Temperature

In [27]:
time_temp = np.corrcoef(bikeshare['temp'], bikeshare['time_diff'])
time_corr = time_temp[0,1]
print('The correlation between humidity and ride time is ' + str(time_corr))

The correlation between humidity and ride time is 0.102763920058


The correlation between temperature and ride time is very small as correlations that are close to 0 mean that there is not correlation. A hypothesis test is needed to determine if the correlation is equal or not equal to zero.
<br>
<br>
**Null Hypothesis**: There is no correlation between the temperature and ride times.
<br>
**Alternative Hypothesis**: There is a correlation between the temperature and ride times.
<br> 
**Significance level**: 0.05

In [28]:
t_num = time_corr * (np.sqrt(len(bikeshare) - 2))
t_denom = np.sqrt((1 - (time_corr ** 2)))
t_stat = t_num / t_denom
pval_hum = stat.t.sf(np.abs(t_stat), len(bikeshare) - 1)*2
print("The p-value for the t-test for the correlation between temperature and ride times is " + 
      str(pval_hum))

The p-value for the t-test for the correlation between temperature and ride times is 7.21413596186e-233


The p-value is roughly 0.0 so we can reject the null hypothesis. However, the p-value is very small so there does not seem to be a strong correlation.

## Ride Times vs. Humidity
The next test is to see if the humidity variable is correlated with the ride time variable.

In [29]:
time_hum = np.corrcoef(bikeshare['hum'], bikeshare['time_diff'])
hum_corr = time_hum[0,1]
print('The correlation between humidity and ride time is ' + str(hum_corr))

The correlation between humidity and ride time is 0.00628474960666


The correlation between humidity and ride time is very small as correlations that are close to 0 mean that there is not correlation. A hypothesis test is needed to determine if the correlation is equal or not equal to zero.
<br>
<br>
**Null Hypothesis**: There is no correlation between the humidity and ride times.
<br>
**Alternative Hypothesis**: There is a correlation between the humidity and ride times.
<br> 
**Significance level**: 0.05

In [30]:
t_num = hum_corr * (np.sqrt(len(bikeshare) - 2))
t_denom = np.sqrt((1 - (hum_corr ** 2)))
t_stat = t_num / t_denom
pval_hum = stat.t.sf(np.abs(t_stat), len(bikeshare) - 1)*2
print("The p-value for the t-test for the correlation between humidity and ride times is " + 
      str(pval_hum))

The p-value for the t-test for the correlation between humidity and ride times is 0.0468773003628


The low p-value here shows that there is a correlation between humidity and ride times. However, this test only proved that the correlation is not equal to 0. The correlation is very small so it does not seem like the correlation is very strong.

## Ride Times vs Wind Speed
The last test is to determine is wind speed is correlated to the ride time.

In [31]:
time_wind = np.corrcoef(bikeshare['windspeed'], bikeshare['time_diff'])
wind_corr = time_wind[0,1]
print('The correlation between humidity and ride time is ' + str(wind_corr))

The correlation between humidity and ride time is -0.0295105218383


The correlation between humidity and ride time is very small as correlations that are close to 0 mean that there is not correlation. A hypothesis test is needed to determine if the correlation is equal or not equal to zero.
<br>
<br>
**Null Hypothesis**: There is no correlation between the wind speed and ride times.
<br>
**Alternative Hypothesis**: There is a correlation between the wind speed and ride times.
<br> 
**Significance level**: 0.05

In [32]:
t_num = wind_corr * (np.sqrt(len(bikeshare) - 2))
t_denom = np.sqrt((1 - (wind_corr ** 2)))
t_stat = t_num / t_denom
pval_hum = stat.t.sf(np.abs(t_stat), len(bikeshare) - 1)*2
print("The p-value for the t-test for the correlation between wind speed and ride times is " + 
      str(pval_hum))

The p-value for the t-test for the correlation between wind speed and ride times is 1.01985189613e-20


The p-value shows that there is a correlation between wind speed and ride times. However, this value is very small so there is not a strong correlation between the two variables.

## Miles vs. Ride Time
The attribute miles was calculated using the Vincenty formula to calculate distance between two points with latititudes and longitudes.

In [33]:
time_miles = np.corrcoef(bikeshare['miles'], bikeshare['time_diff'])
miles_corr = time_miles[0,1]
print('The correlation between miles and ride time is ' + str(miles_corr))

The correlation between miles and ride time is 0.405867531255


The correlation between miles between two stations and ride time is fairly large compared to the correlations of the other 3 variables, humidity, temperature, and wind speed. A hypothesis test is needed to determine if the correlation is equal or not equal to zero.
<br>
<br>
**Null Hypothesis**: There is no correlation between the miles between two stations and ride times.
<br>
**Alternative Hypothesis**: There is a correlation between miles between two stations and ride times.
<br> 
**Significance level**: 0.05

In [34]:
t_num = miles_corr * (np.sqrt(len(bikeshare) - 2))
t_denom = np.sqrt((1 - (miles_corr ** 2)))
t_stat = t_num / t_denom
pval_hum = stat.t.sf(np.abs(t_stat), len(bikeshare) - 1)*2
print("The p-value for the t-test for the correlation between miles between two points and ride times is " + 
      str(pval_hum))

The p-value for the t-test for the correlation between miles between two points and ride times is 0.0


The p-value shows that we can reject the null hypothesis in favor of the alternative and can conclude that there is a correlation between the two variables.

In [35]:
casual = bikeshare[bikeshare['Member Type'] == 'Casual']
registered = bikeshare[bikeshare['Member Type'] == 'Registered']

time_miles_cas = np.corrcoef(casual['miles'], casual['time_diff'])
miles_corr_cas = time_miles_cas[0,1]
print('The correlation between miles and ride time is ' + str(miles_corr_cas))

The correlation between miles and ride time is 0.144906869361


In [36]:
time_miles_reg = np.corrcoef(registered['miles'], registered['time_diff'])
miles_corr_reg = time_miles_reg[0,1]
print('The correlation between miles and ride time is ' + str(miles_corr_reg))

The correlation between miles and ride time is 0.646566974426


The sample size is large for each and based on the results of the hypothesis test done above for all miles and ride times, I am assuming a hypothesis test would also show that there is a correlation when the rider type breakdown is done. 
<br>
<br>
The value for the correlation for registered riders is fairly large showing that there is a fairly strong correlation between ride time and the distance between the two stations. This cannot be said for casual riders as this value is closer to 0 meaning the correlation is not as strong.