Capstone Project 1: Statistical Data Analysis
 Students typically spend 4 - 12 Hours

At this point, you’ve obtained the dataset for your capstone project, cleaned, and wrangled it into a form that's ready for analysis. It's now time to apply the inferential statistics techniques you’ve learned to explore the data.

Based on your dataset, the questions that interest you, and the results of the visualization techniques that you used previously, you should choose the most relevant statistical inference techniques. You aren’t expected to demonstrate all of them. Your specific situation determines how much time it’ll take you to complete this project. Talk to your mentor to determine the most appropriate approach to take for your project. You may find yourself revisiting the analytical framework that you first used to develop your proposal questions. It’s fine to refine your questions more as you get deeper into your data and find interesting patterns and answers. Remember to stay in touch with your mentor to remain focused on the scope of your project

Think of the following questions and apply them to your dataset:

* Are there variables that are particularly significant in terms of explaining the answer to your project question?
* Are there significant differences between subgroups in your data that may be relevant to your project aim?
* Are there strong correlations between pairs of independent variables or between an independent and a dependent variable?
* What are the most appropriate tests to use to analyze these relationships?
Submission: Write a 1-2 page report on the steps and findings of your inferential statistical analysis. Upload this report to your GitHub and submit a link. Eventually, this report will get incorporated into your milestone report.

In [55]:
!ls -lh ../data/interim/data_by_day.pkl

-rw-r--r--  1 bethanys08  admin    62M Sep 25  2019 ../data/interim/data_by_day.pkl


In [56]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

In [57]:
daily_data = pd.read_pickle('../data/interim/data_by_day.pkl')

In [58]:
daily_data['PercentOccupied'] = daily_data.PaidOccupancy / daily_data.ParkingSpaceCount
daily_data.PercentOccupied.replace([np.inf, -np.inf], np.nan, inplace=True)
daily_data.dropna(inplace=True)

In [59]:
daily_data.PercentOccupied.isna().sum()

0

In [60]:
daily_data.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,PaidOccupancy,ParkingSpaceCount,PercentOccupied
SourceElementKey,OccupancyDateTime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1001,2012-01-03,2.072222,7.0,0.296032
1001,2012-01-04,1.336111,7.0,0.190873
1001,2012-01-05,1.836111,7.0,0.262302
1001,2012-01-06,2.268698,7.0,0.3241
1001,2012-01-07,1.683333,7.0,0.240476


In [61]:
blockface_ids = daily_data.index.unique(level=0).values

# How important is weather

We can state a null hypothesis that weather does not affect parking availability. Then I will test 4 alternate hypothesis - rain affects availibility, snow affects availability, and hot days affect availibility, and cold days affect availibility

In [66]:
weather = pd.read_pickle('../data/processed/2010-2019_daily_weather.pkl')

In [67]:
weather.index.rename('OccupancyDateTime', inplace=True)
weather.head()

Unnamed: 0_level_0,PRCP,SNOW,TAVG,TMAX,TMIN
OccupancyDateTime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010-01-01,0.36,0.0,48.5,52.0,45.0
2010-01-02,0.03,0.0,46.0,50.0,42.0
2010-01-03,0.02,0.0,45.0,48.0,42.0
2010-01-04,0.71,0.0,46.0,48.0,44.0
2010-01-05,0.07,0.0,46.5,48.0,45.0


In [68]:
weather.PRCP = weather.PRCP.replace(np.nan, 0.0)

In [69]:
daily_with_weather = daily_data.join(weather)
daily_with_weather.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,PaidOccupancy,ParkingSpaceCount,PercentOccupied,PRCP,SNOW,TAVG,TMAX,TMIN
SourceElementKey,OccupancyDateTime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1001,2012-01-03,2.072222,7.0,0.296032,0.02,0.0,48.0,53.0,43.0
1001,2012-01-04,1.336111,7.0,0.190873,0.65,0.0,46.0,53.0,39.0
1001,2012-01-05,1.836111,7.0,0.262302,0.04,0.0,44.0,49.0,39.0
1001,2012-01-06,2.268698,7.0,0.3241,0.05,0.0,40.0,42.0,38.0
1001,2012-01-07,1.683333,7.0,0.240476,0.0,0.0,42.0,45.0,39.0


In [104]:
from scipy.stats import t
import operator
import scipy.stats

In [131]:
parameters = {'PRCP': (operator.gt, .5), 
             'SNOW': (operator.gt, .01),
             'TMIN': (operator.lt, 30.),
             'TMAX': (operator.gt, 70.)}

In [132]:
for key, val in parameters.items():
    val_operator, value = val
    mask = val_operator(daily_with_weather[key] , value)
    true_values = daily_with_weather[mask]
    false_values = daily_with_weather[~mask]
    print('\n', key, val)
#     print('len', len(true_values), len(false_values))
    print('mean', true_values.PercentOccupied.mean(), false_values.PercentOccupied.mean())
#     print('var', true_values.PercentOccupied.var(), false_values.PercentOccupied.var())
    
    scipy_t, scipy_p = scipy.stats.ttest_ind(false_values.PercentOccupied, true_values.PercentOccupied, equal_var=False)
    print('ttest', np.abs(scipy_t))
    print('pvalue', scipy_p)
    


 PRCP (<built-in function gt>, 0.5)
mean 0.4524196864901656 0.4590367671990252
ttest 9.700247139559067
pvalue 3.0368429960104573e-22

 SNOW (<built-in function gt>, 0.01)
mean 0.31938216371853234 0.4594837066385767
ttest 72.41324750425517
pvalue 0.0

 TMIN (<built-in function lt>, 30.0)
mean 0.4302192387151238 0.4595918282052129
ttest 32.396311714938534
pvalue 3.318456391831045e-229

 TMAX (<built-in function gt>, 70.0)
mean 0.4660766099720753 0.4555768431159003
ttest 28.238149002371376
pvalue 2.1853972858872343e-175


Above, we can see that for each of the above conditions, there is a small yet significant difference in the means between the two groups. A decent rainfall, some snow, as well as cold temperatures lead to more parking spaces that are available, while high temperatures lead to fewer parking spaces available.

# How is temperature correlated to parking availability

In [115]:
scipy.stats.pearsonr(daily_with_weather.PercentOccupied, daily_with_weather.TAVG)

(0.03047883034976372, 0.0)

In [116]:
scipy.stats.pearsonr(daily_with_weather.PercentOccupied, daily_with_weather.TMAX)

(0.02867653552506716, 0.0)

In [117]:
scipy.stats.pearsonr(daily_with_weather.PercentOccupied, daily_with_weather.TMIN)

(0.03050118986046407, 0.0)