Project 2: Analyzing the NYC Subway Dataset
==============

-----------

Section 0. References
-----------

----------------

 Problems with understanding and interpretation of the Mann-Whitney U-test:
 
 http://stats.stackexchange.com/questions/124995/how-do-i-interpret-the-p-value-returned-in-scipys-mann-whitney-u-test
        
 https://discussions.udacity.com/t/problem-set-3-3-interpreting-mann-whitney-u-test-repost/25403/2   
 
 https://discussions.udacity.com/t/welchs-ttest-use-it-if-distribution-not-normal/21193
 
 Linear regression with Python:
 
 http://connor-johnson.com/2014/02/18/linear-regression-with-python/
 
 R squared one more time:
 
 https://en.wikipedia.org/wiki/Coefficient_of_determination#Interpretation
 
 Matplotlib: equal width of the bins in histogram (looks familiar):
 
 http://stackoverflow.com/questions/28101623/python-pyplot-histogram-adjusting-bin-width-not-number-of-bins

----------


Section 1. Statistical Test
------------

*********************

**1.1 Which statistical test did you use to analyze the NYC subway data? Did you use a one-tail or a two-tail P value? What is the null hypothesis? What is your p-critical value?**

I used **two-sided Mann-Whitney U-test** (two-tailed p-value: 0.04999983)

null hypothesis $H_0$: distribution of the number of entries equal for rainy and non rainy days

alternative hypothesis $H_1$: distribution of the number of entries different for rainy and non rainy days

--------------

**1.2 Why is this statistical test applicable to the dataset? In particular, consider the assumptions that the test is making about the distribution of ridership in the two samples.**

We don't care about any properties of the samples to be tested with the Mann-Whitney U-test. In our case both distributions are right-skewed.

------------

**1.3 What results did you get from this statistical test? These should include the following numerical values: p-values, as well as the means for each of the two samples under test.**

mean for the rainy days sample: $\mu_r = 1105.446$

mean for the non rainy days sample: $\mu_r = 1090.279$

p-value: p = 0.04999983

------------

**1.4 What is the significance and interpretation of these results?**

With the significance level α=0.05 null-hypothesis is rejected in favour of the alternative. But it has to be noted, that resulted p-value almost equal to critical value.
Distribution of the number of entrances statistically different between rainy and non rainy days. Comparisons of means suppose, that we have slightly higher number of entrances in rainy days.

-------------

Section 2. Linear Regression
==============

------------

**2.1 What approach did you use to compute the coefficients theta and produce prediction for ENTRIESn_hourly in your regression model:**

**OLS using Statsmodels or Scikit Learn**

**Gradient descent using Scikit Learn**

**Or something different?**

I used statmodels.OLS(.,.)

------------------

**2.2 What features (input variables) did you use in your model? Did you use any dummy variables as part of your features?**

I used: 'Hour', 'maxpressurei', 'maxdewpti','minpressurei', 'meanpressurei', 'rain', 'meanwindspdi', 'mintempi' and dummy variable prodused from 'UNIT' variable.

----------

**2.3 Why did you select these features in your model? We are looking for specific reasons that lead you to believe that
the selected features will contribute to the predictive power of your model.
Your reasons might be based on intuition. For example, response for fog might be: “I decided to use fog because I thought that when it is very foggy outside people might decide to use the subway more often.”
Your reasons might also be based on data exploration and experimentation, for example: “I used feature X because as soon as I included it in my model, it drastically improved my R2 value.”**

First I've tried several combinations using an intuition, then I've printed out summary of fitting the linear regression and chose coefficients, that significantly different from 0. Code and summary below:

In [9]:
def linear_regression(features, values):
    
    features = sm.add_constant(features, prepend = True)
    model = sm.OLS(values, features)
    results = model.fit()
    intercept = results.params[0]
    params = results.params[1:]
    print intercept, params, results.summary()
#print linear_regression(dataframe[nmes_clear], dataframe['ENTRIESn_hourly'])

Note: I had to got rid of "thunder" variable, because there was a problem adding the constant (intercept) when one constant already exists in data set. I can't find a link right now, but it is known problem.

------------

**2.4 What are the parameters (also known as "coefficients" or "weights") of the non-dummy features in your linear regression model?**

------------

**2.5 What is your model’s R2 (coefficients of determination) value?**

$R^2 = 0.481891290522$

----------

**2.6 What does this R2 value mean for the goodness of fit for your regression model? Do you think this linear model to predict ridership is appropriate for this dataset, given this R2  value?**

Around 48% of the variance in the response variable (number of entrances per hour) can be explained by the explanatory variables. As it is a real-world data-set this value is fair enough, I reckon this model is fit reasonably well.

-------------

Section 3. Visualization
==========

----------

**3.1 One visualization should contain two histograms: one of  ENTRIESn_hourly for rainy days and one of ENTRIESn_hourly for non-rainy days.**

<img src="Rain_no_rain.png">

_This histogram shows us a skewness of data (expected), but as it is not normalized by number of rainy and not rainy days we can't apply it for any further conclusions_.

_Normalized version below:_

<img src="Rain_no_rain_norm.png">

_Here we can see only slight difference, as expected from result of statistical test._

-------------

**3.2 One visualization can be more freeform. You should feel free to implement something that we discussed in class (e.g., scatter plots, line plots) or attempt to implement something more advanced if you'd like. Some suggestions are:
Ridership by time-of-day
Ridership by day-of-week**

These visualizations was done with python ggplot(). I decided to use bar charts, by any type can be equally informative for these cases.

<img src="By_hour.png">

_This bar chart shows mean number of entrances on the appropriate hour of day. We can clearly see peak hours - in the morning, at the lunch-time and homegoing hours in the second part of day. Though it is a question, why the mean number of entrances in the morning so different from evening. Maybe I should have to separate weekdays from weekends._

<img src="By_day.png">

_This bar chart shows mean number of entrances on the appropriate day of week. Here we can see perfectly normal weekly seasonality - greater number of people using subway during the workweek, fewer in weekends._

----------

Section 4. Conclusion
==============

-----------

**4.1 From your analysis and interpretation of the data, do more people ride the NYC subway when it is raining or when it is not raining?**

_To analyze NY Subway dataset Mann-Whitney statistical test was applied and obtained results revealed that there is a slight difference between ridership in rainy and non rainy days - in rainy days slightly more people enters the NY subway stations. Average number of entrances in rainy days is approximately 1105, in non rainy days - 1090._

----------

**4.2 What analyses lead you to this conclusion? You should use results from both your statistical
tests and your linear regression to support your analysis.**

Result of Mann-Whitney test, namely, probability of obtain the different value of number of people choosing randomly from subset of rainy days vs choosing from non rainy days, showed, that there is a statistically significant difference; comparing summary statistics - such as means and medians - revealed, that higher ridership characterized rainy days. Also, in regression model raininess is a significant predictor, that means that this factor contributes to the prediction of number of passengers.

------------

Section 5. Reflection
==========

-----------

**5.1 Please discuss potential shortcomings of the methods of your analysis, including:**

**Dataset,**

**Analysis, such as the linear regression model or statistical test.**

I noticed, that for a different UNITs time of registration of number of entrances can be shifted for one hour or missing (e.g. R001 - (Hours) 1, 5, 9, 13, 17, 21, R003 - 0, 4, 12, 16, 20), this can cause some issues. 
Second not very great thing is that raininess usually not a day-long characteristics, for example, days, when it was raining between 03 and 07 am can be marked as rainy, but in fact this rain didn't cause changes in ridership.
Most of the predictors are weather conditions, so correlated to each other (rain and preassure for example), so using all of them in the regression model can cause multicollinearity problems. Also, output of the scipy's mannwhitneyu is [quastionable](http://stats.stackexchange.com/questions/124995/how-do-i-interpret-the-p-value-returned-in-scipys-mann-whitney-u-test).