In [1]:
import os
import sys
import numpy as np
import pandas as pd
import json
from __future__ import division, print_function

import scipy.stats

import statsmodels.api as sm
import statsmodels.formula.api as smf
import datetime as dt


import pylab as pl
import seaborn
%pylab inline

np.seed = 999



Populating the interactive namespace from numpy and matplotlib




# Science Buzzwords

- Reproducibility: independent verification 

- Falsifiability: defining characteristic of science

- Central Limit Theorem: The distribution of sample means is normally distributed around the true mean.

### Types of data
- qualitative: no ordering

- quantitative: ordering is meaningful
- - Continuous
- - Discrete




![1](flow.PNG)

# Errors

## Statistical: 

Stochastic and random error
 - unpredictable uncertainty in a measurement due to lack of sensitivity

- stochastic process can be completely random: Poisson process

Solution: larger sample size
 

## Systematic

Tendency to underestimate/overestimate the average differnece between population and sample
- Survery Bias
- - Undercoverage Bias
- - Self Selection Bias
- - Social Desirability Bias
- - Publication Bias
Solution: Good experimental design, calibration, simulations


## Reporting Errors
- Add in quadrature (assuming Gaussian)

## Type 1 vs Type 2
- Type 1: False Positive -- Reject null when shouldn't
- Type 2: False Negative -- Fail to reject null when should

# Data Munging

In [2]:
!head apidef.json
json_data = open("apidef.json").read()
myAPI = json.loads(json_data)
myAPI["myAPI"]

{"myAPI":"XXXXXX-XXX-XXXX"}

'XXXXXX-XXX-XXXX'

In [1]:


# reading data
#pd.read_csv()

#url = "http://some.url.here"
#os.system("curl -O " + url)
#os.system("mv data.csv " + os.getenv("PUIDATA"))


#url = ("https://maps.googleapis.com/maps/api/geocode/json?latlng=" +
#          "%f,%f&key=%s"%(
#            latlon[0], latlon[1], os.getenv('GOOGLEAPI')))

#Never hard-code your API key in the code. Set an environmental variable 

#DFDATA = "/gws/open/NYCOpenData/nycopendata/data/"
#df_gas = pd.read_csv(DFDATA + "/uedp-fegm/1414245967/uedp-fegm")

#pd.drop(..., axis = 1)

### Data Wrangling

https://github.com/fedhere/UInotebooks/blob/master/dataWrangling/PandasDataWrangling-Chap7.ipynb



# Distributions


### Normal -- Gaussian

![1](gauss.PNG)

In [3]:
# change mu and sigma2
# N(mu, sigma2) --> normally distributed with mean mu, variance sigma2 (sigma is stdev)
# stdev is sqrt(variance)

sigma = 25 # new standard deviation
mu = 100

g = sigma * np.random.randn(10) + mu
print(g)

[  74.34410066   51.42184615  108.38449537  158.13378052  140.85260894
   58.62244899  125.29692014  130.87713101  108.82553256   53.63564495]


### Poisson

Poisson: discrete variables, for counting, arrivals, pieces of mail, “queuing up”
- independent events
- lambda = mu = variance. Lambda is based on a historical average, very rarely given


![1](poisson.PNG)

![1](pd.PNG)

### Chi Squared

For continuous variables. With k degrees of freedom, is the sum of the squares of k independent standard normal random variables.

![1](chisq.PNG)

# Statistical Tests

## Stating the Null Hypothesis:

#### Verbally:

Null Hypothesis: The mean of A is not different or is significantly greater than the mean of B.  
Alternate Hypothesis: The mean of A is significantly less than the mean of B.

#### Mathematically:

$H_0$: A.mean() >= B.mean()

$H_a$: A.mean() < B.mean()

### $\alpha=0.05$


## Z - Test

In [4]:
# z = (mean_pop - mean_sample)/ (std_pop / sqrt(N))

## T - Test

In [5]:
# t = (mean_pop - mean_sample)/ (std_sample / sqrt(N))

## Test of difference of proportions

# $z = \frac{(p_0 - p_1)}{SE} $
# $p =\frac{p_0  n_0 + p_1  n_1}{n_0+n_1}$
# $SE = \sqrt{ p  ( 1 - p )  (\frac{1}{n_0} + \frac{1}{n_1}) }$

## Pearson's Chi Squared

### Independence

Are unpaired observations of two variables independent?
For a contingency table that has r rows and c columns, the chi square test can be thought of as a test of independence.  
In a test of independence the null and alternative hypotheses are:  
Ho: The two categorical variables are independent.  
Ha: The two categorical variables are related.  

![1](chisq_i.PNG)

### Goodness of fit

Does observed frequency differ from theoretical distribution?

In [6]:
# x2p = sum_i( (oi - ei)**2 / ei )
# df = number of observations - num_parameters

### Model evaluation chi-squared

chisq = $\sum_i \frac{(model(x_i) - data(x_i))^2 }{ error_i^2}$  

and $error_i$ is $\sqrt{data(x_i)}$

# Goodness of fit tests

### KS
- Answers: are the samples likely to come from the same parent distribution
- The Kolmogorov-Smirnov test (KS-test) tries to determine if two datasets differ significantly. The KS-test has the advantage of making no assumption about the distribution of data. (Technically speaking it is non-parametric and distribution free.)
- - This test is used to decide if a sample comes from a hypothesized continuous distribution. 
- - It is based on the empirical cumulative distribution function (ECDF) 
- - Better for looking at the center of the data.
- Hypotheses:
- - Ho: that a sample is drawn from a population that follows a particular distribution
- - Ha: The data do not follow the specified distribution.
- - KS statistic (D) is based on the largest vertical difference between F(x) and Fn(x).
- - Rejecting the Null: if p value is larger than (critical value for) confidence level, we cannot reject Ho.


![1](ks.PNG)


### AD

- Answers: are the samples likely to come from the same parent distribution

- The Anderson–Darling test is a statistical test of whether a given sample of data is drawn from a given probability distribution. In its basic form, the test assumes that there are no parameters to be estimated in the distribution being tested, in which case the test and its set of critical values is distribution-free.
- - The AD assumes gaussian distribution test is a modification of the KS test.

- - Better for looking at extremes.

- - The AD procedure is a general test to compare the fit of an observed CDF to an expected cumulative distribution function (ECDF). This test gives more weight to the tails than the KS test.

- Hypotheses
- - Ho: the distributions are related (under implicit assumptions of gaussianity) / data follow specified distribution.
- - Ha: The data follow the specified distribution.
-  AD, on CDF, takes derivative at different points, and compares the slope, and that’s why it's more sensitive at the tails


![1](AD.PNG)

### KL Test
The KL test is a non-symmetric measure of the difference between two probability distributions P and Q. Specifically, the KL divergendce of Q from P, denoted DKL(P‖Q), is a measure of the information lost when Q is used to approximate P. Typically P represents the "true" distribution of data, observations, or a precisely calculated theoretical distribution. The measure Q typically represents a theory, model, description, or approximation of P.
- There is no Null for the KL divergence. 
- Can work on a PDF, doesn’t need a CDF

![1](kl.PNG)


# Correlation Tests


Used to compare if datasets are correlated, looks at paired values. Return a correlation coefficient. Note, correlation does not imply causation!!


### Spearmans

- Compares lineal distance of CDF’s at any point  
- Looks at the center of the distribution  

Returns 2-d array of correlation values
Output (correlation coef, p value)
Correlation coefficient: between -1 and 1. 0 implies no correlation. Correlations of -1 or +1 imply an “exact monotonic relationship”. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.

![1](spearman.PNG)

### Pearsons test
Compares derivative at different points of CDFs, which makes it more sensitive towards the tails as the slope is greater. The test is pairwise, so you have to sort your data so the pairs match.
Returns: Output (correlation coef, p value)
Correlation coefficient: between -1 and 1. 0 implies no correlation. Correlations of -1 or +1 imply an “exact monotonic relationship”. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.

![1](pearson.PNG)

# Likelihood

A likelihood function (often simply the likelihood) is a function of the parameters of a statistical model. Likelihood functions play a key role in statistical inference, especially methods of estimating a parameter from a set of statistics. In informal contexts, "likelihood" is often used as a synonym for "probability." But in statistical usage, a distinction is made depending on the roles of the outcome or parameter. Probability is used when describing a function of the outcome given a fixed parameter value. For example, if a coin is flipped 10 times and it is a fair coin, what is the probability of it landing heads-up every time? Likelihood is used when describing a function of a parameter given an outcome. For example, if a coin is flipped 10 times and it has landed heads-up 10 times, what is the likelihood that the coin is fair?

### Likelihood ratio test

a statistical test used to compare the goodness of fit of two models, one of which (the null model) is a special case of the other (the alternative model). The test is based on the likelihood ratio, which expresses how many times more likely the data are under one model than the other. This likelihood ratio, or equivalently its logarithm, can then be used to compute a p-value, or compared to a critical value to decide whether to reject the null model in favour of the alternative model. When the logarithm of the likelihood ratio is used, the statistic is known as a log-likelihood ratio statistic.

Being a function of the data , the likelihood ratio is therefore a statistic. The likelihood ratio test rejects the null hypothesis if the value of this statistic is too small. How small is too small depends on the significance level of the test, i.e., on what probability of Type I error is considered tolerable ("Type I" errors consist of the rejection of a null hypothesis that is true).
The numerator corresponds to the maximum likelihood of an observed outcome under the null hypothesis. The denominator corresponds to the maximum likelihood of an observed outcome varying parameters over the whole parameter space. The numerator of this ratio is less than the denominator. The likelihood ratio hence is between 0 and 1. Low values of the likelihood ratio mean that the observed result was less likely to occur under the null hypothesis as compared to the alternative. High values of the statistic mean that the observed outcome was nearly as likely to occur under the null hypothesis as the alternative, and the null hypothesis cannot be rejected.

![1](likelihood.PNG)

### Likelihood Ratio Test
The likelihood ratio test (LRT) is a statistical test of the goodness-of-fit between two models. A relatively more complex model is compared to a simpler model to see if it fits a particular dataset significantly better. If so, the additional parameters of the more complex model are often used in subsequent analyses. The LRT is only valid if used to compare hierarchically nested models. That is, the more complex model must differ from the simple model only by the addition of one or more parameters. Adding additional parameters will always result in a higher likelihood score. However, there comes a point when adding additional parameters is no longer justified in terms of significant improvement in fit of a model to a particular dataset. The LRT provides one objective criterion for selecting among possible models.
The LRT begins with a comparison of the likelihood scores of the two models:
LR = 2*(lnL1-lnL2)

This LRT statistic approximately follows a chi-square distribution. To determine if the difference in likelihood scores among the two models is statistically significant, we next must consider the degrees of freedom. In the LRT, degrees of freedom is equal to the number of additional parameters in the more complex model. Using this information we can then determine the critical value of the test statistic from standard statistical tables.

The likelihood range requires the difference in the degrees of freedom between the variables to be >=1 in order to compare its result with the chi square distribution

You test fits of lines with a LR, and a higher LR indicates a better fit.

### Nested models for line/curve fitting: 
Nested if one is a special case of another. 
For example: 
y = ax + b and 
y’ = ax2 + bx + c, where b=0
are nested.
Feeding a curve and feeding a line given the same independent variables is a nested model.
Adding complexity always helps increase the fit of the model, and you should keep adding complexity until your model starts capturing noise.



# Regression

### Standard Errors
Standard errors of the estimate is a measure of the accuracy of predictions. Regression line minimizes the sum of squared deviations of prediction.

![1](sterr.PNG)

### Leverage
The leverage of an observation is based on how much the observation's value on
the predictor variable differs from the mean of the predictor variable. The greater
an observation's leverage, the more potential it has to be an influential observation.

### Regression towards the mean
Regression toward the mean involves outcomes that are at least partly due to
chance….This tendency of subjects with high values on a measure that includes chance and skill to score closer to the mean on a retest is called “regression toward the mean.”

### Multiple Regression
In multiple regression, the criterion is predicted by two or more variables. As in the case of simple linear regression, we define the best predictions as the predictions that minimize the squared errors of prediction.

## Least square fits (OLS, WLS) → predictive models
The method of least squares is a standard approach in regression analysis to the approximate solution of overdetermined systems, i.e., sets of equations in which there are more equations than unknowns. "Least squares" means that the overall solution minimizes the sum of the squares of the errors made in the results of every single equation.

The most important application is in data fitting. The best fit in the least-squares sense minimizes the sum of squared residuals, a residual being the difference between an observed value and the fitted value provided by a model. 

Least squares problems fall into two categories: linear or ordinary/linear least squares and non-linear least squares, depending on whether or not the residuals are linear in all unknowns. The ordinary/ linear least-squares problem occurs in statistical regression analysis; it has a closed-form solution. 

### Ordinary least squares (OLS)
or linear least squares is a method for estimating the unknown parameters in a linear regression model, with the goal of minimizing the differences between the observed responses in some arbitrary dataset and the responses predicted by the linear approximation of the data (visually this is seen as the sum of the vertical distances between each data point in the set and the corresponding point on the regression line - the smaller the differences, the better the model fits the data). The resulting estimator can be expressed by a simple formula

### Weighted least squares
A special case of generalized least squares called weighted least squares occurs when all the off-diagonal entries of Ω (the correlation matrix of the residuals) are null; the variances of the observations (along the covariance matrix diagonal) may still be unequal (heteroskedasticity).
The expressions given above are based on the implicit assumption that the errors are uncorrelated with each other and with the independent variables and have equal variance. 

### Curve Fitting
Curve fitting is the process of constructing a curve, or mathematical function, that has the best fit to a series of data points, possibly subject to constraints... A related topic is regression analysis, which focuses more on questions of statistical inference such as how much uncertainty is present in a curve that is fit to data observed with random errors. Fitted curves can be used as an aid for data visualization, to infer values of a function where no data are available, and to summarize the relationships among two or more variables. 
Can be done with a polynomial curve. 
If the order of the equation is increased to a second degree polynomial, the following results:
y = ax**2 + bx + c


