### Exploratory data analysis (EDA)

- The process of organizing, plotting, and summarizing a data set.

"Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone." (John Tukey)

#### Plotting a histogram

- Always label your axes!

#### Seaborn

- An excellent Matplotlib-based statistical data visualization package written by Michael Waskom

~~~
import seaborn as sns

sns.set() # Set aesthetic parameters.

_ = plt.hist(df_swing['dem_share'])
_ = plt.xlabel('percent of vote for Obama')
_ = plt.ylabel('number of counties')
plt.show()
~~~

#### Bee swarm plot

~~~
_ = sns.swarmplot(x='state', y='dem_share', data=df_swing)

_ = plt.xlabel('state')
_ = plt.ylabel('percent of vote for Obama')

plt.show()
~~~

#### Empirical Cumulative Distribution Function (ECDF)

~~~
import numpy as np

x = np.sort(df_swing['dem_share'])

y = np.arange(1, len(x)+1) / len(x)

_ = plt.plot(x, y, marker='.', linestyle='none')

_ = plt.xlabel('percent of vote for Obama')
_ = plt.ylabel('ECDF')

plt.margins(0.02)

plt.show()
~~~


#### Mean

$ \bar{x} = \displaystyle\frac{1}{n} \displaystyle\sum_{i=1}^{n} x_i$

~~~
import numpy as np

print(np.mean(dem_share_PA))
~~~

- Heavily influenced by outliers.

#### Median

- Middle value of a data set

~~~
print(np.median(dem_share_PA))
~~~

#### Percentile

~~~
np.percentile(df_swing['dem_share'], [25, 50, 75])
~~~

#### Box plot

~~~
import matplotlib.pyplot as plt
import seaborn as sns

_ = sns.boxplot(x='east_west', y='dem_share', data=df_all_states)

_ = plt.xlabel('region')
_ = plt.ylabel('percent of vote for Obama')

plt.show()
~~~

- Middle of the box: median
- Edges of the box: 25th and 75th percentiles
- Total height of the box: IQR (middle 50% of data)
- Whiskers: $\pm$ 1.5 IQR
- Outside of whiskers: outliers


#### Variance

- Average squared distance from the mean.

$\sigma^2 = \displaystyle\frac{1}{n} \displaystyle\sum_{i=1}^{n} \big(x_i - \bar{x}\big)^2$

~~~
np.var(dem_share_FL)
~~~

#### Std Deviation

- Square root of the variance

$\sigma = \sqrt{\sigma^2}$

~~~
np.std(dem_share_FL)
~~~

#### Covariance

- A measure of how two quantities vary *together*
- The mean of the product of the differences of each data point to the means (x and y)

$Cov(x,y) = \displaystyle\frac{1}{n} \displaystyle\sum_{i=1}^{n} \big(x_i - \bar{x}\big)\big(y_i - \bar{y}\big)$

~~~
np.cov(x,y) # covariance MATRIX [cov(x,x), cov(x,y); cov(y,x), cov(y,y)]
~~~

#### Pearson correlation coefficient

$\rho_{x,y} = \displaystyle\frac{Cov(x,y)}{\sigma_{x}\ \sigma_{y}}$

- Variability of the data due to codependence (covariance) divided by the independent variability (std. deviations).
- ranges from -1 to 1

~~~
np.corrcoef(x,y) # correlation MATRIX
~~~

### Probability distribution

- A mathematical description of outcomes

#### Poisson Distribution

- Limit of the Binomial distribution for low probability of success and large number of Bernoulli trials - that is, for rare events.

#### The exponential distribution

- The waiting time between arrivals of a Poisson process is Exponentially distributed.

### Optimal parameters

- Parameter values that bring the model in closest agreement with the data

#### Packages to do statistical inference

- scipy.stats
- statsmodels
- numpy (hacker stats)

### Least Squares

- The process of finding the parameters for which the sum of the squares of the residuals is minimal.

#### Least squares with *np.polyfit()*

~~~
slope, intercept = np.polyfit(total_votes,dem_share, 1)
~~~


### Resampling an array

Data:
~~~
[23.3, 27.1, 24.3, 25.7, 26.0] # Mean = 25.2
~~~

Resampled data (with replacement):
~~~
[27.1, 26.0, 23.3, 25.7, 23.3] # Mean = 25.08
~~~

### Bootstrapping

- The use of resampled data to perform statistical inference.
- Bootstrap sample: a resampled array of data.
- Bootstrap replicate: a statistic computed from a resampled array.

#### Resampling engine: *np.random.choice()*

~~~
import numpy as np

bs_sample = np.random.choice(data, size=len(data))

print(np.mean(bs_sample))
print(np.median(bs_sample))
print(np.std(bs_sample))
~~~

#### Bootstrap replicate function

~~~
def bootstrap_replicate_1d(data, func):
	"""Generate bootstrap replicate of 1D data."""
	bs_sample = np.random.choice(data,size=len(data))
	return func(bs_sample)
~~~

#### Many bootstrap replicates

~~~
N = 10000

bs_replicates = np.empty(N)

for i in range(N):
	bs_replicates[i] = bootstrap_replicate_1d(
			michelson_speed_of_light, np.mean)

_ = plt.hist(bs_replicates, bins=30, normed=True)
_ = plt.xlabel('mean speed of light (km/s)')
_ = plt.ylabel('PDF')

plt.show()
~~~

### Confidence interval of a statistic

- If we repeated measurements over and over again, $p$% of the observed values would lie within the $p$% confidence interval.

#### Bootstrap confidence interval

~~~
ci = np.percentile(bs_replicates, [2.5, 97.5]) # 95%
~~~

In fact, it can be shown theoretically that under not-too-restrictive conditions, the value of the mean will always be Normally distributed. (This does not hold in general, just for the mean and a few other statistics.) The standard deviation of this distribution, called the standard error of the mean, or SEM, is given by the standard deviation of the data divided by the square root of the number of data points. I.e., for a data set, sem = np.std(data) / np.sqrt(len(data)).

### Nonparametric inference

- Make no assumptions about the model or probability distribution underlying the data.

### Pairs bootstrap for linear regression

- Resample data in pairs.
- Compute slope and intercept from resampled data.
- Each slope and intercept is a bootstrap replicate.

#### Generating a pairs bootstrap sample

~~~
inds = np.arange(len(total_votes)) # obtaining indices

bs_inds = np.random.choice(inds,len(inds)) # resampling indices

bs_total_votes = total_votes[bs_inds]
bs_dem_share = dem_share[bs_inds]

bs_slope, bs_intercept = np.polyfit(bs_total_votes, bs_dem_share, 1)
~~~

### Hypothesis testing

- Assessment of how reasonable the observed data are assuming a hypothesis is true.
- Null hypothesis: another name for the hypothesis you are testing.

#### Permutation

- Random reordering of entries in an array.
- Generating a permutation sample:

~~~
import numpy as np

dem_share_both = np.concatenate((dem_share_PA, dem_share_OH))

dem_share_perm = np.random.permutation(dem_share_both)

perm_sample_PA = dem_share_perm[:len(dem_share_PA)]
perm_sample_OH = dem_share_perm[len(dem_share_PA):]
~~~

### Test statistic

- A single number that can be computed from observed data and from data you simulate under the null hypothesis.
- It serves as a basis of comparison between the two.

### p-value

- The probability of obtaining a value of your test statistic that is at least as extreme as what was observed, under the assumption the null hypothesis is true.

- NOT the probability that the null hypothesis is true.

- Statistical significance: determined by the smallness of a p-value -> Null hypothesis significance testing (NHST)

### Pipeline for hypothesis testing

1. Clearly state the null hypothesis.
2. Define your test statistic.
3. Generate many sets of simulated data assuming the null hypothesis is true.
4. Compute the test statistic for each simulated data set.
5. The p-value is the fraction of your simulated data for which the test statistic is at least as extreme as for the real data.

### Example

#### Null hypothesis

- The true mean speed of light in Michelson's experiments was actually Newcomb's reported value.

#### Shifting the Michelson data

~~~
newcomb_value = 299860

michelson_shifted = michelson_speed_of_light \
	- np.mean(michelson_speed_of_light) + newcomb_value
~~~

#### Calculating the test statistic

~~~
def diff_from_newcomb(data, newcomb_value=299860):
	return np.mean(data) - newcomb_value

diff_obs = diff_from_newcomb(michelson_speed_of_light)

bs_replicates = draw_bs_reps(michelson_shifted,diff_from_newcomb,10000)

p_value = np.sum(bs_replicates <= diff_observed) / 10000

print(p_value)
~~~

### One sample test

- Compare one set of data to a single number.

### Two sample test

- Compare two sets of data.

### A/B test

- Used by organizations to see if a strategy change gives a better result.
- Null hypothesis: the test statistic is impervious to the change.

### Hypothesis test of correlation

- Posit null hypothesis: the two variables are completely uncorrelated.
- Simulate data assuming null hypothesis is true.
- Use Pearson correlation $\rho$ as test statistic.
- Compute the p-value as fraction of replicates that have $\rho$ at least as large as observed.
