### Exploratory data analysis (EDA)

- The process of organizing, plotting, and summarizing a data set.

"Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone." (John Tukey)

#### Plotting a histogram

- Always label your axes!

#### Seaborn

- An excellent Matplotlib-based statistical data visualization package written by Michael Waskom

~~~
import seaborn as sns

sns.set() # Set aesthetic parameters.

_ = plt.hist(df_swing['dem_share'])
_ = plt.xlabel('percent of vote for Obama')
_ = plt.ylabel('number of counties')
plt.show()
~~~

#### Bee swarm plot

~~~
_ = sns.swarmplot(x='state', y='dem_share', data=df_swing)

_ = plt.xlabel('state')
_ = plt.ylabel('percent of vote for Obama')

plt.show()
~~~

#### Empirical Cumulative Distribution Function (ECDF)

~~~
import numpy as np

x = np.sort(df_swing['dem_share'])

y = np.arange(1, len(x)+1) / len(x)

_ = plt.plot(x, y, marker='.', linestyle='none')

_ = plt.xlabel('percent of vote for Obama')
_ = plt.ylabel('ECDF')

plt.margins(0.02)

plt.show()
~~~


#### Mean

$ \bar{x} = \displaystyle\frac{1}{n} \displaystyle\sum_{i=1}^{n} x_i$

~~~
import numpy as np

print(np.mean(dem_share_PA))
~~~

- Heavily influenced by outliers.

#### Median

- Middle value of a data set

~~~
print(np.median(dem_share_PA))
~~~

#### Percentile

~~~
np.percentile(df_swing['dem_share'], [25, 50, 75])
~~~

#### Box plot

~~~
import matplotlib.pyplot as plt
import seaborn as sns

_ = sns.boxplot(x='east_west', y='dem_share', data=df_all_states)

_ = plt.xlabel('region')
_ = plt.ylabel('percent of vote for Obama')

plt.show()
~~~

- Middle of the box: median
- Edges of the box: 25th and 75th percentiles
- Total height of the box: IQR (middle 50% of data)
- Whiskers: $\pm$ 1.5 IQR
- Outside of whiskers: outliers


#### Variance

- Average squared distance from the mean.

$\sigma^2 = \displaystyle\frac{1}{n} \displaystyle\sum_{i=1}^{n} \big(x_i - \bar{x}\big)^2$

~~~
np.var(dem_share_FL)
~~~

#### Std Deviation

- Square root of the variance

$\sigma = \sqrt{\sigma^2}$

~~~
np.std(dem_share_FL)
~~~

#### Covariance

- A measure of how two quantities vary *together*
- The mean of the product of the differences of each data point to the means (x and y)

$Cov(x,y) = \displaystyle\frac{1}{n} \displaystyle\sum_{i=1}^{n} \big(x_i - \bar{x}\big)\big(y_i - \bar{y}\big)$

~~~
np.cov(x,y) # covariance MATRIX [cov(x,x), cov(x,y); cov(y,x), cov(y,y)]
~~~

#### Pearson correlation coefficient

$\rho_{x,y} = \displaystyle\frac{Cov(x,y)}{\sigma_{x}\ \sigma_{y}}$

- Variability of the data due to codependence (covariance) divided by the independent variability (std. deviations).
- ranges from -1 to 1

~~~
np.corrcoef(x,y) # correlation MATRIX
~~~

### Probability distribution

- A mathematical description of outcomes

#### Poisson Distribution

- Limit of the Binomial distribution for low probability of success and large number of Bernoulli trials - that is, for rare events.

#### The exponential distribution

- The waiting time between arrivals of a Poisson process is Exponentially distributed.