## Exploratory Data Analysis (EDA)

The process of organizing, plotting, and summarizing a data set.

## Histograms

Code for histogram:
```
import matplotlib.pyplot as plt

_ = plt.hist(Dataframe Column)
_ = plt.xlabel(X Label)
_ = plt.ylabel(Y Label)
plt.show()
```

Manually specifying bins:
```
bin_edges = [0,10,20,30,40,50]
_ = plt.hist(Dataframe Column, bins = bin_edges)
```

Setting N number of bins:
```
_ = plt.hist(Dataframe Column, bins = N)
```

Using Seaborn:
```
import seaborn as sns
sns.set() #set default Seaborn style
_ = plt.hist(Dataframe Column)
_ = plt.xlabel(X Label)
_ = plt.ylabel(Y Label)
plt.show()
```

## Bee Swarm Plots
Also known as a swarm plot.  Requires each point to have a feature and observation.

Problems with Histograms:
* binning bias - different number of bins yields different visual representations which may lead to different interpretations of the same data
* loss of individual data points

Using Seaborn to generate a bee swarm plot.
```
_ = sns.swarmplot(x = X Variable, y = Y Variable, data = Dataframe)
_ = plt.xlabel(X Label)
_ = plt.ylabel(Y Label)
plt.show()
```

## Empiral Cumulative Distribution Function (ECDF)
In cases where classification catogories are blurred, we can use an ECDF.

```
import numpy as np
x = np.sort(Dataframe Column)
y = np.arange(1, len(x)+1 / len(x)
_ = plt.plot(x, y, marker='.', linestyle='none')
_ = plt.xlabel(X Label)
_ = plt.ylabel(Y Label)
plt.margins(0.02) # keeps data off plot edges)
plt.show()
```

## Summary Statistics

Mean (average):
```
import numpy as np
np.mean(Data)
```

Mean (middle value):
```
import numpy as np
np.median(Data)
```

Percentiles:
```
np.percentile(Data, Array of Percentiles)
```

Box Plot (the box is the interquartile range (boundaries are 75th and 25th percentile) where the middle line is the 50th percentile):
```
_ = sns.boxplot(x = X Variable, y = Y Variable, data = Dataframe)
_ = plt.xlabel(X Label)
_ = plt.ylabel(Y Label)
plt.show()
```

Variance - quantification of the spread of data or the average of the squared distance from the mean)
Standard Deviation (square root of the variance)
```
np.std(Data)
np.sqrt(np.var(Data))
```

Scatter Plots:
```
_= plt.plot(x, y, marker='.', marker='.', linestyle='none')
_ = plt.xlabel(X Label)
_ = plt.ylabel(Y Label)
```

### Covariance
The mean of the product of each points x and y difference from the x-mean and y-mean respectively.  If the x value is high while the y value is high, the covariance is positively correlated.  If the x value is low while the y value is low, the covariance is negatively correlated.
Covariance Matrix:
```
np.cov(X, Y)
```

Extracting Covariance:
np.cov(X, Y)[0,1]

### Pearson Correlation Coefficient
Measure of how two variables depend on each other. The covariance divided by the standard deviation of x and the standard diviation of y.

* -1 : Anti Correlation
* +1 : Complete Correlation
* 0 : No Correlation

## Bernouli Trials
Experiment where result is either True or False.

```
import numpy as np
rng = np.random.default_rng(SEED) #Create generator object with optional SEED
random_numbers = rng.random(size=4) # Generates array of 4 random numbers
heads = random_numbers < 0.5
```

## Probability Mass Function (PMF)
Set of probabilities of discrete outcomes.  The probabilities associated with a fair die roll is a discrete uniform PMF (1/6 chance for each result).

## Binomial Distribution
```
rng.binomial(NUMBER OF BERNOULI TRIALS, PROBABILITY OF SUCCESS, size = NUMBER OF SAMPLES)
```

## Poisson Process
The timing of the next event is independent of the timing of the previous event.  Used for rare events.  Example of a scenario that utilizes the poisson distribution: a website in an hour with an average of 6 hits per hour. 

```
samples = rng.poisson(MEAN NUMBER OF OCCURANCES, size = NUMBER OF SAMPLES)
```

## Probability Density Functions
Set of probabilities of continious outcomes.

## Normal Distribution
```
rng.normal(MEAN, STD, size=NUMBER OF SAMPLES)
```

## Exponential Distribution
```
rng.exponential(MEAN, size=NUMBER OF SAMPLES)
```