## Exploratory Data Analysis (EDA)

The process of organizing, plotting, and summarizing a data set.

## Histograms

Code for histogram:
```
import matplotlib.pyplot as plt

_ = plt.hist(Dataframe Column)
_ = plt.xlabel(X Label)
_ = plt.ylabel(Y Label)
plt.show()
```

Manually specifying bins:
```
bin_edges = [0,10,20,30,40,50]
_ = plt.hist(Dataframe Column, bins = bin_edges)
```

Setting N number of bins:
```
_ = plt.hist(Dataframe Column, bins = N)
```

Using Seaborn:
```
import seaborn as sns
sns.set() #set default Seaborn style
_ = plt.hist(Dataframe Column)
_ = plt.xlabel(X Label)
_ = plt.ylabel(Y Label)
plt.show()
```

## Bee Swarm Plots
Also known as a swarm plot.  Requires each point to have a feature and observation.

Problems with Histograms:
* binning bias - different number of bins yields different visual representations which may lead to different interpretations of the same data
* loss of individual data points

Using Seaborn to generate a bee swarm plot.
```
_ = sns.swarmplot(x = X Variable, y = Y Variable, data = Dataframe)
_ = plt.xlabel(X Label)
_ = plt.ylabel(Y Label)
plt.show()
```

## Empiral Cumulative Distribution Function (ECDF)
In cases where classification catogories are blurred, we can use an ECDF.

```
import numpy as np
x = np.sort(Dataframe Column)
y = np.arange(1, len(x)+1 / len(x)
_ = plt.plot(x, y, marker='.', linestyle='none')
_ = plt.xlabel(X Label)
_ = plt.ylabel(Y Label)
plt.margins(0.02) # keeps data off plot edges)
plt.show()
```

## Summary Statistics

Mean (average):
```
import numpy as np
np.mean(Data)
```

Mean (middle value):
```
import numpy as np
np.median(Data)
```

Percentiles:
```
np.percentile(Data, Array of Percentiles)
```

Box Plot (the box is the interquartile range (boundaries are 75th and 25th percentile) where the middle line is the 50th percentile):
```
_ = sns.boxplot(x = X Variable, y = Y Variable, data = Dataframe)
_ = plt.xlabel(X Label)
_ = plt.ylabel(Y Label)
plt.show()
```

Variance - quantification of the spread of data or the average of the squared distance from the mean)
Standard Deviation (square root of the variance)
```
np.std(Data)
np.sqrt(np.var(Data))
```

Scatter Plots:
```
_= plt.plot(x, y, marker='.', marker='.', linestyle='none')
_ = plt.xlabel(X Label)
_ = plt.ylabel(Y Label)
```

### Covariance
The mean of the product of each points x and y difference from the x-mean and y-mean respectively.  If the x value is high while the y value is high, the covariance is positively correlated.  If the x value is low while the y value is low, the covariance is negatively correlated.
Covariance Matrix:
```
np.cov(X, Y)
```

Extracting Covariance:
np.cov(X, Y)[0,1]

### Pearson Correlation Coefficient
Measure of how two variables depend on each other. The covariance divided by the standard deviation of x and the standard diviation of y.

* -1 : Anti Correlation
* +1 : Complete Correlation
* 0 : No Correlation

## Bernouli Trials
Experiment where result is either True or False.

```
import numpy as np
rng = np.random.default_rng(SEED) #Create generator object with optional SEED
random_numbers = rng.random(size=4) # Generates array of 4 random numbers
heads = random_numbers < 0.5
```

## Probability Mass Function (PMF)
Set of probabilities of discrete outcomes.  The probabilities associated with a fair die roll is a discrete uniform PMF (1/6 chance for each result).

## Binomial Distribution
```
rng.binomial(NUMBER OF BERNOULI TRIALS, PROBABILITY OF SUCCESS, size = NUMBER OF SAMPLES)
```

## Poisson Process
The timing of the next event is independent of the timing of the previous event.  Used for rare events.  Example of a scenario that utilizes the poisson distribution: a website in an hour with an average of 6 hits per hour. 

```
samples = rng.poisson(MEAN NUMBER OF OCCURANCES, size = NUMBER OF SAMPLES)
```

## Probability Density Functions
Set of probabilities of continious outcomes.

## Normal Distribution
```
rng.normal(MEAN, STD, size=NUMBER OF SAMPLES)
```

## Exponential Distribution
```
rng.exponential(MEAN, size=NUMBER OF SAMPLES)
```

## Linear Regression by Least Squares
Process of finding parameters where the sum of the squares of risiduals is minimized.
```
slope, interceipt = np.polyfit(X DATA, Y DATA, Degree of Polynomial)
```

Sample:
```
# Plot the illiteracy rate versus fertility
_ = plt.plot(illiteracy, fertility, marker='.', linestyle='none')
plt.margins(0.02)
_ = plt.xlabel('percent illiterate')
_ = plt.ylabel('fertility')

# Perform a linear regression using np.polyfit(): a, b
a, b = np.polyfit(illiteracy, fertility, 1)

# Print the results to the screen
print('slope =', a, 'children per woman / percent illiterate')
print('intercept =', b, 'children per woman')

# Make theoretical line to plot
x = np.array([0,100])
y = a * x + b

# Add regression line to your plot
_ = plt.plot(x, y)

# Draw the plot
plt.show()
```


## Bootstrapping
Using resampled data to perform statistical inference
* bootstrap sample - the resample data in array format
* bootstrap replicate - statistic computed from resample array

We can generate a bootstrap sample with `np.random.choice(ORIGINAL SAMPLE, size=n)`

### Code to Draw a Single Bootstrap Replicate:
```
def bootstrap_replicate_1d(data, func):
    """Generate bootstrap replicate of 1D data."""
    bs_sample = np.random.choice(data, len(data))
    return func(bs_sample)
```

### Code to Draw and Store Multiple Bootstrap Replicates
```
def draw_bs_reps(data, func, size=1):
    """Draw bootstrap replicates."""

    # Initialize array of replicates: bs_replicates
    bs_replicates = np.empty(size)

    # Generate replicates
    for i in range(size):
        bs_replicates[i] = bootstrap_replicate_1d(data,func)

    return bs_replicates
```

### Sample
```
# Draw bootstrap replicates of the mean no-hitter time (equal to tau): bs_replicates
bs_replicates = draw_bs_reps(nohitter_times, np.mean, 10000)

# Compute the 95% confidence interval: conf_int
conf_int = (np.percentile(bs_replicates, 2.5), np.percentile(bs_replicates, 97.5))

# Print the confidence interval
print('95% confidence interval =', conf_int, 'games')

# Plot the histogram of the replicates
_ = plt.hist(bs_replicates, bins=50, normed=True)
_ = plt.xlabel(r'$\tau$ (games)')
_ = plt.ylabel('PDF')

# Show the plot
plt.show()

```



## np.arange(n)
Generates an int array starting from 0 to **n**-1.

## np.linspace(a,b,n)
Generate an array of **n** sequential floats from **a** to **b**

## Pairs Bootstrap
### Code to draw bootstrap pairs:
```
def draw_bs_pairs_linreg(x, y, size=1):
    """Perform pairs bootstrap for linear regression."""

    # Set up array of indices to sample from: inds
    inds = np.arange(len(x))

    # Initialize replicates: bs_slope_reps, bs_intercept_reps
    bs_slope_reps = np.empty(size)
    bs_intercept_reps = np.empty(size)

    # Generate replicates
    for i in range(size):
        bs_inds = np.random.choice(inds, size=len(inds))
        bs_x, bs_y = x[bs_inds], y[bs_inds]
        bs_slope_reps[i], bs_intercept_reps[i] = np.polyfit(bs_x, bs_y, 1)

    return bs_slope_reps, bs_intercept_reps
```

### Sample:
```
# Generate array of x-values for bootstrap lines: x
x = np.array([0,100])

# Plot the bootstrap lines
for i in range(100):
    _ = plt.plot(x, 
                 bs_slope_reps[i]*x + bs_intercept_reps[i],
                 linewidth=0.5, alpha=0.2, color='red')

# Plot the data
_ = plt.plot(illiteracy, fertility, marker='.', linestyle='none')

# Label axes, set the margins, and show the plot
_ = plt.xlabel('illiteracy')
_ = plt.ylabel('fertility')
plt.margins(0.02)
plt.show()
```


## Hypothesis Testing
To compare 2 groups, you can merge the groups, permute values and then let group a be the first (len a) elements of the permutation while group b be the elements after.

### Code for creating a permutation sample:
```
def permutation_sample(data1, data2):
    """Generate a permutation sample from two data sets."""

    # Concatenate the data sets: data
    data = np.concatenate((data1, data2))

    # Permute the concatenated array: permuted_data
    permuted_data = np.random.permutation(data)

    # Split the permuted array into two: perm_sample_1, perm_sample_2
    perm_sample_1 = permuted_data[:len(data1)]
    perm_sample_2 = permuted_data[len(data1):]

    return perm_sample_1, perm_sample_2
```

## Test Statistics and p-values
A **test statistic** is a single value comparison between the observed and computed data.

The **p-value** is the probability of obtaining a value of your test statistic that is at least as extreme as what was observed (assuming a true null-hypothesis).

### Code for drawing multiple permutation replicates:
```
def draw_perm_reps(data_1, data_2, func, size=1):
    """Generate multiple permutation replicates."""

    # Initialize array of replicates: perm_replicates
    perm_replicates = np.empty(size)

    for i in range(size):
        # Generate permutation sample
        perm_sample_1, perm_sample_2 = permutation_sample(data_1, data_2)

        # Compute the test statistic
        perm_replicates[i] = func(perm_sample_1, perm_sample_2)

    return perm_replicates
```

### Sample Code:
```
def diff_of_means(data_1, data_2):
    """Difference in means of two arrays."""

    # The difference of means of data_1, data_2: diff
    diff = np.mean(data_1)-np.mean(data_2)

    return diff

# Compute difference of mean impact force from experiment: empirical_diff_means
empirical_diff_means = diff_of_means(force_a, force_b)

# Draw 10,000 permutation replicates: perm_replicates
perm_replicates = draw_perm_reps(force_a, force_b,
                                 diff_of_means, size=10000)

# Compute p-value: p
p = np.sum(perm_replicates >= empirical_diff_means) / len(perm_replicates)

# Print the result
print('p-value =', p)
```

## Pipeline for Hypothesis Testing
1. State null hypothesis.
2. Define test statistic.
3. Generate multiple sets of simulated data assuming null hypothesis true.
4. Compute test statistic for each simulated data set.
5. The p-value is the fraction of the simulated data set for which the test statistic is at least as eextreme as for the real data.

### one sample test
compares one set of data to a single number

### two sample test
compares two sets of data

### Sample Code:
```
# Compute mean of all forces: mean_force
mean_force = np.mean(forces_concat)

# Generate shifted arrays
force_a_shifted = force_a - np.mean(force_a) + mean_force
force_b_shifted = force_b - np.mean(force_b) + mean_force 

# Compute 10,000 bootstrap replicates from shifted arrays
bs_replicates_a = draw_bs_reps(force_a_shifted, np.mean, 10000)
bs_replicates_b = draw_bs_reps(force_b_shifted, np.mean, 10000)

# Get replicates of difference of means: bs_replicates
bs_replicates = bs_replicates_a - bs_replicates_b

# Compute and print p-value: p
p = np.sum(bs_replicates>empirical_diff_means)/len(bs_replicates)
print('p-value =', p)
```

## Correlation Testing
