# Statistics for Data Science

## You may (possibly) need to download scipy

`pip install scipy`

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm, uniform, t

%matplotlib inline

## Descriptive Statistics

1. **Central Tendency Measures**

**Mean:** The average of the set of observations  
**Median:** A measure such that 50% of the data is at or above this value.  
**Mode:** The observation with the highest frequency

2. **Variability Measures**
**Range**: The difference between the maximum and minimum value in the set of observations.  
**Variance**: The spread of the observations from the average value.  
**Standard Deviation**: The square root of the variance

In [None]:
# Mean and Median of an array
arr = np.random.rand(1, 10)
print("Contents: ", arr)
print("Mean: %f" % arr.mean())
print("Median: %f" % np.median(arr))

**Pop Quiz**: Calculate the range of the array *arr*. (Hint: You can use numpy's built in array.max() and array.min() functions)

In [None]:
# Range of the array
r = arr.max() - arr.min()
print('Range: ', r)

# Variance of the array
print('Variance: ', arr.var())

# Standard Deviation of the array
print('Standard Deviation: ', arr.std())

## Inferential Statistics

## Random Variables in Numpy

Notice that numpy's random module actually produces integer value from a random variable's distribution! Shown below is a random array of 10 elements whose contents are sampled from a uniform distribution.

In [None]:
arr = np.random.randint(3, size=10)

In [None]:
sns.distplot(arr, color='skyblue')

**Question**: What do you think would happen if we sampled 100 elements instead?

## Visualizing Sampling Distributions

## Plotting a Uniform distribution

Let's use scipy's out-of-the-box uniform distribution to visualize a uniform distribution

In [None]:
data_uniform = uniform.rvs(size=10000, loc =10, scale=20)

In [None]:
ax = sns.distplot(data_uniform,
                  bins=100,
                  kde=True,
                  color='skyblue',
                  hist_kws={"linewidth": 15,'alpha':1})
ax.set(xlabel='Uniform Distribution ', ylabel='Frequency')

## Plotting a Uniform distribution

Let's use scipy's out-of-the-box normal distribution to visualize a normal distribution  

**Pop Quiz**: Using the plot of the uniform distribution as a reference, fill in the parameters of the normal distribution plot to be the following:

1. Pass in *data_normal* as the data.
2. Make bins=100
3. Make the color='skyblue'

In [None]:
data_normal = norm.rvs(size=10000,loc=0,scale=1)

In [None]:
ax = sns.distplot(data_normal,
                  bins=100,
                  kde=True,
                  color='skyblue',
                  hist_kws={"linewidth": 15,'alpha':1})
ax.set(xlabel='Normal Distribution', ylabel='Frequency')

## Visualizing the Central Limit Theorem

In [None]:
# calculate the mean of 50 dice rolls 1000 times
means = [np.mean(np.random.randint(1, 7, 50)) for _ in range(1000)]
# plot the distribution of sample means
plt.hist(means)
plt.show()

What do you notice about the shape of the distribution? What does it resemble?

## Confidence Intervals

This is simply a demonstration of calculating 95% confidence intervals for the mean of a t-distribution with unknown standard deviation

In [None]:
data = np.array([63.5, 81.3, 88.9, 63.5, 76.2, 67.3, 66.0, 64.8, 74.9, 81.3, 76.2,
        72.4, 76.2, 81.3, 71.1, 80.0, 73.7, 74.9, 76.2, 86.4, 73.7, 81.3,
        68.6, 71.1, 83.8, 71.1, 68.6, 81.3, 73.7, 74.9])
mean = np.mean(data)

# evaluate sample variance by setting delta degrees of freedom (ddof) to
# 1. The degree used in calculations is N - ddof
stddev = np.std(data, ddof=1)

# Get the endpoints of the range that contains 95% of the distribution
t_bounds = t.interval(0.95, len(data) - 1)

# sum mean to the confidence interval
ci = [mean + critval * stddev / np.sqrt(len(data)) for critval in t_bounds]

print("Mean: %f" % mean)
print("Confidence Interval 95%%: %f, %f" % (ci[0], ci[1]))

# Resources and Additional Readings

[Think Stats by Downey](http://greenteapress.com/thinkstats/thinkstats.pdf)  
[Hypothesis Testing](https://www.investopedia.com/terms/h/hypothesistesting.asp)  
[5 Basic Statistics Concepts for Data Science](https://towardsdatascience.com/the-5-basic-statistics-concepts-data-scientists-need-to-know-2c96740377ae)  
[Statistics and Data Science Course](https://www.edx.org/micromasters/mitx-statistics-and-data-science)

# BONUS: Exploratory Data Analysis

We will apply the concepts we've learned to perform some EDA on a SAT/ACT dataset (data from the past 3 years).

# TODO: Insert pop-quiz.

In [None]:
sat_17 = pd.read_csv('./data/sat_2017.csv')
sat_18 = pd.read_csv('./data/sat_2018.csv')
act_17 = pd.read_csv('./data/act_2017.csv')
act_18 = pd.read_csv('./data/act_2018.csv')

### 1. Initial Explorations and Cleaning Corrupted Data

In [None]:
# Understanding the size of our datasets using the shape attribute. Notice the inconsistency in the dimensions of each year's data!
print("SAT 2017 Shape: ", sat_17.shape)
print("SAT 2018 Shape: ", sat_18.shape)
print("ACT 2017 Shape: ", act_17.shape)
print("ACT 2018 Shape: ", act_18.shape)

In [None]:
# Getting a closer look at the structure of the data. NOTE: We are working with the 2018 ACT data as a point of reference.
act_18.head()

In [None]:
# Getting a look at the frequencies of the states in 
act_18['State'].value_counts()

In [None]:
# Check if the state of Maine is corrupted or duplicate.
act_18[act_18['State'] == 'Maine']

In [None]:
# Drop the duplicate entry
act_18.drop(act_18.index[52], inplace=True)

# Reset the indices to the updated DataFrame
act_18 = act_18.reset_index(drop=True)
act_18.shape

# TODO: Pop quiz

In [None]:
def compare_values(act_col, sat_col):
    '''
    Takes in two lists of SAT and ACT statistics and compares the uniqueness of values in each of them.
    '''
    act_vals = []
    sat_vals = []
    
    for act in act_col:
        act_vals.append(act)
    
    for sat in sat_col:
        sat_vals.append(sat)
        
    print('Values unique to ACT: ')
    for act in act_vals:
        if act not in sat_vals:
            print(act)
            
    print('----------------------')
    
    print('Values unique to SAT: ')
    for sat in sat_vals:
        if sat not in act_vals:
            print(sat)

In [None]:
compare_values(act_17['State'], sat_17['State'])

In [None]:
compare_values(act_18['State'], sat_18['State'])

In [None]:
act_17[act_17['State'] == 'National']

In [None]:
act_17.drop(act_17.index[0], inplace=True)
act_17 = act_17.reset_index(drop=True)
act_17.shape

In [None]:
act_18[act_18['State'] == 'National']

In [None]:
act_18.drop(act_18.index[0], inplace=True)
act_18 = act_18.reset_index(drop=True)
act_18.shape

In [None]:
# Check if the values are consistent in the 2017 ACT dataframe.
print('------ WASHINGTON D.C --------')
print(act_17[act_17['State'] == 'Washington, D.C.'])
print('------ District of Columbia --------')
print(act_17[act_17['State'] == 'District of Columbia'])

In [None]:
print('SAT 2017 Cols: ', sat_17.columns, '\n')
print('SAT 2018 Cols: ', sat_18.columns, '\n')
print('ACT 2017 Cols: ', act_17.columns, '\n')
print('ACT 2018 Cols: ', act_18.columns, '\n')

In [None]:
sat_17.drop(columns = ['Evidence-Based Reading and Writing', 'Math'], inplace = True)
act_17.drop(columns = ['English', 'Math', 'Reading', 'Science'], inplace = True)
sat_18.drop(columns = ['Evidence-Based Reading and Writing', 'Math'], inplace=True)

In [None]:
print('SAT 2017 Cols: ', sat_17.columns, '\n')
print('SAT 2018 Cols: ', sat_18.columns, '\n')
print('ACT 2017 Cols: ', act_17.columns, '\n')
print('ACT 2018 Cols: ', act_18.columns, '\n')

In [None]:
print('SAT 2017 Missing Data: ')
print(sat_17.isnull().sum(), '\n')
print('SAT 2018 Missing Data: ')
print(sat_18.isnull().sum(), '\n')
print('ACT 2017 Missing Data: ')
print(act_17.isnull().sum(), '\n')
print('ACT 2018 Missing Data: ')
print(act_18.isnull().sum(), '\n')

In [None]:
print('SAT 2017 Data Types: ')
print(sat_17.dtypes, '\n')
print('SAT 2018 Data Types: ')
print(sat_18.dtypes, '\n')
print('ACT 2017 Data Types: ')
print(act_17.dtypes, '\n')
print('ACT 2018 Data Types: ')
print(act_18.dtypes, '\n')