# Data Analysis

## Why are statistical significance tests useful?

* They provide a formalized framework for comparing and evaluating data
* They enable us to evaluate whether perceived effects in our dataset reflect differences across the whole population

## Normal Distribution (Gaussian Distribution, Bell Curve)

### Two parameters associated:

* Mean $$\mu$$
* Standard deviation $$\sigma$$


These two parameters plug in to the following probability density function, which describes a Gaussian distribution:

![title](img/normal-d.jpg)

$$f(x) = \frac{1}{{\sqrt {2\pi \sigma^2} }}e^{ - \frac{{(x - \mu)^2}}{2\sigma^2}}$$

* The expected <b>value of a variable described</b> by a Gaussian distribution is the <b>mean</b> and the <b>variance</b> is the <b>standard deviation</b>.

* Normal distributions are also symetric about their mean

## Statistical Significance Tests

## t-Test
One of the most common parametric test that we can use to compare two sets of data.

* Aims at accepting or rejecting a <b>null hypothesis</b>: generally a statement that we are trying to disprove by running our test)

<b>TEST STATISTIC:</b> reduces the dataset to one number that helps to accept or reject the <b>null hypothesis</b>. When performing a t-Test, we compute a test statistic called <b>T</b>. 

$$ tTest \rightarrow t $$

Depending on the value of the test statistic T we can determine whether or not a null hypotesis is true.

### Two Sample t-Test
A few different versions depending on assumptions:
* Equal sample size?
* Same variance?

$$t = \frac{\mu_1 - \mu_2}{{\sqrt {\frac {\sigma_1^2}{N_1} + \frac {\sigma_2^2}{N_2} }}}$$

Where:
* Sample mean for i sample: $$\mu_i$$ 
* Sample variance for i'th sample: $$\sigma_i^2$$
* Sample size for i sample: $$N_i$$

To estimate the number of degrees of freedom:
$$\nu \approx \frac{(\frac{\sigma_1^2}{N_1}+\frac{\sigma_2^2}{N_2})^2}{\frac{\sigma_1^4}{N_1^2 \nu_1}+\frac{\sigma_2^4}{N_2^2 \nu_2}}$$

Where:

$$\nu_i = N_i - 1$$

is the degrees of freedom associated with the i'th variance estimate.

With these two values we can estimate the P value which is the probability of obtaining the test statistic at least as extreme as the one that was actually observed assumin that the null hypothesis was true (the P value IS NOT the probability of the null hypothesis is true given the data).

* P-value: probability of obtaining a test statistic <b>at least</b> as extreme as ours if null hypothesis was true
* Set Pcritical -> if P < Pcritical: REJECT NULL HYPOTHESIS else CANNOT REJECT NULL HYPOTHESIS

### t-Test in Python: SciPy

In [2]:
import scipy.stats

In [34]:
# two sets of data
lst1 = [1,2,3,4,5,6]
lst2 = [5,4,3,2,6,7,8,9,10]
# assumes a two-sided t-test
scipy.stats.ttest_ind(lst1, lst2, equal_var=False)
# returns a tuple: (t-value, p-value for a two-tailed test)

Ttest_indResult(statistic=-2.1004201260420148, pvalue=0.05583466515003168)

#### For one-sided: half of two sided p-value (one side of the distribution)

$$ > Mean \rightarrow \frac{P}{2} < P_{critical}, t > 0$$

$$ < Mean \rightarrow \frac{P}{2} < P_{critical}, t < 0$$

## Lesson 5 Quiz: Welch's t-Test Exercise
Perform a t-test on two sets of baseball data (left-handed and right-handed hitters).

Receive a csv file that has three columns.  A player's name, handedness (L for lefthanded or R for righthanded) and their career batting average (called 'avg'). 
    
Read that the csv file into a pandas data frame, and run Welch's t-test on the two cohorts defined by handedness.

One cohort should be a data frame of right-handed batters. And the other cohort should be a data frame of left-handed batters.
    
* With a significance level of 95%, if there is no difference between the two cohorts, return a tuple consisting of True, and then the tuple returned by scipy.stats.ttest.  
    
* If there is a difference, return a tuple consisting of False, and then the tuple returned by scipy.stats.ttest.
    
For example, the tuple that you return may look like:
* (True, (9.93570222, 0.000023))

In [85]:
import numpy as np
import pandas as pd
import scipy.stats as sps
import os

In [23]:
def input_dir():
    return os.getcwd() + '/data/input/'

def output_dir():
    return os.getcwd() + '/data/output/'

In [24]:
def read_csv_data(filename, input_dir):
    '''
    Receives a file name (csv)
    Returns a DataFrame
    '''
    data = pd.read_csv(input_dir + filename)
    
    #Rename the columns by replacing spaces with underscores and setting all characters to lowercase
    data.rename(columns = lambda x: x.replace(' ', '_').lower(), inplace=True)
    
    return data

In [96]:
def t_test_compare(lst1, lst2 ):
    """
    Compare averages
    Performs a t-test on two sets of average data
    """
    t_test_tuple = sps.ttest_ind(lst1, lst2, equal_var=False)
    tvalue = t_test_tuple.statistic
    pvalue = t_test_tuple.pvalue
    # ex: pvalue = 5% -> there is a 5% chance of finding a difference (probability of rejecting the null hypothesis when it is true)
    # as large as (or larger than) the one in our study given that the null hypothesis is true
    # A low P value suggests that the sample provides enough evidence that we can reject 
    # the null hypothesis for the entire population.
    # pvalue tells the strength of the evidence


    # With a significance level of 95% 
    if pvalue >= 0.05:
        # No difference
        return (True, (tvalue,pvalue))
    else:
        # There is a difference
        return (False, (tvalue,pvalue))

In [87]:
baseball_data = read_csv_data('baseball-data.csv',input_dir())

In [88]:
print(baseball_data)

                    name handedness height weight    avg   hr
0           Brandon Hyde          R     75    210  0.000    0
1            Carey Selph          R     69    175  0.277    0
2           Philip Nastu          L     74    180  0.040    0
3             Kent Hrbek          L     76    200  0.282  293
4            Bill Risley          R     74    215  0.000    0
5                   Wood        NaN                0.000    0
6        Steve Gajkowski          R     74    200  0.000    0
7              Rick Schu          R     72    170  0.246   41
8              Tom Brown          R     73    170  0.000    0
9           Tom Browning          L     73    190  0.153    2
10           Tommy Brown          R     73    170  0.241   31
11             Tom Brown          B     73    190  0.147    1
12              Joe Burg          R     70    143  0.326    0
13             Tom Brown          L     70    168  0.265   64
14         Terry McGriff          R     74    190  0.206    3
15      

In [89]:
right_h = baseball_data[baseball_data['handedness'] == 'R']

In [90]:
left_h = baseball_data[baseball_data['handedness'] == 'L']

In [97]:
# Ignoring NaN handness
t_test_result = t_test_compare(right_h['avg'], left_h['avg'])
print(t_test_result)

(False, (-9.9357022262420944, 3.8102742258887383e-23))


## Non-parametric Test
Statistical test that does not assume our data is drawn from anny particular underlying probability distribution.

## Mann-Whitney U Test (-Wilcoxan Test)
This is a test of the null hypothesis that two populations are the same.

Tests whether or not these samples came from the same population - but not necessarily which one has a higher mean or higher median or anything like that

Because of this it is usually useful to report Mann-Whitney U Test results along with some other information (like the two samples means, or the sample medians...)

In [105]:
# u: Mann-Whitney test statistic
# p: one sided pvalue
u_test = sps.mannwhitneyu(right_h['avg'], left_h['avg'])
print(u_test)

MannwhitneyuResult(statistic=22523894.5, pvalue=3.7307870396512496e-45)


## Machine Learning
A branch of artificial intelligence focused on constructing systems that learn from large amounts of data to make predictions. 

### Statistics vs. Machine Learning
NOT MUCH

* Statistics is focused on analyzing existing data, and drawing valid conclusions (care about how the data is collected and drawing conclusions about that existing data using probability models)
* Machine Learning is focused on making predictions

### Types of Machine Learning: Supervised and Unsupervised
Data -> MODEL -> Predictions

### Unsupervised Learning
Do not have any such training examples. Instead, we have a bunch of unlabeled data points and we are trying to understand the structure of the data, often by clustering similar data points together.
* Trying to understand structure of data
* Clustering

 ### Supervised Learning
 There are labeled inputs that we train the model on. Training the model means teaching the model what the correct answer looks like.
 * Have examples with input and output
 * Predict output for future
 * Classification
 * Regression

#### Linear Regression with Gradient Descent
Can we write an equation that takes a bunch of info (e.g., height, weight, birth year, position) and predicts the number of home runs? Yeah, regression!