# Data Analysis

## Why are statistical significance tests useful?

* They provide a formalized framework for comparing and evaluating data
* They enable us to evaluate whether perceived effects in our dataset reflect differences across the whole population

## Normal Distribution (Gaussian Distribution, Bell Curve)

### Two parameters associated:

* Mean $$\mu$$
* Standard deviation $$\sigma$$


These two parameters plug in to the following probability density function, which describes a Gaussian distribution:

![title](img/normal-d.jpg)

$$f(x) = \frac{1}{{\sqrt {2\pi \sigma^2} }}e^{ - \frac{{(x - \mu)^2}}{2\sigma^2}}$$

* The expected <b>value of a variable described</b> by a Gaussian distribution is the <b>mean</b> and the <b>variance</b> is the <b>standard deviation</b>.

* Normal distributions are also symetric about their mean

## Statistical Significance Tests

## t-Test
One of the most common parametric test that we can use to compare two sets of data.

* Aims at accepting or rejecting a <b>null hypothesis</b>: generally a statement that we are trying to disprove by running our test)

<b>TEST STATISTIC:</b> reduces the dataset to one number that helps to accept or reject the <b>null hypothesis</b>. When performing a t-Test, we compute a test statistic called <b>T</b>: 

$$ tTest \rightarrow t $$

Depending on the value of the test statistic T we can determine whether or not a null hypotesis is true.

### Two Sample t-Test
A few different versions depending on assumptions:
* Equal sample size?
* Same variance?

$$t = \frac{\mu_1 - \mu_2}{{\sqrt {\frac {\sigma_1^2}{N_1} + \frac {\sigma_2^2}{N_2} }}}$$

Where:
* Sample mean for i sample: $$\mu_i$$ 
* Sample variance for i'th sample: $$\sigma_i^2$$
* Sample size for i sample: $$N_i$$

To estimate the number of degrees of freedom:
$$\nu \approx \frac{(\frac{\sigma_1^2}{N_1}+\frac{\sigma_2^2}{N_2})^2}{\frac{\sigma_1^4}{N_1^2 \nu_1}+\frac{\sigma_2^4}{N_2^2 \nu_2}}$$

Where:

$$\nu_i = N_i - 1$$

is the degrees of freedom associated with the i'th variance estimate.

With these two values we can estimate the P value which is the probability of obtaining the test statistic at least as extreme as the one that was actually observed assumin that the null hypothesis was true (the P value IS NOT the probability of the null hypothesis is true given the data).

* P-value: probability of obtaining a test statistic <b>at least</b> as extreme as ours if null hypothesis was true
* Set Pcritical -> if P < Pcritical: REJECT NULL HYPOTHESIS else CANNOT REJECT NULL HYPOTHESIS

### t-Test in Python: SciPy

In [11]:
import scipy.stats

In [12]:
# two sets of data
lst1 = [1,2,3,4,5,6]
lst2 = [5,4,3,2,6,7,8,9,10]
# assumes a two-sided t-test
scipy.stats.ttest_ind(lst1, lst2, equal_var=False)
# returns a tuple: (t-value, p-value for a two-tailed test)

Ttest_indResult(statistic=-2.1004201260420148, pvalue=0.05583466515003168)

#### For one-sided: half of two sided p-value (one side of the distribution)

$$ > Mean \rightarrow \frac{P}{2} < P_{critical}, t > 0$$

$$ < Mean \rightarrow \frac{P}{2} < P_{critical}, t < 0$$

## Lesson 5 Quiz: Welch's t-Test Exercise
Perform a t-test on two sets of baseball data (left-handed and right-handed hitters).

Receive a csv file that has three columns.  A player's name, handedness (L for lefthanded or R for righthanded) and their career batting average (called 'avg'). 
    
Read that the csv file into a pandas data frame, and run Welch's t-test on the two cohorts defined by handedness.

One cohort should be a data frame of right-handed batters. And the other cohort should be a data frame of left-handed batters.
    
* With a significance level of 95%, if there is no difference between the two cohorts, return a tuple consisting of True, and then the tuple returned by scipy.stats.ttest.  
    
* If there is a difference, return a tuple consisting of False, and then the tuple returned by scipy.stats.ttest.
    
For example, the tuple that you return may look like:
* (True, (9.93570222, 0.000023))

In [9]:
import numpy as np
import pandas as pd
import scipy.stats as sps
import os

In [6]:
def input_dir():
    return os.getcwd() + '/data/input/'

def output_dir():
    return os.getcwd() + '/data/output/'

In [7]:
def read_csv_data(filename, input_dir):
    '''
    Receives a file name (csv)
    Returns a DataFrame
    '''
    data = pd.read_csv(input_dir + filename)
    
    #Rename the columns by replacing spaces with underscores and setting all characters to lowercase
    data.rename(columns = lambda x: x.replace(' ', '_').lower(), inplace=True)
    
    return data

In [16]:
def t_test_compare(lst1, lst2 ):
    """
    Compare averages
    Performs a t-test on two sets of average data
    """
    t_test_tuple = sps.ttest_ind(lst1, lst2, equal_var=False)
    tvalue = t_test_tuple.statistic
    pvalue = t_test_tuple.pvalue
    # ex: pvalue = 5% -> there is a 5% chance of finding a difference (probability of rejecting the null hypothesis when it is true)
    # as large as (or larger than) the one in our study given that the null hypothesis is true
    # A low P value suggests that the sample provides enough evidence that we can reject 
    # the null hypothesis for the entire population.
    # pvalue tells the strength of the evidence


    # With a significance level of 95% 
    if pvalue >= 0.05:
        # No difference
        return (True, (tvalue,pvalue))
    else:
        # There is a difference
        return (False, (tvalue,pvalue))

In [17]:
baseball_data = read_csv_data('baseball-data.csv',input_dir())

In [18]:
print(baseball_data)

                    name handedness height weight    avg   hr
0           Brandon Hyde          R     75    210  0.000    0
1            Carey Selph          R     69    175  0.277    0
2           Philip Nastu          L     74    180  0.040    0
3             Kent Hrbek          L     76    200  0.282  293
4            Bill Risley          R     74    215  0.000    0
5                   Wood        NaN                0.000    0
6        Steve Gajkowski          R     74    200  0.000    0
7              Rick Schu          R     72    170  0.246   41
8              Tom Brown          R     73    170  0.000    0
9           Tom Browning          L     73    190  0.153    2
10           Tommy Brown          R     73    170  0.241   31
11             Tom Brown          B     73    190  0.147    1
12              Joe Burg          R     70    143  0.326    0
13             Tom Brown          L     70    168  0.265   64
14         Terry McGriff          R     74    190  0.206    3
15      

In [19]:
right_h = baseball_data[baseball_data['handedness'] == 'R']

In [20]:
left_h = baseball_data[baseball_data['handedness'] == 'L']

In [21]:
# Ignoring NaN handness
t_test_result = t_test_compare(right_h['avg'], left_h['avg'])
print(t_test_result)

(False, (-9.9357022262420944, 3.8102742258887383e-23))


## Non-parametric Test
Statistical test that does not assume our data is drawn from anny particular underlying probability distribution.

## Mann-Whitney U Test (-Wilcoxan Test)
This is a test of the null hypothesis that two populations are the same.

Tests whether or not these samples came from the same population - but not necessarily which one has a higher mean or higher median or anything like that

Because of this it is usually useful to report Mann-Whitney U Test results along with some other information (like the two samples means, or the sample medians...)

In [22]:
# u: Mann-Whitney test statistic
# p: one sided pvalue
u_test = sps.mannwhitneyu(right_h['avg'], left_h['avg'])
print(u_test)

MannwhitneyuResult(statistic=22523894.5, pvalue=3.7307870396512496e-45)


## Machine Learning
A branch of artificial intelligence focused on constructing systems that learn from large amounts of data to make predictions. 

### Statistics vs. Machine Learning
NOT MUCH

* Statistics is focused on analyzing existing data, and drawing valid conclusions (care about how the data is collected and drawing conclusions about that existing data using probability models)
* Machine Learning is focused on making predictions

### Types of Machine Learning: Supervised and Unsupervised
Data -> MODEL -> Predictions

### Unsupervised Learning
Do not have any such training examples. Instead, we have a bunch of unlabeled data points and we are trying to understand the structure of the data, often by clustering similar data points together.
* Trying to understand structure of data
* Clustering

 ### Supervised Learning
 There are labeled inputs that we train the model on. Training the model means teaching the model what the correct answer looks like.
 * Have examples with input and output
 * Predict output for future
 * Classification
 * Regression

#### Linear Regression with Gradient Descent
Can we write an equation that takes a bunch of info (e.g., height, weight, birth year, position) and predicts the number of home runs? Yeah, regression!

Each data point (1..m) has an output variable Y and n input variables:


$
(1)\begin{bmatrix}
    Y\\
    x_{1}\\
    .\\
    .\\
    .\\
    x_{n} 
\end{bmatrix}
$
$
(2)\begin{bmatrix}
    Y\\
    x_{1}\\
    .\\
    .\\
    .\\
    x_{n} 
\end{bmatrix}
$
$
(3)\begin{bmatrix}
    Y\\
    x_{1}\\
    .\\
    .\\
    .\\
    x_{n} 
\end{bmatrix}
$
$
\dots
$
$
(m)\begin{bmatrix}
    Y\\
    x_{1}\\
    .\\
    .\\
    .\\
    x_{n} 
\end{bmatrix}
$
$
\theta_1 \dots \theta_n
$



In the baseball example Y is the lifetime number of home runs and x1..xn are parameters like height and weight; one through m samples might be diffent baseball players.

We try to predict the values of the output variable for each data point by multiplying the input variables by some set of coefficients (theta 1 through theta N). Each theta <b>tells how important</b> an input variable is when predicting a value for the output variable. If theta 1 is <b>very small</b> then x1 <b>must not be very important</b> in general when predicting Y (and if theta n is very large then Xn is generally a big contributor for the value of Y).

This model is build in such a way that we can muntiply each X by the corresponding theta and sum them up to get Y. So the final equation will look something like this:

$\theta_1 x_1 + \theta_2 x_2 + \dots + \theta_n x_n = Y$

The best equation is the one that's is going to minimize the difference across all data points between our predicted Y and our observed Y. To find this equation we need to find the thetas that produce the best predictions.

To create a value that describes the total errors of the model: 

$\sum_{m}^{i=1}{(Y_{predicted} - Y_{actual})^2}$

However since these errors can be both negative and positive if we simply sum them up we could have a total error term that is very close to 0 even if the model is very wrong. That is why we need to add the square of the error terms (the magnitude of each individual error term will be positive)

##### How to find theta values: Gradient Descent
1) Define a cost function $J(\theta)$ where $\theta$ here is used to represent the entire set of thetas

The cost function is meant to provide a measure of how well the current set of thetas does at modeling the observed data so we want to minimize the cost function's value.

$J(\theta) = \frac{1}{2m}\sum_{i=1}^{m}{(h(X^i) - Y^i)^2}$ 

where

$h(X^i) = \sum_{n}^{j=1}{\theta_jX_{j}^{i}} = \theta_0X_{0}^{i} + \theta_1X_{1}^{i} + \dots + \theta_nX_{n}^{i} = Y_{predicted}^{i}$


How to find the correct values of theta to minimize the cost function $J(\theta)$? Use a <b>search algorithm</b> that takes some initial guess for a theta and iteratively changes theta so that $J(\theta)$ keeps on getting smaller until it convergers on some minimum value (<b>gradient descent</b>). 

## Lesson 5 Quiz: Gradient Descent in Python
Perform gradient descent given a data set with an arbitrary number of features.

Useful formula:

$\theta_j = \theta_j + \frac{\alpha}{m} \sum_{i=1}^{m}{(Y^i - h(X^i))X^{i}_{j}}$

In [2]:
import numpy as np
import pandas as pd

In [3]:
def compute_cost(features, values, theta):
    """
    Compute the cost of a list of parameters, theta, given a list of features 
    (input data points) and values (output data points).
    """
    m = len(values)
    sum_of_square_errors = np.square(np.dot(features, theta) - values).sum()
    cost = sum_of_square_errors / (2*m)

    return cost

In [172]:
def gradient_descent(features, values, theta, alpha, num_iterations):
    """
    Perform gradient descent given a data set with an arbitrary number of features.
    """

    # Write code here that performs num_iterations updates to the elements of theta.
    # times. Every time you compute the cost for a given list of thetas, append it 
    # to cost_history.
    # See the Instructor notes for hints. 
    
    m = len(values)
    cost_history = []

    for i in range(num_iterations):
        print (i,cost_history)
        # update values
        # dot product of the features and theta
        predicted_values = np.dot(features, theta)
        # update theta
        #print (predicted_values)
        theta = theta - alpha / m * np.dot((predicted_values - values), features)
        
        cost = compute_cost(features, values, theta.transpose())
        cost_history.append(cost)
    
    return theta, pd.Series(cost_history)  

### Read data

In [47]:
baseball_data = read_csv_data('baseball-data.csv',input_dir())
print(baseball_data)

                    name handedness height weight    avg   hr
0           Brandon Hyde          R     75    210  0.000    0
1            Carey Selph          R     69    175  0.277    0
2           Philip Nastu          L     74    180  0.040    0
3             Kent Hrbek          L     76    200  0.282  293
4            Bill Risley          R     74    215  0.000    0
5                   Wood        NaN                0.000    0
6        Steve Gajkowski          R     74    200  0.000    0
7              Rick Schu          R     72    170  0.246   41
8              Tom Brown          R     73    170  0.000    0
9           Tom Browning          L     73    190  0.153    2
10           Tommy Brown          R     73    170  0.241   31
11             Tom Brown          B     73    190  0.147    1
12              Joe Burg          R     70    143  0.326    0
13             Tom Brown          L     70    168  0.265   64
14         Terry McGriff          R     74    190  0.206    3
15      

### Isolate features and values

In [125]:
features = baseball_data[['height', 'weight']]
values = baseball_data[['hr']]
# m = number of data points
m = len(values)
# print (features.values)

### Replace empty entries with np.NaN

In [126]:
def replace_df_empty_entries(df):
    return df.replace(r'\s+', np.nan, regex=True)

### Transform series to numeric

In [127]:
def series_to_numeric(s):
    return pd.to_numeric(s)

### Normalize features

#### Scale to Unit Length

In [128]:
def normalize_unit_lenght(s):
    return s / (s.max() - s.min())

In [177]:
def mean_normalization(s):
    return (s - s.mean()) / (s.max() - s.min())

In [178]:
features = replace_df_empty_entries(features)
features['height'] = series_to_numeric(features['height'])
features['weight'] = series_to_numeric(features['weight'])
print(features)

values = normalize_unit_lenght(values)
features['height'] = normalize_unit_lenght(features['height'])
features['weight'] = normalize_unit_lenght(features['weight'])
features['weight'] = normalize_unit_lenght(features['weight'])

print(features)

features['height'] = pd.Series(features['height'])
features['weight'] = pd.Series(features['weight'])
features = pd.concat([features['height'], features['weight']], axis = 1)
print(features)

       height    weight
0       1.875  0.823529
1       1.725  0.686275
2       1.850  0.705882
3       1.900  0.784314
4       1.850  0.843137
6       1.850  0.784314
7       1.800  0.666667
8       1.825  0.666667
9       1.825  0.745098
10      1.825  0.666667
11      1.825  0.745098
12      1.750  0.560784
13      1.750  0.658824
14      1.850  0.745098
15      1.825  0.745098
16      1.875  0.862745
17      1.825  0.713725
18      1.750  0.588235
19      1.775  0.686275
20      1.750  0.588235
21      1.800  0.725490
22      1.850  0.725490
23      1.875  0.784314
24      1.700  0.686275
26      1.650  0.588235
27      1.800  0.737255
28      1.800  0.784314
29      1.850  0.725490
30      1.800  0.666667
31      1.900  0.764706
...       ...       ...
18146   1.775  0.654902
18147   1.825  0.725490
18148   1.800  0.666667
18149   1.700  0.666667
18150   1.875  0.764706
18151   1.825  0.666667
18152   1.800  0.764706
18153   1.825  0.686275
18154   1.775  0.635294
18155   1.750  0

In [176]:
features = features[np.isfinite(features['height'])]
features = features[np.isfinite(features['weight'])]
#theta, cost_history = gradient_descent(features.values, values.values,[0.,0.], 0.01, 10)

# Coefficient of Determination ($R^2$)
One pretty effective way to evaluate the effectiveness of the model.

data = $y_i \dots y_n$

predictions = $f_i \dots f_n$

average of data = $\bar{y}$

$R^2 \equiv 1 - \frac{\sum_{n} (y_i - f_i)^2 }{\sum_{n} (y_i - \bar{y})^2}$

The closer $R^2$ is to 1, the beter is the model. The closer it to zero the poorer is the model.

## Lesson 5 Quiz: Calculating $R^2$
Function to compute $R^2$

In [67]:
import numpy as np

In [69]:
def compute_r_squared(data, predictions):
    '''
    Receives numpy arrays: data and predictions and return the coefficient of determination for the model
    '''
    
    d_mean = np.mean(data)
    r_squared = 1 - np.sum((data - predictions)**2) / np.sum((data - d_mean)**2)
    # ((data - predictions)**2).sum() is also correct
    return r_squared

## Aditional Considerations
To apply linear regression in real problems

* Other types of linear regression
    * Ordinary Least Squares Regression (always guaranteed to find the optimal solution when performing Linear Regression whereas Gradient Descent is not)
* Parameter estimation
    * What are the confidence intervals on the parameters?
    * What is the likelihood we would calculate this parameter value if the parameter had no effect on the output variable?
* Under/Overfitting (
    * Is not so much a problem with linear regression
    * Cross validation
* Multiple Local Minima
    * Use various different random initial thetas
    * Seed random values for repeatability (make results replicable)