In [24]:
import numpy
import scipy.stats
import pandas

def compare_averages(filename):
    """
    Performs a t-test on two sets of baseball data (left-handed and right-handed hitters).

    You will be given a csv file that has three columns.  A player's
    name, handedness (L for lefthanded or R for righthanded) and their
    career batting average (called 'avg'). You can look at the csv
    file by downloading the baseball_stats file from Downloadables below. 
    
    Write a function that will read that the csv file into a pandas data frame,
    and run Welch's t-test on the two cohorts defined by handedness.
    
    One cohort should be a data frame of right-handed batters. And the other
    cohort should be a data frame of left-handed batters.
    
    We have included the scipy.stats library to help you write
    or implement Welch's t-test:
    http://docs.scipy.org/doc/scipy/reference/stats.html
    
    With a significance level of 95%, if there is no difference
    between the two cohorts, return a tuple consisting of
    True, and then the tuple returned by scipy.stats.ttest.  
    
    If there is a difference, return a tuple consisting of
    False, and then the tuple returned by scipy.stats.ttest.
    
    For example, the tuple that you return may look like:
    (True, (9.93570222, 0.000023))
    """
    
    baseballDF = pandas.read_csv(filename)
    
    avgRightHandedArray = baseballDF[baseballDF['handedness'] == 'R']['avg']
    avgLeftHandedArray  = baseballDF[baseballDF['handedness'] == 'L']['avg']
    
    resultsTTest = scipy.stats.ttest_ind(avgRightHandedArray, avgLeftHandedArray, equal_var=False)

    significanceLevel = 0.95
    alphaLevel        = 1.00 - significanceLevel
    pValue            = resultsTTest[1]

    if pValue <= alphaLevel:
        return (False, resultsTTest)
    else:
        return (True, resultsTTest)    



In [25]:
compare_averages('/Users/jason/Desktop/baseball_stats.csv')

[True, (-9.9357022262420944, 3.8102742258887383e-23)]

In [4]:
baseballDF.head()

Unnamed: 0,name,handedness,height,weight,avg,HR
0,Brandon Hyde,R,75,210,0.0,0
1,Carey Selph,R,69,175,0.277,0
2,Philip Nastu,L,74,180,0.04,0
3,Kent Hrbek,L,76,200,0.282,293
4,Bill Risley,R,74,215,0.0,0


# Testing if your data comes from a Normal Distribution -
#   Shapiro-Wilk Test for Normality

    w,p = scipy.stats.shaprio(data)
    
    w = Wilkinson statistic
    p = p-value
    
    Null hypothesis: The data is drawn/sampled from the Normal distribution
    
    Given the null hypothesis that the data is drawn from the Normal distribution,
    what is the probability that we would observe a value of _w_ at least as 
    extreme as the observed

In [27]:
scipy.stats.shapiro(avgRHandedArray)



(0.8317142724990845, 0.0)

The null-hypothesis of this test is that the population is normally distributed. Thus if the p-value is less than the chosen alpha level, then the null hypothesis is rejected and there is evidence that the data tested are not from a normally distributed population. In other words, the data are not normal. On the contrary, if the p-value is greater than the chosen alpha level, then the null hypothesis that the data came from a normally distributed population cannot be rejected. E.g. for an alpha level of 0.05, a data set with a p-value of 0.02 rejects the null hypothesis that the data are from a normally distributed population.
