# Hypothesis Testing methodology and Concepts explained. 

    Our Salespeople dataset has the following properties 
    1. promoted — a binary value indicating if the salesperson was promoted or not in the recent promotion round
    2. sales — the recent sales made by the salesperson in thousands of dollars
    3. customer_rate — the recent average rating by customers of the salesperson on a scale of 1 to 5
    4. performance — the most recent performance rating of the salesperson where a rating of 1 is the lowest and 4 is the highest.

# Example 1: Welch's Test 
    Welch’s t-test is a hypothesis test for determining if two populations have different means. There are a number of varieties of this test, but we will look at the two sample version and we will ask if high performing salespeople generate higher sales than low performing salespeople in the population. We start by assuming our null hypothesis which is that the difference in mean sales between high performers and low performers in the population is zero or less. Now we calculate our difference in means statistic for our sample.

In [7]:
#Importing libraries
import pandas as pd
import numpy as np
from scipy import stats

In [2]:
#Getting the dataset
url = "http://peopleanalytics-regression-book.org/data/salespeople.csv"
salespeople = pd.read_csv(url)
salespeople.head()

Unnamed: 0,promoted,sales,customer_rate,performance
0,0,594.0,3.94,2.0
1,0,446.0,4.06,3.0
2,1,674.0,3.83,4.0
3,0,525.0,3.62,2.0
4,1,657.0,4.4,3.0


In [3]:
# get sales for top and bottom performers
perf1 = salespeople[salespeople.performance == 1].sales
perf4 = salespeople[salespeople.performance == 4].sales

In [9]:
# welch's independent t-test with unequal variance
ttest = stats.ttest_ind(perf4, perf1, equal_var=False)
print(ttest)

Ttest_indResult(statistic=4.629477606844271, pvalue=1.0932443461577038e-05)


# Analysis Results - P-value interprentation
    H0(Null Hypothesis): difference in mean sales between high performers and low performers in the population is zero or less.
    H1(alternate Hypothesis): high performing salespeople generate higher sales than low performing salespeople.
    
    Assumption 1: Sales is a random variable — that is, that the sales of one salesperson is independent of another. Therefore we expect the difference in mean sales between the two groups to also be a random variable.
    
    we therefore expect the true population difference to be on a t-distribution centered around our sample statistic, which is an estimate of a normal distribution based on our sample. 
    
    To get the precise t-distribution, we need the degrees of freedom - 100.98 degrees: This represents the maximum probability of our sample statistic occurring under the null hypothesis, and is known as the p-value of the hypothesis test.
    
    We also need to know the standard deviation of the mean difference, which we call the standard error - 4.63 
    
 # Results interpretentation
     So we determine that the maximum probability of our sample statistic occurring under the null hypothesis is 0.000005 — much less than even a very stringent alpha. In most cases this would be considered too unlikely to accept the null hypothesis and we will reject it in favour of the alternative hypothesis — that high performing salespeople generate higher sales than low performing salespeople.

# Example 2— Correlation test
    Another common hypothesis test is a test that two numeric variables have a non-zero correlation.
    Let’s ask if there is a non-zero correlation between sales and customer_rate in our salespeople data set. 
    As usual we assume the null hypothesis — that there is a zero correlation between these variables. We then calculate the sample correlation. 
    we expect the true population correlation to lie in a distribution around this sample statistic. A simple correlation like this is expected to observe a t-distribution with n-2 degrees of freedom (348 in this case) and the standard error is approximately 0.05.

In [None]:
import numpy as np

In [None]:
#calculate correlation and p-value 
sales = salespeople.sales[~np.isnan(salespeople.sales)]

In [13]:
cust_rate = salespeople.customer_rate[~np.isnan(salespeople.customer_rate)]
cor = stats.pearsonr(sales, cust_rate)
print(cor)

(0.33780504485867807, 8.647952212091666e-11)


# Example 3— Chi-square test of difference in proportion
    This example is critical when dealing with categorical variables.
    A common question is whether there is a difference in proportion across different categories of a such a variable. A chi-square test is a hypothesis test designed for this purpose.
    
    QUESTION: is there a difference in the proportion of salespeople who are promoted between the different performance categories?
    H0(Null Hypothesis): the proportion of salespeople who are promoted is the same across all the performance categories.
    
    1. Let’s look at the proportion of salespeople who were promoted in each performance category by creating a contingency table or cross table for performance and promotion.
    
    2. Next we assume that there was perfect equality across the categories. We do this by calculating the overall proportion of promoted salespeople and then applying this proportion to the number of salespeople in each category.
    
    3. Calculate chi-square statistics: As with our t-statistic earlier, the chi-square statistic has an expected distribution which is dependent on the degrees of freedom. The degrees of freedom are calculated by subtracting one from the number of rows and the number of columns of the contingency table and multiplying them together

In [15]:
# create contingency table for promoted versus performance
contingency = pd.crosstab(salespeople.promoted, salespeople.performance)
contingency

performance,1.0,2.0,3.0,4.0
promoted,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,50,85,77,25
1,10,25,48,30


In [17]:
# perform chi-square test
chi2_test = stats.chi2_contingency(contingency)
chi2_test

(25.895405268094862,
 1.0030629464566802e-05,
 3,
 array([[40.62857143, 74.48571429, 84.64285714, 37.24285714],
        [19.37142857, 35.51428571, 40.35714286, 17.75714286]]))

# Results
     The Graph area is extremely small indicating that we are likely to reject the null hypothesis and confirm the alternative hypothesis that there is a difference in promotion rates between promotion categories.