# Hypothesis Testing notes

- standard deviation: 
   measure of variability, gives us an idea how well the sample mean represent the data.
   
- standard error: a measure of variability on a larger scale. How variable the sampling distribution is.

SE = SD of sample means


In general, 
If you have a sample with more than 30 observations, you can accept that its coming from a sampling distribution with a mean close to the population mean.

$SE = \frac {\sigma }  {\sqrt{n}}$

- SE gives us the level of certainty with which we can generalize from a sample to a population.

## confidence interval

A range of values where the population mean is likely to fall.

( lower boundary, upper boundary )

 1. define what you want the interval to tell you. a range in which the true value we want to estimate lies in certain number of cases. usually  *95%* or *99 %*
 
 2. plug in your data into a formula.
 

#### CI:

* __a large Interval is uninformative because the true population mean can be anywhere in the range.__

* __if small, we can be fairly confident that for $95$ %  of the cases, the sample mean is a decent estimate of the population mean.__ 

## hypothesis

A hypothesis is “an idea that can be tested”

### The alternative hypothesis

The alternative hypothesis is the change or innovation that is
contesting the status-quo.
Usually the alternative is our own opinion. The idea is the
following:
If the null is the status-quo (i.e., what is generally believed),
then the act of performing a test, shows we have doubts
about the truthfulness of the null. More often than not the
researcher’s opinion is contained in the alternative hypothesis.

<u> what theory predict to be true.</u>


### The Null hypothesis

<u> what theory predicts will be FALSE.</u>

### Testing: 

When testing, there are two decisions that can be made: to accept the null hypothesis or to reject the null hypothesis.

in case we rejected the null hypothesis we can only support the alternative. we can not actually prove it.

## Statistical errors (Type I Error and Type II Error)


In general, there are two types of errors we can make while testing: Type I error rejecting Null hypothesis(False positive) responsibility for Type I falls on you $\alpha$ and
Type II Error accepting a false Null (False negative). $\beta$ depends on the sample size and the population variance.
 The significance level is the probability of rejecting a null hypothesis that is
true.

Type II is considered the smaller problem.

$1-\beta$ is called the power of the test. Power of the test is increased by increasing the sample size.

In [1]:
sal = read.csv('employee_data.csv')
head(sal,3)
summary(sal)

emp_no,first_name,last_name,birth_date,gender,title,salary,latest_start_date,end_of_contract_date
10001,Georgi,Facello,02/09/1953,M,Senior Engineer,60117,22/06/2010,01/01/9999
10002,Bezalel,Simmel,02/06/1964,F,Staff,65828,02/08/2001,01/01/9999
10003,Parto,Bamford,03/12/1959,M,Senior Engineer,40006,01/12/2001,01/01/9999


     emp_no          first_name     last_name        birth_date  gender 
 Min.   :10001   Sumali   :  5   Danley  :  4   13/11/1963:  4   F:390  
 1st Qu.:10251   Florina  :  4   Dredge  :  4   04/11/1960:  3   M:610  
 Median :10500   Hironoby :  4   Kaiser  :  4   05/11/1961:  3          
 Mean   :10500   Inderjeet:  4   Kalloufi:  4   10/11/1963:  3          
 3rd Qu.:10750   Munehiro :  4   Narahara:  4   11/09/1964:  3          
 Max.   :11000   Remko    :  4   Skafidas:  4   13/03/1963:  3          
                 (Other)  :975   (Other) :976   (Other)   :981          
                title         salary        latest_start_date
 Assistant Engineer: 50   Min.   : 40000   02/06/2002:  7    
 Engineer          :341   1st Qu.: 40000   04/10/2001:  7    
 Senior Engineer   :103   Median : 48585   07/01/2002:  7    
 Senior Staff      :295   Mean   : 53271   30/01/2002:  7    
 Staff             :148   3rd Qu.: 62115   03/04/2002:  6    
 Technique Leader  : 63   Max.   :106905   0

#### Known Variance populations

Example: the data scientist salary example, the null would be: the mean
data scientist salary is $113,000. Then we will try to reject the null with a
statistical test. So, usually, your personal opinion (e.g. data scientists don't
earn exactly that much) is the alternative hypothesis.



 True_mean salary of a data scientis salary is: $ \mu = \$50,000 $
 
 True_ Varianceof a data scientis salary is: $\sigma = \$15,000$
 
 $$ Z = \frac{\bar {x} - \mu} {SE} =  \frac{\bar {x} - \mu} {\sigma \ /\ \sqrt n } $$
 
 
 $$z = \frac{\bar {x} - \mu} {\sigma} $$


In [22]:
#write a function to perfotm a Z-test

z.test <- function(mu,sd,data){
    
   zeta= (mean(data) - mu) 
    zeta1 = sd / sqrt( length(data))
    zeta2 = zeta/zeta1
    return(zeta2)
}

In [25]:
z.test(mu = 50000, sd=15000, data = sal$salary )

we choose : 

$\alpha = 0.05 $

$z_{\alpha} = 1.96 $

since $\mathopen| Z \mathclose | \approx 6.89 $ and is larger than $1.96$ we can  reject the null hypothesis.

## P-Value

<u>The p-value is the smallest level of significance at which we can still reject the null hypothesis,
given the observed sample statistic</u>


In [26]:
dinom = 0.5 / sqrt(40)

In [27]:
-.2 / dinom

In [28]:
1-0.979

## Test for the mean - population variance unknown

In [32]:
# Using the data from the lesson, solve the following tasks:

# What if the question was: is the competitor open rate EXACTLY 40%. What would be the decision then?

# 1. Test at 5% significance. Comment on the decision with the appropriate statistical jargon.
# 2. Test at 1% significance. Comment on the decision with the appropriate statistical jargon.

# Hint: Think about what type of test would be suitable here (one- or two-sided).

library(psych)

rate <- read.csv("ttest-a.csv") # load your data
describe(rate) # understand your data

my.t.test <- function(a, hmean){
  t <- (mean(a) - hmean)/(sd(a)/sqrt(length(a)))
  return(t) # create the test
}

my.t.test(rate$Open.rate, 0.4)

# H0 = open rate is NOT 40%
# H1 = open rate is 40%
# The problem is a two-sided test
# T = 0.53
# t1 = 2.26 Accept the null. At the 5% significance level we cannot say that the competitor's open rate is 40%
# t2 = 3.25 Accept the null. The test on that sample shows that at 1% significance, our competitor's open rate is not 40%.


Unnamed: 0,vars,n,mean,sd,median,trimmed,mad,min,max,range,skew,kurtosis,se
X1,1,10,0.377,0.13736,0.345,0.36875,0.14826,0.23,0.59,0.36,0.3517242,-1.63805,0.04343706


In [30]:
install.packages("psych")

also installing the dependencies ‘tmvnsim’, ‘mnormt’

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done


In [34]:
summary(rate)

   Open.rate     
 Min.   :0.2300  
 1st Qu.:0.2675  
 Median :0.3450  
 Mean   :0.3770  
 3rd Qu.:0.4725  
 Max.   :0.5900  

## Comparing two means - dependent samples Example

A health guru on the internet designed a weight-loss program.
`
You are wondering if it is working. You are given a sample of some people who did the program.

You can find the data in kg if you prefer working with kg as a unit of measurement.

State the null hypothesis.

Calculate the appropriate statistic

Decide if this is a one-sided or a two-sided test. What is the p-value?

Based on the p-value, decide at 1%,5% and 10% significance, if the program is working.\ 

In [38]:
# A health guru on the internet designed a weight-loss program. 
# You are wondering if it is working. You are given a sample of some people who did the program. 
# You can find the data in kg if you prefer working with kg as a unit of measurement.

# State the null hypothesis.
# Calculate the appropriate statistic
# Decide if this is a one-sided or a two-sided test. What is the p-value?
# Based on the p-value, decide at 1%,5% and 10% significance, if the program is working. Comment using the appropriate statistical jargon.

library(pastecs)
library(psych)

weight <- read.csv("weight_data_exercise_kg.csv")
describe(weight)

dep.t.test <- t.test(weight$before, weight$after, paired = TRUE, alternative = "g")
dep.t.test

# H0: The difference between the before and the after conditions is less than or equal to 0
# t = 2.01
# The test is one-sided. We want to know if people are actually losing weight. p = 0.038
# At 1% significance we accept the null hypothesis. The data shows that the program is not working.
# At 5% significance, we reject the null hypothesis. Therefore, the program is successful.
# At 10% significance, there is enoug statistical evidence that the program is working.

Unnamed: 0,vars,n,mean,sd,median,trimmed,mad,min,max,range,skew,kurtosis,se
before,1,10,106.426,10.50876,107.775,106.8037,7.620564,88.84,120.99,32.15,-0.2885115,-1.234775,3.323163
after,2,10,105.2888,10.72571,104.9389,105.3591,8.309973,87.36389,122.6514,35.28749,-0.1708298,-1.057865,3.391766



	Paired t-test

data:  weight$before and weight$after
t = 2.0058, df = 9, p-value = 0.03792
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 0.0979058       Inf
sample estimates:
mean of the differences 
               1.137196 


In [40]:
install.packages('tidyverse')

also installing the dependencies ‘fs’, ‘rappdirs’, ‘processx’, ‘xfun’, ‘blob’, ‘lifecycle’, ‘vctrs’, ‘glue’, ‘tidyselect’, ‘data.table’, ‘gargle’, ‘ids’, ‘rematch2’, ‘isoband’, ‘cpp11’, ‘ellipsis’, ‘callr’, ‘knitr’, ‘withr’, ‘broom’, ‘cli’, ‘crayon’, ‘dbplyr’, ‘dplyr’, ‘dtplyr’, ‘forcats’, ‘googledrive’, ‘googlesheets4’, ‘ggplot2’, ‘haven’, ‘hms’, ‘httr’, ‘jsonlite’, ‘lubridate’, ‘magrittr’, ‘modelr’, ‘pillar’, ‘purrr’, ‘readr’, ‘reprex’, ‘rlang’, ‘rstudioapi’, ‘rvest’, ‘tibble’, ‘tidyr’, ‘xml2’

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
