# Inferential statistics

Inferential statistics measures helps us draw inferences about the larger population from the sample data. You rarely have access to full population datasets. 

Consider the following: The evidence linking cigarrette smoking and lung disease is almost irrefutable. Based upon this information, what proportion of Americans have given up smoking? One way to answer this question would be to survey the entire population of the United States. It would be impossible from the standpoint of time and cost-effectiveness. Mathematicians have created methods of estimating population parameters from samples drawn from target populations that adequautely represent the large population.

In inferential statistics, we answer questions (test hypotheses) about populations based upon sample or prior data. Since parameters of populations are generally not available, we must rely on sampling techniques to estimate them. 

## Sampling Distributions

A sampling distribution is a theoretical distribution that would result if we were to take all possible samples from a given population. Suppose we were able to draw a representative sample using 20 individuals (N = 20) from a population. We could then calculate what is referred to as a sampling distribution. We typically draw fewer samples with knowledge that there will be some error. This error term can be calculated and will be used in inference. 

For example, how do we know the average height of males in this country is 5'9? If we were to draw one hundred different samples of 10 males at random, we will find a certain amount of difference among the means and standard deviations of the samples. Imagine that the standard deviation of our sample means is 2.25'. We have what is called the "standard error of the mean". It can be defined as the theoretical standard deviation of sample means of a given sample drawn from a population. 

When a researcher asserts something he/she is inferring, he/she does so with the knowledge that there will be a calculated error. They generally designate 2 cutoff points for error based upon normal distributions and they are called significance levels. Some researchers conclude that if the event would occur by chance 5% of the time or less, then the event can be attributed to non-chance factors. In other settings, researchers conclude that if the event would occur by chance 1% of the time or less, then the event can be attributed to non-chance factors. These are 0.5 and 0.1 significance levels, respectively. The problem/data domain and other factors drive the required significance levels for a particular statistical evaluation. 

### Example

In the case of male heights, we could choose at random 10 males and their average height would fall 69' $\pm$ 2.25' (1.96). In other words, from 64.59' to 73.41'. The sample standard deviation is multiplied by 1.96 since a Z-Score of $\pm$ 1.96 would encompass 95% of the normal distribution. Here, if we are using 0.5 signficance level, we would theoritically be correct 95% of the time. Also, we would know that 5% of the samples we chose would have a mean height or greater than 73.41' or less than 64.59'

When dealing with two tailed tests, sometimes alpha levels/significance levels are referred to as confidence intervals. The 0.5 alpha level is the 95% confidence band. You might say that you are 95% confident in asserting your hypothesis. 

In order to perform inferential statistics or parametrics tests of significance, we'll have to use sampling distributions. In order to create a sampling distribution, we would need to draw all possible samples of size N from a given population. Once we have calculated the mean for each distribution, the resulting distribution would be the sampling distribution of means. 

Sampling distributions have 3 characteristics:
* The mean of the sampling distribution will not change with the change in sample size. If the mean from the sampling distribution of means is 20 when N = 10, it will remain 20 whether you increase or descrease the size of the samples. Simply put, the mean of the sampling distribution is equal to the mean of the population. 
* As the sample size in the sampling distribution of means increases, the dispersion of sample means will becomes less. The larger the N, the more compact the distribution of sample means. As N increaes, the standard error of the mean decreases. 
* If the sampling distribution of means is taken from a normally distributed population, the sample means will also be bell-shaped. 

Based on the above 3 issues, the Central Limit Theorem states: If random samples of fixed N are drawn from any population, as N becomes larger, the distribution of sample means approaches normality with the overall mean approaching $\mu$.
The standard error of the sample means is equal to

$$\sigma_{\bar{x}} = \frac{\sigma}{\sqrt(N)}$$

and z-statistic is given as 

$$Z = \frac{\bar{X}-\mu}{\sigma_{\bar{x}}}$$

where,
* $\bar{X}$ = sample mean,
* $\mu$ = population mean and  
* $\sigma_{\bar{x}}$ = population standard deviation  


## Hypothesis

Null hypothesis: The null hypothesis specifies values for parameters. Generally referred to as "no significant difference" hypothesis. I personally call it the "dull" hypothesis. Most analyses are set up to reject or not reject the null hypothesis. 

Alternate hypothesis: The alternative hypothesis states that the population parameters are something other than the one hypothesized. A statement like "this class is different from other statistics classes" is an example of an alternate hypothesis. 

### Statistical error (Type I, Alpha & Type II, Beta)

#### Type I: 
When we reject the null hypothesis and it is really true, it is called a Type I error. A Type I error is equal to the alpha level set and sometimes referred to as the alpha error. If we set the alpha level at 0.5, our chance of making a type I error is 5%. 

#### Type II: 
When we fail to reject the null hypothesis when it is actually false, it is called a Type II error (beta error). A Type II error is more likely to be made than a Type I error. The lower we set the alpha level, the less likely we are to make a Type I error and more likely we are to make a Type II error. 

Let's find out the inference with which we can draw from the body dimensions (bdims) dataset. The dataset contains body dimensions data from 247 men and 260 women. 

Let's pretend we want to check the significance of the variable sex for hypothesis testing. Assume that males (sex = 1) have higher weight than the average population weight. 

To verify this assumption, let's use z-test and see if males are actually heavier than the overall population. 

H0 = Null Hypothesis = There is no significant difference in weights of men and women. 

H1 = Alternative Hypothesis = There is a better chance of men being heavier than average population weight. 

* Datasource = http://www.openintro.org/stat/data/bdims.RData
* Details = http://www.openintro.org/stat/data/bdims.php

In [1]:
download.file("http://www.openintro.org/stat/data/bdims.RData", destfile = "bdims.RData")

In [2]:
load("bdims.RData")

In [4]:
t(head(bdims))

Unnamed: 0,1,2,3,4,5,6
bia.di,42.9,43.7,40.1,44.3,42.5,43.3
bii.di,26.0,28.5,28.2,29.9,29.9,27.0
bit.di,31.5,33.5,33.3,34.0,34.0,31.5
che.de,17.7,16.9,20.9,18.4,21.5,19.6
che.di,28.0,30.8,31.7,28.2,29.4,31.3
elb.di,13.1,14.0,13.9,13.9,15.2,14.0
wri.di,10.4,11.8,10.9,11.2,11.6,11.5
kne.di,18.8,20.6,19.7,20.9,20.7,18.8
ank.di,14.1,15.1,14.1,15.0,14.9,13.9
sho.gi,106.2,110.5,115.1,104.5,107.5,119.8


In [5]:
summary(bdims)

     bia.di          bii.di          bit.di          che.de     
 Min.   :32.40   Min.   :18.70   Min.   :24.70   Min.   :14.30  
 1st Qu.:36.20   1st Qu.:26.50   1st Qu.:30.60   1st Qu.:17.30  
 Median :38.70   Median :28.00   Median :32.00   Median :19.00  
 Mean   :38.81   Mean   :27.83   Mean   :31.98   Mean   :19.23  
 3rd Qu.:41.15   3rd Qu.:29.25   3rd Qu.:33.35   3rd Qu.:20.90  
 Max.   :47.40   Max.   :34.70   Max.   :38.00   Max.   :27.50  
     che.di          elb.di          wri.di          kne.di     
 Min.   :22.20   Min.   : 9.90   Min.   : 8.10   Min.   :15.70  
 1st Qu.:25.65   1st Qu.:12.40   1st Qu.: 9.80   1st Qu.:17.90  
 Median :27.80   Median :13.30   Median :10.50   Median :18.70  
 Mean   :27.97   Mean   :13.39   Mean   :10.54   Mean   :18.81  
 3rd Qu.:29.95   3rd Qu.:14.40   3rd Qu.:11.20   3rd Qu.:19.60  
 Max.   :35.60   Max.   :16.70   Max.   :13.30   Max.   :24.30  
     ank.di          sho.gi           che.gi           wai.gi      
 Min.   : 9.90   Min. 

Let's create 2 different datasets. male and female

In [6]:
male <- subset(bdims, sex == 1)
female <- subset(bdims, sex == 0)

### Calculating Z-Score

A z-score is a measure of how many standard deviations below or above the population mean an observation is or the number of standard deviations from the mean a data point is. 

Variance is the average difference squared from the mean and the standard deviation is then the square root of the variance. 

In [9]:
sample_mean <- mean(male$wgt)
print(sample_mean)

[1] 78.14453


In [10]:
pop_mean <- mean(bdims$wgt)
print(pop_mean)

[1] 69.14753


In [11]:
pop_var <- var(bdims$wgt)
print(pop_var)

[1] 178.1094


In [12]:
zscore <- (sample_mean - pop_mean) / (sqrt(pop_var))
print(zscore)

[1] 0.6741466


In [14]:
# function to do that for me

z.score <- function(sam, pop){
    sample_mean <- mean(sam)
    pop_mean <- mean(pop)
    pop_var <- var(pop)
    zscore <- (sample_mean - pop_mean) / (sqrt(pop_var))
    return(zscore)    
}

In [15]:
z.score(male$wgt, bdims$wgt)

The z score is 0.67 after rounding it. Now, we need to work out the percentage of men that weigh more and less than the population mean. We can refer to the standard normal distribution table to find out this percentage value (Z score table).

A z-score of 0.67 gives us 0.7486 = 74.86% of the population. So 74.86% of the population is lower than the average weight of men. 

This allows us to reject the null hypothesis and affirms our alternate hypothesis that males tend to weigh more than the general population. 

In [16]:
sample_mean <- mean(male$wgt)
pop_mean <- mean(bdims$wgt)
pop_sd <- sd(bdims$wgt)

In [18]:
# we can use R probability under normal distribution for sameple measure, 
# mean, and standard deviation
p = pnorm(sample_mean, mean = pop_mean, sd = pop_sd, lower.tail = TRUE)
print(p)

[1] 0.7498909


In [20]:
print(paste("Probability", p))
print(paste("Mean male weight", mean(male$wgt)))
print(paste("Mean pop weight", mean(bdims$wgt)))

[1] "Probability 0.749890930193527"
[1] "Mean male weight 78.1445344129555"
[1] "Mean pop weight 69.1475345167653"


### Chi-Square analysis

A non parametric test of significance is one that makes no assumption concerning the shape of the population distribution and is commonly referred to as a distribution-free test of significance. Non parametric procedures are more suitable when data is categorical and for group comparison research. 