# Frequentist Inference Case Study - Part B

## Learning objectives

Welcome to Part B of the Frequentist inference case study! The purpose of this case study is to help you apply the concepts associated with Frequentist inference in Python. In particular, you'll practice writing Python code to apply the following statistical concepts: 
* the _z_-statistic
* the _t_-statistic
* the difference and relationship between the two
* the Central Limit Theorem, including its assumptions and consequences
* how to estimate the population mean and standard deviation from a sample
* the concept of a sampling distribution of a test statistic, particularly for the mean
* how to combine these concepts to calculate a confidence interval

In the previous notebook, we used only data from a known normal distribution. **You'll now tackle real data, rather than simulated data, and answer some relevant real-world business problems using the data.**

## Hospital medical charges

Imagine that a hospital has hired you as their data scientist. An administrator is working on the hospital's business operations plan and needs you to help them answer some business questions. 

In this assignment notebook, you're going to use frequentist statistical inference on a data sample to answer the questions:
* has the hospital's revenue stream fallen below a key threshold?
* are patients with insurance really charged different amounts than those without?

Answering that last question with a frequentist approach makes some assumptions, and requires some knowledge, about the two groups.

We are going to use some data on medical charges obtained from [Kaggle](https://www.kaggle.com/easonlai/sample-insurance-claim-prediction-dataset). 

For the purposes of this exercise, assume the observations are the result of random sampling from our single hospital. Recall that in the previous assignment, we introduced the Central Limit Theorem (CLT), and its consequence that the distributions of sample statistics approach a normal distribution as $n$ increases. The amazing thing about this is that it applies to the sampling distributions of statistics that have been calculated from even highly non-normal distributions of data! Recall, also, that hypothesis testing is very much based on making inferences about such sample statistics. You're going to rely heavily on the CLT to apply frequentist (parametric) tests to answer the questions in this notebook.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import t
from scipy.stats import norm
from scipy.stats import lognorm
from numpy.random import seed
from scipy.stats import ttest_ind
import os
os.getcwd()
medical = pd.read_csv('data//insurance2.csv')

In [None]:
medical.shape

In [None]:
medical.head()

__Q1:__ Plot the histogram of charges and calculate the mean and standard deviation. Comment on the appropriateness of these statistics for the data.

__A:__ The distribution is skewed to the right. The standard deviation of the sample is huge which can be expected due to some high values for the charges. 

In [None]:
plt.hist(medical.charges)

In [None]:
mean = np.mean(medical.charges)
stddev = np.std(medical.charges)
mean, stddev

__Q2:__ The administrator is concerned that the actual average charge has fallen below 12,000, threatening the hospital's operational model. On the assumption that these data represent a random sample of charges, how would you justify that these data allow you to answer that question? And what would be the most appropriate frequentist test, of the ones discussed so far, to apply?

__A:__ From the data, we can calculate the 95% confidence interval for the sample mean and see if it contains values below 12000. We can also find the probability of mean being < 1200 using the cdf function.

__Q3:__ Given the nature of the administrator's concern, what is the appropriate confidence interval in this case? A ***one-sided*** or ***two-sided*** interval? (Refresh your understanding of this concept on p. 399 of the *AoS*). Calculate the critical value and the relevant 95% confidence interval for the mean, and comment on whether the administrator should be concerned.

__A:__ We consider a one-sided confidence interval to check if the mean is below 12000. The administrator need not be concerned 

In [None]:
seed(47)
sampledist = [np.mean(np.random.choice(medical.charges, size = len(medical.charges), replace = True)) for i in range(100)]
sampledist[:5]

In [None]:
samplemean = np.mean(sampledist)
samplestddev = np.std(sampledist, ddof = 1)
samplemean, samplestddev

In [None]:
stderr = samplestddev/np.sqrt(len(medical.charges))
critical_val = norm.ppf(0.95)
margin_of_error = stderr*critical_val
confint = (samplemean - margin_of_error, samplemean+margin_of_error)
confint

In [None]:
norm(samplemean, samplestddev).cdf(12000), lognorm(mean,stddev).cdf(12105.4)

The administrator then wants to know whether people with insurance really are charged a different amount to those without.

__Q4:__ State the null and alternative hypothesis here. Use the _t_-test for the difference between means, where the pooled standard deviation of the two groups is given by:
\begin{equation}
s_p = \sqrt{\frac{(n_0 - 1)s^2_0 + (n_1 - 1)s^2_1}{n_0 + n_1 - 2}}
\end{equation}

and the *t*-test statistic is then given by:

\begin{equation}
t = \frac{\bar{x}_0 - \bar{x}_1}{s_p \sqrt{1/n_0 + 1/n_1}}.
\end{equation}

(If you need some reminding of the general definition of ***t-statistic***, check out the definition on p. 404 of *AoS*). 

What assumption about the variances of the two groups are we making here?

__A:__ Null Hypothesis: There is no difference between the means of medical charges of insured and uninsured people.
Alternate Hypothesis: There is difference between the means of medical charges of insured and uninsured people.

__Q5:__ Perform this hypothesis test both manually, using the above formulae, and then using the appropriate function from [scipy.stats](https://docs.scipy.org/doc/scipy/reference/stats.html#statistical-tests) (hint, you're looking for a function to perform a _t_-test on two independent samples). For the manual approach, calculate the value of the test statistic and then its probability (the p-value). Verify you get the same results from both.

__A:__ The variances of the two groups are assumed to be equal and drawn from the same population. This allows us to pool the standard deviation of both groups and calculate the t statistics with this assumption.

In [None]:
medical.head()

In [None]:
insured = medical.charges[medical.insuranceclaim == 1]
uninsured = medical.charges[medical.insuranceclaim == 0]
insured, uninsured

In [None]:
insuredmean = np.mean(insured)
insuredstddev = np.std(insured, ddof = 1)
insuredmean, insuredstddev

In [None]:
uninsuredmean = np.mean(uninsured)
uninsuredstddev = np.std(uninsured, ddof = 1)
uninsuredmean, uninsuredstddev

In [None]:
#Using the formula for pooled standard deviation
sp = np.sqrt((((len(insured)-1)*insuredstddev**2)+((len(uninsured)-1)*uninsuredstddev**2))/(len(insured)+len(uninsured)-2))
sp

In [None]:
t_stat = (insuredmean - uninsuredmean)/(sp*(np.sqrt((1/len(insured))+(1/len(uninsured)))))
df = (len(insured)+len(uninsured)-2)
pval = t.sf(np.abs(t_stat), df = df) # or 1 - t.cdf(t_stat, df)
t_stat, pval

In [None]:
ttest_ind(insured, uninsured) # Cross checking with the above values with the ttest values from scipy, we can say its the same.

Congratulations! Hopefully you got the exact same numerical results. This shows that you correctly calculated the numbers by hand. Secondly, you used the correct function and saw that it's much easier to use. All you need to do is pass your data to it.

__Q6:__ Conceptual question: look through the documentation for statistical test functions in scipy.stats. You'll see the above _t_-test for a sample, but can you see an equivalent one for performing a *z*-test from a sample? Comment on your answer.

__A:__ Performing an equivalent z test from a sample is possible only if the underlying population standard deviation is known. In most cases we work with only samples taken from the population, with no knowledge of the population distribution or standard deviation. Student's t distribution is preferred in these situations.(z test is not available in documentation)

## Learning outcomes

Having completed this project notebook, you now have good hands-on experience:
* using the central limit theorem to help you apply frequentist techniques to answer questions that pertain to very non-normally distributed data from the real world
* performing inference using such data to answer business questions
* forming a hypothesis and framing the null and alternative hypotheses
* testing this using a _t_-test