# Inference for Numerical Data
Copied and adapted from OpenStats Intro ["Inference for numerical data" lab](http://htmlpreview.github.io/?https://github.com/andrewpbray/oiLabs-base-R/blob/master/inf_for_numerical_data/inf_for_numerical_data.html), a product of OpenIntro that is released under a [Creative Commons Attribution-ShareAlike 3.0 Unported](http://creativecommons.org/licenses/by-sa/3.0). Original lab was adapted for OpenIntro by Mine Çetinkaya-Rundel from a lab written by the faculty and TAs of UCLA Statistics.

The [data set](https://www.openintro.org/stat/data/?data=nc) contains 1000 randomly selected births from the birth records released by the state of North Carolina in 2004. 

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import statsmodels.stats.weightstats
# from __future__ import print_function # Python 2 users, uncomment this statement

In [None]:
# load data into dataframe
ncbirths = pd.read_csv("https://www.openintro.org/stat/data/nc.csv")

In [None]:
ncbirths.head()

## Exploratory analysis

In [None]:
ncbirths.describe()

In [None]:
ncbirths.info()

### Exercise 1 - Exploratory
1. What are the cases in this data set? How many cases are there in our sample?
2. Which variables are numerical and which ones are categorical?
3. For numerical values, are their outliers? If you aren't sure or want to take a closer look at the data, make a graph.

### Exercise 2
Make a side-by-side box plot of `habit` and `weight`. What does the plot highlight about the relationship between these two variables?

The box plots show how the medians of the two distributions compare, but we can also compare the means of the distributions using the following function to split the weight variable into the habit groups, then take the mean of each using the `np.mean` function.

There is an observed difference, but is this difference statistically significant? In order to answer this question we will conduct a hypothesis test.



In [None]:
smoking = ncbirths.groupby("habit")
smoking.agg(np.mean).weight

## Inference
### Exercise 3
Check if the conditions necessary for inference are satisfied. Note that you will need to obtain sample sizes to check the conditions. You can compute the group size using the same by command above but replacing `np.mean` with `len`.

### Exercise 4
Write the hypotheses for testing if the average weights of babies born to smoking and non-smoking mothers are different. Calculate the Z-score and p-value for the hypothesis test.

In [None]:
# functions to visualize z-test
def _gauss(x, mu=0, sigma=1):
    return 1/(sigma*np.sqrt(2*np.pi)) * np.exp(-0.5*pow((x-mu)/sigma,2))
gauss = np.vectorize(_gauss)

def plot_twosided_ztest(se):
    fig = plt.figure()
    g = fig.add_subplot(111)
    dx = np.linspace(-3.5*se, 3.5*se)
    g.plot(dx, gauss(dx, sigma=se))
    zx = np.linspace(htest[0]*se, max(dx))
    g.fill_between(zx, 0, gauss(zx, sigma=se))
    g.fill_between(-zx, 0, gauss(-zx, sigma=se))
    g.yaxis.set_visible(False)

In [None]:
# hypthesis test from exercise 4
group1 = ncbirths[ncbirths["habit"]=="nonsmoker"].weight
group2 = ncbirths[ncbirths["habit"]=="smoker"].weight
d1 = statsmodels.stats.weightstats.DescrStatsW(group1)
d2 = statsmodels.stats.weightstats.DescrStatsW(group2)
cm = statsmodels.stats.weightstats.CompareMeans(d1, d2)
htest = cm.ztest_ind(usevar="unequal")
print("Test statistics: Z = {:n}".format(htest[0]))
print("p-value = {:n}".format(htest[1]))
plot_twosided_ztest(cm.std_meandiff_separatevar)

### Exercise 5
Calculate the 95% confidence interval for the difference in means $\mu_{nonsmoker} - \mu_{smoker}$. Read the documentation for statsmodels.stats.weightstats.CompareMeans (current instance is `cm`) for help.

### Exercise 6
See DescrStatsW [documentation](http://statsmodels.sourceforge.net/devel/generated/statsmodels.stats.weightstats.DescrStatsW.html) for useful methods.

1. Calculate a 95% confidence interval for the average length of pregnancies (weeks) and interpret it in context. Note that since you’re doing inference on a single population parameter, there is no explanatory variable, so you can omit the x variable from the function.

2. Calculate a new confidence interval for the same parameter at the 90% confidence level. You can change the confidence level by adding a new argument to the function: conflevel = 0.90.

3. Conduct a hypothesis test evaluating whether the average weight gained by younger mothers is different than the average weight gained by mature mothers.

4. Now, a non-inference task: Determine the age cutoff for younger and mature mothers. Use a method of your choice, and explain how your method works.

5. Pick a pair of numerical and categorical variables and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval. Answer your question using the inference function, report the statistical results, and also provide an explanation in plain language.