<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Inferential Statistics Lab

_Author: Matt Brems (DC)_

Head to the `data` folder. There are two files:
- `crx_names.txt`: This summarizes the source of the data and provides information that is important to understanding what the data mean. **Be sure to read this first!**
- `crx_data.csv`: This will be the data itself. **Note that there are no column headers.**

A source of the data is [here](https://archive.ics.uci.edu/ml/datasets/Japanese+Credit+Screening) if you would like to learn more.

**Exercise 1**: Load the data in using any method that you choose.

In [21]:
import pandas as pd

df = pd.read_csv('./data/crx_data.csv', header=None)

In [22]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


**Exercise 2**: Note that there are no meaningful column names. Why is this the case? Do you agree with this or disagree with this?

**Answer**: This is done for confidentiality purposes. I agree with this, as maintaining privacy is very important when doing data science. If we don't protect people's privacy, we open them up to harm and people will be less likely to share their information with us.

**Exercise 3:** You want to give names to each column. Read the following line of code:

```python
['X' + str(i) for i in range(1,17)]
```

Before running this line of code, what will this create? Go ahead and use this for the column names.

**Answer**: This should give us a list of values `X1`, `X2`, through `X16`.

In [3]:
df.columns = ['X' + str(i) for i in range(1,17)]

In [4]:
df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14,X15,X16
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


**Exercise 4**: Count the number of missing values in each column. (There are multiple ways to do this.)

In [5]:
df.isnull().sum()

X1     0
X2     0
X3     0
X4     0
X5     0
X6     0
X7     0
X8     0
X9     0
X10    0
X11    0
X12    0
X13    0
X14    0
X15    0
X16    0
dtype: int64

In [6]:
df.describe(include = 'all')

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14,X15,X16
count,690,690,690.0,690,690,690,690,690.0,690,690,690.0,690,690,690.0,690.0,690
unique,3,350,,4,4,15,10,,2,2,,2,3,171.0,,2
top,b,?,,u,g,c,v,,t,f,,f,g,0.0,,-
freq,468,12,,519,519,137,399,,361,395,,374,625,132.0,,383
mean,,,4.758725,,,,,2.223406,,,2.4,,,,1017.385507,
std,,,4.978163,,,,,3.346513,,,4.86294,,,,5210.102598,
min,,,0.0,,,,,0.0,,,0.0,,,,0.0,
25%,,,1.0,,,,,0.165,,,0.0,,,,0.0,
50%,,,2.75,,,,,1.0,,,0.0,,,,5.0,
75%,,,7.2075,,,,,2.625,,,3.0,,,,395.5,


This is odd, because the data dictionary told us that there are some missing attributes! However, `.isnull().sum()` shows zero missing values and `.describe()` states that the count of all variables is 690.

Is it possible that our missing values aren't represented as missing values?

In [7]:
df['X1'].value_counts()

b    468
a    210
?     12
Name: X1, dtype: int64

Ahh! So our missing values seem to be represented as `?` rather than an `NA`. Let's replace question marks with `NA`.
- I consulted [StackOverflow](https://stackoverflow.com/questions/13445241/replacing-blank-values-white-space-with-nan-in-pandas) for this.

In [8]:
import numpy as np

In [9]:
df.replace('?', np.nan, inplace=True)

**Exercise 5**: Our goal is to learn about the population of interest. In this case, our population is all credit applications submitted to this company.

How would you describe the sample here?

**Answer**: Our sample is a set of 690 credit applications collected from the company we want to study. (Hopefully the applications were collected randomly, but this isn't written anywhere.)

**Exercise 6**: Our goal is to learn about the population of interest. In this case, our population is all credit applications submitted to this company. We specifically want to estimate the true proportion of approved applications for this company.

What is the parameter here? What is the statistic here? 
> Be sure to identify which is which!

**Answer**: Our parameter is the true proportion of approved applications for the company. Our statistic is the sample proportion of approved applications for the company.

**Exercise 7**: Recall that the formula for a confidence interval is:

$$
\text{[sample statistic] } \pm \text{[multiplier] } \times \text{[standard deviation of sampling distribution]}
$$

Calculate the:
- sample percentage of `+` applications
- sample standard deviation of `+` applications
- size of our sample.

Use these to generate a 95% confidence interval for the true proportion of approved applications for this company. Note that column `X16` identifies which applications in our sample were approved (`+`) and denied (`-`).

> Some data "munging" (cleaning/transforming) may be required!

First, I want to turn the `+` and `-` into `1`s and `0`s. This will make finding the sample average/percentage much easier!

In [10]:
df['X16_dummied'] = [1 if df.loc[i,'X16'] == '+' else 0 for i in range(df.shape[0])]

In [11]:
df['X16'].value_counts()

-    383
+    307
Name: X16, dtype: int64

In [12]:
df['X16_dummied'].value_counts()

0    383
1    307
Name: X16_dummied, dtype: int64

Now that that worked, let's calculate the needed values.

In [13]:
sample_pct = np.mean(df['X16_dummied'])
sigma = np.std(df['X16_dummied'])
sample_size = len(df['X16_dummied'])

In [14]:
round(sample_pct - 1.96 * sigma / (sample_size ** 0.5), 4)

0.4078

In [15]:
round(sample_pct + 1.96 * sigma / (sample_size ** 0.5), 4)

0.482

**Answer**: My 95% confidence interval for the true population proportion of approved credit applications is **(40.78%, 48.2%)**.

**Exercise 8**: Interpret the above interval.
> While you _could_ copy and paste text from the notes and fill in the blanks, you should practice interpreting the interval. Remember, this will come up in interviews!

**Answer**: I am 95% confident that the true population proportion of approved credit applications is between 40.78% and 48.2%.

**Exercise 9**: Define a function named `conf_int()` that accepts two arguments: 
- `data`, which should be an array or Series of data
- `conf_level`, which should be either `90`, `95`, or `99`.

Your function should return the 90% confidence interval, 95% confidence interval, or 99% confidence interval, depending on what value the user selected. **Set the default to be 95.**

> For a 90% confidence interval, the multiplier is 1.645.

> For a 95% confidence interval, the multiplier is 1.96.

> For a 99% confidence interval, the multiplier is 2.576.

In [16]:
def conf_int(data, conf_level = 95):
    
    if conf_level == 95:
        multiplier = 1.96
    elif conf_level == 90:
        multiplier = 1.645
    elif conf_level == 99:
        multiplier = 2.576
    else:
        return "Please provide a confidence level of 90, 95, or 99!"
    
    sample_pct = np.mean(data)
    sigma = np.std(data)
    sample_size = len(data)
    
    return (round(sample_pct - multiplier * sigma / (sample_size ** 0.5), 4),
            round(sample_pct + multiplier * sigma / (sample_size ** 0.5), 4))

**Exercise 10**: Test your function to find the 99% confidence interval for the mean of `X3`. Your answer should be **(4.2709, 5.2466)**. Also interpret the interval.

In [17]:
print("The 90% confidence interval is: " + str(conf_int(df['X3'], 90)))
print("The 95% confidence interval is: " + str(conf_int(df['X3'], 95)))
print("The 99% confidence interval is: " + str(conf_int(df['X3'], 99)))

The 90% confidence interval is: (4.4472, 5.0703)
The 95% confidence interval is: (4.3875, 5.1299)
The 99% confidence interval is: (4.2709, 5.2466)


**Answer**: I am 99% confident that the true average value of $X_3$ is between 4.2709 and 5.2466.

**Exercise 11**: We want to test whether or not the mean of $X_3$ was equal to 5.

State the null and alternative hypotheses.

**Answer:**

$$
\begin{eqnarray*}
H_0: &\mu& = 5 \\
H_1: &\mu& \ne 5
\end{eqnarray*}
$$

**Exercise 12**: Use a one-sample $t$-test to test the above hypotheses at the $\alpha = 0.05$ significance level. Report and interpret your $p$-value.
> Hint: You might find [this link](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html) to be helpful! Check out the `a` and `popmean` arguments.

In [18]:
import scipy.stats as stats

In [19]:
stats.ttest_1samp(df['X3'], popmean = 5)

Ttest_1sampResult(statistic=-1.2731172058046707, pvalue=0.20340583433417578)

**Answer**: My $p$-value is 0.2034. Because $p > \alpha$ ($0.2034 > 0.05$), there is insufficient evidence to reject $H_0$ and cannot conclude that the mean of $X_3$ is different from 5.

**Exercise 13**: We want to test whether or not the true proportion of $X_{16}$ was equal to 0.5.

State the null and alternative hypotheses.

**Answer**:

$$
\begin{eqnarray*}
H_0: &p& = 0.5 \\
H_1: &p& \ne 0.5
\end{eqnarray*}
$$

**Exercise 14**: Use a one-sample $t$-test to test the above hypotheses at the $\alpha = 0.05$ significance level. Report and interpret your $p$-value.
> Hint: You might find [this link](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html) to be helpful! Check out the `a` and `popmean` arguments.

In [20]:
stats.ttest_1samp(df['X16_dummied'], 0.5)

Ttest_1sampResult(statistic=-2.9088721445109043, pvalue=0.0037441556293534564)

**Answer**: My $p$-value is 0.0037. Because $p < \alpha$ ($0.0037 < 0.05$), I reject $H_0$. Thus, I conclude that the mean of $X_{16}$ is different from 0.5.