### Assignment 1

#### Inferential Statistics

In this exercise we are interested to make some conclusions or inferences on the population using only a subset of data (samples).Let's look at an example.

Consider that you are trying to decide if a coin is fair. How would you decide that?

Suppose you toss it 17 times and it came out heads 7 times, is the coin fair? Or you toss it 19 times, you got heads 9 times, is the coin fair? How do we decide? Even a fair coin could show 22 heads in 30 tosses. It might be just chance.

The classical method is to perform a hypothesis testing. But here let's try a different and perhaps more intuitive approach. 

Let's consider this more interesting problem: Does beer consumption increases Human Attractiveness to Malaria Mosquitoes?

An experiment was performed using equipment like this:

![Mosquito](A1-img/mosquito.png)

Volunteers consumed either beer (n = 25 volunteers and a total of 2500 mosquitoes tested) or water (n = 18 volunteers and a total of 1800 mosquitoes). Batches of 50 mosquitoes were released into the downwind box of the olfactometer (figure 1C) and given a choice between outdoor air and human odour.  At the end of each test, the mosquitoes inside the two traps were removed with an aspirator and counted.

These are the results which showed the number of mosquitoes caught in the traps for each volunteers.


|  |  Beer  |  |
|------|------|----|
| 27 | 19 | 20 |
| 20 | 23 | 17 |
| 21 | 24 | 31 |
| 26 | 28 | 20 |
| 27 | 19 | 25 |
| 31 | 24 | 28 |
| 24 | 29 | 21 |        
| 21 | 18 | 27 |
| 20 |    |    |

|  |  Water  |  |
|----|----|----|
| 21 | 19 | 13 |
| 22 | 15 | 22 |
| 15 | 22 | 20 |
| 12 | 24 | 24 |
| 21 | 19 | 18 |
| 16 | 23 | 20 |

In [None]:
import pandas as pd

df = pd.DataFrame({'beer':  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]+ 
                            [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                   'num':   [27, 19, 20, 20, 23, 17, 21, 24, 31, 26, 28, 20, 27, 19, 25, 31, 24, 28, 24, 29, 21, 21,18,27,20]+
                            [21, 19, 13, 22, 15, 22, 15, 22, 20, 12, 24, 24, 21, 19, 18, 16, 23, 20]
                  })

In [None]:
df.head()


In [None]:
beer_mean = df[df.beer == 1].num.mean()
water_mean = df[df.beer == 0].num.mean()

print("Mean number of Mosquitoes (Beer): %2.1f" % beer_mean)
print("Mean number of Mosquitoes (Water): %2.1f" % water_mean)
print("Difference: %2.1f" % (beer_mean - water_mean))

Do the results indicate that beer drinkings are more prone to mosquito bites? Skeptics argue that beer consumption has not effect. The difference of 4.4 is too small, it could have happened by chance. Others otherwise argue, the additional 4.4 mosquitos is quite a large difference that could not have been by chance. What do you say?

Well, to determine if this is a signficant difference, we would normally perform a __t-test__ on our data to compute a p-value, and then just make sure that the p-value is less than the target 0.05. This is hypothesis testing - which we will discuss later.

We could take a more intuitive approach. The thing we are trying to figure out is whether the 4.4 difference is a large or small effect. Le'ts think about it. If the skeptic is right, i.e. there is **no difference**, that is to say if we swap the numbers around between the two groups it would not have made a difference. **If do this swapping (or shuffling) a number of times, we can find out how often this 4.4 difference occurs.** 


<img src="A1-img/beer_water_2.png" style="width: 400px;"/>
<img src="A1-img/beer_water_1.png" style="width: 400px;"/>
<img src="A1-img/beer_water_0.png" style="width: 400px;"/>


Let's plot the groups of data out first


In [None]:
import seaborn as sns
%matplotlib inline
sns.swarmplot(x='beer',y='num',hue='beer',data=df)

In [None]:
import seaborn as sns

sns.distplot(df.num,bins=16,kde=False)

Let us now do this shuffling in code, say 10,000 times.


In [None]:
import numpy as np

df['label'] = df['beer']

num_simulations = 100

differences = []
for i in range(num_simulations):
    np.random.shuffle(df['label'])
    beer_mean = df[df.label == 1].num.mean()
    water_mean = df[df.label == 0].num.mean()
    differences.append(beer_mean - water_mean)
    #print(differences)

We can plot  a few of these random swap simulations.
![](random-swap.png)

In [None]:
import seaborn as sns

sns.distplot(differences,bins=50,kde=False)


Increase the number of simulations to 10,000 and re-plot it. What is the reason for increasing the number of simulations?

### Question 
1. What is the reason for increasing the number of simulations?
1. How many times 4.4 occurs in the distribution plot? 
2. Based on above can you explain if there is a difference between the beer and water group (that is if the mosquito prefers the beer or water group)? 
2. Construct a different data set so that there is no difference in effect for the beer and water. Show that there is no difference by repeating the experiment as above. Explain why you conclude there is no difference.



#### Hypothesis Testing 

We can also perform the t-test using the Scipy library.The t-statistic is how many standard errors of the difference the two means are apart. The p-value is the probability of seeing a t-statistic at least that far from 0 if the null hypothesis were true. Low p-values lead to rejection of the null hypothesis. 

### Question
Is the result with t-test significant? Explain why.


In [None]:
from scipy import stats
beer = df[df.beer == 1]['num']
water= df[df.beer == 0]['num']
stats.ttest_ind(beer,water)

### Question

Use the data and code provided below. The data represents two group, one have been given treatment A (group=1) whereas the other is the control group with no treatment given. The days give the number of days taken by a patient to recover. Find out if the treatment A have an effect on the patients.

```
import pandas as pd

df = pd.DataFrame({'group':  [1, 1, 1, 1, 1, 1, 1, 1] + 
                            [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                   'days': [91, 99, 76, 72, 57, 46, 63, 75 ] +
                            [44, 62, 69,81, 69, 74, 61, 69, 65, 66, 56, 87,]})
```

Use the two methods above (random shuffling and hypothesis testing) to find out if the treatment A have an effect on patients or not. Do methods give the same answer on effects?