# Mosquito arms



In [None]:
# Don't change this cell; just run it.
import numpy as np  # The array library.

import pandas as pd
# Safe setting for Pandas.
pd.set_option('mode.copy_on_write', True)

import matplotlib.pyplot as plt
%matplotlib inline

## The data

You will analyze data from a study involving mosquitoes and beer. The paper describing the data is [Beer Consumption Increases Human
Attractiveness to Malaria
Mosquitoes](https://doi.org/10.1371/journal.pone.0009546).

The first author, Dr [Thierry Lefèvre](https://sites.google.com/site/thierryelefevre), kindly sent the original data.

He released the data and derivatives under the [CC-BY](https://creativecommons.org/licenses/by/4.0) license.  Specifically, you should attribute any copies of these data to Dr Thierry Lefèvre, and reference the paper above.

The processed data are in `./data/mosquito_beer.csv`.

Variables in that file are:

* `volunteer`: 43 levels corresponding to the id of the 43
  volunteers.
* `group`: 2 levels "beer" or "water" (= volunteers were
  assigned to either the beer (volunteer 1 to 25) or the water
  treatment (volunteer 26 to 43).
* `test`: 2 levels "after" or "before"  (the attractiveness of
  each volunteer was tested twice: before drinking and 15 min
  after drinking either water or beer).
* `nb_relased`: nb of released mosquitoes (n=50 for each test
  and group).
* `no_odour`: nb of caught mosquitoes in the "no_odour control
  trap".
* `volunt_odour`: nb of caught mosquitoes in the volunteer odour
  trap.
* `activated`: number of trapped mosquitoes (= `no_odour` +
  `volunt_odour`).
* `co2no`: CO2 concentration in the no odour trap.
* `co2od`: CO2 concentration in the volunteer odour trap.
* `temp`: body temperature of the volunteer.
* `trapside`: 2 levels (A or B) this is the side of the
  volunteer odour treatment in the Y-olfactometer (volunteer
  odour on the right side: A or on the left side: B)
* `datetime`: date / time of the corresponding test run.

To read in the data:

In [None]:
mosquitoes = pd.read_csv("../data/mosquito_beer.csv")
mosquitoes.head()

## The 

These variables were derived from a full experimental setup that was quite sophisticated. Here is the graphic from the paper:

![](experimental_setup.png)

For each trial, there were two tents.

* One tent was empty (the *control* tent).
* The other tent contained a person (the *volunteer* tent).
* A tube led from each tent to a corresponding *trap* box. Thus, there was a
  *control trap* box and a *volunteer trap* box.
* A tube from each trap box fed into an arm of a Y connector.
* The remaining, third arm of the Y led to a *downwind box* containing 50
  mosquitoes.
* At the beginning of the trial, the experimenters opened the *downwind box*
  of mosquitoes, so the mosquitoes could fly out into the Y connector, and
  thence, into either of the *trap* boxes.
* The number of mosquitoes who flew into the *control trap* box gives the
  values for the `no_odour` column.
* The number of mosquitoes who flew into the *volunteer trap* box gives the
  values for the `volunt_odour` column.
* The total number of mosquitoes who flew into either the trap box gives
  the values for the `activated` column.

## Research question

The authors studied **whether people who had drunk beer were more attractive to mosquitoes.**

You too will study this. Firts, you will first filter the data frame to contain only the "after" treatment rows. Each row corresponds to one person in the study. The number for each subject was the number of mosquitoes flying towards them. The subjects were from two groups: people who had just drunk beer, and people who had just drunk water. There were 25 subjects who had drunk beer, and therefore, 25 numbers of mosquitoes corresponding to the "beer" group. There were 18 subjects who had drunk water, and 18 numbers corresponding to the "water" group.

Get the numbers of mosquitoes flying towards the beer drinkers, and towards the water drinkers, after they had drunk their beer or water.

In [None]:
after_rows = mosquitoes[mosquitoes['test'] == 'after']
beer_rows = ... 
beer_activated = ...
water_rows = ... 
water_activated = 

Check that there are 25 values in the beer group, and 18 in the water group:

In [None]:
print('Number in beer group:', len(beer_activated))
print('Number in water group:', len(water_activated))

We are interested in the difference between the means of these numbers, which you can check here:

In [None]:
observed_difference = np.mean(beer_activated) - np.mean(water_activated)
observed_difference

## Testing for a difference

Your task is to conduct a relevant statistical test to address the research question "does drinking beer make people more attractive to mosquitoes?"

For this, you should:
1) state your hypotheses
2) specify any assumptions
3) conduct the test
4) report the results
5) draw a conclusion

To achieve full marks, you should comment on how you could better answer the question, making use of  variables already provided in the dataset.

## A check

From what we have said above, you might assume that the mosquitoes who left
their box (`activated` number) would always equal the number of who flew to
the `no_odour` arm plus the number who flew to the `volunt_odour` arm.

Check this by adding the values in `no_odour` to those in `volunt_odour`, then
comparing them for equality, to the `activated` numbers.  Finally count the
number of values where you got True for this comparison.  It should be the
same as the number of rows in the data frame - if this relationship holds for
all rows.

In [None]:
n_same = ...
# Show the result
n_same

In [None]:
_ = ok.grade('q_n_same')

## Another test for beer

A comparison of interest to the authors was the difference between the number
of mosquitoes that flew towards the volunteer and the number that flew towards
the empty tent.

In the next cell, first select the rows corresponding to the trials for
volunteers after they had had their allocated drink.  Call this new DataFrame
`afters`.

Next make a new column in the resulting `afters` DataFrame. Call the column
`volunt_diff`.  The column should have the result of the subtraction of the
numbers in `no_odour` from those in `volunt_odour`.

In [None]:
#- Select the rows corresponding to the "after" phase of the experiment.
afters = ...
#- Make a new column in "afters" called "volunt_diff" (see above).
afters...

In [None]:
_ = ok.grade('q_afters')

Make an array from the `volunt_diff` values for the rows corresponding to the
`beer` drinkers and another for the rows corresponding to the water drinkers.
Call these arrays `after_beer_vd` and `after_water_vd` respectively.

In [None]:
after_beer_vd = ...
after_water_vd = ...
# Show the results
print(after_beer_vd)
print(after_water_vd)

In [None]:
_ = ok.grade('q_after_arrs')

## More permutation

Consider the mean difference between the differences:

In [None]:
beer_mean = np.mean(after_beer_vd)
water_mean = np.mean(after_water_vd)
beer_water_diff = beer_mean - water_mean
beer_water_diff

The number of beer values:

In [None]:
n_beer = len(after_beer_vd)
n_beer

The values, pooled into one array:

In [None]:
# The values, pooled.
pooled = np.concatenate([after_beer_vd, after_water_vd])
pooled

Your job is to do a *permutation* test, to see whether this observed mean
difference is plausible in an ideal (null) world, where there is no real
difference between the groups, and any observed difference is just due to
random sampling.

We simulate samples from such an ideal world by shuffling the 16 values
randomly, allocating 25 shuffled values to a fake Beer group, and the rest to 
fake Water group, and calculating the mean difference for these fake groups.
We do this many times to build up the *sampling distribution* of these fake
differences.

To do this job, you may want to remind yourself of the [permutation idea](https://lisds.github.io/textbook/permutation/permutation_idea.html) notebook in the textbook.

You may well want to start with a cell that does one trial where you:

* shuffle the pooled values.
* split them into two groups of 25 and 18.
* calculate the difference in means.

In [None]:
#- You may want to simulate a single trial here.

Then finish up the cell below to build your sampling distribution, storing the values in the array `fake_diffs`.

In [None]:
# Build up the sampling distribution from the ideal (null) world.
n_iters = 10000
fake_diffs = ...
# Show the first 10 values.
fake_diffs[:10]

In [None]:
_ = ok.grade('q_fake_diffs')

You might also like to review the histogram of these values, to compare by eye to the value in the real world, `beer_water_diff`.

In [None]:
#- Do a histogram of the sampling distribution here.

Calculate the proportion of the sampling distribution values that are greater than or equal to the observed difference in means.

In [None]:
prop_ge = ...
# Show the proportion.
prop_ge

In [None]:
_ = ok.grade('q_prop_ge')

## Further details about the experiment

The authors studied **whether people who had drunk beer were more attractive to mosquitoes.**

The full experimental setup was quite sophisticated. Here is the graphic from the paper:

![](experimental_setup.png)

For each trial, there were two tents.

* One tent was empty (the *control* tent).
* The other tent contained a person (the *volunteer* tent).
* A tube led from each tent to a corresponding *trap* box. Thus, there was a
  *control trap* box and a *volunteer trap* box.
* A tube from each trap box fed into an arm of a Y connector.
* The remaining, third arm of the Y led to a *downwind box* containing 50
  mosquitoes.
* At the beginning of the trial, the experimenters opened the *downwind box*
  of mosquitoes, so the mosquitoes could fly out into the Y connector, and
  thence, into either of the *trap* boxes.
* The number of mosquitoes who flew into the *control trap* box gives the
  values for the `no_odour` column.
* The number of mosquitoes who flew into the *volunteer trap* box gives the
  values for the `volunt_odour` column.
* The total number of mosquitoes who flew into either the trap box gives
  the values for the `activated` column.

Each volunteer had two trials, one *before* they drank their allocated drink,
and one *after* they drank their allocated drink.  The `test` columns records
the trial type of each row.  The study allocated 25 volunteers to drink beer,
and 18 volunteers to drink water.  The `group` column contains the allocation
of the corresponding volunteer on each trial.

## Done.

Congratulations, you're done with the assignment!  Be sure to:

- **run all the tests** (the next cell has a shortcut for that).
- **Save and Checkpoint** from the `File` menu.

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
import os
_ = [ok.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]