These are the data from [Beer Consumption Increases Human Attractiveness to Malaria Mosquitoes](https://doi.org/10.1371/journal.pone.0009546).

The first author, Dr [Thierry Lefèvre](https://sites.google.com/site/thierryelefevre), kindly sent the original data.

He released the data and derivatives under the [CC-BY](https://creativecommons.org/licenses/by/4.0) license.

In [1]:
import os.path as op
import numpy as np
import pandas as pd

`beer.txt` is the original data file as provided by Dr Lefèvre.

It is in tab-separated value format.

In [2]:
df = pd.read_csv('beer.txt', sep='\t')
df.head()

Unnamed: 0,volunteer,group,test,nb_released,no_odour,volunt_odour,activated,co2no,co2od,temp,trapside,hour,date
0,subj1,beer,before,50,7,9,16,305.0,321.0,36.1,A,19.0,39320
1,subj2,beer,before,50,26,7,33,338.0,720.0,35.3,B,21.0,39320
2,subj3,beer,before,50,5,10,15,348.0,355.0,36.1,B,19.0,39338
3,subj4,beer,before,50,3,7,10,349.0,437.0,35.6,A,17.0,39348
4,subj5,beer,before,50,2,8,10,396.0,475.0,37.0,B,18.0,39348


Here are the variable descriptions sent by Dr Lefèvre.

* `volunteer`: 43 levels corresponding to the id of the 43
  volunteers.
* `group`: 2 levels "beer" or "water" (= volunteers were
  assigned to either the beer (volunteer 1 to 25) or the water
  treatment (volunteer 26 to 43).
* `test`: 2 levels "after" or "before"  (the attractiveness of
  each volunteer was tested twice: before drinking and 15 min
  after drinking either water or beer).
* `nb_relased`: nb of released mosquitoes (n=50 for each test
  and group).
* `no_odour`: nb of caught mosquitoes in the "no_odour control
  trap".
* `volunt_odour`: nb of caught mosquitoes in the volunteer odour
  trap.
* `activated`: number of trapped mosquitoes (= no_odour
  volunt_odour).
* `co2no`: CO2 concentration in the no odour trap.
* `co2od`: CO2 concentration in the volunteer odour trap.
* `temp`: body temperature of the volunteer.
* `trapside`: 2 levels (A or B) this is the side of the
  volunteer odour treatment in the Y-olfactometer (volunteer
  odour on the right side: A or on the left side: B)
* `hour`: hour at which the test began.
* `date`: date of the test.

The `date` column looks like Excel format dates, which are
number of days since January 1st 1900 - see [this
explanation](http://www.cpearson.com/Excel/datetime.htm).

The paper has:

> Experiments were conducted between September and October 2007.

The dates should be in this range.

In [3]:
dates = pd.to_datetime('1900-01-01') + pd.to_timedelta(df['date'], unit='D')
dates.describe()

count                      86
unique                     16
top       2007-10-04 00:00:00
freq                        8
first     2007-08-28 00:00:00
last      2007-10-31 00:00:00
Name: date, dtype: object

Add the hours:

In [4]:
date_times = dates + pd.to_timedelta(df['hour'], unit='h')
date_times.head()

0   2007-08-28 19:00:00
1   2007-08-28 21:00:00
2   2007-09-15 19:00:00
3   2007-09-25 17:00:00
4   2007-09-25 18:00:00
dtype: datetime64[ns]

In [5]:
clean_df = df.loc[:, :'trapside']
clean_df['datetime'] = date_times
clean_df.head()

Unnamed: 0,volunteer,group,test,nb_released,no_odour,volunt_odour,activated,co2no,co2od,temp,trapside,datetime
0,subj1,beer,before,50,7,9,16,305.0,321.0,36.1,A,2007-08-28 19:00:00
1,subj2,beer,before,50,26,7,33,338.0,720.0,35.3,B,2007-08-28 21:00:00
2,subj3,beer,before,50,5,10,15,348.0,355.0,36.1,B,2007-09-15 19:00:00
3,subj4,beer,before,50,3,7,10,349.0,437.0,35.6,A,2007-09-25 17:00:00
4,subj5,beer,before,50,2,8,10,396.0,475.0,37.0,B,2007-09-25 18:00:00


Check the number released is always 50:

In [6]:
np.all(clean_df['nb_released'] == 50)

True

Save without the index (row labels), and load to check it worked:

In [7]:
out_path = op.join('processed', 'mosquito_beer.csv')
clean_df.to_csv(out_path, index=False)  # No row labels
# Read it back again.
pd.read_csv(out_path).head()

Unnamed: 0,volunteer,group,test,nb_released,no_odour,volunt_odour,activated,co2no,co2od,temp,trapside,datetime
0,subj1,beer,before,50,7,9,16,305.0,321.0,36.1,A,2007-08-28 19:00:00
1,subj2,beer,before,50,26,7,33,338.0,720.0,35.3,B,2007-08-28 21:00:00
2,subj3,beer,before,50,5,10,15,348.0,355.0,36.1,B,2007-09-15 19:00:00
3,subj4,beer,before,50,3,7,10,349.0,437.0,35.6,A,2007-09-25 17:00:00
4,subj5,beer,before,50,2,8,10,396.0,475.0,37.0,B,2007-09-25 18:00:00
