1-clean.Rmd

```{r setup, include=FALSE}
# set global chunk options
library(knitr)
opts_chunk$set(cache=TRUE, echo=FALSE, warning=FALSE, message=FALSE)
# import funtions
source("functions.R")
```

Load and clean the data
-----------------------

<a id="data"></a>
## Data
Donors data were taken from the file `donors_view.csv` in the `data/` directory.
We excluded donors enrolled in the Novartis study, HIV positive and those for
which the number of vaccinations is not defined.

Data were loaded from several csv files (one per strain) and merged into two
dataframes, one for IC50 values and one for the EC50 values.

```{r load_data, echo=FALSE, warning=FALSE}
library(plyr)
library(reshape2)
# donors data
donors_view <- read.csv("../../data/donors_view.csv", sep=";", quote="\"",
                          header=TRUE, na.strings=c("NULL"))

# excludes those enrolled in novartis study or hiv positive
# drop phone, email, shcs
donors_view <- subset(donors_view,
                      (novartis_study == 'n' | novartis_study == 'k') &
                      hiv == 'n',
                      select=-c(phone, email, shcs))

# capitalize sex
donors_view$sex <- toupper(donors_view$sex)
print("dimension of donors_view")
dim(donors_view)

# EC data
filenames <- list.files(path="../../data/data_to_merge_EC50/", full.names=TRUE)
import.list <- llply(filenames, read.csv, sep=";")
ec_data <- Reduce(function(...) merge(..., by="donorID"), import.list)
# IC data
filenames <- list.files(path="../../data/data_to_merge_IC50/", full.names=TRUE)
import.list <- llply(filenames, read.csv, sep=";")
ic_data <- Reduce(function(...) merge(..., by="donorID"), import.list)
```

The age distribution per gender is shown in the next plot.

```{r age_dist}
p <- ggplot(donors_view, aes(x=age, fill=sex)) +
  geom_histogram(binwidth=5, alpha=0.9, position="dodge") +
  ggtitle("Age distribution per gender")
p
```

## Data cleaning
IC_50 could not be computed for H5 and H7. For these two strains, rather than
the inhibitory concentrations, the inhibition value at the first dilution is
reported, expressed in percentage.

EC50 data present two outliers (one in H2 and one in H4 titers): their value
for effective concentration was lower than -40 (in log scale) and for this
reason were removed.
```{r clean_data, echo=TRUE}
ec_data$H2_log[ec_data$H2_log < -40] <- NA
ec_data$H4_log[ec_data$H4_log < -40] <- NA
```

The following plots show the distribution of values, one boxplot per strain.
IC50 for H5 and H7 are shown separately because the scale is different
(percentage).

#### IC50
```{r ic_boxplots}
library(ggplot2)
library(reshape)
# boxplots on log scales
m <- melt(ic_data[, -c(1, 8, 9)])
p <- ggplot(m, aes(x=variable, y=value)) + geom_boxplot()
p
# boxplots of percentages
m <- melt(ic_data[, -c(1:7)])
p <- ggplot(m, aes(x=variable, y=value)) + geom_boxplot()
p
```
#### EC50
```{r ec_boxplots}
# boxplots
m <- melt(ec_data[, -1])
p <- ggplot(m, aes(x=variable, y=value)) + geom_boxplot()
p
```

### Merge with donors data
Exclude donoros for which the vaccinations number is not defined.
```{r merge_donors}
donors_view <- subset(donors_view, !is.na(flushot_avg))

# define a binary variable for vaccinated
donors_view$shot[donors_view$flushot_avg > 0] <- 'Vaccinated'
donors_view$shot[donors_view$flushot_avg == 0] <- 'Non vaccinated'

ic_data <- merge(ic_data, donors_view, by="donorID")
ec_data <- merge(ec_data, donors_view, by="donorID")

dim(ic_data)
dim(ec_data)
```
We are left with 273 donors, 142 females and 131 males, for which we can plot
the following characteristics.

```{r vaccine_sex}
p <- ggplot(ec_data, aes(x=sex, fill=shot)) + geom_histogram()
p <- p + ggtitle("Vaccinated and non vaccinated per gender")
p

table(ec_data$shot, ec_data$sex)

p <- ggplot(ec_data, aes(x=age, fill=shot))
p <- p + geom_histogram(binwidth=5, alpha=0.9, position="dodge")
p <- p + ggtitle("Age distribution per vaccination status")
p
```

It is important to point out that the groups are not homogeneous, vaccinated
are older than non vaccinated, as seen in the plot below.

```{r vaccine_age}
p <- ggplot(ec_data, aes(x=shot, y=age)) + geom_boxplot()
p <- p + ggtitle("Age distribution per vaccination status")
p
```


We show the boxplots again, limited to donors considered here.
#### IC50
```{r ic_boxplots_again}
# boxplots on log scales
m <- melt(ic_data[, 2:7])
p <- ggplot(m, aes(x=variable, y=value)) + geom_boxplot()
p
# boxplots of percentages
m <- melt(ic_data[, 8:9])
p <- ggplot(m, aes(x=variable, y=value)) + geom_boxplot()
p
```

H3 Brisbane has a few outliers: we remove those below zero.
```{r remove_outliers_ic, echo=TRUE}
ic_data$H3_brisbane_log[ic_data$H3_brisbane_log < 0] <- NA
```

#### EC50
```{r ec_boxplots_again}
# boxplots
m <- melt(ec_data[, 2:10])
p <- ggplot(m, aes(x=variable, y=value)) + geom_boxplot()
p
```

We observe a few strain with outliers: H1 pandemic 2009, H2 and H4. We remove
the lowest observation on each strain.

```{r remove_outliers_ec, echo=TRUE}
ec_data$H1_pdm09_log[ec_data$H1_pdm09_log < 0] <- NA
ec_data$H2_log[ec_data$H2_log < -3] <- NA
ec_data$H4_log[ec_data$H4_log < -1.5] <- NA
```


#### Count the valid data (those that are not reported as `NA`) on each strain.
For **IC** first
```{r count_valid_data_ic, echo=TRUE}
for(str in names(ic_data)[2:9]){
  n <- sum(!is.na(ic_data[[str]]))
  print(paste(str, "-->", n))
}
```

and then for **EC**
```{r count_valid_data_ec, echo=TRUE}
for(str in names(ec_data)[2:10]){
  n <- sum(!is.na(ec_data[[str]]))
  print(paste(str, "-->", n))
}
```
For all 273 donors the variable `shot` is defined (Vaccinated or Non
vaccinated).

----

Save everything, it will be loaded in other files.
```{r save_ic_ec, echo=TRUE}
save(ic_data, ec_data, file="ic_ec.Rdata")
```
----

#### _Latest update: 7 April 2014_.