-
Notifications
You must be signed in to change notification settings - Fork 0
/
1-clean.Rmd
executable file
·201 lines (167 loc) · 5.66 KB
/
1-clean.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
```{r setup, include=FALSE}
# set global chunk options
library(knitr)
opts_chunk$set(cache=TRUE, echo=FALSE, warning=FALSE, message=FALSE)
# import funtions
source("functions.R")
```
Load and clean the data
-----------------------
<a id="data"></a>
## Data
Donors data were taken from the file `donors_view.csv` in the `data/` directory.
We excluded donors enrolled in the Novartis study, HIV positive and those for
which the number of vaccinations is not defined.
Data were loaded from several csv files (one per strain) and merged into two
dataframes, one for IC50 values and one for the EC50 values.
```{r load_data, echo=FALSE, warning=FALSE}
library(plyr)
library(reshape2)
# donors data
donors_view <- read.csv("../../data/donors_view.csv", sep=";", quote="\"",
header=TRUE, na.strings=c("NULL"))
# excludes those enrolled in novartis study or hiv positive
# drop phone, email, shcs
donors_view <- subset(donors_view,
(novartis_study == 'n' | novartis_study == 'k') &
hiv == 'n',
select=-c(phone, email, shcs))
# capitalize sex
donors_view$sex <- toupper(donors_view$sex)
print("dimension of donors_view")
dim(donors_view)
# EC data
filenames <- list.files(path="../../data/data_to_merge_EC50/", full.names=TRUE)
import.list <- llply(filenames, read.csv, sep=";")
ec_data <- Reduce(function(...) merge(..., by="donorID"), import.list)
# IC data
filenames <- list.files(path="../../data/data_to_merge_IC50/", full.names=TRUE)
import.list <- llply(filenames, read.csv, sep=";")
ic_data <- Reduce(function(...) merge(..., by="donorID"), import.list)
```
The age distribution per gender is shown in the next plot.
```{r age_dist}
p <- ggplot(donors_view, aes(x=age, fill=sex)) +
geom_histogram(binwidth=5, alpha=0.9, position="dodge") +
ggtitle("Age distribution per gender")
p
```
## Data cleaning
IC_50 could not be computed for H5 and H7. For these two strains, rather than
the inhibitory concentrations, the inhibition value at the first dilution is
reported, expressed in percentage.
EC50 data present two outliers (one in H2 and one in H4 titers): their value
for effective concentration was lower than -40 (in log scale) and for this
reason were removed.
```{r clean_data, echo=TRUE}
ec_data$H2_log[ec_data$H2_log < -40] <- NA
ec_data$H4_log[ec_data$H4_log < -40] <- NA
```
The following plots show the distribution of values, one boxplot per strain.
IC50 for H5 and H7 are shown separately because the scale is different
(percentage).
#### IC50
```{r ic_boxplots}
library(ggplot2)
library(reshape)
# boxplots on log scales
m <- melt(ic_data[, -c(1, 8, 9)])
p <- ggplot(m, aes(x=variable, y=value)) + geom_boxplot()
p
# boxplots of percentages
m <- melt(ic_data[, -c(1:7)])
p <- ggplot(m, aes(x=variable, y=value)) + geom_boxplot()
p
```
#### EC50
```{r ec_boxplots}
# boxplots
m <- melt(ec_data[, -1])
p <- ggplot(m, aes(x=variable, y=value)) + geom_boxplot()
p
```
### Merge with donors data
Exclude donoros for which the vaccinations number is not defined.
```{r merge_donors}
donors_view <- subset(donors_view, !is.na(flushot_avg))
# define a binary variable for vaccinated
donors_view$shot[donors_view$flushot_avg > 0] <- 'Vaccinated'
donors_view$shot[donors_view$flushot_avg == 0] <- 'Non vaccinated'
ic_data <- merge(ic_data, donors_view, by="donorID")
ec_data <- merge(ec_data, donors_view, by="donorID")
dim(ic_data)
dim(ec_data)
```
We are left with 273 donors, 142 females and 131 males, for which we can plot
the following characteristics.
```{r vaccine_sex}
p <- ggplot(ec_data, aes(x=sex, fill=shot)) + geom_histogram()
p <- p + ggtitle("Vaccinated and non vaccinated per gender")
p
table(ec_data$shot, ec_data$sex)
p <- ggplot(ec_data, aes(x=age, fill=shot))
p <- p + geom_histogram(binwidth=5, alpha=0.9, position="dodge")
p <- p + ggtitle("Age distribution per vaccination status")
p
```
It is important to point out that the groups are not homogeneous, vaccinated
are older than non vaccinated, as seen in the plot below.
```{r vaccine_age}
p <- ggplot(ec_data, aes(x=shot, y=age)) + geom_boxplot()
p <- p + ggtitle("Age distribution per vaccination status")
p
```
We show the boxplots again, limited to donors considered here.
#### IC50
```{r ic_boxplots_again}
# boxplots on log scales
m <- melt(ic_data[, 2:7])
p <- ggplot(m, aes(x=variable, y=value)) + geom_boxplot()
p
# boxplots of percentages
m <- melt(ic_data[, 8:9])
p <- ggplot(m, aes(x=variable, y=value)) + geom_boxplot()
p
```
H3 Brisbane has a few outliers: we remove those below zero.
```{r remove_outliers_ic, echo=TRUE}
ic_data$H3_brisbane_log[ic_data$H3_brisbane_log < 0] <- NA
```
#### EC50
```{r ec_boxplots_again}
# boxplots
m <- melt(ec_data[, 2:10])
p <- ggplot(m, aes(x=variable, y=value)) + geom_boxplot()
p
```
We observe a few strain with outliers: H1 pandemic 2009, H2 and H4. We remove
the lowest observation on each strain.
```{r remove_outliers_ec, echo=TRUE}
ec_data$H1_pdm09_log[ec_data$H1_pdm09_log < 0] <- NA
ec_data$H2_log[ec_data$H2_log < -3] <- NA
ec_data$H4_log[ec_data$H4_log < -1.5] <- NA
```
#### Count the valid data (those that are not reported as `NA`) on each strain.
For **IC** first
```{r count_valid_data_ic, echo=TRUE}
for(str in names(ic_data)[2:9]){
n <- sum(!is.na(ic_data[[str]]))
print(paste(str, "-->", n))
}
```
and then for **EC**
```{r count_valid_data_ec, echo=TRUE}
for(str in names(ec_data)[2:10]){
n <- sum(!is.na(ec_data[[str]]))
print(paste(str, "-->", n))
}
```
For all 273 donors the variable `shot` is defined (Vaccinated or Non
vaccinated).
----
Save everything, it will be loaded in other files.
```{r save_ic_ec, echo=TRUE}
save(ic_data, ec_data, file="ic_ec.Rdata")
```
----
#### _Latest update: 7 April 2014_.