/
BioinfoPrac1.Rmd
505 lines (316 loc) · 11.1 KB
/
BioinfoPrac1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
---
title: "Bioinformatics prac 1: Hello World!"
author: "Mark Ziemann"
date: "`r Sys.Date()`"
output:
html_document:
toc: true
theme: cosmo
---
Source: https://github.com/markziemann/SLE712_files/blob/master/BioinfoPrac1.Rmd
## Introduction
What is bioinformatics?
It is data science techniques applied to biological or biomedical domains.
This is important because biology and biomedicine are becoming more data-driven.
This means it is becoming an essential skill to be able to process these huge datasets,
analyse them and create meaningful charts and tables.
Bioinformatics is done in many languages including Perl, Python, Java, Matlab and more, so why R?
1. It is free and open source
2. It works in Linux, Mac and Windows (and on the cloud)
3. Is the premier language for numerical analysis and statistics
4. It has best data visualisation options
5. It has a vibrant and noob-friendly base of science users throughout the world
6. There is a huge array of packages available on [CRAN](https://cran.r-project.org/) and [Bioconductor](https://www.bioconductor.org/).
7. Nearly everything you learn in R here can be applied to other fields of endeavour such as business, engineering and medicine.
Therefore all of the bioinformatics content covered in this unit will be related to using R.
We will be using the [Rstudio](https://rstudio.com/) application to help us managing the development process.
It is a graphical interface with different panels and tools which makes the code development process easier and more efficient.
You can log on to our Rstudio server here: http://118.138.234.73:8787/
The username and password were provided to you by email. If you missed out
This week, our goal is to learn:
* how to use R studio interface
* about different data objects in R
* how to do basic math in R
## The Rstudio interface
Let's get to know the Rstudio interface a bit better.
It consists of a menu bar at the top of the browser window and four panels below:
| Menu Bar |
|:----:|:----:|
| Top left: Script (type and save your scripts here) | Top right: Global Environment (R data objects you can work with) |
| Bottom left: Console (commands are executed here) | Bottom right: Files, Plots and help pages |
If you cannot see the Script panel, click "New Script" on the menu bar and it should appear.
Let’s watch a video together to learn the basics of the Rstudio environment.
<a href="http://www.youtube.com/watch?feature=player_embedded&v=5YmcEYTSN7k
" target="_blank"><img src="http://img.youtube.com/vi/5YmcEYTSN7k/0.jpg"
alt="IMAGE ALT TEXT HERE" width="480" height="360" border="10" /></a>
## Let's get started with using R!
### Arithmetic
The first thing we will cover in R is the different types of data structures.
Let’s go through the different data types together: https://www.statmethods.net/input/datatypes.html
Open up your Rstudio and try these commands by typing in the script panel and hitting the “Run” button or Ctrl+Enter, observing the output.
The hash character `#` is used to include comments.
Including comments help the user to understand the purpose of the commands.
It is a good idea to add comments to your code. Any characters that come after the # are ignored.
We'll start with some arithmetic.
```{r,math1}
1+2
10000^2
sqrt(64)
```
Notice that as you start typing commands like `sqrt` that Rstudio will give you suggestions, you can use the tab key to autocomplete.
### Working with variables
Use the `<-` back arrow or `=` to define variables.
```{r,math2}
s <- sqrt(64)
s
a <- 2
b <- 5
c <- a*b
c
```
### Vectors
We can also do operations on vectors of numbers.
Vectors are specified with c-brackets like this `c(1,2,3)`
```{r,math3}
c(1,2,3)
sum(c(1,10,100))
mean(c(1,10,100))
median(c(1,10,100))
max(c(1,10,100))
min(c(1,10,100))
```
### Saving variables
To make things easier to read and write, we can save the vector as a variable `a`.
Then we can work on `a`.
We use the `<-` to save objects in R, but `=` also works.
`length ` tells us the number of elements in the vector which is very useful.
```{r,math4}
a <- c(1,10,100)
a
2*a
a+1
sum(a)
mean(a)
median(a)
sd(a)
var(a)
length(a)
```
### R object types
Whenever we are using an R object we should check exactly what sort of object it is.
This is the most common type of error you will encounter.
We use the `str()` command to check the structure.
Other commands we can use to investigate the data structure include `class()` and `typeof()`
The colon `:` can be used to specify series.
For example 1:5 for numbers 1 to 5.
Note the different object types here:
A is a Numerical vector
B is Numerical vector (series)
C is a named numerical vector
D is a character string
E is a character vector
F is a named character vector
```{r,math5}
A <- c(1,10,100)
A
str(A)
class(A)
typeof(A)
B <- 1:10
B
str(B)
class(B)
typeof(B)
C <- c("prime1"=2, "prime2"=3, "prime3"=5, "prime4"=7)
C
str(C)
class(C)
typeof(C)
D <- "x1"
D
str(D)
class(D)
typeof(D)
E <- c("x1", "y2", "z3")
E
str(E)
names(E) <- c("code1","code2","code3")
E
str(E)
class(E)
typeof(E)
```
Note how the names can be added during or after the vector is defined.
### Factors
Factors are an entirely different type of data in R.
They are represented as non numeric categories, but are stored internally as numerical data.
Typically factors are used in R for categorical data like biological sex.
Here the example is an unordered category.
```{r,factors1}
x <- factor(c("single", "married", "married", "single","defacto","widowed"))
x
str(x)
levels(x)
```
But it is possible to have ordered factors as well.
Take this survey result about a product as an example, where respondents rate the food at a restaurant between very poor and very good.
These responses have a natural order, so it makes sense to treat these as ordered factors.
```{r,factors2}
y <- c("good", "very good", "fair", "poor","good")
str(y)
yy <- factor(y, levels = c("very poor","poor","fair","good","very good"),ordered = TRUE)
yy
str(yy)
levels(yy)
```
### Logicals
R also uses `TRUE/FALSE` a lot, as a data type as well as an option when executing commands.
It is also possible to have vectors of logical values.
```{r,logicals}
myvariable1 <- 0
as.logical(myvariable1)
myvariable2 <- 1
as.logical(myvariable2)
vals <- c(0,1,0,1,0,0,0,1,1,0,1,0,1,0)
vals
str(vals)
as.logical(vals)
```
### Subsetting vectors
Next, we would like to subset vectors.
To do this, we use square brackets and inside the square bracket we indicate which values we want, using 1 for the first and 2 for second and so on.
See how it is possible to get the last and second last elements.
We can subset vectors, run arithmetic operations and save the results to a new variable in a single line.
```{r,vec_subset}
a <- c(2:22,98,124,3002)
length(a)
a[2]
a[3:4]
a[c(1,3)]
a[length(a)]
a[(length(a)-1)]
x <- a[10:(length(a)-1)] * 2
x
```
### Coersing R data into different types
As shown above with commands like `factor` and `as.logical`, it os possible to convert objects into different types.
Here are some further examples.
Note that some conversions don't make sense, which can cause errors in your analysis.
```{r,conv1}
a <- c(1.9,2.7,3.3,5.1,9.9,0)
a
as.integer(a)
as.character(a)
as.logical(a)
as.factor(a)
b <- c("abc","def","ghi","jkl")
b
as.numeric(b)
as.logical(b)
as.integer(b)
my_factor <- as.factor(b)
as.numeric(my_factor)
```
### Creating random and semi random data
This is used a lot in statistics and probability as well as simulation analysis.
```{r,sample1}
nums <- 1:5
sample(x = nums, size = 3)
sample(x = nums, size = 5)
sample(x = nums ,size = 10, replace = TRUE)
```
We can also sample from distributions.
Here we are sampling 5 numbers from a normal distribution around a median of 10 and standard deviation of 2
```{r,distributions1}
d <- rnorm(n = 5, mean = 10, sd = 2)
d
```
Here we are sampling 20 numbers from a binomial distribution with a size of 50 and probability of 0.5
```{r,distributions2}
b <- rbinom(n = 20, size = 50, prob = 0.5)
b
mean(b)
```
### Basic plots
Creating basic plots in R isn't very difficult.
Here are some simple ones.
Dot plot and line plot.
Adding extra lines.
Changing line colour and adding a subheading
```{r,dot1}
a <- (1:10)^2
a
plot(a)
plot(a,type="l")
plot(a,type="b")
plot(a,type="b")
lines(a/2, type="b",col="red")
mtext("Black:Growth of A. Red: growth of A/2")
```
Now for scatterplots.
We can change the point type (`pch`) and size (`cex`), as well as add a main heading (`main`).
We can also add additional series of points to the chart and adjust the axis limits with `xlim` and `ylim`.
```{r,scatter1}
x_vals <- rnorm(n = 1000, mean = 10, sd = 2)
d_error <- rnorm(n = 1000, mean = 1, sd = 0.1)
y_vals <- x_vals * d_error
plot(x=x_vals, y=y_vals, xlab="my x values", ylab="my y values")
plot(x=x_vals, y=y_vals, xlab="my x values", ylab="my y values",pch=19, cex=0.5, main="Plot of X and Y values")
plot(x=x_vals, y=y_vals, xlab="my x values", ylab="my y values",pch=19, cex=0.5, main="Plot of X and Y values")
points(x=x_vals, y=y_vals/2, pch=19, cex=0.5,col="blue")
plot(x=x_vals, y=y_vals, xlab="my x values", ylab="my y values",pch=19,
cex=0.5, main="Plot of X and Y values",
ylim=c(0,17))
points(x=x_vals, y=y_vals/2, pch=19, cex=0.5,col="blue")
```
Let's now run a linear regression on the relationship of y with x.
```{r,scatter2}
linear_regression_model <- lm(y_vals ~ x_vals)
summary(linear_regression_model)
SLOPE <- linear_regression_model$coefficients[1]
INTERCEPT <- linear_regression_model$coefficients[2]
HEADER <- paste("Slope:",signif(SLOPE,4),"Intercept:",signif(INTERCEPT,4))
plot(x=x_vals, y=y_vals, xlab="my x values", ylab="my y values",pch=19, cex=0.5)
abline(linear_regression_model,col="red",lty=2,lwd=3)
mtext(HEADER)
```
Barplots are also really useful.
The quantities in the vector need to have names.
```{r,barplots1}
names(a) <- 1:length(a)
barplot(a)
barplot(a,horiz = TRUE,las=1, xlab = "Measurements")
```
Boxplots are relatively easy to create.
```{r,boxplots1}
boxplot(x_vals, y_vals/2)
boxplot(x_vals, y_vals/2, names=c("X values", "Y values"),ylab="Measurement (cm)")
```
Histograms are easily made.
And it is possible to place multiple charts on a single image.
```{r,histograms1}
hist(x_vals)
par(mfrow = c(2, 1))
hist(x_vals,main="")
hist(y_vals,main="")
```
## Homework questions for group study
1. calculate the sum of all integers numbers between 500 and 600
2. calculate the sum of all the square roots of all integers between 900 and 1000
3. Create the following datasets and plot a boxplot:
a. A Sample 10000 datapoints from a normal distribution with mean of 50 and SD of 5
b. A Sample 10000 datapoints from a normal distribution with mean of 50 and SD of 10
4. Plot a and b above as a scatterplot, and plot the trend line.
5. Plot a and b above as histograms on the same chart.
## Next week we will
Learn:
* how to work with data frames and matrices
* how to read in files properly with R
* how to filter, subset, search and sort data
* more types of analysis like correlations
* how to generate more sophisticated data types
## Session information
For reproducibility.
```{r,sessioninfo}
sessionInfo()
```