### Hyun-Joon Yang
### yanghyun@usc.edu
### QBIO 401
### Assignment 7

On Blackboard, you can find two files: “beforerain.bracken” and “afterrain.bracken”. The data in the first file is the sample from LA River before rain on a certain day, and the second file is the sample from the same river few days later after rain.

These files are outputs from the Bracken program.

In [1]:
rm(list=ls())

In [2]:
data1 <- read.table('beforerain.bracken', header=T, sep='\t', stringsAsFactors=F)
head(data1)
dim(data1)
data2 <- read.table('afterrain.bracken', header=T, sep='\t', stringsAsFactors=F)
head(data2)
dim(data2)

name,taxonomy_id,taxonomy_lvl,kraken_assigned_reads,added_reads,new_est_reads,fraction_total_reads
Streptococcus sp. oral taxon 431,712633,S,46,10,56,1e-05
Neorhizobium sp. NCHU2750,1825976,S,581,132,713,0.00014
Pseudomonas sp. MT-1,150396,S,109,233,342,7e-05
Ahniella affigens,2021234,S,489,6,495,0.0001
Sinorhizobium sp. CCBAU 05631,794846,S,81,96,177,4e-05
Cohnella sp. 18JY8-7,2480923,S,162,4,166,3e-05


name,taxonomy_id,taxonomy_lvl,kraken_assigned_reads,added_reads,new_est_reads,fraction_total_reads
Streptococcus sp. oral taxon 431,712633,S,20,4,24,1e-05
Neorhizobium sp. NCHU2750,1825976,S,860,196,1056,0.00022
Pseudomonas sp. MT-1,150396,S,100,195,295,6e-05
Ahniella affigens,2021234,S,549,7,556,0.00011
Sinorhizobium sp. CCBAU 05631,794846,S,72,83,155,3e-05
Cohnella sp. 18JY8-7,2480923,S,216,7,223,5e-05


In [3]:
rm(data1, data2)

#### 1. Write an R program that takes as input one of these files and a threshold, and returns all “names” (1st column) and “fractions” (7th column) of those rows where the fraction is greater than the threshold. Run the program for each file with threshold 0.01 [2pt].

In [4]:
get_by_threshold <- function(filename, threshold) {
    # read table
    mydata <- read.table(filename, header=T, sep='\t', stringsAsFactors=F)
    # extract names and fractions by threshold
    mydata <- mydata[mydata$fraction_total_reads >= 0.01, c('name', 'fraction_total_reads')]
    # rename columns
    mydata <- data.frame('names'=mydata$name, 'fractions'=mydata$fraction_total_reads)
    return(mydata)
}

In [5]:
threshold <- 0.01

before <- get_by_threshold('beforerain.bracken', threshold)
after <- get_by_threshold('afterrain.bracken', threshold)

In [6]:
before

names,fractions
Polynucleobacter acidiphobus,0.01528
Homo sapiens,0.01831
Limnohabitans sp. 63ED37-2,0.11301
Limnohabitans sp. 103DPR2,0.01286
Polynucleobacter necessarius,0.02691
Hydrogenophaga sp. RAC07,0.01211
Cloacibacterium normanense,0.02526
beta proteobacterium CB,0.02466


In [7]:
after

names,fractions
Curvibacter sp. AEP1-3,0.0117
Polynucleobacter acidiphobus,0.01099
Homo sapiens,0.01187
Limnohabitans sp. 63ED37-2,0.05251
Limnohabitans sp. 103DPR2,0.01398
Polynucleobacter necessarius,0.01792
Hydrogenophaga sp. RAC07,0.01938
Cloacibacterium normanense,0.01177
Acidovorax sp. T1,0.01068


#### 2. Write an R program that takes as input both of these files and a number n, and returns the “names” and “fractions” (the fractions in both files) for the n names with the greatest absolute difference in fractions between the two files. Note: some names might be present in one file but absent (not even listed) in the other file. The fraction for the name not listed in the file is zero. Run the program with the number n equal to 10 [3pt].

In [8]:
get_by_difference <- function(file1, file2, n) {
    # read files
    data1 <- read.table(file1, header=T, sep='\t', stringsAsFactors=F)
    data2 <- read.table(file2, header=T, sep='\t', stringsAsFactors=F)
    # find names that are in both files
    names <- intersect(data1$name, data2$name)
    # get difference
    diff <- vector()
    for(i in names) {
        x <- data1[data1$name == i, 'fraction_total_reads']
        y <- data2[data2$name == i, 'fraction_total_reads']
        diff <- c(diff, abs(x-y))
    }
    # create data frame
    output <- data.frame(names=names, diff=diff)
    # get entries with largest differences
    orders <- order(output$diff, decreasing=T)
    return(output[orders[1:n],])
}

In [9]:
greatest_difference <- get_by_difference('beforerain.bracken', 'afterrain.bracken', 10)
greatest_difference

Unnamed: 0,names,diff
1103,Limnohabitans sp. 63ED37-2,0.0605
2043,Cloacibacterium normanense,0.01349
1587,Polynucleobacter necessarius,0.00899
150,Curvibacter sp. AEP1-3,0.00892
1986,Hydrogenophaga sp. RAC07,0.00727
1038,Homo sapiens,0.00644
949,Candidatus Planktophila sulfonica,0.00544
876,Polynucleobacter acidiphobus,0.00429
490,Polynucleobacter difficilis,0.00399
1015,Acidovorax sp. KKS102,0.00381


#### 3. Let ${𝑟_𝑖}$ be the “new_est_reads” numbers (6th column) in one of the files. Define $p_i = \frac{r_i}{\sum{r_j}}$. The Shannon diversity for the file is defined as $-\sum{p_i \times ln(p_i)}$. Write an R program to compute the Shannon diversity. Run this program for both files and tell us what you find [3pt].

In [10]:
compute_diversity <- function(filename) {
    # read table
    mydata <- read.table(filename, header=T, sep='\t', stringsAsFactors=F)
    # find p
    r_total <- sum(mydata$new_est_reads)
    mydata$p <- mydata$new_est_reads / r_total
    # find H
    mydata$H <- -mydata$p * log(mydata$p)
    return(sum(mydata$H))
}

In [11]:
diversity_before <- compute_diversity('beforerain.bracken')
diversity_after <- compute_diversity('afterrain.bracken')

In [12]:
print(diversity_before)
print(diversity_after)

[1] 6.137442
[1] 6.723041


There seems to be more diversity after the rain than before.