### Nichollas Tidow
### QBIO 401
### HW07:  Metagenomics of LA River Bracken Before and After
### October, 14, 2020

In [332]:
# set working directory and read bracken files in
setwd("/Users/nicktidow/")
beforeData <- read.csv("Downloads/beforerain.bracken", sep="\t",stringsAsFactors = FALSE)
afterData <- read.csv("Downloads/afterrain.bracken",sep="\t",stringsAsFactors = FALSE)

## Function 01

Write an R program that takes as input one of these files and a threshold, and returns all “names” (1st column) and “fractions” (7th column) of those rows where the fraction is greater than the threshold. Run the program for each file with threshold 0.01


In [None]:
# Name: function01
# Purpose: read through Bracken file and return names and rows greater than threshold 
# Input: bracken file and treshold value 
# Output: all names and fractions that meet the threshold 

In [333]:
function01 <- function(dataFile,threshold){
    # create a subset that meets the fraction threshold 
    subset <- filter(dataFile, dataFile$fraction_total_reads > threshold)
    # iterate over elements in the subset
    for(i in subset){
        # save names to vector
        names <- (subset$name)
        # save fraction reads to vector
        fractions <- (subset$fraction_total_reads)
        # save the two vectors to a data frame 
        df01 <- data.frame(names, fractions)
    }
    # return names
    #print(subset$name)
    #print(fractions)
    return(df01) # return the data frame
    # return fractions
    #fractions
    
}

In [336]:
function01(beforeData,.01) 
function01(afterData,.01)

names,fractions
Polynucleobacter acidiphobus,0.01528
Homo sapiens,0.01831
Limnohabitans sp. 63ED37-2,0.11301
Limnohabitans sp. 103DPR2,0.01286
Polynucleobacter necessarius,0.02691
Hydrogenophaga sp. RAC07,0.01211
Cloacibacterium normanense,0.02526
beta proteobacterium CB,0.02466


names,fractions
Curvibacter sp. AEP1-3,0.0117
Polynucleobacter acidiphobus,0.01099
Homo sapiens,0.01187
Limnohabitans sp. 63ED37-2,0.05251
Limnohabitans sp. 103DPR2,0.01398
Polynucleobacter necessarius,0.01792
Hydrogenophaga sp. RAC07,0.01938
Cloacibacterium normanense,0.01177
Acidovorax sp. T1,0.01068
beta proteobacterium CB,0.0104


## Function 02

Write an R program that takes as input both of these files and a number n, and returns the “names” and “fractions” (the fractions in both files) for the n names with the greatest absolute difference in fractions between the two files. Note: some names might be present in one file but absent (not even listed) in the other file. The fraction for the name not listed in the file is zero. Run the program with the number n equal to 10

In [136]:
# Name: function02
# Purpose: read through Bracken files and return cases with greatest absolute difference in fractions
# Input: before and after bracken files and number n 
# Output: all names and fractions with the greatest absolute difference in fractions b/w files

In [337]:
function02 <- function(before,after,n){
    # merge the two dataframes
    mergedDf <- merge(before,after,by = "name", all=TRUE)
    # remove all columns that are not names or fractions
    df02 <- subset(mergedDf, select = -c(taxonomy_id.x, taxonomy_lvl.x,kraken_assigned_reads.x,added_reads.x,new_est_reads.x,taxonomy_id.y, taxonomy_lvl.y,kraken_assigned_reads.y,added_reads.y,new_est_reads.y))
    # replace NAs with 0
    df02[is.na(df02)] <- 0
    # add column with abs difference to the data frame
    df02$absolute_difference <- abs(df02$fraction_total_reads.x - df02$fraction_total_reads.y)
    
    # iterate over merged data frame
    for(i in df02){
        # sort the data frame by size of absolute difference from large to small
        sorted <- df02[with(df02, order(absolute_difference, decreasing=TRUE)),]
        # subset the sorted to show the greatest n objects
        onlyN <- head(sorted, n)
        }     
    # return the greatest n objects
    return(onlyN)
}


In [338]:
function02(beforeData,afterData, 10)

Unnamed: 0,name,fraction_total_reads.x,fraction_total_reads.y,absolute_difference
2190,Limnohabitans sp. 63ED37-2,0.11301,0.05251,0.0605
4631,beta proteobacterium CB,0.02466,0.0104,0.01426
1100,Cloacibacterium normanense,0.02526,0.01177,0.01349
3146,Polynucleobacter necessarius,0.02691,0.01792,0.00899
1279,Curvibacter sp. AEP1-3,0.00278,0.0117,0.00892
1909,Hydrogenophaga sp. RAC07,0.01211,0.01938,0.00727
52,Acidithiobacillus ferrivorans,9e-05,0.00706,0.00697
1894,Homo sapiens,0.01831,0.01187,0.00644
893,Candidatus Planktophila sulfonica,0.00999,0.00455,0.00544
3142,Polynucleobacter acidiphobus,0.01528,0.01099,0.00429


## Function 03

Let {𝑟 } be the “new_est_reads” numbers (6th column) in one of the files. Define {𝑝 = 𝑟 ⁄∑ 𝑟 }. 𝑖 𝑖𝑖𝑗
The Shannon diversity for the file is defined as − ∑ 𝑝𝑖 𝑙𝑛(𝑝𝑖 ). Write an R program to compute the Shannon diversity. Run this program for both files and tell us what you find

In [None]:
# Name: function03
# Purpose: compute shannon diversity for a bracken file
# Input: a bracken file
# Output: Shannon diversity result

In [339]:
function03 <- function(file){
    for(i in file){
        # save fraction reads to vector
        fractions <- c(file$fraction_total_reads)
        # replace NAs with 0
        fractions[is.na(fractions)] <- 0  
    }
    # compute shannon divesity 
    shannonDiversity <- ((fractions*log(fractions)))
    # omit NAs from shannon diversity
    shannonDiversity <- na.omit(shannonDiversity)
    # complete shannon diversity summation
    sum <- -sum(shannonDiversity)
    return(sum)
    
}

In [340]:
# Running function03 for the before rain bracken data of LA River
print("Metagenomic Shannon diversity in LA river before rain:")
function03(beforeData)

[1] "Metagenomic Shannon diversity in LA river before rain:"


In [341]:
# Running function03 for the after rain bracken data of LA River
print("Metagenomic Shannon diversity in LA river after rain:")
function03(afterData)

[1] "Metagenomic Shannon diversity in LA river after rain:"


The Shannon diversity calculation results demonstrate that the LA river had more metagenomic diversity after a period of rainfall