# Pset 7

## Part 1
Use python to run EdgeR externally and get result

In [1]:
#!Rscript analyze_W.r

In [2]:
## write R script based on the analyze_W.r file into my own r script
def runEdgeR (countsFile): 
    l1="library(edgeR)\n"
    l2="infile     <- '{}' \n".format(countsFile)
    l3="group      <- factor(c(1,1,1,2,2,2))\n"   
    l4="outfile    <- 'FCResult.out'\n"
    l5="x     <- read.table(infile, sep='\\t', row.names=1)\n"
    l6="y     <- DGEList(counts=x,group=group)\n"
    l7="y     <- estimateDisp(y)\n"
    l8="et    <- exactTest(y)\n"
    l9="tab   <- topTags(et, nrow(x))\n"
    l10="write.table(tab, file=outfile)\n"
    ## create and open the R script and write the command by strings to it
    with open('myanalysis.r','w') as out:
        out.writelines([l1, l2, l3, l4, l5, l6, l7, l8, l9, l10])
    
    ## run the R script file in command line
    !Rscript myanalysis.r
    
    ## initialize return data
    geneNames = []
    logFCs = []
    logCPMs = []
    pValues = []
    FDRs = []
    
    ## open the results file and parse inputs and store the result to their correspoinding list
    with open ("FCResult.out", 'r') as infile: 
        next (infile)
        for line in infile:
            lineInfoList = line.split()
            geneNames.append(lineInfoList[0])
            logFCs.append(float(lineInfoList[1]))
            logCPMs.append(float(lineInfoList[2]))
            pValues.append(float(lineInfoList[3]))
            FDRs.append(float(lineInfoList[4]))
    
    return geneNames, logFCs, logCPMs, pValues, FDRs

In [3]:
## join each data to form the merged data file
!join -t $'\t' w07-data.1 w07-data.2 > merged.12
!join -t $'\t' w07-data.1 w07-data.3 > merged.13
!join -t $'\t' w07-data.2 w07-data.3 > merged.23

In [4]:
## run EdgeR externally and get results for each of the file
geneNames12, logFCs12, logCPMs12, pValues12, FDRs12 = runEdgeR ("merged.12")
geneNames13, logFCs13, logCPMs13, pValues13, FDRs13 = runEdgeR ("merged.13")
geneNames23, logFCs23, logCPMs23, pValues23, FDRs23 = runEdgeR ("merged.23")

Loading required package: limma
Using classic mode.
Loading required package: limma
Using classic mode.
Loading required package: limma
Using classic mode.


## Part 2
reproduce the wiggins data and assign missing labels

In [5]:
## extract all p-values smaller than 0.05 and return the total counts
def checkPValues (pValueList): 
    count = 0
    for pValue in pValueList: 
        if pValue < 0.05: 
            count+=1
    return count 

## print the number of cases for each merged file
print ("Number of genes with p<0.05 in data files 1 & 2:  {0:6d}".format(checkPValues(pValues12)))
print ("Number of genes with p<0.05 in data files 1 & 3:  {0:6d}".format(checkPValues(pValues13)))
print ("Number of genes with p<0.05 in data files 2 & 3:  {0:6d}".format(checkPValues(pValues23)))

Number of genes with p<0.05 in data files 1 & 2:    2107
Number of genes with p<0.05 in data files 1 & 3:    1978
Number of genes with p<0.05 in data files 2 & 3:    1018


The merged file for data 1 and 2 got 2107 different expressed genes with p value < 0.05.

We can see that merged file 12 and merged file 13 have almost same amount of different expressed genes. However, merged file 23 has only around 1000 different expressed genes which is much fewer than the other two. In this way, we could conclude that data file 2 and data file 3 are the same wild type, while data file 1 is the mutant.

## Part 3

In [6]:
## extract all FDR smaller than 0.05 and return the total counts
def checkFDRs (FDRList): 
    count = 0
    for FDR in FDRList: 
        if FDR < 0.05: 
            count+=1
    return count

## print the number of cases for each merged file
print ("Number of genes with FDR<0.05 in data files 1 & 2:  {0:6d}".format(checkFDRs(FDRs12)))
print ("Number of genes with FDR<0.05 in data files 1 & 3:  {0:6d}".format(checkFDRs(FDRs13)))
print ("Number of genes with FDR<0.05 in data files 2 & 3:  {0:6d}".format(checkFDRs(FDRs23)))

Number of genes with FDR<0.05 in data files 1 & 2:      63
Number of genes with FDR<0.05 in data files 1 & 3:      63
Number of genes with FDR<0.05 in data files 2 & 3:       0


I don't agree with Wiggin's conclusion. He missed the point that EdgeR relies on H0 that all samples are from the sample type of data. 

I believe that FDR is a better measurement in this case than p-value. As there are too many samples and FDR could measure the false discovery rate which is the rate to reject null hypothesis when it is actually true. Based on my threshold of FDR < 0.05, there are 63 cases that pass the threshold in both merged file 12 and 13, which means that the number of different expressed genes between wildtype and mutant is 63. Also, the number of case for merged file 23 is 0, which also proves that FDR is suitable as both data are the same type and they should have no different expressed genes.

## Part 4

In [7]:
## update the EdgeR function with normalization added in the pipeline
def runEdgeRQ4 (countsFile): 
    l1="library(edgeR)\n"
    l2="infile     <- '{}' \n".format(countsFile)
    l3="group      <- factor(c(1,1,1,2,2,2))\n"   
    l4="outfile    <- 'FCResult2.out'\n"
    l5="x     <- read.table(infile, sep='\\t', row.names=1)\n"
    l6="y     <- DGEList(counts=x,group=group)\n"
    
    ## add normalization step into the pipeline
    l61="y <- calcNormFactors(y)\n"
    l62="design <- model.matrix(~group)\n"
    l7="y     <- estimateDisp(y, design)\n"  
    l8="et    <- exactTest(y)\n"
    l9="tab   <- topTags(et, nrow(x))\n"
    l10="write.table(tab, file=outfile)\n"
    ## create another R script and store the new command in it
    with open('mynormalize.r','w') as out:
        out.writelines([l1, l2, l3, l4, l5, l6, l61, l62, l7, l8, l9, l10])
    
    ##run the updated R script file
    !Rscript mynormalize.r
    
    ## store the updated result
    geneNames = []
    logFCs = []
    logCPMs = []
    pValues = []
    FDRs = []
    
    # open a new file and parse inputs and store updated results
    with open ("FCResult2.out", 'r') as infile: 
        next (infile)
        for line in infile:
            lineInfoList = line.split()
            geneNames.append(lineInfoList[0])
            logFCs.append(float(lineInfoList[1]))
            logCPMs.append(float(lineInfoList[2]))
            pValues.append(float(lineInfoList[3]))
            FDRs.append(float(lineInfoList[4]))
    
    return geneNames, logFCs, logCPMs, pValues, FDRs

In [8]:
## rerun the EdgeR and count the genes that pass the FDR threshold in the updated pipeline
geneNames12_4, logFCs12_4, logCPMs12_4, pValues12_4, FDRs12_4 = runEdgeRQ4 ("merged.12")
geneNames13_4, logFCs13_4, logCPMs13_4, pValues13_4, FDRs13_4 = runEdgeRQ4 ("merged.13")
geneNames23_4, logFCs23_4, logCPMs23_4, pValues23_4, FDRs23_4 = runEdgeRQ4 ("merged.23")

print ("Number of genes with FDR<0.05 in data files 1 & 2:  {0:6d}".format(checkFDRs(FDRs12_4)))
print ("Number of genes with FDR<0.05 in data files 1 & 3:  {0:6d}".format(checkFDRs(FDRs13_4)))
print ("Number of genes with FDR<0.05 in data files 2 & 3:  {0:6d}".format(checkFDRs(FDRs23_4)))

Loading required package: limma
Loading required package: limma
Loading required package: limma
Number of genes with FDR<0.05 in data files 1 & 2:      53
Number of genes with FDR<0.05 in data files 1 & 3:      53
Number of genes with FDR<0.05 in data files 2 & 3:       0


The important step that is missing in the edgeR pipeline is normalization. It is a crucial step that could deal with unequal data sizes.


After adding that step and rerun the whole process, I found 53 genes that are differently expressed between wildtype and mutant.