In [57]:
from IPython.display import HTML

HTML('''
<script src='//code.jquery.com/jquery-3.3.1.min.js'></script>
<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 $('div .jp-CodeCell .jp-Cell-inputWrapper').hide();
 } else {
 $('div.input').show();
 $('div .jp-CodeCell .jp-Cell-inputWrapper').show();
 }
 code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Code on/off"></form>''')

In [58]:
%%HTML
<style>
div.prompt {display:none}
</style>

# JSC370 Assignment #2

## Due dates

**Presentation:** Feb 11 in Tutorial, 18:00 - 20:00.

**Written Report, Jupyter Notebook/RStudio Markdown, other files:** Feb. 13 before class, 16:30.


## Background


## Question

Can you identify distinct biological “subtypes” of patients with liver cancer, using The Cancer Genome Atlas (TCGA) transcriptome sequencing data, that have different survival times?  If you are able to identify subtypes of patients then what impact does age at diagnosis, tumor stage, and sex have on the relationship between survival and subgroup?


## Data

### Access

The data can be obtained by visiting <https://portal.gdc.cancer.gov> and issuing the following query. 

![](download_liver_data.png)

### Data Definitions

The data come from different studies conducted at different institutions and are submitted to the Genomic Data Commons (GDC). The data requirements are outlined [here](https://gdc.cancer.gov/content/selecting-common-cross-study-clinical-data-elements#dmwg), and a data dictionary can be found [here](https://docs.gdc.cancer.gov/Data_Dictionary/viewer/).  


### Linking Clinical Data and Genetic Expression Data

- *Clinical data* is in the file `clinical.tsv` in the folder beginning with the name `clinical.cart`. 

- *Gene expression data* is in the folder beginning with `gdc_download`.  There are 60483 genes measured on each patient.  An explanation of the measurement is [here](https://docs.gdc.cancer.gov/Encyclopedia/pages/HTSeq-FPKM/).

In [59]:
# see https://docs.python.org/3.8/library/glob.html
import glob
# get list of file names
genefilenames = glob.glob('gdc_download_20200130_160123.454228/**/*.gz', recursive = True)
generna = genefilenames[0]
generna

'gdc_download_20200130_160123.454228/abce9187-0834-49c4-998d-b44d2472458a/413446bc-5b88-4b58-8105-abd9cd7a5ddd.FPKM.txt.gz'

In [20]:
import pandas as pd
import numpy as np

genedat = pd.read_csv(generna,
                      sep = '\t', header = None, names = ['gene', 'rna'] )

# transform rna values 

genedat['logrna'] = genedat['rna'].transform(lambda x: np.log(x+1))
print(genedat.count())
genedat.head()

gene      60483
rna       60483
logrna    60483
dtype: int64


Unnamed: 0,gene,rna,logrna
0,ENSG00000242268.2,0.091032,0.087124
1,ENSG00000270112.3,0.0,0.0
2,ENSG00000167578.15,3.037096,1.395526
3,ENSG00000273842.1,0.0,0.0
4,ENSG00000078237.5,0.748311,0.55865


- A link between clinical data to the gene expression can be made using the `File Name` and `Case ID` columns in the file `gdc_sample_sheet.xxx.tsv`.

In [54]:
# Look up the link for the gene expression file through the File Name
links = pd.read_csv('gdc_sample_sheet.2020-01-30.tsv', sep = '\t')
filename = generna.split('/')[2] # extract filename from file path
links[links['File Name'] == filename] # find match in link file

Unnamed: 0,File ID,File Name,Data Category,Data Type,Project ID,Case ID,Sample ID,Sample Type
166,abce9187-0834-49c4-998d-b44d2472458a,413446bc-5b88-4b58-8105-abd9cd7a5ddd.FPKM.txt.gz,Transcriptome Profiling,Gene Expression Quantification,TCGA-LIHC,TCGA-DD-AAE6,TCGA-DD-AAE6-01A,Primary Tumor


In [53]:
# Find the clinical information Case ID above
clinical = pd.read_csv('clinical.tsv', sep = '\t')

# extract Case ID from link file
case_id = links[links['File Name'] == filename]['Case ID'].iloc[0]

# use Case ID to select clinical data
(clinical[clinical['submitter_id'] == case_id]
 [['submitter_id', 'age_at_index','gender','vital_status', 
   'days_to_death', 'days_to_last_follow_up', 'tumor_stage', 
   'tumor_grade']])

Unnamed: 0,submitter_id,age_at_index,gender,vital_status,days_to_death,days_to_last_follow_up,tumor_stage,tumor_grade
392,TCGA-DD-AAE6,59,female,Alive,--,141,stage i,not reported
393,TCGA-DD-AAE6,59,female,Alive,--,141,stage i,not reported


# Issues to consider

- How will you define if a subject is censored?
- How will you derive the subgroups?
- How will you visualize the results of your analyses?
- How will you decide if the subgroups have different survival times?

# Preparation Lab Expectations

- Use this time to get familiar with the assignment expectations.
- Work in pairs.  It's OK (and encouraged) to share information.
- Develop strategies on how you plan to tackle the points in [Issues to consider](#Issues-to-consider) and other challenges such as data wrangling and statistical analysis.
- During the last part of the tutorial give a very short presentation on your plan.  
- By the end of the tutorial submit a brief written plan (via quercus) on how you will approach the assignment.  You will recieve some feedback from a TA in a few days after submitting. The format of the written plan can be short paragraphs, or detailed bullet points.  It's not necessary to include code or data.

# Presentation Expectations

The time allotted for each presentation is 7 minutes plus 3 minutes for questions/discussion. This time limit will be enforced. If you exceed the time limit then you will be asked to stop the presentation. This means that you should rehearse your presentation timing before you present to the class.

## General Presentation Guidelines

The goal of the presentation is to effectively communicate your findings to a non-technical, but educated, audience (e.g., scientists, physicians, health care executives, company managers, etc.). This doesn’t mean that you shouldn’t include technical details, but you should aim to communicate the findings to an audience without a background in statistics, math, or computer science.

You will need to remind us about the project, but only tell us what we really need to know. We are curious about the results, and how you present the results, but they are not the only purpose of this presentation. So, what should you include? Examples, of questions to consider as you prepare your presentation are:

What problem did your group set out to solve?
How did you group define the problem?
What do your results mean in practice? Do your results suggest something should change or not change?

Your presentation will be graded using the [presentation rubric.](https://jsc370.github.io/assignment_rubrics.html#presentation_rubric)

Presentation slides should be uploaded to Quercus by Feb. 11, 18:00.


# Written Report Expectations

The written report should be done using a Jupyter Notebook or RMarkdown document so that it's reproducible (i.e., we should be able to run the `.ipynb` or `.Rmd` file to reproduce the report). The written report should be at most five pages. This means that you will have to be selective in what you choose to report, and which plots you choose to display.

## Answers to Some Common Questions about the Written Report


It’s not necessary for R/Python code chunks to appear in the report (in R Markdown use the chunk options `echo=FALSE`, `warning = FALSE`, `message = FALSE` and in Jupyter use the command line tool `nbconvert` <sup><a href="#fn1" id = "ref1">1</a></sup>) unless there is some part of the code that will contribute to describing what you have done in the data analysis. For example, don’t submit a report with warning messages from a library you loaded in your report. 

Also, you will be submitting your R Markdown/Jupyter Notebook file so we can see all the gory details. This leads to …

What should be in the report? A high level description of what you have done. This leads to …

Who is the intended audience for the report and what do you mean by a “high level description”? The intended audience is an educated person that has taken at least one basic statistics course, but might be a bit rusty on the details. For example, your supervisor at work completed an MBA ten years ago and took a few statistics courses, but the details are a bit hazy.

## How will my writing be evaluated?

Your writing will be evaluated for clarity and conciseness.

**Title [1-5]:** There should be an appropriate title, adequate summary, and complete information including names and dates.

**Introduction [1-5]:** The purpose of the research should be clearly stated and the scope of what is considered in the report should be clear.

**Methods [1-5]:** The role of each method should be clearly stated. The description of the analyses should be clear and unambiguous so that another statistician or data scientist could easily re-construct it. The methods should be described accurately.


**Results [1-5]:** There should be appropriate tables and graphs. The results should be clearly stated in the context of the problem. The size and direction of significant results should be given. The results must be accurately stated. The research question should be adequately answered.


**Conclusion / Discussion [1-5]:** The results should be clearly and completely summarized. This section should also include discussion of limitations and/or concerns and/or suggestions for future consideration as appropriate.


**General Considerations [1-5]:** The ideas should be presented in logical order, with well-organized sections, no grammatical, spelling, or punctuation errors, an appropriate level of technical detail, and be clear and easy to follow.

## How will my data analysis and programming be evaluated?

Data analysis and programming will be evaluated according to the [data analysis](https://jsc370.github.io/assignment_rubrics.html#data_analysis_rubric) and [programming](https://jsc370.github.io/assignment_rubrics.html#programming_rubric) rubrics.


<hr></hr>

<sup id="fn1">1. For example, to convert foo.ipynb to an html document without code cells the command line syntax for nbconvert is: `jupyter nbconvert --TemplateExporter.exclude_input=True foo.ipynb`. For more information see the <a href="https://nbconvert.readthedocs.io/en/latest/install.html">`nbconvert`</a> documentation <a href="#ref1" title="Jump back to footnote 1 in the text."> ↩ </a></sup>

