# BIT homework


## Jupyter and conda for R

Jupyter, previously called IPython, is already widely adopted by data scientists, researchers, and analysts. Jupyter’s notebook user interface enables mixing executable code with narrative text, equations, interactive visualizations, and images to enhance team collaboration and advance the state of reproducible research and training. Jupyter began with Python and now has kernels for 50 different languages, and the IRKernel is the native R kernel for Jupyter.

Data scientists, researchers, and analysts use the conda package manager to install and organize project dependencies. With conda they can easily build and share metapackages, which are downloadable bundles of packages. Conda works with Linux, OS X, and Windows, and is language agnostic, so we can use it with any programming language and with projects that depend on multiple languages.

Let’s use conda and Jupyter to start a data science project in R.

## “R Essentials” setup

The Anaconda team has created an “R Essentials” bundle with the IRKernel and over 80 of the most used R packages for data science, including dplyr, shiny, ggplot2, tidyr,caret and nnet.

Downloading “R Essentials” requires conda. Miniconda includes conda, Python, and a few other necessary packages, while Anaconda includes all this and over 200 of the most popularPython packages for science, math, engineering, and data analysis. Users may install all of Anaconda at once, or they may install Miniconda at first and then use conda to install any other packages they need, including any of the packages in Anaconda.

Once you have conda, you may install “R Essentials” into the current environment:



 > conda install -c r r-essentials
 
 
 ## Jupyter

Jupyter provides a great notebook interface to write your analysis and share it with your peers. Open a shell and run this command to start the Jupyter notebook interface in your browser:

> jupyter notebook

Start a new R notebook:

![Turing's Device](https://www.continuum.io/sites/default/files/conda-jupyter-irkernel-create-r-notebook.png)



# R Quiz

- Visit “www.ensembl.org”; download the human GTF file of the latest release 
    - a .Downloading program: wget or curl
    - b. Target file: Homo_sapiens.GRCh38.83.chr_patch_hapl_scaff.gtf.gz
    - c. Uncompress: gunzip

- Read in the GTF file into an R session
    - read.table arguments
    - file
    - sep
    - comment
    - quote
- Tabulate the number of genes per chromosome
    - table for a two-way counting (chromosome by feature)
- Assess the average number of transcripts per gene
- Restrict these countings to ‘protein-coding’ genes
    - grep for record searching
    
    
### GFF/GTF File Format - Definition and supported options 
The GFF (General Feature Format) format consists of one line per feature, each containing 9 columns of data, plus optional track definition lines. The following documentation is based on theVersion 2 specifications.

The GTF (General Transfer Format) is identical to GFF version 2.
- Fields
- Track lines
- More information -- (http://asia.ensembl.org/info/website/upload/gff.html?redirect=no#moreinfo)


#### Fields

**Fields must be tab-separated. Also, all but the final field in each feature line must contain a value; "empty" columns should be denoted with a '.'**

- **seqname** - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
- **source** - name of the program that generated this feature, or the data source (database or project name)
- **feature** - feature type name, e.g. Gene, Variation, Similarity
- **start** - Start position of the feature, with sequence numbering starting at 1.
- **end** - End position of the feature, with sequence numbering starting at 1.
- **score** - A floating point value.
- **strand** - defined as + (forward) or - (reverse).
- **frame** - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
- **attribute** - A semicolon-separated list of tag-value pairs, providing additional information about each feature.


### Read in the GTF file into an R session
    - read.table arguments
    - file
    - sep
    - comment
    - quote

In [2]:
test_gtf <- read.table("test.gtf" , sep = "\t" , comment.char = "#" , quote = "\"")
# // quote = "\"" for preserving quote.
# // trouncate comment charcter "#"  : comment.char = "#"
# dim(test_gtf)

# gtf table
gtf_col <-  c("chr","source","feature","start","end","score","strand","frame","quote")
colnames(test_gtf) <- gtf_col

In [3]:
# test :  way 1

# t <- subset(test_gtf$chr,test_gtf$feature == "exon")
# table(t)

### Tabulate the number of genes per chromosome
    - table for a two-way counting (chromosome by feature)

In [4]:
library("data.table")

test <- as.data.table(test_gtf)
test[feature == "gene",(COUNT = .N) ,by=chr]


Attaching package: ‘data.table’

The following object is masked _by_ ‘.GlobalEnv’:

    .N



Unnamed: 0,chr,V1
1,1,1


### Assess the average number of transcripts per gene

In [5]:
test.avg.trans_per_gene <- nrow(test[feature == "transcript"]) / nrow(test[feature == "gene"])
test.avg.trans_per_gene

### Restrict these countings to ‘protein-coding’ genes
    - grep for record searching

In [7]:
# grep for regural expression
test.protein_coding <- test[grepl("protein_coding",quote,perl=TRUE),]
test.protein_coding.avg.trans_per_gene <- nrow(test.protein_coding[feature == "transcript"]) / nrow(test.protein_coding[feature == "gene"])
test.protein_coding.avg.trans_per_gene 