# Tutorial

## Installation

conda install numpy scipy cython numba matplotlib scikit-learn h5py click

### cython
The Cython language makes writing C extensions for the Python language as easy as Python itself. Cython is a source code translator based on Pyrex, but supports more cutting edge functionality and optimizations.

The Cython language is a superset of the Python language (almost all Python code is also valid Cython code), but Cython additionally supports optional static typing to natively call C functions, operate with C++ classes and declare fast C types on variables and class attributes. This allows the compiler to generate very efficient C code from Cython code.

### numba

Numba translates Python functions to optimized machine code at runtime using the industry-standard LLVM compiler library. Numba-compiled numerical algorithms in Python can approach the speeds of C or FORTRAN.

### h5py
The h5py package provides both a high- and low-level interface to the HDF5 library from Python. The low-level interface is intended to be a complete wrapping of the HDF5 API, while the high-level component supports access to HDF5 files, datasets and groups using established Python and NumPy concepts.

A strong emphasis on automatic conversion between Python (Numpy) datatypes and data structures and their HDF5 equivalents vastly simplifies the process of reading and writing data from Python.

### click
lick is a Python package for creating beautiful command line interfaces in a composable way with as little code as necessary. It’s the “Command Line Interface Creation Kit”. It’s highly configurable but comes with sensible defaults out of the box.

It aims to make the process of writing command line tools quick and fun while also preventing any frustration caused by the inability to implement an intended CLI API.

Click in three points:

arbitrary nesting of commands

automatic help page generation

supports lazy loading of subcommands at runtime

### pysam (prefer to use pip)
pysam - a python module for reading, manipulating and writing genomic data sets.

pysam is a lightweight wrapper of the htslib C-API and provides facilities to read and write SAM/BAM/VCF/BCF/BED/GFF/GTF/FASTA/FASTQ files as well as access to the command line functionality of the samtools and bcftools packages. The module supports compression and random access through indexing.

This module provides a low-level wrapper around the htslib C-API as using cython and a high-level API for convenient access to the data within standard genomic file formats.

pip install velocyto (Conda is not available)
conda install R rpy2 (recommended)

## Tutorial

After you have velocyto correctly installed on your machine (see installation tutorial) the velocyto command will become available in the terminal. velocyto is a command line tool with subcomands. You can get quick info on all the available commands typing velocyto --help. You will get the following output.

The general purpose command to run the read counting pipeline is velocyto run.

### run10x - Run on 10X Chromium samples

https://bauercore.fas.harvard.edu/10x-chromium-system

10x Genomics' Chromium technology partitions reactions into nanoliter-scale droplets containing uniquely barcoded beads called GEMs (Gel Bead-In EMulsions). This core technology can be used to partition single cells, nuclei, or high molecular weight gDNA to prepare next generation sequencing libraries in parallel. We provide training and run samples as a service on the Chromium system. Please contact Claire Reardon to discuss your experimental plans. 

Single Cell 3' and 5' Workflow
10x single-cell 3' and 5' assays partition individual cells into GEMs that uniquely barcode hundreds to thousands of cells with a capture efficiency of 65%. 3’ or 5' end counting determines gene expression and characterizes cells of a heterogeneous population. In addition to expression profiling, the 5' assay enables immune profiling by enriching barcoded cDNA for V(D)J sequences of T or B cells. Both the 3' and 5' assays can be combined with "Feature Barcoding" technology which determines expression of cell-surface proteins through oligo-labelled antibodies.

Single Cell ATAC Workflow
The single cell ATAC assay allows for determination of the accessibility of chromatin on a single cell level. A sample of hundreds to thousands of nuclei undergoes a transposition reaction and the transposed nuclei are then partitioned into GEMs which uniquely barcode accessible DNA fragments. 

Genome Assay Workflow
The genome assay partitions high molecular weight (HMW) DNA into individual GEMs to capture long-range information using Illumina next-generation sequencing. The assay requires low input of high molecular weight DNA and produces "linked read" information which can be used to assemble genomes, assess structural variants, and phase across haplotype blocks >10 Mb.Please contact Claire Reardon to discuss your experimental plans.

### run_smartseq2 - Run on SmartSeq2 samples

Smart-Seq2
https://www.illumina.com/science/sequencing-method-explorer/kits-and-arrays/smart-seq2.html

Method Category: Transcriptome > RNA Low-Level Detection

Description: For Smart-Seq2, single cells are lysed in a buffer that contains free dNTPs and oligo(dT)-tailed oligonucleotides with a universal 5'-anchor sequence. RT is performed, which adds 25 untemplated nucleotides to the cDNA 3′ end. A template-switching oligo (TSO) is added, carrying 2 riboguanosines and a modified guanosine to produce a LNA as the last base at the 3′ end. After the first-strand reaction, the cDNA is amplified using a limited number of cycles. Next, tagmentation is used to construct sequencing libraries quickly and efficiently from the amplified cDNA.

Pros:
As little as 50 pg of starting material can be used
mRNA sequence does not have to be known
Improved coverage across transcripts
High level of mappable reads
Cons:
Not strand-specific
No early multiplexing
Transcript length bias, with inefficient transcription of reads over 4 Kb
Preferential amplification of high-abundance transcripts
Purification step may lead to loss of material
Could be subject to strand-invasion bias

### run_dropest - Run on DropSeq, InDrops and other techniques

Drop-Seq
https://www.illumina.com/science/sequencing-method-explorer/kits-and-arrays/drop-seq.html

Method Category: Transcriptome > RNA Low-Level Detection

Description: Drop-Seq analyzes mRNA transcripts from droplets of individual cells in a highly parallel fashion. This single-cell sequencing method uses a microfluidic device to compartmentalize droplets containing a single cell, lysis buffer, and a microbead covered with barcoded primers. Each primer contains: 1) a 30 bp oligo(dT) sequence to bind mRNAs; 2) an 8 bp molecular index to identify each mRNA strand uniquely; 3) a 12 bp barcode unique to each cell and 4) a universal sequence identical across all beads. Following compartmentalization, cells in the droplets are lysed and the released mRNA hybridizes to the oligo(dT) tract of the primer beads. Next, all droplets are pooled and broken to release the beads within. After the beads are isolated, they are reverse-transcribed with template switching. This generates the first cDNA strand with a PCR primer sequence in place of the universal sequence. cDNAs are PCR-amplified, and sequencing adapters are added using the Nextera XT Library Preparation Kit. The barcoded mRNA samples are ready for sequencing.

Similar methods: CEL-Seq, Quartz-Seq, MARS-Seq, CytoSeq, inDrop, Hi-SCL.

Pros:
Analyze sequences of single-cells in a highly parallel manner
Unique molecular and cell barcodes enables cell and gene specific identification of mRNA strands
Reverse transcription with template-switching PCR produce high-yield reads from single cells
Low cost - $0.07 per cell ($653 per 10,000 cells) and fast library prep (10,000 cells per day)
Cons:
Requires custom microfluidics device to perform droplet separation
Low gene-per-cell sensitivity compared to other scRNA-Seq methods
Limited to mRNA transcripts


### first runtime and parallelization

As one of its first steps velocyto run will try to create a copy of the input .bam files sorted by cell-barcode. The sorted .bam file will be placed in the same directory as the original file and it will be named cellsorted_[ORIGINALBAMNAME]. The sorting procedure uses samtools sort and it is expected to be time consumning, because of this, the procedurre is perfomed in parellel by default. It is possible to control this parallelization using the parameters --samtools-threads and --samtools-memory

### Requirements on the input files
velocyto assumes that the bam file that is passed to the CLI contains a set of information and that some upstream analysis was performed on them already. In particular the bam file will have to:

Be sorted by mapping position.
Represents either a single sample (multiple cells prepared using a certain barcode set in a single experiment) or single cell.
Contain an error corrected cell barcodes as a TAG named CB or XC.
Contain an error corrected molecular barcodes as a TAG named UB or XM.


### About the output .loom file
The main result file is a 4-layered loom file : sample_id.loom.

A valid .loom file is simply an HDF5 file that contains specific groups representing the main matrix as well as row and column attributes. Because of this, .loom files can be created and read by any language that supports HDF5.

.loom files can be easily handled using the loompy package.

## Analysis

### Velocyto Loom

In [None]:
import velocyto as vcy
vlm = vcy.VelocytoLoom("YourData.loom")

Different steps of analysis can be carried on by simply calling the methods of this VelocytoLoom object. New variables, normalized version of the data matrixes and other parameters will be stored as attributes of the “VelocytoLoom” object (method calls will not return any value). For example normalization and log transformation can be performed by calling the normalize method:

In [None]:
vlm.normalize("S", size=True, log=True)
vlm.S_norm  # contains log normalized

The docstring of every function specifies which attributes will be generated or modified at each method call.