# Generate a Drosophila reference genome with optional addition of transgenes

#### Download and save the transcriptome gene transfer format (GTF) file from the Ensembl database. This will be downloaded as a zipped file (with extension .gz)

In [None]:
~ users$ curl -O http://ftp.ensemblgenomes.org/pub/metazoa/release-61/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.54.61.gtf.gz

#### Download and save the genomic FASTA file from the Ensembl database. This will also be downloaded as a zipped file (with extension .gz).

In [None]:
~ users$ curl -O http://ftp.ensemblgenomes.org/pub/metazoa/release-61/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.54.dna.toplevel.fa.gz

#### Unzip the reference GTF and FASTA files downloaded in steps 1 and 2.


In [None]:
~ users$ gunzip Drosophila_melanogaster.BDGP6.54.dna.toplevel.fa.gz
~ users$ gunzip Drosophila_melanogaster.BDGP6.54.61.gtf.gz

## Generate Drosophila custome FASTA and gtf files 

#### If your sample contains transgenes/reporters whose expression need to be tracked, the sequences of those reporters need to be added as additional “contigs” or “chromosomes” to the genomic FASTA. In addition, for quantitation purposes, a GTF entry must be added for each transgene to define those regions as belonging to the transgene. we have used GFP and RFP reporters. Note that most of the available GFP or RFP transgenes have been inserted upstream of a SV40 polyA cassette, therefore the 3’UTR sequence of most GFP and RFP transgenes will be common 

#### generate FASTA file for RFP-GFP

>RFP_GFP
TAATCCGCGGTAGATCATAATCAGCCATACCACATTTGTAGAGGTTTTACTTGCTTTAAAAAACCTCCCACACCTCCCCCTGAACCTGAAACATAAAATGAATGCAATTGTTGTTGTTAACTTGTTTATTGCAGCTTATAATGGTTACAAATAAAGCAATAGCATCACAAATTTCACAAATAAAGCATTTTTTTCACTGCATTCTAGTTGTGGTTTGTCCAAACTCATCAATGTATCTTAAGGCGTAAATTGTAAGCGTTAATACTAGTTGGCCACGTAATAAGTGTGCGTTGAATTTATTCGCAAAAACATTGCATATTTTCGGCAAAGTAAAATTTTGTTGCATACCTTATCAAAAAATAAGTGCTGCATACTTTTTAGAGAAACCAAATAATTTTTTATTGCATACCCGTTTTTAATAAAATACATTGCATACCCTCTTTTAATAAAAAATATTGCATACTTTGACGAAACAAATTTTCGTTGCATACCCAATAAAAGATTATTATATTGCATACCCGTTTTT

~ users$ cat RFP_GFP.fa | grep -v "^>" | tr -d "\n" | wc -c

#### generate gtf file for RFP-GFP

In [None]:
~ users$ echo -e 'RFP_GFP\tunknown\texon\t1\t526\t.\t+\t.\tgene_id "RFP_GFP"; transcript_id "RFP_GFP"; gene_name "RFP_GFP"; gene_biotype "protein_coding";' > RFP_GFP.gtf


### Combine transgene FASTA to the reference FASTA

#### first, make a copy and rename it, so that the original is unchanged.

In [None]:
cp Drosophila_melanogaster.BDGP6.54.dna.toplevel.fa dmel_GFP.fasta

#### append the GFP.fa to the end of dmel-all-aligned-r6.64.fasta file (Note: Do not use ">" which overwrites the original file)

In [None]:
~ users$ cat GFP.fa >> dmel_GFP.fasta

#### To confirm that the GFP entry was added to the FASTA file, use the grep ">" command to search for lines with the > character:

In [None]:
~ users$ grep ">" dmel_GFP.fasta

### Combine transgene GTF to the reference GTF 

#### first, make a copy and rename it, so that the original is unchanged.

In [None]:
~ users$ cp Drosophila_melanogaster.BDGP6.54.61.gtf dmel_GFP.gtf

~ users$ cat RFP_GFP.gtf >> dmel_GFP.gtf

#### Check the gtf file with the following command:

In [None]:
~ users$ tail dmel_GFP.gtf

#### Use cellranger mkref to match combined transgene GTF and combined FASTA and generate a genome index.

In [None]:
~ users$ module load cellranger

#### Now use the dmel_GFP.gtf and dmel_GFP.fasta files as inputs to the cellranger mkref pipeline

~ users$ cellranger mkref --genome=dm6_with_transgenes --fasta=dmel_GFP.fasta --genes=dmel_GFP.gtf --nthreads 16 --memgb 16

## Demultiplex 10X run data from Illumina sequencers

In [None]:
##### create csv sheet for various samples
~ users$ cat samplesheet_example.csv
Lane, Sample, Index
*,RP2_GFP, SI-GA-B4
*,isoW, SI-GA-A1

## Use the cellranger mkfastq function to demultiplex the samples.

In [None]:
~ users$ cellranger mkfastq --localcores=16 --localmem=16 --csv=samplesheet_example.csv --run=190605_D00762_0450_AH22GWBCX3 --id=demultiplexed_fastqs --qc/usr/local/apps/cellranger/4.0.0/bin

###### In this example “190605_D00762_0450_AH22GWBCX3” is the name of the run folder automatically generated by the Illumina sequencer that held our two samples. It contains all the raw BCL files from a single run. It also has other files, some of which are used by Cell Ranger to process the data. If samples have been repeatedly sequenced over multiple flow cells, the cellranger mkfastq pipeline should be repeated for each run providing a separate “id” destination folder for each.

### Align data to the reference genome and generate read count matrices

###### The cellranger count function first aligns the single 3’ polyA-adjacent read to the index reference genome using the STAR aligner producing an indexed BAM file. The BAM file contains the chromosome number and the coordinates for each read. Additional read metadata derived from the other FASTQ files (including the cell barcode, the UMI, and any overlapping exon gene definition) is attached to the BAM entry for the read at this step

##### run cellranger count for sample “RP2_GFP” with genome index “dm6_with_transgenes” into folder “RP2_GFP”

In [None]:
~ users$ cellranger count --id=RP2_GFP --sample=RP2_GFP --transcriptome=dm6_with_transgenes --fastqs=demultiplexed_fastqs --localcores=16 --localmem=16