<a href="https://colab.research.google.com/github/pachterlab/GRNP_2020/blob/master/notebooks/FASTQ_processing/ProcessPBMC_V3_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**BUTTERFLY - Processing of the FASTQ files for the PBMC_V3_2 dataset.**

1. Download and build kallisto and bustools from source.
2. Download the genome FASTA file and build a kallisto index
3. Download the FASTQ files and process with kallisto
4. Install the BUSpaRse R package and create a transcripts_to_genes file
5. Process the output from kallisto with bustools (the butterfly branch) 

**1. Download and build kallisto and bustools from source**

In [None]:
# Install dependencies needed for build
!apt update
!apt install -y cmake
!apt-get install autoconf


In [None]:
#Need to download and build htslib to be able to build kallisto
!cd /usr/bin && wget https://github.com/samtools/htslib/releases/download/1.9/htslib-1.9.tar.bz2 &&tar -vxjf htslib-1.9.tar.bz2 && cd htslib-1.9 && make

In [None]:
#clone the kallisto repo, build and install
!rm -r temporary #if the code is run more than once
!mkdir temporary
!cd temporary && git clone https://github.com/pachterlab/kallisto.git
!cd temporary/kallisto && git checkout v0.46.2 && mkdir build && cd build && cmake .. && make
!chmod +x temporary/kallisto/build/src/kallisto
!mv temporary/kallisto/build/src/kallisto /usr/local/bin/

In [None]:
#clone the bustools repo, build and install
!cd temporary && rm -r *
!git clone https://github.com/johan-gson/bustools.git
!mv bustools/ temporary/
!cd temporary/bustools && git checkout butterfly && mkdir build && cd build && cmake .. && make
!chmod +x temporary/bustools/build/src/bustools
!mv temporary/bustools/build/src/bustools /usr/local/bin/

In [None]:
!kallisto version

**2. Download the genome FASTA file and build a kallisto index**

In [None]:
#Download fasta and build kallisto index for mouse
!wget "ftp://ftp.ensembl.org/pub/release-94/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz" -O human.fa.gz
!kallisto index -i Homo_sapiens.GRCh38.cdna.all.idx human.fa.gz

**3. Download the FASTQ files and process with kallisto**

In [None]:
#clean up a bit first
!rm -r sample_data
!rm -r temporary

In [None]:
#Download fastqs
!wget "http://s3-us-west-2.amazonaws.com/10x.files/samples/cell-exp/3.0.0/pbmc_10k_protein_v3/pbmc_10k_protein_v3_fastqs.tar"


In [None]:
#stream from the tar directly into kallisto
!rm A_R1.gz A_R2.gz B_R1.gz B_R2.gz # in case of running this several times
!mkfifo A_R1.gz A_R2.gz B_R1.gz B_R2.gz

!tar -O --to-stdout -xf pbmc_10k_protein_v3_fastqs.tar pbmc_10k_protein_v3_fastqs/pbmc_10k_protein_v3_gex_fastqs/pbmc_10k_protein_v3_gex_S1_L001_R1_001.fastq.gz > A_R1.gz & tar -O --to-stdout -xf pbmc_10k_protein_v3_fastqs.tar pbmc_10k_protein_v3_fastqs/pbmc_10k_protein_v3_gex_fastqs/pbmc_10k_protein_v3_gex_S1_L001_R2_001.fastq.gz > A_R2.gz & tar -O --to-stdout -xf pbmc_10k_protein_v3_fastqs.tar pbmc_10k_protein_v3_fastqs/pbmc_10k_protein_v3_gex_fastqs/pbmc_10k_protein_v3_gex_S1_L002_R1_001.fastq.gz > B_R1.gz &  tar -O --to-stdout -xf pbmc_10k_protein_v3_fastqs.tar pbmc_10k_protein_v3_fastqs/pbmc_10k_protein_v3_gex_fastqs/pbmc_10k_protein_v3_gex_S1_L002_R2_001.fastq.gz > B_R2.gz & kallisto bus -i Homo_sapiens.GRCh38.cdna.all.idx -o bus_output/ -x 10xv3 -t 2 A_R1.gz A_R2.gz B_R1.gz B_R2.gz

**4. Install the BUSpaRse R package and create a transcripts_to_genes file**

In [None]:
#switch to R mode
%reload_ext rpy2.ipython


In [None]:
#install BUSpaRse
%%R
install.packages("BiocManager")
BiocManager::install(version='3.10')
BiocManager::install("BUSpaRse")


In [None]:
#create transcripts_to_genes.txt
%%R

library("BUSpaRse")
tr2g <- transcript2gene(fasta_file = "human.fa.gz",
                              kallisto_out_path = "bus_output")
write.table(tr2g, "bus_output/transcripts_to_genes.txt", quote=F, row.names = F, col.names=F, sep="\t")


**5. Process the output from kallisto with bustools (the butterfly branch)**

In [None]:
#get the whitelist
!rm -r GRNP_2020 #in case the code is run several times
!git clone https://github.com/pachterlab/GRNP_2020.git
!cd GRNP_2020/whitelists && unzip 10xv3_whitelist.zip
!cd GRNP_2020/whitelists && ls

In [None]:
!bustools correct -w GRNP_2020/whitelists/10xv3_whitelist.txt -p bus_output/output.bus | bustools sort -T tmp/ -t 2 -o bus_output/sort.bus -

In [None]:
#collapse
!bustools collapse -o bus_output/coll -t  bus_output/transcripts.txt -g bus_output/transcripts_to_genes.txt -e  bus_output/matrix.ec  bus_output/sort.bus

In [None]:
#umicorrect - this code is not optimized for speed in this branch and may take a while to run, it is much faster in the master branch
!bustools umicorrect -o bus_output/umicorr.bus bus_output/coll.bus

In [None]:
#convert to text
!bustools text -o bus_output/bug.txt bus_output/umicorr.bus


In [None]:
!ls -l
!cd bus_output && ls -l