# Breast Cancer Cell Lines: Preprocess Data
In this notebook we preprocess our data for downstream locus-based analysis. Data includes:
* smRNA seq data of the breast cancer (and normal) cell lines used in our Nature Med paper (Fish, 2018)
* smRNA seq data from TCGA: specifically all the BRCA samples and all the normal samples from all tissue types in TCGA. 
* smRNA seq data from exRNA Atlas along with other publically available datasets from non-cancerous exosomal smRNA samples

In [1]:
import pandas as pd
import os
import pysam
import re
import pymongo
import json as js

## Preprocess Cell Lines Data
Dataset involves 21 samples:
* 3 samples from HMEC cell lines
* 18 samples from 

In [2]:
cell_lines = [f for f in os.scandir("data/IC/") if f.name.endswith(".srt.dd.bam")]
len(cell_lines)

21

### Dust

Since we are converting the bam files to bed format, we first remove the low complexity sequences in the bam files first.

In [3]:
"""Simple dust score adapted from Dust Algorithm Score: A fast and symmetric DUST implementation to mask
low-complexity DNA sequences (2006, Morgulis et al.)"""
def simpleDustScore(seq):
    assert len(seq) > 2
    if len(seq) == 3:
        return 0
    else:
        triplets = {}
        num_trip = len(seq) - 2
        for i in range(num_trip):
            subseq = seq[i:i+3]
            if subseq in triplets:
                triplets[subseq] += 1
            else:
                triplets[subseq] = 1
        sum_triplet = 0
        for triplet, count in triplets.items():
            sum_triplet += count * (count - 1) / 2
        return sum_triplet/(num_trip - 1)

In [4]:
for f in cell_lines: 
    name = f.name.split(".")[0]
    out = f"data/IC/{name}.srt.dd.dust.bam"
    infile = pysam.AlignmentFile(f, "rb")
    outfile = pysam.AlignmentFile(out, "wb", template=infile)
    for read in infile.fetch():
        if simpleDustScore(read.get_forward_sequence()) < 3 and len(read.get_forward_sequence()) >= 15:
            outfile.write(read)
    outfile.close()
    infile.close()

### BedtoBam
Next we convert all the cell lines bam alignment files to bedfiles. 

In [5]:
%%bash
for f in data/IC/*.srt.dd.dust.bam;
do 
base=$(basename $f)
out=${base/.bam/.bed}
echo "$bedtools bamtobed -i $f > data/IC/$out"
bedtools bamtobed -i $f > data/IC/$out
done &> log/cell_lines_bamtobed.out

### Filter Cell Lines Bed

In [6]:
for f in os.scandir("data/IC"):
    if f.name.endswith(".srt.dd.dust.bed"):
        name = f.name.split(".")[0]
        out = f"data/IC/{name}.filter.bed"
        with open(out, "wt") as out, open(f, "rt") as file:
            for line in file:
                if "chr" in line and "chrUn" not in line and "None" not in line:
                    out.write(line) 

### Merge Cell Lines Loci
Here we merge all the loci seen in the cell lines data to create a locus feature map. No threshold set for degree of overlap.

In [7]:
%%bash
cat data/IC/*.filter.bed | sort -k1,1 -k2,2n | mergeBed -s -c 6 -o distinct -i stdin | \
awk '{print $1 "\t" $2 "\t" $3 "\t" $1":"$2"-"$3":"$4  "\t." "\t" $4  > "data/cell_lines_smRNAs_merged_loci.bed"}' 

### Non-cancerous exRNA Filter Feature Map
Here we filter out all the loci that are also seen in non-cancerous samples from the exRNA Atlas and other publically available non-cancerous exosomal smRNAseq datasets. 

In [8]:
myclient=pymongo.MongoClient(port=27027)
mydb = myclient["exRNA"]
rnacol = mydb["smRNA"]

In [9]:
cursor = rnacol.find(no_cursor_timeout=True)
with open("data/healthy_exRNA.bed", "wt") as out:
    for rna in cursor:
        ref = rna["ref_id"]
        start = rna["start"]
        end = rna["end"]
        strand = rna["strand"]
        bed = f"{ref}\t{start}\t{end}\t.\t.\t{strand}"
        if "chr" in bed and "chrUn" not in bed and "None" not in bed:
            out.write(bed + "\n")

In [10]:
myclient.close()

In [11]:
%%bash 
sortBed -i data/healthy_exRNA.bed | mergeBed -s -c 6 -o distinct -i stdin | \
awk '{print $1 "\t" $2 "\t" $3 "\t" $1":"$2"-"$3":"$4  "\t." "\t" $4  > "data/healthy_exRNA.srt.merge.bed"}' 

Next we filter all cell lines loci feature map using the non-cancerous exRNA loci, eliminating any loci from the cell lines that are also found in the non-cancerous exRNA dataset.
<br>
`-v` option for intersectBed keeps loci of the `-a` file that do not overlap with any loci in the `-b` file. We have an additional requirement here for what counts as an overlap: we require for each locus in `-a` that at least 90% must be covered by a locus in `-b` to count as an overlap: `-f 0.9`

In [12]:
%%bash
echo "intersectBed -s -f 0.9 -v -a data/cell_lines_smRNAs_merged_loci.bed -b data/healthy_exRNA.srt.merge.bed > data/cell_lines_smRNAs_filtered_loci.bed"
intersectBed -s -f 0.9 -v -a data/cell_lines_smRNAs_merged_loci.bed -b data/healthy_exRNA.srt.merge.bed > data/cell_lines_smRNAs_filtered_loci.bed

intersectBed -s -f 0.9 -v -a data/cell_lines_smRNAs_merged_loci.bed -b data/healthy_exRNA.srt.merge.bed > data/cell_lines_smRNAs_filtered_loci.bed


In [13]:
%%bash
wc -l data/cell_lines_smRNAs_merged_loci.bed

1731285 data/cell_lines_smRNAs_merged_loci.bed


In [14]:
%%bash
wc -l data/cell_lines_smRNAs_filtered_loci.bed

1297653 data/cell_lines_smRNAs_filtered_loci.bed


For number of loci in consideration, we went from 1731285 to 1297653 after filtering out loci present in non-cancerous exosomal RNA samples. This amounts to about 75% of loci left after non-cancerous exRNA filtering.

### Create Counts
Last preprocess step for the cell lines is to create counts for each cell line sample for each smRNA loci feature.

In [15]:
%%bash 
for f in data/IC/*.filter.bed;
do
out=${f/.filter.bed/.intersect.bed}
echo "intersectBed -s -wo -a $f -b data/cell_lines_smRNAs_merged_loci.bed > $out"
intersectBed -s -wo -a $f -b data/cell_lines_smRNAs_merged_loci.bed > $out
done &> log/cell_lines_intersect.out

In [16]:
%%bash 
for f in data/IC/*.filter.bed;
do
out=${f/.filter.bed/.exRNA.intersect.bed}
echo "intersectBed -s -wo -a $f -b data/cell_lines_smRNAs_filtered_loci.bed > $out"
intersectBed -s -wo -a $f -b data/cell_lines_smRNAs_filtered_loci.bed > $out
done &> log/cell_lines_exRNA_intersect.out

In [17]:
exRNA_filtered_loci = {}
sample_loci = {}
cpm_map = {}
exRNA_filtered_cell_lines = [f for f in os.scandir("data/IC/") if f.name.endswith(".exRNA.intersect.bed")]
for f in exRNA_filtered_cell_lines:  
    #Counts of exRNA filtered loci
    ex_fil_bed = pd.read_csv(f, header=None, sep="\t")
    sample = f.name.split(".")[0]
    
    assert len(ex_fil_bed[3].unique()) == ex_fil_bed.shape[0] #Ensures number of unique reads (query_id) matches total number of reads in bed file, so that we are not over counting
    exRNA_filtered_loci[sample] = {}
    for locus in ex_fil_bed[9]: #This is the locus from feature annotation
        if locus in exRNA_filtered_loci[sample]:
            exRNA_filtered_loci[sample][locus] += 1
        else:
            exRNA_filtered_loci[sample][locus] = 1    
    
    bed_file = pd.read_csv(f"data/IC/{sample}.intersect.bed", header=None, sep="\t")
    assert len(bed_file[3].unique()) == bed_file.shape[0] #Ensures number of unique reads (query_id) matches total number of reads in bed file, so that we are not over counting
    #cpm calculations: total reads within pre-exRNA filtered bedfile. Note reads that had "chrUn" or low-complexity seqeunces where filered out and not included in cpm normalization.           
    cpm = 1000000/bed_file.shape[0]
    cpm_map[sample] = cpm
     
    #Counts of loci
    sample_loci[sample] = {}
    for locus in bed_file[9]: #This is the locus from feature annotation
        if locus in sample_loci[sample]:
            sample_loci[sample][locus] += 1
        else:
            sample_loci[sample][locus] = 1

In [18]:
with open('data/counts/cell_lines_loci_counts.json', 'w') as f:
    js.dump(sample_loci, f)
    f.close()
    
with open('data/counts/cell_lines_exRNA_filtered_loci_counts.json', 'w') as f:
    js.dump(exRNA_filtered_loci, f)
    f.close()
    
with open ("data/counts/cell_lines_cpm_map.json", "w") as f:
    js.dump(cpm_map, f)
    f.close()

## Preprocess TCGA Data
Since TCGA datasets are quite bit larger, we run the following pre-processing steps as scripts instead of in this notebook. All scripts can be found in the `scripts` directory <br>
TCGA dataset involves: 679 normal samples from varying tissues and 1103 BRCA cancer samples.
<br>

### Dust 
Eliminate low-complexity sequences

Command used in smRNA env: <br>
`python3 scripts/dust_filter.py &> log/dust_filter.out`

### Bamtobed
Convert all the filtered bam files to bed files.

Command used in smRNA env: <br>
`ls /rumi/shams/jwang/BRCA_oncRNA/data/TCGA/*.dust.bam | parallel -j 30 bash scripts/bamtobed.sh {} &> log/bamtobed.out`

### Filter TCGA Data
TCGA bed data contains certain formats that are not compatible with bedtools and are not of interest (ex: chrUn). Here we filter our TCGA data before further processing.

Command used in smRNA env: <br>
`python3 scripts/filter_TCGA.py &> log/filter_TCGA.out`

# Conclusion
Finished preprocessing and creating counts for loci found in our cell lines and preprocessing TCGA data.