# 8B_make_rna_tracks

how to make rna tracks : will be displaying TPM values on exons in hg19

Option 1: Wiggle (not done b/c wiggle not easy to process)
1. make wiggle file (https://genome.ucsc.edu/goldenPath/help/wiggle.html)
2. convert wig to BigWig (https://genome.ucsc.edu/goldenpath/help/bigWig.html)

Option 2: Bedgraph
1. make bedgraph (https://genome.ucsc.edu/goldenpath/help/bedgraph.html)
2. convert bedgraph to BigWig (https://genome.ucsc.edu/goldenpath/help/bigWig.html)
    To create a bigWig track from a bedGraph file, follow these steps:

1. Create a bedGraph format file following the directions here. When converting a bedGraph file to a bigWig file, you are limited to one track of data in your input file; therefore, you must create a separate bedGraph file for each data track.
2. Remove any existing track or browser lines from your bedGraph file so that it contains only data.
3. Download the bedGraphToBigWig program from the binary utilities directory.
4. Use the fetchChromSizes script from the same directory to create the chrom.sizes file for the UCSC database with which you are working (e.g., hg19). If the assembly genNom is hosted by UCSC, chrom.sizes can be a URL like http://hgdownload.soe.ucsc.edu/goldenPath/genNom/bigZips/genNom.chrom.sizes
5. Use the bedGraphToBigWig utility to create a bigWig file from your bedGraph file:
    `bedGraphToBigWig in.bedGraph chrom.sizes myBigWig.bw`
(Note that the bedGraphToBigWig program DOES NOT accept gzipped bedGraph input files.)
6. Move the newly created bigWig file (myBigWig.bw) to a web-accessible http, https, or ftp location.
7. Paste the URL into the custom track entry form or construct a custom track using a single track line.
8. Paste the custom track line into the text box on the custom track management page.



In [1]:
import pandas as pd
import os, glob, re
import subprocess
import pybedtools

# 0. get files
- exon file
- rna seq tpm matrix file
- chrom size file

In [2]:
size_file = '../data/external/chrom_hg19.sizes'
exon_bed = '../data/external/gencode.v19.exon.bed'

In [3]:
exon_df = pybedtools.BedTool(exon_bed).to_dataframe()
exon_df[:5]

Unnamed: 0,chrom,start,end,name
0,chr1,11869,12227,DDX11L1
1,chr1,12613,12721,DDX11L1
2,chr1,13221,14409,DDX11L1
3,chr1,11872,12227,DDX11L1
4,chr1,12613,12721,DDX11L1


In [11]:
rna_file = '../data/interim/rna/tissue_tpm_sym_wHEK.csv'
rna_df = pd.read_csv(rna_file,index_col=0).reset_index()
print(rna_df.shape)
rna_df[:5]

(23686, 11)


Unnamed: 0,index,Astrocytes,SL_D0,SL_D2,SLC_D0,SLC_D2,H9_D0,H9_D2,H9_D10,H9_D28,HEK293T
0,A1BG,18.66,7.23,4.016667,6.993333,3.616667,12.485,7.47,5.27,5.475,22.843285
1,A1BG-AS1,0.43,2.04,2.71,1.656667,2.7,0.915,2.77,3.3,3.71,0.0
2,A1CF,0.01,0.015,0.026667,0.01,0.003333,0.0,0.006667,0.11,0.16,0.082062
3,A2M,50.67,0.055,2.483333,0.156667,6.546667,0.02,15.666667,3.74,2.97,1.258617
4,A2M-AS1,0.07,2.58,1.67,0.996667,2.16,0.285,0.68,5.34,7.985,0.0


In [5]:
tissues = ['Astrocytes', 'SL_D0', 'SL_D2', 'SLC_D0', 'SLC_D2', 'H9_D0', 'H9_D2',
       'H9_D10', 'H9_D28','HEK293T']

In [6]:
# # bedgraph defintion lines
# definition_str = "track type=bedGraph name={track_label:s} description=center_label visibility=display_mode color=r,g,b altColor=r,g,b priority=priority autoScale=on|off alwaysZero=on|off gridDefault=on|off maxHeightPixels=max:default:min graphType=bar|points viewLimits=lower:upper yLineMark=real-value yLineOnOff=on|off windowingFunction=maximum|mean|minimum smoothingWindow=off|2-16"
# definition_str.format(track_label='test', )

In [7]:
# txt = "For only {price:.2f} dollars!"
# print(txt.format(price = 49))

In [8]:
save_dir = '../data/processed/rna_bigwigs'
if not os.path.exists(save_dir):
    os.makedirs(save_dir)

# 1A. make bedGraph (no brower or track lines) per tissue

space delimited
`chrom chromStart chromEnd dataValue`

example:

```
chr19 49302000 49302300 -1.0
chr19 49302300 49302600 -0.75
```

In [9]:
THRES=1

In [15]:
%%time
for tissue in tissues:
    print('*****', tissue)
    print('# unique exon ids: ', exon_df.name.unique().shape[0])

    exon_df_tissue = exon_df.merge(rna_df[['index',tissue]],how='inner',left_on='name',right_on='index')
    print('size exon_df_tissue', exon_df_tissue.shape)
    print('# unique exon ids with rna info: ', exon_df_tissue['index'].unique().shape[0])
    exon_df_tissue = exon_df_tissue[exon_df_tissue[tissue]>THRES]
    print('size exon_df_tissue THRES 1',exon_df_tissue.shape)
    print('# unique exon ids with rna info THRES 1: ',exon_df_tissue['index'].unique().shape[0])
    exon_df_tissue = exon_df_tissue[['chrom','start','end',tissue]]
    exon_df_bed = pybedtools.BedTool.from_dataframe(exon_df_tissue).sort().merge(c=4,o='sum')
    exon_df_tissue = exon_df_bed.to_dataframe()
    print('size exon_df_tissue sorted and merged',exon_df_tissue.shape)
    exon_df_tissue.to_csv(os.path.join(save_dir,'bedgraphs', tissue+'_unsorted.bedGraph'),index=False, header=False,sep=' ')    
    

***** Astrocytes
# unique exon ids:  55765
size exon_df_tissue (1081192, 6)
# unique exon ids with rna info:  21985
size exon_df_tissue THRES 1 (704155, 6)
# unique exon ids with rna info THRES 1:  10226
size exon_df_tissue sorted and merged (138877, 4)
***** SL_D0
# unique exon ids:  55765
size exon_df_tissue (1081192, 6)
# unique exon ids with rna info:  21985
size exon_df_tissue THRES 1 (828697, 6)
# unique exon ids with rna info THRES 1:  12760
size exon_df_tissue sorted and merged (166033, 4)
***** SL_D2
# unique exon ids:  55765
size exon_df_tissue (1081192, 6)
# unique exon ids with rna info:  21985
size exon_df_tissue THRES 1 (849223, 6)
# unique exon ids with rna info THRES 1:  13116
size exon_df_tissue sorted and merged (170943, 4)
***** SLC_D0
# unique exon ids:  55765
size exon_df_tissue (1081192, 6)
# unique exon ids with rna info:  21985
size exon_df_tissue THRES 1 (814730, 6)
# unique exon ids with rna info THRES 1:  12472
size exon_df_tissue sorted and merged (162378, 4

In [16]:
exon_df_tissue[:5]

Unnamed: 0,chrom,start,end,name
0,chr1,14363,14829,33.858137
1,chr1,14970,15038,33.858137
2,chr1,15796,15947,40.629765
3,chr1,16607,16765,33.858137
4,chr1,16854,17055,33.858137


# 1B. sort bedGraphs

`sort -k1,1 -k2,2n unsorted.bedGraph > sorted.bedGraph`

In [17]:
%%time
for tissue in tissues:
    unsort_file = os.path.join(save_dir, 'bedgraphs',tissue+'_unsorted.bedGraph')
    sort_file = os.path.join(save_dir,'bedgraphs', tissue+'_sorted.bedGraph')
    sort_cmd = 'sort -k1,1 -k2,2n "{}" > "{}"'.format(unsort_file,sort_file)
    print(sort_cmd)
    subprocess.call(sort_cmd,shell=True)

sort -k1,1 -k2,2n "../data/processed/rna_bigwigs/bedgraphs/Astrocytes_unsorted.bedGraph" > "../data/processed/rna_bigwigs/bedgraphs/Astrocytes_sorted.bedGraph"
sort -k1,1 -k2,2n "../data/processed/rna_bigwigs/bedgraphs/SL_D0_unsorted.bedGraph" > "../data/processed/rna_bigwigs/bedgraphs/SL_D0_sorted.bedGraph"
sort -k1,1 -k2,2n "../data/processed/rna_bigwigs/bedgraphs/SL_D2_unsorted.bedGraph" > "../data/processed/rna_bigwigs/bedgraphs/SL_D2_sorted.bedGraph"
sort -k1,1 -k2,2n "../data/processed/rna_bigwigs/bedgraphs/SLC_D0_unsorted.bedGraph" > "../data/processed/rna_bigwigs/bedgraphs/SLC_D0_sorted.bedGraph"
sort -k1,1 -k2,2n "../data/processed/rna_bigwigs/bedgraphs/SLC_D2_unsorted.bedGraph" > "../data/processed/rna_bigwigs/bedgraphs/SLC_D2_sorted.bedGraph"
sort -k1,1 -k2,2n "../data/processed/rna_bigwigs/bedgraphs/H9_D0_unsorted.bedGraph" > "../data/processed/rna_bigwigs/bedgraphs/H9_D0_sorted.bedGraph"
sort -k1,1 -k2,2n "../data/processed/rna_bigwigs/bedgraphs/H9_D2_unsorted.bedGraph" > 

# 2. run `bedGraphToBigWig`
cmd: `bedGraphToBigWig in.bedGraph chrom.sizes myBigWig.bw`

In [18]:
bedGraphToBigWig_cmd = '/Users/mguo123/Documents/genome_browser_tools/bedGraphToBigWig'


In [19]:
%%time
for tissue in tissues:
    bedgraph_file = os.path.join(save_dir,'bedgraphs', tissue+'_sorted.bedGraph')
    bigwig_file = os.path.join(save_dir,'bigwigs', tissue+'.bw')
    cmd = '{} {} {} {}'.format(bedGraphToBigWig_cmd,bedgraph_file, size_file,bigwig_file )
    print(cmd)
    subprocess.call(cmd,shell=True)

/Users/mguo123/Documents/genome_browser_tools/bedGraphToBigWig ../data/processed/rna_bigwigs/bedgraphs/Astrocytes_sorted.bedGraph ../data/external/chrom_hg19.sizes ../data/processed/rna_bigwigs/bigwigs/Astrocytes.bw
/Users/mguo123/Documents/genome_browser_tools/bedGraphToBigWig ../data/processed/rna_bigwigs/bedgraphs/SL_D0_sorted.bedGraph ../data/external/chrom_hg19.sizes ../data/processed/rna_bigwigs/bigwigs/SL_D0.bw
/Users/mguo123/Documents/genome_browser_tools/bedGraphToBigWig ../data/processed/rna_bigwigs/bedgraphs/SL_D2_sorted.bedGraph ../data/external/chrom_hg19.sizes ../data/processed/rna_bigwigs/bigwigs/SL_D2.bw
/Users/mguo123/Documents/genome_browser_tools/bedGraphToBigWig ../data/processed/rna_bigwigs/bedgraphs/SLC_D0_sorted.bedGraph ../data/external/chrom_hg19.sizes ../data/processed/rna_bigwigs/bigwigs/SLC_D0.bw
/Users/mguo123/Documents/genome_browser_tools/bedGraphToBigWig ../data/processed/rna_bigwigs/bedgraphs/SLC_D2_sorted.bedGraph ../data/external/chrom_hg19.sizes ../d

# 3. make hub.config.json

manually made

final folder for upload: `/Users/mguo123/Documents/pan_omics_psych/data/processed/rna_bigwigs/bigwigs`

# 4. checking


In [21]:
rna_df[rna_df['index'].str.contains('HOXA1')]

Unnamed: 0,index,Astrocytes,SL_D0,SL_D2,SLC_D0,SLC_D2,H9_D0,H9_D2,H9_D10,H9_D28,HEK293T
8112,HOXA1,0.03,0.0,0.183333,14.44,15.2,0.0,0.11,9.1,152.435,5.263825
8113,HOXA10,0.23,0.0,0.083333,0.613333,5.913333,0.02,0.016667,0.0,0.0,100.344035
8114,HOXA10-HOXA9,0.0,0.0,0.0,0.0,0.123333,0.0,0.0,0.0,0.0,0.0
8115,HOXA11,0.085,0.0,0.0,0.026667,0.113333,0.02,0.0,0.12,0.0,29.814197
8116,HOXA11-AS,0.0,0.0,0.0,0.406667,0.113333,0.0,0.023333,0.0,0.0,0.0
8117,HOXA13,0.0,0.0,0.0,0.0,0.016667,0.02,0.0,0.0,0.0,10.697375
