## Merge *de novo* TSS calls

Jianheng Liu (Fox) @ Jaffrey Lab, May 31st, 2023

Concat: jil4026@med.cornell.edu

This notebook is about how to merge the site list for *denovo* calls. 

If you just have one sample, just make a site list consisting of the the 

`chromosome id \t 1-based position \t strand `

for downstream analysis.

### 1. Prepare a SampleSheet

Here, you need to provide two columns: 

* Column-1 : The name of the merged samples
* Column-2 : The `prefix` of the files. Fox example, you should have the following two files `HEK_WT`.called.ben.passed.bed and `HEK_WT`.called.ben.filtered_out.bed

In [1]:
!cat SampleSheet

HEK	HEK_WT
A549	A549_WT


### 2. Fetch all sites into a single csv file

If --no-ambiguous is set, no annotations start with a "*" will be used.

Note: there are some containment in the KO data. The artifacts 

In [2]:
!python Merge_TSS_1_fetch_sites.py -s SampleSheet -o TSS.csv --no-ambiguous

Date: 2023-05-31 16:25:19
Number of sites analyzing: 15265
Mode: ambiguous gene annotations not allowed? True
Running: HEK_WT
Sites passed filters: 7241
Sites failed to pass filters w/ signals: 384
Running: A549_WT
Sites passed filters: 9079
Sites failed to pass filters w/ signals: 483


### 3. Filter TSS

In [3]:
!python Merge_TSS_2_filter_sites.py -i TSS.csv -s SampleSheet -o TSS.filtered.csv --no-ambiguous

Date: 2023-05-31 16:25:23
Finished at: 2023-05-31 16:25:23


### 4. Merge replicates

(No replicate here, but the script still works.)

In [4]:
!python Mrege_TSS_3_merge_replicates.py -i TSS.filtered.csv -s SampleSheet -o TSS.merged

Date: 2023-05-31 16:25:24
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  count_sample["Gene"] = count_sample.index.get_level_values(3)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  xs_df_out["Base"] = xs_df_out.index.get_level_values(4)


You may have outputs like

* TSS.merged.merged.csv - The main table
* TSS.merged.xs_frac.csv - The table about the fraction of the TSS usage within a gene
* TSS.merged.xs_tpm.csv - TPM of the sitess
* TSS.merged.xs_count.csv - Number of the reads 
* TSS.merged.stats.csv - Number of A/C/G/U reads

### 5. Extract the site list

In [5]:
!sed '1d' TSS.merged.xs_count.csv | sed 's/,/\t/g' | awk '$5=="A"' | cut -f 1,3,6 | sort -u > CROWN_sites.txt

In [6]:
!head CROWN_sites.txt

10	100225993	-
10	100229624	-
10	100347247	+
10	100373843	+
10	100376883	+
10	100529806	-
10	101040893	+
10	101229529	+
10	101588291	+
10	101783395	-


In [7]:
!wc -l CROWN_sites.txt

5843 CROWN_sites.txt


### Now you are ready to measure m6Am 