# STAR-Fusion analyses

This Jupyter notebook reproduces the STAR-Fusion analyses on the ILC and B-ALL datasets, in which we aimed to identify endogenous gene fusions.

In [2]:
%reload_ext autoreload
%autoreload 2

%matplotlib inline

import sys
sys.path.append('../src')

import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
import seaborn as sns

sns.set_style('white')

## ILC dataset

First we load fusions for the ILC dataset. To narrow down the fusions, we remove any fusions that involve *Foxf2* and *En2*, as these reflect fusions with the transposon (transposon insertions). We also filter the fusions for those that involve known reference splices, as these are most likely to reflect true fusions.

In [3]:
from pathlib import Path
from nbsupport.star_fusion import (read_fusions, expand_fusion_name,
                                   filter_sb_transposon)

# Read fusions for samples.
base_dir = Path('../data/interim/sb/star-fusion/fusions')
sample_dirs = [x for x in base_dir.iterdir() if x.is_dir()]

fusions_per_sample = (
    (sample_dir.name,
     read_fusions(sample_dir / 'star-fusion.fusion_candidates.final'))
    for sample_dir in sample_dirs)

# Merge into single frame.
sb_fusions = pd.concat((df.assign(sample=sample) 
                        for sample, df in fusions_per_sample),
                       axis=0, ignore_index=True)

# Filter likely transposon fusions and split gene names.
sb_fusions = (sb_fusions
              .pipe(filter_sb_transposon)
              .pipe(expand_fusion_name))

# Subset to REF splices.
sb_fusions = sb_fusions.query('splice_type == "ONLY_REF_SPLICE"')

sb_fusions.head()

Unnamed: 0,fusion_name,support_junction,support_spanning,splice_type,left_gene,left_breakpoint,right_gene,right_breakpoint,junction_reads,spanning_reads,sample,gene_a,gene_b
4,Cnksr3--Cand1,2.0,0.0,ONLY_REF_SPLICE,Cnksr3^ENSMUSG00000015202,10:7211632:-,Cand1^ENSMUSG00000020114,10:119227546:-,"HWI-ST867:243:C579PACXX:7:1213:9265:59434,HWI-...",.,2800_65_12SKA136-L2,Cnksr3,Cand1
42,Igkv1-110--Igkj1,2.0,0.0,ONLY_REF_SPLICE,Igkv1-110^ENSMUSG00000093861,6:68271267:+,Igkj1^ENSMUSG00000076604,6:70722562:+,"HWI-ST867:240:C53C7ACXX:5:2103:15598:82385,HWI...",.,2800_75_13SKA012-L2,Igkv1-110,Igkj1
191,Frs2--Dtx3,22.0,0.0,ONLY_REF_SPLICE,Frs2^ENSMUSG00000020170,10:117148392:-,Dtx3^ENSMUSG00000040415,10:127191553:-,"HWI-ST383:195:D1G4DACXX:8:1111:14633:10859,HWI...",.,2049_37_12SKA028,Frs2,Dtx3
193,Frs2--Dtx3,12.0,0.0,ONLY_REF_SPLICE,Frs2^ENSMUSG00000020170,10:117148392:-,Dtx3^ENSMUSG00000040415,10:127191039:-,"HWI-ST383:195:D1G4DACXX:8:2102:8998:13444,HWI-...",.,2049_37_12SKA028,Frs2,Dtx3
200,Frs2--Arhgef25,3.0,0.0,ONLY_REF_SPLICE,Frs2^ENSMUSG00000020170,10:117148392:-,Arhgef25^ENSMUSG00000019467,10:127187322:-,"HWI-ST383:195:D1G4DACXX:8:2102:9088:75597,HWI-...",.,2049_37_12SKA028,Frs2,Arhgef25


Next, we try to identify recurrent fusions by ranking genes by the frequency of their occurrence:

In [5]:
sb_ranking = (
    sb_fusions.groupby(['gene_a', 'gene_b'])
    ['sample'].nunique().reset_index()
    .rename(columns={'sample': 'n_samples'})
    .sort_values(['n_samples', 'gene_a'], ascending=[False, True]))

sb_ranking.to_excel('../reports/star_fusion.sanger.xlsx', index=False)

sb_ranking.head(n=20)

Unnamed: 0,gene_a,gene_b,n_samples
0,4930448N21Rik,Stox2,3
1,Atxn7l1os1,Fam173b,1
2,Cnksr3,Cand1,1
3,Cnst,Smyd3,1
4,Cpq,Sdc2,1
5,Fgfr2,Kif16b,1
6,Fgfr2,Myh9,1
7,Fgfr2,Tbc1d1,1
8,Frs2,Arhgef25,1
9,Frs2,Dtx3,1


Interestingly, this identifies three fusions involving *Fgfr2*, which was also a hit in the insertion analysis. Further investigation shows that these fusions reflect known oncogenic *Fgfr2* fusions, supporting the validity of these fusions.

In [18]:
sel_columns = ['sample', 'gene_a', 'gene_b', 'left_breakpoint', 
               'right_breakpoint', 'support_junction']
fgfr2_fusions = sb_fusions.query('gene_a == "Fgfr2"')[sel_columns]

fgfr2_fusions

Unnamed: 0,sample,gene_a,gene_b,left_breakpoint,right_breakpoint,support_junction
213,2800_58_12SKA127-R3,Fgfr2,Kif16b,7:130167703:-,2:142834136:-,5.0
222,2800_2_12SKA035-L3,Fgfr2,Myh9,7:130167703:-,15:77767663:-,3.0
342,1566_10_11KOU023,Fgfr2,Tbc1d1,7:130167703:-,5:64256715:+,11.0


## B-ALL dataset

Next, we perform the same analysis for the B-ALL dataset.

In [7]:
# Read fusions for samples.
base_dir = Path('../data/interim/sanger/star-fusion/fusions')
sample_dirs = [x for x in base_dir.iterdir() if x.is_dir()]

fusions_per_sample = (
    (sample_dir.name,
     read_fusions(sample_dir / 'star-fusion.fusion_candidates.final'))
    for sample_dir in sample_dirs)

# Merge into single frame.
sanger_fusions = pd.concat((df.assign(sample=sample) 
                           for sample, df in fusions_per_sample),
                          axis=0, ignore_index=True)

# Filter likely transposon fusions and split gene names.
sanger_fusions = (sanger_fusions
                  .pipe(filter_sb_transposon)
                  .pipe(expand_fusion_name))

# Subset to REF splices.
sanger_fusions = sanger_fusions.query('splice_type == "ONLY_REF_SPLICE"')

# Create ranking.
sanger_ranking = (
    sanger_fusions.groupby(['gene_a', 'gene_b'])
    ['sample'].nunique().reset_index()
    .rename(columns={'sample': 'n_samples'})
    .sort_values(['n_samples', 'gene_a'], ascending=[False, True]))

sanger_ranking.head(n=20)

Unnamed: 0,gene_a,gene_b,n_samples
71,Etv6,Runx1,16
115,Hexdc,Khdrbs1,9
5,A430104N18Rik,Mcm7,8
60,Dnmt1,Dyx1c1,5
165,Mcm9,Asf1a,5
130,Ighv12-3,Ighm,4
207,RP23-32C18.5,Ms4a4d,4
4,A430104N18Rik,2010111I01Rik,3
58,Diap3,Tdrd3,3
106,Hba-a1,Hbb-bt,3


This identifies the *Etv6-Runx1* fusion that was engineered in this mouse model in the majority of the samples.