In [1]:
#| echo: false
#| output: false

%load_ext autoreload
%autoreload 2

In [2]:
from geneinfo.genelist import GeneListCollection
from geneinfo.genelist import GeneList as glist

## GeneList

Long lists of gene names do not work well visually:

In [3]:
dummy_genes = ['ABCB7', 'ACTRT1', 'AKAP4', 'ALG13', 'ARHGAP36', 'ATP7A', 'ATRX', 'BCLAF3', 'BRCC3', 'CAPN6', 'CCNB3', 'CFAP47', 'CLCN5', 'CMC4', 'CNKSR2', 'COX7B', 'CYBB', 'DCX', 'DKC1', 'DYNLT3', 'ENOX2', 'ENOX2-AS1', 'EZHIP', 'F8', 'F8A1', 'FAM120C', 'FGF16']
dummy_genes

['ABCB7',
 'ACTRT1',
 'AKAP4',
 'ALG13',
 'ARHGAP36',
 'ATP7A',
 'ATRX',
 'BCLAF3',
 'BRCC3',
 'CAPN6',
 'CCNB3',
 'CFAP47',
 'CLCN5',
 'CMC4',
 'CNKSR2',
 'COX7B',
 'CYBB',
 'DCX',
 'DKC1',
 'DYNLT3',
 'ENOX2',
 'ENOX2-AS1',
 'EZHIP',
 'F8',
 'F8A1',
 'FAM120C',
 'FGF16']

GeneList objects work just like normal lists but have some additional features that usefull for exploring sets of genes.

When displayed they render as Markdown in columns to make them easier to read:

In [4]:
#| classes: .gene-list

list_A = glist(dummy_genes)
list_A

0,1,2,3,4,5,6,7,8
ABCB7,ALG13,ATRX,CAPN6,CLCN5,COX7B,DKC1,ENOX2-AS1,F8A1
ACTRT1,ARHGAP36,BCLAF3,CCNB3,CMC4,CYBB,DYNLT3,EZHIP,FAM120C
AKAP4,ATP7A,BRCC3,CFAP47,CNKSR2,DCX,ENOX2,F8,FGF16


## Highlight genes

The bitwise operator `<<` is overloaded and highlights genes also present in another gene list:

In [5]:
#| classes: .gene-list

list_B = glist(dummy_genes[::2])
list_A << list_B

0,1,2,3,4,5,6,7,8
ABCB7,ALG13,ATRX,CAPN6,CLCN5,COX7B,DKC1,ENOX2-AS1,F8A1
ACTRT1,ARHGAP36,BCLAF3,CCNB3,CMC4,CYBB,DYNLT3,EZHIP,FAM120C
AKAP4,ATP7A,BRCC3,CFAP47,CNKSR2,DCX,ENOX2,F8,FGF16


In [6]:
#| classes: .gene-list

list_C = glist(dummy_genes[:12])
list_A << list_C

0,1,2,3,4,5,6,7,8
ABCB7,ALG13,ATRX,CAPN6,CLCN5,COX7B,DKC1,ENOX2-AS1,F8A1
ACTRT1,ARHGAP36,BCLAF3,CCNB3,CMC4,CYBB,DYNLT3,EZHIP,FAM120C
AKAP4,ATP7A,BRCC3,CFAP47,CNKSR2,DCX,ENOX2,F8,FGF16


You apply the `<<` operator repeatedly to highlight genes from up to four other gene lists. Each time adding a new style of highlighting is applied in the following sequence:

1. <span style="font-weight: bold;">Bold</span>
2. <span style="color:#1876D2;">Color</span>
3. <span style="text-decoration: underline;">Underline</span>
4. <span style="font-style: italic;">Italic</span>

Genes with all styles applied looks like <span style="font-weight: bold; color:#1876D2; text-decoration: underline; font-style: italic;">this</span>. 

In [7]:
#| classes: .gene-list

list_D = glist(dummy_genes[::4])
list_E = glist(dummy_genes[2::10])

list_A << list_B << list_C <<  list_D << list_E

0,1,2,3,4,5,6,7,8
ABCB7,ALG13,ATRX,CAPN6,CLCN5,COX7B,DKC1,ENOX2-AS1,F8A1
ACTRT1,ARHGAP36,BCLAF3,CCNB3,CMC4,CYBB,DYNLT3,EZHIP,FAM120C
AKAP4,ATP7A,BRCC3,CFAP47,CNKSR2,DCX,ENOX2,F8,FGF16


The highlight color can be changed by passing a HEX color to `set_highlight_color`:

In [8]:
#| classes: .gene-list

glist.set_highlight_color('#009D2B')
list_A << list_E << list_D <<  list_C << list_B

0,1,2,3,4,5,6,7,8
ABCB7,ALG13,ATRX,CAPN6,CLCN5,COX7B,DKC1,ENOX2-AS1,F8A1
ACTRT1,ARHGAP36,BCLAF3,CCNB3,CMC4,CYBB,DYNLT3,EZHIP,FAM120C
AKAP4,ATP7A,BRCC3,CFAP47,CNKSR2,DCX,ENOX2,F8,FGF16


Reset highlght color:

In [9]:
#| classes: .gene-list

glist.reset_highlight_color()
list_A << list_E << list_D <<  list_C << list_B

0,1,2,3,4,5,6,7,8
ABCB7,ALG13,ATRX,CAPN6,CLCN5,COX7B,DKC1,ENOX2-AS1,F8A1
ACTRT1,ARHGAP36,BCLAF3,CCNB3,CMC4,CYBB,DYNLT3,EZHIP,FAM120C
AKAP4,ATP7A,BRCC3,CFAP47,CNKSR2,DCX,ENOX2,F8,FGF16


## Set operations

The bitwise operators, `&`, `|`, and `^`, to allow set operations on gene lists.

Highlight in A the intersection between B and C:

In [10]:
#| classes: .gene-list

list_A << (list_B & list_C)

0,1,2,3,4,5,6,7,8
ABCB7,ALG13,ATRX,CAPN6,CLCN5,COX7B,DKC1,ENOX2-AS1,F8A1
ACTRT1,ARHGAP36,BCLAF3,CCNB3,CMC4,CYBB,DYNLT3,EZHIP,FAM120C
AKAP4,ATP7A,BRCC3,CFAP47,CNKSR2,DCX,ENOX2,F8,FGF16


Highlight in A the union between B and C:

In [11]:
#| classes: .gene-list

list_A << (list_B | list_C)

0,1,2,3,4,5,6,7,8
ABCB7,ALG13,ATRX,CAPN6,CLCN5,COX7B,DKC1,ENOX2-AS1,F8A1
ACTRT1,ARHGAP36,BCLAF3,CCNB3,CMC4,CYBB,DYNLT3,EZHIP,FAM120C
AKAP4,ATP7A,BRCC3,CFAP47,CNKSR2,DCX,ENOX2,F8,FGF16


Highlight in A the genes not shared by B and C:

In [12]:
#| classes: .gene-list

list_A << (list_B ^ list_C)

0,1,2,3,4,5,6,7,8
ABCB7,ALG13,ATRX,CAPN6,CLCN5,COX7B,DKC1,ENOX2-AS1,F8A1
ACTRT1,ARHGAP36,BCLAF3,CCNB3,CMC4,CYBB,DYNLT3,EZHIP,FAM120C
AKAP4,ATP7A,BRCC3,CFAP47,CNKSR2,DCX,ENOX2,F8,FGF16


Highlight in A the genes in B but not in C (set difference):

In [13]:
#| classes: .gene-list

list_A << (list_B ^ (list_B & list_C))

0,1,2,3,4,5,6,7,8
ABCB7,ALG13,ATRX,CAPN6,CLCN5,COX7B,DKC1,ENOX2-AS1,F8A1
ACTRT1,ARHGAP36,BCLAF3,CCNB3,CMC4,CYBB,DYNLT3,EZHIP,FAM120C
AKAP4,ATP7A,BRCC3,CFAP47,CNKSR2,DCX,ENOX2,F8,FGF16


Highlight in A the genes in C but not in B (set difference):

In [14]:
#| classes: .gene-list

list_A << (list_C ^ (list_C & list_B))

0,1,2,3,4,5,6,7,8
ABCB7,ALG13,ATRX,CAPN6,CLCN5,COX7B,DKC1,ENOX2-AS1,F8A1
ACTRT1,ARHGAP36,BCLAF3,CCNB3,CMC4,CYBB,DYNLT3,EZHIP,FAM120C
AKAP4,ATP7A,BRCC3,CFAP47,CNKSR2,DCX,ENOX2,F8,FGF16


## GeneListCollection

Load table of gene lists from a yaml file with the format below. Each gene list must have a unique list_label. The `genes` field is mandatory, the `description` field is not. Additional fields are ignored.

```yaml
<list_label>:
  description: |
    <free text description of the gene list>
    <free text description of the gene list>
  genes: <gene_name>, <gene_name>, ...
<list_label>:
  description: |
    <free text description of the gene list>
    <free text description of the gene list>
  genes: <gene_name>, <gene_name>, ...
```

In [34]:
yaml = """
cool_genes:
    description: "A list of cool genes"
    genes: ['TP53', 'BRCA1', 'EGFR', 'VEGFA', 'MYC']
target_genes:
    description: "A list of other genes"
    genes: ['AKT1', 'PIK3CA', 'PTEN', 'TP53', 'BRCA1']
"""
with open('gene_lists.yaml', 'w') as f:
    f.write(yaml)

In [41]:
gene_lists = GeneListCollection('gene_lists.yaml')
gene_lists

Unnamed: 0,Label,Description
0,cool_genes,A list of cool genes
1,target_genes,A list of other genes


In [38]:
gene_lists.get('cool_genes')

0,1,2,3,4
TP53,BRCA1,EGFR,VEGFA,MYC


In [39]:
gene_lists.all_genes()

0,1,2,3,4,5,6,7
AKT1,BRCA1,EGFR,MYC,PIK3CA,PTEN,TP53,VEGFA


Or from a Google Sheet using its ID and the sheet name:

In [25]:
# gene_lists = GeneListCollection(google_sheet='2JSjSLuto3jqdEnnG7JqzeC_1pUZw76n7XueVAYrUOpk')

See which neuron genes are also SFARI genes:

In [None]:
# gene_lists = GeneListCollection(google_sheet='2JSjSLuto3jqdEnnG7JqzeC_1pUZw76n7XueVAYrUOpk')

In [20]:
#| echo: false
#| output: false
#| classes: .gene-list

gene_lists = GeneListCollection(google_sheet='1JSjSLuto3jqdEnnG7JqzeC_1pUZw76n7XueVAYrUOpk')

In [21]:
gene_lists

AttributeError: 'GeneListCollection' object has no attribute 'names'



In [16]:
#| echo: false
#| output: false
#| classes: .gene-list

neuron_genes = glist(gene_lists.get('neuron_npx_proteome'))
sfari = glist(gene_lists.get('sfari_all_conf'))
neuron_genes << sfari

AttributeError: 'GeneListCollection' object has no attribute 'data'

In [37]:
(glist(gene_lists.get('cDEG')) 
 << glist(gene_lists.get('Hama'))
 << glist(gene_lists.get('ech90_regions'))
 << glist(gene_lists.get('hum_nean_admix'))
 << glist(gene_lists.get('ari_nonPUR'))
)

0,1,2,3,4,5
CFAP47,EDA,HUWE1,PHF8,SCML1,UPF3B
DDX3X,EIF1AX,IQSEC2,PRICKLE3,SRPX2,VSIG1
DIAPH2,EMD,mc_ampl_SPANXN5,RBM41,SYP,
DYNLT3,HTR2C,OCRL,RTL4,SYTL5,
