# Pangenomes graph database construction

## Purpose
This notebook allows to reproduce the construction of a Neo4J graph database with 10 ESKAPE pangenome. The code behind was developped in the ICDE conference paper submisson.

## Notebook organisation
This notebook will be split in X part

1. Setup the notebook
2. Import pangenomes to a neo4j database
3. Graph Queries, experimental evaluation
4. Supplementary code

## Methodology
Quickly describe assumptions and processing steps.

## Results
Thanks to this code it's possible to load multiple pangenomes and their similarities to make possible a new step in the field of comparative genomics.

## Suggested next steps
State suggested next steps, based on results obtained in this notebook.

# Setup
## Environment configuration
To begin, note that you must have an empty Neo4J DMBS (version 4.4.11) open and available with the APOC plugin install (version 4.4.0.10).

To execute the following script, you will need to install some packages. They're listed in the following conda environment file. The *in development* version of PPanGGOLiN is required to satisfy some feature and pangenomes compatibility.

To install the conda environment in jupyter kernel, please copy and paste the following code in your terminal:
```
conda update -n base -c defaults conda -y
conda env create --file conda-env.yml
conda init bash
conda activate pangraph
pip install --user ipykernel
python -m ipykernel install --user --name=pangraph
```

In [15]:
!git clone -b release1.3 https://github.com/labgem/PPanGGOLiN.git
!pip install PPanGGOLiN/.

Clonage dans 'PPanGGOLiN'...
remote: Enumerating objects: 6007, done.[K
remote: Counting objects: 100% (175/175), done.[K
remote: Compressing objects: 100% (113/113), done.[K
remote: Total 6007 (delta 81), reused 104 (delta 50), pack-reused 5832[K
Réception d'objets: 100% (6007/6007), 130.20 Mio | 25.96 Mio/s, fait.
Résolution des deltas: 100% (3446/3446), fait.
Processing ./PPanGGOLiN
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: ppanggolin
  Building wheel for ppanggolin (setup.py) ... [?25ldone
[?25h  Created wheel for ppanggolin: filename=ppanggolin-1.2.105-cp38-cp38-linux_x86_64.whl size=3538076 sha256=ed32c48b935a0cff6750f76eaa2bbea49108f4952a97d6615cb613c55b1c5d17
  Stored in directory: /tmp/pip-ephem-wheel-cache-73e3yfjr/wheels/c0/09/87/147f46fa9951bc20911b5efff11c297269a1f8cea4752a77e8
Successfully built ppanggolin
Installing collected packages: ppanggolin
  Attempting uninstall: ppanggolin
    Found existing

## Library import
We import all the common required Python libraries to execute all the script

In [1]:
# default libraries
import logging
import os
from tqdm import tqdm
from pathlib import Path

# installed libraries
from py2neo import Graph

# local libraries
from script.python.utils import check_tsv_sanity

# Parameter definition
We set all relevant parameters for our notebook. By convention, parameters are uppercase, while all the 
other variables follow Python's guidelines.

In [2]:
GBFF="data/GBFF"  # Genome used to construct pangenomes
os.environ["GBFF"] = GBFF
PANGENOMES="original_data/pangenomes"  # Directory with our pangenomes
os.environ["PANGENOMES"] = PANGENOMES
GF="data/GF_fasta"  # Directory with for all pangenomes the DNA sequences of all gene families
SIMILARITY="original_data/similarity"

PANGENOMES_TSV=Path(f"{PANGENOMES}/organism_eskape_2.list")
SIMILARITIES=Path(f"{SIMILARITY}/clustering2/clust_concat.tsv")

#Neo4J paramaeters
URI="bolt://localhost:7687"
USER="neo4j"
PWD="PANORAMA2022"

#Exec parameters
CPU=6
os.environ["CPU"]=str(CPU)
BATCH_SIZE=1000

pangenomes = check_tsv_sanity(PANGENOMES_TSV)  # Create a dictionnary with path file and information to pangenomes.
graph = Graph(uri=URI, user=USER, password=PWD)
# graph.delete_all()  # make sure the graph is empty

# Import pangenomes to a neo4j database
## Translation pangenome in dictionnary to export in Graph database

The first step is to translate our pangenomes in a data structure adapted to export into our Graph database.

In [6]:
from script.python import Pangenome
from script.python.translate import write_families, write_organisms, write_spot, write_rgp, write_modules

def create_dict(pangenome: Pangenome) -> dict:
    """Create a dictionary with the pangenome content to export one pangenome into a graph database
    :param pangenome: A pangenome construct thanks to PPanGGOLiN

    :return: Compatible dictionary to export in graph database, corresponding to a pangenome
    """

    translate_dict = {"Pangenome": {"name": pangenome.name, "taxid": pangenome.taxid,
                                  "Family": [],
                                  "Partition": [],
                                  "Module": [],
                                  "RGP": write_rgp(parent=pangenome),
                                  "Spot": [], "Genome": []}}
    write_families(pangenome, translate_dict)
    write_organisms(pangenome, translate_dict)
    write_spot(pangenome, translate_dict)
    write_modules(pangenome, translate_dict)
    return translate_dict

# Loader configuration
To export pangenomes into our graph database we are using the package employed by the CovidGraph framework.

In [7]:
from multiprocessing import Lock

from py2neo import Node
# installed librairies
from dict2graph import Dict2graph


def custom_post_func(node: Node):
    if node is not None and node.__primarylabel__ == "Gene":
        del node["tmp_id"]
    return node


class PangenomeLoader:

    def __init__(self, pangenome_name: str, pangenome_data: dict, lock: Lock, batch_size: int = 1000):
        self.name = pangenome_name
        self.lock = lock
        self.data = pangenome_data
        self.batch_size = batch_size
        self._build_loader()

    def load(self, graph: Graph):
        assert self.lock is not None, "Lock not Initialized"
        try:
            with self.lock:
                logging.getLogger().debug("parse")
                self.loader.parse(self.data)
                logging.getLogger().debug("index")
                self.loader.create_indexes(graph)
                logging.getLogger().debug("merge")
                self.loader.merge(graph)
        except Exception as error:
            raise Exception(f"Load to Neo4j failed because : {error}")

    def _build_loader(self):
        d2g = Dict2graph()
        d2g.config_dict_primarykey_generated_hashed_attrs_by_label = {
            "Pangenome": 'AllAttributes',  # Random id
            "Family": ["name"],
            "Partition": 'AllAttributes',
            "Gene": 'AllAttributes',
            "Module": "InnerContent",
            "Spot": "AllContent",
            "RGP": "InnerContent",
            "Genome": "InnerContent",
            "Contig": "AllContent"
        }
        d2g.config_str_primarykey_generated_attr_name = "hash_id"
        d2g.config_list_blocklist_collection_hubs = [
            "PangenomeCollection",
            "FamilyCollection",
            "PartitionCollection",
            "GeneCollection",
            "NeighborCollection",
            "ModuleCollection",
            "SpotCollection",
            "RGPCollection",
            "GenomeCollection",
            "ContigCollection",
        ]
        d2g.config_dict_node_prop_to_rel_prop = {"Family": {"weight": ["NEIGHBOR"]}} #,
                                                 # "Shell": {"weight": ["NEIGHBOR"]},
                                                 # "Cloud": {"weight": ["NEIGHBOR"]}}  # ,  "partition": ["IN_MODULE"]}}
        d2g.config_dict_primarykey_attr_by_label = {"Family": ["name"],
                                                    "Gene": ["name"],
                                                    "Partition": ["partition"]}
        d2g.config_dict_reltype_override = {"PANGENOME_HAS_FAMILY": "IS_IN_PANGENOME",
                                            "FAMILY_HAS_GENE": "IS_IN_FAMILY",
                                            "FAMILY_HAS_PARTITION": "HAS_PARTITION",
                                            "FAMILY_HAS_FAMILY": "NEIGHBOR",
                                            "MODULE_HAS_FAMILY": "IS_IN_MODULE",
                                            "SPOT_HAS_RGP": "IS_IN_SPOT",
                                            "RGP_HAS_GENE": "IS_IN_RGP",
                                            "GENOME_HAS_CONTIG": "IS_IN_GENOME",
                                            "CONTIG_HAS_GENE": "IS_IN_CONTIG"}
        d2g.config_list_blocklist_reltypes = ["PANGENOME_HAS_MODULE",
                                              "PANGENOME_HAS_RGP",
                                              "PANGENOME_HAS_SPOT",
                                              "PANGENOME_HAS_GENOME"]
        d2g.config_bool_capitalize_labels = False
        d2g.config_func_node_post_modifier = custom_post_func
        d2g.config_graphio_batch_size = self.batch_size
        self.loader = d2g

## Export to Graph Database
### Export pangenomes

In [8]:
# default libraries
from concurrent.futures import ProcessPoolExecutor
from multiprocessing import Manager, Lock
from time import time
from datetime import timedelta

# local librairies
from script.python import Pangenome
from script.python.export import give_gene_tmp_id
from script.python.utils import check_pangenome_info


db_loading_lock: Lock = None


def init_db_lock(lock: Lock):
    global db_loading_lock
    if db_loading_lock is None:
        db_loading_lock = lock


def load_pangenome(pangenome_name, pangenome_info, batch_size: int = 1000):
    """

    :param pangenome_name:
    :param pangenome_info:
    :param batch_size:
    :return:
    """
    logging.getLogger(f"Add {pangenome_name} to load list")
    pangenome = Pangenome(name=pangenome_name, taxid=pangenome_info["taxid"])
    pangenome.add_file(pangenome_info["path"])
    check_pangenome_info(pangenome, need_annotations=True, need_families=True, need_graph=True,
                         need_rgp=True, need_spots=True, need_modules=True, need_anntation_fam=True,
                         disable_bar=False)
    give_gene_tmp_id(pangenome)
    data = create_dict(pangenome)
    loader = PangenomeLoader(pangenome_name, data, db_loading_lock, batch_size=batch_size)
    loader.load(graph)


def load_pangenome_mp(pangenomes: dict, cpu: int = 1, batch_size: int = 1000):
    """

    :param pangenomes:
    :param cpu:
    :param batch_size:
    :return:
    """
    manager = Manager()
    lock = manager.Lock()
    with ProcessPoolExecutor(max_workers=cpu, initializer=init_db_lock, initargs=(lock,)) as executor:
        list(tqdm(executor.map(load_pangenome, pangenomes.keys(), pangenomes.values(),
                               [batch_size] * len(pangenomes)),
                  total=len(pangenomes), unit='pangenome'))

print("Begin pangenomes load")
begin_load_time = time()
load_pangenome_mp(pangenomes, CPU, BATCH_SIZE)
load_time = time() - begin_load_time
print(f"All pangenomes loaded in : {timedelta(seconds=load_time)}")

Begin pangenomes load


100%|███████████████████████████████████████████████████████████████████████████████████████████████| 664001/664001 [00:01<00:00, 338876.72gene/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 1070265/1070265 [00:03<00:00, 283936.29gene/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 137/137 [00:11<00:00, 12.23organism/s]
100%|████████████████████████████████████████████████████████████████████████████████████████| 651827/651827 [00:03<00:00, 173287.53gene family/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████| 22953/22953 [00:00<00:00, 84615.41gene family/s]
100%|███████████████████████████████████████████████████████████████████████████████████| 639082/639082 [00:04<00:00, 136840.02contig adjacency/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████| 98368/98368 [00:

All pangenomes loaded in : 0:25:42.722878


### Export similarities

In [9]:
# installed librairies
from graphio import RelationshipSet
import pandas as pd

def load_similarities(tsv: Path, batch_size: int = 1000):
    df = pd.read_csv(filepath_or_buffer=tsv, sep="\t", header=None,
                     names=["Family_1", "Family_2", "identity", "covery"])

    is_similar_list = []
    is_similar_to = RelationshipSet('IS_SIMILAR', ['Family'], ['Family'], ['name'], ['name'])
    chunk_size = batch_size * 10
    for row in df.iterrows():
        if len(is_similar_to.relationships) >= chunk_size:
            is_similar_list.append(is_similar_to)
            is_similar_to = RelationshipSet('IS_SIMILAR', ['Family'], ['Family'], ['name'], ['name'])
        is_similar_to.add_relationship(start_node_properties={"name": row[1]['Family_1']},
                                       end_node_properties={"name": row[1]['Family_2']},
                                       properties={"identity": row[1]['identity'],
                                                   "coverage": row[1]['covery']})
    is_similar_list.append(is_similar_to)
    for sim in tqdm(is_similar_list, unit="similarities_batch", total=len(is_similar_list)):
        sim.merge(graph=graph, batch_size=batch_size)

print("Bengin load of similarities...")
begin_sim_time = time()
load_similarities(SIMILARITIES, BATCH_SIZE)
sim_time = time() - begin_sim_time
print(f"All similarities loaded in : {timedelta(seconds=sim_time)}")

Bengin load of similarities...


100%|██████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [07:06<00:00, 426.45s/similarities_batch]

All similarities loaded in : 0:07:06.837499





### Invert edges

In [4]:
from script.python.export import invert_edges_query


def invert_edges(edge_label: str):
    query = invert_edges_query(edge_label)
    try:
        graph.run(query)
    except Exception as errror:
        raise Exception(f"Invert edges failed because : {errror}")

print("Invert edges...")
begin_invert_time = time()
labels2invert = ["IS_IN_PANGENOME", "IS_IN_MODULE", "IS_IN_FAMILY", "IS_IN_CONTIG",
                 "IS_IN_GENOME", "IS_IN_SPOT", "IS_IN_RGP"]
for edge_label in tqdm(labels2invert, unit='label'):
    logging.getLogger().debug(f"Invert: {edge_label}")
    invert_edges(edge_label)
invert_time = time() - begin_invert_time
print(f"All edges inverted in : {timedelta(seconds=invert_time)}")

Invert edges...


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [03:20<00:00, 28.62s/label]

All edges inverted in : 0:03:20.377133





# Request workflow

In [8]:
from time import time
from datetime import timedelta

def launch_query(query):
    print(query)
    try:
        res = graph.run(query)
    except Exception as errror:
        raise Exception(f"Query : '{query}' failed because of the following errror\n{errror}")
    else:
        return  res

def launch_WF():
    from script.python.wf import WF
    from statistics import mean, median, stdev
    import pandas as pd

    nb_rep = 10
    stat_dict = {}
    for q_id, query in WF.items():
        list_q_time = []
        for i in range(nb_rep):
            launch_query("CALL apoc.warmup.run()")
            begin_time = time()
            res = launch_query(query)
            q_time = time() - begin_time
            list_q_time.append(q_time)
            launch_query("CALL db.clearQueryCaches()")
        stat_dict[q_id] = [mean(list_q_time), stdev(list_q_time), median(list_q_time),
                           min(list_q_time), max(list_q_time)]
    return pd.DataFrame.from_dict(stat_dict, orient='index', columns=["Mean", "Stdev", "Mediane", "Min", "Max"])

launch_WF()

CALL apoc.warmup.run()
MATCH (p:Pangenome)<-[:IS_IN_PANGENOME]-(f:Family) WHERE f.annotation IS NOT NULL RETURN p.name, count(f)
CALL db.clearQueryCaches()
CALL apoc.warmup.run()
MATCH (p:Pangenome)<-[:IS_IN_PANGENOME]-(f:Family) WHERE f.annotation IS NOT NULL RETURN p.name, count(f)
CALL db.clearQueryCaches()
CALL apoc.warmup.run()
MATCH (p:Pangenome)<-[:IS_IN_PANGENOME]-(f:Family) WHERE f.annotation IS NOT NULL RETURN p.name, count(f)
CALL db.clearQueryCaches()
CALL apoc.warmup.run()
MATCH (p:Pangenome)<-[:IS_IN_PANGENOME]-(f:Family) WHERE f.annotation IS NOT NULL RETURN p.name, count(f)
CALL db.clearQueryCaches()
CALL apoc.warmup.run()
MATCH (p:Pangenome)<-[:IS_IN_PANGENOME]-(f:Family) WHERE f.annotation IS NOT NULL RETURN p.name, count(f)
CALL db.clearQueryCaches()
CALL apoc.warmup.run()
MATCH (p:Pangenome)<-[:IS_IN_PANGENOME]-(f:Family) WHERE f.annotation IS NOT NULL RETURN p.name, count(f)
CALL db.clearQueryCaches()
CALL apoc.warmup.run()
MATCH (p:Pangenome)<-[:IS_IN_PANGENOME]-(

Unnamed: 0,Mean,Stdev,Mediane,Min,Max
Q1,0.04305,0.003106,0.042875,0.037794,0.046781
Q2,0.039839,0.002843,0.039303,0.036775,0.045423
Q3,0.556492,0.041772,0.532297,0.521851,0.635195
Q4,1.186968,0.115383,1.137031,1.100517,1.444945
Q5,0.042844,0.00565,0.041675,0.03732,0.056931
Q6,0.039438,0.007757,0.037768,0.032563,0.060707
Q7,2.385238,0.348206,2.291893,2.050604,3.088853
Q8,1.111279,0.025456,1.106841,1.073579,1.160115
Q9,0.053641,0.005251,0.051907,0.048528,0.064753
Q10,0.083611,0.005362,0.082206,0.078561,0.094062


# Suplementary code
## Genomes downloading
Step do download genomes necessary to consctruct pangenomes. Note that genomes database are in constant evolution and your pangenome could be different than our.

In [None]:
!mkdir $GBFF
!genome_updater.sh -d genbank,refseq -M gtdb -T "s__Acinetobacter baumannii" -f assembly_report.txt,genomic.gbff.gz -l "complete genome" -o $GBFF/Acinetobacter.baumannii -t $CPU
!genome_updater.sh -d genbank,refseq -M gtdb -T "s__Enterobacter bugandensis" -f assembly_report.txt,genomic.gbff.gz -o $GBFF/Enterobacter.bugandensis -t $CPU
!genome_updater.sh -d genbank,refseq -M gtdb -T "s__Enterobacter cloacae" -f assembly_report.txt,genomic.gbff.gz -o $GBFF/Enterobacter.cloacae -t $CPU
!genome_updater.sh -d genbank,refseq -M gtdb -T "s__Enterobacter hormaechei_A" -f assembly_report.txt,genomic.gbff.gz -l "complete genome" -o $GBFF/Enterobacter.hormaechei_A -t $CPU
!genome_updater.sh -d genbank,refseq -M gtdb -T "s__Enterobacter kobei" -f assembly_report.txt,genomic.gbff.gz -o $GBFF/Enterobacter.kobei -t $CPU
!genome_updater.sh -d genbank,refseq -M gtdb -T "s__Enterobacter roggenkampii" -f assembly_report.txt,genomic.gbff.gz -o $GBFF/Enterobacter.roggenkampii -t $CPU
!genome_updater.sh -d genbank,refseq -M gtdb -T "s__Enterococcus_B faecium" -f assembly_report.txt,genomic.gbff.gz -l "complete genome" -o $GBFF/Enterococcus_B.faecium -t $CPU
!genome_updater.sh -d genbank,refseq -M gtdb -T "s__Klebsiella pneumoniae" -f assembly_report.txt,genomic.gbff.gz -l "complete genome" -A 600 -o $GBFF/Klebsiella.pneumoniae -t $CPU
!genome_updater.sh -d genbank,refseq -M gtdb -T "s__Pseudomonas aeruginosa" -f assembly_report.txt,genomic.gbff.gz -l "complete genome" -o $GBFF/Pseudomonas.aeruginosa -t $CPU
!genome_updater.sh -d genbank,refseq -M gtdb -T "s__Staphylococcus aureus" -f assembly_report.txt,genomic.gbff.gz -l "complete genome" -A 600 -o $GBFF/Staphylococcus.aureus -t $CPU

## Pangenome construction with PPanGGOLiN
### Generate list of genomes

In [None]:
! for sp in $(ls $GBFF); do echo for path in $(ls $GBFF/$sp/*/files/*.gbff.gz);do genome=$(echo $path | cut -d'/' -f6 | cut -d. -f1,2); echo $genome $(pwd)/$path | sed 's/\s/\t/'; done > $PANGENOMES/$sp.list; done

### Generate pangenomes

In [None]:
!ppanggolin all --anno data/pangenomes/Acinetobacter.baumannii.list -o data/pangenomes/Acinetobacter.baumannii --only_pangenome -c 4
!ppanggolin all --anno data/pangenomes/Enterobacter.bugandensis.list -o data/pangenomes/Enterobacter.bugandensis --only_pangenome -c 4
!ppanggolin all --anno data/pangenomes/Enterobacter.cloacae.list -o data/pangenomes/Enterobacter.cloacae --only_pangenome -c 4
!ppanggolin all --anno data/pangenomes/Enterobacter.hormaechei_A.list -o data/pangenomes/Enterobacter.hormaechei_A --only_pangenome -c 4
!ppanggolin all --anno data/pangenomes/Enterobacter.kobei.list -o data/pangenomes/Enterobacter.kobei --only_pangenome -c 4
!ppanggolin all --anno data/pangenomes/Enterobacter.roggenkampii.list -o data/pangenomes/Enterobacter.roggenkampii --only_pangenome -c 4
!ppanggolin all --anno data/pangenomes/Enterococcus_B.faecium.list -o data/pangenomes/Enterococcus_B.faecium --only_pangenome -c 4
!ppanggolin all --anno data/pangenomes/Klebsiella.pneumoniae.list -o data/pangenomes/Klebsiella.pneumoniae --only_pangenome -c 4
!ppanggolin all --anno data/pangenomes/Pseudomonas.aeruginosa.list -o data/pangenomes/Pseudomonas.aeruginosa --only_pangenome -c 4
!ppanggolin all --anno data/pangenomes/Staphylococcus.aureus.list -o data/pangenomes/Staphylococcus.aureus --only_pangenome -c 4

## Pangenome AMR annotation with CARD database and RGI

# Clean to relaunch

In [None]:
!rm -rf PPanGGOLiN
!rm -rf $GBFF
!conda env remove -n pangraph


# References
We report here relevant references:
1. author1, article1, journal1, year1, url1
2. author2, article2, journal2, year2, url2