## CBCG

It's a Curated Breast Cancer Genes (CBCG) database. CBCG is an easy to use database comprising manually curated breast cancer genes and variants. This database will expedite clinical diagnostics and support the ongoing efforts in managing breast cancer etiology. Moreover, the database will serve as an essential repository when designing new breast cancer multigene panels.

More details about the database can be found [here](https://cbcg.dk/genes.html).


### Extract the gene-disease relations from the CBCG database

In [1]:
# Fetch the html page https://cbcg.dk/genes.html
import requests

url = "https://cbcg.dk/genes.html"
response = requests.get(url)
html_content = response.text

# Parse the html content using BeautifulSoup
# Example:
#   <table class="table table-bordered table-striped" id="ExportTable">
#     <thead>
#       <tr>
#         <th>Gene</th>
#         <th>SNP Identifiers</th>
#         <th>SNP location</th>
#         <th>Genomic loci</th>
#         <th>Disease or Protective allele</th>
#         <th>Allele Frequency (GnomAD)</th>
#         <th>Literatures</th>
#       </tr>
#     </thead>
#     <tbody id="GeneTable">

#           <tr>
#             <td>ABCB1</td>
#             <td>rs1045642</td>
#             <td>Exon</td>
#             <td>7q21.12</td>
#             <td>Disease</td>
#             <td>&gt; 0.01</td>
#             <td><a href="https://pubmed.ncbi.nlm.nih.gov/32042261/" target="_blank" rel="noopener noreferrer">Association between ABCB1, ABCG2 carrier protein and COX-2 enzyme gene polymorphisms and breast cancer risk in a Turkish population</a></td>
#           </tr>
#     </tbody>
#   </table>

In [2]:
# Extract the table data
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_content, 'html.parser')
table = soup.find('table', {'id': 'ExportTable'})

# Extract headers
headers = [th.text.strip() for th in table.find('thead').find_all('th')] + ['PMID']

# Extract rows
rows = []
for row in table.find('tbody').find_all('tr'):
    cells = row.find_all('td')
    row_data = [cell.text.strip().replace("Â", "") for cell in cells]
    print(row_data)
    row_data.append(cells[6].find('a')['href'].strip("/").split("/")[-1])
    rows.append(row_data)

print(headers)
print(rows[0])

['ABCB1', 'rs1045642', 'Exon', '7q21.12', 'Disease', '> 0.01', 'Association between ABCB1, ABCG2 carrier protein and COX-2 enzyme gene polymorphisms and breast cancer risk in a Turkish population']
['ABCG2', 'rs2231142', 'Exon', '4q22.1', 'Protective', '> 0.01', 'Association between ABCB1, ABCG2 carrier protein and COX-2 enzyme gene polymorphisms and breast cancer risk in a Turkish population']
['ABHD8', 'rs10424198', 'Intron', '19p13.11', 'Disease', '> 0.01', 'Functional mechanisms underlying pleiotropic risk alleles at the 19p13.1 breastâ\x80\x93ovarian cancer susceptibility locus']
['ABHD8', 'rs4808616', '3 Prime UTR Variant', '19p13.11', 'Disease', '> 0.01', 'Functional mechanisms underlying pleiotropic risk alleles at the 19p13.1 breastâ\x80\x93ovarian cancer susceptibility locus']
['ABR', '-', 'CNV', '17p13.3', 'Protective', '< 0.01', 'Rare germline copy number variants (CNVs) and breast cancer risk']
['ABRAXAS1', '-', '-', '4q21.23', 'Disease', '-', 'Aggregation tests identify n

In [3]:
import pandas as pd

df = pd.DataFrame(rows, columns=headers)
df.to_csv("cbcg_data.tsv", index=False, sep="\t")

### Reformat the CBCG data as the BioMedGPS format

More details on the data format can be found [here](https://open-prophetdb.github.io/biomedgps-data/graph_data_index/#knowledge-graph-file).

Examples:

| relation_type                  | resource | source_id | source_type | target_id   | target_type | source_name                    | target_name |
|--------------------------------|----------|-----------|-------------|-------------|-------------|--------------------------------|-------------|
| DGIDB::INHIBITOR::Gene:Compound| DGIDB    | ENTREZ:4311 | Gene        | MESH:D015244| Compound    | membrane metalloendopeptidase  | Thiorphan   |
| DGIDB::INHIBITOR::Gene:Compound| DGIDB    | ENTREZ:4311 | Gene        | MESH:C097292| Compound    | membrane metalloendopeptidase  | aladotrilat |

In [4]:
# We assume the entity file is already generated and placed in the ROOT_DIR/graph_data/entities.tsv. The ROOT_DIR is the root directory of the BioMedGPS Data Repository.
entity_file = "../../entities.tsv"

entity_df = pd.read_csv(entity_file, sep="\t", low_memory=False)

  entity_df = pd.read_csv(entity_file, sep="\t")


In [None]:
import os
import os.path as osp
import subprocess


def format_cbcg(filename):
    def get_project_root():
        try:
            return osp.dirname(osp.dirname(os.getcwd()))
        except Exception as e:
            raise RuntimeError(f"Failed to determine project root: {e}")

    try:
        root_dir = get_project_root()
        print(f"Project root directory: {root_dir}")
    except RuntimeError as e:
        print(e)
        exit(1)

    database = "customdb"
    relations_path = osp.join(
        root_dir,
        "relations",
        "cbcg",
        filename,
    )
    output_dir = osp.join(
        root_dir, "formatted_relations", "cbcg"
    )
    entities_path = osp.join(root_dir, "entities.tsv")
    log_file = osp.join(output_dir, "log.txt")

    command = [
        "graph-builder",
        "--database",
        database,
        "-d",
        relations_path,
        "-o",
        output_dir,
        "-f",
        entities_path,
        "-n",
        "20",
        "--download",
        "--skip",
        "-l",
        log_file,
        "--debug",
    ]

    print("Executing command:", " ".join(command))

    try:
        subprocess.run(command, check=True)
    except FileNotFoundError:
        print(
            "Error: 'graph-builder' command not found. Make sure it is installed and available in the PATH."
        )
        exit(1)
    except subprocess.CalledProcessError as e:
        print(f"Error: Command execution failed with return code {e.returncode}")
        print(f"Output: {e.output}")
        exit(1)
    except Exception as e:
        print(f"Unexpected error: {e}")
        exit(1)

In [5]:
# Filter all Protective allele rows
df = df[df["Disease or Protective allele"] != "Protective"]

formatted_df = pd.DataFrame()
formatted_df["source_name"] = df["Gene"]
formatted_df["source_type"] = "Gene"
formatted_df["target_name"] = "Breast Cancer"
formatted_df["target_type"] = "Disease"

source_ids = []
target_ids = []

for _, row in df.iterrows():
    source_id = entity_df.loc[
        (entity_df["name"] == row["Gene"]) & (entity_df["label"] == "Gene"), "id"
    ]

    if not source_id.empty:
        source_ids.append(source_id.values[0])
    else:
        source_ids.append(None)

    target_id = entity_df.loc[
        (entity_df["name"] == "Breast Cancer") & (entity_df["label"] == "Disease"), "id"
    ]

    if not target_id.empty:
        target_ids.append(target_id.values[0])
    else:
        target_ids.append(None)

formatted_df["source_id"] = source_ids
formatted_df["target_id"] = target_ids

formatted_df["pmid"] = df["PMID"]
formatted_df["relation_type"] = "GNBR::Y::Gene:Disease"
formatted_df["resource"] = "CBCG"

formatted_df.to_csv("formatted_cbcg.tsv", index=False, sep="\t")

In [None]:
format_cbcg("formatted_cbcg.tsv")