# ChEMBL Evidence Data Download


This notebook downloads ChEMBL evidence data from the Open Targets Platform using rsync. The data includes drug-target associations and related evidence from the ChEMBL database.


## 1. Download ChEMBL Evidence Data


In [61]:
%%bash
release="25.09"
source_id="chembl"
evidence_path="evidence/sourceId=${source_id}"

# Create output directory
mkdir -p ../tmp/evidence_chembl

# Download ChEMBL evidence data using rsync
echo "Downloading ChEMBL evidence data from Open Targets Platform release ${release}..."
rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/${release}/output/${evidence_path} ../tmp/evidence_chembl

# Download mechanism of action data
echo "Downloading mechanism of action data from Open Targets Platform release ${release}..."
rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/${release}/output/drug_mechanism_of_action ../tmp/

# Download molecule data
echo "Downloading molecule data from Open Targets Platform release ${release}..."
rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/${release}/output/drug_molecule ../tmp/

echo "Download completed!"

Downloading ChEMBL evidence data from Open Targets Platform release 25.09...


Transfer starting: 101 files

sent 16 bytes  received 7031 bytes  70470000 bytes/sec
total size is 47051903  speedup is 6676.87
Downloading mechanism of action data from Open Targets Platform release 25.09...
Transfer starting: 3 files

sent 16 bytes  received 177 bytes  1930000 bytes/sec
total size is 478223  speedup is 2477.83
Downloading molecule data from Open Targets Platform release 25.09...
Transfer starting: 3 files

sent 16 bytes  received 166 bytes  1820000 bytes/sec
total size is 2275731  speedup is 12503.95
Download completed!


2. **Python environment and Spark session**

In [62]:
import pyspark.sql.functions as f
from pyspark.sql import SparkSession

# Starting a Spark session
spark = SparkSession.builder.config("spark.driver.memory", "8g").getOrCreate()

3. **Read relevant datasets**

In [None]:
# T-D-Dr evidence
chembl = spark.read.parquet("../tmp/evidence_chembl")
# chembl.printSchema()

# Drug molecule information
molecule = spark.read.parquet("../tmp/drug_molecule")
# molecule.printSchema()

# Drug molecule mechnanism of action
chembl_moa = spark.read.parquet("../tmp/drug_mechanism_of_action")
# chembl_moa.printSchema()

# Drug molecule mechanism of action
moas = chembl_moa.select(
    f.explode("chemblIds").alias("drugId"),
    "mechanismOfAction",
    "actionType",
    f.explode("targets").alias("targetId"),
).distinct()
moas.show(truncate=False)

+-------------+-----------------------------------------------------------------------+-------------+---------------+
|drugId       |mechanismOfAction                                                      |actionType   |targetId       |
+-------------+-----------------------------------------------------------------------+-------------+---------------+
|CHEMBL2424780|Toll-like receptor 7 activator                                         |ACTIVATOR    |ENSG00000196664|
|CHEMBL1963249|Ghrelin receptor agonist                                               |AGONIST      |ENSG00000121853|
|CHEMBL4594542|IL22 Receptor agonist                                                  |AGONIST      |ENSG00000142677|
|CHEMBL1201540|Insulin receptor agonist                                               |AGONIST      |ENSG00000171105|
|CHEMBL2108973|Insulin receptor agonist                                               |AGONIST      |ENSG00000171105|
|CHEMBL1201777|Mu opioid receptor agonist               

# Extract relevant information

## Column Descriptions

The following columns are selected from the joined DataFrame combining ChEMBL evidence data, drug mechanism of action, and molecule information. These column descriptions are based on the Open Targets Platform schema documentation available in the [croissant file](https://platform.opentargets.org/downloads):

### Core Identifiers
- **targetId**: Ensembl gene identifier for the target protein (e.g., ENSG00000000938)
- **diseaseId**: Disease identifier from EFO (Experimental Factor Ontology) (e.g., EFO_0000707)
- **drugId**: ChEMBL compound identifier (e.g., CHEMBL1421)

### Drug Information
- **name**: Drug name or compound name (e.g., DASATINIB)
- **mechanismOfAction**: Detailed description of how the drug acts on the target (e.g., "SRC inhibitor")
- **actionType**: Type of pharmacological action (e.g., INHIBITOR, AGONIST, ANTAGONIST, BLOCKER)
- **variantEffect**: This is a simplification of the `actionType` specifying whether the effect of the drug on the target can be interpreted as a loss of function (LoF) or gain of function (GoF).

### Clinical Trial Information
- **clinicalPhase**: Clinical trial phase indicating the stage of clinical development
- **clinicalStatus**: Current status of the clinical trial (e.g., "Recruiting", "Completed", "Active, not recruiting")
- **studyStartDate**: Date when the clinical study started (YYYY-MM-DD format)
- **studyStopReason**: Reason why the study was stopped (if applicable)
- **studyStopReasonCategories**: Categorized reasons for study termination


### Additional Resources
- **urls**: Array of URLs providing additional information about the study, typically including ClinicalTrials.gov links

In [75]:
out = (
    chembl.join(moas, on=["drugId", "targetId"], how="left")
    .join(molecule.withColumnRenamed("id", "drugId"), on="drugId", how="left")
    .select(
        "targetId",
        "diseaseId",
        "drugId",
        "name",
        "mechanismOfAction",
        "actionType",
        "clinicalPhase",
        "variantEffect",
        "clinicalStatus",
        "studyStartDate",
        "studyStopReason",
        "studyStopReasonCategories",
        "urls",
    )
)
out.show()

+---------------+-------------+-------------+--------------------+--------------------+--------------------+-------------+-------------+--------------------+--------------+---------------+-------------------------+--------------------+
|       targetId|    diseaseId|       drugId|                name|   mechanismOfAction|          actionType|clinicalPhase|variantEffect|      clinicalStatus|studyStartDate|studyStopReason|studyStopReasonCategories|                urls|
+---------------+-------------+-------------+--------------------+--------------------+--------------------+-------------+-------------+--------------------+--------------+---------------+-------------------------+--------------------+
|ENSG00000000938|  EFO_0000707|   CHEMBL1421|           DASATINIB|       SRC inhibitor|           INHIBITOR|          2.0|          LoF|          Recruiting|    2023-02-05|           NULL|                     NULL|[{ClinicalTrials,...|
|ENSG00000004468|  EFO_0000203|CHEMBL3545131|          I

# Write output

In [77]:
# Write output in CSV format
# This dataframe can be written to different formats including parquet file:
# out.write.parquet("../tmp/autoimmune_credible_set_parquet", mode="overwrite")
# or csv:
out.drop("urls", "studyStopReasonCategories").coalesce(1).write.csv(
    "../tmp/chembl_evidence_download_csv",
    mode="overwrite",
    header=True,
)

                                                                                