# Exploration of Open Targets datasets

The purpose of this notebook is to explore the Open Targets datasets available at [Open Targets Data Downloads](https://platform.opentargets.org/downloads).

Open Targets provides a variety of datasets that can be used for drug target identification and prioritization.
The datasets include information on diseases, targets, evidence, and associations.

## Dataset sources

The datasets are available as parquet files on:

- EBI FTP: `ftp://ftp.ebi.ac.uk/pub/databases/opentargets/platform/{release}output/{dataset_name}`
- Google Cloud Storage `gs://open-targets-data-releases/{release}/output/{dataset_name}`

Where `{release}` is the release version of the Open Targets Platform (e.g., `23.06`), and `{dataset_name}` is the name of the dataset (e.g., `disease`, `target`, `evidence`, etc.).


## Listing datasets

One can seek the available datasets

- from the [Open Targets Data Downloads](https://platform/opentargets.org/downloads)
- listing the FTP server
- listing the GCS bucket


In [12]:
## Listing available datasets

!curl -l ftp://ftp.ebi.ac.uk/pub/databases/opentargets/platform/25.09/output/


association_by_datasource_direct
association_by_datasource_indirect
association_by_datatype_direct
association_by_datatype_indirect
association_by_overall_indirect
association_overall_direct
biosample
colocalisation_coloc
colocalisation_ecaviar
credible_set
disease
disease_hpo
disease_phenotype
drug_indication
drug_mechanism_of_action
drug_molecule
evidence
expression
go
interaction
interaction_evidence
interval
known_drug
l2g_prediction
literature
literature_vector
mouse_phenotype
openfda_significant_adverse_drug_reactions
openfda_significant_adverse_target_reactions
pharmacogenomics
reactome
so
study
target
target_essentiality
target_prioritisation
variant


In [13]:
# !gcloud storage ls --billing-project=open-targets-genetics-dev gs://open-targets-data-releases/25.09/output/


## Downloading datasets

Datasets can be downloaded using `gcloud storage` from google cloud or `rsync` from the EBI FTP server.

Below you can find examples of how to download the `credible_set` dataset using both methods.


In [14]:
# rsync from EBI FTP
!mkdir -p ../tmp/
!rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/25.09/output/credible_set ../tmp/.


Transfer starting: 27 files
credible_set/

sent 22 bytes  received 1835 bytes  18570000 bytes/sec
total size is 2588111695  speedup is 1393705.06


In [15]:
# gcloud storage
# !gcloud auth login
# !mkdir -p ../tmp/
# !gcloud storage rsync -r \
#     --billing-project=open-targets-genetics-dev \
#     --recursive \
#     --delete-unmatched-destination-objects \
#     gs://open-targets-data-releases/25.09/output/credible_set ../tmp/credible_set


The datasets are represented by directories with parquet partitions (natively written with pyspark)


## Loading datasets

To load the data one can use:

- pyspark (native for Open Targets)
- polars
- dask

!Note:
Reading full dataset with pandas may not be feasible due to the large size, refer to the frameworks that can use the lazy evaluation


#### Pyspark


In [16]:
### Pyspark - recommended!
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.appName("Exploring OT Datasets")
    .config("spark.driver.memory", "10g")
    .getOrCreate()
)
spark


In [17]:
cs = spark.read.parquet("../tmp/credible_set")
cs.show(1)


+--------------------+--------------------+--------------+----------+--------+--------------------+--------+------+--------------+--------------+-------------------------------+-------------+-------------------+---------------+-----------------+----------------+------------------+------------+-----------+----------+--------+----------+-----+--------------------+--------------------+---------+----------+
|        studyLocusId|             studyId|     variantId|chromosome|position|              region|    beta|zScore|pValueMantissa|pValueExponent|effectAlleleFrequencyFromSource|standardError|subStudyDescription|qualityControls|finemappingMethod|credibleSetIndex|credibleSetlog10BF|purityMeanR2|purityMinR2|locusStart|locusEnd|sampleSize|ldSet|               locus|          confidence|studyType|isTransQtl|
+--------------------+--------------------+--------------+----------+--------+--------------------+--------+------+--------------+--------------+-------------------------------+---------

#### Polars


In [18]:
### Polars
import polars as pl

cs = pl.read_parquet(
    "../tmp/credible_set/*.parquet"
)  # Need to use glob for reading all parquet partitions
cs.head(1)


studyLocusId,studyId,variantId,chromosome,position,region,beta,zScore,pValueMantissa,pValueExponent,effectAlleleFrequencyFromSource,standardError,subStudyDescription,qualityControls,finemappingMethod,credibleSetIndex,credibleSetlog10BF,purityMeanR2,purityMinR2,locusStart,locusEnd,sampleSize,ldSet,locus,confidence,studyType,isTransQtl
str,str,str,str,i32,str,f64,f64,f32,i32,f32,f64,str,list[str],str,i32,f64,f64,f64,i32,i32,i32,list[struct[2]],list[struct[10]],str,str,bool
"""e62f70cd4aa982aad471ba67e3915e…","""gtex_tx_brain_cerebellum_enst0…","""1_14677_G_A""","""1""",14677,"""chr1:-826138-1173862""",1.17782,,9.527,-8,,0.212414,,[],"""SuSie""",1,6.214039,,,,,,,"[{true,true,14.308355,0.95,""1_14677_G_A"",9.527,-8,1.17782,0.212414,null}]","""SuSiE fine-mapped credible set…","""eqtl""",False


#### Dask


In [19]:
### Dask

from dask.distributed import Client

client = Client(n_workers=1, threads_per_worker=4, processes=True, memory_limit="10GB")


Perhaps you already have a cluster running?
Hosting the HTTP server on port 55292 instead


In [20]:
import dask.dataframe as dd

cs = dd.read_parquet("../tmp/credible_set/*.parquet", engine="pyarrow")
cs.head(1)


Unnamed: 0,studyLocusId,studyId,variantId,chromosome,position,region,beta,zScore,pValueMantissa,pValueExponent,...,purityMeanR2,purityMinR2,locusStart,locusEnd,sampleSize,ldSet,locus,confidence,studyType,isTransQtl
0,e62f70cd4aa982aad471ba67e3915ec3,gtex_tx_brain_cerebellum_enst00000491962,1_14677_G_A,1,14677,chr1:-826138-1173862,1.17782,,9.527,-8,...,,,,,,,"[{'is95CredibleSet': True, 'is99CredibleSet': ...",SuSiE fine-mapped credible set with in-sample LD,eqtl,False
