# Reading Open Targets Platform data hosted on AWS

Open Targets Platform datasets are made available through Amazon's [Registry of Open Data](https://registry.opendata.aws/) initiative. The location of the released datasets can be found under `s3://open-targets-public-data-releases/platform/`. Our users have a number of ways to directly interact with these datasets without the need for downloading a local copy. In this notebook we are showing how to access our datasets with [pandas](https://pandas.pydata.org/), [polars](https://pola.rs/) and [pyspark](https://spark.apache.org/docs/latest/api/python/index.html). 

Although these examples are Python based only, any programming languages with frameworks supporting [parquet format](https://en.wikipedia.org/wiki/Apache_Parquet), and can interact with Amazon's S3 storage can achieve the same.

## Accessing data via pandas

In [2]:
import pandas as pd

associations_path = "s3://open-targets-public-data-releases/platform/25.12/output/association_overall_direct"

pdf = pd.read_parquet(associations_path)

pdf.head()

Unnamed: 0,diseaseId,targetId,score,evidenceCount
0,DOID_0050890,ENSG00000001084,0.031799,4
1,DOID_0050890,ENSG00000002549,0.001478,1
2,DOID_0050890,ENSG00000004142,0.002217,1
3,DOID_0050890,ENSG00000004478,0.002217,1
4,DOID_0050890,ENSG00000004948,0.002957,1


Once the data is read into a pandas dataframe, it can be used in any downstream analysis as usual. It has to be mentioned though, that pandas does not support nested data structures, so it's applicability limited when working on Platform data.

## Accessing data via polars

Polars is a highly scalable framework to work with tabular data with strong support for nested schemas. It can directly read from S3 location, however, it is very important to highlight that polars won't automatically expand paths in the background and S3 is not using directories, files needs to be explicitly listed:

In [3]:
import polars as pl

# s3_path = "s3://open-targets-public-data-releases/platform/25.12/output/target" <= This path definition fails!
s3_path = "s3://open-targets-public-data-releases/platform/25.12/output/drug_mechanism_of_action/*.parquet"

df = pl.read_parquet(
    s3_path,
    storage_options={
        "skip_signature": "true",
        "region": "eu-west-1"
    },
)

df.glimpse()

Rows: 6505
Columns: 7
$ actionType                    <str> 'ACTIVATOR', 'ACTIVATOR', 'ACTIVATOR', 'ACTIVATOR', 'ACTIVATOR', 'ACTIVATOR', 'ACTIVATOR', 'ACTIVATOR', 'ACTIVATOR', 'ACTIVATOR'
$ mechanismOfAction             <str> 'AMP-activated protein kinase, AMPK activator', 'Acetylcholinesterase activator', 'Antithrombin-III activator', 'Antithrombin-III activator', 'Antithrombin-III activator', 'Antithrombin-III activator', 'Antithrombin-III activator', 'Antithrombin-III activator', 'Antithrombin-III activator', 'Antithrombin-III activator'
$ chemblIds               <list[str]> ['CHEMBL1551724'], ['CHEMBL748', 'CHEMBL1420'], ['CHEMBL1200644', 'CHEMBL1201202'], ['CHEMBL1201414'], ['CHEMBL1201448'], ['CHEMBL1201460'], ['CHEMBL1201476'], ['CHEMBL1201513'], ['CHEMBL1201534'], ['CHEMBL1201657']
$ targetName                    <str> 'AMP-activated protein kinase, AMPK', 'Acetylcholinesterase', 'Antithrombin-III', 'Antithrombin-III', 'Antithrombin-III', 'Antithrombin-III', 'Antithrombin-III'

## Accessing data via pyspark

Pyspark is a highly scalable data processing framework, with strong support for nested data structures and parallelisation. To read data directly, one needs to specify credentialless data access upon starting the spark session.

As the default installation of pyspark doesn't contain the necessary connectors for read from AWS S3 location, we have to add these connectors upon initialising the spark session (unless we permanently install in our local environment).

An other important detail to highlight: hadoop no longer supports `s3` URI scheme, instead, `s3a` should be used. 

In [1]:
from pyspark.sql import SparkSession

target_path = 's3a://open-targets-public-data-releases/platform/25.12/output/target'

spark = (
    SparkSession.builder
    # As the default pyspark installation 
    .config(
        "spark.jars.packages",
        ",".join([
            "org.apache.hadoop:hadoop-aws:3.3.4",
            "com.amazonaws:aws-java-sdk-bundle:1.12.626",
        ])
    )
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    .config("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
    .getOrCreate()
)


spark.read.parquet(target_path).show(1, vertical=True)

Ivy Default Cache set to: /Users/dsuveges/.ivy2/cache
The jars for the packages stored in: /Users/dsuveges/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
com.amazonaws#aws-java-sdk-bundle added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-07950b07-d402-4256-a23a-21af879e2636;1.0
	confs: [default]


:: loading settings :: url = jar:file:/Users/dsuveges/repositories/notebooks/.venv/lib/python3.12/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


	found org.apache.hadoop#hadoop-aws;3.3.4 in central
	found org.wildfly.openssl#wildfly-openssl;1.0.7.Final in central
	found com.amazonaws#aws-java-sdk-bundle;1.12.626 in central
downloading https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar ...
	[SUCCESSFUL ] org.apache.hadoop#hadoop-aws;3.3.4!hadoop-aws.jar (132ms)
downloading https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.626/aws-java-sdk-bundle-1.12.626.jar ...
	[SUCCESSFUL ] com.amazonaws#aws-java-sdk-bundle;1.12.626!aws-java-sdk-bundle.jar (11429ms)
downloading https://repo1.maven.org/maven2/org/wildfly/openssl/wildfly-openssl/1.0.7.Final/wildfly-openssl-1.0.7.Final.jar ...
	[SUCCESSFUL ] org.wildfly.openssl#wildfly-openssl;1.0.7.Final!wildfly-openssl.jar (124ms)
:: resolution report :: resolve 4574ms :: artifacts dl 11688ms
	:: modules in use:
	com.amazonaws#aws-java-sdk-bundle;1.12.626 from central in [default]
	org.apache.hadoop#hadoop-aws;3.3.4 from central in [defau

-RECORD 0------------------------------------
 id                   | ENSG00000000003      
 approvedSymbol       | TSPAN6               
 biotype              | protein_coding       
 transcriptIds        | [ENST00000496771,... 
 canonicalTranscript  | {ENST00000373020,... 
 canonicalExons       | [100632485, 10063... 
 genomicLocation      | {X, 100627108, 10... 
 alternativeGenes     | NULL                 
 approvedName         | tetraspanin 6        
 go                   | [{GO:0016020, GO_... 
 hallmarks            | NULL                 
 synonyms             | [{Tetraspanin-6, ... 
 symbolSynonyms       | [{TSPAN6, uniprot... 
 nameSynonyms         | [{Tetraspanin-6, ... 
 functionDescriptions | []                   
 subcellularLocations | [{Membrane, unipr... 
 targetClass          | NULL                 
 obsoleteSymbols      | [{TM4SF6, HGNC}]     
 obsoleteNames        | [{transmembrane 4... 
 constraint           | NULL                 
 tep                  | NULL      