# Autoimmune colocalisations

The purpose of this notebook is to extract colocalisations for GWAS credible set associated with autoimmune diseases, including additional metadata about the studies.

1. **Download datasets from Open Targets Platform**

The next snippet allows you to synch the datasets from the EBI FTP server to your local machine. It creates a temporary directory for each dataset and uses `rsync` to download the data. Alternatively, you could download the data directly from EBI FTP sever or the Google Cloud Storage bucket. More info on how to download data in the [Open Targets Platform documentation](https://platform-docs.opentargets.org/data-access/datasets).


In [14]:
%%bash
release="25.06"
datasets=("study" "credible_set" "colocalisation_coloc" "colocalisation_ecaviar" "target" "disease", "biosample")
for dataset in "${datasets[@]}"
    do mkdir -p ../tmp/"${dataset}"
    # Rsync the data from EBI FTP server
    rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/${release}/output/${dataset} ../tmp/
done


receiving incremental file list

sent 29 bytes  received 161 bytes  380.00 bytes/sec
total size is 93,324,727  speedup is 491,182.77
receiving incremental file list

sent 29 bytes  received 1,843 bytes  1,248.00 bytes/sec
total size is 2,578,157,705  speedup is 1,377,220.00
receiving incremental file list

sent 29 bytes  received 1,853 bytes  3,764.00 bytes/sec
total size is 3,774,708,529  speedup is 2,005,689.97
receiving incremental file list

sent 29 bytes  received 1,847 bytes  3,752.00 bytes/sec
total size is 5,074,430,950  speedup is 2,704,920.55
receiving incremental file list

sent 29 bytes  received 764 bytes  528.67 bytes/sec
total size is 75,451,617  speedup is 95,147.06
receiving incremental file list


rsync: [sender] link_stat "/databases/opentargets/platform/25.06/output/disease," (in pub) failed: No such file or directory (2)



sent 8 bytes  received 149 bytes  314.00 bytes/sec
total size is 0  speedup is 0.00


rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1865) [Receiver=3.2.7]


receiving incremental file list
biosample/
biosample/_SUCCESS
biosample/part-00000-9a0782ce-d126-4164-9b2a-e32529959a6b-c000.snappy.parquet

sent 70 bytes  received 5,497,887 bytes  10,995,914.00 bytes/sec
total size is 6,296,680  speedup is 1.15


2. **Python environment and Spark session**

In [1]:
from pathlib import Path
from pyspark.sql import SparkSession
import pyspark.sql.functions as f

# Starting a Spark session
spark = SparkSession.builder.config("spark.driver.memory", "8g").getOrCreate()


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/07/04 14:43:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


3. **Read downloaded datasets**

In [15]:
# Reading datasets
credible_set = spark.read.parquet(
    str(Path.cwd().joinpath("../tmp/credible_set").resolve())
)
# Union two colocalisation datasets "colocalisation_ecaviar" and "colocalisation_coloc"
colocalisation_coloc = spark.read.parquet(
    str(Path.cwd().joinpath("../tmp/colocalisation_coloc").resolve())
)
colocalisation_ecaviar = spark.read.parquet(
    str(Path.cwd().joinpath("../tmp/colocalisation_ecaviar").resolve())
)
colocalisation = colocalisation_coloc.unionByName(
    colocalisation_ecaviar, allowMissingColumns=True
)
study = spark.read.parquet(str(Path.cwd().joinpath("../tmp/study").resolve()))
target = spark.read.parquet(str(Path.cwd().joinpath("../tmp/target").resolve()))
disease = spark.read.parquet(str(Path.cwd().joinpath("../tmp/disease").resolve()))
biosample = spark.read.parquet(str(Path.cwd().joinpath("../tmp/biosample").resolve()))


4. **Finding all autoimmune diseases according to EFO**

In [3]:
autoimmune_efo = "EFO_0005140"
autoimmune_diseases = (
    disease.filter(f.col("id") == autoimmune_efo)
    .select(f.explode("descendants").alias("diseaseId"))
    .join(
        disease.select(f.col("id").alias("diseaseId"), "name"),
        on="diseaseId",
        how="left",
    )
)
autoimmune_diseases.show(truncate=False)


+-------------+--------------------------------------------------------------------------------+
|diseaseId    |name                                                                            |
+-------------+--------------------------------------------------------------------------------+
|MONDO_0012500|chilblain lupus 1                                                               |
|MONDO_0010894|maturity-onset diabetes of the young type 3                                     |
|MONDO_0024278|proctocolitis                                                                   |
|EFO_0005626  |pancolitis                                                                      |
|EFO_0008613  |pemphigus vegetans                                                              |
|EFO_0803379  |anti-GAD65 autoimmune neurological syndromes                                    |
|MONDO_0005301|multiple sclerosis                                                              |
|EFO_0008605  |IgG/IgA pemphig

5. **Finding all GWAS studies for autoimmune diseases**


In [4]:
auto_gwas_studies = study.withColumn("diseaseId", f.explode("diseaseIds")).join(
    autoimmune_diseases, on="diseaseId", how="inner"
)
auto_gwas_studies.show(1, vertical=True, truncate=False)


25/07/04 14:43:23 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.

-RECORD 0--------------------------------------------------------------------------------------------------------------------------------------------------
 diseaseId                          | EFO_0002689                                                                                                          
 studyId                            | GCST004227                                                                                                           
 geneId                             | NULL                                                                                                                 
 projectId                          | GCST                                                                                                                 
 studyType                          | gwas                                                                                                                 
 traitFromSource                    | Obstetric antiphospholipid

                                                                                

6. **Extracting all credible sets in autoimmune studies**


In [5]:
auto_cs = auto_gwas_studies.join(credible_set, on="studyId", how="inner")
auto_cs.count()


                                                                                

7621

7. **Find colocalising molecular QTLs for each credible set**

The colocalisation dataset contains all GWAS - GWAS and GWAS - molQTL credible sets. This order is persisted, therefore all molQTL credible sets are in the right side of the colocalisation results. 

In [25]:
auto_cs_colocalisations = (
    auto_cs.withColumnRenamed("studyLocusId", "leftStudyLocusId")
    .alias("gwas")
    # Bring study information for the GWAS study
    .join(
        study.alias("gwas_study"),
        on=[f.col("gwas_study.studyId") == f.col("gwas.studyId")],
        how="inner",
    )
    # Bring colocalisation results
    .join(
        colocalisation.alias("colocalisation").filter(
            # Sensible filter for colocalisation results
            (f.col("clpp") > 0.01) | (f.col("h4") > 0.8)
        ),
        on="leftStudyLocusId",
        how="inner",
    )
    # ignore GWAS - GWAS colocalisations
    .filter(f.col("rightStudyType") != "gwas")
    # Bring molQTL credible set information
    .join(
        credible_set.alias("molQTL").withColumnRenamed(
            "studyLocusId", "rightStudyLocusId"
        ),
        on="rightStudyLocusId",
        how="inner",
    )
    # Bring study information for the molQTL study
    .join(
        study.alias("molQTL_study"),
        on=[f.col("molQTL_study.studyId") == f.col("molQTL.studyId")],
        how="inner",
    )
    # Add approved symbol for the molQTL gene
    .join(
        target.select("approvedSymbol", "id").alias("molQTL_target"),
        on=f.col("molQTL_target.id") == f.col("molQTL_study.geneId"),
        how="left",
    )
    # Add biosample information
    .join(
        biosample.select("biosampleId", "biosampleName").alias("molQTL_biosample"),
        on=f.col("molQTL_study.biosampleId") == f.col("molQTL_biosample.biosampleId"),
        how="left",
    )
)
# Count of moQTL colocalising with autoimmune disease credible sets across tissues/cell types
auto_cs_colocalisations.count()


                                                                                

208185

8. **Select columns of interest to print/write**

In [None]:
auto_cs_colocalisations_out = auto_cs_colocalisations.select(
    f.col("gwas.leftStudyLocusId").alias("gwas_credible_set"),
    f.col("gwas.studyId").alias("gwas_studyId"),
    f.col("gwas_study.traitFromSource").alias("gwas_trait_from_source"),
    f.col("gwas_study.publicationJournal").alias("gwas_publication_journal"),
    f.col("gwas_study.publicationDate").alias("gwas_publication_date"),
    f.col("gwas.variantId").alias("gwas_lead_variant_id"),
    f.col("gwas.beta").alias("gwas_beta"),
    f.col("gwas.pValueMantissa").alias("gwas_pValueMantissa"),
    f.col("gwas.pValueExponent").alias("gwas_pValueExponent"),
    f.col("colocalisation.colocalisationMethod").alias("colocalisation_method"),
    f.col("colocalisation.clpp").alias("colocalisation_clpp"),
    f.col("colocalisation.h4").alias("colocalisation_h4"),
    f.col("colocalisation.rightStudyLocusId").alias("molQTL_credible_set"),
    f.col("molQTL_study.studyId").alias("molQTL_studyId"),
    f.col("molQTL_study.geneId").alias("molQTL_gene_id"),
    f.col("molQTL_target.approvedSymbol").alias("molQTL_gene_symbol"),
    f.col("molQTL.studyType").alias("molQTL_study_type"),
    f.col("molQTL_biosample.biosampleId").alias("molQTL_biosample_id"),
    f.col("molQTL_biosample.biosampleName").alias("molQTL_biosample_name"),
    f.col("molQTL.variantId").alias("molQTL_lead_variant_id"),
    f.col("molQTL.beta").alias("molQTL_beta"),
    f.col("molQTL.pValueMantissa").alias("molQTL_pValueMantissa"),
    f.col("molQTL.pValueExponent").alias("molQTL_pValueExponent"),
)

auto_cs_colocalisations_out.show(truncate=False)

# This dataframe can be written to different formats including parquet file:
# auto_cs_colocalisations_out.write.parquet(
#     str(Path.cwd().joinpath("../tmp/autoimmune_colocalisations.parquet").resolve()),
#     mode="overwrite",
# )
# or csv:
# auto_cs_colocalisations_out.coalesce(1).write.csv(
#     str(Path.cwd().joinpath("../tmp/autoimmune_colocalisations.csv").resolve()),
#     mode="overwrite",
#     header=True,
# )




+--------------------------------+------------+---------------------------------+------------------------+---------------------+--------------------+--------------------+-------------------+-------------------+---------------------+--------------------+------------------+--------------------------------+-------------------------------------+---------------+------------------+-----------------+-------------------+---------------------+----------------------+--------------------+---------------------+---------------------+
|gwas_credible_set               |gwas_studyId|gwas_trait_from_source           |gwas_publication_journal|gwas_publication_date|gwas_lead_variant_id|gwas_beta           |gwas_pValueMantissa|gwas_pValueExponent|colocalisation_method|colocalisation_clpp |colocalisation_h4 |molQTL_credible_set             |molQTL_studyId                       |molQTL_gene_id |molQTL_gene_symbol|molQTL_study_type|molQTL_biosample_id|molQTL_biosample_name|molQTL_lead_variant_id|molQTL_beta 

                                                                                