# Autoimmune colocalisations

The purpose of this notebook is to extract colocalisations for GWAS credible set associated with autoimmune diseases, including additional metadata about the studies.

1. **Download datasets from Open Targets Platform**

In [None]:
%%bash
release="25.06"
datasets=("study" "credible_set" "colocalisation" "target" "disease")
for dataset in "${datasets[@]}"
    do mkdir -p ../tmp/"${dataset}"
    # Rsync the data from EBI FTP server
    rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/${release}/output/${dataset} ../tmp/
done
;


receiving incremental file list
study/
study/_SUCCESS
study/part-00000-07bf8025-0ff8-4254-b65e-c2e2662cf4b1-c000.snappy.parquet

sent 70 bytes  received 75,150,372 bytes  2,385,728.32 bytes/sec
total size is 93,324,727  speedup is 1.24
receiving incremental file list
credible_set/
credible_set/_SUCCESS
credible_set/part-00000-33904b03-658b-460b-b4ca-e2facc0a1c98-c000.snappy.parquet
credible_set/part-00001-33904b03-658b-460b-b4ca-e2facc0a1c98-c000.snappy.parquet
credible_set/part-00002-33904b03-658b-460b-b4ca-e2facc0a1c98-c000.snappy.parquet
credible_set/part-00003-33904b03-658b-460b-b4ca-e2facc0a1c98-c000.snappy.parquet
credible_set/part-00004-33904b03-658b-460b-b4ca-e2facc0a1c98-c000.snappy.parquet
credible_set/part-00005-33904b03-658b-460b-b4ca-e2facc0a1c98-c000.snappy.parquet


rsync: [receiver] stat "/tmp/credible_set/.part-00005-33904b03-658b-460b-b4ca-e2facc0a1c98-c000.snappy.parquet.QIGrun" failed: No such file or directory (2)
rsync: [receiver] rename "/tmp/credible_set/.part-00005-33904b03-658b-460b-b4ca-e2facc0a1c98-c000.snappy.parquet.QIGrun" -> "credible_set/part-00005-33904b03-658b-460b-b4ca-e2facc0a1c98-c000.snappy.parquet": No such file or directory (2)


credible_set/part-00006-33904b03-658b-460b-b4ca-e2facc0a1c98-c000.snappy.parquet
credible_set/part-00007-33904b03-658b-460b-b4ca-e2facc0a1c98-c000.snappy.parquet
credible_set/part-00008-33904b03-658b-460b-b4ca-e2facc0a1c98-c000.snappy.parquet
credible_set/part-00009-33904b03-658b-460b-b4ca-e2facc0a1c98-c000.snappy.parquet
credible_set/part-00010-33904b03-658b-460b-b4ca-e2facc0a1c98-c000.snappy.parquet
credible_set/part-00011-33904b03-658b-460b-b4ca-e2facc0a1c98-c000.snappy.parquet
credible_set/part-00012-33904b03-658b-460b-b4ca-e2facc0a1c98-c000.snappy.parquet
credible_set/part-00013-33904b03-658b-460b-b4ca-e2facc0a1c98-c000.snappy.parquet
credible_set/part-00014-33904b03-658b-460b-b4ca-e2facc0a1c98-c000.snappy.parquet
credible_set/part-00015-33904b03-658b-460b-b4ca-e2facc0a1c98-c000.snappy.parquet
credible_set/part-00016-33904b03-658b-460b-b4ca-e2facc0a1c98-c000.snappy.parquet
credible_set/part-00017-33904b03-658b-460b-b4ca-e2facc0a1c98-c000.snappy.parquet
credible_set/part-00018-3390

rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1865) [generator=3.2.7]


receiving incremental file list


rsync: [sender] link_stat "/databases/opentargets/platform/25.06/output/colocalisation" (in pub) failed: No such file or directory (2)



sent 8 bytes  received 155 bytes  108.67 bytes/sec
total size is 0  speedup is 0.00


rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1865) [Receiver=3.2.7]


receiving incremental file list
target/
target/_SUCCESS
target/part-00000-32c6b8fb-df3a-488a-8fe2-2f214b81f4d9-c000.snappy.parquet
target/part-00001-32c6b8fb-df3a-488a-8fe2-2f214b81f4d9-c000.snappy.parquet
target/part-00002-32c6b8fb-df3a-488a-8fe2-2f214b81f4d9-c000.snappy.parquet
target/part-00003-32c6b8fb-df3a-488a-8fe2-2f214b81f4d9-c000.snappy.parquet
target/part-00004-32c6b8fb-df3a-488a-8fe2-2f214b81f4d9-c000.snappy.parquet
target/part-00005-32c6b8fb-df3a-488a-8fe2-2f214b81f4d9-c000.snappy.parquet
target/part-00006-32c6b8fb-df3a-488a-8fe2-2f214b81f4d9-c000.snappy.parquet
target/part-00007-32c6b8fb-df3a-488a-8fe2-2f214b81f4d9-c000.snappy.parquet
target/part-00008-32c6b8fb-df3a-488a-8fe2-2f214b81f4d9-c000.snappy.parquet
target/part-00009-32c6b8fb-df3a-488a-8fe2-2f214b81f4d9-c000.snappy.parquet

sent 241 bytes  received 59,240,400 bytes  1,221,456.52 bytes/sec
total size is 75,451,617  speedup is 1.27
receiving incremental file list
disease/
disease/disease.parquet

sent 51 bytes  rece

2. **Python environment and Spark session**

In [4]:
from pathlib import Path
import pyspark.sql.functions as f
from pyspark.sql import SparkSession

# Starting a Spark session
spark = SparkSession.builder.config("spark.driver.memory", "8g").getOrCreate()


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/07/01 15:38:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


3. **Read downloaded datasets**

In [5]:
# Reading datasets
credible_set = spark.read.parquet(
    str(Path.cwd().joinpath("../tmp/credible_set").resolve())
)
l2g_prediction = spark.read.parquet(
    str(Path.cwd().joinpath("../tmp/l2g_prediction").resolve())
)
study = spark.read.parquet(str(Path.cwd().joinpath("../tmp/study").resolve()))
target = spark.read.parquet(str(Path.cwd().joinpath("../tmp/target").resolve()))
disease = spark.read.parquet(str(Path.cwd().joinpath("../tmp/disease").resolve()))


4. **Finding all autoimmune diseases according to EFO**

In [6]:
autoimmune_efo = "EFO_0005140"
autoimmune_diseases = (
    disease.filter(f.col("id") == autoimmune_efo)
    .select(f.explode("descendants").alias("diseaseId"))
    .join(
        disease.select(f.col("id").alias("diseaseId"), "name"),
        on="diseaseId",
        how="left",
    )
)
autoimmune_diseases.show(truncate=False)


+-------------+--------------------------------------------------------------------------------+
|diseaseId    |name                                                                            |
+-------------+--------------------------------------------------------------------------------+
|MONDO_0012500|chilblain lupus 1                                                               |
|MONDO_0010894|maturity-onset diabetes of the young type 3                                     |
|MONDO_0024278|proctocolitis                                                                   |
|EFO_0005626  |pancolitis                                                                      |
|EFO_0008613  |pemphigus vegetans                                                              |
|EFO_0803379  |anti-GAD65 autoimmune neurological syndromes                                    |
|MONDO_0005301|multiple sclerosis                                                              |
|EFO_0008605  |IgG/IgA pemphig

5. **Finding all GWAS studies for autoimmune diseases**


In [7]:
auto_gwas_studies = study.withColumn("diseaseId", f.explode("diseaseIds")).join(
    autoimmune_diseases, on="diseaseId", how="inner"
)
auto_gwas_studies.show(1, vertical=True, truncate=False)


25/07/01 15:38:47 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.

-RECORD 0--------------------------------------------------------------------------------------------------------------------------------------------------
 diseaseId                          | EFO_0002689                                                                                                          
 studyId                            | GCST004227                                                                                                           
 geneId                             | NULL                                                                                                                 
 projectId                          | GCST                                                                                                                 
 studyType                          | gwas                                                                                                                 
 traitFromSource                    | Obstetric antiphospholipid

                                                                                

6. **Extracting all credible sets in autoimmune studies**


In [8]:
auto_cs = auto_gwas_studies.join(credible_set, on="studyId", how="inner")
auto_cs.count()


                                                                                

7621

7. **Including top L2G gene for each credible set**

In [9]:
auto_cs_gene = auto_cs.join(
    # Top scoring L2G gene for each studyLocusId
    l2g_prediction.groupBy("studyLocusId")
    # Stricter L2G score filters can be applied here
    # .filter(f.col("score") > 0.25)
    .agg(
        f.first("geneId").alias("topGeneId"),
        f.first("score").alias("top_L2G_score"),
    )
    .join(
        target.select(
            f.col("id").alias("topGeneId"),
            f.col("approvedSymbol").alias("topGeneSymbol"),
        ),
        on="topGeneId",
        how="left",
    ),
    on="studyLocusId",
    how="left",
)
auto_cs_gene.printSchema()


root
 |-- studyLocusId: string (nullable = true)
 |-- studyId: string (nullable = true)
 |-- diseaseId: string (nullable = true)
 |-- geneId: string (nullable = true)
 |-- projectId: string (nullable = true)
 |-- studyType: string (nullable = true)
 |-- traitFromSource: string (nullable = true)
 |-- traitFromSourceMappedIds: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- biosampleFromSourceId: string (nullable = true)
 |-- pubmedId: string (nullable = true)
 |-- publicationTitle: string (nullable = true)
 |-- publicationFirstAuthor: string (nullable = true)
 |-- publicationDate: string (nullable = true)
 |-- publicationJournal: string (nullable = true)
 |-- backgroundTraitFromSourceMappedIds: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- initialSampleSize: string (nullable = true)
 |-- nCases: integer (nullable = true)
 |-- nControls: integer (nullable = true)
 |-- nSamples: integer (nullable = true)
 |-- cohorts: array (null

8. **Select columns of interest to print/write**

In [10]:
cs_out = auto_cs_gene.select(
    # About the study
    "studyId",
    "publicationJournal",
    "publicationDate",
    "nSamples",
    "ldPopulationStructure",
    "traitFromSource",
    "diseaseId",
    f.col("name").alias("diseaseName"),
    # About the credible set
    "studyLocusId",
    "pValueMantissa",
    "pValueExponent",
    "beta",
    "standardError",
    "fineMappingMethod",
    "hasSumstats",
    f.col("variantId").alias("leadVariant"),
    # About the gene
    "topGeneSymbol",
    "topGeneId",
    "top_L2G_score",
)

cs_out.show(truncate=False)

# This dataframe can be written to different formats including parquet file:
# cs_out.write.parquet(
#     str(Path.cwd().joinpath("../tmp/autoimmune_credible_set_parquet").resolve()),
#     mode="overwrite",
# )
# or csv:
# cs_out.drop("ldPopulationStructure").coalesce(1).write.csv(
#     str(Path.cwd().joinpath("../tmp/autoimmune_credible_set_csv").resolve()),
#     mode="overwrite",
#     header=True,
# )


                                                                                

+-------------------------------+-----------------------------+---------------+--------+-----------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------+-------------+----------------------------------+--------------------------------+--------------+--------------+---------------------+-------------+-----------------+-----------+---------------+-------------+---------------+-------------------+
|studyId                        |publicationJournal           |publicationDate|nSamples|ldPopulationStructure                                                              |traitFromSource                                                                                                                                    |diseaseId    |diseaseName                       |studyLocusId                    |pValueMantissa|pValueExponent|bet