# Mapping EFOs for the FinnGen study index using old study index from the previos prod

This notebook adds EFOs from previos prod version of study_index to the new FinnGen study_index using trait name as a matching key.

The rsulting study index has 1542 rows with not null EFOs (out of 2408 rows).

The new study index is saved here:
"gs://genetics-portal-dev-analysis/yt4/study_index_finngen_with_efo"

In [37]:
!gcloud auth application-default login

Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=764086051850-6qr4p6gpi6hn506pt8ejuq83di341hur.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A8085%2F&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fsqlservice.login+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth&state=XHb8Uk43SsVjvFRqwgrX4Tgg2tTOHS&access_type=offline&code_challenge=OkiqDAkHXDGEgJQbX8r0ZYKfZ7gcgfXS8mfZc5a913Y&code_challenge_method=S256


Credentials saved to file: [/Users/yt4/.config/gcloud/application_default_credentials.json]

These credentials will be used by any library that requests Application Default Credentials (ADC).

Quota project "open-targets-genetics-dev" was added to ADC which can be used by Google client libraries for billing and quota. Note that some services may still bill the project owning

In [1]:
import os

import hail as hl
import pandas as pd

from gentropy.common.session import Session
from gentropy.dataset.study_index import StudyIndex

pd.set_option("display.max_colwidth", None)
pd.set_option("display.expand_frame_repr", False)

hail_dir = os.path.dirname(hl.__file__)
session = Session(hail_home=hail_dir, start_hail=True, extended_spark_conf={"spark.driver.memory": "12g",
    "spark.kryoserializer.buffer.max": "500m","spark.driver.maxResultSize":"3g"})

24/04/14 16:03:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
pip-installed Hail requires additional configuration options in Spark referring
  to the path to the Hail Python module directory HAIL_DIR,
  e.g. /path/to/python/site-packages/hail:
    spark.jars=HAIL_DIR/backend/hail-all-spark.jar
    spark.driver.extraClassPath=HAIL_DIR/backend/hail-all-spark.jar
    spark.executor.extraClassPath=./hail-all-spark.jarRunning on Apache Spark version 3.3.4
SparkUI available at http://192.168.0.232:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.127-bb535cd096c5
LOGGING: writing to /dev/null


In [2]:
path_si="gs://genetics_etl_python_playground/releases/24.03/study_index/finngen/study_index"
path_si_old="gs://genetics-portal-dev-analysis/yt4/study_index.csv"

In [3]:
si_old=session.spark.read.csv(path_si_old, header=True,sep="\t")
si_new=StudyIndex.from_parquet(session=session, path=path_si)

                                                                                

In [4]:
si_old.show(5)

+--------------------+-------------------+--------------------+-------+---------+-------------+----+----------+----------+-----------+---------+------------+--------------+-------+--------------------+-----------------+--------------------+
|            study_id|   ancestry_initial|ancestry_replication|n_cases|n_initial|n_replication|pmid|pub_author|  pub_date|pub_journal|pub_title|has_sumstats|num_assoc_loci| source|      trait_reported|       trait_efos|      trait_category|
+--------------------+-------------------+--------------------+-------+---------+-------------+----+----------+----------+-----------+---------+------------+--------------+-------+--------------------+-----------------+--------------------+
|FINNGEN_R6_M13_MU...|['European=253458']|                  []|  108.0|   253458|          0.0|null|FINNGEN_R6|2022-01-24|       null|     null|        True|             0|FINNGEN|Multifocal fibros...|['MONDO_0009230']|immune system dis...|
|FINNGEN_R6_M13_MU...|['European=199

In [5]:
si_new_df=si_new.df
si_new_df.show(5)

+--------------------+-----------+---------+--------------------+------------------------+------+------------------+--------+----------------+----------------------+---------------+------------------+----------------------------------+--------------------+------+---------+--------+---------+---------------------+-------------------+------------------+---------------+-------------+--------------------+-----------+
|             studyId|  projectId|studyType|     traitFromSource|traitFromSourceMappedIds|geneId|tissueFromSourceId|pubmedId|publicationTitle|publicationFirstAuthor|publicationDate|publicationJournal|backgroundTraitFromSourceMappedIds|   initialSampleSize|nCases|nControls|nSamples|  cohorts|ldPopulationStructure|   discoverySamples|replicationSamples|qualityControls|analysisFlags|summarystatsLocation|hasSumstats|
+--------------------+-----------+---------+--------------------+------------------------+------+------------------+--------+----------------+----------------------+-

                                                                                

57246
2408


In [7]:
si_old=si_old.select("trait_reported","trait_efos")
si_old.show(5)

+--------------------+-----------------+
|      trait_reported|       trait_efos|
+--------------------+-----------------+
|Multifocal fibros...|['MONDO_0009230']|
|Disorders of muscles|  ['EFO_0002970']|
|"""Muscle wasting...|  ['EFO_0009851']|
|Other specified d...|  ['EFO_0002970']|
|       Muscle strain|  ['EFO_0010686']|
+--------------------+-----------------+
only showing top 5 rows



In [8]:
from pyspark.sql.functions import lower

si_old = si_old.withColumn("trait_reported_low", lower(si_old["trait_reported"])).select("trait_reported_low","trait_efos")
si_new_df= si_new_df.withColumn("trait_reported_low", lower(si_new_df["traitFromSource"]))

In [9]:
si_old = si_old.dropDuplicates(["trait_reported_low"])
joined_df = si_new_df.join(si_old, "trait_reported_low", how="left")
joined_df.count()

                                                                                

2408

In [10]:
joined_df.show()

                                                                                

+--------------------+--------------------+-----------+---------+--------------------+------------------------+------+------------------+--------+----------------+----------------------+---------------+------------------+----------------------------------+--------------------+------+---------+--------+---------+---------------------+-------------------+------------------+---------------+-------------+--------------------+-----------+--------------------+
|  trait_reported_low|             studyId|  projectId|studyType|     traitFromSource|traitFromSourceMappedIds|geneId|tissueFromSourceId|pubmedId|publicationTitle|publicationFirstAuthor|publicationDate|publicationJournal|backgroundTraitFromSourceMappedIds|   initialSampleSize|nCases|nControls|nSamples|  cohorts|ldPopulationStructure|   discoverySamples|replicationSamples|qualityControls|analysisFlags|summarystatsLocation|hasSumstats|          trait_efos|
+--------------------+--------------------+-----------+---------+-----------------

In [11]:
num_non_null_rows = joined_df.filter(joined_df.trait_efos.isNotNull()).count()

                                                                                

1542




In [27]:
path_tmp="gs://gwas_catalog_data/study_index"
tmp=StudyIndex.from_parquet(session=session, path=path_tmp)
tmp.df.show(50)

+------------+---------+---------+--------------------+------------------------+------+------------------+--------+--------------------+----------------------+---------------+--------------------+----------------------------------+--------------------+------+---------+--------+--------------------+---------------------+--------------------+--------------------+---------------+-------------+--------------------+-----------+
|     studyId|projectId|studyType|     traitFromSource|traitFromSourceMappedIds|geneId|tissueFromSourceId|pubmedId|    publicationTitle|publicationFirstAuthor|publicationDate|  publicationJournal|backgroundTraitFromSourceMappedIds|   initialSampleSize|nCases|nControls|nSamples|             cohorts|ldPopulationStructure|    discoverySamples|  replicationSamples|qualityControls|analysisFlags|summarystatsLocation|hasSumstats|
+------------+---------+---------+--------------------+------------------------+------+------------------+--------+--------------------+----------

In [17]:
joined_df=joined_df.withColumn("traitFromSourceMappedIds",joined_df["trait_efos"]).drop("trait_efos","trait_reported_low")

In [19]:
joined_df.show()

                                                                                

+--------------------+-----------+---------+--------------------+------------------------+------+------------------+--------+----------------+----------------------+---------------+------------------+----------------------------------+--------------------+------+---------+--------+---------+---------------------+-------------------+------------------+---------------+-------------+--------------------+-----------+
|             studyId|  projectId|studyType|     traitFromSource|traitFromSourceMappedIds|geneId|tissueFromSourceId|pubmedId|publicationTitle|publicationFirstAuthor|publicationDate|publicationJournal|backgroundTraitFromSourceMappedIds|   initialSampleSize|nCases|nControls|nSamples|  cohorts|ldPopulationStructure|   discoverySamples|replicationSamples|qualityControls|analysisFlags|summarystatsLocation|hasSumstats|
+--------------------+-----------+---------+--------------------+------------------------+------+------------------+--------+----------------+----------------------+-

In [24]:
column_type = dict(joined_df.dtypes)["traitFromSourceMappedIds"]

string


In [25]:
from pyspark.sql.functions import from_json
from pyspark.sql.types import ArrayType, StringType

# Assuming joined_df is your DataFrame
joined_df = joined_df.withColumn(
    "traitFromSourceMappedIds",
    from_json("traitFromSourceMappedIds", ArrayType(StringType()))
)

In [26]:
joined_df.show()

                                                                                

+--------------------+-----------+---------+--------------------+------------------------+------+------------------+--------+----------------+----------------------+---------------+------------------+----------------------------------+--------------------+------+---------+--------+---------+---------------------+-------------------+------------------+---------------+-------------+--------------------+-----------+
|             studyId|  projectId|studyType|     traitFromSource|traitFromSourceMappedIds|geneId|tissueFromSourceId|pubmedId|publicationTitle|publicationFirstAuthor|publicationDate|publicationJournal|backgroundTraitFromSourceMappedIds|   initialSampleSize|nCases|nControls|nSamples|  cohorts|ldPopulationStructure|   discoverySamples|replicationSamples|qualityControls|analysisFlags|summarystatsLocation|hasSumstats|
+--------------------+-----------+---------+--------------------+------------------------+------+------------------+--------+----------------+----------------------+-

                                                                                

In [28]:
column_type = dict(joined_df.dtypes)["traitFromSourceMappedIds"]

array<string>


In [29]:
si=StudyIndex(_df=joined_df, _schema=StudyIndex.get_schema())

In [30]:
si.df.show()

                                                                                

+--------------------+-----------+---------+--------------------+------------------------+------+------------------+--------+----------------+----------------------+---------------+------------------+----------------------------------+--------------------+------+---------+--------+---------+---------------------+-------------------+------------------+---------------+-------------+--------------------+-----------+
|             studyId|  projectId|studyType|     traitFromSource|traitFromSourceMappedIds|geneId|tissueFromSourceId|pubmedId|publicationTitle|publicationFirstAuthor|publicationDate|publicationJournal|backgroundTraitFromSourceMappedIds|   initialSampleSize|nCases|nControls|nSamples|  cohorts|ldPopulationStructure|   discoverySamples|replicationSamples|qualityControls|analysisFlags|summarystatsLocation|hasSumstats|
+--------------------+-----------+---------+--------------------+------------------------+------+------------------+--------+----------------+----------------------+-

In [31]:
si.df.count()

                                                                                

2408

In [34]:
si.df.write.parquet(path="gs://genetics-portal-dev-analysis/yt4/study_index_finngen_with_efo")

                                                                                

In [35]:
path_to_study_index="gs://genetics-portal-dev-analysis/yt4/study_index_finngen_with_efo"
si=StudyIndex.from_parquet(session=session, path=path_to_study_index)