# Leave one group in analysis

## Reason

We want to understand how excluding one set of features affect the overall locus to gene model predictions. 

## How

To conduct the analysis we follow the protocol:


* (1) obtain feature matrix
* (2) Exclude or make the features requested in combination constant - two ways are redundant, as we use gradient boosting model
* (3) Run training
* (4) Run predictions

## Setup

In [28]:
# Ensure proper java version < 11
!java -version


openjdk version "11.0.13" 2021-10-19
OpenJDK Runtime Environment JBR-11.0.13.7-1751.21-jcef (build 11.0.13+7-b1751.21)
OpenJDK 64-Bit Server VM JBR-11.0.13.7-1751.21-jcef (build 11.0.13+7-b1751.21, mixed mode)


In [29]:
!rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/25.03/intermediate/l2g_feature_matrix ../../data/.
!rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/25.03/input/l2g/gold_standard.json ../../data/.


receiving incremental file list

sent 29 bytes  received 13.403 bytes  26.864,00 bytes/sec
total size is 836.131.251  speedup is 62.249,20
receiving incremental file list

sent 29 bytes  received 9.674 bytes  6.468,67 bytes/sec
total size is 21.402.425  speedup is 2.205,75


In [30]:
!gcloud auth application-default login


Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=764086051850-6qr4p6gpi6hn506pt8ejuq83di341hur.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A8085%2F&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fsqlservice.login&state=0qh1QG9Q6SOwqCyDLrMSh5M6Bicmsa&access_type=offline&code_challenge=qTZ4lVEsItKrm4wdTYa-BbTzBhTQSzg3BydsfS96mC8&code_challenge_method=S256


Credentials saved to file: [/home/mindos/.config/gcloud/application_default_credentials.json]

These credentials will be used by any library that requests Application Default Credentials (ADC).

Quota project "open-targets-genetics-dev" was added to ADC which can be used by Google client libraries for billing and quota. Note that some services may still bill the project owning the resource.


In [31]:
from enum import Enum, StrEnum
from pprint import pprint

from gentropy.common.session import Session
from gentropy.dataset.l2g_feature_matrix import L2GFeatureMatrix
from gentropy.l2g import LocusToGeneStep


In [32]:
session = Session(extended_spark_conf={"spark.driver.memory": "40G", "spark.rpc.message.maxSize": "1024"})
fm_path = "../../data/l2g_feature_matrix"


25/04/25 14:19:37 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [33]:
session.spark


## Data loading
We need to:
* prepare feature combinations
* load l2g feature matrix with feature combinations

### Feature combination preparation

In [34]:
class EQTLFeatures(StrEnum):
    EQTLCOLOCCLPPMAXIMUM = "eQtlColocClppMaximum"
    EQTLCOLOCH4MAXIMUM = "eQtlColocH4Maximum"
    EQTLCOLOCCLPPMAXIMUMNEIGHBOURHOOD = "eQtlColocClppMaximumNeighbourhood"
    EQTLCOLOCH4MAXIMUMNEIGHBOURHOOD = "eQtlColocH4MaximumNeighbourhood"


class PQTLFeatures(StrEnum):
    PQTLCOLOCCLPPMAXIMUM = "pQtlColocClppMaximum"
    PQTLCOLOCH4MAXIMUM = "pQtlColocH4Maximum"
    PQTLCOLOCCLPPMAXIMUMNEIGHBOURHOOD = "pQtlColocClppMaximumNeighbourhood"
    PQTLCOLOCH4MAXIMUMNEIGHBOURHOOD = "pQtlColocH4MaximumNeighbourhood"


class SQTLFeatures(StrEnum):
    SQTLCOLOCCLPPMAXIMUM = "sQtlColocClppMaximum"
    SQTLCOLOCH4MAXIMUM = "sQtlColocH4Maximum"
    SQTLCOLOCCLPPMAXIMUMNEIGHBOURHOOD = "sQtlColocClppMaximumNeighbourhood"
    SQTLCOLOCH4MAXIMUMNEIGHBOURHOOD = "sQtlColocH4MaximumNeighbourhood"


class DistanceFeatures(StrEnum):
    DISTANCESENTINELFOOTPRINT = "distanceSentinelFootprint"
    DISTANCESENTINELFOOTPRINTNEIGHBOURHOOD = "distanceSentinelFootprintNeighbourhood"
    DISTANCEFOOTPRINTMEAN = "distanceFootprintMean"
    DISTANCEFOOTPRINTMEANNEIGHBOURHOOD = "distanceFootprintMeanNeighbourhood"
    DISTANCETSSMEAN = "distanceTssMean"
    DISTANCETSSMEANNEIGHBOURHOOD = "distanceTssMeanNeighbourhood"
    DISTANCESENTINELTSS = "distanceSentinelTss"
    DISTANCESENTINELTSSNEIGHBOURHOOD = "distanceSentinelTssNeighbourhood"


class VEPFeatures(StrEnum):
    VEPMAXIMUM = "vepMaximum"
    VEPMAXIMUMNEIGHBOURHOOD = "vepMaximumNeighbourhood"
    VEPMEAN = "vepMean"
    VEPMEANNEIGHBOURHOOD = "vepMeanNeighbourhood"


class OtherFeatures(StrEnum):
    GENECOUNT500KB = "geneCount500kb"
    PROTEINGENECOUNT500KB = "proteinGeneCount500kb"
    CREDIBLESETCONFIDENCE = "credibleSetConfidence"
    ISPROTEINCODING = "isProteinCoding"


class FeatureGroup(Enum):
    VEP = VEPFeatures
    DISTANCE = DistanceFeatures
    EQTL = EQTLFeatures
    PQTL = PQTLFeatures
    SQTL = SQTLFeatures
    OTHER = OtherFeatures


class FeatureCombination:
    def __init__(self, name: str, *feature_groups: FeatureGroup) -> None:
        self.name = name
        self.feature_groups = list(feature_groups)

    @property
    def features(self) -> list[str]:
        """Return feature names under the initialized groups."""
        features = []
        for g in self.feature_groups:
            features.extend([f.value for f in list(g.value)])
        return features


In [35]:
feature_combinations = [
    FeatureCombination("distance-other", FeatureGroup.DISTANCE, FeatureGroup.OTHER),
    FeatureCombination("distance-other-vep", FeatureGroup.DISTANCE, FeatureGroup.OTHER, FeatureGroup.VEP),
    FeatureCombination("distance-other-eqlt", FeatureGroup.DISTANCE, FeatureGroup.OTHER, FeatureGroup.EQTL),
    FeatureCombination("distance-other-pqtl", FeatureGroup.DISTANCE, FeatureGroup.OTHER, FeatureGroup.PQTL),
    FeatureCombination("distance-other-sqtl", FeatureGroup.DISTANCE, FeatureGroup.OTHER, FeatureGroup.SQTL),
]
named_combinations = {c.name: c.features for c in feature_combinations}
named_combinations


{'distance-other': ['distanceSentinelFootprint',
  'distanceSentinelFootprintNeighbourhood',
  'distanceFootprintMean',
  'distanceFootprintMeanNeighbourhood',
  'distanceTssMean',
  'distanceTssMeanNeighbourhood',
  'distanceSentinelTss',
  'distanceSentinelTssNeighbourhood',
  'geneCount500kb',
  'proteinGeneCount500kb',
  'credibleSetConfidence',
  'isProteinCoding'],
 'distance-other-vep': ['distanceSentinelFootprint',
  'distanceSentinelFootprintNeighbourhood',
  'distanceFootprintMean',
  'distanceFootprintMeanNeighbourhood',
  'distanceTssMean',
  'distanceTssMeanNeighbourhood',
  'distanceSentinelTss',
  'distanceSentinelTssNeighbourhood',
  'geneCount500kb',
  'proteinGeneCount500kb',
  'credibleSetConfidence',
  'isProteinCoding',
  'vepMaximum',
  'vepMaximumNeighbourhood',
  'vepMean',
  'vepMeanNeighbourhood'],
 'distance-other-eqlt': ['distanceSentinelFootprint',
  'distanceSentinelFootprintNeighbourhood',
  'distanceFootprintMean',
  'distanceFootprintMeanNeighbourhood',

### Loading feature matrix 

In [36]:
df = session.load_data(fm_path, format="parquet")
fms = {
    name: {
        "features_list": features,
        "fm": L2GFeatureMatrix(_df=df, features_list=features),
        "model_path": f"../../data/l2g_leave_one_in_tests/models/{name}/classifier.skops",
        "prediction_path": f"../../data/l2g_leave_one_in_tests/predictions/{name}",
    }
    for name, features in named_combinations.items()
}


In [37]:
## Sanity check to see if correct features are used
for name, meta in fms.items():
    print(f"{name}: columns({', '.join(meta['fm']._df.columns)})")


distance-other: columns(studyLocusId, geneId, distanceSentinelFootprint, distanceSentinelFootprintNeighbourhood, distanceFootprintMean, distanceFootprintMeanNeighbourhood, distanceTssMean, distanceTssMeanNeighbourhood, distanceSentinelTss, distanceSentinelTssNeighbourhood, geneCount500kb, proteinGeneCount500kb, credibleSetConfidence, isProteinCoding)
distance-other-vep: columns(studyLocusId, geneId, distanceSentinelFootprint, distanceSentinelFootprintNeighbourhood, distanceFootprintMean, distanceFootprintMeanNeighbourhood, distanceTssMean, distanceTssMeanNeighbourhood, distanceSentinelTss, distanceSentinelTssNeighbourhood, geneCount500kb, proteinGeneCount500kb, credibleSetConfidence, isProteinCoding, vepMaximum, vepMaximumNeighbourhood, vepMean, vepMeanNeighbourhood)
distance-other-eqlt: columns(studyLocusId, geneId, distanceSentinelFootprint, distanceSentinelFootprintNeighbourhood, distanceFootprintMean, distanceFootprintMeanNeighbourhood, distanceTssMean, distanceTssMeanNeighbourhood

## Running training and predictions

### Paramter estimation

Based on 
```
 l2g_training:
    params:
      step: locus_to_gene
      step.session.write_mode: errorIfExists
      step.run_mode: train
      step.wandb_run_name: '{l2g_training}'
      step.cross_validate: false
      step.hf_hub_repo_id: opentargets/locus_to_gene
      step.hf_model_commit_message: 'chore: update model base model for {l2g_training} run'
      +step.session.extended_spark_conf: "{spark.kryoserializer.buffer.max:500m, spark.sql.autoBroadcastJoinThreshold:'-1'}"
      # INPUTS
      step.credible_set_path: '{release_uri}/output/credible_set'
      step.feature_matrix_path: '{release_uri}/intermediate/l2g_feature_matrix'
      step.gold_standard_curation_path: '{release_uri}/input/l2g/gold_standard.json'
      # OUTPUTS
      step.model_path: '{release_uri}/etc/model/locus_to_gene_model/classifier.skops'

   l2g_prediction:
    params:
      step: locus_to_gene
      step.run_mode: predict
      step.session.write_mode: errorIfExists
      step.session.output_partitions: 1
      step.l2g_threshold: 0.05
      step.download_from_hub: true
      step.explain_predictions: true
      +step.session.extended_spark_conf: "{spark.rpc.message.maxSize:'1024'}"
      # INPUTS
      step.hf_hub_repo_id: opentargets/locus_to_gene
      step.feature_matrix_path: '{release_uri}/intermediate/l2g_feature_matrix'
      step.credible_set_path: '{release_uri}/output/credible_set'
      # OUTPUTS
      step.predictions_path: '{release_uri}/output/l2g_prediction'
```

In [38]:
train_parameters = {
    name: {
        "run_mode": "train",
        "wandb_run_name": f"run_250425_leave-one-in-group_{name}_train",
        "cross_validate": False,
        "hf_hub_repo_id": None,
        "hf_model_commit_message": f"perf: performance test for leave-one-in-group analysis with {name} features",
        "credible_set_path": "../../data/credible_set",
        "feature_matrix_path": "../../data/l2g_feature_matrix",
        "gold_standard_curation_path": "../../data/gold_standard.json",
        "features_list": meta["features_list"],
        "model_path": meta["model_path"],
        "hyperparameters": {
            "n_estimators": 100,
            "max_depth": 3,
            "ccp_alpha": 0,
            "learning_rate": 0.1,
            "min_samples_leaf": 1,
            "min_samples_split": 5,
            "subsample": 0.7,
        },
        "download_from_hub": False,
    }
    for name, meta in fms.items()
}
pprint(train_parameters["distance-other"])


{'credible_set_path': '../../data/credible_set',
 'cross_validate': False,
 'download_from_hub': False,
 'feature_matrix_path': '../../data/l2g_feature_matrix',
 'features_list': ['distanceSentinelFootprint',
                   'distanceSentinelFootprintNeighbourhood',
                   'distanceFootprintMean',
                   'distanceFootprintMeanNeighbourhood',
                   'distanceTssMean',
                   'distanceTssMeanNeighbourhood',
                   'distanceSentinelTss',
                   'distanceSentinelTssNeighbourhood',
                   'geneCount500kb',
                   'proteinGeneCount500kb',
                   'credibleSetConfidence',
                   'isProteinCoding'],
 'gold_standard_curation_path': '../../data/gold_standard.json',
 'hf_hub_repo_id': None,
 'hf_model_commit_message': 'perf: performance test for leave-one-in-group '
                            'analysis with distance-other features',
 'hyperparameters': {'ccp_alpha': 0,
      

In [39]:
pred_parameters = {
    name: {
        "run_mode": "predict",
        "l2g_threshold": 0.05,
        "download_from_hub": False,
        "explain_predictions": False,
        "features_list": meta["features_list"],
        "model_path": meta["model_path"].removesuffix("/classifier.skops"),
        "feature_matrix_path": "../../data/l2g_feature_matrix",
        "credible_set_path": "../../data/credible_set",
        "predictions_path": meta["prediction_path"],
        "hyperparameters": {
            "n_estimators": 100,
            "max_depth": 3,
            "ccp_alpha": 0,
            "learning_rate": 0.1,
            "min_samples_leaf": 1,
            "min_samples_split": 5,
            "subsample": 0.7,
        },
        "cross_validate": False,
        "wandb_run_name": f"run_250425_leave-one-in-group_{name}_pred",
    }
    for name, meta in fms.items()
}
pprint(pred_parameters["distance-other"])


{'credible_set_path': '../../data/credible_set',
 'cross_validate': False,
 'download_from_hub': False,
 'explain_predictions': False,
 'feature_matrix_path': '../../data/l2g_feature_matrix',
 'features_list': ['distanceSentinelFootprint',
                   'distanceSentinelFootprintNeighbourhood',
                   'distanceFootprintMean',
                   'distanceFootprintMeanNeighbourhood',
                   'distanceTssMean',
                   'distanceTssMeanNeighbourhood',
                   'distanceSentinelTss',
                   'distanceSentinelTssNeighbourhood',
                   'geneCount500kb',
                   'proteinGeneCount500kb',
                   'credibleSetConfidence',
                   'isProteinCoding'],
 'hyperparameters': {'ccp_alpha': 0,
                     'learning_rate': 0.1,
                     'max_depth': 3,
                     'min_samples_leaf': 1,
                     'min_samples_split': 5,
                     'n_estimators': 100,


### L2G steps

The test will run both:
* training, upload training metrics to W&B
* predictions

Predictions and models are saved locally


In [41]:
names = train_parameters.keys()
for name in names:
    LocusToGeneStep(session=session, **train_parameters[name])
    LocusToGeneStep(session=session, **pred_parameters[name])


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/mindos/.netrc
25/04/25 14:23:53 WARN CacheManager: Asked to cache already cached data.


[34m[1mwandb[0m: 
[34m[1mwandb[0m: Plotting L2G-classifier.
[34m[1mwandb[0m: Logged feature importances.
[34m[1mwandb[0m: Logged confusion matrix.
[34m[1mwandb[0m: Logged summary metrics.
[34m[1mwandb[0m: Logged class proportions.
[34m[1mwandb[0m: Logged calibration curve.
[34m[1mwandb[0m: Logged roc curve.
[34m[1mwandb[0m: Logged precision-recall curve.
[34m[1mwandb[0m: Logged Shapley contributions.


0,1
accuracy,▁
areaUnderROC,▁
averagePrecision,▁
f1,▁
weightedPrecision,▁
weightedRecall,▁

0,1
accuracy,0.97616
areaUnderROC,0.96168
averagePrecision,0.67865
f1,0.97495
weightedPrecision,0.97517
weightedRecall,0.97616


ERROR:root:Training data set to none. Error: [Errno 2] No such file or directory: '../../data/l2g_leave_one_in_tests/models/distance-other/training_data.parquet'
25/04/25 14:27:18 WARN TaskSetManager: Stage 450 contains a task of very large size (30229 KiB). The maximum recommended task size is 1000 KiB.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/mindos/.netrc
                                                                                

[34m[1mwandb[0m: 
[34m[1mwandb[0m: Plotting L2G-classifier.
[34m[1mwandb[0m: Logged feature importances.
[34m[1mwandb[0m: Logged confusion matrix.
[34m[1mwandb[0m: Logged summary metrics.
[34m[1mwandb[0m: Logged class proportions.

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

[34m[1mwandb[0m: Logged calibration curve.
[34m[1mwandb[0m: Logged roc curve.
[34m[1mwandb[0m: Logged precision-recall curve.
[34m[1mwandb[0m: Logged Shapley contributions.


0,1
accuracy,▁
areaUnderROC,▁
averagePrecision,▁
f1,▁
weightedPrecision,▁
weightedRecall,▁

0,1
accuracy,0.97639
areaUnderROC,0.9621
averagePrecision,0.68145
f1,0.97517
weightedPrecision,0.97543
weightedRecall,0.97639


ERROR:root:Training data set to none. Error: [Errno 2] No such file or directory: '../../data/l2g_leave_one_in_tests/models/distance-other-vep/training_data.parquet'
25/04/25 14:31:29 WARN TaskSetManager: Stage 597 contains a task of very large size (30229 KiB). The maximum recommended task size is 1000 KiB.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/mindos/.netrc
                                                                                

[34m[1mwandb[0m: 
[34m[1mwandb[0m: Plotting L2G-classifier.
[34m[1mwandb[0m: Logged feature importances.
[34m[1mwandb[0m: Logged confusion matrix.
[34m[1mwandb[0m: Logged summary metrics.
[34m[1mwandb[0m: Logged class proportions.

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

[34m[1mwandb[0m: Logged calibration curve.
[34m[1mwandb[0m: Logged roc curve.
[34m[1mwandb[0m: Logged precision-recall curve.
[34m[1mwandb[0m: Logged Shapley contributions.


0,1
accuracy,▁
areaUnderROC,▁
averagePrecision,▁
f1,▁
weightedPrecision,▁
weightedRecall,▁

0,1
accuracy,0.97721
areaUnderROC,0.96311
averagePrecision,0.69258
f1,0.97613
weightedPrecision,0.9763
weightedRecall,0.97721


ERROR:root:Training data set to none. Error: [Errno 2] No such file or directory: '../../data/l2g_leave_one_in_tests/models/distance-other-eqlt/training_data.parquet'
25/04/25 14:35:30 WARN TaskSetManager: Stage 744 contains a task of very large size (30229 KiB). The maximum recommended task size is 1000 KiB.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/mindos/.netrc
                                                                                

[34m[1mwandb[0m: 
[34m[1mwandb[0m: Plotting L2G-classifier.
[34m[1mwandb[0m: Logged feature importances.
[34m[1mwandb[0m: Logged confusion matrix.
[34m[1mwandb[0m: Logged summary metrics.
[34m[1mwandb[0m: Logged class proportions.
[34m[1mwandb[0m: Logged calibration curve.
[34m[1mwandb[0m: Logged roc curve.
[34m[1mwandb[0m: Logged precision-recall curve.
[34m[1mwandb[0m: Logged Shapley contributions.


0,1
accuracy,▁
areaUnderROC,▁
averagePrecision,▁
f1,▁
weightedPrecision,▁
weightedRecall,▁

0,1
accuracy,0.9778
areaUnderROC,0.966
averagePrecision,0.69958
f1,0.97666
weightedPrecision,0.977
weightedRecall,0.9778


ERROR:root:Training data set to none. Error: [Errno 2] No such file or directory: '../../data/l2g_leave_one_in_tests/models/distance-other-pqtl/training_data.parquet'
25/04/25 14:39:30 WARN TaskSetManager: Stage 891 contains a task of very large size (30229 KiB). The maximum recommended task size is 1000 KiB.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/mindos/.netrc
                                                                                

[34m[1mwandb[0m: 
[34m[1mwandb[0m: Plotting L2G-classifier.
[34m[1mwandb[0m: Logged feature importances.
[34m[1mwandb[0m: Logged confusion matrix.
[34m[1mwandb[0m: Logged summary metrics.
[34m[1mwandb[0m: Logged class proportions.

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

[34m[1mwandb[0m: Logged calibration curve.
[34m[1mwandb[0m: Logged roc curve.
[34m[1mwandb[0m: Logged precision-recall curve.
[34m[1mwandb[0m: Logged Shapley contributions.


0,1
accuracy,▁
areaUnderROC,▁
averagePrecision,▁
f1,▁
weightedPrecision,▁
weightedRecall,▁

0,1
accuracy,0.97371
areaUnderROC,0.96377
averagePrecision,0.64634
f1,0.97218
weightedPrecision,0.97246
weightedRecall,0.97371


ERROR:root:Training data set to none. Error: [Errno 2] No such file or directory: '../../data/l2g_leave_one_in_tests/models/distance-other-sqtl/training_data.parquet'
25/04/25 14:43:32 WARN TaskSetManager: Stage 1038 contains a task of very large size (30229 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

In [None]:
# Upload to gcs
!gcloud storage rsync -r ../../data/l2g_leave_one_in_tests gs://genetics-portal-dev-analysis/ss60/gentropy-manuscript/chapters/leave_one_group_in_analysis
