# Motif

This notebook demonstrates how to run the full Motif pipeline,
including data preprocessing, GRN inference, single-step aggregation of the resulting weights,
and aggregation with different parameter settings using a defined parameter grid.
Evaluation of the results is performed in a separate notebook.

In [5]:
import pandas as pd
import sys
import os
from pathlib import Path
import glob
import json

In [6]:
# Determine the notebook’s current directory
current = Path().resolve()

# Traverse upward until we find a directory named "Motif"
for p in [current] + list(current.parents):
    if p.name == "Motif":
        os.chdir(p)
        print("Changed working directory to Motif root:", p)
        break
else:
    raise FileNotFoundError("Could not locate a parent directory named 'Motif'")

Changed working directory to Motif root: /Users/juliamarlene/Documents/GitHub/Motif


In [7]:
from src.motif_core import grn_inference
from src.motif_core import aggregation
from src.motif_core import aggregation_grid_search

Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



In [4]:
# Define input file paths
cpgs_file = "data/sample_cpgs.tsv"
genes_file = "data/sample_genes.tsv"
seeds = [0, 42, 123, 2021, 77]

In [5]:
# Run inference several times
for seed in seeds:
    sys.argv = [
        "grn_inference.py",
        "--cpgs_file", cpgs_file,
        "--genes_file", genes_file,
        "--seed", str(seed)]
    
    grn_inference.main()

# Iterate over each seed and preview the corresponding GRN inference output
for seed in seeds:
    # Build the path to the results file for this seed
    result_path = f"results/grn_inference/grn_with_seed_{seed}.tsv"
    
    # Read the TSV output into a DataFrame
    df = pd.read_csv(result_path, sep="\t")
    
    # Print a header and show the first few rows
    print(f"GRN inference results for seed {seed}:")
    display(df.head())

Running GRNBoost2...


Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



Network inference complete. Result saved to results/grn_inference.
Running GRNBoost2...


Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



Network inference complete. Result saved to results/grn_inference.
Running GRNBoost2...


Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



Network inference complete. Result saved to results/grn_inference.
Running GRNBoost2...


Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



Network inference complete. Result saved to results/grn_inference.
Running GRNBoost2...


Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



Network inference complete. Result saved to results/grn_inference.
GRN inference results for seed 0:


Unnamed: 0,TF,target,importance
0,ENSG00000060642,cg00002809,5.705554
1,ENSG00000057294,cg00000721,5.625646
2,ENSG00000166391,cg00000622,4.918205
3,ENSG00000070081,cg00000734,4.339199
4,ENSG00000060642,cg00002769,4.147474


GRN inference results for seed 42:


Unnamed: 0,TF,target,importance
0,ENSG00000057294,cg00000721,5.599803
1,ENSG00000264576,cg00002597,4.898498
2,ENSG00000206072,cg00002145,4.506045
3,ENSG00000064225,cg00000957,4.334503
4,ENSG00000215246,cg00001364,4.070592


GRN inference results for seed 123:


Unnamed: 0,TF,target,importance
0,ENSG00000227766,cg00004082,4.888872
1,ENSG00000057294,cg00001687,4.403597
2,ENSG00000127511,cg00002808,4.32935
3,ENSG00000215246,cg00001364,3.941459
4,ENSG00000206072,cg00002145,3.713339


GRN inference results for seed 2021:


Unnamed: 0,TF,target,importance
0,ENSG00000215246,cg00001364,6.079778
1,ENSG00000057294,cg00000721,5.6294
2,ENSG00000215246,cg00004067,5.600062
3,ENSG00000179833,cg00003173,5.205131
4,ENSG00000166391,cg00002597,5.047079


GRN inference results for seed 77:


Unnamed: 0,TF,target,importance
0,ENSG00000060642,cg00002809,5.882295
1,ENSG00000227766,cg00004082,5.196693
2,ENSG00000231105,cg00001854,5.030915
3,ENSG00000057294,cg00000721,4.837068
4,ENSG00000158486,cg00002464,4.302933


In [6]:
sys.argv = ["aggregation.py", "--method", 'mean', "--nruns", 'all']
    
aggregation.main()

# Load and preview results
df = pd.read_csv("results/aggregation/aggregated_weights.tsv", sep="\t")
print("\nAggregation results: ")
df.head()

Using seeds: [0, 42, 77, 123, 2021]
Saved results to results/aggregation.

Aggregation results: 


Unnamed: 0,genes,mean_weight
0,ENSG00000057294,5.054988
1,ENSG00000215246,4.415996
2,ENSG00000060642,4.27093
3,ENSG00000227766,4.086239
4,ENSG00000127511,3.793127


In [1]:
# Define parameter grid for aggregation
param_grid = {
    "should_normalize": [False],
    "should_group_by_gene": [False],
    "method": ["mean", "freq"],
    "alpha": [10],
    "beta": [2]
}

# Save to JSON file
with open("resources/param_grid.json", "w") as f:
    json.dump(param_grid, f, indent=4)

print("Saved param_grid.json to resources.")

FileNotFoundError: [Errno 2] No such file or directory: 'resources/param_grid.json'

In [8]:
# Run grid search over aggregation parameters
sys.argv = ["aggregation_grid_search.py", "resources/param_grid.json"]
aggregation_grid_search.main(sys.argv[1])

param_files = sorted(glob.glob("results/aggregation_grid_search/params_*.tsv")) #sorted(glob.glob
result_files = sorted(glob.glob("results/aggregation_grid_search/aggregated_weights_*.tsv"))

# Load one example parameter/result pair
example_param = pd.read_csv(param_files[0], sep="\t")
example_result = pd.read_csv(result_files[0], sep="\t")
print("\nParameter setting:")
print(example_param.head())
print("\nAggregation results:")
print(example_result.head())

Loaded parameter grid from: resources/param_grid.json
Using seeds: [0, 42, 77, 123, 2021]

Parameter setting:
   should_normalize  should_group_by_gene method  alpha  beta
0             False                 False   mean    NaN   NaN

Aggregation results:
             genes        cpgs  mean_weight
0  ENSG00000057294  cg00000721     5.054988
1  ENSG00000215246  cg00001364     4.415996
2  ENSG00000060642  cg00002809     4.270930
3  ENSG00000227766  cg00004082     4.086239
4  ENSG00000127511  cg00002808     3.793127
