# 🧬 Hands-on: Linking Motifs to Variant Effects


> **Open Questions & Ideas**
>
> - **How do we extract motifs from model interpretation?**
> - **Should we filter for variants with measurements?**  
>   <span style="color:green">Yes, because unmeasured variants may not be significant.</span>
> - **TODO:** Reduce the number of variants to ~10 for clarity.
> - **IDEA:** GATA1 is an interesting candidate.
> - **IDEA:** For variants with multiple TFBS hits, compare motif difference scores in the window.
> - **How do we map HOCOMOCO IDs to TF gene names?**
> - **How do we handle differences between HOCOMOCO and JASPAR?**  
>   E.g., SP9 has totally different motifs in each.

<!-- ![Variant Effect Motif Illustration](resources/picture/MPRA_variant_effect_picture.png) -->
<img src="resources/picture/MPRA_variant_effect_picture.png" alt="Variant Effect Motif Illustration" width="850"/>

---

## 🧭 Introduction: Variants Tested with MPRAs

- In the following Workshop part we will use a subset of variants tested within a MPRA.
- It was a large-scale MPRA (~46000 variants) and ~80000 sequences were tested within WTC11-derived neurons.
- The original FIMO result table can be very large (>1GB).

---

### Assuming we already have run BCalm for the variant effects

---

## Motif Identification from Sequence

Let's explore how to identify transcription factor binding motifs in DNA sequences.

---

### TFBS Databases: JASPAR & HOCOMOCO

Transcription factors (TFs) regulate gene expression by binding to specific DNA motifs. Identifying these motifs is crucial for understanding gene regulation, cell differentiation, and disease.

**Key Databases:**
- [JASPAR](https://jaspar.elixir.no/)
- [HOCOMOCO](https://hocomoco13.autosome.org/)

Both provide position weight matrices (PWMs) for TFBSs.  
> **Note:**  
> Genome-wide PWM scanning can yield many false positives due to short motif lengths and genome size.

You already learned about JASPAR in the modeling MPRA activity part of this Workshop. Here, we'll focus on **HOCOMOCO v13 (H13)** in .meme format.

- **HOCOMOCO v13:**  
  - Curated for human and mouse TFBSs.
  - ~1,200 motifs for 1,000 TFs.
  - Includes ChIP-seq and computational predictions.
  - Quality control metrics (e.g., supporting sequences, conservation).
  - [Download HOCOMOCO v13](https://hocomoco13.autosome.org/downloads_v13)

![HOCOMOCO Website Screenshot](resources/picture/hocomocov13_website.png)

### 🛠️ TFBS Identification Tools

Several tools predict TFBSs using PWMs:

- **FIMO** (Find Individual Motif Occurrences, MEME suite)
- **HOMER** (Hypergeometric Optimization of Motif EnRichment)
- ...and more

We'll use **FIMO** for this tutorial.

**FIMO Usage:**

- Input: TFBS database (.meme format) + FASTA sequences
- Output: Table of motif matches (`motif_id`, sequence name, start/end, matched sequence, etc.)

We'll use the `.tsv` output for downstream analysis.

In [None]:
# Run FIMO to find TFBS in variant sequences
!fimo --o results/finding_tfbs_in_variants_H13 resources/hocomoco_v13/H13CORE_meme_format.meme resources/test_variants_fimo_input.fa
# Execution time: ~15 seconds locally

In [None]:
# # helpful functions for later
# imports
import ast
from Bio import motifs
import pandas as pd
import os


# # Function to safely evaluate string representations of lists
def safe_eval(x):
    if pd.isna(x):
        return None
    try:
        return ast.literal_eval(x)
    except (ValueError, SyntaxError):
        return x

---

## Merging FIMO Output with Variant Information

After reading the FIMO output, we'll merge it with variant metadata (for both alternative and reference sequences) to create a unified, filterable table.

---

In [None]:
# Checkout fimo output
fimo_input = "results/finding_tfbs_in_variants_H13"
fimo_result = os.path.join(fimo_input, 'fimo.tsv')
fimo_result_df = pd.read_csv(fimo_result, sep="\t", comment='#')

### FIMO Output Columns

| Column           | Description                                      |
|------------------|--------------------------------------------------|
| `motif_id`       | ID of the matched motif                          |
| `sequence_name`  | Name of the sequence with the motif              |
| `start`/`stop`   | Start/stop positions of the motif match          |
| `strand`         | Strand (+/-)                                     |
| `matched_sequence`| Sequence matching the motif                     |
| `p-value`        | Motif match p-value (default threshold: 1e-4)    |
| `q-value`        | FDR-corrected p-value (higher confidence = lower)|

> **Tip:** Lower q-values indicate higher confidence in motif matches.

In [None]:
# Preview the FIMO result table
fimo_result_df

Unnamed: 0,motif_id,motif_alt_id,sequence_name,start,stop,strand,score,p-value,q-value,matched_sequence
0,ZNF646.H13CORE.0.PSG.A,,REF_variant18644,138,162,+,39.01830,7.220000e-14,8.900000e-10,TGTGTGTGTGTGTGTGTGTGTGTGC
1,ZNF865.H13CORE.0.PSG.A,,REF_variant18644,137,157,+,36.02440,5.250000e-13,1.330000e-09,GTGTGTGTGTGTGTGTGTGTG
2,ZNF865.H13CORE.0.PSG.A,,ALT_variant1864419-10953395-T-C,137,157,+,36.02440,5.250000e-13,1.330000e-09,GTGTGTGTGTGTGTGTGTGTG
3,ZNF865.H13CORE.0.PSG.A,,ALT_variant1864419-10953395-T-C,139,159,+,36.02440,5.250000e-13,1.330000e-09,GTGTGTGTGTGTGTGTGTGTG
4,ZNF865.H13CORE.0.PSG.A,,REF_variant18644,139,159,+,36.02440,5.250000e-13,1.330000e-09,GTGTGTGTGTGTGTGTGTGTG
...,...,...,...,...,...,...,...,...,...,...
6482,ZN223.H13CORE.0.P.C,,REF_variant18644,41,56,-,9.45918,9.980000e-05,8.830000e-02,AGGCTGAGGCGGGCAG
6483,ZN100.H13CORE.0.P.B,,ALT_variant1799218-37394793-C-G,218,240,-,-6.68182,9.990000e-05,1.540000e-01,CCACTGCAGACATGGGGGGTGTC
6484,ZN100.H13CORE.0.P.B,,REF_variant17992,218,240,-,-6.68182,9.990000e-05,1.540000e-01,CCACTGCAGACATGGGGGGTGTC
6485,ZNF773.H13CORE.0.PG.A,,ALT_variant1799218-37394793-C-G,192,212,+,9.58716,9.990000e-05,4.220000e-02,CAGTCCTCCCACCTTCTCACC


---

## Load Variant Metadata and Effects

We'll now load the variant metadata (positions, sequences, effects) and element metadata for our example dataset.

---

In [None]:
input_variants_path = "/home/kisa/coding/ISMB-2025_IGVF-MPRA-Tutorial/06_variant_effects_and_motifs/resources/variant_bcalm_metadata_df_ISMB.tsv.gz"
variant_bcalm_metadata_df = pd.read_csv(input_variants_path, sep="\t", low_memory=False)

# list columns col_variant_class, col_variant_pos, col_SPDI, col_allele
list_columns = ["variant_pos"]

# Apply the safe_eval function to the specified columns
for col in list_columns:
    variant_bcalm_metadata_df[col] = variant_bcalm_metadata_df[col].apply(safe_eval)

In [427]:
# Preview variant metadata
variant_bcalm_metadata_df

Unnamed: 0,ID,REF,ALT,ref_sequence,alt_sequence,variant_pos,bcalm_variant_effect_data_exists,bcalm_variant_effect_adjusted_p_value,bcalm_variant_effect_log_ratio_activity,bcalm_reference_adjusted_p_value,bcalm_reference_log_ratio_activity,bcalm_alternative_adjusted_p_value,bcalm_alternative_log_ratio_activity
0,cardiac_neuro_cava_random:ALT_SKI|ENSG00000157...,REF_variant1,ALT_variant11-2192366-T-G,CCATGCGGTGGCCACAGCCTCGGGTGAGTTCCGGTTCCAAAGTACC...,CCATGCGGTGGCCACAGCCTCGGGTGAGTTCCGGTTCCAAAGTACC...,[116],True,0.944787,-0.043372,1.000000,-0.068107,1.000000,-0.072051
1,cardiac_neuro_cava_random:ALT_SKI|ENSG00000157...,REF_variant2,ALT_variant21-2193142-G-A,GGACTCCGGTGCCTTCGCATTCCCGAGCTGTTTTTGCTTCTGGAAG...,GGACTCCGGTGCCTTCGCATTCCCGAGCTGTTTTTGCTTCTGGAAG...,[205],True,0.926799,-0.036392,1.000000,0.001944,1.000000,-0.005889
2,cardiac_neuro_cava_random:ALT_SKI|ENSG00000157...,REF_variant3,ALT_variant31-2197934-C-A,CCTCCACTTGTCAGGAAGCCTGACCCCCAATCCCCTCCCGCCTGAC...,CCTCCACTTGTCAGGAAGCCTGACCCCCAATCCCCTCCCGCCTGAC...,[71],True,0.779710,-0.130556,1.000000,-0.053109,1.000000,-0.065999
3,cardiac_neuro_cava_random:ALT_SKI|ENSG00000157...,REF_variant4,ALT_variant41-2198007-C-G,CCTCCACTTGTCAGGAAGCCTGACCCCCAATCCCCTCCCGCCTGAC...,CCTCCACTTGTCAGGAAGCCTGACCCCCAATCCCCTCCCGCCTGAC...,[144],True,0.953076,-0.034817,1.000000,-0.053109,1.000000,-0.074918
4,cardiac_neuro_cava_random:ALT_SKI|ENSG00000157...,REF_variant5,ALT_variant51-2198046-G-A,CCTCCACTTGTCAGGAAGCCTGACCCCCAATCCCCTCCCGCCTGAC...,CCTCCACTTGTCAGGAAGCCTGACCCCCAATCCCCTCCCGCCTGAC...,[183],True,0.632794,-0.167928,1.000000,-0.053109,1.000000,-0.060033
...,...,...,...,...,...,...,...,...,...,...,...,...,...
37865,cardiac_neuro_cava_random:ALT_G6PD|ENSG0000016...,REF_variant37866,ALT_variant37866X-154545206-A-G,GGAGCTCTGCCTCACCCCACCTGGCCCCAATTGTCCAGCTTGTAGA...,GGAGCTCTGCCTCACCCCACCTGGCCCCAATTGTCCAGCTTGTAGA...,[64],True,0.911154,-0.076406,0.015993,0.100370,0.451154,0.065547
37866,cardiac_neuro_cava_random:ALT_G6PD|ENSG0000016...,REF_variant37867,ALT_variant37867X-154549923-T-G,ATGTCTGAATTCACCTCCAAATAATGGGAAAACTCCTAGGTATATA...,ATGTCTGAATTCACCTCCAAATAATGGGAAAACTCCTAGGTATATA...,[149],True,0.748684,-0.153448,1.000000,-0.080632,0.091747,-0.090558
37867,cardiac_neuro_cava_random:ALT_G6PD|ENSG0000016...,REF_variant37868,ALT_variant37868X-154552289-C-T,CCTCTGCCCTCCCTGGCTTCTTCCCCTGTCCCTCCTTTCCCTTCCC...,CCTCTGCCCTCCCTGGCTTCTTCCCCTGTCCCTCCTTTCCCTTCCC...,[126],True,0.989551,0.007846,0.409652,0.056209,0.436290,0.052674
37868,cardiac_neuro_cava_random:ALT_G6PD|ENSG0000016...,REF_variant37869,ALT_variant37869X-154552371-G-A,CCTCTGCCCTCCCTGGCTTCTTCCCCTGTCCCTCCTTTCCCTTCCC...,CCTCTGCCCTCCCTGGCTTCTTCCCCTGTCCCTCCTTTCCCTTCTC...,[44],True,0.514030,-0.243036,0.409652,0.056209,1.000000,0.016824


---

## Combine FIMO Output with Variant Information

Let's merge the FIMO results with variant metadata for both reference and alternative alleles.

---

In [None]:
important_cols_for_matching = ['ID', 'REF', 'ALT', 'ref_sequence', 'alt_sequence', 'variant_pos']

# Merge FIMO with variants based on alternative and reference sequence names
fimo_alt = fimo_result_df.merge(variant_bcalm_metadata_df[important_cols_for_matching], left_on="sequence_name", right_on="ALT", how="inner")
fimo_ref = fimo_result_df.merge(variant_bcalm_metadata_df[important_cols_for_matching], left_on="sequence_name", right_on="REF", how="inner")

# Rename columns to distinguish ALT and REF q-values
fimo_alt = fimo_alt.rename(columns={"q-value": "ALT_q-value", "sequence_name": "ALT_name"})
fimo_ref = fimo_ref.rename(columns={"q-value": "REF_q-value", "sequence_name": "REF_name"})

# Merge ALT and REF results on ID, motif_id, start, stop, strand
combined_fimo_df = pd.merge(
    fimo_alt[['ID', 'motif_id', 'start', 'stop', 'strand', 'ALT_name', 'ALT_q-value']],
    fimo_ref[['ID', 'motif_id', 'start', 'stop', 'strand', 'REF_name', 'REF_q-value']],
    on=['ID', 'motif_id', 'start', 'stop', 'strand'],
    how='outer'  # Ensure we include all motif occurrences
)

combined_fimo_df

We want to look into the variant overlapping TFBS which required the information of the variant position. This info can be added from the variant table.


In [None]:
# Preview variant metadata (again, for context)
variant_bcalm_metadata_df

Unnamed: 0,ID,REF,ALT,ref_sequence,alt_sequence,variant_pos,bcalm_variant_effect_data_exists,bcalm_variant_effect_adjusted_p_value,bcalm_variant_effect_log_ratio_activity,bcalm_reference_adjusted_p_value,bcalm_reference_log_ratio_activity,bcalm_alternative_adjusted_p_value,bcalm_alternative_log_ratio_activity
0,cardiac_neuro_cava_random:ALT_SKI|ENSG00000157...,REF_variant1,ALT_variant11-2192366-T-G,CCATGCGGTGGCCACAGCCTCGGGTGAGTTCCGGTTCCAAAGTACC...,CCATGCGGTGGCCACAGCCTCGGGTGAGTTCCGGTTCCAAAGTACC...,[116],True,0.944787,-0.043372,1.000000,-0.068107,1.000000,-0.072051
1,cardiac_neuro_cava_random:ALT_SKI|ENSG00000157...,REF_variant2,ALT_variant21-2193142-G-A,GGACTCCGGTGCCTTCGCATTCCCGAGCTGTTTTTGCTTCTGGAAG...,GGACTCCGGTGCCTTCGCATTCCCGAGCTGTTTTTGCTTCTGGAAG...,[205],True,0.926799,-0.036392,1.000000,0.001944,1.000000,-0.005889
2,cardiac_neuro_cava_random:ALT_SKI|ENSG00000157...,REF_variant3,ALT_variant31-2197934-C-A,CCTCCACTTGTCAGGAAGCCTGACCCCCAATCCCCTCCCGCCTGAC...,CCTCCACTTGTCAGGAAGCCTGACCCCCAATCCCCTCCCGCCTGAC...,[71],True,0.779710,-0.130556,1.000000,-0.053109,1.000000,-0.065999
3,cardiac_neuro_cava_random:ALT_SKI|ENSG00000157...,REF_variant4,ALT_variant41-2198007-C-G,CCTCCACTTGTCAGGAAGCCTGACCCCCAATCCCCTCCCGCCTGAC...,CCTCCACTTGTCAGGAAGCCTGACCCCCAATCCCCTCCCGCCTGAC...,[144],True,0.953076,-0.034817,1.000000,-0.053109,1.000000,-0.074918
4,cardiac_neuro_cava_random:ALT_SKI|ENSG00000157...,REF_variant5,ALT_variant51-2198046-G-A,CCTCCACTTGTCAGGAAGCCTGACCCCCAATCCCCTCCCGCCTGAC...,CCTCCACTTGTCAGGAAGCCTGACCCCCAATCCCCTCCCGCCTGAC...,[183],True,0.632794,-0.167928,1.000000,-0.053109,1.000000,-0.060033
...,...,...,...,...,...,...,...,...,...,...,...,...,...
37865,cardiac_neuro_cava_random:ALT_G6PD|ENSG0000016...,REF_variant37866,ALT_variant37866X-154545206-A-G,GGAGCTCTGCCTCACCCCACCTGGCCCCAATTGTCCAGCTTGTAGA...,GGAGCTCTGCCTCACCCCACCTGGCCCCAATTGTCCAGCTTGTAGA...,[64],True,0.911154,-0.076406,0.015993,0.100370,0.451154,0.065547
37866,cardiac_neuro_cava_random:ALT_G6PD|ENSG0000016...,REF_variant37867,ALT_variant37867X-154549923-T-G,ATGTCTGAATTCACCTCCAAATAATGGGAAAACTCCTAGGTATATA...,ATGTCTGAATTCACCTCCAAATAATGGGAAAACTCCTAGGTATATA...,[149],True,0.748684,-0.153448,1.000000,-0.080632,0.091747,-0.090558
37867,cardiac_neuro_cava_random:ALT_G6PD|ENSG0000016...,REF_variant37868,ALT_variant37868X-154552289-C-T,CCTCTGCCCTCCCTGGCTTCTTCCCCTGTCCCTCCTTTCCCTTCCC...,CCTCTGCCCTCCCTGGCTTCTTCCCCTGTCCCTCCTTTCCCTTCCC...,[126],True,0.989551,0.007846,0.409652,0.056209,0.436290,0.052674
37868,cardiac_neuro_cava_random:ALT_G6PD|ENSG0000016...,REF_variant37869,ALT_variant37869X-154552371-G-A,CCTCTGCCCTCCCTGGCTTCTTCCCCTGTCCCTCCTTTCCCTTCCC...,CCTCTGCCCTCCCTGGCTTCTTCCCCTGTCCCTCCTTTCCCTTCTC...,[44],True,0.514030,-0.243036,0.409652,0.056209,1.000000,0.016824


In [40]:
variant_bcalm_infos = ['ID', 'variant_pos', 'ref_sequence', 'alt_sequence', 'bcalm_variant_effect_data_exists', 'bcalm_variant_effect_adjusted_p_value', 'bcalm_variant_effect_log_ratio_activity', 'bcalm_reference_adjusted_p_value', 'bcalm_reference_log_ratio_activity', 'bcalm_alternative_adjusted_p_value', 'bcalm_alternative_log_ratio_activity']
variant_metadata_fimo_combined_df = combined_fimo_df.merge(variant_bcalm_metadata_df[variant_bcalm_infos], on="ID", how="inner")

In [41]:
def motif_overlaps_variant(row, fimo_start_col='start', fimo_stop_col='stop', col_variant_pos='variant_pos'):
    """
    Iterate over each variant pos and return true if the start stop coordinates are overlapping the variant position
    NOTE: start and stop are 1-based while variant position from the metadata format is 0-based
    """

    if isinstance(row[col_variant_pos], list):
        for variant_position in row[col_variant_pos]:
            # make the variant position 1-based
            variant_pos_1_based = int(variant_position) + 1
            if (int(row[fimo_start_col]) <= variant_pos_1_based & int(row[fimo_stop_col]) >= variant_pos_1_based):
                return True
    elif isinstance(row[col_variant_pos], int):
        # make the variant position 1-based
        variant_pos_1_based = row[col_variant_pos] + 1
        if (int(row[fimo_start_col]) <= variant_pos_1_based & int(row[fimo_stop_col]) >= variant_pos_1_based):
            return True
    else:
        print(row)
        print(type(row[col_variant_pos]))
        raise ValueError("Variant position has an unexpected data type")
    return False

variant_metadata_fimo_combined_df["is_variant_overlapping"] = variant_metadata_fimo_combined_df.apply(lambda row: motif_overlaps_variant(row, fimo_start_col='start', fimo_stop_col='stop', col_variant_pos='variant_pos'), axis=1)


In [64]:
sig_level = 0.1
variant_metadata_fimo_combined_df["is_significant"] = variant_metadata_fimo_combined_df["bcalm_variant_effect_adjusted_p_value"] < sig_level

if variant_metadata_fimo_combined_df.shape[0] == variant_metadata_fimo_combined_df['is_significant'].sum():
   print("All the variants of interest are significant")

All the variants of interest are significant


> **Result:**  
> Now we have a table of significant variants and their overlapping TFBS predictions. Let's explore the numbers!

In [None]:
# Preview combined variant-TFBS table
variant_metadata_fimo_combined_df

Unnamed: 0,ID,motif_id,start,stop,strand,ALT_name,ALT_q-value,REF_name,REF_q-value,variant_pos,...,alt_sequence,bcalm_variant_effect_data_exists,bcalm_variant_effect_adjusted_p_value,bcalm_variant_effect_log_ratio_activity,bcalm_reference_adjusted_p_value,bcalm_reference_log_ratio_activity,bcalm_alternative_adjusted_p_value,bcalm_alternative_log_ratio_activity,is_variant_overlapping,is_significant
0,cardiac_neuro_cava_random:ALT_ANKZF1|ENSG00000...,ANDR.H13CORE.0.P.B,186,202,+,ALT_variant211862-219267001-T-C,0.1870,REF_variant21186,0.1870,[63],...,CAGGTTCACTGTTGGGCTCTGATCCCACCTTCCCACCATGGGGACA...,True,1.257254e-50,1.457050,1.0,0.025425,2.299743e-50,0.400958,False,True
1,cardiac_neuro_cava_random:ALT_ANKZF1|ENSG00000...,AP2A.H13CORE.0.PSM.A,196,210,+,ALT_variant211862-219267001-T-C,0.1580,REF_variant21186,0.1580,[63],...,CAGGTTCACTGTTGGGCTCTGATCCCACCTTCCCACCATGGGGACA...,True,1.257254e-50,1.457050,1.0,0.025425,2.299743e-50,0.400958,False,True
2,cardiac_neuro_cava_random:ALT_ANKZF1|ENSG00000...,AP2B.H13CORE.0.SM.B,195,212,+,ALT_variant211862-219267001-T-C,0.1370,REF_variant21186,0.1370,[63],...,CAGGTTCACTGTTGGGCTCTGATCCCACCTTCCCACCATGGGGACA...,True,1.257254e-50,1.457050,1.0,0.025425,2.299743e-50,0.400958,False,True
3,cardiac_neuro_cava_random:ALT_ANKZF1|ENSG00000...,AP2C.H13CORE.0.PSM.A,193,207,+,ALT_variant211862-219267001-T-C,0.1990,REF_variant21186,0.1990,[63],...,CAGGTTCACTGTTGGGCTCTGATCCCACCTTCCCACCATGGGGACA...,True,1.257254e-50,1.457050,1.0,0.025425,2.299743e-50,0.400958,False,True
4,cardiac_neuro_cava_random:ALT_ANKZF1|ENSG00000...,AP2C.H13CORE.0.PSM.A,197,211,-,ALT_variant211862-219267001-T-C,0.1990,REF_variant21186,0.1990,[63],...,CAGGTTCACTGTTGGGCTCTGATCCCACCTTCCCACCATGGGGACA...,True,1.257254e-50,1.457050,1.0,0.025425,2.299743e-50,0.400958,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3519,cardiac_neuro_cava_random:ALT_WWOX|ENSG0000018...,ZSC21.H13CORE.0.P.C,92,102,-,ALT_variant1574716-79008835-G-T,0.0637,REF_variant15747,0.0637,[201],...,GACAGAGCTTGGGCATGCTCCCGACACCAGGTCATACTGATGTCAC...,True,8.353091e-02,-0.372359,1.0,0.003278,1.000000e+00,-0.041520,False,True
3520,cardiac_neuro_cava_random:ALT_WWOX|ENSG0000018...,ZSC21.H13CORE.0.P.C,137,147,+,ALT_variant1574716-79008835-G-T,0.0271,REF_variant15747,0.0271,[201],...,GACAGAGCTTGGGCATGCTCCCGACACCAGGTCATACTGATGTCAC...,True,8.353091e-02,-0.372359,1.0,0.003278,1.000000e+00,-0.041520,False,True
3521,cardiac_neuro_cava_random:ALT_WWOX|ENSG0000018...,ZSC22.H13CORE.0.P.C,66,82,-,ALT_variant1574716-79008835-G-T,0.0503,REF_variant15747,0.0503,[201],...,GACAGAGCTTGGGCATGCTCCCGACACCAGGTCATACTGATGTCAC...,True,8.353091e-02,-0.372359,1.0,0.003278,1.000000e+00,-0.041520,False,True
3522,cardiac_neuro_cava_random:ALT_WWOX|ENSG0000018...,ZSC31.H13CORE.0.P.C,215,231,+,ALT_variant1574716-79008835-G-T,0.2070,REF_variant15747,0.2070,[201],...,GACAGAGCTTGGGCATGCTCCCGACACCAGGTCATACTGATGTCAC...,True,8.353091e-02,-0.372359,1.0,0.003278,1.000000e+00,-0.041520,False,True


In [None]:
# Filter for variant-overlapping TFBS
variant_overlapping_tfbs = variant_metadata_fimo_combined_df.loc[variant_metadata_fimo_combined_df['is_variant_overlapping']].copy()

In [None]:
print("Number of Motif predictions within the given sequences", variant_overlapping_tfbs.shape[0])
print("Number of different variants we found at least on TFBS", variant_overlapping_tfbs['ID'].nunique())

Number of Motif predictions within the given sequences 513
Number of different variants we found at least on TFBS 13


### Motif Gain or Loss? Assessing TFBS Changes
We can compare if the motifs fit the reference or the alternative sequence better so we know if the motif more likely to bind at a specific allele but from just computational predictions we cannot tell if the motif is now bound or not bound anymore because of the SNP. But we can increase the confidence of our predictions which are prone to false positives.

> **Observation:**  
> For 13 variants, we see 513 TFBS hits!  
> - Short motifs → many chance matches (false positives).
> - Multiple TFBS can overlap the same variant.
>
> **How to prioritize?**
> - Use more stringent q-value thresholds.
> - Focus on TFs expressed in your cell type.
> - Use additional data (e.g., ChIP-seq).

### Filter by q-value (Stringency)

Let's filter for motif matches with q-value < 0.1 for higher confidence.

In [46]:
q_value_threshold = 0.1
variant_metadata_fimo_combined_overlap_filtered = variant_overlapping_tfbs.loc[(variant_overlapping_tfbs['ALT_q-value'] < q_value_threshold) | (variant_metadata_fimo_combined_df['ALT_q-value'].isna())].copy()
variant_metadata_fimo_combined_overlap_filtered = variant_metadata_fimo_combined_overlap_filtered.loc[(variant_metadata_fimo_combined_overlap_filtered['REF_q-value'] < q_value_threshold) | (variant_metadata_fimo_combined_overlap_filtered['REF_q-value'].isna())].copy()


In [47]:
print("Number of Motif predictions within the given sequences", variant_metadata_fimo_combined_overlap_filtered.shape[0])
print("Number of different variants we found at least on TFBS", variant_metadata_fimo_combined_overlap_filtered['ID'].nunique())

Number of Motif predictions within the given sequences 383
Number of different variants we found at least on TFBS 13


In [None]:
# Preview filtered motif-variant table
variant_metadata_fimo_combined_overlap_filtered

Unnamed: 0,ID,motif_id,start,stop,strand,ALT_name,ALT_q-value,REF_name,REF_q-value,variant_pos,...,bcalm_variant_effect_log_ratio_activity,bcalm_reference_adjusted_p_value,bcalm_reference_log_ratio_activity,bcalm_alternative_adjusted_p_value,bcalm_alternative_log_ratio_activity,is_variant_overlapping,is_significant,REF_bit_score,ALT_bit_score,max_bit_score
8,cardiac_neuro_cava_random:ALT_ANKZF1|ENSG00000...,ATMIN.H13CORE.0.P.B,62,82,-,ALT_variant211862-219267001-T-C,0.01070,REF_variant21186,0.00643,[63],...,1.457050,1.0,0.025425,2.299743e-50,0.400958,True,True,6.7980,6.7980,6.798
9,cardiac_neuro_cava_random:ALT_ANKZF1|ENSG00000...,ATMIN.H13CORE.0.P.B,64,84,-,ALT_variant211862-219267001-T-C,0.01370,REF_variant21186,0.01530,[63],...,1.457050,1.0,0.025425,2.299743e-50,0.400958,True,True,7.1130,7.1130,7.113
45,cardiac_neuro_cava_random:ALT_ANKZF1|ENSG00000...,KLF12.H13CORE.0.P.C,53,70,-,ALT_variant211862-219267001-T-C,0.05120,,,[63],...,1.457050,1.0,0.025425,2.299743e-50,0.400958,True,True,2.7452,3.5032,3.503
49,cardiac_neuro_cava_random:ALT_ANKZF1|ENSG00000...,KLF13.H13CORE.1.P.C,53,72,-,ALT_variant211862-219267001-T-C,0.03690,,,[63],...,1.457050,1.0,0.025425,2.299743e-50,0.400958,True,True,3.1717,3.8242,3.824
53,cardiac_neuro_cava_random:ALT_ANKZF1|ENSG00000...,KLF14.H13CORE.1.P.C,51,73,-,ALT_variant211862-219267001-T-C,0.02010,,,[63],...,1.457050,1.0,0.025425,2.299743e-50,0.400958,True,True,3.2665,3.8407,3.841
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3185,cardiac_neuro_cava_random:ALT_SOX5|ENSG0000013...,ZNF77.H13CORE.0.P.C,167,189,-,,,REF_variant8740,0.04240,[176],...,-0.362474,1.0,-0.052217,9.003175e-01,-0.080430,True,True,5.8137,4.9764,5.814
3187,cardiac_neuro_cava_random:ALT_SOX5|ENSG0000013...,ZNF813.H13CORE.0.SG.A,166,191,+,ALT_variant874012-24455997-A-G,0.02440,REF_variant8740,0.02480,[176],...,-0.362474,1.0,-0.052217,9.003175e-01,-0.080430,True,True,5.5517,5.5526,5.553
3192,cardiac_neuro_cava_random:ALT_SOX5|ENSG0000013...,ZSC22.H13CORE.0.P.C,161,177,+,ALT_variant874012-24455997-A-G,0.00112,REF_variant8740,0.00470,[176],...,-0.362474,1.0,-0.052217,9.003175e-01,-0.080430,True,True,7.5497,8.0673,8.067
3265,cardiac_neuro_cava_random:ALT_WWOX|ENSG0000018...,KLF16.H13CORE.1.P.B,181,203,+,,,REF_variant15747,0.01930,[201],...,-0.372359,1.0,0.003278,1.000000e+00,-0.041520,True,True,5.6166,5.0571,5.617


### 🤔 Which TFBS Fits Better?

The q-value ranks motif matches, but longer motifs tend to have lower p/q-values.  
Let's use a **bitscore** (normalized for motif length) to compare motif fits near the variant.

---

<img src="resources/picture/mutliple_tfbs_fitting_variant_example_long_vs_short.png" alt="Motif Comparison Example" width="600"/>

#### Which TFBS fits better by eye?
But by eye I would say that the bottom one matches more important nucleotides. So note it is known that the p- and q-values are not reliable when comparing motifs with different lengths, because longer motifs get easier a lower p-value.
But we can use a bitscore which is measured with a unified size around the variant position to make motifs of different sizes comparable.

In [51]:
# function used to compute the local bitscore
def extract_motif_names(meme_file):
    motif_names = []
    with open(meme_file, "r") as f:
        for line in f:
            if line.startswith("MOTIF"):
                motif_name = line.split(" ", 1)[1].strip()
                motif_names.append(motif_name)
    return motif_names


# read the used motif file and extract the motifs
meme_file = "resources/hocomoco_v13/H13CORE_meme_format.meme"
motif_names = extract_motif_names(meme_file)

# Add the name while loading the file using biopython
hocomoco_motif_dict = {}
with open(meme_file) as handle:
    motif_list = motifs.parse(handle, "pfm-four-columns")
    i = 0
    for motif in motif_list:
        motif.name = motif_names[i]
        hocomoco_motif_dict[motif.name] = motif
        i += 1


def reverse_complement(seq):
    complement = str.maketrans("ACGTacgt", "TGCAtgca")
    return seq.translate(complement)[::-1]


def local_bitscore_tfbs(motif, sequence, motif_start, motif_end, variant_pos, strand):
    """
    Calculate the local bitscore for a given motif and sequence, using a window of 5 bases around the variant position.
    """
    zero_based_motif_start = motif_start - 1 # start should be 0-based and end 1-based
    pwm = motif.pwm
    zero_based_variant_pos = variant_pos[0]


    # Get only subset of sequence which is overlapped with the motif
    sequence_of_interest_start = max(zero_based_motif_start, zero_based_variant_pos - 5)
    sequence_of_interest_end = min(motif_end, zero_based_variant_pos + 5)


    motif_related_start = max(0, sequence_of_interest_start - zero_based_motif_start)
    motif_related_end = min(motif_end - zero_based_motif_start, sequence_of_interest_end - zero_based_motif_start)

    # normalize the score for the length of the motif
    additional_score = 10 - (motif_related_end - motif_related_start)
    motif_sequence = sequence[zero_based_motif_start:motif_end]
    # use the reverse complement sequence if the hit is on the "-" strand
    if strand == "-":
        motif_sequence = reverse_complement(motif_sequence)
    motif_part = motif_sequence[motif_related_start:motif_related_end]
    # crop the pwm to the motif related part:
    modified_pwm = {nuc: scores[motif_related_start:motif_related_end] for nuc, scores in pwm.items()}

    return additional_score + round(sum(modified_pwm[nuc][i] for i, nuc in enumerate(motif_part)), 4)

In [61]:
variant_metadata_fimo_combined_overlap_filtered['REF_bit_score'] = variant_metadata_fimo_combined_overlap_filtered.apply(
    lambda row: local_bitscore_tfbs(
        motif=hocomoco_motif_dict[row["motif_id"]],
        sequence=row['ref_sequence'],
        motif_start=row['start'],
        motif_end=row['stop'],
        variant_pos=row['variant_pos'],
        strand=row['strand']
    ),
    axis=1
)
variant_metadata_fimo_combined_overlap_filtered['ALT_bit_score'] = variant_metadata_fimo_combined_overlap_filtered.apply(
    lambda row: local_bitscore_tfbs(
        motif=hocomoco_motif_dict[row["motif_id"]],
        sequence=row['alt_sequence'],
        motif_start=row['start'],
        motif_end=row['stop'],
        variant_pos=row['variant_pos'],
        strand=row['strand']
    ),
    axis=1
)

variant_metadata_fimo_combined_overlap_filtered['max_bit_score'] = variant_metadata_fimo_combined_overlap_filtered[['REF_bit_score', 'ALT_bit_score']].max(axis=1).round(3)

> **Result:**  
> The bitscore now correctly ranks motif fits.  
> - **Higher bitscore = better motif fit.**
> - Filter for a minimal bitscore to avoid weak matches.

In [None]:
interesting_columns =["motif_id", "max_bit_score"]
variant_metadata_fimo_combined_overlap_filtered.loc[(variant_metadata_fimo_combined_overlap_filtered['ID'] == "cardiac_neuro_cava_random:ALT_ANKZF1|ENSG00000163516.14|EH38E2076021_fwd_tile1-1_ANKZF1|ENSG00000163516.14|EH38E2076021|2-219267001-T-C")&(variant_metadata_fimo_combined_overlap_filtered['motif_id'].isin(["ZN263.H13CORE.0.PM.A", "PATZ1.H13CORE.1.P.C"]))].sort_values(by=['max_bit_score'], ascending=False).head(10)[interesting_columns]


Unnamed: 0,motif_id,max_bit_score
309,ZN263.H13CORE.0.PM.A,8.394
148,PATZ1.H13CORE.1.P.C,8.046


In [None]:
# optional stringent filtering based on the maximal bit score
# TODO: clearify why exactly this bitscore threshold (max score is 10) required? (wait for feedback)
bitscore_threshold = 8
variant_metadata_fimo_combined_overlap_bitscore_filtered = variant_metadata_fimo_combined_overlap_filtered.loc[variant_metadata_fimo_combined_overlap_filtered['max_bit_score'] >= bitscore_threshold].copy()

---
## Increase Confidence: Use Cell-Type Expression Data

To reduce false positives, filter motif hits by whether the TF is expressed in your cell type (e.g., neurons).  
We'll use a binarized matrix derived from GTEx to annotate TF expression. 
> **Note:** 
We will not cover the data processing of this data in detail here.


In [None]:
# Functions to get the information about considered expressed transcription factors

def from_masterlist_info_2_gene_symbol(masterlist_info):
    """Expects to get a dict as input and tried to read the human gene_symbol"""
    try:
        gene_symbol = masterlist_info['species']['HUMAN']['gene_symbol']
        uniprot_id =  masterlist_info['species']['HUMAN']['uniprot_id']
        uniprot_accession_num =  masterlist_info['species']['HUMAN']['uniprot_ac']
    except:
        print("Problem during finding gene symbol of the following masterlist info: ", masterlist_info)
        gene_symbol = pd.NA
    return pd.Series({'gene_symbol': gene_symbol, 'uniprot_id': uniprot_id, 'uniprot_accession_num': uniprot_accession_num})


def get_neuro_activity_info(variant_metadata_df):
    """
    Loads the hocomoco annotations and gets gene_symbol for each motif_id before it merged the information about which motif
    is expressed either in Excitatory neurons or Inhibitory Neurons
    """
    binary_gene_activity_matrix_path = "resources/gene_matrix_specificity_by_percentage.tsv"
    binary_gene_activity_matrix_df = pd.read_csv(binary_gene_activity_matrix_path, sep="\t", low_memory=False)
    # get all genes which are considered active within neurons:
    binary_gene_activity_matrix_df['neuron_active'] = (binary_gene_activity_matrix_df["Excitatory neurons"] == 1) | (binary_gene_activity_matrix_df["Inhibitory neurons"] == 1)

    hocomoco_annotation_info_path = "resources/hocomoco_v13/H13CORE-CLUSTERED_annotation.jsonl"
    hocomoco_annotation_info_all = pd.read_json(path_or_buf=hocomoco_annotation_info_path, lines=True)
    hocomoco_annotation_info_all
    # rename name
    hocomoco_annotation_info = hocomoco_annotation_info_all.rename(columns={"name": "motif_id"}).copy()

    hocomoco_annotation_info[['gene_symbol', 'uniprot_id', 'uniprot_accession_num']] = hocomoco_annotation_info["masterlist_info"].apply(from_masterlist_info_2_gene_symbol)
    hocomoco_annotation_info
    # only name and tf name
    hocomoco_annotation_info = hocomoco_annotation_info[['motif_id', 'gene_symbol', 'uniprot_id', 'uniprot_accession_num', 'cluster_motifs']].copy()

    # add gene info
    variant_metadata_fimo_combined_overlap_bitscore_filtered_gene_info_df = variant_metadata_fimo_combined_overlap_bitscore_filtered.merge(hocomoco_annotation_info[['motif_id', 'gene_symbol']], on="motif_id", how="inner").copy()
    # add gene activity
    variant_metadata_fimo_combined_overlap_filtered_seq_gene_info_activity = variant_metadata_fimo_combined_overlap_bitscore_filtered_gene_info_df.merge(binary_gene_activity_matrix_df[["gene_symbol", "neuron_active"]], on="gene_symbol", how="inner").copy()
    return variant_metadata_fimo_combined_overlap_filtered_seq_gene_info_activity

In [107]:
significant_variant_overlapping_tfbs_neuro_activity = get_neuro_activity_info(variant_metadata_fimo_combined_overlap_bitscore_filtered)
significant_variant_overlapping_tfbs_neuro_activity = significant_variant_overlapping_tfbs_neuro_activity.loc[significant_variant_overlapping_tfbs_neuro_activity['neuron_active']].copy()

In [108]:
significant_variant_overlapping_tfbs_neuro_activity

Unnamed: 0,ID,motif_id,start,stop,strand,ALT_name,ALT_q-value,REF_name,REF_q-value,variant_pos,...,is_significant,REF_bit_score,ALT_bit_score,max_bit_score,max_q_value,max_bit_score_normal,alt_ref_bit_score_diff,tfbs_change_class,gene_symbol,neuron_active
0,cardiac_neuro_cava_random:ALT_ANKZF1|ENSG00000...,VEZF1.H13CORE.0.P.C,61,71,-,ALT_variant211862-219267001-T-C,0.0333,,,[63],...,True,7.0591,8.033,8.033,0.0333,1.163825,0.974,gain,VEZF1,True
1,cardiac_neuro_cava_random:ALT_ANKZF1|ENSG00000...,ZN263.H13CORE.0.PM.A,61,70,-,ALT_variant211862-219267001-T-C,0.0069,,,[63],...,True,7.7611,8.3937,8.394,0.0069,1.409498,0.633,gain,ZNF263,True
2,cardiac_neuro_cava_random:ALT_CASK|ENSG0000014...,ZN770.H13CORE.0.P.B,79,90,-,ALT_variant37314X-41492211-G-A,0.00629,REF_variant37314,0.000688,[81],...,True,8.1954,8.1954,8.195,0.00629,1.274072,0.0,no_change,ZNF770,True
3,cardiac_neuro_cava_random:ALT_CELF4|ENSG000001...,VEZF1.H13CORE.0.P.C,116,126,+,ALT_variant1799218-37394793-C-G,0.0397,REF_variant17992,0.0333,[115],...,True,9.1832,8.5165,9.183,0.0397,1.94644,-0.667,loss,VEZF1,True
4,cardiac_neuro_cava_random:ALT_DOLK|ENSG0000017...,VEZF1.H13CORE.0.P.C,41,51,-,ALT_variant364869-128912406-T-G,0.0209,REF_variant36486,0.0333,[48],...,True,8.3534,8.3534,8.353,0.0333,1.381596,0.0,no_change,VEZF1,True
5,cardiac_neuro_cava_random:ALT_DOLK|ENSG0000017...,ZN148.H13CORE.0.P.B,39,57,-,ALT_variant364869-128912406-T-G,0.0053,REF_variant36486,0.0242,[48],...,True,7.554,8.534,8.534,0.0242,1.504773,0.98,gain,ZNF148,True
6,cardiac_neuro_cava_random:ALT_DOLK|ENSG0000017...,ZSCAN25.H13CORE.0.PSG.A,29,49,+,ALT_variant364869-128912406-T-G,0.0234,REF_variant36486,0.0234,[48],...,True,8.3799,8.1085,8.38,0.0234,1.399971,-0.271,loss,ZSCAN25,True
7,cardiac_neuro_cava_random:ALT_DPYSL2|ENSG00000...,VEZF1.H13CORE.0.P.C,189,199,+,,,REF_variant34764,0.0409,[196],...,True,8.7548,7.8318,8.755,0.0409,1.655171,-0.923,loss,VEZF1,True
8,cardiac_neuro_cava_random:ALT_FHL2|ENSG0000011...,ZN467.H13CORE.0.P.C,92,114,+,,,REF_variant20372,0.00636,[113],...,True,8.9567,8.4829,8.957,0.00636,1.792639,-0.474,loss,ZNF467,True
9,cardiac_neuro_cava_random:ALT_FHL2|ENSG0000011...,ZSC22.H13CORE.0.P.C,98,114,+,ALT_variant203722-105426528-C-G,0.0126,REF_variant20372,0.0204,[113],...,True,8.4402,8.9196,8.92,0.0204,1.767459,0.479,gain,ZSCAN22,True


### Incorporating the variant effect and the motif information
Now we know which predicted TFBS are overlapping significant variants. But we are still not sure if the variant is actually overlapping an important nucleotide of the TFBS so a potential next question can be how these tfbs are affected by the variants? Are these TFBS generated with respect to the SNP or are they destroyed? So we would like to know at which allele the TFBS prediction is fitting better.


In [110]:
def classify_tfbs_change(row, ref_bit_score_col='REF_bit_score', alt_bit_score_col='ALT_bit_score'):
    """
    Classifies the tfbs change based on the bitscore differences
    """
    if row[alt_bit_score_col] > row[ref_bit_score_col]:
        return 'gain'
    elif row[alt_bit_score_col] < row[ref_bit_score_col]:
        return 'loss'
    else:
        return 'no_change'

In [111]:
significant_variant_overlapping_tfbs_neuro_activity['tfbs_change_class'] = significant_variant_overlapping_tfbs_neuro_activity.apply(
    lambda row: classify_tfbs_change(row, ref_bit_score_col='REF_bit_score', alt_bit_score_col='ALT_bit_score'),
    axis=1
)

In [112]:
significant_variant_overlapping_tfbs_neuro_activity['tfbs_change_class'].value_counts()

tfbs_change_class
gain         12
loss         10
no_change     4
Name: count, dtype: int64

In [114]:
significant_variant_overlapping_tfbs_neuro_activity = significant_variant_overlapping_tfbs_neuro_activity.loc[significant_variant_overlapping_tfbs_neuro_activity['tfbs_change_class'] != 'no_change'].copy()
significant_variant_overlapping_tfbs_neuro_activity['tfbs_change_class'].value_counts()

tfbs_change_class
gain    12
loss    10
Name: count, dtype: int64

In [None]:
# ? Extend this notebook by showing how many variants we find an overlapping TFBS?

## Interpret Results: Motif Gain/Loss & Variant Effect
As presented there are 12 potential gains of motifs and 10 potential losses but the consequence of the variant effect direction is not yet covered. So in the final step we will interprete these loss and gains of motifs together with the variant effect. 

If a TFBS is more likely on the alternative allele and the variant can be measured to increase or decrease the transcriptional activity, the effect on TF level would be that the TF is potentially increasing or decreasing the transcription in the tested cell-type, respectively.

If a motif is fitting the reference allele better the increasing variant would be interpreted on TF level as a transcriptional decreasing TFBS in this cell-type while the decreasing variant effect suggests that the respective TF is driving transcriptional activity.

- **Motif Gain:**  
  - If the motif fits the alternative allele better, is the variant activating or repressing?
- **Motif Loss:**  
  - If the motif fits the reference allele better, what is the effect direction?

> **Interpretation Table:**
>
> | Motif Change | Variant Effect | TFBS Role Prediction |
> |--------------|---------------|----------------------|
> | Gain         | ↑             | Activating           |
> | Gain         | ↓             | Repressing           |
> | Loss         | ↑             | Repressing           |
> | Loss         | ↓             | Activating           |

In [None]:
def interprete_TFBS_findings(row, variant_effect_col="bcalm_variant_effect_log_ratio_activity", bit_score_difference_col="alt_ref_bitscore_diff", new_interpretation_col="TFBS_interpretation"):
    """Iterate over each motif variant association and interprete the effect of the identified TFBS as follows:
        Activating TFBS:
            - if REF binds better and negative effect logfc => motif is predicted to be activating in this cell type
            - if ALT binds better and positive effect logfc => motif is predicted to be activating in this cell type
        Repressing TFBS:
            - if ALT binds better and negative effect logfc => motif is predicted to be repressing in this cell type
            - if REF binds better and positive effect logfc => motif is predicted to be repressing in this cell type
    """
    bit_score_diff = row[bit_score_difference_col] # since alt - ref if alt > ref => alt binds better than ref => it is positive if ref binds better it is negative
    variant_effect = row[variant_effect_col] # assume alt - ref
    interpretation=""
    if bit_score_diff > 0: # alt binds better
        if variant_effect > 0: # variant is activating
            interpretation = "predicted to be a transcriptional increasing motif in neural progenitors derived from WTC11"
        elif variant_effect < 0: # variant is repressing
            interpretation = "predicted to be a transcriptional decreasing motif in neural progenitors derived from WTC11"
        else:
            raise ValueError("Did not expect a variant with 0 variant effect better binding to alternative: please sanity check")
    elif bit_score_diff < 0: # ref binds better
        if variant_effect > 0: # variant is activating
            interpretation = "predicted to be a transcriptional decreasing motif in neural progenitors derived from WTC11"
        elif variant_effect < 0: # variant is repressing
            interpretation = "predicted to be a transcriptional increasing motif in neural progenitors derived from WTC11"
        else:
            raise ValueError("Did not expect a variant with 0 variant effect: please sanity check")
    else:
        interpretation="Ref and Alt have same bit score => unimportant base hit => motif does not seem to be causal"
        print(f"{row['motif_id']} REF and ALT have same bit score => motif is not causal")
    row[new_interpretation_col] = interpretation
    return row

In [116]:
significant_variant_overlapping_tfbs_neuro_activity["bit_score_change"] = significant_variant_overlapping_tfbs_neuro_activity["ALT_bit_score"] - significant_variant_overlapping_tfbs_neuro_activity["REF_bit_score"]

In [118]:
significant_variant_overlapping_tfbs_neuro_activity = significant_variant_overlapping_tfbs_neuro_activity.apply(lambda row: interprete_TFBS_findings(row, variant_effect_col="bcalm_variant_effect_log_ratio_activity", bit_score_difference_col="bit_score_change", new_interpretation_col="TFBS_interpretation"), axis=1)

In [124]:
significant_variant_overlapping_tfbs_neuro_activity['TFBS_interpretation'].value_counts()

TFBS_interpretation
motif is predicted to be activating in neural progenitors derived from WTC11    12
motif is predicted to be repressing in neural progenitors derived from WTC11    10
Name: count, dtype: int64

> **Conclusion:**  
> The gain of a motif does **not always** lead to increased transcriptional activity.  
> Integrating motif and variant effect data allows more nuanced interpretation.

In [123]:
significant_variant_overlapping_tfbs_neuro_activity['TFBS_interpretation'].value_counts()

# test if gain of TFBS == TFBS is predicted to be activating in neural progenitors derived from WTC11
significant_variant_overlapping_tfbs_neuro_activity.groupby(['TFBS_interpretation', 'tfbs_change_class']).size().unstack(fill_value=0)

tfbs_change_class,gain,loss
TFBS_interpretation,Unnamed: 1_level_1,Unnamed: 2_level_1
motif is predicted to be activating in neural progenitors derived from WTC11,8,4
motif is predicted to be repressing in neural progenitors derived from WTC11,4,6
