# TF-co-occurences for WP1 - Data
### Outline of this notebook:
    1. Constants, Path and Interface Definitions 
    2. Market Basket analysis with tf-comb, for all cluster/celltypes of a tissue
    3. Differential analysis with all market basket analysis (CombObj ´s) of step 2 for the clusters/celltypes of a single tissue. (One DiffCombObj for each tissue)
    4. Analysis for biological questions 
    
The aim is to find transcription-factor-co-occurences for cluster/celltypes of human tissues with the help of the python-library tf-comb (https://github.com/loosolab/TF-COMB). For the data of WP1. The data basis comes from the "cell atlas of chromatin accessibility across 25 adult human tissues"(https://doi.org/10.1101/2021.02.17.431699) 

**Biological question, that we want to answer with this notebook:**

1. Find Transcriptionfactor-co-occurences, which only occure in one (or more) "cluster/celltypes of a tissue".
    Maybe we can identify a cluster through this co-occurences.

**How to use this notbook:**
1. Please adapted the paths in Constants, Path and interface defintions for your approach
2. Please make sure you have installed the kernel as it is described in the ReadMe
3. Check if the WP3 Data structure is correctly provided (ReadMe).
    - Has all tissues as a folder in it: ../OUPTPUT_FOLDER/
    - You find the data (open cromatin regions per cluster .bed-files) e.g ../OUPTPUT_FOLDER/<tissue>/WP6/*cluster_x.bed
4. Execute each notebook window from top to bottom one after another.

## 1. Constants, Path and Interface Definitions:

In [2]:
from tfcomb import CombObj, DiffCombObj, utils
import os
import pathlib
import pandas as pd
import numpy as np
'''
Constants for this script.

This window contains all paths and constants, which are later used for this juypter notbook.

Please adapt paths or constants, if use other files. 
For example adapted the genome path, if you use another genome. 
'''

# Path to genome fasta file. Is used for the market basket analysis of tfcomb.
genome_path="/mnt/workspace_stud/allstud/homo_sapiens.104.mainChr.fa"

# Path to the jaspar file (contains transcription factors (TF) binding profiles
# as position frequency matrices (PFMs)). Is used for the market basket analysis of tfcomb
main_jaspar_file="../testdaten/JASPAR2020_CORE_vertebrates.meme" 

# Path where results of this notebook will be written to (eg. TF_COMB objects, .pkl).
result_path="./results/wp1/"

# Paths in the result folder:
# Path to folder, where the resulting market basket analysis for a cluster/celltype is put 
main_analysis_path=f"{result_path}main/"

# Path to folder, where the differential analysis for a tissue is put 
differential_analysis_path=f"{result_path}diff_analysis/"
# Path to folder, where answers of our results are put to
answers_path=f"{result_path}answers/"

# Path to folder (Interface folder) of wp1, where the clusters of each tissue can be found.
#path_to_tissues="/mnt/workspace_stud/stud3/WP2_OUTPUT/FINISHED/"
path_to_tissues="/mnt/workspace_stud/stud2/output/"

# Tag for WP6 data in WP1 Interface
cluster_folder_tag="wp6/"

# Path to Folder with celltype annotation tables of wp1, wp1 does the celltype annotation 
# and adds it to the 
#celltype_annotation_path = "/mnt/workspace_stud/stud4/celltype_assignment_tables/"

# The following lines, initally check if all file/paths are available. 
#If a result folder does not exist it is created automatically
if not os.path.exists(result_path):
     pathlib.Path(result_path).mkdir(parents=True, exist_ok=True)

if not os.path.exists(main_analysis_path):
     pathlib.Path(main_analysis_path).mkdir(parents=True, exist_ok=True)

if not os.path.exists(differential_analysis_path):
     pathlib.Path(differential_analysis_path).mkdir(parents=True, exist_ok=True)

if not os.path.exists(answers_path):
     pathlib.Path(answers_path).mkdir(parents=True, exist_ok=True)

if not os.path.exists(genome_path):
    print(f"ERROR: path {genome_path} does not exist")

if not os.path.exists(main_jaspar_file):
    print(f"ERROR: path {main_jaspar_file} does not exist")



### Helper functions for reading-in folders/files:

In [3]:
def get_folder_names_in_folder(rel_folder_path:str):
    ''' 
        Read in the names of the folders in a folder.(rel_folder_path)
        ---
        Parameters:
        
        rel_folder_path: String
            relative Path to the folder location that is read in.
        ---
        Return: a List of Strings (folder names)
    '''

    dirlist = [ item for item in os.listdir(rel_folder_path) if os.path.isdir(os.path.join(rel_folder_path, item))]
    folder_names = []
    for folder in dirlist:
        folder_names.append(folder)
    return folder_names

def read_in_file_names_of_folder(rel_path:str):
    ''' 
        Read in the file names in a folder (rel_path).
        ---
        Parameters:
        
        rel_path: String
            relative Path to the folder location.
        ---
        Return: a List of Strings (file names)
    '''
    return [f for f in os.listdir(rel_path) if os.path.isfile(os.path.join(rel_path, f))]

## 2. Market basket analysis with tf-comb

We do a market basket analysis with tfcomb for each cluster/celltype, which has been culstered by WP1 and comes from the raw single-cell ATAC-Data of the cell atlas project. As a result, we get the transcription-factor-co-occurences for each cluster. The Trancriptionsfactor motif´s come as a position-frequency-matrix from https://jaspar.genereg.net/search?q=&collection=CORE&tax_group=vertebrates. The corresponding genome, which is used is homo_sapiens.104.mainChr.fa .

Approach:

    1. Read-in tissue folders of WP1
    2. For each tissue: read-in single .bed (Content= open-chromatin regions) files for each cluster/celltype
    3. Do market basket analysis for each cluster/celltype
    4. Result .pkl CombObj files can be found under */{result_path}/{main_analysis_path}/{tissue_name}/{cluster_name}.pkl *



In [3]:
def do_market_basket_analyses_for_cell_cluster(mb_file_name: str, cell_cluster_path:str, tissue:str):
    '''
        Does market basket analysis with tfcomb. Saves the tfcomb-Object as .pkl file to main_analysis_path.
        ---
        Paramater:
        
        mb_file_name: string
            Name for the result file of the market basket analyses. 
            e.g "<tissue>_<cluster_number>_<celltype>".
        
        cell_cluster_path: string
            Path to the .bed-File with genome regions to check for tf-co-occurences
         
        tissue: string
            Tissue name, origin of the cluster.
    '''    
    # Save path initalization, if folder for tissue does not exists, new folder is created.
    save_path = f'{main_analysis_path}{tissue}/'
    if not os.path.exists(save_path):
         pathlib.Path(save_path).mkdir(parents=True, exist_ok=True)
    
    # TF-comb market basket analysis
    comb = CombObj()
    comb.TFBS_from_motifs(regions= cell_cluster_path,
                   motifs=main_jaspar_file,
                   genome=genome_path,
                   threads=4)
    
    print(f'Start market basket analyses for cell-cluster/type: {mb_file_name}')
    comb.market_basket(threads=10)
    
    # if rules are empty nothing is saved 
    if len(comb.rules) <= 0:
        print(f'Could not find TF-cooccurences for cell-cluster/type: {mb_file_name}')
        return
    print(f'Finished market basket analyses for cell-cluster/type: {mb_file_name}')
    print(f'Found rules: {len(comb.rules)}')
    
    # save tf-comb obj to .pkl
    comb.to_pickle(f'{save_path}{mb_file_name}.pkl')
    print(f'Saved: {save_path}{mb_file_name}.pkl')

In [4]:
# Load tissue folder names for wp1 data
# ['complete_liver', 'heart_lv', 'esophagus_muscularis', 'leg_skin_exposed', 'colon_transverse']
tissues=get_folder_names_in_folder(rel_folder_path=path_to_tissues)
print(tissues)

['complete_liver', 'heart_lv', 'esophagus_muscularis', 'leg_skin_exposed', 'artery_tibial', 'esophagus_mucosa', 'lung', 'lung_sample', 'colon_transverse', 'liver_complete']


In [5]:
def make_mb_for_clusters(path_to_clusters:str, tissue:str):
    '''
        Wrapper function, that does the market basket analysis for all clusters/celltypes in a tissue.
        Also annotates the cluster with a celltype.
        ---
        Paramater:
        
        path_to_clusters: string
            Path to the .bed files(cluster) of a tissue.
         
        tissue: string
            Tissue name from where the cluster corresponds to.
        --- 
        Catch Exceptions with a message, if any error occures in the market basket analyses a message is printed
        and the programm continues with the next tissue.
    '''
    # Read in the .bed files for each cluster of the specific tissue
    cluster_file_names = read_in_file_names_of_folder(rel_path=path_to_clusters)
    print(cluster_file_names)
    
    # Do a market basket analysis for each cluster of a tissue
    for file_name in cluster_file_names:
        # e.g JF1O6_body_of_pancreas.10_peaks.bed -> [JF1O6_body_of_pancreas.10_peaks] = clustername = JF1O6_body_of_pancreas.10_peaks
        cluster_name = file_name.split('.bed')[0]
       
        try:
            print(cluster_name)
            print(file_name)
            # Prepare names and paths
            cluster_path=f"{path_to_clusters}{file_name}"
            # e.g. JF1O6_body_of_pancreas_c3_fibroblast
            mb_file_name = f"{cluster_name}"
            print(mb_file_name)
            do_market_basket_analyses_for_cell_cluster(mb_file_name=mb_file_name, cell_cluster_path=cluster_path, tissue=tissue)
        except Exception:
            print(f"ERROR: Market basket for cluster:{cluster_name} in tissue {tissue}, did not work")
            continue
            

In [None]:
# Create a market basket analysis for each cluster of the tissue
for tissue in tissues:
    path_to_clusters = f"{path_to_tissues}{tissue}/{cluster_folder_tag}"
    make_mb_for_clusters(path_to_clusters=path_to_clusters, tissue=tissue)

print("DONE: Created market basket analysis for each cluster")  

['complete_liver.0.Oligodendrocyte.bed', 'complete_liver.6.Plasmacytoid_dendritic_cell.bed', 'complete_liver.2.B_cell.bed', 'complete_liver.1.-.bed', 'complete_liver.4.Natural_killer_T_(NKT)_cell.bed', 'complete_liver.3.Gonadal_endothelial_cell.bed', 'complete_liver.5.Natural_killer_T_(NKT)_cell.bed']
complete_liver.0.Oligodendrocyte
complete_liver.0.Oligodendrocyte.bed
complete_liver.0.Oligodendrocyte
INFO: Scanning for TFBS with 4 thread(s)...
INFO: Progress: 10%
INFO: Progress: 20%
INFO: Progress: 31%
INFO: Progress: 40%
INFO: Progress: 50%
INFO: Progress: 60%
INFO: Progress: 70%
INFO: Progress: 80%
INFO: Progress: 90%
INFO: Finished!
INFO: Processing scanned TFBS
INFO: Identified 5998756 TFBS (746 unique names) within given regions
Start market basket analyses for cell-cluster/type: complete_liver.0.Oligodendrocyte
INFO: Setting up binding sites for counting
INFO: Counting co-occurrences within sites
INFO: Counting co-occurrence within background
INFO: Progress: 10%
INFO: Progress:

INFO: Scanning for TFBS with 4 thread(s)...
INFO: Progress: 11%
INFO: Progress: 20%
INFO: Progress: 30%
INFO: Progress: 40%
INFO: Progress: 50%
INFO: Progress: 60%
INFO: Progress: 71%
INFO: Progress: 81%
INFO: Progress: 90%
INFO: Finished!
INFO: Processing scanned TFBS
INFO: Identified 5620963 TFBS (746 unique names) within given regions
Start market basket analyses for cell-cluster/type: complete_liver.5.Natural_killer_T_(NKT)_cell
INFO: Setting up binding sites for counting
INFO: Counting co-occurrences within sites
INFO: Counting co-occurrence within background
INFO: Progress: 10%
INFO: Progress: 20%
INFO: Progress: 30%
INFO: Progress: 40%
INFO: Progress: 50%
INFO: Progress: 60%
INFO: Progress: 70%
INFO: Progress: 80%
INFO: Progress: 90%
INFO: Finished!
INFO: Done finding co-occurrences! Run .market_basket() to estimate significant pairs
INFO: Market basket analysis is done! Results are found in <CombObj>.rules
Finished market basket analyses for cell-cluster/type: complete_liver.5.

INFO: Market basket analysis is done! Results are found in <CombObj>.rules
Finished market basket analyses for cell-cluster/type: heart_lv.5.Multilymphoid_progenitor_cell
Found rules: 549160
Saved: ./results/wp1/main/heart_lv/heart_lv.5.Multilymphoid_progenitor_cell.pkl
heart_lv.0.Natural_killer_T_(NKT)_cell
heart_lv.0.Natural_killer_T_(NKT)_cell.bed
heart_lv.0.Natural_killer_T_(NKT)_cell
INFO: Scanning for TFBS with 4 thread(s)...
INFO: Progress: 10%
INFO: Progress: 20%
INFO: Progress: 30%
INFO: Progress: 40%
INFO: Progress: 50%
INFO: Progress: 60%
INFO: Progress: 70%
INFO: Progress: 80%
INFO: Progress: 90%
INFO: Finished!
INFO: Processing scanned TFBS
INFO: Identified 11359649 TFBS (746 unique names) within given regions
Start market basket analyses for cell-cluster/type: heart_lv.0.Natural_killer_T_(NKT)_cell
INFO: Setting up binding sites for counting
INFO: Counting co-occurrences within sites
INFO: Counting co-occurrence within background
INFO: Progress: 10%
INFO: Progress: 20%
IN

Saved: ./results/wp1/main/heart_lv/heart_lv.8.Natural_killer_T_(NKT)_cell.pkl
heart_lv.9.Oligodendrocyte
heart_lv.9.Oligodendrocyte.bed
heart_lv.9.Oligodendrocyte
INFO: Scanning for TFBS with 4 thread(s)...
INFO: Progress: 11%
INFO: Progress: 20%
INFO: Progress: 30%
INFO: Progress: 40%
INFO: Progress: 50%
INFO: Progress: 60%
INFO: Progress: 70%
INFO: Progress: 80%
INFO: Progress: 90%
INFO: Finished!
INFO: Processing scanned TFBS
INFO: Identified 2738640 TFBS (746 unique names) within given regions
Start market basket analyses for cell-cluster/type: heart_lv.9.Oligodendrocyte
INFO: Setting up binding sites for counting
INFO: Counting co-occurrences within sites
INFO: Counting co-occurrence within background
INFO: Progress: 10%
INFO: Progress: 20%
INFO: Progress: 30%
INFO: Progress: 40%
INFO: Progress: 50%
INFO: Progress: 60%
INFO: Progress: 70%
INFO: Progress: 80%
INFO: Progress: 90%
INFO: Finished!
INFO: Done finding co-occurrences! Run .market_basket() to estimate significant pairs
IN

INFO: Scanning for TFBS with 4 thread(s)...
INFO: Progress: 10%
INFO: Progress: 20%
INFO: Progress: 30%
INFO: Progress: 40%
INFO: Progress: 50%
INFO: Progress: 60%
INFO: Progress: 70%
INFO: Progress: 81%
INFO: Progress: 90%
INFO: Finished!
INFO: Processing scanned TFBS
INFO: Identified 10820119 TFBS (746 unique names) within given regions
Start market basket analyses for cell-cluster/type: esophagus_muscularis.1.Ciliated_epithelial_cell
INFO: Setting up binding sites for counting
INFO: Counting co-occurrences within sites
INFO: Counting co-occurrence within background
INFO: Progress: 10%
INFO: Progress: 20%
INFO: Progress: 30%
INFO: Progress: 40%
INFO: Progress: 50%
INFO: Progress: 60%
INFO: Progress: 70%
INFO: Progress: 80%
INFO: Progress: 90%
INFO: Finished!
INFO: Done finding co-occurrences! Run .market_basket() to estimate significant pairs
INFO: Market basket analysis is done! Results are found in <CombObj>.rules
Finished market basket analyses for cell-cluster/type: esophagus_mus

Saved: ./results/wp1/main/esophagus_muscularis/esophagus_muscularis.7.Endothelial_cell.pkl
esophagus_muscularis.17.AXL+SIGLEC6+_dendritic_cell
esophagus_muscularis.17.AXL+SIGLEC6+_dendritic_cell.bed
esophagus_muscularis.17.AXL+SIGLEC6+_dendritic_cell
INFO: Scanning for TFBS with 4 thread(s)...
INFO: Progress: 11%
INFO: Progress: 20%
INFO: Progress: 30%
INFO: Progress: 40%
INFO: Progress: 50%
INFO: Progress: 60%
INFO: Progress: 70%
INFO: Progress: 80%
INFO: Progress: 90%
INFO: Finished!
INFO: Processing scanned TFBS
INFO: Identified 4048650 TFBS (746 unique names) within given regions
Start market basket analyses for cell-cluster/type: esophagus_muscularis.17.AXL+SIGLEC6+_dendritic_cell
INFO: Setting up binding sites for counting
INFO: Counting co-occurrences within sites
INFO: Counting co-occurrence within background
INFO: Progress: 10%
INFO: Progress: 20%
INFO: Progress: 30%
INFO: Progress: 40%
INFO: Progress: 50%
INFO: Progress: 60%
INFO: Progress: 70%
INFO: Progress: 80%
INFO: Progr

INFO: Market basket analysis is done! Results are found in <CombObj>.rules
Finished market basket analyses for cell-cluster/type: esophagus_muscularis.8.Astrocyte
Found rules: 547020
Saved: ./results/wp1/main/esophagus_muscularis/esophagus_muscularis.8.Astrocyte.pkl
esophagus_muscularis.3.Pyramidal_cell
esophagus_muscularis.3.Pyramidal_cell.bed
esophagus_muscularis.3.Pyramidal_cell
INFO: Scanning for TFBS with 4 thread(s)...
INFO: Progress: 10%
INFO: Progress: 20%
INFO: Progress: 30%
INFO: Progress: 40%
INFO: Progress: 50%
INFO: Progress: 60%
INFO: Progress: 70%
INFO: Progress: 80%
INFO: Progress: 90%
INFO: Finished!
INFO: Processing scanned TFBS
INFO: Identified 9598348 TFBS (746 unique names) within given regions
Start market basket analyses for cell-cluster/type: esophagus_muscularis.3.Pyramidal_cell
INFO: Setting up binding sites for counting
INFO: Counting co-occurrences within sites
INFO: Counting co-occurrence within background
INFO: Progress: 10%
INFO: Progress: 20%
INFO: Progr

Start market basket analyses for cell-cluster/type: leg_skin_exposed.10.Oligodendrocyte
INFO: Setting up binding sites for counting
INFO: Counting co-occurrences within sites
INFO: Counting co-occurrence within background
INFO: Progress: 10%
INFO: Progress: 20%
INFO: Progress: 30%
INFO: Progress: 40%
INFO: Progress: 50%
INFO: Progress: 60%
INFO: Progress: 70%
INFO: Progress: 80%
INFO: Progress: 90%
INFO: Finished!
INFO: Done finding co-occurrences! Run .market_basket() to estimate significant pairs
INFO: Market basket analysis is done! Results are found in <CombObj>.rules
Finished market basket analyses for cell-cluster/type: leg_skin_exposed.10.Oligodendrocyte
Found rules: 533965
Saved: ./results/wp1/main/leg_skin_exposed/leg_skin_exposed.10.Oligodendrocyte.pkl
leg_skin_exposed.3.Oocyte
leg_skin_exposed.3.Oocyte.bed
leg_skin_exposed.3.Oocyte
INFO: Scanning for TFBS with 4 thread(s)...
INFO: Progress: 10%
INFO: Progress: 20%
INFO: Progress: 30%
INFO: Progress: 40%
INFO: Progress: 50%
I

## 3. Analysis

### Differential Analysis
We use the differential analysis of tfcomb to identifiy the differences of tf-co-occurences between all cluster/celltypes of a tissue. For this we load all market basket analysis (tfcomb-objects) of a tissue (see point 2) into a DiffCombObj. After the differential analysis, we filter the object, so that we hopefully find tf-co-occurences that only occure in a single cluster of that tissue or only in a special celltype.  

- mb = market basket analysis (CombObj of tf comb)

In [8]:
# get the tissue names by folder names of the market basket analysis 
mb_tissues = get_folder_names_in_folder(rel_folder_path=main_analysis_path)
print(mb_tissues)
print(f"Anzahl tissues: {str(len(mb_tissues))}")

['complete_liver', 'heart_lv', 'esophagus_muscularis', 'leg_skin_exposed', 'artery_tibial', 'esophagus_mucosa', 'lung', 'lung_sample', 'colon_transverse', 'liver_complete']
Anzahl tissues: 10


In [9]:
def do_differential_analysis_for_tissues(tissues=[]):
    '''
        Differential analysis between all clusters/celltypes of a tissue.
        ---
        Paramater:
        
        tissues: array
            tissue names by market basket analysis.
        ---
        DiffCombObj are saved as .pkl to differential_analysis_path
    '''
    for tissue_folder in tissues:
        
        diff_save_path = f"{differential_analysis_path}{tissue_folder}/"
        # Check if  folder for differential_analysis already exists for tissue, if not create new one
        if not os.path.exists(diff_save_path):
             pathlib.Path(diff_save_path).mkdir(parents=True, exist_ok=True)
        
        # get file names of the market basket analysis
        tissue_mb_files = read_in_file_names_of_folder(rel_path=f"{main_analysis_path}{tissue_folder}/")
        
        # holds the combobj´s
        tissue_mbs_to_compare = []
        for file in tissue_mb_files:
            print(file)
            file_name = file.split('.pkl')[0]
            # Load the CombObj (market basket analysis) for each cluster of a tissue
            obj = CombObj().from_pickle(f"{main_analysis_path}{tissue_folder}/{file}")
            obj.set_prefix(file_name)
            tissue_mbs_to_compare.append(obj)
        
        # Create DiffCombObj with all Combobj (market basket analysis) of the clusters in a tissue
        compare_obj = DiffCombObj(tissue_mbs_to_compare, measure="cosine", join="outer", fillna=True)
        # save diffcombj
        compare_obj.to_pickle(f'{diff_save_path}{tissue_folder}.pkl')
        # Normalize the DiffCombObj
        compare_obj.normalize()
        compare_obj.calculate_foldchanges()
        
        # Remove rules which are doubled, e.g. A-B, B-A; (B-A) is removed
        compare_obj.simplify_rules()
        # Save the normalized, foldchange calculated and simplified diff_comb_obj to .pkl
        compare_obj.to_pickle(f'{diff_save_path}{tissue_folder}_normalized.pkl')
        print(f"Done: Diff analysis for tissue {tissue_folder}")
        

In [10]:
# Do the differential analysis for all clusters in a tissue 
do_differential_analysis_for_tissues(tissues=mb_tissues)

complete_liver.0.Oligodendrocyte.pkl
complete_liver.6.Plasmacytoid_dendritic_cell.pkl
complete_liver.2.B_cell.pkl
complete_liver.1.-.pkl
complete_liver.4.Natural_killer_T_(NKT)_cell.pkl
complete_liver.3.Gonadal_endothelial_cell.pkl
complete_liver.5.Natural_killer_T_(NKT)_cell.pkl
INFO: Calculating foldchange for contrast: complete_liver.0.Oligodendrocyte / complete_liver.6.Plasmacytoid_dendritic_cell
INFO: Calculating foldchange for contrast: complete_liver.0.Oligodendrocyte / complete_liver.2.B_cell
INFO: Calculating foldchange for contrast: complete_liver.0.Oligodendrocyte / complete_liver.1.-
INFO: Calculating foldchange for contrast: complete_liver.0.Oligodendrocyte / complete_liver.4.Natural_killer_T_(NKT)_cell
INFO: Calculating foldchange for contrast: complete_liver.0.Oligodendrocyte / complete_liver.3.Gonadal_endothelial_cell
INFO: Calculating foldchange for contrast: complete_liver.0.Oligodendrocyte / complete_liver.5.Natural_killer_T_(NKT)_cell
INFO: Calculating foldchange fo

INFO: Calculating foldchange for contrast: heart_lv.5.Multilymphoid_progenitor_cell / heart_lv.6.Natural_killer_T_(NKT)_cell
INFO: Calculating foldchange for contrast: heart_lv.5.Multilymphoid_progenitor_cell / heart_lv.7.Multilymphoid_progenitor_cell
INFO: Calculating foldchange for contrast: heart_lv.5.Multilymphoid_progenitor_cell / heart_lv.4.Natural_killer_T_(NKT)_cell
INFO: Calculating foldchange for contrast: heart_lv.5.Multilymphoid_progenitor_cell / heart_lv.8.Natural_killer_T_(NKT)_cell
INFO: Calculating foldchange for contrast: heart_lv.5.Multilymphoid_progenitor_cell / heart_lv.9.Oligodendrocyte
INFO: Calculating foldchange for contrast: heart_lv.0.Natural_killer_T_(NKT)_cell / heart_lv.10.Plasmacytoid_dendritic_cell
INFO: Calculating foldchange for contrast: heart_lv.0.Natural_killer_T_(NKT)_cell / heart_lv.6.Natural_killer_T_(NKT)_cell
INFO: Calculating foldchange for contrast: heart_lv.0.Natural_killer_T_(NKT)_cell / heart_lv.7.Multilymphoid_progenitor_cell
INFO: Calcula

INFO: Calculating foldchange for contrast: esophagus_muscularis.0.Ciliated_epithelial_cell / esophagus_muscularis.17.AXL+SIGLEC6+_dendritic_cell
INFO: Calculating foldchange for contrast: esophagus_muscularis.0.Ciliated_epithelial_cell / esophagus_muscularis.11.Ciliated_epithelial_cell
INFO: Calculating foldchange for contrast: esophagus_muscularis.0.Ciliated_epithelial_cell / esophagus_muscularis.16.Leydig_precursor_cell
INFO: Calculating foldchange for contrast: esophagus_muscularis.0.Ciliated_epithelial_cell / esophagus_muscularis.19.Mitotic_arrest_phase_fetal_germ_cell
INFO: Calculating foldchange for contrast: esophagus_muscularis.0.Ciliated_epithelial_cell / esophagus_muscularis.13.FGFR1HighNME5-_epithelial_cell
INFO: Calculating foldchange for contrast: esophagus_muscularis.0.Ciliated_epithelial_cell / esophagus_muscularis.8.Astrocyte
INFO: Calculating foldchange for contrast: esophagus_muscularis.0.Ciliated_epithelial_cell / esophagus_muscularis.3.Pyramidal_cell
INFO: Calculati

INFO: Calculating foldchange for contrast: esophagus_muscularis.18.Natural_killer_T_(NKT)_cell / esophagus_muscularis.9.Multilymphoid_progenitor_cell
INFO: Calculating foldchange for contrast: esophagus_muscularis.18.Natural_killer_T_(NKT)_cell / esophagus_muscularis.15.Natural_killer_T_(NKT)_cell
INFO: Calculating foldchange for contrast: esophagus_muscularis.18.Natural_killer_T_(NKT)_cell / esophagus_muscularis.7.Endothelial_cell
INFO: Calculating foldchange for contrast: esophagus_muscularis.18.Natural_killer_T_(NKT)_cell / esophagus_muscularis.17.AXL+SIGLEC6+_dendritic_cell
INFO: Calculating foldchange for contrast: esophagus_muscularis.18.Natural_killer_T_(NKT)_cell / esophagus_muscularis.11.Ciliated_epithelial_cell
INFO: Calculating foldchange for contrast: esophagus_muscularis.18.Natural_killer_T_(NKT)_cell / esophagus_muscularis.16.Leydig_precursor_cell
INFO: Calculating foldchange for contrast: esophagus_muscularis.18.Natural_killer_T_(NKT)_cell / esophagus_muscularis.19.Mitot

  self.rules[log2_col] = np.log2((p1_values + pseudo) / (p2_values + pseudo))


INFO: Calculating foldchange for contrast: esophagus_muscularis.9.Multilymphoid_progenitor_cell / esophagus_muscularis.11.Ciliated_epithelial_cell
INFO: Calculating foldchange for contrast: esophagus_muscularis.9.Multilymphoid_progenitor_cell / esophagus_muscularis.16.Leydig_precursor_cell
INFO: Calculating foldchange for contrast: esophagus_muscularis.9.Multilymphoid_progenitor_cell / esophagus_muscularis.19.Mitotic_arrest_phase_fetal_germ_cell
INFO: Calculating foldchange for contrast: esophagus_muscularis.9.Multilymphoid_progenitor_cell / esophagus_muscularis.13.FGFR1HighNME5-_epithelial_cell
INFO: Calculating foldchange for contrast: esophagus_muscularis.9.Multilymphoid_progenitor_cell / esophagus_muscularis.8.Astrocyte
INFO: Calculating foldchange for contrast: esophagus_muscularis.9.Multilymphoid_progenitor_cell / esophagus_muscularis.3.Pyramidal_cell
INFO: Calculating foldchange for contrast: esophagus_muscularis.9.Multilymphoid_progenitor_cell / esophagus_muscularis.12.Natural_

INFO: Calculating foldchange for contrast: esophagus_muscularis.19.Mitotic_arrest_phase_fetal_germ_cell / esophagus_muscularis.10.Plasmacytoid_dendritic_cell
INFO: Calculating foldchange for contrast: esophagus_muscularis.13.FGFR1HighNME5-_epithelial_cell / esophagus_muscularis.8.Astrocyte
INFO: Calculating foldchange for contrast: esophagus_muscularis.13.FGFR1HighNME5-_epithelial_cell / esophagus_muscularis.3.Pyramidal_cell
INFO: Calculating foldchange for contrast: esophagus_muscularis.13.FGFR1HighNME5-_epithelial_cell / esophagus_muscularis.12.Natural_killer_T_(NKT)_cell
INFO: Calculating foldchange for contrast: esophagus_muscularis.13.FGFR1HighNME5-_epithelial_cell / esophagus_muscularis.14.Astrocyte
INFO: Calculating foldchange for contrast: esophagus_muscularis.13.FGFR1HighNME5-_epithelial_cell / esophagus_muscularis.10.Plasmacytoid_dendritic_cell
INFO: Calculating foldchange for contrast: esophagus_muscularis.8.Astrocyte / esophagus_muscularis.3.Pyramidal_cell
INFO: Calculating

INFO: Calculating foldchange for contrast: leg_skin_exposed.7.Multilymphoid_progenitor_cell / leg_skin_exposed.2.Astrocyte
INFO: Calculating foldchange for contrast: leg_skin_exposed.7.Multilymphoid_progenitor_cell / leg_skin_exposed.13.AXL+SIGLEC6+_dendritic_cell
INFO: Calculating foldchange for contrast: leg_skin_exposed.7.Multilymphoid_progenitor_cell / leg_skin_exposed.12.Multilymphoid_progenitor_cell
INFO: Calculating foldchange for contrast: leg_skin_exposed.0.Multilymphoid_progenitor_cell / leg_skin_exposed.11.Natural_killer_T_(NKT)_cell
INFO: Calculating foldchange for contrast: leg_skin_exposed.0.Multilymphoid_progenitor_cell / leg_skin_exposed.4.Purkinje_cell
INFO: Calculating foldchange for contrast: leg_skin_exposed.0.Multilymphoid_progenitor_cell / leg_skin_exposed.1.Mitotic_arrest_phase_fetal_germ_cell
INFO: Calculating foldchange for contrast: leg_skin_exposed.0.Multilymphoid_progenitor_cell / leg_skin_exposed.9.Natural_killer_T_(NKT)_cell
INFO: Calculating foldchange fo

INFO: Calculating foldchange for contrast: artery_tibial.1.- / artery_tibial.6.-
INFO: Calculating foldchange for contrast: artery_tibial.1.- / artery_tibial.0.Multilymphoid_progenitor_cell
INFO: Calculating foldchange for contrast: artery_tibial.1.- / artery_tibial.4.Natural_killer_T_(NKT)_cell
INFO: Calculating foldchange for contrast: artery_tibial.1.- / artery_tibial.3.Plasmacytoid_dendritic_cell
INFO: Calculating foldchange for contrast: artery_tibial.6.- / artery_tibial.0.Multilymphoid_progenitor_cell
INFO: Calculating foldchange for contrast: artery_tibial.6.- / artery_tibial.4.Natural_killer_T_(NKT)_cell
INFO: Calculating foldchange for contrast: artery_tibial.6.- / artery_tibial.3.Plasmacytoid_dendritic_cell
INFO: Calculating foldchange for contrast: artery_tibial.0.Multilymphoid_progenitor_cell / artery_tibial.4.Natural_killer_T_(NKT)_cell
INFO: Calculating foldchange for contrast: artery_tibial.0.Multilymphoid_progenitor_cell / artery_tibial.3.Plasmacytoid_dendritic_cell
INF

INFO: Calculating foldchange for contrast: esophagus_mucosa.13.Multilymphoid_progenitor_cell / esophagus_mucosa.3.Monocyte
INFO: Calculating foldchange for contrast: esophagus_mucosa.13.Multilymphoid_progenitor_cell / esophagus_mucosa.16.Gonadal_endothelial_cell
INFO: Calculating foldchange for contrast: esophagus_mucosa.13.Multilymphoid_progenitor_cell / esophagus_mucosa.15.Natural_killer_T_(NKT)_cell
INFO: Calculating foldchange for contrast: esophagus_mucosa.13.Multilymphoid_progenitor_cell / esophagus_mucosa.11.Natural_killer_T_(NKT)_cell
INFO: Calculating foldchange for contrast: esophagus_mucosa.13.Multilymphoid_progenitor_cell / esophagus_mucosa.17.-
INFO: Calculating foldchange for contrast: esophagus_mucosa.13.Multilymphoid_progenitor_cell / esophagus_mucosa.2.Multilymphoid_progenitor_cell
INFO: Calculating foldchange for contrast: esophagus_mucosa.13.Multilymphoid_progenitor_cell / esophagus_mucosa.6.-
INFO: Calculating foldchange for contrast: esophagus_mucosa.13.Multilympho

INFO: Calculating foldchange for contrast: esophagus_mucosa.15.Natural_killer_T_(NKT)_cell / esophagus_mucosa.11.Natural_killer_T_(NKT)_cell
INFO: Calculating foldchange for contrast: esophagus_mucosa.15.Natural_killer_T_(NKT)_cell / esophagus_mucosa.17.-
INFO: Calculating foldchange for contrast: esophagus_mucosa.15.Natural_killer_T_(NKT)_cell / esophagus_mucosa.2.Multilymphoid_progenitor_cell
INFO: Calculating foldchange for contrast: esophagus_mucosa.15.Natural_killer_T_(NKT)_cell / esophagus_mucosa.6.-
INFO: Calculating foldchange for contrast: esophagus_mucosa.15.Natural_killer_T_(NKT)_cell / esophagus_mucosa.14.Oocyte
INFO: Calculating foldchange for contrast: esophagus_mucosa.15.Natural_killer_T_(NKT)_cell / esophagus_mucosa.8.Oligodendrocyte
INFO: Calculating foldchange for contrast: esophagus_mucosa.15.Natural_killer_T_(NKT)_cell / esophagus_mucosa.9.B_cell
INFO: Calculating foldchange for contrast: esophagus_mucosa.15.Natural_killer_T_(NKT)_cell / esophagus_mucosa.7.Natural_k

INFO: Calculating foldchange for contrast: lung.8.Multilymphoid_progenitor_cell / lung.7.Gonadal_endothelial_cell
INFO: Calculating foldchange for contrast: lung.8.Multilymphoid_progenitor_cell / lung.10.Natural_killer_T_(NKT)_cell
INFO: Calculating foldchange for contrast: lung.8.Multilymphoid_progenitor_cell / lung.3.Multilymphoid_progenitor_cell
INFO: Calculating foldchange for contrast: lung.0.Lake_et_al.Science.Ex8 / lung.1.FGFR1HighNME5-_epithelial_cell
INFO: Calculating foldchange for contrast: lung.0.Lake_et_al.Science.Ex8 / lung.4.Natural_killer_T_(NKT)_cell
INFO: Calculating foldchange for contrast: lung.0.Lake_et_al.Science.Ex8 / lung.5.Natural_killer_T_(NKT)_cell
INFO: Calculating foldchange for contrast: lung.0.Lake_et_al.Science.Ex8 / lung.7.Gonadal_endothelial_cell
INFO: Calculating foldchange for contrast: lung.0.Lake_et_al.Science.Ex8 / lung.10.Natural_killer_T_(NKT)_cell
INFO: Calculating foldchange for contrast: lung.0.Lake_et_al.Science.Ex8 / lung.3.Multilymphoid_pr

colon_transverse.4.Natural_killer_T_(NKT)_cell.pkl
INFO: Calculating foldchange for contrast: colon_transverse.1.Oogenesis_phase_fetal_germ_cell / colon_transverse.0.Granulocyte-monocyte_progenitor
INFO: Calculating foldchange for contrast: colon_transverse.1.Oogenesis_phase_fetal_germ_cell / colon_transverse.7.Endothelial_cell
INFO: Calculating foldchange for contrast: colon_transverse.1.Oogenesis_phase_fetal_germ_cell / colon_transverse.2.Multilymphoid_progenitor_cell
INFO: Calculating foldchange for contrast: colon_transverse.1.Oogenesis_phase_fetal_germ_cell / colon_transverse.9.Purkinje_cell
INFO: Calculating foldchange for contrast: colon_transverse.1.Oogenesis_phase_fetal_germ_cell / colon_transverse.10.Pyramidal_cell
INFO: Calculating foldchange for contrast: colon_transverse.1.Oogenesis_phase_fetal_germ_cell / colon_transverse.3.Natural_killer_T_(NKT)_cell
INFO: Calculating foldchange for contrast: colon_transverse.1.Oogenesis_phase_fetal_germ_cell / colon_transverse.8.Lake_et

INFO: Calculating foldchange for contrast: colon_transverse.2.Multilymphoid_progenitor_cell / colon_transverse.12.Multilymphoid_progenitor_cell
INFO: Calculating foldchange for contrast: colon_transverse.2.Multilymphoid_progenitor_cell / colon_transverse.4.Natural_killer_T_(NKT)_cell
INFO: Calculating foldchange for contrast: colon_transverse.9.Purkinje_cell / colon_transverse.10.Pyramidal_cell
INFO: Calculating foldchange for contrast: colon_transverse.9.Purkinje_cell / colon_transverse.3.Natural_killer_T_(NKT)_cell
INFO: Calculating foldchange for contrast: colon_transverse.9.Purkinje_cell / colon_transverse.8.Lake_et_al.Science.In2
INFO: Calculating foldchange for contrast: colon_transverse.9.Purkinje_cell / colon_transverse.15.Airway_secretory_cell
INFO: Calculating foldchange for contrast: colon_transverse.9.Purkinje_cell / colon_transverse.5.Natural_killer_T_(NKT)_cell
INFO: Calculating foldchange for contrast: colon_transverse.9.Purkinje_cell / colon_transverse.13.Natural_killer

INFO: Calculating foldchange for contrast: colon_transverse.5.Natural_killer_T_(NKT)_cell / colon_transverse.12.Multilymphoid_progenitor_cell
INFO: Calculating foldchange for contrast: colon_transverse.5.Natural_killer_T_(NKT)_cell / colon_transverse.4.Natural_killer_T_(NKT)_cell
INFO: Calculating foldchange for contrast: colon_transverse.13.Natural_killer_T_(NKT)_cell / colon_transverse.11.Pyramidal_cell
INFO: Calculating foldchange for contrast: colon_transverse.13.Natural_killer_T_(NKT)_cell / colon_transverse.6.Meiotic_prophase_fetal_germ_cell
INFO: Calculating foldchange for contrast: colon_transverse.13.Natural_killer_T_(NKT)_cell / colon_transverse.14.FGFR1HighNME5-_epithelial_cell
INFO: Calculating foldchange for contrast: colon_transverse.13.Natural_killer_T_(NKT)_cell / colon_transverse.16.Multilymphoid_progenitor_cell
INFO: Calculating foldchange for contrast: colon_transverse.13.Natural_killer_T_(NKT)_cell / colon_transverse.17.Lake_et_al.Science.Ex8
INFO: Calculating foldc

INFO: Calculating foldchange for contrast: liver_complete.2.B_cell / liver_complete.0.B_cell
INFO: Calculating foldchange for contrast: liver_complete.2.B_cell / liver_complete.3.Oligodendrocyte
INFO: Calculating foldchange for contrast: liver_complete.2.B_cell / liver_complete.9.-
INFO: Calculating foldchange for contrast: liver_complete.2.B_cell / liver_complete.10.-
INFO: Calculating foldchange for contrast: liver_complete.2.B_cell / liver_complete.7.-
INFO: Calculating foldchange for contrast: liver_complete.0.Goblet_cell / liver_complete.7.Leydig_precursor_cell
INFO: Calculating foldchange for contrast: liver_complete.0.Goblet_cell / liver_complete.1.Natural_killer_T_(NKT)_cell
INFO: Calculating foldchange for contrast: liver_complete.0.Goblet_cell / liver_complete.6.Oligodendrocyte
INFO: Calculating foldchange for contrast: liver_complete.0.Goblet_cell / liver_complete.4.Multilymphoid_progenitor_cell
INFO: Calculating foldchange for contrast: liver_complete.0.Goblet_cell / liver_

INFO: Calculating foldchange for contrast: liver_complete.3.FGFR1HighNME5-_epithelial_cell / liver_complete.1.-
INFO: Calculating foldchange for contrast: liver_complete.3.FGFR1HighNME5-_epithelial_cell / liver_complete.5.Ciliated_epithelial_cell
INFO: Calculating foldchange for contrast: liver_complete.3.FGFR1HighNME5-_epithelial_cell / liver_complete.0.B_cell
INFO: Calculating foldchange for contrast: liver_complete.3.FGFR1HighNME5-_epithelial_cell / liver_complete.3.Oligodendrocyte
INFO: Calculating foldchange for contrast: liver_complete.3.FGFR1HighNME5-_epithelial_cell / liver_complete.9.-
INFO: Calculating foldchange for contrast: liver_complete.3.FGFR1HighNME5-_epithelial_cell / liver_complete.10.-
INFO: Calculating foldchange for contrast: liver_complete.3.FGFR1HighNME5-_epithelial_cell / liver_complete.7.-
INFO: Calculating foldchange for contrast: liver_complete.8.- / liver_complete.4.Plasmacytoid_dendritic_cell
INFO: Calculating foldchange for contrast: liver_complete.8.- / 

# 4. Specific analysis  :
### Biological Question:
 Find Transcriptionfactor-co-occurences, which only occure in one (or more) "cluster/celltype of a tissue".
 - Solution with a hard threshold (a, all cluster) or a week threshold (b, celltypes in a tissue):
       - a) The found TF-pair occures only in a single cluster of a tissue:
            e.g. Cluster1 has TF1-TF1 and the other cluster in the tissue do not have TF1-TF1.
       - b) A TF-pair occures significantly only in some clusters:
            e.g. We have 3 endothelcells cluster in a tissue and this three clusters share a TF-pair
            which does not occure in the other cluster of the tissue. (need celltype annotation)
          
###  Steps:
1. Read in tfcomb-DiffObj´s of the Differential analysis (Point 3)
2. Get the rules of the DiffCombObj
3. Filter the DiffCombObj rules for each cluster in it:
       - e.g. investigate cluster1: get all log2changes col,
         where cluster1 is associated with and his cosine column
       - than filter the rules (tf-pairs) for tf-pairs with a significant change in the log2changes
       - We would like to find TF-pairs that show a significant change in the log2changes in each contrast,
         because this indicates a TF-Pair, which is striking for the investigated cluster in that tissue.
         And a high cosine would be a indication, that the found tf-pair is really 
         a tf-co-occurence in that cluster.
       - To find a TF-pair, which only occures in a single cluster of a tissue,
         we check the log2changes in each col of the significant tf-pairs, if they have the same value,
         the tf-pair only occures in that cluster in the tissue. 
    

In [4]:
# Get the folder names of the tissues that have a differential analysis
diff_mb_tissues = get_folder_names_in_folder(rel_folder_path=differential_analysis_path)
print(diff_mb_tissues)

['complete_liver', 'heart_lv', 'esophagus_muscularis', 'leg_skin_exposed', 'artery_tibial', 'esophagus_mucosa', 'lung', 'lung_sample', 'colon_transverse', 'liver_complete']


In [5]:
def get_significant_rules(df:pd.DataFrame, cosine_col:str, cosine_threshold=0.001, log2fc_threshold_percent=0.10):
    '''
    Filter a rules Dataframe (tf-pairs) by log2foldchange col´s and cosine col´s of a DiffCombObj for significant tf-co-occurences (tf-pairs).
    Only the TF-pairs with significant log2foldchanges per contrast are kept in the returned Dataframe.
    The same filtering is done for the cosine col of the investigated cluster.
    
    For Filtering: for each col (foldchanges/ cosine) the thresholds to get significant values in that col
    are calculated by tfcomb.utils.get_threshold. All rows (tf-pairs), which are not in the threshold 
    (e.g. >95% percentile for log2foldchange or 99.9% percentile) are removed from the datframe (df). 
    Threshold are calculated on the original dataframe (df)
    Filtering/Reduction of the results is done on a copy of the original dataframe.
    
    This is done step by step:
    e.g. 1. thresholds for column "log2change_cluster1/cluster2" are calculated on the original df.
         2. all rows (tf-pairs) of the dataframe,
            where the values of the investigated column (log2change_cluster1/cluster2) are not in the thresholds,
            are removed from the copy of the original dataframe.
        These steps are done for each column.

    ---
    Parameter:
    df: pd.DataFrame
        DiffCombObj Rules
        e.g.:
        index   cosine_cluster1 log2change_cluster1/cluster2 log2change_cluster1/cluster3 log2change_cluster1/cluster4
        TF1-TF2  0.20                        3.4                              1                        4
        TF3-TF4  0.30                       - 0.3                             2                        1
        TFX-TFY  0.15                        8                                3                        -4
        ...
    
    cosine_col: string
        Name of the cluster, which is investigated and name of the cosine col
    
    cosine_threshold: float,
        Threshold value for the cosine col. All values of the cosine col are investigated
        and the threshold value for filtering is calculated. 
        
    log2fc_threshold_percent: float
        Threshold value for the log2foldchanges col. All values of the log2foldchanges cols are investigated
        and the threshold value for filtering is calculated.
    ---
    Return pd.Datframe, Filtered df         

    '''
    # copy of the original dataframe, which will be reduced (remove rows (TF-Pairs),  step by step for each col)
    reduced_df = df.copy(deep=True)
    for col in df.columns:
        # calculate thresholds for log2fc columns or cosine_column
        
        if col == cosine_col:
            # for the cosine column we only want to keep rows with a high value (no negative values),
            # so only the upper threshold is important
            measure_threshold = utils.get_threshold(df[col], "both", percent=cosine_threshold)
            upper_threshold = measure_threshold[1]
            
            # removes tf-pairs (rows), which are smaller than the upper_threshold form the df.
            reduced_df = reduced_df[(reduced_df[col] > upper_threshold)]
            print(upper_threshold)
        else:
            # calculates the thresholds for the log2foldchange cols,
            # positive and negative is possible, so both thresholds are needed.
            measure_threshold = utils.get_threshold(df[col], "both", percent=log2fc_threshold_percent)
            upper_threshold = measure_threshold[1]
            lower_threshold = measure_threshold[0]
             # removes tf-pairs (rows), which are smalleror higher than the thresholds form the df.
            reduced_df = reduced_df[(reduced_df[col] > upper_threshold) | (reduced_df[col] < lower_threshold)]
    # returns only tf-pairs, which are significant        
    return reduced_df

In [6]:
def find_specific_tf_cos_for_cluster(df:pd.DataFrame, cluster_name:str) -> pd.DataFrame:
    '''
        Find tf-co-occurences for a cluster/celltype (cluster_name), that are specific for that cluster
        in the associated tissue.
        
        ---
        Parameters:
            df:pd.DataFrame
                 rules of a DiffCombObj. (Differential analysis)

            cluster_name: string
                name of the cluster/celltype, that is investigated
        ---
        Return pd.Dataframe with tf-pairs, that show a high difference (log2foldchange) to the other clusters in the the tissue
        and are significantly occureing (cosine) in the investigated cluster (cluster_name).   
        
    '''
    # Get only the columns associated with the investigated cluster of the differential analysis,
    # that belongs to the tissue of the cluster.
    # reduce dataframe(df) to relevant columns assocaiated with the cluster
    cluster_cols = list(filter(lambda x: f'{cluster_name}' in x , df.columns))
    # print(cluster_cols)
    relevant_cluster_cols = []
    # Add cosine value for the investigated cluster
    cluster_cosine_col_name = f"{cluster_name}_cosine"
    relevant_cluster_cols.append(cluster_cosine_col_name) 
    for entry in cluster_cols:
        if (f'{cluster_name}/' in entry) or (f'/{cluster_name}_cosine_log2fc' in entry):
            relevant_cluster_cols.append(entry)
   # print(logfc_cluster_cols)
    print(len(relevant_cluster_cols))
    
    # Get only the values (cosine of cluster + log2fc with each contrast of to the cluster) for the cluster to investigate .
    reduced_df = df[relevant_cluster_cols]
    print(f'Initial TF-pairs-Count: {reduced_df.shape}')

   # Filter out rows with 0.00   
    # Count all entries in a row , which do not have a zero(0.00) in it.
    # e.g. 15 cols have 0.00 => val_counts = 0, 10 cols not have a 0.00 => val_counts = 10 
    val_counts = reduced_df[~reduced_df.isin([0])].count(axis=1).sort_values()
    
    # Set threshold 
    selection_threshold = len(relevant_cluster_cols) # e.g. 15, could be varyied
    # Keep all entries, which have more/same values that are higher than the threshold
    tfs_occ = val_counts[val_counts >= selection_threshold].index
    result = reduced_df.loc[tfs_occ]
    print(f'Zero filtered TF-pairs-Count: {result.shape}')
    
    # We would like to find TF-pairs that show a significant change in the log2changes in each contrast, because
    # this indicates a TF-Pair, which is striking for the investigated cluster in that tissue.
    # And a high cosine would be a indication, that the found tf-pair is really a tf-co-occurence in that cluster.
    # Filtering: log2changes and cosine cols to get tf-pairs that have a log2change,
    # that shows a high difference to all other clusters in the tissue and a significant cosine value.
    significants = get_significant_rules(df=result, cosine_col=cluster_cosine_col_name, cosine_threshold=0.001, log2fc_threshold_percent=0.10)
    
    print(f'Cluster: {cluster_name}: {significants.shape} ,tf-pairs with significant log2fc-changes in comparison to all the other clusters in tissue')
    print(f'Done: Find specific tf co-occurences for cluster{cluster_name}.')
    
    return significants


In [7]:
def investigate_differential_analysis_per_tissue(tissues=[]):
    '''
    Investigate the differential analysis of tf-comb for each tissue.
    ---
    Parameters:
    tissues: []
    Tissues that have differential analysis.
    ---
    Exception:
        Catch exception for cluster analysis and continues with analysing the next cluster.
    
    '''
    for tissue in tissues:
        print(f"Start analyse cluster for {tissue}:")
        answer_save_path = f"{answers_path}{tissue}/"
        # Check if  folder for path already exists for tissue, if not create new one
        if not os.path.exists(answer_save_path):
             pathlib.Path(answer_save_path).mkdir(parents=True, exist_ok=True)
        
        # folder for errors
        error_path = f"{answers_path}{tissue}/errors/"
        # Check if  folder for path already exists for tissue, if not create new one
        if not os.path.exists(error_path):
             pathlib.Path(error_path).mkdir(parents=True, exist_ok=True)

        # load the diffcombobj for the tissue 
        diff_obj = DiffCombObj().from_pickle(f"{differential_analysis_path}{tissue}/{tissue}_normalized.pkl")
        # get the rules (tf-pairs) of the diffcombobj.
        diff_rules = diff_obj.rules
        
        # read in the filenames of the original tfcomobj with the executed market basket analysis. 
        files_main_mb = read_in_file_names_of_folder(rel_path=f"{main_analysis_path}{tissue}")
        
        # Iterate over each cluster.
        for i, file in enumerate(files_main_mb):
            # print(file)
            cluster_name = file.split('.pkl')[0]
            print(f"Start Find specific tf-cos for: {cluster_name}")
            # Handle exceptions that happen and continue with next cluster.
            # e.g. got once an exception 
            try:
                # res contains a pd.dataframe, that contains tf-pairs, 
                # that have significant log2changes/ and a significant cosine value for the investigated cluster
                # compared to the other clusters in the tissue. This TF-pairs could be interesting for further investigation
                # ,because they show a significant difference to all the other clusters in the tissue.
                res = find_specific_tf_cos_for_cluster(df=diff_rules, cluster_name=cluster_name)

                # save res as .pkl
                res.to_pickle(f"{answer_save_path}{cluster_name}.pkl")
            except Exception as err:
                print(f"ERROR: Could not proceed with analysis for cluster: {cluster_name}")
                print(f"{err}")
                df = pd.DataFrame()
                # just save a csv with the clustername as file name,so
                # that we later know that an error happened for this cluster.
                df.to_csv(f"{error_path}{cluster_name}.csv")
                print(f"Continue with next cluster.")
                continue
                
    print("Done investigating diff analysis!")

           
        
        

In [8]:
investigate_differential_analysis_per_tissue(tissues=diff_mb_tissues)

Start analyse cluster for complete_liver:
Start Find specific tf-cos for: complete_liver.0.Oligodendrocyte
7
Initial TF-pairs-Count: (268257, 7)
Zero filtered TF-pairs-Count: (267398, 7)
0.0404783703021167
Cluster: complete_liver.0.Oligodendrocyte: (4, 7) ,tf-pairs with significant log2fc-changes in comparison to all the other clusters in tissue
Done: Find specific tf co-occurences for clustercomplete_liver.0.Oligodendrocyte.
Start Find specific tf-cos for: complete_liver.6.Plasmacytoid_dendritic_cell
7
Initial TF-pairs-Count: (268257, 7)
Zero filtered TF-pairs-Count: (267392, 7)
0.045272859820941376
Cluster: complete_liver.6.Plasmacytoid_dendritic_cell: (2, 7) ,tf-pairs with significant log2fc-changes in comparison to all the other clusters in tissue
Done: Find specific tf co-occurences for clustercomplete_liver.6.Plasmacytoid_dendritic_cell.
Start Find specific tf-cos for: complete_liver.2.B_cell
7
Initial TF-pairs-Count: (268257, 7)
Zero filtered TF-pairs-Count: (266350, 7)
0.040411

Zero filtered TF-pairs-Count: (275289, 20)
0.0406969788568334
Cluster: esophagus_muscularis.0.Ciliated_epithelial_cell: (0, 20) ,tf-pairs with significant log2fc-changes in comparison to all the other clusters in tissue
Done: Find specific tf co-occurences for clusteresophagus_muscularis.0.Ciliated_epithelial_cell.
Start Find specific tf-cos for: esophagus_muscularis.5.Gonadal_endothelial_cell
20
Initial TF-pairs-Count: (275777, 20)
Zero filtered TF-pairs-Count: (274529, 20)
0.039790377127330516
Cluster: esophagus_muscularis.5.Gonadal_endothelial_cell: (0, 20) ,tf-pairs with significant log2fc-changes in comparison to all the other clusters in tissue
Done: Find specific tf co-occurences for clusteresophagus_muscularis.5.Gonadal_endothelial_cell.
Start Find specific tf-cos for: esophagus_muscularis.4.Leydig_precursor_cell
20
Initial TF-pairs-Count: (275777, 20)
Zero filtered TF-pairs-Count: (273957, 20)
0.03980948014825375
Cluster: esophagus_muscularis.4.Leydig_precursor_cell: (0, 20) ,

Zero filtered TF-pairs-Count: (272322, 14)
0.04143005377835264
Cluster: leg_skin_exposed.5.Pyramidal_cell: (0, 14) ,tf-pairs with significant log2fc-changes in comparison to all the other clusters in tissue
Done: Find specific tf co-occurences for clusterleg_skin_exposed.5.Pyramidal_cell.
Start Find specific tf-cos for: leg_skin_exposed.10.Oligodendrocyte
14
Initial TF-pairs-Count: (276480, 14)
Zero filtered TF-pairs-Count: (267217, 14)
0.043396847146409026
Cluster: leg_skin_exposed.10.Oligodendrocyte: (14, 14) ,tf-pairs with significant log2fc-changes in comparison to all the other clusters in tissue
Done: Find specific tf co-occurences for clusterleg_skin_exposed.10.Oligodendrocyte.
Start Find specific tf-cos for: leg_skin_exposed.3.Oocyte
14
Initial TF-pairs-Count: (276480, 14)
Zero filtered TF-pairs-Count: (273983, 14)
0.041342480440653744
Cluster: leg_skin_exposed.3.Oocyte: (0, 14) ,tf-pairs with significant log2fc-changes in comparison to all the other clusters in tissue
Done: Fi

0.042778260561328475
Cluster: artery_tibial.3.Plasmacytoid_dendritic_cell: (3, 7) ,tf-pairs with significant log2fc-changes in comparison to all the other clusters in tissue
Done: Find specific tf co-occurences for clusterartery_tibial.3.Plasmacytoid_dendritic_cell.
Start analyse cluster for esophagus_mucosa:
Start Find specific tf-cos for: esophagus_mucosa.0.Oligodendrocyte
18
Initial TF-pairs-Count: (275977, 18)
Zero filtered TF-pairs-Count: (275079, 18)
0.04009117871123277
Cluster: esophagus_mucosa.0.Oligodendrocyte: (0, 18) ,tf-pairs with significant log2fc-changes in comparison to all the other clusters in tissue
Done: Find specific tf co-occurences for clusteresophagus_mucosa.0.Oligodendrocyte.
Start Find specific tf-cos for: esophagus_mucosa.5.Natural_killer_T_(NKT)_cell
18
Initial TF-pairs-Count: (275977, 18)
Zero filtered TF-pairs-Count: (274340, 18)
0.04222187725137669
Cluster: esophagus_mucosa.5.Natural_killer_T_(NKT)_cell: (0, 18) ,tf-pairs with significant log2fc-changes i

Zero filtered TF-pairs-Count: (275181, 11)
0.041022213033017126
Cluster: lung.2.Oocyte: (0, 11) ,tf-pairs with significant log2fc-changes in comparison to all the other clusters in tissue
Done: Find specific tf co-occurences for clusterlung.2.Oocyte.
Start Find specific tf-cos for: lung.6.Lymphoid-primed_multipotent_progenitor_cell
11
Initial TF-pairs-Count: (276458, 11)
Zero filtered TF-pairs-Count: (273905, 11)
0.041608853037033114
Cluster: lung.6.Lymphoid-primed_multipotent_progenitor_cell: (10, 11) ,tf-pairs with significant log2fc-changes in comparison to all the other clusters in tissue
Done: Find specific tf co-occurences for clusterlung.6.Lymphoid-primed_multipotent_progenitor_cell.
Start Find specific tf-cos for: lung.8.Multilymphoid_progenitor_cell
11
Initial TF-pairs-Count: (276458, 11)
Zero filtered TF-pairs-Count: (274040, 11)
0.042574459012031614
Cluster: lung.8.Multilymphoid_progenitor_cell: (0, 11) ,tf-pairs with significant log2fc-changes in comparison to all the other

Zero filtered TF-pairs-Count: (270679, 18)
0.04230622242261372
Cluster: colon_transverse.7.Endothelial_cell: (0, 18) ,tf-pairs with significant log2fc-changes in comparison to all the other clusters in tissue
Done: Find specific tf co-occurences for clustercolon_transverse.7.Endothelial_cell.
Start Find specific tf-cos for: colon_transverse.2.Multilymphoid_progenitor_cell
18
Initial TF-pairs-Count: (275724, 18)
Zero filtered TF-pairs-Count: (274049, 18)
0.04324650469189761
Cluster: colon_transverse.2.Multilymphoid_progenitor_cell: (0, 18) ,tf-pairs with significant log2fc-changes in comparison to all the other clusters in tissue
Done: Find specific tf co-occurences for clustercolon_transverse.2.Multilymphoid_progenitor_cell.
Start Find specific tf-cos for: colon_transverse.9.Purkinje_cell
18
Initial TF-pairs-Count: (275724, 18)
Zero filtered TF-pairs-Count: (269353, 18)
0.04383020311770843
Cluster: colon_transverse.9.Purkinje_cell: (0, 18) ,tf-pairs with significant log2fc-changes in c

Zero filtered TF-pairs-Count: (259909, 18)
0.04983632802039784
Cluster: liver_complete.1.Natural_killer_T_(NKT)_cell: (0, 18) ,tf-pairs with significant log2fc-changes in comparison to all the other clusters in tissue
Done: Find specific tf co-occurences for clusterliver_complete.1.Natural_killer_T_(NKT)_cell.
Start Find specific tf-cos for: liver_complete.6.Oligodendrocyte
18
Initial TF-pairs-Count: (268355, 18)
Zero filtered TF-pairs-Count: (266992, 18)
0.06822783486014229
Cluster: liver_complete.6.Oligodendrocyte: (1, 18) ,tf-pairs with significant log2fc-changes in comparison to all the other clusters in tissue
Done: Find specific tf co-occurences for clusterliver_complete.6.Oligodendrocyte.
Start Find specific tf-cos for: liver_complete.4.Multilymphoid_progenitor_cell
18
Initial TF-pairs-Count: (268355, 18)
Zero filtered TF-pairs-Count: (263343, 18)
0.07465366791395776
Cluster: liver_complete.4.Multilymphoid_progenitor_cell: (0, 18) ,tf-pairs with significant log2fc-changes in com

### Answer for: 
a) The found TF-pair occures only in a single cluster of a tissue:
     e.g. Cluster1 has TF1-TF1 and the other cluster in the tissue do not have TF1-TF1.

In [9]:
def find_rows_with_same_value(df:pd.DataFrame) -> pd.DataFrame:
    '''
    Removes all rows of a Dataframe that do not have the same value in each column and
    returns the reduced dataframe.
    '''
    return df[df.apply(lambda x: min(abs(x)) == max(abs(x)), 1)]

In [10]:
#  Get folder names of the tissues.
answers_mb_tissues = get_folder_names_in_folder(rel_folder_path=answers_path)
print(answers_mb_tissues)

['complete_liver', 'heart_lv', 'esophagus_muscularis', 'leg_skin_exposed', 'artery_tibial', 'esophagus_mucosa', 'lung', 'lung_sample', 'colon_transverse', 'liver_complete']


In [11]:
'''
We check now the TF-pairs, that have significant log2changes/ 
and a significant cosine value for the investigated clusters.
If the log2foldchange for each contrast (compared clusters in the tissue) has the same value,
the tf-pair is only occureing in the investigated cluster. 
So The found TF-pair occures only in a single cluster of that tissue. 
This tf-pair could be interesting for further investigations.
'''
cluster_with_specials= []
all_cluster_results= []
for tissue in answers_mb_tissues:
    
    print(f"Start analysis {tissue}:")
    answer_specials_save_path = f"{answers_path}{tissue}/specials/"
    # Check if  folder for path already exists for tissue, if not create new one
    if not os.path.exists(answer_specials_save_path):
         pathlib.Path(answer_specials_save_path).mkdir(parents=True, exist_ok=True)
    
    # file names of pd.dataframes with the significant tf-pairs in it.
    files = read_in_file_names_of_folder(rel_path=f"{answers_path}{tissue}")
    for i, file in enumerate(files):
        # print(file)
        cluster_name = file.split('.pkl')[0]
        
        # Read in significant tf-pairs
        df = pd.read_pickle(f"{answers_path}{tissue}/{file}")
        all_cluster_results.append(df)
        
        # has to remove cosine from dataframe, to get only the log2foldchanges
        # for each contrast (compared clusters in the tissue), to check if they have the same value in each column
        df_copy = df.copy()
        cosine_col_name = df_copy.columns[0]
        cosine_col = df_copy[cosine_col_name]
        df_only_log2fc = df_copy.drop(columns=[cosine_col_name])
        
        # find tf-pairs, that have the same value in each log2foldchange column
        df_res = find_rows_with_same_value(df_only_log2fc)
        
        # check if we found a tf-co-occurence which is significant and only occures in one cluster of a tissue
        if df_res.shape[0] > 0: 
            print(f"Cluster: {cluster_name}, wich has tf-pairs that only occure in this cluster for tissue {tissue}.Found: {str(df_res.shape[0])}")
            #add cosine value back to dataframe
            df_res[cosine_col_name]= cosine_col
            cluster_with_specials.append(df_res)
            # save the tf-pairs as csv, if we found some.
            df_res.to_csv(f"{answer_specials_save_path}{cluster_name}.csv")
        
        print(f"Could not find single occurences in cluster {cluster_name}.")
print("DONE")

Start analysis complete_liver:
Could not find single occurences in cluster complete_liver.0.Oligodendrocyte.
Could not find single occurences in cluster complete_liver.6.Plasmacytoid_dendritic_cell.
Could not find single occurences in cluster complete_liver.2.B_cell.
Could not find single occurences in cluster complete_liver.1.-.
Could not find single occurences in cluster complete_liver.4.Natural_killer_T_(NKT)_cell.
Could not find single occurences in cluster complete_liver.3.Gonadal_endothelial_cell.
Could not find single occurences in cluster complete_liver.5.Natural_killer_T_(NKT)_cell.
Start analysis heart_lv:
Could not find single occurences in cluster heart_lv.3.Natural_killer_T_(NKT)_cell.
Could not find single occurences in cluster heart_lv.2.Natural_killer_T_(NKT)_cell.
Could not find single occurences in cluster heart_lv.11.Ciliated_epithelial_cell.
Could not find single occurences in cluster heart_lv.1.Meiotic_prophase_fetal_germ_cell.
Could not find single occurences in c

Could not find single occurences in cluster lung_sample.5.Astrocyte.
Start analysis colon_transverse:
Could not find single occurences in cluster colon_transverse.1.Oogenesis_phase_fetal_germ_cell.
Could not find single occurences in cluster colon_transverse.0.Granulocyte-monocyte_progenitor.
Could not find single occurences in cluster colon_transverse.7.Endothelial_cell.
Could not find single occurences in cluster colon_transverse.2.Multilymphoid_progenitor_cell.
Could not find single occurences in cluster colon_transverse.9.Purkinje_cell.
Could not find single occurences in cluster colon_transverse.10.Pyramidal_cell.
Could not find single occurences in cluster colon_transverse.3.Natural_killer_T_(NKT)_cell.
Could not find single occurences in cluster colon_transverse.8.Lake_et_al.Science.In2.
Could not find single occurences in cluster colon_transverse.15.Airway_secretory_cell.
Could not find single occurences in cluster colon_transverse.5.Natural_killer_T_(NKT)_cell.
Could not find 

In [19]:
print(f'Number of all investigated clusters: {len(all_cluster_results)}')
cluster_with_high_changes = []
for res in all_cluster_results:
    
    if res.shape[0] > 0:
        cluster_with_high_changes.append(res)
print(f'Number of cluster, that that show significant high/ changes for tf-pairs in comparison to the other cluster of theire tissue.')
print(len(cluster_with_high_changes))

Number of all investigated clusters: 129
Number of cluster, that that show significant high/ changes for tf-pairs in comparison to the other cluster of theire tissue.
48


In [20]:
print(f'Number of cluster that have tf-pairs that only occure in the investigated cluster in that tissue. (filter thresholds: cosine=0.001 , log2fc=0.10)')
print(len(cluster_with_specials))

Number of cluster that have tf-pairs that only occure in the investigated cluster in that tissue. (filter thresholds: cosine=0.001 , log2fc=0.10)
0


# OLD ---------------------------------------------------------------

In [None]:
from tfcomb import CombObj
import os
import pathlib

'''
Constants for this script
'''

#genome_path="../testdaten/hg19_masked.fa"
genome_path="../testdaten/homo_sapiens.104.mainChr.fa"

main_jaspar_file="../testdaten/JASPAR2020_CORE_vertebrates.meme" 

# path where market basket analyses for cluster are put.
result_path="./results/wp1/"

# folder of wp1, where the clusters are
path_to_clusters="?"
#path_to_clusters="/mnt/workspace_stud/stud4/WP6_data/"


# create result folders 
if not os.path.exists(result_path):
     pathlib.Path(result_path).mkdir(parents=True, exist_ok=True)

if not os.path.exists(genome_path):
    print(f"ERROR: path {genome_path} does not exist")

if not os.path.exists(main_jaspar_file):
    print(f"ERROR: path {main_jaspar_file} does not exist")

if not os.path.exists(path_to_clusters):
    print(f"ERROR: path {path_to_clusters} does not exist")

In [None]:
def do_market_basket_analyses_for_cell_cluster(cell_cluster_name: str, cell_cluster_path:str):
    '''
        Does market basket analyses.
    '''
    comb = CombObj()
    comb.TFBS_from_motifs(regions= cell_cluster_path,
                   motifs=main_jaspar_file,
                   genome=genome_path,
                   threads=4)
    
    print(f'Start market basket analyses for cell-cluster/type: {cell_cluster_name}')
    comb.market_basket(threads=10)
    if len(comb.rules) <= 0:
        print(f'Could not find TF-cooccurences for cell-cluster/type: {cell_cluster_name}')
        return
    print(f'Finished market basket analyses for cell-cluster/type: {cell_cluster_name}')
    print(f'Found rules: {len(comb.rules)}')
    comb.to_pickle(f'{result_path}{cell_cluster_name}.pkl')
    print(f'Saved: {result_path}{cell_cluster_name}.pkl')

In [None]:
def read_in_file_names_of_folder(rel_path:str):
    return [f for f in os.listdir(rel_path) if os.path.isfile(os.path.join(rel_path, f))]

cluster_file_names = read_in_file_names_of_folder(rel_path=path_to_clusters)
print(cluster_file_names)




In [None]:
# Has to be tested as soon as wp1 generates .bed files
for file_name in cluster_file_names:
    cluster_name = file_name.split('.')[0]
    print(cluster_name)
    print(file_name)
    cluster_path=f"{path_to_clusters}{file_name}"
    do_market_basket_analyses_for_cell_cluster(cell_cluster_name=cluster_name, cell_cluster_path=cluster_path)
    break