# TF-co-occurences for WP2 - Data
### Outline of this notebook:
    1. Constants, Path and Interface Definitions 
    2. Market Basket analysis with tf-comb, for all cluster/celltypes of a tissue
    3. Differential analysis with all market basket analysis (CombObj ´s) of step 2 for the clusters/celltypes of a single tissue. (One DiffCombObj for each tissue)
    4. Analysis for biological questions 
    
The aim is to find transcription-factor-co-occurences for cluster/celltypes of human tissues with the help of the python-library tf-comb. For the data of WP2. The data basis comes from the "cell atlas of chromatin accessibility across 25 adult human tissues"(https://doi.org/10.1101/2021.02.17.431699) 

**Biological question, that we want to answer with this notebook:**

1. Find Transcriptionfactor-co-occurences, which only occure in one (or more) "cluster/celltypes of a tissue".
    Maybe we can identify a cluster through this co-occurences.

**How to use this notbook:**
1. Please adapted the paths in Constants, Path and interface defintions for your approach
2. Please make sure you have installed the kernel as it is described in the ReadMe
3. Check if the WP2 Data structure is correctly provided (ReadMe).
    - Has all tissues as a folder in it: ../OUPTPUT_FOLDER/
    - You find the data (open cromatin regions per cluster .bed-files) e.g ../OUPTPUT_FOLDER/<tissue>/WP6/*cluster_x.bed
4. Execute each notebook window from top to bottom one after another.

## 1. Constants, Path and Interface Definitions:

In [None]:
from tfcomb import CombObj, DiffCombObj, utils
import os
import pathlib
import pandas as pd
import numpy as np
'''
Constants for this script.

This window contains all paths and constants, which are later used for this juypter notbook.

Please adapt paths or constants, if use other files. 
For example adapted the genome path, if you use another genome. 
'''

# Path to genome fasta file. Is used for the market basket analysis of tfcomb.
genome_path="/mnt/workspace_stud/allstud/homo_sapiens.104.mainChr.fa"

# Path to the jaspar file (contains transcription factors (TF) binding profiles
# as position frequency matrices (PFMs)). Is used for the market basket analysis of tfcomb
main_jaspar_file="../testdaten/JASPAR2020_CORE_vertebrates.meme" 

# Path where results of this notebook will be written to (eg. TF_COMB objects, .pkl).
result_path="./results/wp2/"

# Paths in the result folder:
# Path to folder, where the resulting market basket analysis for a cluster/celltype is put 
main_analysis_path=f"{result_path}main/"

# Path to folder, where the differential analysis for a tissue is put 
differential_analysis_path=f"{result_path}diff_analysis/"
# Path to folder, where answers of our results are put to
answers_path=f"{result_path}answers/"

# Path to folder (Interface folder) of wp2, where the clusters of each tissue can be found.
path_to_tissues="/mnt/workspace_stud/stud3/WP2_OUTPUT/FINISHED/"

# Tag for WP6 data in WP2 Interface
cluster_folder_tag="WP6/"

# Path to Folder with celltype annotation tables of wp2
celltype_annotation_path = "/mnt/workspace_stud/stud4/celltype_assignment_tables/"

# The following lines, initally check if all file/paths are available. 
# If a result folder does not exist it is created automatically
if not os.path.exists(result_path):
     pathlib.Path(result_path).mkdir(parents=True, exist_ok=True)

if not os.path.exists(main_analysis_path):
     pathlib.Path(main_analysis_path).mkdir(parents=True, exist_ok=True)

if not os.path.exists(differential_analysis_path):
     pathlib.Path(differential_analysis_path).mkdir(parents=True, exist_ok=True)

if not os.path.exists(answers_path):
     pathlib.Path(answers_path).mkdir(parents=True, exist_ok=True)

if not os.path.exists(genome_path):
    print(f"ERROR: path {genome_path} does not exist")

if not os.path.exists(main_jaspar_file):
    print(f"ERROR: path {main_jaspar_file} does not exist")



### Helper functions for reading-in folders/files:

In [None]:
def get_folder_names_in_folder(rel_folder_path:str):
    ''' 
        Read in the names of the folders in a folder.(rel_folder_path)
        ---
        Parameters:
        
        rel_folder_path: String
            relative Path to the folder location that is read in.
        ---
        Return: a List of Strings (folder names)
    '''

    dirlist = [ item for item in os.listdir(rel_folder_path) if os.path.isdir(os.path.join(rel_folder_path, item))]
    folder_names = []
    for folder in dirlist:
        folder_names.append(folder)
    return folder_names

def read_in_file_names_of_folder(rel_path:str):
    ''' 
        Read in the file names in a folder (rel_path).
        ---
        Parameters:
        
        rel_path: String
            relative Path to the folder location.
        ---
        Return: a List of Strings (file names)
    '''
    return [f for f in os.listdir(rel_path) if os.path.isfile(os.path.join(rel_path, f))]


#cluster_file_names = read_in_file_names_of_folder(rel_path=path_to_clusters)


## 2. Market basket analysis with tf-comb
We do a market basket analysis with tfcomb for each cluster/celltype, which has been culstered by WP2 and comes from the raw single-cell ATAC-Data of the cell atlas project. As a result, we get the transcription-factor-co-occurences for each cluster. The Trancriptionsfactor motif´s come as a position-frequency-matrix from https://jaspar.genereg.net/search?q=&collection=CORE&tax_group=vertebrates. The corresponding genome, which is used is **homo_sapiens.104.mainChr.fa** .

Approach:
1. Read-in tissue folders of WP2
2. For each tissue: read-in single .bed (Content= open-chromatin regions) files for each cluster/celltype
3. Do market basket analysis for each cluster/celltype
4. Result .pkl CombObj files can be found under **/{result_path}/{main_analysis_path}/{tissue_name}/{cluster_name}.pkl **

In [None]:
def do_market_basket_analyses_for_cell_cluster(mb_file_name: str, cell_cluster_path:str, tissue:str):
    '''
        Does market basket analysis with tfcomb. Saves the tfcomb-Object as .pkl file to main_analysis_path.
        ---
        Paramater:
        
        mb_file_name: string
            Name for the result file of the market basket analyses. 
            e.g "<tissue>_<cluster_number>_<celltype>".
        
        cell_cluster_path: string
            Path to the .bed-File with genome regions to check for tf-co-occurences
         
        tissue: string
            Tissue name, origin of the cluster.
    '''    
    # Save path initalization, if folder for tissue does not exists, new folder is created.
    save_path = f'{main_analysis_path}{tissue}/'
    if not os.path.exists(save_path):
         pathlib.Path(save_path).mkdir(parents=True, exist_ok=True)
    
    # TF-comb market basket analysis
    comb = CombObj()
    comb.TFBS_from_motifs(regions= cell_cluster_path,
                   motifs=main_jaspar_file,
                   genome=genome_path,
                   threads=4)
    
    print(f'Start market basket analyses for cell-cluster/type: {mb_file_name}')
    comb.market_basket(threads=10)
    
    # if rules are empty nothing is saved 
    if len(comb.rules) <= 0:
        print(f'Could not find TF-cooccurences for cell-cluster/type: {mb_file_name}')
        return
    print(f'Finished market basket analyses for cell-cluster/type: {mb_file_name}')
    print(f'Found rules: {len(comb.rules)}')
    
    # save tf-comb obj to .pkl
    comb.to_pickle(f'{save_path}{mb_file_name}.pkl')
    print(f'Saved: {save_path}{mb_file_name}.pkl')

In [None]:
# Load tissue folder names for wp2 data
tissues=get_folder_names_in_folder(rel_folder_path=path_to_tissues)
print(tissues)

### Functions for adding the celltype annotations of wp2 to the corresponding clusters

In [None]:
def get_celltype_annotation_table(celltype_annotation_path:str) -> pd.DataFrame:
    '''
    Read in the the given annotation file(celltype_annotation_path) with pandas read_table function
    ---
    Parameters:
    celltype_annotation_path : string
        Is the path to a file which has the celltype annotations for the cluster numbers.
        e.g. file content:
        6	Stromal cells
        3	Adipocyte progenitor cells
        4	Stromal cells
        7	Fibroblasts
    ---
    Returns a celltype annotation table as a pandas Dataframe.
    '''
    df = pd.read_table(celltype_annotation_path, header=None, index_col=None)
    return df

def get_celltype_for_cluster(cluster_number: int, celltype_annotation_table: pd.DataFrame):
    '''
    Find celltype annotation for cluster number in the corresponding celltype_annotation_table
    ---
    Parameters:
    cluster_number: integer
        number of the cluster
    
    celltype_annotation_table: pd.DataFrame
        table with the celltype annotations, e.g. Dataframe
         0	1 
       1 6	Stromal cells
       2 3	Adipocyte progenitor cells
       3 4	Stromal cells
       4 7	Fibroblasts
    ---
    Returns the celltype for a cluster_number. (if no celltype kann be found "na" is returned) 
    '''
    # initalize celltype name with na (not available), if no celltype can be found na is returned
    celltype_name = "na"
    try:
        # Try to find the cell-type annotation in the table, if no annotation can be found key error is raised
        celltype_name = celltype_annotation_table[celltype_annotation_table[0] == int(cluster_number)].at[0,1]
        # replace whitespaces in celltype names, e.g. "Stromal cell" -> "Stromal_cell"
        celltype_name = celltype_name.replace(" ", "_")
    except KeyError:
        print(f"Could not find celltype annotation for cluster: {cluster_number}")
   
    return celltype_name

##remove
celltype_annotations=read_in_file_names_of_folder(rel_path=celltype_annotation_path)
print(celltype_annotations[0].split('_table_')[1])
df = pd.read_table(f"{celltype_annotation_path}{celltype_annotations[0]}", header=None, index_col=None)
print(celltype_annotations)

In [None]:
### remove 
try:
    name = df[df[0] == 6].at[0,1]
    name1 = name.replace(" ", "_")
    print(name)
    print(name1)
except KeyError:
    print(f"Error")



In [None]:
def make_mb_for_clusters(path_to_clusters:str, tissue:str):
    '''
        Wrapper function, that does the market basket analysis for all clusters/celltypes in a tissue.
        Also annotates the cluster with a celltype.
        ---
        Paramater:
        
        path_to_clusters: string
            Path to the .bed files(cluster) of a tissue.
         
        tissue: string
            Tissue name from where the cluster corresponds to.
        --- 
        Catch Exceptions with a message, if any error occures in the market basket analyses a message is printed
        and the programm continues with the next tissue.
    '''
    # Read in the .bed files for each cluster of the specific tissue
    cluster_file_names = read_in_file_names_of_folder(rel_path=path_to_clusters)
    print(cluster_file_names)
    
    # Get the celltype annotation table. To add celltype for a cluster. e.g. cluster6 = fibroblast
    celltype_table = pd.DataFrame()
    # TODO:
    #celltype_table = get_celltype_annotation_table(
     #   celltype_annotation_path=f'{path_to_tissues}annotaion/annotation.txt') # pandas dataframe is returned
    
    # Do a market basket analysis for each cluster of a tissue
    for file_name in cluster_file_names:
        # e.g JF1O6_body_of_pancreas.10_peaks.bed -> [JF1O6_body_of_pancreas.10_peaks] = clustername = JF1O6_body_of_pancreas.10_peaks
        cluster_name = file_name.split('.bed')[0]
        #e.g JF1O6_body_of_pancreas.10_peaks -> [JF1O6_body_of_pancreas, 10_peaks], cluster_number_with_tag = 10_peaks
        cluster_number_with_tag = cluster_name.split('.')[1]
        # e.g 10_peaks -> [10 , peaks], cluster_number = 10
        cluster_number = int(cluster_number_with_tag.split('_')[0])
        
        celltype_name = get_celltype_for_cluster(cluster_number=cluster_number,
                                                 celltype_annotation_table=celltype_table)
        
        try:
            print(cluster_name)
            print(cluster_number)
            print(celltype_name)
            print(file_name)
            # Prepare names and paths
            cluster_path=f"{path_to_clusters}{file_name}"
            # e.g. JF1O6_body_of_pancreas_c3_fibroblast
            mb_file_name = f"{tissue}_c{str(cluster_number)}_{celltype_name}"
            
            do_market_basket_analyses_for_cell_cluster(mb_file_name=mb_file_name, cell_cluster_path=cluster_path, tissue=tissue)
        except Exception:
            print(f"ERROR: Market basket for cluster:{cluster_name} in tissue {tissue}, did not work")
            continue
            

    

In [None]:
# Create a market basket analysis for each cluster of the tissue
for tissue in tissues:
    path_to_clusters = f"{path_to_tissues}{tissue}/{cluster_folder_tag}"
    make_mb_for_clusters(path_to_clusters=path_to_clusters, tissue=tissue)

print("DONE: Created market basket analysis for each cluster")    

## 3. Analysis

### Differential Analysis
We use the differential analysis of tfcomb to identifiy the differences of tf-co-occurences between all cluster/celltypes of a tissue. For this we load all market basket analysis (tfcomb-objects) of a tissue (see point 2) into a DiffCombObj. After the differential analysis, we filter the object, so that we hopefully find tf-co-occurences that only occure in a single cluster of that tissue or only in a special celltype.  

- mb = market basket analysis (CombObj of tf comb)

In [None]:
# get the tissue names by folder names of the market basket analysis 
mb_tissues = get_folder_names_in_folder(rel_folder_path=main_analysis_path)
print(mb_tissues)

In [None]:
def do_differential_analysis_for_tissues(tissues=[]):
    '''
        Differential analysis between all clusters/celltypes of a tissue.
        ---
        Paramater:
        
        tissues: array
            tissue names by market basket analysis.
        ---
        DiffCombObj are saved as .pkl to differential_analysis_path
    '''
    for tissue_folder in tissues:
        
        diff_save_path = f"{differential_analysis_path}{tissue_folder}/"
        # Check if  folder for differential_analysis already exists for tissue, if not create new one
        if not os.path.exists(diff_save_path):
             pathlib.Path(diff_save_path).mkdir(parents=True, exist_ok=True)
        
        # get file names of the market basket analysis
        tissue_mb_files = read_in_file_names_of_folder(rel_path=f"{main_analysis_path}{tissue_folder}/")
        
        # holds the combobj´s
        tissue_mbs_to_compare = []
        for file in tissue_mb_files:
            print(file)
            file_name = file.split('.pkl')[0]
            # Load the CombObj (market basket analysis) for each cluster of a tissue
            obj = CombObj().from_pickle(f"{main_analysis_path}{tissue_folder}/{file}")
            obj.set_prefix(file_name)
            tissue_mbs_to_compare.append(obj)
        
        # Create DiffCombObj with all Combobj (market basket analysis) of the clusters in a tissue
        compare_obj = DiffCombObj(tissue_mbs_to_compare, measure="cosine", join="outer", fillna=True)
        # save diffcombj
        compare_obj.to_pickle(f'{diff_save_path}{tissue_folder}.pkl')
        # Normalize the DiffCombObj
        compare_obj.normalize()
        compare_obj.calculate_foldchanges()
        
        # Remove rules which are doubled, e.g. A-B, B-A; (B-A) is removed
        compare_obj.simplify_rules()
        # Save the normalized, foldchange calculated and simplified diff_comb_obj to .pkl
        compare_obj.to_pickle(f'{diff_save_path}{tissue_folder}_normalized.pkl')
        print(f"Done: Diff analysis for tissue {tissue_folder}")
        

In [None]:
# Do the differential analysis for all clusters in a tissue 
do_differential_analysis_for_tissues(tissues=mb_tissues)

# 4. Specific analysis  :
### Biological Question:
 Find Transcriptionfactor-co-occurences, which only occure in one (or more) "cluster/celltype of a tissue".
 - Solution with a hard threshold (a, all cluster) or a week threshold (b, celltypes in a tissue):
       - a) The found TF-pair occures only in a single cluster of a tissue:
            e.g. Cluster1 has TF1-TF1 and the other cluster in the tissue do not have TF1-TF1.
       - b) A TF-pair occures significantly only in some clusters:
            e.g. We have 3 endothelcells cluster in a tissue and this three clusters share a TF-pair
            which does not occure in the other cluster of the tissue. (need celltype annotation)
          
###  Steps:
1. Read in tfcomb-DiffObj´s of the Differential analysis (Point 3)
2. Get the rules of the DiffCombObj
3. Filter the DiffCombObj rules for each cluster in it:
       - e.g. investigate cluster1: get all log2changes col,
         where cluster1 is associated with and his cosine column
       - than filter the rules (tf-pairs) for tf-pairs with a significant change in the log2changes
       - We would like to find TF-pairs that show a significant change in the log2changes in each contrast,
         because this indicates a TF-Pair, which is striking for the investigated cluster in that tissue.
         And a high cosine would be a indication, that the found tf-pair is really 
         a tf-co-occurence in that cluster.
       - To find a TF-pair, which only occures in a single cluster of a tissue,
         we check the log2changes in each col of the significant tf-pairs, if they have the same value,
         the tf-pair only occures in that cluster in the tissue. 
    

In [None]:
# Get the folder names of the tissues that have a differential analysis
diff_mb_tissues = get_folder_names_in_folder(rel_folder_path=differential_analysis_path)
print(diff_mb_tissues)

In [None]:
def get_significant_rules(df:pd.DataFrame, cosine_col:str, cosine_threshold=0.001, log2fc_threshold_percent=0.05):
    '''
    Filter a rules Dataframe (tf-pairs) by log2foldchange col´s and cosine col´s of a DiffCombObj for significant tf-co-occurences (tf-pairs).
    Only the TF-pairs with significant log2foldchanges per contrast are kept in the returned Dataframe.
    The same filtering is done for the cosine col of the investigated cluster.
    
    For Filtering: for each col (foldchanges/ cosine) the thresholds to get significant values in that col
    are calculated by tfcomb.utils.get_threshold. All rows (tf-pairs), which are not in the threshold 
    (e.g. >95% percentile for log2foldchange or 99.9% percentile) are removed from the datframe (df). 
    Threshold are calculated on the original dataframe (df)
    Filtering/Reduction of the results is done on a copy of the original dataframe.
    
    This is done step by step:
    e.g. 1. thresholds for column "log2change_cluster1/cluster2" are calculated on the original df.
         2. all rows (tf-pairs) of the dataframe,
            where the values of the investigated column (log2change_cluster1/cluster2) are not in the thresholds,
            are removed from the copy of the original dataframe.
        These steps are done for each column.

    ---
    Parameter:
    df: pd.DataFrame
        DiffCombObj Rules
        e.g.:
        index   cosine_cluster1 log2change_cluster1/cluster2 log2change_cluster1/cluster3 log2change_cluster1/cluster4
        TF1-TF2  0.20                        3.4                              1                        4
        TF3-TF4  0.30                       - 0.3                             2                        1
        TFX-TFY  0.15                        8                                3                        -4
        ...
    
    cosine_col: string
        Name of the cluster, which is investigated and name of the cosine col
    
    cosine_threshold: float,
        Threshold value for the cosine col. All values of the cosine col are investigated
        and the threshold value for filtering is calculated. 
        
    log2fc_threshold_percent: float
        Threshold value for the log2foldchanges col. All values of the log2foldchanges cols are investigated
        and the threshold value for filtering is calculated.
    ---
    Return pd.Datframe, Filtered df         

    '''
    # copy of the original dataframe, which will be reduced (remove rows (TF-Pairs),  step by step for each col)
    reduced_df = df.copy(deep=True)
    for col in df.columns:
        # calculate thresholds for log2fc columns or cosine_column
        
        if col == cosine_col:
            # for the cosine column we only want to keep rows with a high value (no negative values),
            # so only the upper threshold is important
            measure_threshold = utils.get_threshold(df[col], "both", percent=cosine_threshold)
            upper_threshold = measure_threshold[1]
            
            # removes tf-pairs (rows), which are smaller than the upper_threshold form the df.
            reduced_df = reduced_df[(reduced_df[col] > upper_threshold)]
            print(upper_threshold)
        else:
            # calculates the thresholds for the log2foldchange cols,
            # positive and negative is possible, so both thresholds are needed.
            measure_threshold = utils.get_threshold(df[col], "both", percent=log2fc_threshold_percent)
            upper_threshold = measure_threshold[1]
            lower_threshold = measure_threshold[0]
             # removes tf-pairs (rows), which are smalleror higher than the thresholds form the df.
            reduced_df = reduced_df[(reduced_df[col] > upper_threshold) | (reduced_df[col] < lower_threshold)]
    # returns only tf-pairs, which are significant        
    return reduced_df

In [None]:
def find_specific_tf_cos_for_cluster(df:pd.DataFrame, cluster_name:str) -> pd.DataFrame:
    '''
        Find tf-co-occurences for a cluster/celltype (cluster_name), that are specific for that cluster
        in the associated tissue.
        
        ---
        Parameters:
            df:pd.DataFrame
                 rules of a DiffCombObj. (Differential analysis)

            cluster_name: string
                name of the cluster/celltype, that is investigated
        ---
        Return pd.Dataframe with tf-pairs, that show a high difference (log2foldchange) to the other clusters in the the tissue
        and are significantly occureing (cosine) in the investigated cluster (cluster_name).   
        
    '''
    # Get only the columns associated with the investigated cluster of the differential analysis,
    # that belongs to the tissue of the cluster.
    # reduce dataframe(df) to relevant columns assocaiated with the cluster
    cluster_cols = list(filter(lambda x: f'{cluster_name}' in x , df.columns))
    # print(cluster_cols)
    relevant_cluster_cols = []
    # Add cosine value for the investigated cluster
    cluster_cosine_col_name = f"{cluster_name}_cosine"
    relevant_cluster_cols.append(cluster_cosine_col_name) 
    for entry in cluster_cols:
        if (f'{cluster_name}/' in entry) or (f'/{cluster_name}_cosine_log2fc' in entry):
            relevant_cluster_cols.append(entry)
   # print(logfc_cluster_cols)
    print(len(relevant_cluster_cols))
    
    # Get only the values (cosine of cluster + log2fc with each contrast of to the cluster) for the cluster to investigate .
    reduced_df = df[relevant_cluster_cols]
    print(f'Initial TF-pairs-Count: {reduced_df.shape}')

   # Filter out rows with 0.00   
    # Count all entries in a row , which do not have a zero(0.00) in it.
    # e.g. 15 cols have 0.00 => val_counts = 0, 10 cols not have a 0.00 => val_counts = 10 
    val_counts = reduced_df[~reduced_df.isin([0])].count(axis=1).sort_values()
    
    # Set threshold 
    selection_threshold = len(relevant_cluster_cols) # e.g. 15, could be varyied
    # Keep all entries, which have more/same values that are higher than the threshold
    tfs_occ = val_counts[val_counts >= selection_threshold].index
    result = reduced_df.loc[tfs_occ]
    print(f'Zero filtered TF-pairs-Count: {result.shape}')
    
    # We would like to find TF-pairs that show a significant change in the log2changes in each contrast, because
    # this indicates a TF-Pair, which is striking for the investigated cluster in that tissue.
    # And a high cosine would be a indication, that the found tf-pair is really a tf-co-occurence in that cluster.
    # Filtering: log2changes and cosine cols to get tf-pairs that have a log2change,
    # that shows a high difference to all other clusters in the tissue and a significant cosine value.
    significants = get_significant_rules(df=result, cosine_col=cluster_cosine_col_name, cosine_threshold=0.001, log2fc_threshold_percent=0.05)
    
    print(f'Cluster: {cluster_name}: {significants.shape} ,tf-pairs with significant log2fc-changes in comparison to all the other clusters in tissue')
    print(f'Done: Find specific tf co-occurences for cluster{cluster_name}.')
    
    return significants


In [None]:
def investigate_differential_analysis_per_tissue(tissues=[]):
    '''
    Investigate the differential analysis of tf-comb for each tissue.
    ---
    Parameters:
    tissues: []
    Tissues that have differential analysis.
    ---
    Exception:
        Catch exception for cluster analysis and continues with analysing the next cluster.
    
    '''
    for tissue in tissues:
        print(f"Start analyse cluster for {tissue}:")
        answer_save_path = f"{answers_path}{tissue}/"
        # Check if  folder for path already exists for tissue, if not create new one
        if not os.path.exists(answer_save_path):
             pathlib.Path(answer_save_path).mkdir(parents=True, exist_ok=True)
        
        # folder errors
        error_path = f"{answers_path}{tissue}/errors/"
        # Check if  folder for path already exists for tissue, if not create new one
        if not os.path.exists(error_path):
             pathlib.Path(error_path).mkdir(parents=True, exist_ok=True)
        
        # load the diffcombobj for the tissue 
        diff_obj = DiffCombObj().from_pickle(f"{differential_analysis_path}{tissue}/{tissue}_normalized.pkl")
        # get the rules (tf-pairs) of the diffcombobj.
        diff_rules = diff_obj.rules
        
        # read in the filenames of the original tfcomobj with the executed market basket analysis. 
        files_main_mb = read_in_file_names_of_folder(rel_path=f"{main_analysis_path}{tissue}")
        
        # Iterate over each cluster.
        for i, file in enumerate(files_main_mb):
            # print(file)
            cluster_name = file.split('.pkl')[0]
            print(f"Start Find specific tf-cos for: {cluster_name}")
            try:
                # res contains a pd.dataframe, that contains tf-pairs, 
                # that have significant log2changes/ and a significant cosine value for the investigated cluster
                # compared to the other clusters in the tissue. This TF-pairs could be interesting for further investigation
                # ,because they show a significant difference to all the other clusters in the tissue.
                res = find_specific_tf_cos_for_cluster(df=diff_rules, cluster_name=cluster_name)

                # save res as .pkl
                res.to_pickle(f"{answer_save_path}{cluster_name}.pkl")
            except Exception as err:
                print(f"ERROR: Could not proceed with analysis for cluster: {cluster_name}")
                print(f"{err}")
                df = pd.DataFrame()
                # just save a csv with the clustername as file name,so
                # that we later know that an error happened for this cluster.
                df.to_csv(f"{error_path}{cluster_name}.csv")
                print(f"Continue with next cluster.")
                continue
    print("Done investigating diff analysis!")

           
        
        

In [None]:
investigate_differential_analysis_per_tissue(tissues=diff_mb_tissues)

### Answer for: 
a) The found TF-pair occures only in a single cluster of a tissue:
     e.g. Cluster1 has TF1-TF1 and the other cluster in the tissue do not have TF1-TF1.

In [None]:
def find_rows_with_same_value(df:pd.DataFrame) -> pd.DataFrame:
    '''
    Removes all rows of a Dataframe that do not have the same value in each column and
    returns the reduced dataframe.
    '''
    return df[df.apply(lambda x: min(abs(x)) == max(abs(x)), 1)]

In [None]:
#  Get folder names of the tissues.
answers_mb_tissues = get_folder_names_in_folder(rel_folder_path=answers_path)
print(answers_mb_tissues)

In [None]:
'''
We check now the TF-pairs, that have significant log2changes/ 
and a significant cosine value for the investigated clusters.
If the log2foldchange for each contrast (compared clusters in the tissue) has the same value,
the tf-pair is only occureing in the investigated cluster. 
So The found TF-pair occures only in a single cluster of that tissue. 
This tf-pair could be interesting for further investigations.
'''
cluster_with_specials= []
for tissue in answers_mb_tissues:
    
    print(f"Start analysis {tissue}:")
    answer_specials_save_path = f"{answers_path}{tissue}/specials/"
    # Check if  folder for path already exists for tissue, if not create new one
    if not os.path.exists(answer_specials_save_path):
         pathlib.Path(answer_specials_save_path).mkdir(parents=True, exist_ok=True)
    
    # file names of pd.dataframes with the significant tf-pairs in it.
    files = read_in_file_names_of_folder(rel_path=f"{answers_path}{tissue}")
    for i, file in enumerate(files):
        # print(file)
        cluster_name = file.split('.pkl')[0]
        
        # Read in significant tf-pairs
        df = pd.read_pickle(f"{answers_path}{tissue}/{file}")

        # has to remove cosine from dataframe, to get only the log2foldchanges
        # for each contrast (compared clusters in the tissue), to check if they have the same value in each column
        df_copy = df.copy()
        cosine_col_name = df_copy.columns[0]
        cosine_col = df_copy[cosine_col_name]
        df_only_log2fc = df_copy.drop(columns=[cosine_col_name])
        
        # find tf-pairs, that have the same value in each log2foldchange column
        df_res = find_rows_with_same_value(df_only_log2fc)
        
        # check if we found a tf-co-occurence which is significant and only occures in one cluster of a tissue
        if df_res.shape[0] > 0: 
            print(f"Cluster: {cluster_name}, wich has tf-pairs that only occure in this cluster for tissue {tissue}.Found: {str(df_res.shape[0])}")
            #add cosine value back to dataframe
            df_res[cosine_col_name]= cosine_col
            cluster_with_specials.append(df_res)
            # save the tf-pairs as csv, if we found some.
            df_res.to_csv(f"{answer_specials_save_path}{cluster_name}.csv")
        
        print(f"Could not find single occurences in cluster {cluster_name}.")
print("DONE")

### Checking - The found TF-pair occures only in a single cluster of a tissue

In [None]:
len(cluster_with_specials)
cluster_with_specials[0].T

In [None]:
comb = CombObj().from_pickle(f"{main_analysis_path}/C1PX3_thoracic_aorta/C1PX3_thoracic_aorta.7_peaks.pkl")
comb.rules.describe()

In [None]:
pd.DataFrame(comb.rules.loc['TBXT-GLI3'])

In [None]:
df = pd.read_pickle(f"{answers_path}/C1PX3_thoracic_aorta/C1PX3_thoracic_aorta.7_peaks.pkl")

In [None]:
df

In [None]:
df_copied = df.copy()

cosine_col_name = df_copied.columns[0]
cosine_col = df_copied[cosine_col_name]
dropped = df_copied.drop(columns=[cosine_col_name])
dropped
#df_copied.drop()

In [None]:

df1 = find_rows_with_same_value(dropped)
df1[cosine_col_name]= cosine_col
df1

In [None]:
df1.shape

In [None]:
combObj = CombObj().from_pickle(f"{main_analysis_path}/JF1O6_body_of_pancreas/JF1O6_body_of_pancreas.3_peaks.pkl")


In [None]:
combObj.rules

c = combObj.select_top_rules(n=100)

In [None]:
measure_threshold = utils.get_threshold(combObj.rules['cosine'], "both", percent=0.001)
measure_threshold

In [None]:
combObj.rules.describe()

In [None]:
#c.rules
r = combObj.rules
r[(r['cosine'] >= 0.06808819506421941)]

In [None]:
r.loc['OVOL1-OTX1']

# OLD
# -----------------------------------------------------------------------------------------
# OLD

In [None]:
# Analyses with whole diffcombj
diff_file_names=read_in_file_names_of_folder(rel_path=differential_analysis_path)

normalized_diff_objects = []
diff_objects = []

for file in diff_file_names:
    obj = DiffCombObj().from_pickle(f"{differential_analysis_path}{file}")
    if "normalized" in file:
        normalized_diff_objects.append(obj)
    else:
        diff_objects.append(obj)

print(normalized_diff_objects)
print(diff_objects) 

normalized_dfs = []
for obj in normalized_diff_objects:
    normalized_dfs.append(obj.rules)
print("Done: Preparing rules of DiffObj")    

In [None]:
obj = normalized_diff_objects[0]
t = normalized_diff_objects[0].rules
t[t['right-lobe-of-liver.10_cosine'] > 0.7]
obj.rules

In [None]:
results_df = []
for df in normalized_dfs:

    for i, file in enumerate(files_main_mb):
        # print(file)
        cluster_name = file.split('.pkl')[0]
        print(cluster_name)
        
        # reduce to relevant columns of cluster
        cluster_cols = list(filter(lambda x: f'{cluster_name}' in x , df.columns))
        # NOT WORKING: logfc_cluster_cols = list(filter(lambda x: (f'{cluster_name}/' || f'/{cluster_name}') in x , cluster_cols)) 
        # This is important: for right-lob-of-liver-1 ,     #right-lobe-of-liver.10_cosine
        logfc_cluster_cols = []
        for entry in cluster_cols:
            if (f'{cluster_name}/' in entry) or (f'/{cluster_name}_cosine_log2fc' in entry):
                logfc_cluster_cols.append(entry)
        
        #print(logfc_cluster_cols)
        print(len(logfc_cluster_cols))
        #print(logfc_cluster_cols)
        
        reduced_df = df[logfc_cluster_cols]
        print(f'Initial Count: {reduced_df.shape}')
        
        # Count all entries in a row , which do not have a zero(0.00) in it.
        # e.g. 15 cols have 0.00 => val_counts = 0, 10 cols not have a 0.00 => val_counts = 10 
        val_counts = reduced_df[~reduced_df.isin([0])].count(axis=1).sort_values()
        #print(tmp_val_counts)
        # Set threshold 
        selection_threshold = len(logfc_cluster_cols) # z.B 15 could be varyied
        
        ## Keep all entries,which do have more values different higher than the threshold
        tfs_occ = val_counts[val_counts >= selection_threshold].index
        result = reduced_df.loc[tfs_occ]
        print(f'Zero filtered: {result.shape}')
        
        significants = get_significant_log2fc_rules(result, threshold_percent=0.05)
        results_df.append(significants)
        print(f'Cluster: {cluster_name}: {significants.shape} ,tf-pairs with significant log2fc-changes in comparison to all the other clusters in tissue: {tissue_name} ')
         

In [None]:
df1 = results_df[0]
print(df1.shape)
df1
# found tf-pairs, which have significant changes in comparison to all the other clusters in the tissue
# this 10 tf-pairs of cluster 10, show significant changes in comparion to all the other clusters in the tissue right lob of liver. 

#####  nur cosine werte vergleichbar, daher zum finden von spezifischen tf-cos nur cosines nutzen.  

In [None]:
test = compare_obj.rules
df = pd.DataFrame(td.loc['EN2-MYBL2'])
df.head(50)
test

# nur cosine werte vergleichbar, daher zum finden von spezifischen tf-cos nur cosines nutzen.  
np.percentile(test['right-lobe-of-liver.11_cosine'], 75)


In [None]:
test['right-lobe-of-liver.11_cosine'].describe()

In [None]:
# cluster 6
df1 = results_df[13]
print(df1.shape)
df1.head(50)
df1.loc['Sox17-Dlx4':'TP63-FOSJUNB']

In [None]:
# filtering if, nan values occure
filtered = df[logfc_cluster_cols]

val_counts = filtered.count(axis=1).sort_values()
#print(val_counts)
tfs_occ = val_counts[val_counts >=16].index
final = filtered.loc[tfs_occ]

In [None]:
df = normalized_dfs[0]
cluster_name = "right-lobe-of-liver.1"
cluster_cols = list(filter(lambda x: f'{cluster_name}' in x , df.columns))
# NOT WORKING: logfc_cluster_cols = list(filter(lambda x: (f'{cluster_name}/' || f'/{cluster_name}') in x , cluster_cols)) 
# This is important: for right-lob-of-liver-1 
logfc_cluster_cols = []
for entry in cluster_cols:
    if (f'{cluster_name}/' in entry) or (f'/{cluster_name}_cosine_log2fc' in entry):
        logfc_cluster_cols.append(entry)
        
tmp = df[logfc_cluster_cols]
# wtf, tmp[~tmp.isin([0])]??
tmp_val_counts = tmp[~tmp.isin([0])].count(axis=1).sort_values()
#print(tmp_val_counts)
tmp_tfs_occ = tmp_val_counts[tmp_val_counts == len(logfc_cluster_cols)].index
result = tmp.loc[tmp_tfs_occ]
print(result.shape)
result

In [None]:
df = normalized_dfs[0]
cluster_name = "right-lobe-of-liver.1"
cluster_cols = list(filter(lambda x: f'{cluster_name}' in x , df.columns))
# NOT WORKING: logfc_cluster_cols = list(filter(lambda x: (f'{cluster_name}/' || f'/{cluster_name}') in x , cluster_cols)) 
# This is important: for right-lob-of-liver-1 
logfc_cluster_cols = []
for entry in cluster_cols:
    if (f'{cluster_name}/' in entry) or (f'/{cluster_name}_cosine_log2fc' in entry):
        logfc_cluster_cols.append(entry)
        
tmp = df[logfc_cluster_cols]
tmp

In [None]:
measure_threshold_1 = utils.get_threshold(result['right-lobe-of-liver.10/right-lobe-of-liver.1_cosine_log2fc']
                                        , "both", percent=0.05)
measure_threshold_2 = utils.get_threshold(result['right-lobe-of-liver.11/right-lobe-of-liver.1_cosine_log2fc']
                                        , "both", percent=0.05)
print(f'1: {measure_threshold_1}')
print(f'1: {measure_threshold_2}')

In [None]:
measure_threshold_1[1]

In [None]:
#t[t['right-lobe-of-liver.10_cosine'] > 0.7]
reduced_result_1 = result[(result['right-lobe-of-liver.10/right-lobe-of-liver.1_cosine_log2fc'] > measure_threshold_1[1]) | (result['right-lobe-of-liver.10/right-lobe-of-liver.1_cosine_log2fc'] < measure_threshold_1[0])]
reduced_result = reduced_result_1[(reduced_result_1['right-lobe-of-liver.11/right-lobe-of-liver.1_cosine_log2fc'] > measure_threshold_2[1]) | (reduced_result_1['right-lobe-of-liver.11/right-lobe-of-liver.1_cosine_log2fc'] < measure_threshold_2[0])]
reduced_result

In [None]:
f.describe()

In [None]:
m1 = pd.DataFrame(result.mean(axis=1), columns=['mean'])
m1['sum'] = result.sum(axis=1)
m1.plot.hist(by='mean', bins=100)
m1

In [None]:
tmp = filtered.fillna(0)
tmp_val_counts = tmp[~tmp.isin([0])].count(axis=1).sort_values()

print(tmp_val_counts)
tmp_tfs_occ = tmp_val_counts[tmp_val_counts >=16].index
result = filtered.loc[tmp_tfs_occ]

m1 = pd.DataFrame(result.mean(axis=1), columns=['mean'])
m1['sum'] = result.sum(axis=1)
m1.plot.hist(by='mean', bins=100)

## Old: Self implemented - Differential analysis - comparing each cluster

In [None]:
# mb market basket analysis
files_main_mb= read_in_file_names_of_folder(rel_path=main_analysis_path)
print(f"Count of Files: {len(files_main_mb)}")
print(f"Files: {files_main_mb}")


In [None]:
# Diff analysis for all clusters of a tissue:
# TODO: what should be compared ? All of a Tissue? or All? Naming?
tissue_name = 'right-lobe-of-liver'
combObj_to_compare = []
for i, file in enumerate(files_main_mb):
    print(file)
    name_i = file.split('.pkl')[0]
    tissue_name = file.split('.')[0]
    obj = CombObj().from_pickle(f"{main_analysis_path}{file}")
    obj.set_prefix(name_i)
    #print(obj)
    combObj_to_compare.append(obj)
    
compare_obj = DiffCombObj(combObj_to_compare, measure="cosine", join="outer", fillna=True)
compare_obj.to_pickle(f'{differential_analysis_path}{tissue_name}.pkl')
compare_obj.normalize()
compare_obj.calculate_foldchanges()
compare_obj.simplify_rules()
compare_obj.to_pickle(f'{differential_analysis_path}{tissue_name}_normalized.pkl')
#selection does not work?
#selected_std = compare_obj.select_rules()
#selected_std.to_pickle(f'{differential_analysis_selection_path}{tissue_name}.pkl')
print("Done differential analysis")

In [None]:
td = compare_obj.rules

tmp_val_counts = td[td.isin([0])].count(axis=1).sort_values()
#tmp_val_counts = td[td.isna()].count(axis=1).sort_values()

print(tmp_val_counts)
#tmp_tfs_occ = tmp_val_counts[tmp_val_counts >=16].index
#result = filtered.loc[tmp_tfs_occ]

In [None]:
# check original market basket analysis for count of tf-cooccurences
print(compare_obj.rules.shape)
count_of_all = 0
count_of_all_significant = 0
for obj in combObj_to_compare:
    print(obj.prefix)
    print(obj.rules.shape)
    obj.simplify_rules()
    print(obj.rules.shape)
    count_of_all = count_of_all + obj.rules.shape[0]
    sig = obj.select_significant_rules()
    print(sig.rules.shape)
    count_of_all_significant = count_of_all_significant + sig.rules.shape[0]

print("all simplyfied for tissue")
print(count_of_all)    
print("all simplyfied for tissue and significant")
print(count_of_all_significant) 
   
    

In [None]:
# Diff analysis between each cluster:
for i, file in enumerate(files_main_mb):
    print(file)
    name_i = file.split('.pkl')[0]
    
    for j in range(i + 1, len(files_main_mb), 1):
        file_j = files_main_mb[j]
        name_j = file_j.split('.pkl')[0]
        print(j)
        print(name_j)
        A = CombObj().from_pickle(f"{main_analysis_path}{file}")
        print(A)
        A.set_prefix(name_i)
        B = CombObj().from_pickle(f"{main_analysis_path}{file_j}")
        print(B)
        B.set_prefix(name_j)
        compare_obj = A.compare(B)
        compare_obj.to_pickle(f'{differential_analysis_path}{name_i}__{name_j}.pkl')
        
        selected_std = compare_obj.select_rules()
        
        #TODO: Save autamatically generated thresholds
        # utils.get_threshold(new.rules.iloc[:,4], 'both', percent=0.05)
        # logfc threshold (-xxx , +xxx)
        #  utils.get_threshold(new.rules.iloc[:,2:4].mean(axis=1), 'upper', percent=0.05)
        # cosine threshold
        selected_std.to_pickle(f'{differential_analysis_selection_path}{name_i}__{name_j}.pkl')
        
        break;
        
        
print("Done differential analysis")

## OLD implementation

In [None]:
def prepare_diff_obj_dataframe(diff_obj: DiffCombObj) -> pd.DataFrame:
    
    # possible prefix names ['right-lobe-of-liver.10', 'right-lobe-of-liver.16']
    df = diff_obj.rules
    tissue_name_c1 , cluster_nr_c1  = diff_obj.prefixes[0].split('.')
    tissue_name_c2 , cluster_nr_c2  = diff_obj.prefixes[1].split('.')
    suff = ""
    if tissue_name_c1 == tissue_name_c2:
        suff += f"_{tissue_name_c1}"
    else:
        suff += f"_{tissue_name_c1}_{tissue_name_c2}"

    if cluster_nr_c1 == cluster_nr_c2:
        suff += f"_{cluster_nr_c1}"
    else:
        suff += f"_{cluster_nr_c1}_{cluster_nr_c2}"

    df['log2fc_class'] = df.apply(lambda x: 'negativ' if x[4] < 0 else 'positiv', axis=1)
    df.columns = [f'{x}{suff}' for x in df.columns]
    
    return df.copy(deep=True)


# Find the specific tf_cooccurences of a tissue that unique for the specific cluster in the tissue.
# 1. Diff analyse , 
# 2. Read in Diffanalyse for the specific cluster
# 3. Find tf-cooccurence of the diffob , which are occuring in each cluster
# Read in file Names of all analysis
files_main_mb= read_in_file_names_of_folder(rel_path=main_analysis_path)
print(f"Count of Files: {len(files_main_mb)}")
#print(f"Files: {files_main_mb}")

files_diff= read_in_file_names_of_folder(rel_path=differential_analysis_path)
print(f"Count of Files: {len(files_diff)}")

test = ""
for file_mb in files_main_mb:
    cluster_name = file_mb.split('.pkl')[0]
    print(cluster_name)
    diffs = list(filter(lambda x: cluster_name in x, files_diff))
    print(len(diffs))
    print(diffs)
    
    # Keeps the read in DiffCombObj diff_objects:
    diff_objects = []
    
    for diff in diffs:
        diff_obj = DiffCombObj().from_pickle(f"{differential_analysis_selection_path}{diff}")
        diff_objects.append(diff_obj)
    
    erg = None
    for i in range(len(diff_objects)-1):
        
        if erg is None:
            obj_1= diff_objects[i]
            obj_2 = diff_objects[i + 1]
            df1 = prepare_diff_obj_dataframe(diff_obj = obj_1)
            df2 = prepare_diff_obj_dataframe(diff_obj = obj_2)
            
            erg = df1.merge(df2, how='outer', left_index=True, right_index=True)
        else:
            obj_2 = diff_objects[i + 1] 
            df2 = prepare_diff_obj_dataframe(diff_obj = obj_2)
            erg = erg.merge(df2, how='outer', left_index=True, right_index=True)
       
    test = erg
    erg.to_pickle(path=f"{answers_path}{cluster_name}.pkl")
    
print("Done")    
test.columns

In [None]:
answer_file_names=read_in_file_names_of_folder(rel_path=answers_path)
print(answer_file_names)
cluster_dfs = []
df = None
for name in answer_file_names:
    df = pd.read_pickle(f"{answers_path}{name}")
    cluster_dfs.append(name)
    df = df

    #df.groupby(['class', 'value']).count()
    break;
filter_columns = list(filter(lambda x: 'log2fc_class' in x , df.columns))
#len(filter_columns)
filtered_df = df[df[filter_columns].notna().all(1)] #
filtered_df
df
#df3.iloc[:, 2:3]
#df = pd.read_pickle(f"{answers_path}right-lobe-of-liver.6.pkl")
#df = pd.read_pickle(f"{differential_analysis_selection_path}{right-lobe-of-liver.6.pkl}")


#original = CombObj().from_pickle(f"{main_analysis_path}right-lobe-of-liver.6.pkl")
#original.rules.loc[df.index]


In [None]:
df = pd.read_pickle(f"{answers_path}right-lobe-of-liver.6.pkl")
selection = DiffCombObj().from_pickle(f"{differential_analysis_selection_path}right-lobe-of-liver.10__right-lobe-of-liver.16.pkl")
selection_orig = DiffCombObj().from_pickle(f"{differential_analysis_path}right-lobe-of-liver.10__right-lobe-of-liver.16.pkl")
selection_orig
selection

original = CombObj().from_pickle(f"{main_analysis_path}right-lobe-of-liver.6.pkl")
original.rules.loc[df.index]
selection.prefixes

### Try and Error section

In [None]:
#mb_obj = CombObj().from_pickle(f"{main_analysis_path}right-lobe-of-liver.10.pkl")
obj_1 = DiffCombObj().from_pickle(f"{differential_analysis_path}right-lobe-of-liver.16__right-lobe-of-liver.1.pkl")
obj_2 = DiffCombObj().from_pickle(f"{differential_analysis_path}right-lobe-of-liver.16__right-lobe-of-liver.2.pkl")
obj_3 = DiffCombObj().from_pickle(f"{differential_analysis_path}right-lobe-of-liver.16__right-lobe-of-liver.3.pkl")
obj_4 = DiffCombObj().from_pickle(f"{differential_analysis_path}right-lobe-of-liver.16__right-lobe-of-liver.4.pkl")
obj_5 = DiffCombObj().from_pickle(f"{differential_analysis_path}right-lobe-of-liver.16__right-lobe-of-liver.5.pkl")
obj_6 = DiffCombObj().from_pickle(f"{differential_analysis_path}right-lobe-of-liver.16__right-lobe-of-liver.6.pkl")
#mb_obj.rules
#type(diff_obj.rules)
#diff_obj2.rules
df_diff = pd.concat([obj_1.rules, obj_2.rules, obj_3.rules, obj_4.rules, obj_5.rules, obj_6.rules], join="inner")



#df_diff2 = pd.concat([diff_obj_1_1.rules,diff_obj2.rules, diff_obj1_2.rules])

unified_duplicates = df_diff[df_diff.duplicated(subset=['TF1', 'TF2'], keep='first')]

df_diff2 = pd.concat([unified_duplicates, diff_obj1_2.rules])

unified_duplicates2 = df_diff2[df_diff2.duplicated(subset=['TF1', 'TF2'], keep='first')]

#df_diff = df_diff.drop_duplicates(subset=['TF1', 'TF2'])
#unified_duplicates
#unified_duplicates
#unified_duplicates2
#diff_obj_1_1.rules
#unified_duplicates



In [None]:
obj_1 = DiffCombObj().from_pickle(f"{differential_analysis_selection_path}right-lobe-of-liver.16__right-lobe-of-liver.1.pkl")
obj_2 = DiffCombObj().from_pickle(f"{differential_analysis_selection_path}right-lobe-of-liver.16__right-lobe-of-liver.2.pkl")

obj_1.simplify_rules()
obj_2.simplify_rules()
obj3 = obj_1.rules.merge(obj_2.rules, left_index=True, right_index=True, suffixes=(f"_{obj_1.prefixes[0]}_{obj_1.prefixes[1]}", f"_{obj_2.prefixes[0]}_{obj_2.prefixes[1]}"))
obj3


In [None]:
#df['log2fc_class'] = df.apply(lambda x: 'negativ' if x[4] < 0 else 'positiv', axis=1)

#removedNAN = df[df.notna().all(1)]

#df2 = removedNAN[(removedNAN[filter_columns] > 0.0) | (removedNAN[filter_columns] < 0.0)]
#df2[df2.notna().all(1)]
#filtered_df = df[df[filter_columns].notna().all(1)]

In [None]:
diff_obj_1_1.rules.loc['Foxd3-ONECUT2']
diff_obj2.rules.loc['Foxd3-ONECUT2']

In [None]:
top30C = selectedC.select_top_rules(n=30)
top30C.rules.head(31)

In [None]:
df1 = pd.DataFrame({
                    'value1': [-1, -3.33, 3, 6,8,7],
                    'value2': [-1, 3.33, 3.4, 9,2,7],
                    'value3': [1, -3.33, 3, 2,4,7],
                    'value4': [1, 3.33, 3, 1,9,7]},
                   index=['my1', 'my2', 'my3', 'my4', 'my5', 'my6'])
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo','test'],
                    'value': [5, 6, 7, 8, 9]},
                  index=['my1', 'not2', 'my3', 'not4', 'my5'])
df3 = pd.DataFrame({'rkey': ['new', 'lol'],
                    'value': [5, 6]},
                  index=['my1', 'not2'])
df1

#df1.merge(df2, left_on='lkey', right_on='rkey')
#erg = df1.merge(df2, left_index=True, right_index=True, suffixes=("_test", "_test2"))
#erg = df1.merge(df2, how='outer', left_index=True, right_index=True, suffixes=("_x", "_y"))
#erg = erg.merge(df3, how='outer', left_index=True, right_index=True, suffixes=("_x", "_"))
#erg


In [None]:
df1.loc[(df1['value1'] == 1) & df1['value2']== 1]

In [None]:
df1[df1.apply(lambda x: min(abs(x)) == max(abs(x)), 1)]

In [None]:
df2.columns = [f'{x}_df2' for x in df2.columns]
df2

In [None]:
[x+ 1 for x in df2.columns]

In [None]:
df1['class'] = df1.apply(lambda x: 'niedrig' if x[1] < 5 else 'hoch', axis=1) 

In [None]:
df1[df1['class'] == 'hoch']

In [None]:
df1.groupby(['class', 'value']).count()

In [None]:
df2

In [None]:
df3.iloc[:, 2:3]

In [None]:
top30C.plot_bubble()

In [None]:
top30C.plot_network()

In [None]:
selectedC.rules

In [None]:
#### OLD 
# Find the specific tf_cooccurences of a tissue that unique for the specific cluster in the tissue.
# 1. Diff analyse , 
# 2. Read in Diffanalyse for the specific cluster
# 3. Find tf-cooccurence of the diffob , which are occuring in each cluster
# Read in file Names of all analysis
files_main_mb= read_in_file_names_of_folder(rel_path=main_analysis_path)
print(f"Count of Files: {len(files_main_mb)}")
#print(f"Files: {files_main_mb}")

files_diff= read_in_file_names_of_folder(rel_path=differential_analysis_path)
print(f"Count of Files: {len(files_diff)}")
#print(f"Files: {files_diff}")
test = ""
for file_mb in files_main_mb:
    cluster_name = file_mb.split('.pkl')[0]
    print(cluster_name)
    diffs = list(filter(lambda x: cluster_name in x, files_diff))
    print(len(diffs))
    print(diffs)
    
    # Keeps the read in DiffCombObj diff_objects:
    diff_objects = []
    
    for diff in diffs:
        diff_obj = DiffCombObj().from_pickle(f"{differential_analysis_selection_path}{diff}")
        diff_objects.append(diff_obj)
    
    # erste DiffObj dataframe 
    initial_df = diff_objects[0].rules
    
    #has neg and pos foldchange
    cross_product_merged = initial_df
    
    # only pos foldchange
    pos_merged = initial_df[initial_df.iloc[:,4] > 0.00]
    
    # only neg foldchange
    neg_merged = initial_df[initial_df.iloc[:,4] < 0.00]
    for i in range(len(diff_objects)-1):
        obj_1= diff_objects[i]
        obj_2 = diff_objects[i + 1]
            
        # cross_product merge rules-dataframe by index (TF´s)
        cross_product = cross_product_merged.merge(obj_2.rules, left_index=True, right_index=True, suffixes=(f"_{obj_1.prefixes[0]}_{obj_1.prefixes[1]}", f"_{obj_2.prefixes[0]}_{obj_2.prefixes[1]}"))
        cross_product_merged = cross_product.copy(deep=True)
        
        # pos merge rules-dataframe by index (TF´s)
        obj2_df_pos = obj_2.rules[obj_2.rules.iloc[:,4] > 0.00]
        df_pos_merged = pos_merged.merge(obj2_df_pos, left_index=True, right_index=True, suffixes=(f"_{obj_1.prefixes[0]}_{obj_1.prefixes[1]}", f"_{obj_2.prefixes[0]}_{obj_2.prefixes[1]}"))
        pos_merged = df_pos_merged.copy(deep=True)
        
        # neg merge rules-dataframe by index (TF´s)
        obj2_df_neg = obj_2.rules[obj_2.rules.iloc[:,4] < 0.00]
        df_neg_merged = neg_merged.merge(obj2_df_neg, left_index=True, right_index=True, suffixes=(f"_{obj_1.prefixes[0]}_{obj_1.prefixes[1]}", f"_{obj_2.prefixes[0]}_{obj_2.prefixes[1]}"))
        neg_merged = df_neg_merged.copy(deep=True)
        
        
    cross_product_merged.to_pickle(path=f"{answers_path}{cluster_name}_cross.pkl")
    pos_merged.to_pickle(path=f"{answers_path}{cluster_name}_pos.pkl")
    neg_merged.to_pickle(path=f"{answers_path}{cluster_name}_neg.pkl")
    
print("Done")    
test
