# NetInteract
### Extension of NetSeed, NetCooperate, NetCmpt, and PhyloMint

Updated May 12<sup>th</sup>, 2021 <br>
<i>Jessica Hoban, The Russell Lab at Drexel University</i>


---
## Contents: <br>
> **[Setup](#Setup)** <br><br>
> **[Step 1: Get all combinations of networks](#Step1)** <br><br>
> **[Step 2: Extract required input](#Step2)** <br><br>
> **[Step 3: Define scoring algorithm](#Step3)** <br><br>
> **[Step 4: Compute scores (overall, by-category, by-pathway)](#Step4)** <br><br>
> **[Step 5: Compile result files](#Step5)** <br><br>
> **[Full Pipeline](#Full_Pipeline)** <br><br>
> **[Extracting Unique Facultative-Facultative Influence from Results (Removing Obligate Influence)](#Remove_Obligate)** <br><br>
> **[Averaging Across Obligate Symbionts](#Average_Across_Obligate)** <br><br>
> **[Compiling Compound-Level Scores](#Compiling_Compound_Scores)** <br><br>




<br><br><br><br><br><br><br><br><br>

<a id='Setup'></a>
## <span style="margin:auto;display:table">Setup</span>

----


#### Install libraries
   
All libraries should come pre-installed with Anaconda. If not, Google ```conda install [library_name]``` and there will be an anaconda.org link with the proper syntax.

#### Import libraries

In [1]:
from pprint import pprint # Displays data nicely
import json #JSON serialization & de-serialization
import glob # Gets all files in directory of a certain filetype
import itertools # Creates permutations of organism pairs
import pandas as pd # For manipulating tabular data
from copy import copy

import matplotlib
import matplotlib.pyplot as plt # Plotting
import seaborn as sns #Plotting
import os
import math #For determining NaN values
import time #For keeping track of how long each run takes

<br><br><br><br><br><br><br><br><br>

<a id='Step1'></a>
## <span style="margin:auto;display:table">Step 1: Get all combinations of networks</span>

#### Required input: 
> ```genome_information_filepath```  &emsp; *(default = "./Input/Genome Information.csv")*<br>

#### Overview of process:<br>

This notebook was designed to work in conjunction with the <b>Metabolic Networks</b> directory (see structural diagram below), created using the complementary **"Metabolic Network Reconstruction"** Jupyter Notebook.

<img src="./Images/Directory Structure.svg" style="width: 60%;">

The following code uses the ```genome_information_filepath``` to collect the following information for each network:
> ```Database```   &emsp; (KEGG, IMG, NCBI, User-Supplied) <br>
> ```Organism Name``` <br>
> ```Organism ID``` <br>
> ```Organism Type```   &emsp; (Host, Obligate Symbiont, or Facultative Symbiont) <br>
> ```Network Filepath``` <br>

----


In [2]:
"""
Steps:
1) Read in the CSV file of genome information as a pandas DataFrame
2) Populate 3 lists of dictionaries, one for each organism type (Host, Obligate Symbiont, Facultative Symbiont)
    - Each dictionary represents a completed metabolic network and contains the following information:
        - "Database" 
        - "Organism Name" 
        - "Organism ID" 
        - "Organism Type" 
        - "Network Filepath"
3) Find all unique combinations of these networks, such that:
    - There is 1 host, 1 obligate symbiont, and 1-2 facultative symbionts
4) Return these combinations of networks

Returns: 
A list of dictionaries in the form:

    [ 
        {
        "Host Network": 
            {
            "Database" :
            "Organism Name" :
            "Organism ID" :
            "Organism Type" :
            "Network Filepath" :
            }
         "Obligate Symbiont Network": 
            {
            "Database" :
            "Organism Name" :
            "Organism ID" :
            "Organism Type" :
            "Network Filepath" :
            }
         "Facultative Symbiont Network(s)": 
             [
                {
                "Database" :
                "Organism Name" :
                "Organism ID" :
                "Organism Type" :
                "Network Filepath" :
                }
             ]
         }
    ]
                                   

"""

def step1(genome_information_filepath = "./Input/Genome Information.csv"): #___CHANGED___ to ./
    
    #------------GET NETWORK INFORMATION------------
    # Read in "Genome Information.csv" as a pandas DataFrame
    genome_information = pd.read_csv(genome_information_filepath)

    # Intialize lists to hold network information
    host_networks = []
    obligate_networks = []
    facultative_networks = []
    
    # Iterate over rows in DataFrame
    for _, row in genome_information.iterrows():

        # Iterate over database columns
        for database in ["KEGG ID", "IMG ID", "NCBI ID","User-Supplied ID"]:

            # Check that the cell isn't NaN
            if not (isinstance(row[database], float) and  math.isnan(row[database])):

                # Convert IMG IDs to integers(otherwise a ".0" will be added to the end)
                if database == "IMG ID":
                    row[database] = int(row[database])
                    
                # Grab information into new variables (easily readability)
                organism_ID = str(row[database])
                organism_name = row['Organism Name']
                organism_type = row['Organism Type']
                database_name = database[:-3]
                    
                # Define the genome's folderpath
                metabolic_network_folderpath = "./Metabolic Networks/" + organism_type + "/" + organism_name + "/" + database_name + " (" + organism_ID + ")/Metabolic Network"  #___CHANGED___ to ./
                # Define where we want future destination files to go (i.e., in the "Metabolic Network" folder for that genome)
                metabolic_network_filepath = metabolic_network_folderpath + "/" + database_name + "_" + organism_ID + "_metabolic_network.json"
                
                if os.path.isfile(metabolic_network_filepath) == True:
                    
                    network_input = {"Database" : database_name,\
                                    "Organism Name" : organism_name,\
                                    "Organism ID" : organism_ID,\
                                    "Organism Type" : organism_type,\
                                    "Network Filepath" : metabolic_network_filepath}
                
                    if organism_type == "Host": host_networks.append(network_input)
                        
                    elif organism_type == "Obligate Symbiont": obligate_networks.append(network_input)
                
                    elif organism_type == "Facultative Symbiont": facultative_networks.append(network_input)
                        
    #------------GET NETWORK COMBINATIONS------------
    combinations = []

    for host_network in host_networks: # One
        for obligate_network in obligate_networks: #One

            for facultative_network in facultative_networks: #One
                combination = {"Host Network": host_network,\
                               "Obligate Symbiont Network": obligate_network,\
                               "Facultative Symbiont Network(s)": [facultative_network]}
                combinations.append(combination)

            if len(facultative_networks) >= 2: # Need at least 2
                for combo in itertools.combinations(facultative_networks, 2): # Two
                    combination = {"Host Network": host_network,\
                                   "Obligate Symbiont Network": obligate_network,\
                                   "Facultative Symbiont Network(s)": list(combo)}
                    combinations.append(combination)
                   
    return combinations

<br><br><br><br><br><br><br><br><br>

<a id='Step2'></a>
## <span style="margin:auto;display:table">Step 2: Extract required input</span>

#### Required input: 
> ```combination```<br>


#### Overview of process:<br>

To compute all scores, the following is needed for each organism type: <br>

>Host: 
>- ```All compounds```


>Symbiont (Obligate or Facultative):
>- ```All Seeds```
>- ```All SeedGroups```
>- ```All NonSeeds```
>- ```NonSeeds by Category```
>- ```NonSeeds by Pathway```

---


In [3]:
"""
Steps:
1) Define inner function to extract nonseeds by category
2) Define inner function to extract nonseeds by category AND pathway
3) For each network, open JSON file and extract required information 
   (i.e., all nodes, all seeds, all seedgroups, all nonseeds, nonseeds by category, nonseeds by pathway)
4) Add this information to initial "combination" structure and return.

Returns:

Modified structure from step 1 in the form:

    {
    "Host Network": 
        {
        "Database" :
        "Organism Name" :
        "Organism ID" :
        "Organism Type" :
        "Network Filepath" :
        "All Nodes" :
        }
     "Obligate Symbiont Network": 
        {
        "Database" :
        "Organism Name" :
        "Organism ID" :
        "Organism Type" :
        "Network Filepath" :
        "All Seeds" :
        "All SeedGroups" :
        "All NonSeeds" :
        "Nonseeds by Category" :
            [
                {
                "categoryName" :
                "categoryNonSeeds"
                }
                ,...
            ]
        "Nonseeds by Pathway" :
            [
                {
                "categoryName" :
                "pathways" : 
                    [
                        {
                        "pathwayName"
                        "pathwayID"
                        "pathwayNonSeeds"
                        }
                        ,...
                    ]
                }
            ]
        }
     "Facultative Symbiont Network(s)": 
         [
            {
            "Database" :
            "Organism Name" :
            "Organism ID" :
            "Organism Type" :
            "Network Filepath" :
            "All Seeds" :
            "All SeedGroups" :
            "All NonSeeds" :
            "Nonseeds by Category" :
                [
                    {
                    "categoryName" :
                    "categoryNonSeeds"
                    }
                    ,...
                ]
            "Nonseeds by Pathway" :
                [
                    {
                    "categoryName" :
                    "pathways" : 
                        [
                            {
                            "pathwayName"
                            "pathwayID"
                            "pathwayNonSeeds"
                            }
                            ,...
                        ]
                    }
                ]
            }
         ]
     }
"""

def step2(combination):
    
    #---------------------------
    # Returns a list of dictionaries in the form:
    # [
    #    {
    #    "categoryName" :
    #    "categoryNonSeeds"
    #    }
    #    ,...
    # ]
    #
    def extract_nonseeds_by_category(org_dict):
        
        # Initialize list to hold nonseeds within categories
        nonseeds_by_category = []
        
        # Grab information for all nonseeds
        # (Not broken down by category to improve performance,
        # will go through each category to find IDs to filter by)
        all_nonseeds = org_dict["organism"]["allNonSeeds"]
    
        # Iterate over all metabolic network categories
        for cat_dict in org_dict["organism"]["categories"]:

            # Extract category name
            categoryName = cat_dict["categoryName"]

            # Initialize set to hold all nonseed IDs per category
            category_nonseed_IDs = set()

            # Iterate over all metabolic network pathways
            for path_dict in cat_dict["pathways"]:
                # Iterate over all metabolic network reactions
                for reaction_dict in path_dict["reactions"]:
                    # Iterate over all metabolic network compounds
                    for compound_dict in reaction_dict["compounds"]:
                        # If the compound is a non-seed
                        if compound_dict["isSeed"] == False:
                            # Add to category's nonseed IDs
                            category_nonseed_IDs.add(compound_dict["compoundID"])

            # Convert set to list (de-duplicates)
            category_nonseed_IDs = list(category_nonseed_IDs)   

            # Based on nonseedIDs, filter all nonseeds for each category
            # (Based on IDs, grab all information about that nonseed)
            category_nonseeds = [i for i in all_nonseeds if i["compoundID"] in category_nonseed_IDs]
            
            # Initialize temporary dictionary to hold category information
            temp_dict1 = {"categoryName": categoryName,\
                         "categoryNonSeeds": category_nonseeds}

            # Add this to final categories dictionary
            nonseeds_by_category.append(temp_dict1)

        # Return
        return nonseeds_by_category
    
    #---------------------------
    # Returns a list of dictionaries in the form:
    # [
    #    {
    #    "categoryName" :
    #    "pathways" : 
    #        [
    #            {
    #            "pathwayName"
    #            "pathwayID"
    #            "pathwayNonSeeds"
    #            }
    #            ,...
    #        ]
    #    }
    #    ,...
    # ]
    #
    def extract_nonseeds_by_pathway(org_dict):

        # Initialize dictionary to hold nonseeds within categories and pathways
        nonseeds_by_category_pathway = []
        
        # Grab information for all nonseeds
        # (Not broken down by category to improve performance,
        # will go through each category to find IDs to filter by)
        all_nonseeds = org_dict["organism"]["allNonSeeds"]
    
        # Iterate over all metabolic network categories
        for cat_dict in org_dict["organism"]["categories"]:

            # Extract category name
            categoryName = cat_dict["categoryName"]
            
            # Initialize temporary dictionary to hold category information
            temp_dict1 = {"categoryName": categoryName,\
                         "pathways": []}

            # Iterate over pathways
            for path_dict in cat_dict["pathways"]:

                # Initialize set to hold nonseedIDs per pathway
                pathway_nonseed_IDs = set()

                # Iterate over reactions
                for reaction_dict in path_dict["reactions"]:
                    # Iterate over compounds
                    for compound_dict in reaction_dict["compounds"]:
                        # If the compound is a non-seed
                        if compound_dict["isSeed"] == False:
                            # Add to pathway's nonseed IDs
                            pathway_nonseed_IDs.add(compound_dict["compoundID"])

                # Convert set to list (de-duplicates)
                pathway_nonseed_IDs = list(pathway_nonseed_IDs)  

                # Based on nonseedIDs, filter all nonseeds for each pathway
                # (Based on IDs, grab all information about that nonseed)
                pathway_nonseeds = [i for i in all_nonseeds if i["compoundID"] in pathway_nonseed_IDs]

                # Create and populate temporary dictionary to hold pathway information
                temp_dict2 = {"pathwayName": path_dict["pathwayName"],\
                              "pathwayID": path_dict["pathwayID"],\
                              "pathwayNonSeeds" : pathway_nonseeds}

                # Append to temporary dictionary of category information
                temp_dict1["pathways"].append(temp_dict2)

            # Append category-pathway information to final dictionary
            nonseeds_by_category_pathway.append(temp_dict1)

        # Return
        return nonseeds_by_category_pathway
      
        
    #---------------------------
    # Extract required host information
    with open(combination["Host Network"]["Network Filepath"], "r") as f:
        
        host_network = json.load(f)
        combination["Host Network"]["All Nodes"] = host_network["organism"]["allNodes"] # List of compound IDs
        
    # Extract required obligate information
    with open(combination["Obligate Symbiont Network"]["Network Filepath"], "r") as f:
        
        obligate_network = json.load(f)
        combination["Obligate Symbiont Network"]["All Seeds"] = obligate_network["organism"]["allSeeds"] #List of compound IDs
        combination["Obligate Symbiont Network"]["All SeedGroups"]  = obligate_network["organism"]["allSeedGroups"] #List of dictionaries, one for each seed group
        combination["Obligate Symbiont Network"]["All NonSeeds"]  = obligate_network["organism"]["allNonSeeds"] #List of dictionaries, one for each nonseed
        combination["Obligate Symbiont Network"]["Nonseeds by Category"] = extract_nonseeds_by_category(org_dict = obligate_network)
        combination["Obligate Symbiont Network"]["Nonseeds by Pathway"] = extract_nonseeds_by_pathway(org_dict = obligate_network)
        
    # Iterate over facultative symbionts (1-2)
    for facultative_symbiont in combination["Facultative Symbiont Network(s)"]:
        
        # Extract required facultative information 
        with open(facultative_symbiont["Network Filepath"], "r") as f:
            
            facultative_network = json.load(f)
            facultative_symbiont["All Seeds"] = facultative_network["organism"]["allSeeds"] #List of compound IDs
            facultative_symbiont["All SeedGroups"] = facultative_network["organism"]["allSeedGroups"] #List of dictionaries, one for each seed group
            facultative_symbiont["All NonSeeds"] = facultative_network["organism"]["allNonSeeds"] #List of dictionaries, one for each nonseed
            facultative_symbiont["Nonseeds by Category"] = extract_nonseeds_by_category(org_dict = facultative_network)
            facultative_symbiont["Nonseeds by Pathway"] = extract_nonseeds_by_pathway(org_dict = facultative_network)
        
    # Return
    return combination

<br><br><br><br><br><br><br><br><br>

<a id='Step3'></a>
## <span style="margin:auto;display:table">Step 3: Define scoring algorithm</span>

#### Required input: 
> ```all_seedgroups```<br>
> ```all_nonseeds```<br>
> ```interacting_nodes```<br>
> ```interaction_type```<br>

#### Overview of process:<br>

For each pair of networks, we calculate three different scores: Biosynthetic Support Score (BSS), Metabolic Complementarity Index (MCI), and Effective Metabolic Overlap (EMO). <br>

The BSS and MCI scores are potential indicators of complementation based on the NetCooperate tool , with the basic theory being that one organism supplies the “seeds” of another organism. The BSS score is designed for host-microbe interactions, where host compounds that overlap with microbe seeds are counted as exchanged metabolites. Similarly, the MCI score is designed for microbe-microbe interactions, where the one microbe’s nonseeds that overlap with the other microbe’s seeds are counted as exchanged metabolites.<br>

The EMO score is a potential indicator of either competition or habitat filtration and is based on the aforementioned NetCmpt tool. This score computes overlap between two microbe’s seeds (i.e., what they both require). <br>

Our implementation for these three scores is based on the same overall equation, just with different inputs for what we consider “interacting compounds”. For BSS, MCI, and EMO respectively, these interacting compounds are the first network’s total compounds, nonseeds, and seeds. <br>

Rather than go from the outside-in (i.e., looking at the affected seed groups), we chose to go from the inside-out of the network (i.e., looking at the affected nonseeds). The reason for this is that different seeds will have different levels of downstream impacts on the overall network and will impact nonseeds of varying importance. 

In simple terms:
- For each nonseed in Network X, we find its parent seed groups. 
> - For each parent seed group, we find the percent of affected seeds or the number of seeds that overlap with Network Y’s interacting compounds. We multiply this by how close the nonseed and parent seed group are and by how important the nonseed is.<br><br>
>- We add this up for each parent seed group and divide it by the total number of parent seed groups for that nonseed.
- We do this for each nonseed, add it all together, and divide the final score by the total number of nonseeds in the network to get the average nonseed score.


<img src="./Images/Step 3 - Algorithm.png" style="width: 60%;">

---


In [9]:
"""
Steps:
1) Set total score to 0
2) For all nonseeds in the affected network:
    - Extract nonseed weight
    - Find all parent seed groups
    - For all of that nonseed's parent seed groups
        - Multiply (% Affected) * (Nonseed Weight) * (1/Distance)
        - Divide by the number of parent seed groups
        - Add to total score
2) Divide total score by total number of nonseeds
3) Multiply by 1000000000 for legibility
4) Return total score

**Note: There is some code commented out that builds DataFrames of more specific
 results. This will be included in a future version, but was too time-consuming
 for this first iteration.**

Returns:
 - One score

"""
# all_seedgroups & all_nonseeds == from network BEING affected
# interacting_nodes == from network AFFECTING
def step3(all_seedgroups, all_nonseeds, interacting_nodes, interaction_type):
    
    #-------------------------------------
    # For each seed group:
    # - Find % affected by EMO/complementation (i.e. overlap)
    
    # Try updating original all_seedgroups dictionaries, see if that works or if it accumulates stuff
    for seedgroup in all_seedgroups: #list of dicts
        
        affected_seeds = [c for c in seedgroup["seedCompoundIDs"] \
                         if c in interacting_nodes]
        
        percent_affected = len(affected_seeds) / len(seedgroup["seedCompoundIDs"])
        seedgroup["affected_seeds"] = affected_seeds
        seedgroup["percent_affected"] = percent_affected
        
    #-------------------------------------
    # For each nonseed:
    # - Find average % affected in their parent seed groups:
    # - Multiply (weight) * (average % affected parent seed groups)
    
    total_interaction_score = 0.0
    
    results_df = pd.DataFrame() #___CHANGED_____
    
    for nonseed in all_nonseeds:
        
        if len(nonseed["seedGroupPredecessors"]) > 0:
            
            nonseed_score = 0.0
            
            nonseed_rows = [] #___CHANGED_____
            
            for seedgroup in nonseed["seedGroupPredecessors"]:
                seedgroup_predecessor_ID = seedgroup["seedGroupID"]
                all_seedgroups_location = next((i for i, j in enumerate(all_seedgroups)\
                                                if seedgroup_predecessor_ID == j["seedGroupID"]), None)
                percent_affected = all_seedgroups[all_seedgroups_location]["percent_affected"]
                distance = seedgroup["distance"]

                #seedgroup_predecessor_impact = (1000000000 * percent_affected * distance * nonseed["weight"]) / len(all_nonseeds)
                seedgroup_predecessor_impact = (1000000000 * percent_affected * distance * nonseed["weight"])

                nonseed_score += seedgroup_predecessor_impact

                for affected_seed in all_seedgroups[all_seedgroups_location]["affected_seeds"]: #___CHANGED_____>>
                    
                    results_df_row = {"Nonseed - Compound ID": nonseed["compoundID"],\
                                       "Nonseed - Weight": nonseed["weight"],\
                                       "Seed Group Predecessor - ID": seedgroup_predecessor_ID,\
                                       "Seed Group Predecessor - Distance": distance,\
                                       "Seed Group Predecessor - % Affected": percent_affected,\
                                       "Seed Group Predecessor - Group Size": len(all_seedgroups[all_seedgroups_location]["seedCompoundIDs"]),\
                                       "Seed Group Predecessor - Affected Seed Compound ID": affected_seed}
                    nonseed_rows.append(results_df_row) #___CHANGED_____ <<
                
            norm_nonseed_score = nonseed_score / len(nonseed["seedGroupPredecessors"])
            total_interaction_score += norm_nonseed_score
            
            for results_df_row in nonseed_rows: #___CHANGED_____ >>
                results_df_row["Nonseed - Interaction Type"] = interaction_type
                results_df_row["Nonseed - Interaction Score"] = norm_nonseed_score
                results_df_row["Nonseed - Number of Seed Group Predecessors"] = len(nonseed["seedGroupPredecessors"])
                results_df_row["Total Number of Nonseeds"] = len(all_nonseeds)
                
                results_df = results_df.append(results_df_row, ignore_index = True) #___CHANGED_____ <<
        
    if len(all_nonseeds) > 0:
        total_interaction_score = total_interaction_score / len(all_nonseeds) #___CHANGED_____
    
    return(total_interaction_score, results_df) #___CHANGED_____


<br><br><br><br><br><br><br><br><br>

<a id='Step4'></a>
## <span style="margin:auto;display:table">Step 4: Compute scores (overall, by-category, by-pathway)</span>

#### Required input: 
> ```combination```<br>


#### Overview of process:<br>

Compute overall, by-category, and by-pathway scores for each interaction type (BSS, EMO, MCI) based on the extracted input from Step 2 and the scoring algorithm from Step 3.

---


In [5]:
"""
Steps:
1) Define inner function that extracts scores by category
2) Define inner function that extracts scores by pathway
3) For each symbiont in the combination
    - Get all "interacting nodes" for each interaction type (BSS, MCI, EMO)
    - For each interaction type
        - Compute overall score
        - Compute by-category score
        - Compute by-pathway score

Returns:
    - The combination data structure populated with scores and with 
      the information from step 2 removed (e.g., nodes, nonseeds, seeds, seed groups)
      in the form:
    
    {
    "Host Network": 
        {
        "Database" :
        "Organism Name" :
        "Organism ID" :
        "Organism Type" :
        "Network Filepath" :
        }
     "Obligate Symbiont Network": 
        {
        "Database" :
        "Organism Name" :
        "Organism ID" :
        "Organism Type" :
        "Network Filepath" :
        "BSS - Overall Score" :
        "MCI - Overall Score" :
        "EMO - Overall Score" :
        "BSS - By-Category Score" :            
            [
                {
                "categoryName" :
                "categoryScore"
                }
                ,...
            ]
        "MCI - By-Category Score" : same as "BSS - By-Category Score"
        "EMO - By-Category Score" : same as "BSS - By-Category Score"
        "BSS - By-Pathway Score" :
            [
                {
                "categoryName" :
                "pathways" : 
                    [
                        {
                        "pathwayName"
                        "pathwayID"
                        "pathwayScore"
                        }
                        ,...
                    ]
                }
            ]
        "MCI - By-Pathway Score" : same as "BSS - By-Pathway Score" 
        "EMO - By-Pathway Score" : same as "BSS - By-Pathway Score"
        }
     "Facultative Symbiont Network(s)": 
         [
            {
            "Database" :
            "Organism Name" :
            "Organism ID" :
            "Organism Type" :
            "Network Filepath" :
            "BSS - Overall Score" :
            "MCI - Overall Score" :
            "EMO - Overall Score" :
            "BSS - By-Category Score" :  same as obligate symbiont          
            "MCI - By-Category Score" :  same as obligate symbiont 
            "EMO - By-Category Score" :  same as obligate symbiont 
            "BSS - By-Pathway Score" : same as obligate symbiont 
            "MCI - By-Pathway Score" : same as obligate symbiont 
            "EMO - By-Pathway Score" : same as obligate symbiont 
            }
                ]
            }
         ]
     }

"""


def step4(combination):

    #---------------------------
    # Returns a list of dictionaries in the form:
    # [
    #    {
    #    "categoryName" :
    #    "categoryScore" :
    #    }
    #    ,...
    # ]
    #
    def get_scores_by_category(interacting_nodes, affected_network, interaction_type):
        
        # Initialize list to hold scores for each category
        by_category_scores = []
        
        # Initialize DataFrame to hold specific compound-level details for each category
        by_category_dfs = pd.DataFrame() #_____CHANGED_____
        
        # Iterate over categories
        for category_dict in affected_network["Nonseeds by Category"]:
            
            # Get nonseeds
            category_nonseeds = category_dict["categoryNonSeeds"]
            
            # Compute score
            by_category_score, by_category_df = step3(interacting_nodes = interacting_nodes, # ____CHANGED_____
                                                     interaction_type = interaction_type,\
                                                     all_seedgroups = affected_network["All SeedGroups"],\
                                                     all_nonseeds = category_nonseeds)
            
            #----------
            # Create temporary dictionary to hold category info and score
            by_category_score_dict = {"categoryName": category_dict["categoryName"],\
                                      "categoryScore": by_category_score}
            
            # Append to final list of scores
            by_category_scores.append(by_category_score_dict)
            
            #----------
            # Modify by_category_df to include categoryName, categoryScore, and organism identifiers
            by_category_df["Category Name"] = category_dict["categoryName"] #____CHANGED_____>>
            by_category_df["Category Score"] = by_category_score 
            by_category_df["Affected Organism ID"] = affected_network["Organism ID"]
            by_category_df["Affected Organism Name"] = affected_network["Organism Name"]
            by_category_df["Affected Organism Source"] = affected_network["Database"] #____CHANGED_____<<
            # Don't need to add information about affecting ones, because each combination will have its own file
            
            # Append to final list of DataFrames
            by_category_dfs = by_category_dfs.append(by_category_df)#_____CHANGED_____
            #-----------
        
        # Return final list of scores
        return by_category_scores, by_category_dfs #_____CHANGED_____
    
    #---------------------------
    # Returns a list of dictionaries in the form:
    # [
    #    {
    #    "categoryName" :
    #    "pathways" : 
    #        [
    #            {
    #            "pathwayName"
    #            "pathwayID"
    #            "pathwayScore"
    #            }
    #            ,...
    #        ]
    #    }
    #    ,...
    # ]
    #
    def get_scores_by_pathway(interacting_nodes, affected_network, interaction_type):
        
        # Initialize list to hold scores for each category
        by_pathway_scores = []
        
        # Initialize DataFrame to hold specific compound-level details for each category
        by_pathway_dfs = pd.DataFrame() #_____CHANGED_______
        
        # Iterate over categories
        for category_dict in affected_network["Nonseeds by Pathway"]:
            
            # Create temporary dictionary to hold category info and score
            temp_dict1 = {"categoryName": category_dict["categoryName"],\
                          "pathways": []}
            
            # Iterate over pathways
            for pathway_dict in category_dict["pathways"]:
            
                # Get nonseeds
                pathway_nonseeds = pathway_dict["pathwayNonSeeds"]

                # Compute score
                by_pathway_score, by_pathway_df = step3(interacting_nodes = interacting_nodes, #_____CHANGED______
                                                         interaction_type = interaction_type,\
                                                         all_seedgroups = affected_network["All SeedGroups"],\
                                                         all_nonseeds = pathway_nonseeds)

                #-------
                # Create temporary dictionary to hold category info and score
                by_pathway_score_dict = {"pathwayName": pathway_dict["pathwayName"],\
                                         "pathwayID": pathway_dict["pathwayID"],\
                                         "pathwayScore": by_pathway_score}

                # Append to temporary category dict
                temp_dict1["pathways"].append(by_pathway_score_dict)
                
                #-------
                by_pathway_df["Pathway Name"] = pathway_dict["pathwayName"]#____CHANGED____>>
                by_pathway_df["Pathway ID"] = pathway_dict["pathwayID"]
                by_pathway_df["Pathway Score"] = by_pathway_score
                by_pathway_df["Category Name"] = category_dict["categoryName"]
                by_pathway_df["Affected Organism ID"] = affected_network["Organism ID"]
                by_pathway_df["Affected Organism Name"] = affected_network["Organism Name"]
                by_pathway_df["Affected Organism Source"] = affected_network["Database"] #____CHANGED____<<
                # Don't need to add information about affecting ones, because each combination will have its own file
            
                by_pathway_dfs = by_pathway_dfs.append(by_pathway_df) #____CHANGED____
                #--------
            
            # Append to final dict
            by_pathway_scores.append(temp_dict1)
            
        # Return final list of scores
        return by_pathway_scores, by_pathway_dfs #____CHANGED____
    
    #-----------------------------
    # Combine all symbionts together
    symbionts = combination["Facultative Symbiont Network(s)"] + [combination["Obligate Symbiont Network"]]

    #---------------------------------
    # Make NetInteract Results Directory
    try:
        all_IDs = "_".join([combination["Host Network"]["Organism ID"]] + [i["Organism ID"] for i in symbionts])
        compoundlevel_folderpath = "./NetInteract Results - Compound-Level/With Obligate/" + all_IDs + "/"
        os.makedirs(compoundlevel_folderpath)
        
    # Don't make anything if the directory already exists
    except FileExistsError:
        pass
    #---------------------------------
    
    # Iterate over symbionts
    for symbiont in symbionts:

        host = combination["Host Network"]
        other_symbionts = [s for s in symbionts if s != symbiont]
        
        # Extract interacting nodes for BSS
        # (i.e., all host compound IDs)
        BSS_interacting_nodes = host["All Nodes"]

        # Extract interacting nodes for MCI
        # (i.e., combine all nonseed compound IDs from other symbionts)
        MCI_interacting_nodes = [nonseed_dict["compoundID"] for sym in other_symbionts \
                                 for nonseed_dict in sym["All NonSeeds"]]

        # Extract interacting nodes for EMO
        # (i.e., combine all seed compound IDs from other symbionts)
        EMO_interacting_nodes = [seed_ID for sym in other_symbionts for seed_ID in sym["All Seeds"]]    
    
        
        #------------------
        # Initialize compound-level CSV files
        
        # OVERALL
        
        overall_df_filepath = compoundlevel_folderpath  + "Overall Effect on " + symbiont["Organism ID"] + ".csv"
        overall_columns = ["Affected Organism Name", "Affected Organism Source", "Affected Organism ID",
                          "Overall Score", "Total Number of Nonseeds", "Nonseed - Interaction Type",
                          "Nonseed - Compound ID", "Nonseed - Interaction Score", "Nonseed - Weight",
                          "Nonseed - Number of Seed Group Predecessors", "Seed Group Predecessor - ID",
                          "Seed Group Predecessor - % Affected", "Seed Group Predecessor - Distance",
                          "Seed Group Predecessor - Group Size", "Seed Group Predecessor - Affected Seed Compound ID"]
        
        overall_template = pd.DataFrame(columns = overall_columns)
        overall_template.to_csv(overall_df_filepath, index=False)
        
        # BY-CATEGORY
        by_category_df_filepath = compoundlevel_folderpath  + "By-Category Effect on " + symbiont["Organism ID"] + ".csv"#____CHANGED____ 
        by_category_columns = ["Affected Organism Name", "Affected Organism Source", "Affected Organism ID",
                              "Category Name", "Category Score", "Total Number of Nonseeds", "Nonseed - Interaction Type",
                              "Nonseed - Compound ID", "Nonseed - Interaction Score", "Nonseed - Weight",
                              "Nonseed - Number of Seed Group Predecessors", "Seed Group Predecessor - ID",
                              "Seed Group Predecessor - % Affected", "Seed Group Predecessor - Distance",
                              "Seed Group Predecessor - Group Size", "Seed Group Predecessor - Affected Seed Compound ID"]
            
        by_category_template = pd.DataFrame(columns = by_category_columns)
        by_category_template.to_csv(by_category_df_filepath, index=False)
        
        # BY-PATHWAY
        by_pathway_df_filepath = compoundlevel_folderpath  + "By-Pathway Effect on " + symbiont["Organism ID"] + ".csv"#____CHANGED____ 
        by_pathway_columns = ["Affected Organism Name", "Affected Organism Source", "Affected Organism ID",
                              "Category Name", "Pathway Name", "Pathway ID", "Pathway Score",
                              "Total Number of Nonseeds", "Nonseed - Interaction Type",
                              "Nonseed - Compound ID", "Nonseed - Interaction Score", "Nonseed - Weight",
                              "Nonseed - Number of Seed Group Predecessors", "Seed Group Predecessor - ID",
                              "Seed Group Predecessor - % Affected", "Seed Group Predecessor - Distance",
                              "Seed Group Predecessor - Group Size", "Seed Group Predecessor - Affected Seed Compound ID"]
        
        by_pathway_template = pd.DataFrame(columns = by_pathway_columns)
        by_pathway_template.to_csv(by_pathway_df_filepath, index=False)
        #----------------------
        
        # Put interaction type and nodes into dictionary
        interaction_type_and_nodes = {"BSS": BSS_interacting_nodes,\
                                     "MCI": MCI_interacting_nodes,\
                                     "EMO": EMO_interacting_nodes}
        
        #------------------
        # Iterate over each interaction type
        for interaction_type, interacting_nodes in interaction_type_and_nodes.items():
            
            #----------------
            # Compute overall score
            overall_score, overall_df = step3(interacting_nodes = interacting_nodes, #____CHANGED____
                                             interaction_type = interaction_type,\
                                             all_seedgroups = symbiont["All SeedGroups"],\
                                             all_nonseeds = symbiont["All NonSeeds"])
            
            symbiont[interaction_type + " - Overall Score"] = overall_score
            
            # Add more information to DataFrame
            overall_df["Affected Organism ID"] = symbiont["Organism ID"] #____CHANGED____ >>
            overall_df["Affected Organism Name"] = symbiont["Organism Name"]
            overall_df["Affected Organism Source"] = symbiont["Database"]
            overall_df["Overall Score"] = overall_score
            # Don't need to add information about affecting ones, because each combination will have its own file
            
            # Send compound-level DataFrame to file so system isn't overwhelmed
            overall_df[overall_columns].to_csv(overall_df_filepath, mode = 'a', index=False, header=False)#____CHANGED____<<
            
            #----------------
            # Compute by-category scores
            by_category_score, by_category_dfs = get_scores_by_category(interacting_nodes = interacting_nodes,#____CHANGED____ 
                                                                       affected_network = symbiont,\
                                                                       interaction_type = interaction_type)

            symbiont[interaction_type + " - By-Category Score"] = by_category_score #____CHANGED____ 

            # Send compound-level DataFrame to file so system isn't overwhelmed
            by_category_dfs[by_category_columns].to_csv(by_category_df_filepath, mode = 'a', index=False, header=False)#____CHANGED____ 
            
            #----------------
            # Compute by-pathway scores
            by_pathway_score, by_pathway_dfs = get_scores_by_pathway(interacting_nodes = interacting_nodes, #____CHANGED____ 
                                                                   affected_network = symbiont,\
                                                                   interaction_type = interaction_type)

            symbiont[interaction_type + " - By-Pathway Score"] = by_pathway_score

            # Send compound-level DataFrame to file so system isn't overwhelmed
            by_pathway_dfs[by_pathway_columns].to_csv(by_pathway_df_filepath, mode = 'a', index=False, header=False) #____CHANGED____ 
            
            
    #-------------------------
    # Remove seeds/nonseeds/seedgroups/nodes information before sending to JSON files
    del combination["Host Network"]["All Nodes"]
    del combination["Obligate Symbiont Network"]["All Seeds"]
    del combination["Obligate Symbiont Network"]["All SeedGroups"]
    del combination["Obligate Symbiont Network"]["All NonSeeds"]
    del combination["Obligate Symbiont Network"]["Nonseeds by Category"]
    del combination["Obligate Symbiont Network"]["Nonseeds by Pathway"]

    for facultative_symbiont in combination["Facultative Symbiont Network(s)"]:
        del facultative_symbiont["All Seeds"]
        del facultative_symbiont["All SeedGroups"]
        del facultative_symbiont["All NonSeeds"]
        del facultative_symbiont["Nonseeds by Category"]
        del facultative_symbiont["Nonseeds by Pathway"]
        
    #-------------------------
    return combination
    

<br><br><br><br><br><br><br><br><br>

<a id='Step5'></a>
## <span style="margin:auto;display:table">Step 5: Compile Result Files</span>

#### Required input: 
> ```combination```<br>
> ```step4_output```<br>

#### Overview of process:<br>

For each combination of networks, create rows which will be appended to DataFrames in the full pipeline step.

---


In [6]:
"""
Steps:
1) Make a template for all CSV rows with information on the host, obligate, and facultative networks 
   (e.g., Name, ID, Source)
2) Make a row for the overall DataFrame with the overall results for that community
3) Make a row for the by-category DataFrame with the overall results for that community, 
   broken down by category
4) Make a row for the by-pathway DataFrame with the overall results for that community,
   broken down by pathway

Returns:
    - A single row for the overall DataFrame (overall_row)
    - A dictionary of rows for the by-category DataFrame (category_rows)
    - A dictionary of rows for the by-pathway DataFrame (pathway_rows)

"""

def step5(combination, step4_output):

    #-------------------------
    # Make template row for all CSV files
    csv_row_template = {"Host - ID" : combination["Host Network"]["Organism ID"],\
                         "Host - Name" : combination["Host Network"]["Organism Name"],\
                         "Host - Source" : combination["Host Network"]["Database"],\
                         "Obligate Symbiont - ID" : combination["Obligate Symbiont Network"]["Organism ID"],\
                         "Obligate Symbiont - Name" : combination["Obligate Symbiont Network"]["Organism Name"],\
                         "Obligate Symbiont - Source" : combination["Obligate Symbiont Network"]["Database"],\
                         "Facultative Symbiont 1 - ID" : combination["Facultative Symbiont Network(s)"][0]["Organism ID"],\
                         "Facultative Symbiont 1 - Name" : combination["Facultative Symbiont Network(s)"][0]["Organism Name"],\
                         "Facultative Symbiont 1 - Source" : combination["Facultative Symbiont Network(s)"][0]["Database"]}

    if len(combination["Facultative Symbiont Network(s)"]) == 2:

        csv_row_template["Facultative Symbiont 2 - ID"] = combination["Facultative Symbiont Network(s)"][1]["Organism ID"]
        csv_row_template["Facultative Symbiont 2 - Name"] = combination["Facultative Symbiont Network(s)"][1]["Organism Name"]
        csv_row_template["Facultative Symbiont 2 - Source"] = combination["Facultative Symbiont Network(s)"][1]["Database"]

    #-------------------------
    # Make row for Overall CSV file
    overall_row = {}
    overall_row.update(csv_row_template)

    for interaction_type in ["BSS", "MCI", "EMO"]:
        overall_row["Obligate Symbiont - " + interaction_type] = combination["Obligate Symbiont Network"][interaction_type + " - Overall Score"]
        overall_row["Facultative Symbiont 1 - " + interaction_type] = combination["Facultative Symbiont Network(s)"][0][interaction_type + " - Overall Score"]
        if len(combination["Facultative Symbiont Network(s)"]) == 2:
            overall_row["Facultative Symbiont 2 - " + interaction_type] = combination["Facultative Symbiont Network(s)"][1][interaction_type + " - Overall Score"]

    #-------------------------
    # Make rows for By-Category CSV file

    # {CategoryName:row}
    category_rows = {}

    for interaction_type in ["BSS", "MCI", "EMO"]:

        for category_dict in combination["Obligate Symbiont Network"][interaction_type + " - By-Category Score"]:

            category_row = {}
            category_row.update(csv_row_template) 
            category_row["Category Name"] = category_dict["categoryName"]

            category_row["Obligate Symbiont - " + interaction_type] = category_dict["categoryScore"]

            fac_1_location = next((i for i, j in enumerate(combination["Facultative Symbiont Network(s)"][0][interaction_type + " - By-Category Score"])
                                   if category_dict["categoryName"] == j["categoryName"]), None)
            fac_1_category_dict = combination["Facultative Symbiont Network(s)"][0][interaction_type + " - By-Category Score"][fac_1_location]
            category_row["Facultative Symbiont 1 - " + interaction_type] = fac_1_category_dict["categoryScore"]

            if len(combination["Facultative Symbiont Network(s)"]) == 2:

                fac_2_location = next((i for i, j in enumerate(combination["Facultative Symbiont Network(s)"][1][interaction_type + " - By-Category Score"])
                                   if category_dict["categoryName"] == j["categoryName"]), None)
                fac_2_category_dict = combination["Facultative Symbiont Network(s)"][1][interaction_type + " - By-Category Score"][fac_2_location]
                category_row["Facultative Symbiont 2 - " + interaction_type] = fac_2_category_dict["categoryScore"]

            if category_row["Category Name"] not in category_rows.keys():
                category_rows[category_row["Category Name"]] = category_row

            else:
                category_rows[category_row["Category Name"]].update(category_row)

    #-------------------------
    # Make rows for By-Pathway CSV file

    pathway_rows = {}

    for interaction_type in ["BSS", "MCI", "EMO"]:

        for category_dict in combination["Obligate Symbiont Network"][interaction_type + " - By-Pathway Score"]:

            fac_1_cat_location = next((i for i, j in enumerate(combination["Facultative Symbiont Network(s)"][0][interaction_type + " - By-Pathway Score"])
                                       if category_dict["categoryName"] == j["categoryName"]), None)
            fac_1_category_dict = combination["Facultative Symbiont Network(s)"][0][interaction_type + " - By-Pathway Score"][fac_1_cat_location]

            if len(combination["Facultative Symbiont Network(s)"]) == 2:
                fac_2_cat_location = next((i for i, j in enumerate(combination["Facultative Symbiont Network(s)"][1][interaction_type + " - By-Pathway Score"])
                                       if category_dict["categoryName"] == j["categoryName"]), None)
                fac_2_category_dict = combination["Facultative Symbiont Network(s)"][1][interaction_type + " - By-Pathway Score"][fac_2_cat_location]

            for pathway_dict in category_dict["pathways"]:

                pathway_row = {}
                pathway_row["Category Name"] = category_dict["categoryName"]
                pathway_row["Pathway Name"] = pathway_dict["pathwayName"]
                pathway_row["Pathway ID"] = pathway_dict["pathwayID"]
                pathway_row.update(csv_row_template) 

                pathway_row["Obligate Symbiont - " + interaction_type] = pathway_dict["pathwayScore"]

                fac_1_path_location = next((i for i, j in enumerate(fac_1_category_dict["pathways"])
                                           if pathway_dict["pathwayName"] == j["pathwayName"]), None)
                fac_1_pathway_dict = fac_1_category_dict["pathways"][fac_1_path_location]
                pathway_row["Facultative Symbiont 1 - " + interaction_type] = fac_1_pathway_dict["pathwayScore"]

                if len(combination["Facultative Symbiont Network(s)"]) == 2:
                    fac_2_path_location = next((i for i, j in enumerate(fac_2_category_dict["pathways"])
                                           if pathway_dict["pathwayName"] == j["pathwayName"]), None)
                    fac_2_pathway_dict = fac_2_category_dict["pathways"][fac_2_path_location]
                    pathway_row["Facultative Symbiont 2 - " + interaction_type] = fac_2_pathway_dict["pathwayScore"]


                if pathway_row["Pathway Name"] not in pathway_rows.keys():
                    pathway_rows[pathway_row["Pathway Name"]] = pathway_row

                else:
                    pathway_rows[pathway_row["Pathway Name"]].update(pathway_row)


    #------------------------
    # Return
    return overall_row, category_rows, pathway_rows


<br><br><br><br><br><br><br><br><br>

<a id='Full_Pipeline'></a>
## <span style="margin:auto;display:table">Full Pipeline</span>

#### Required input: 
> N/A <br>


#### Overview of process:<br>

Complete all above steps and send results to final files.

---


In [7]:
"""
Steps:
1) Get all combinations of networks
2) Extract required input
3) Define scoring algorithm
4) Compute scores (overall, by-category, by-pathway)
5) Compile result files

Note: Doesn't return anything.

"""

def full_pipeline():
    
    # Get all combinations of communities
    combinations = step1()
    
    # Initialize tracker
    i = 1
    
    #-----
    # Make NetInteract Results Directory
    try:
        os.makedirs("./NetInteract Results - Compiled/With Obligate/")
    # Don't make anything if the directory already exists
    except FileExistsError:
        pass
    
    #-----
    # Make NetInteract Results Directory
    try:
        os.makedirs("./NetInteract Results - Compiled/Without Obligate/")
    # Don't make anything if the directory already exists
    except FileExistsError:
        pass
    
    #----------
    # Initialize column order
    overall_column_names = ["Host - ID","Host - Name", "Host - Source",\
               "Obligate Symbiont - ID", "Obligate Symbiont - Name", "Obligate Symbiont - Source",\
               "Obligate Symbiont - BSS", "Obligate Symbiont - MCI", "Obligate Symbiont - EMO",\
               "Facultative Symbiont 1 - ID", "Facultative Symbiont 1 - Name", "Facultative Symbiont 1 - Source",\
               "Facultative Symbiont 1 - BSS", "Facultative Symbiont 1 - MCI", "Facultative Symbiont 1 - EMO",\
               "Facultative Symbiont 2 - ID", "Facultative Symbiont 2 - Name", "Facultative Symbiont 2 - Source",\
               "Facultative Symbiont 2 - BSS", "Facultative Symbiont 2 - MCI", "Facultative Symbiont 2 - EMO"]
        
    category_column_names = ["Category Name"] + overall_column_names
    pathway_column_names = ["Category Name", "Pathway Name", "Pathway ID"] + overall_column_names
        
    # OVERALL
    overall_template = pd.DataFrame(columns = overall_column_names)
    overall_template.to_csv("./NetInteract Results - Compiled/With Obligate/Overall Results.csv", index=False)

    # BY-CATEGORY
    by_category_template = pd.DataFrame(columns = category_column_names)
    by_category_template.to_csv("./NetInteract Results - Compiled/With Obligate/By-Category Results.csv", index=False)

    # BY-PATHWAY
    by_pathway_template = pd.DataFrame(columns = pathway_column_names)
    by_pathway_template.to_csv("./NetInteract Results - Compiled/With Obligate/By-Pathway Results.csv", index=False)

    #-------------------
    # Iterate over those
    for combination in combinations:
        
        # Capture the number of facultative symbionts
        num_facultative = len(combination["Facultative Symbiont Network(s)"])
        
        # Capture time
        start_time = time.time()
        
        # Print progress
        print("Working on combination " + str(i) + " of " + str(len(combinations)) + " combinations.")
        
        # Increment tracker
        i += 1
        
        # Extract required input for scores
        step2_output = step2(combination)
        
        # Compute scores
        # (Step 3 included in Step 4)
        step4_output = step4(combination = step2_output)
        
        #--------------------------
        # Initialize empty pandas DataFrames to hold results
        overall_df = pd.DataFrame()
        category_df = pd.DataFrame()
        pathway_df = pd.DataFrame()
    
        overall_row, category_rows, pathway_rows = step5(combination = combination, step4_output = step4_output)
        
        #-----
        # Append to Overall DataFrame
        overall_df = overall_df.append(overall_row, ignore_index = True)
        
        # Add swapped row
        if num_facultative == 2:
            
            # Swap Facultative Symbionts 1 and 2 and add new row, so that it's mirrored
            swapped_row = {k.replace("1","3").replace("2","1").replace("3","2"): v for k, v in overall_row.items()}
            overall_df = overall_df.append(swapped_row, ignore_index = True)
           
        #-----
        # Append to By-Category DataFrame
        for category_row in category_rows.values():
            category_df = category_df.append(category_row, ignore_index = True)
            
            if num_facultative == 2:
            
                # Swap Facultative Symbionts 1 and 2 and add new row, so that it's mirrored
                swapped_row = {k.replace("1","3").replace("2","1").replace("3","2"): v for k, v in category_row.items()}
                category_df = category_df.append(swapped_row, ignore_index = True)
            
        #-----
        # Append to By-Pathway DataFrame
        for pathway_row in pathway_rows.values():
            pathway_df = pathway_df.append(pathway_row, ignore_index = True)

            if num_facultative == 2:

                # Swap Facultative Symbionts 1 and 2 and add new row, so that it's mirrored
                swapped_row = {k.replace("1","3").replace("2","1").replace("3","2"): v for k, v in pathway_row.items()}
                pathway_df = pathway_df.append(swapped_row, ignore_index = True)

        #------------SEND RESULTS TO CSV FILES------------
        
        # Re-index so that combinations with 1 or 2 facultative symbionts have the same columns
        overall_df = overall_df.reindex(columns = overall_column_names)
        category_df = category_df.reindex(columns = category_column_names)
        pathway_df = pathway_df.reindex(columns = pathway_column_names)
        
        overall_df.to_csv("./NetInteract Results - Compiled/With Obligate/Overall Results.csv", index = False,\
                         columns = overall_column_names, mode = 'a', header=False)
    
        category_df.to_csv("./NetInteract Results - Compiled/With Obligate/By-Category Results.csv", index = False,\
                          columns = category_column_names, mode = 'a', header=False)

        pathway_df.to_csv("./NetInteract Results - Compiled/With Obligate/By-Pathway Results.csv", index = False,\
                         columns = pathway_column_names, mode = 'a', header=False)

            
        #------------
        end_time = time.time()
           
        print(end_time - start_time, num_facultative)
     
    
    

In [None]:
%time full_pipeline()

<br><br><br><br><br><br><br><br><br>

<a id='Remove_Obligate'></a>
## <span style="margin:auto;display:table">Extracting Unique Facultative-Facultative Influence from Results</span>
### <span style="margin:auto;display:table">(Removing Obligate Influence)</span>

#### Required input: 
> N/A <br>

#### Overview of process:<br>

Using the files in the folder "NetInteract Results - Compiled/With Obligate/" (e.g., "Overall Results.csv", "By-Category Results.csv", and "By-Pathway Results.csv"), we are going to extract the unique influence of facultative symbionts on each other. In the original result files, the influence of the obligate symbiont is included in the final scores; if we're looking for strictly facultative-facultative interactions, we need to remove that obligate influence. The benefit of first performing the community-wise view and then removing this influence (as opposed to computing pair-wise scores) is that we can find facultative-facultative interactions that are unique to that relationship and are not present in the obligate-facultative relationship.  

To do this, we first find community structures with 2 facultative symbionts, so that this community contains 1 obligate symbionts and the two facultative symbionts. Then, for each of those facultative symbionts, we find the community structure that contains only it and the applicable obligate symbiont. 

For each pathway, category, and overall score, we grab the MCI and EMO scores from the 3 community structures (i.e., the one 1 obligate/2 facultative structure, and two 1 obligate/1 facultative structures). For each facultative symbiont's score in the full 1 obligate/2 facultative structure, we subtract the corresponding score from the 1 obligate/1 facultative structure. 

---

<br>
Here is a practical example. Say these are the following MCI scores for a 1 obligate/2 facultative community structure:

|Obligate Symbiont - Name|Facultative Symbiont 1 - Name|Facultative Symbiont 1 - MCI|Facultative Symbiont 1 - EMO|Facultative Symbiont 2 - Name|Facultative Symbiont 2 - MCI|Facultative Symbiont 2 - EMO|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|Buchnera Tuc7|Serratia|249771.5858|670045.1481|Regiella|622863.5923|920253.6706|

<br><br>
For each of the 2 facultative symbionts, here are their 1 obligate/1 facultative community structure:

|Obligate Symbiont - Name|Facultative Symbiont 1 - Name|Facultative Symbiont 1 - MCI|Facultative Symbiont 1 - EMO|
|:-:|:-:|:-:|:-:|
|Buchnera Tuc7|Serratia|96695.74654|276363.4575|

|Obligate Symbiont - Name|Facultative Symbiont 1 - Name|Facultative Symbiont 1 - MCI|Facultative Symbiont 1 - EMO|
|:-:|:-:|:-:|:-:|
|Buchnera Tuc7|Regiella|267033.984|450878.3158|

<br><br>
We use each of these to subtract out of the original scores:

|Obligate Symbiont - Name|Facultative Symbiont 1 - Name|Facultative Symbiont 1 - MCI|Facultative Symbiont 1 - EMO|Facultative Symbiont 2 - Name|Facultative Symbiont 2 - MCI|Facultative Symbiont 2 - EMO|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|Buchnera Tuc7|Serratia|249771.5858 - 96695.74654|670045.1481 - 276363.4575|Regiella|622863.5923 - 267033.984|920253.6706 - 450878.3158|

<br><br>
Which evaluates to final scores of :

|Obligate Symbiont - Name|Facultative Symbiont 1 - Name|Facultative Symbiont 1 - MCI|Facultative Symbiont 1 - EMO|Facultative Symbiont 2 - Name|Facultative Symbiont 2 - MCI|Facultative Symbiont 2 - EMO|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|Buchnera Tuc7|Serratia|153075.8393|393681.6906|Regiella|355829.6082|469375.3548|

---

We then send the resulting files to the folder "NetInteract Results - Compiled/Without Obligate/".


---


In [None]:
def extract_unique_facultative_influence():
    
    #---------------------------------------------
    # Read in DataFrames
    
    # Read in overall DataFrame - Scores include influence from obligate symbionts
    overall_df = pd.read_csv("./NetInteract Results - Compiled/With Obligate/Overall Results.csv")
    
    #-------------
    # Create another DataFrame - Scores DO NOT include influence from obligate symbionts
    # (We use the communities with only 1 facultative symbiont to subtract out the influence
    # of the obligate symbiont on that 1 facultative symbiont)
    overall_df_1fac = copy(overall_df[overall_df["Facultative Symbiont 2 - ID"].notnull() == False])
    overall_df_no_ob = copy(overall_df[overall_df["Facultative Symbiont 2 - ID"].notnull()])
    
    # Iterate over the 

    for count, row in overall_df_no_ob.iterrows():
        
        for column in ["Facultative Symbiont 1 - MCI", "Facultative Symbiont 1 - EMO"]:
            
            same_fac_ID_mask = overall_df_1fac["Facultative Symbiont 1 - ID"] == row["Facultative Symbiont 1 - ID"]
            same_ob_ID_mask = overall_df_1fac["Obligate Symbiont - ID"] == row["Obligate Symbiont - ID"]
            
            value_to_subtract = overall_df_1fac.loc[same_fac_ID_mask & same_ob_ID_mask][column].mean()
            
            overall_df_no_ob.at[count,column] = row[column] - value_to_subtract
            
        for column in ["Facultative Symbiont 2 - MCI", "Facultative Symbiont 2 - EMO"]:
            
            same_fac_ID_mask = overall_df_1fac["Facultative Symbiont 1 - ID"] == row["Facultative Symbiont 2 - ID"]
            same_ob_ID_mask = overall_df_1fac["Obligate Symbiont - ID"] == row["Obligate Symbiont - ID"]
            
            value_to_subtract = overall_df_1fac.loc[same_fac_ID_mask & same_ob_ID_mask][column.replace("2", "1")].mean()
            
            overall_df_no_ob.at[count,column] = row[column] - value_to_subtract
            
    
    overall_df_no_ob.to_csv("./NetInteract Results - Compiled/Without Obligate/Overall Results (Without Obligate).csv", index=False)
         
    #-------------
    # Do the same thing for the by-category dataframe
    # Read in DataFrame - Scores include influence from obligate symbionts
    bycategory_df = pd.read_csv("./NetInteract Results - Compiled/With Obligate/By-Category Results.csv")

    #Create another DataFrame - Scores DO NOT include influence from obligate symbionts
    # (We use the communities with only 1 facultative symbiont to subtract out the influence
    # of the obligate symbiont on that 1 facultative symbiont)
    bycategory_df_1fac = copy(bycategory_df[bycategory_df["Facultative Symbiont 2 - ID"].notnull() == False])
    bycategory_df_no_ob = copy(bycategory_df[bycategory_df["Facultative Symbiont 2 - ID"].notnull()])

    # Iterate over the 

    for count, row in bycategory_df_no_ob.iterrows():

        for column in ["Facultative Symbiont 1 - MCI", "Facultative Symbiont 1 - EMO"]:

            same_fac_ID_mask = bycategory_df_1fac["Facultative Symbiont 1 - ID"] == row["Facultative Symbiont 1 - ID"]
            same_ob_ID_mask = bycategory_df_1fac["Obligate Symbiont - ID"] == row["Obligate Symbiont - ID"]
            same_cat_mask = bycategory_df_1fac["Category Name"] == row["Category Name"]

            value_to_subtract = bycategory_df_1fac.loc[same_fac_ID_mask & same_ob_ID_mask & same_cat_mask][column].mean()

            bycategory_df_no_ob.at[count, column] = row[column] - value_to_subtract

        for column in ["Facultative Symbiont 2 - MCI", "Facultative Symbiont 2 - EMO"]:

            same_fac_ID_mask = bycategory_df_1fac["Facultative Symbiont 1 - ID"] == row["Facultative Symbiont 2 - ID"]
            same_ob_ID_mask = bycategory_df_1fac["Obligate Symbiont - ID"] == row["Obligate Symbiont - ID"]
            same_cat_mask = bycategory_df_1fac["Category Name"] == row["Category Name"]

            value_to_subtract = bycategory_df_1fac.loc[same_fac_ID_mask & same_ob_ID_mask & same_cat_mask][column.replace("2", "1")].mean()

            bycategory_df_no_ob.at[count,column] = row[column] - value_to_subtract


    bycategory_df_no_ob.to_csv("./NetInteract Results - Compiled/Without Obligate/By-Category Results (Without Obligate).csv", index=False)

    #------------
    # Do the same thing for the by-pathway dataframe
    # Read in DataFrame - Scores include influence from obligate symbionts
    bypathway_df = pd.read_csv("./NetInteract Results - Compiled/With Obligate/By-Pathway Results.csv")

    #Create another DataFrame - Scores DO NOT include influence from obligate symbionts
    # (We use the communities with only 1 facultative symbiont to subtract out the influence
    # of the obligate symbiont on that 1 facultative symbiont)
    bypathway_df_1fac = copy(bypathway_df[bypathway_df["Facultative Symbiont 2 - ID"].notnull() == False])
    bypathway_df_no_ob = copy(bypathway_df[bypathway_df["Facultative Symbiont 2 - ID"].notnull()])

    # Iterate over the 

    for count, row in bypathway_df_no_ob.iterrows():

        for column in ["Facultative Symbiont 1 - MCI", "Facultative Symbiont 1 - EMO"]:

            same_fac_ID_mask = bypathway_df_1fac["Facultative Symbiont 1 - ID"] == row["Facultative Symbiont 1 - ID"]
            same_ob_ID_mask = bypathway_df_1fac["Obligate Symbiont - ID"] == row["Obligate Symbiont - ID"]
            same_path_mask = bypathway_df_1fac["Pathway ID"] == row["Pathway ID"]

            value_to_subtract = bypathway_df_1fac.loc[same_fac_ID_mask & same_ob_ID_mask & same_path_mask][column].mean()

            bypathway_df_no_ob.at[count, column] = row[column] - value_to_subtract

        for column in ["Facultative Symbiont 2 - MCI", "Facultative Symbiont 2 - EMO"]:

            same_fac_ID_mask = bypathway_df_1fac["Facultative Symbiont 1 - ID"] == row["Facultative Symbiont 2 - ID"]
            same_ob_ID_mask = bypathway_df_1fac["Obligate Symbiont - ID"] == row["Obligate Symbiont - ID"]
            same_cat_mask = bypathway_df_1fac["Pathway ID"] == row["Pathway ID"]

            value_to_subtract = bypathway_df_1fac.loc[same_fac_ID_mask & same_ob_ID_mask & same_path_mask][column.replace("2", "1")].mean()

            bypathway_df_no_ob.at[count,column] = row[column] - value_to_subtract


    bypathway_df_no_ob.to_csv("./NetInteract Results - Compiled/Without Obligate/By-Pathway Results (Without Obligate).csv", index=False)


In [None]:
extract_unique_facultative_influence()

<br><br><br>
### The same process is then performed on all compound-level files

In [11]:
def extract_unique_facultative_influence_compounds():
    
    #---------------------------------------------
    # Find appropriate folders (2x 1-fac/1-ob, 1x 2-fac/1-ob)
    all_community_folders = glob.glob("./NetInteract Results - Compound-Level/With Obligate/*")

    #print(all_community_folders)
    # Remove "_" from some of them, so it doesn't confuse which are unique IDs
    # (Only applicable to this run, but shouldn't affect future runs)
    all_community_folders = [i.replace("New_", "New-") for i in all_community_folders]

    # Find unique IDs
    all_community_folders_dict = {}

    # Iterate
    for folder in all_community_folders:

        if folder.count("_") == 3: # 1 host, 1 ob, 2 fac

            # Remove host ID to only have the other 3
            folder_nohost = folder.split("_", 1)[1]

            fac2_unique_ids = folder_nohost.split("_")
            fac2_unique_ids_set = set(fac2_unique_ids)

            fac1_folders = []

            # Now to find the other ones with only 1 fac
            for subfolder in [i for i in all_community_folders if i.count("_") == 2]:

                subfolder_nohost = subfolder.split("_", 1)[1]

                fac1_unique_ids = subfolder_nohost.split("_")

                fac1_unique_ids_set = set(fac1_unique_ids)

                if fac1_unique_ids_set.issubset(fac2_unique_ids_set) == True:

                    fac1_folders.append(subfolder)

            all_community_folders_dict[folder] = fac1_folders

            
    #---------------------------------------------
    # Figure out which are obligate symbionts
    genome_information_df = pd.read_csv("./Input/Genome Information.csv")

    # Works only on this run because we know they're all NCBI
    obligate_IDs = list(genome_information_df.loc[genome_information_df["Organism Type"] == "Obligate Symbiont"]["NCBI ID"].unique())

    #---------------------------------------------
    # for each full community structure in all_community_folders_dict:
    for full_folder, fac1_folders in all_community_folders_dict.items():

        # Get all IDs
        full_folder_all_ids = full_folder.split("_", 1)[1].split("_")

        # Get obligate ID (not pretty but works)
        obligate_ID = [i for i in full_folder_all_ids if i in obligate_IDs][0]

        # Get all filepaths under that full folder
        full_folder_filepaths = glob.glob(full_folder.replace("New-", "New_") + "/*")

        # Remove any files that are affects on the obligate symbionts
        full_folder_filepaths = [i for i in full_folder_filepaths if obligate_ID not in i.split("/")[-1]]

        # For each file, determine which facultative symbiont it's about and find that same file in the 1fac subfolder
        for filepath in full_folder_filepaths:

            # Find which facultative symbiont this file is about
            fac_ID = filepath.split("on ")[1].split(".csv")[0]

            # Find the folder for this symbiont (Not pretty but works)
            corr_fac1_folder = [i for i in fac1_folders if fac_ID in i.replace("New-", "New_")][0]

            # Find the file within that folder that matches this one
            corr_fac1_filepath = corr_fac1_folder.replace("New-", "New_") + "/" + filepath.split("/")[-1]

            # Read in both CSVs as DataFrames
            full_df = pd.read_csv(filepath)
            fac1_df = pd.read_csv(corr_fac1_filepath)

            # Make new directory next to the other original directory
            new_directory = full_folder.replace("Compound-Level/With Obligate/", "Compound-Level/Without Obligate/")

            #---------------------------------
            try:
                os.makedirs(new_directory)

            except FileExistsError:
                pass

            #----------------------------------

            if "Overall" in corr_fac1_filepath:

                #------------------------------
                # Grab only what's needed - Modify compound scores
                fac1_df_comp = fac1_df[["Affected Organism Name", "Affected Organism Source", "Affected Organism ID",
                                       "Nonseed - Interaction Type", "Nonseed - Compound ID", "Nonseed - Interaction Score"]]

                fac1_df_comp = fac1_df_comp.rename(columns = {"Nonseed - Interaction Score":"Nonseed - Interaction Score (Just Obligate Influence)"})

                # Merge in
                full_df = full_df.merge(fac1_df_comp, how = 'left', on = ["Affected Organism Name", "Affected Organism Source", "Affected Organism ID",
                                                                          "Nonseed - Interaction Type", "Nonseed - Compound ID"])

                # If it's not found in the obligate-only one, set that score to 0
                full_df = full_df.fillna({"Nonseed - Interaction Score (Just Obligate Influence)": 0})

                # Subtract scores
                full_df["Nonseed - Interaction Score (Just Facultative Influence)"] = full_df["Nonseed - Interaction Score"] - full_df["Nonseed - Interaction Score (Just Obligate Influence)"]

                # For BSS, set those new columns to N/A (will replace past 0 fillna)
                full_df.loc[full_df["Nonseed - Interaction Type"] == "BSS", "Nonseed - Interaction Score (Just Obligate Influence)"] = "N/A"
                full_df.loc[full_df["Nonseed - Interaction Type"] == "BSS", "Nonseed - Interaction Score (Just Facultative Influence)"] = "N/A"

                #------------------------------
                # Send to new file
                corr_fac1_filepath_new = new_directory + "/" + corr_fac1_filepath.split("/")[-1]

                # Rename columns for clarity
                full_df = full_df.rename(columns = {"Nonseed - Interaction Score":"Nonseed - Interaction Score (Influence From Both Symbionts)",
                                                   "Total Number of Nonseeds": "Total Number of Nonseeds in Network"})

                # Specify columns we want to keep
                cols = ["Affected Organism Name", "Affected Organism Source", "Affected Organism ID", 
                        "Total Number of Nonseeds in Network", "Nonseed - Interaction Type", 
                        "Nonseed - Compound ID", "Nonseed - Weight",
                        "Nonseed - Interaction Score (Influence From Both Symbionts)", 
                        "Nonseed - Interaction Score (Just Obligate Influence)",
                        "Nonseed - Interaction Score (Just Facultative Influence)"]

                full_df[cols].drop_duplicates().to_csv(corr_fac1_filepath_new, index = False)


            #------------------------------
            elif "Category" in corr_fac1_filepath:

                #------------------------------
                # Grab only what's needed - Modify compound scores
                fac1_df_comp = fac1_df[["Affected Organism Name", "Affected Organism Source", "Affected Organism ID",
                                       "Category Name", "Nonseed - Interaction Type", "Nonseed - Compound ID", 
                                       "Nonseed - Interaction Score"]]

                fac1_df_comp = fac1_df_comp.rename(columns = {"Nonseed - Interaction Score":"Nonseed - Interaction Score (Just Obligate Influence)"})

                # Merge in
                full_df = full_df.merge(fac1_df_comp, how = 'left', on = ["Affected Organism Name", "Affected Organism Source", "Affected Organism ID",
                                                                          "Category Name", "Nonseed - Interaction Type", "Nonseed - Compound ID"])

                # If it's not found in the obligate-only one, set that score to 0
                full_df = full_df.fillna({"Nonseed - Interaction Score (Just Obligate Influence)": 0})

                # Subtract scores
                full_df["Nonseed - Interaction Score (Just Facultative Influence)"] = full_df["Nonseed - Interaction Score"] - full_df["Nonseed - Interaction Score (Just Obligate Influence)"]

                # For BSS, set those new columns to N/A (will replace past 0 fillna)
                full_df.loc[full_df["Nonseed - Interaction Type"] == "BSS", "Nonseed - Interaction Score (Just Obligate Influence)"] = "N/A"
                full_df.loc[full_df["Nonseed - Interaction Type"] == "BSS", "Nonseed - Interaction Score (Just Facultative Influence)"] = "N/A"

                # Send to new file
                corr_fac1_filepath_new = new_directory + "/" + corr_fac1_filepath.split("/")[-1]

                # Rename columns for clarity
                full_df = full_df.rename(columns = {"Nonseed - Interaction Score":"Nonseed - Interaction Score (Influence From Both Symbionts)",
                                                   "Total Number of Nonseeds": "Total Number of Nonseeds in Category"})

                # Specify columns we want to keep
                cols = ["Affected Organism Name", "Affected Organism Source", "Affected Organism ID", 
                        "Category Name",
                        "Total Number of Nonseeds in Category", "Nonseed - Interaction Type", 
                        "Nonseed - Compound ID", "Nonseed - Weight",
                        "Nonseed - Interaction Score (Influence From Both Symbionts)", 
                        "Nonseed - Interaction Score (Just Obligate Influence)",
                        "Nonseed - Interaction Score (Just Facultative Influence)"]

                full_df[cols].drop_duplicates().to_csv(corr_fac1_filepath_new, index = False)

            #----------------------------------
            elif "Pathway" in corr_fac1_filepath:

                #------------------------------
                # Grab only what's needed - Modify compound scores
                fac1_df_comp = fac1_df[["Affected Organism Name", "Affected Organism Source", "Affected Organism ID",
                                       "Category Name", "Pathway Name", "Pathway ID", "Nonseed - Interaction Type", "Nonseed - Compound ID", 
                                       "Nonseed - Interaction Score"]]

                fac1_df_comp = fac1_df_comp.rename(columns = {"Nonseed - Interaction Score":"Nonseed - Interaction Score (Just Obligate Influence)"})

                # Merge in
                full_df = full_df.merge(fac1_df_comp, how = 'left', on = ["Affected Organism Name", "Affected Organism Source", "Affected Organism ID",
                                                                          "Category Name", "Pathway Name", "Pathway ID", "Nonseed - Interaction Type", "Nonseed - Compound ID"])

                # If it's not found in the obligate-only one, set that score to 0
                full_df = full_df.fillna({"Nonseed - Interaction Score (Just Obligate Influence)": 0})

                # Subtract scores
                full_df["Nonseed - Interaction Score (Just Facultative Influence)"] = full_df["Nonseed - Interaction Score"] - full_df["Nonseed - Interaction Score (Just Obligate Influence)"]

                # For BSS, set those new columns to N/A (will replace past 0 fillna)
                full_df.loc[full_df["Nonseed - Interaction Type"] == "BSS", "Nonseed - Interaction Score (Just Obligate Influence)"] = "N/A"
                full_df.loc[full_df["Nonseed - Interaction Type"] == "BSS", "Nonseed - Interaction Score (Just Facultative Influence)"] = "N/A"

                # Send to new file
                corr_fac1_filepath_new = new_directory + "/" + corr_fac1_filepath.split("/")[-1]

                # Rename columns for clarity
                full_df = full_df.rename(columns = {"Nonseed - Interaction Score":"Nonseed - Interaction Score (Influence From Both Symbionts)",
                                                   "Total Number of Nonseeds": "Total Number of Nonseeds in Pathway"})

                # Specify columns we want to keep
                cols = ["Affected Organism Name", "Affected Organism Source", "Affected Organism ID", 
                        "Category Name", "Pathway Name", "Pathway ID",
                        "Total Number of Nonseeds in Pathway", "Nonseed - Interaction Type", 
                        "Nonseed - Compound ID", "Nonseed - Weight",
                        "Nonseed - Interaction Score (Influence From Both Symbionts)", 
                        "Nonseed - Interaction Score (Just Obligate Influence)",
                        "Nonseed - Interaction Score (Just Facultative Influence)"]

                full_df[cols].drop_duplicates().to_csv(corr_fac1_filepath_new, index = False)




In [12]:
extract_unique_facultative_influence_compounds()

<br><br><br><br><br><br><br><br><br>

<a id='Average_Across_Obligate'></a>
## <span style="margin:auto;display:table">Averaging Across Obligate Symbionts</span>


<br><br><br>
### First, all compiled files are averaged across the obligate symbionts:

In [None]:
def average_across_compiled_files(with_obligate):
    
    if with_obligate == "With Obligate":
        with_obligate_2 = ""
        
    elif with_obligate == "Without Obligate":
        with_obligate_2 = " (Without Obligate)"
        
    #------------------------------------------------
    try:
        #os.makedirs("./NetInteract Results - Compiled/Without Obligate (Averaged Across Obligates)/")
        os.makedirs("./NetInteract Results - Compiled/" + with_obligate + " (Averaged Across Obligates)/")

    except FileExistsError:
        pass

    #------------------------------------------------
    # Get averages of overall scores

    #overall_df_no_ob = pd.read_csv("./NetInteract Results - Compiled/Without Obligate/Overall Results (Without Obligate).csv")
    overall_df_no_ob = pd.read_csv("./NetInteract Results - Compiled/" + with_obligate + "/Overall Results" + with_obligate_2 + ".csv")

    overall_df_no_ob_avg = overall_df_no_ob [[col for col in overall_df_no_ob.columns if "Obligate Symbiont" not in col]]

    non_value_cols = [col for col in overall_df_no_ob_avg.columns if any(scores in col for scores in ["BSS", "MCI", "EMO"]) == False]
    value_cols = [col for col in overall_df_no_ob_avg.columns if any(scores in col for scores in ["BSS", "MCI", "EMO"]) == True]

    overall_df_no_ob_avg = overall_df_no_ob_avg.groupby(non_value_cols).mean()

    #overall_df_no_ob_avg.to_csv("./NetInteract Results - Compiled/Without Obligate (Averaged Across Obligates)/Overall Results.csv")
    overall_df_no_ob_avg.to_csv("./NetInteract Results - Compiled/" + with_obligate + " (Averaged Across Obligates)/Overall Results.csv")

    #------------------------------------------------
    # Get averages of by-category scores

    #cat_df_no_ob = pd.read_csv("./NetInteract Results - Compiled/Without Obligate/By-Category Results (Without Obligate).csv")
    cat_df_no_ob = pd.read_csv("./NetInteract Results - Compiled/" + with_obligate + "/By-Category Results" + with_obligate_2 + ".csv")

    cat_df_no_ob_avg = cat_df_no_ob [[col for col in cat_df_no_ob.columns if "Obligate Symbiont" not in col]]

    non_value_cols = [col for col in cat_df_no_ob_avg.columns if any(scores in col for scores in ["BSS", "MCI", "EMO"]) == False]
    value_cols = [col for col in cat_df_no_ob_avg.columns if any(scores in col for scores in ["BSS", "MCI", "EMO"]) == True]

    cat_df_no_ob_avg = cat_df_no_ob_avg.groupby(non_value_cols).mean()

    #cat_df_no_ob_avg.to_csv("./NetInteract Results - Compiled/Without Obligate (Averaged Across Obligates)/By-Category Results.csv")
    cat_df_no_ob_avg.to_csv("./NetInteract Results - Compiled/" + with_obligate + " (Averaged Across Obligates)/By-Category Results.csv")
    #------------------------------------------------
    # Get averages of by-pathway scores

    #path_df_no_ob = pd.read_csv("./NetInteract Results - Compiled/Without Obligate/By-Pathway Results (Without Obligate).csv")
    path_df_no_ob = pd.read_csv("./NetInteract Results - Compiled/" + with_obligate + "/By-Pathway Results" + with_obligate_2 + ".csv")
    path_df_no_ob_avg = path_df_no_ob [[col for col in path_df_no_ob.columns if "Obligate Symbiont" not in col]]

    non_value_cols = [col for col in path_df_no_ob_avg.columns if any(scores in col for scores in ["BSS", "MCI", "EMO"]) == False]
    value_cols = [col for col in path_df_no_ob_avg.columns if any(scores in col for scores in ["BSS", "MCI", "EMO"]) == True]

    path_df_no_ob_avg = path_df_no_ob_avg.groupby(non_value_cols).mean()

    #path_df_no_ob_avg.to_csv("./NetInteract Results - Compiled/Without Obligate (Averaged Across Obligates)/By-Pathway Results.csv")
    path_df_no_ob_avg.to_csv("./NetInteract Results - Compiled/" + with_obligate + " (Averaged Across Obligates)/By-Pathway Results.csv")
    #------------------------------------------------




In [None]:
average_across_compiled_files(with_obligate = "Without Obligate")

In [None]:
average_across_compiled_files(with_obligate = "With Obligate")

<br><br><br>
### A similar process is then performed on all compound-level files:

In [None]:
def average_across_compoundlevel_files(with_obligate):
    
    #---------------------------------------------
    # Figure out which are obligate symbionts
    genome_information_df = pd.read_csv("./Input/Genome Information.csv")

    # Works only on this run because we know they're all NCBI
    obligate_IDs = list(genome_information_df.loc[genome_information_df["Organism Type"] == "Obligate Symbiont"]["NCBI ID"].unique())
    
    #------------------------------------------------
    # Find appropriate folders
    #all_community_folders = glob.glob("./NetInteract Results - Compound-Level/Without Obligate/*")
    all_community_folders = glob.glob("./NetInteract Results - Compound-Level/" + with_obligate + "/*")
    
    # Remove "_" from some of them, so it doesn't confuse which are unique IDs
    # (Only applicable to this run, but shouldn't affect future runs)
    all_community_folders = [i.replace("New_", "New-") for i in all_community_folders]

    # Find unique IDs
    all_community_folders_dict = {}

    #--------
    # For with_obligate, make sure it's only full community folders
    if with_obligate == "With Obligate":
        
        all_community_folders = [fol for fol in all_community_folders if fol.count("_") == 3]
    
    #-------
    # Iterate
    for folder in all_community_folders:
        
        # Remove host ID to only have the other 3
        folder_nohost = folder.split("_", 1)[1]

        fac2_unique_ids = folder_nohost.split("_")
        fac2_unique_ids_set = set(fac2_unique_ids)

        # Get obligate ID (not pretty but works)
        obligate_ID = [i for i in list(fac2_unique_ids_set) if i in obligate_IDs][0]
        fac_IDs = [i for i in list(fac2_unique_ids_set) if i not in obligate_IDs]
        
        #print(fac2_unique_ids_set, obligate_ID, fac_IDs)
        
        
        # Get all folders that have that combination of 2 facultative symbionts 
        #(# folders will change based on the number of obligate symbionts)
        if frozenset(fac_IDs) not in all_community_folders_dict.keys():
            
            # Add to dict
            all_community_folders_dict[frozenset(fac_IDs)] = [folder]
        
        else:
            all_community_folders_dict[frozenset(fac_IDs)].append(folder)
            
    #pprint(all_community_folders_dict)
    
    # Iterate over each and find average
    for fac_combination, folders_list in all_community_folders_dict.items():
        
        #new_folderpath = "./NetInteract Results - Compound-Level/Without Obligate (Averaged Across Obligates)/" + "_".join(fac_combination) + "/"
        new_folderpath = "./NetInteract Results - Compound-Level/" + with_obligate + " (Averaged Across Obligates)/" + "_".join(fac_combination) + "/"
        
        #------------------------------------------------
        try:
            os.makedirs(new_folderpath)

        except FileExistsError:
            pass
        
        #------------------------------------------------
        # Remove obligate file when using with_obligate
        if with_obligate == "With Obligate":
            folders_list = [i.replace("New-", "New_") for i in folders_list]
        
        # Use first folder in folders_list as baseline:
        baseline_folder = folders_list[0]
        baseline_files = glob.glob(baseline_folder + "/*")
        
        #print(baseline_folder)
        
        # Remove obligate file when using with_obligate
        if with_obligate == "With Obligate":
            baseline_files = [i for i in baseline_files if any(ID in i.split("/")[-1] for ID in obligate_IDs) == False]
        
        #print(len(baseline_files))
        
        
        # Using the baseline_folder, find all files in it that are about the affect on the facultative symbionts
        # (i.e., "By-Pathway Affect on Org X")
        # There will be 6 files in here:
        for first_file in baseline_files:
            
            file_ending = first_file.split("/")[-1]
            first_df = pd.read_csv(first_file)
            list_of_dfs = [first_df]
            
            
            for file_index in range(1, len(folders_list)):
                
                next_file = folders_list[file_index] + "/" + file_ending
                next_df = pd.read_csv(next_file)
                list_of_dfs.append(next_df)
                
                
            #second_file = folders_list[1] + "/" + file_ending
            #third_file = folders_list[2] + "/" + file_ending
            
            #second_df = pd.read_csv(second_file)
            #third_df = pd.read_csv(third_file)
            #total_df = pd.concat([first_df, second_df, third_df])
            
            total_df = pd.concat(list_of_dfs)
            
            
            if "Overall" in first_file:
                to_average_cols = ["Affected Organism Name", "Affected Organism Source", "Affected Organism ID",
                                   "Nonseed - Interaction Type", "Nonseed - Compound ID"]
                num_nonseed_col = "Total Number of Nonseeds in Network"
                
            elif "Category" in first_file:
                to_average_cols = ["Affected Organism Name", "Affected Organism Source", "Affected Organism ID",
                                   "Category Name", "Nonseed - Interaction Type", "Nonseed - Compound ID"]
                num_nonseed_col = "Total Number of Nonseeds in Category"
            
            elif "Pathway" in first_file:
                to_average_cols = ["Affected Organism Name", "Affected Organism Source", "Affected Organism ID",
                                   "Category Name", "Pathway Name", "Pathway ID", "Nonseed - Interaction Type", 
                                   "Nonseed - Compound ID"]
                num_nonseed_col = "Total Number of Nonseeds in Pathway"
            
            # Filter columns
            if with_obligate == "With Obligate":
                
                total_df = total_df.rename(columns = {"Total Number of Nonseeds": num_nonseed_col})
                
                filter_columns = [num_nonseed_col, 
                                  "Nonseed - Weight",
                                  "Nonseed - Interaction Score"]
                
            elif with_obligate == "Without Obligate":
                
                filter_columns = [num_nonseed_col, 
                                  "Nonseed - Weight",
                                  "Nonseed - Interaction Score (Influence From Both Symbionts)",
                                  "Nonseed - Interaction Score (Just Obligate Influence)",
                                  "Nonseed - Interaction Score (Just Facultative Influence)"]
            
            filter_columns.extend(to_average_cols)
            total_df = total_df[filter_columns]
            
            # Average across to_avg_columns
            total_df = total_df.groupby(to_average_cols).mean().reset_index()
            
            # For BSS, set those new columns to N/A (will replace past 0 fillna)
            if with_obligate == "Without Obligate":
                total_df.loc[total_df["Nonseed - Interaction Type"] == "BSS", "Nonseed - Interaction Score (Just Obligate Influence)"] = "N/A"
                total_df.loc[total_df["Nonseed - Interaction Type"] == "BSS", "Nonseed - Interaction Score (Just Facultative Influence)"] = "N/A"

            total_df.to_csv(new_folderpath + file_ending, index=False)
        
        

In [None]:
average_across_compoundlevel_files(with_obligate = "With Obligate")

In [None]:
average_across_compoundlevel_files(with_obligate = "Without Obligate")

<br><br><br><br><br><br><br><br><br>

<a id='Compiling_Compound_Scores'></a>
## <span style="margin:auto;display:table">Compiling Compound-Level Scores</span>


In [None]:
def compile_compound_level_files(with_obligate, organism_order = None):
    
    #---------------------------------------------
    # Figure out which are facultative symbionts
    genome_information_df = pd.read_csv("./Input/Genome Information.csv")
    # Works only on this run because we know they're all NCBI
    all_fac_IDs = list(genome_information_df.loc[genome_information_df["Organism Type"] == "Facultative Symbiont"]["NCBI ID"].unique())
    all_fac_IDs = [i.replace("New_", "New-") for i in all_fac_IDs]
    
    genome_IDs_names_df = genome_information_df.loc[genome_information_df["Organism Type"] == "Facultative Symbiont"][["NCBI ID", "Organism Name"]]
    genome_IDs_names_map = {}
    
    for index, row in genome_IDs_names_df.iterrows():
        genome_IDs_names_map[row["NCBI ID"].replace("New_", "New-")] = row["Organism Name"].replace("New_", "New-")
    
    
    #------------------------------------------------
    # Find appropriate folders
    all_community_folders = glob.glob("./NetInteract Results - Compound-Level/" + with_obligate + " (Averaged Across Obligates)/*")
    
    full_MCI_df = pd.DataFrame(columns = ["Nonseed - Compound ID", "Affected Organism ID"] + all_fac_IDs)
    full_EMO_df = pd.DataFrame(columns = ["Nonseed - Compound ID", "Affected Organism ID"] + all_fac_IDs)
    
    #------------------------------------------------
    # Iterate over folders
    for folder in all_community_folders:
        
        # Remove host ID to only have the other 3
        fac_IDs = list(folder.split("/")[-1].split("_"))
        
        overall_files = glob.glob(folder + "/Overall*.csv")
        
        # Find all overall files (should be 2 per folder)
        for file in overall_files:
            
            df = pd.read_csv(file)
            
            if with_obligate == "Without Obligate":
                df = df.rename(columns = {"Nonseed - Interaction Score (Just Facultative Influence)":"Nonseed - Interaction Score"})
            
            affected_organism = df["Affected Organism ID"].unique()[0]
            affected_organism = affected_organism.replace("New_", "New-")
            affecting_organism = [i for i in fac_IDs if i != affected_organism][0]
            
            MCI_df = df.loc[df["Nonseed - Interaction Type"] == "MCI"][["Nonseed - Compound ID", "Nonseed - Interaction Score", "Affected Organism ID"]].drop_duplicates()
            MCI_df = MCI_df.rename(columns = {"Nonseed - Interaction Score": affecting_organism})
            full_MCI_df = full_MCI_df.merge(MCI_df, on = ["Nonseed - Compound ID", "Affected Organism ID", affecting_organism], how = "outer")
            
            
            EMO_df = df.loc[df["Nonseed - Interaction Type"] == "EMO"][["Nonseed - Compound ID", "Nonseed - Interaction Score", "Affected Organism ID"]].drop_duplicates()
            EMO_df = EMO_df.rename(columns = {"Nonseed - Interaction Score": affecting_organism})
            full_EMO_df = full_EMO_df.merge(EMO_df, on = ["Nonseed - Compound ID", "Affected Organism ID", affecting_organism], how = "outer")
            
            
    #------------------------------------------------  
    for col in all_fac_IDs:
        full_MCI_df[col] = full_MCI_df.groupby(["Nonseed - Compound ID", "Affected Organism ID"], sort=False)[col].apply(lambda x: x.ffill().bfill())
        full_EMO_df[col] = full_EMO_df.groupby(["Nonseed - Compound ID", "Affected Organism ID"], sort=False)[col].apply(lambda x: x.ffill().bfill())

    #-----------------------
    # Add compound names?
    
    all_compound_names = {}
    compounds_str = REST.kegg_list("compound").read()
    glycans_str = REST.kegg_list("glycan").read()

    for line in compounds_str.rstrip().split("\n"):
        compound_ID, compound_name = line.split("\t")
        all_compound_names[compound_ID[4:]] = compound_name

    for line in glycans_str.rstrip().split("\n"):
        glycan_ID, glycan_name = line.split("\t")
        all_compound_names[glycan_ID[3:]] = glycan_name

    full_MCI_df["Nonseed - Compound Name"] = full_MCI_df["Nonseed - Compound ID"].map(all_compound_names)
    full_EMO_df["Nonseed - Compound Name"] = full_EMO_df["Nonseed - Compound ID"].map(all_compound_names)
    
    #-----------------------
    # Change from org IDs to org names in columns
    full_MCI_df = full_MCI_df.rename(columns = genome_IDs_names_map)
    full_EMO_df = full_EMO_df.rename(columns = genome_IDs_names_map)
    
    # Change from org IDs to org names in rows
    full_MCI_df = full_MCI_df.replace({"Affected Organism ID": {k.replace("New-", "New_"):v for k,v in genome_IDs_names_map.items()}}).rename(columns = {"Affected Organism ID": "Affected Organism Name"})
    full_EMO_df = full_EMO_df.replace({"Affected Organism ID": {k.replace("New-", "New_"):v for k,v in genome_IDs_names_map.items()}}).rename(columns = {"Affected Organism ID": "Affected Organism Name"})
    
    # Reorder
    cols = ["Affected Organism Name", "Nonseed - Compound ID", "Nonseed - Compound Name"] + [i for i in organism_order if i in genome_IDs_names_map.values()]
    full_MCI_df = full_MCI_df[cols]
    full_EMO_df = full_EMO_df[cols]
    
    #-----------------------
    # Fill nans with 0s
    for col in [i for i in organism_order if i in genome_IDs_names_map.values()]:
        
        full_MCI_df.loc[(full_MCI_df["Affected Organism Name"] != col) & (full_MCI_df[col].isna() == True), col] = 0
        full_EMO_df.loc[(full_EMO_df["Affected Organism Name"] != col) & (full_EMO_df[col].isna() == True), col] = 0
    
    #-----------------------
    full_MCI_df.drop_duplicates().to_csv("./NetInteract Results - Compound-Level/Compiled compound-level scores - MCI (" + with_obligate + " & Averaged across obligate).csv", index = False)
    full_EMO_df.drop_duplicates().to_csv("./NetInteract Results - Compound-Level/Compiled compound-level scores - EMO (" + with_obligate + " & Averaged across obligate).csv", index = False)
                

In [None]:
compile_compound_level_files(with_obligate = "With Obligate", organism_order = ['Hamiltonella defensa 5AT (AA)',
                                                                               'Hamiltonella defensa NY26 (AA)',
                                                                               'Hamiltonella defensa MI47 (BB)',
                                                                               'Hamiltonella defensa 5D (BB)',
                                                                               'Hamiltonella defensa MI12 (CC)',
                                                                               'Hamiltonella defensa A2C (DD)',
                                                                               'Hamiltonella defensa AS3 (DD)',
                                                                               'Hamiltonella defensa ZA17 (EE)',
                                                                               'Regiella insecticola LSR1',
                                                                               'Rickettsiella viridis Ap-RA04',
                                                                               'Fukatsuia symbiotica 5D',
                                                                               'Serratia symbiotica Tuscon'])

In [None]:
compile_compound_level_files(with_obligate = "Without Obligate", organism_order = ['Hamiltonella defensa 5AT (AA)',
                                                                               'Hamiltonella defensa NY26 (AA)',
                                                                               'Hamiltonella defensa MI47 (BB)',
                                                                               'Hamiltonella defensa 5D (BB)',
                                                                               'Hamiltonella defensa MI12 (CC)',
                                                                               'Hamiltonella defensa A2C (DD)',
                                                                               'Hamiltonella defensa AS3 (DD)',
                                                                               'Hamiltonella defensa ZA17 (EE)',
                                                                               'Regiella insecticola LSR1',
                                                                               'Rickettsiella viridis Ap-RA04',
                                                                               'Fukatsuia symbiotica 5D',
                                                                               'Serratia symbiotica Tuscon'])

<br><br><br><br><br><br><br><br><br><br>
# <span style="margin:auto;display:table;">NetInteract is complete!</span> <br>
## <span style="margin:auto;display:table;">Please open the folder titled "NetInteract Results" to view results.</span>
<br><br><br><br><br>

<br><br><br><br><br><br><br><br><br><br>