## Explanation of Taxonomic Data Processing Code

This code processes taxonomic data to gather and manage information about species and their classifications. Below is a detailed breakdown of its functionality:

1. **Creating Empty Columns**:
   - The code starts by adding several empty columns to the DataFrame `tax_hierar`. These columns are intended to store taxonomic information such as `taxonID`, `taxon`, `parent taxon ID`, and various identifiers related to scientific names.

2. **Splitting Scientific Names**:
   - The `scientificName` column is split into separate entries based on the delimiter ' | '. This allows each scientific name to be stored in a separate row, making it easier to manage and query.

3. **Removing Duplicates**:
   - Duplicate entries in the `scientificName` column are removed to ensure that each scientific name is unique within the DataFrame.

4. **Retrieving Taxonomic Information**:
   - The script iterates over the `scientificName` column, calling a function `get_taxonomy_id` to fetch taxonomic information for each name. If any name cannot be found, it logs a warning and adds the error to a list.

5. **Handling Missing Taxon IDs**:
   - After attempting to retrieve taxonomic data, rows where `taxonID` is still missing are dropped from the DataFrame. This step ensures that only valid entries remain.

6. **Fetching Additional Taxonomic Data**:
   - The script then makes additional API calls to a taxonomy database using the unique `taxonID` values obtained earlier. It collects further taxonomic data and adds it to a temporary list, which is subsequently combined with the main DataFrame.

7. **Error Handling**:
   - A list to track errors during the data retrieval process is maintained. After processing the initial list of scientific names, if there are errors, the code tries to retrieve taxonomic information for the names that generated errors.

8. **Assigning Parent-Child Relationships**:
   - The code constructs a parent-child relationship by mapping each species to its parent taxon. It utilizes the `ID_pai` (parent ID) to find the corresponding parent entry and stores the index of the parent in a new column.

9. **Building Taxonomic Tree**:
   - A taxonomic tree is constructed using the `construir_arvore_taxonomica` function, which organizes the data hierarchically. This tree is then converted to Newick format for visualization.

10. **Output and Saving Results**:
    - Finally, the results are saved in multiple formats (Newick, Nexus, and NeXML) for further analysis or visualization. The script also ensures that directories for saving the files are created if they do not already exist.

Overall, this code efficiently retrieves, processes, and organizes taxonomic information, allowing researchers to analyze taxonomic collections and their relationships.


# Installing Required Libraries
This command installs the necessary Python libraries: `biopython`, `pandas`, and `requests`. These are essential for handling biological data, data manipulation, and making HTTP requests, respectively.


In [None]:
!pip install biopython pandas requests

# Importing Necessary Modules
In this section, we import the necessary Python libraries and custom functions:

- `tkinter`: Used for creating graphical user interfaces (GUIs). Modules such as `filedialog` and `messagebox` handle file selection dialogs and message pop-ups.
- `functions`: A custom module (assumed to be defined elsewhere in the project) that provides additional functionalities for the program.
- `Bio`: Part of the Biopython library. We specifically import `Entrez` for interacting with NCBI databases and `Phylo` for working with phylogenetic trees.
- `pandas`: A data manipulation library used for handling data structures such as DataFrames.
- `requests`: A library for sending HTTP requests.
- `logging`: Used for tracking events and debugging.
- `os`: A library for interacting with the operating system, useful for file handling.
- `io`: Provides tools for working with input/output operations in Python.


In [29]:
# Importando os módulos necessários
from tkinter import filedialog,messagebox
from functions import *
from Bio import Entrez,Phylo
import tkinter as tk
import pandas as pd
import requests
import logging
import os

import io


# Function Definitions for Taxonomic Data Handling
This section defines several functions related to retrieving, processing, and manipulating taxonomic data from NCBI. It includes functions for configuring access to NCBI, searching and fetching data, and building a taxonomic tree.

## Functions:
- `configure_entrez`: Configures the NCBI Entrez API settings.
- `search_NCBI`: Performs a search in NCBI using the specified term.
- `efetch_NCBI`: Retrieves data from NCBI using the specified taxonomy ID.
- `search_tax_id`: Searches for the taxonomy ID of a given scientific name.
- `fetch_tax_info`: Fetches taxonomy information for a given taxonomy ID.
- `get_taxonomy_id`: Retrieves taxonomy information for a given scientific name.
- `salvar_arquivo`: Opens a file selection dialog for the user to save a file.
- `selecionar_arquivo`: Opens a file selection dialog for the user to select a file.
- `ler_arquivo_csv`: Reads a CSV file and checks for the presence of the 'scientificName' column.
- `get_children`: Recursive function to obtain the children of a node in the taxonomic hierarchy.
- `construir_arvore_taxonomica`: Constructs a taxonomic tree from the provided taxonomic hierarchy data.
- `arvore_para_newick`: Converts a tree represented as a dictionary into Newick format, which is commonly used to represent phylogenetic trees.


In [None]:
# Setting up the logging system to record messages in a log file
# 'salvar_arquivo' is a function that returns the path to save the log file with a '.log' extension
caminho_logfile = salvar_arquivo('.log')

# Configuring the logging system to write log messages to the specified file.
# The logging level is set to DEBUG, meaning all messages from DEBUG level and above (INFO, WARNING, ERROR, CRITICAL) will be captured.
# The log format includes the timestamp, the log level, and the log message.
logging.basicConfig(filename=caminho_logfile, level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s')

# This function is assumed to configure Entrez, possibly related to the NCBI Entrez system for querying biological databases.
configure_entrez()


In [None]:
df = ler_arquivo_csv()

if not df.empty:  # Check if the DataFrame is not empty
    # Remove duplicate rows based on 'scientificName' and store the unique scientific names in df_scientificname
    df_scientificname = df.drop_duplicates(subset='scientificName')['scientificName']
    
    # Convert the 'scientificName' Series into a DataFrame
    df_scientificname = pd.DataFrame(df_scientificname)
    
    # Create a copy of df_scientificname and store it in tax_hierar
    tax_hierar = df_scientificname.copy()
else:
    # If df is empty, create an empty DataFrame with the required columns for the taxonomic hierarchy
    tax_hierar = pd.DataFrame(columns=['taxonID', 'taxon', 'ID_pai', 'parent_taxon_index', 
                                       'ID_filho', 'index_filho', 'scientificName_correct', 'scientificName'])


In [32]:
# Adding empty columns to store taxonomic information
tax_hierar['taxonID'] = None  # Initialize a column for taxon IDs
tax_hierar['taxon'] = None  # Initialize a column for taxon names
tax_hierar['ID_pai'] = None  # Initialize a column for parent IDs
tax_hierar['parent_taxon_index'] = None  # Initialize a column for parent taxon indices
tax_hierar['ID_filho'] = None  # Initialize a column for child IDs
tax_hierar['index_filho'] = None  # Initialize a column for child indices
tax_hierar['scientificName_correct'] = None  # Initialize a column for corrected scientific names

# Splitting the 'scientificName' column into lists separated by ' | '
tax_hierar['scientificName'] = tax_hierar['scientificName'].str.split(' \| ')  # Split the scientific names into lists

# Expanding the lists into individual rows in the DataFrame
tax_hierar = tax_hierar.explode('scientificName')  # Explode lists into separate rows

# Removing duplicate entries from 'scientificName'
tax_hierar.drop_duplicates(subset='scientificName', inplace=True, ignore_index=True)  # Remove duplicate scientific names

# Adding a log message to indicate the completion of the operation
logging.info("Preparação do DataFrame 'tax_hierar' concluída com sucesso.")  # Log a message indicating successful preparation


In [33]:
# List to store names of species that could not be found in the taxonomy search
lista_erro = []

# Iterating over the values in the 'scientificName' column of the 'tax_hierar' DataFrame
for i, valor in enumerate(tax_hierar['scientificName']):
    # Calling the 'get_taxonomy_id' function to obtain taxonomy information for the current value
    tax_hierar.loc[i, 'scientificName_correct'], tax_hierar.loc[i, 'taxonID'], tax_hierar.loc[i, 'taxon'], tax_hierar.loc[i, 'ID_pai'], erro = get_taxonomy_id(valor)

    # Checking if an error occurred during the taxonomy search
    if erro:
        lista_erro.append(erro)  # Add the error to the error list
        logging.warning(f"Erro ao obter informações de taxonomia para '{valor}': Nome não encontrado na pesquisa.") # Log a warning message

# Creating a DataFrame of entries where 'taxonID' is None
tax_hierar_taxonid_none = tax_hierar[tax_hierar['taxonID'].isna()]

# Removing rows where 'taxonID' is empty, indicating that the taxonomy search was unsuccessful
tax_hierar.dropna(subset='taxonID', inplace=True)
logging.info("Informações de taxonomia obtidas com sucesso para todas as espécies.") # Log a message indicating success

# Resetting the index of the DataFrame after removing rows
tax_hierar.reset_index(drop=True, inplace=True)

# Adding a log message to indicate the completion of the operation
logging.info("Obtenção de informações de taxonomia para o DataFrame 'tax_hierar' concluída com sucesso.") # Log a message indicating successful completion


In [34]:
# Temporary list to store taxonomy information for found species
temp_df = []

# Iterating over the values in the 'taxonID' column of the 'tax_hierar' DataFrame
for i, valor in enumerate(tax_hierar['taxonID']):
    # Parameters for the HTTP request
    parametros = {'txid': valor, 'rank': 'custom', 'srank': rank_dwc, 'format': 'json'}

    # Making a GET request to the taxonomy API
    response = requests.get(url_api, params=parametros)
    # Checking if the response was successful
    if response.status_code == 200:
        data = response.json()  # Parsing the JSON response
        dicta = list(data.values())[0]  # Getting the first value from the response dictionary

        # Iterating over the items returned by the API
        for chave, valor in dicta.items():
            # Checking if the value does not contain '_' and if it does not exist in 'temp_df' or 'tax_hierar'
            if '_' not in valor and not any(entry['scientificName'] == valor for entry in temp_df) and valor not in tax_hierar['scientificName'].tolist():
                # Calling the 'get_taxonomy_id' function to obtain taxonomy information for the current value
                temp_name, temp_txid, temp_rank, temp_pai, erro = get_taxonomy_id(valor)

                # Checking if an error occurred during the taxonomy search
                if erro:
                    lista_erro.append(erro)  # Add the error to the error list
                    logging.warning(f"Erro ao obter informações de taxonomia para '{valor}': {erro}") # Log a warning message

                # Adding the obtained information to 'temp_df'
                temp_df.append({'scientificName': valor, 'ID_filho': None, 'ID_pai': temp_pai, 'taxonID': temp_txid, 'taxon': temp_rank, 'scientificName_correct': temp_name})
    else:
        logging.error(f"Falha ao acessar a API de taxonomia para o taxonID '{valor}'. Status code: {response.status_code}") # Log an error if the API request failed

# Concatenating 'tax_hierar' with 'temp_df' and removing rows where 'taxonID' is empty
tax_hierar = pd.concat([tax_hierar, pd.DataFrame(temp_df)])
tax_hierar.drop(tax_hierar.loc[tax_hierar['taxonID'] == ''].index, inplace=True)  # Remove entries with empty 'taxonID'

# Removing duplicate entries based on 'taxonID' and resetting the index
tax_hierar.drop_duplicates(subset=['taxonID'], inplace=True, ignore_index=True)
tax_hierar.reset_index(drop=True, inplace=True)
logging.info("Concluída a busca e concatenação de informações de taxonomia.") # Log a message indicating the successful completion of the operation


In [35]:
# Removing duplicate and empty items from the error list
lista_erro = [item for item in list(set(lista_erro)) if item and item not in tax_hierar['scientificName']]

# Temporary list to store taxonomy information for species found with errors
temp_df_erro = []

# Loop while there are items in the error list
while len(lista_erro) > 0:
    # Checking if the first item in the error list is not in 'tax_hierar'
    if lista_erro[0] not in tax_hierar['scientificName']:
        # Calling the 'get_taxonomy_id' function to obtain taxonomy information for the current value
        temp_name, temp_txid, temp_rank, temp_pai, erro = get_taxonomy_id(lista_erro[0])

        # Adding the obtained information to 'temp_df_erro'
        temp_df_erro.append({'scientificName': lista_erro[0], 'ID_filho': None, 'ID_pai': temp_pai, 'taxonID': temp_txid, 'taxon': temp_rank, 'scientificName_correct': temp_name})

    # Removing the processed error item from the list
    lista_erro.pop(0)

# Concatenating 'tax_hierar' with 'temp_df_erro' and removing rows where 'taxonID' is empty
tax_hierar = pd.concat([tax_hierar, pd.DataFrame(temp_df_erro)])
tax_hierar.drop(tax_hierar.loc[tax_hierar['taxonID'] == ''].index, inplace=True)

# Removing duplicate entries based on 'taxonID' and resetting the index
tax_hierar.drop_duplicates(subset=['taxonID'], inplace=True, ignore_index=True)
tax_hierar.reset_index(drop=True, inplace=True)

logging.info("Concluída a busca e concatenação de informações de taxonomia com tratamento de erros.") # Log a message indicating successful completion


In [None]:
# Dictionary to map parent ID to the corresponding index in the 'tax_hierar' DataFrame
pai_to_index = {value: index for index, value in enumerate(tax_hierar['taxonID'].unique())}

# Iterating over the values in the 'ID_pai' column of the 'tax_hierar' DataFrame
for i, valor in enumerate(tax_hierar['ID_pai']):
    # Maximum number of attempts to obtain taxonomy information
    num_tentativas = 3
    tentativa = 0

    # Checking if the value is present in the 'taxonID' column
    if valor in tax_hierar['taxonID'].values:
        # Obtaining the parent index and assigning it to 'parent_taxon_index'
        tax_hierar.loc[i, 'parent_taxon_index'] = tax_hierar.index[tax_hierar['taxonID'] == valor].tolist()[0]
    else:
        # Attempts to find the parent in the taxonomy hierarchy
        while tentativa < num_tentativas:
            # If the current taxon is 'kingdom', break the attempt
            if tax_hierar.loc[i, 'taxon'] == 'kingdom':
                break

            try:
                # Obtaining taxonomy information for the parent ID
                record = efetch_NCBI(str(tax_hierar.loc[i, 'ID_pai']))

                # Getting the parent's parent's ID
                taxid_sup_rank = record[0]['LineageEx'][-1]['TaxId']
                tax_hierar.loc[i, 'ID_pai'] = taxid_sup_rank

                # If the parent's rank is 'kingdom', break the attempt
                if record[0]['Rank'] == 'kingdom':
                    break

                # If the parent ID is mapped, assign the index to 'parent_taxon_index'
                if taxid_sup_rank in pai_to_index:
                    tax_hierar.loc[i, 'parent_taxon_index'] = pai_to_index[taxid_sup_rank]
                    break

            except Exception as e:
                # If an error occurs, log the error and try again
                logging.error(f"Erro na tentativa {tentativa + 1}: {str(e)} {valor}")
                tentativa += 1

# Replacing 'None' values with 'NaN' in the 'parent_taxon_index' column
tax_hierar['parent_taxon_index'].replace({None: 'NaN'}, inplace=True)

logging.info("Concluída a atribuição de índices de pais para cada espécie na hierarquia taxonômica.")


In [54]:
# Grouping the indices by parent index
a = tax_hierar.groupby(['parent_taxon_index'], group_keys=True).groups

# Iterating over the groups and their values
for chave, valor in a.items():
    # Checking if the key is present in the DataFrame indices
    if chave in tax_hierar.index:
        # Converting the values to a comma-separated string and assigning it to 'index_filho'
        tax_hierar.loc[chave, 'index_filho'] = ','.join(map(str, valor.tolist()))

        # Obtaining the taxonIDs of the children and assigning it to 'ID_filho'
        l = list(tax_hierar.loc[valor, 'taxonID'])
        tax_hierar.loc[chave, 'ID_filho'] = ','.join(l)

logging.info("Concluída a atribuição de índices de filhos para cada espécie na hierarquia taxonômica.")


In [55]:
# Renaming the columns 'scientificName' and 'scientificName_correct'
tax_hierar = tax_hierar.rename(columns={'scientificName': 'scientificName_search', 'scientificName_correct': 'scientificName'})

# Prompting the user to select a location to save the file
arquivo_salvar = salvar_arquivo()
if arquivo_salvar:
    # Saving the DataFrame to a CSV file
    tax_hierar.to_csv(arquivo_salvar, index=False)

    logging.info(f"Salvo o DataFrame em '{arquivo_salvar}' com sucesso.")

else:
    logging.info("Operação de salvamento de arquivo não foi completada.")


In [None]:
# Selecting the root node from the DataFrame where the taxon is 'kingdom'
root = tax_hierar[tax_hierar['taxon'] == 'kingdom']

# Constructing the taxonomic tree using the specified root
arvore_taxonomica = construir_arvore_taxonomica(tax_hierar, 'scientificName', 'index_filho', root=root['scientificName'])

# Converting the taxonomic tree into Newick format
arvore_newick = arvore_para_newick(arvore_taxonomica, 'Eukaryota')

# Reading the Newick formatted string into a Phylo tree object
tree = Phylo.read(io.StringIO(arvore_newick), "newick")

# Checking if the directory 'arvore' exists, and creating it if it does not
if not os.path.exists('arvore'):
    os.makedirs('arvore')

# Visualizing the phylogenetic tree
Phylo.draw(tree)

# Saving the tree in Newick format
Phylo.write(tree, "arvore/arvore_taxonomica.nwk", "newick")

# Saving the tree in Nexus format
Phylo.write(tree, "arvore/arvore_taxonomica.nex", "nexus")

# Saving the tree in NeXML format
Phylo.write(tree, "arvore/arvore_taxonomica.xml", "nexml")
