# Generating Comprehensive Annotation Reports

This Jupyter notebook is designed to generate a report file based on annotation results obtained from annotation tools such as BLAST or Diamond. To use this notebook, users must have both the annotation results in TSV format and a table containing additional information about the transcripts. By leveraging the power of Jupyter notebooks, users can interactively visualize and analyze their annotation results, generating a comprehensive report that includes detailed statistics and data visualizations. The notebook's intuitive user interface and modular design make it easy to customize the report based on specific research needs. With its ability to quickly and efficiently generate reports from annotation data, this Jupyter notebook is an invaluable tool for researchers working with transcriptomic data.

In [29]:
#Importing libraries
import pandas as pd
import matplotlib.pyplot as plt
import re
import numpy as np
pd.set_option('display.max_columns', None)

def get_transcripts_from_id(transcripts, table):
    transcripts = transcripts.unique()

    dic = dict()

    for t in transcripts:
        for x in table.transcript:
            if re.match(x, t):
                dic[t] = x
    return dic

def make_hyperlink(sseqid, database):
    
    if database.lower() == 'nr':
        protein_accession = sseqid.split(" ")[0]
        url = "https://www.ncbi.nlm.nih.gov/gene/?term={}"
    else:
        protein_accession = sseqid.split("|")[1]
        url = "https://www.uniprot.org/uniprotkb/{}/entry"
        
    return '=HYPERLINK("%s", "%s")' % (url.format(protein_accession), protein_accession)
def prova(sseqid, database):
    print(database)

In this section, the user must customize the generation parameters (by appropriately modifying the variables) following the instructions in the comments in the cell below.

In [6]:
# Insert the names (or paths) of the tsv files
files = [
    "./bombina/bombina_corset_DEGS__unk_not-unk.fasta.transdecoder.cds_nr.tsv",
    "./bombina/bombina_corset_DEGS__unk_not-unk.fasta.transdecoder.cds_tr.tsv",
    "./bombina/bombina_corset_DEGS__unk_not-unk.fasta.transdecoder.cds_sp.tsv"
] 

# Insert the titles of the graph
title = "bombina_pachypus_blastx"

# Insert the databases names (the order must match the result files order)
databases_names =[
    "Nr", 
    "TrEMBL",
    "Swiss-Prot",
]

# Insert the table (with additional informations) path
table_path = "./bombina/bombina_unref_vs_not_unkref_table_padj_0.05----log2fc_1.tsv"

# Insert the path of the report
path = "./bombina/" + title

# Set the outformat
# e.g. 
# outfmt = "qseqid qlen sseqid sallseqid slen qstart qend sstart send qseq full_qseq sseq full_sseq evalue bitscore score length pident nident mismatch positive gapopen gaps ppos qframe btop cigar staxids sscinames sskingdoms skingdoms sphylums stitle salltitles qcovhsp scovhsp qtitle qqual full_qqual qstrand"
# If there are column names in the file then set outfmt = None
outfmt = "qseqid qlen sseqid sallseqid slen qstart qend sstart send qseq full_qseq sseq full_sseq evalue bitscore score length pident nident mismatch positive gapopen gaps ppos qframe btop cigar staxids sscinames sskingdoms skingdoms sphylums stitle salltitles qcovhsp scovhsp qtitle qqual full_qqual qstrand"

# Columns names (modify this list by inserting the column names of the report)
features = ["transcript", "row", "log2FoldChange", "padj", 
            "protein_accession", "sequence_identity", "alignment_length", 
            "evalue", "database", "gene", "locus_name", "sequence_description",
            "sequence_length", "organism", "protein_product"]


In [30]:
df = pd.DataFrame()
table = pd.read_csv(table_path, sep='\t')
for i in range(len(files)):

    #Import the dataset
    df_tmp = pd.read_csv(files[i], sep="\t", names=outfmt.split())

    df_tmp['transcript'] = df_tmp['qseqid'].map(get_transcripts_from_id(df_tmp['qseqid'], table))
    df_tmp['database'] = databases_names[i]
    df_tmp['row'] = title
    df_tmp['sequence_identity'] = df_tmp.pident
    df_tmp['alignment_length'] = df_tmp.length
    df_tmp['evalue'] = df_tmp.evalue
    df_tmp['sequence_description'] = df_tmp.stitle
    df_tmp['sequence_length'] = df_tmp.slen

    if "OS=" not in df_tmp.stitle[0]:    
        def get_sciname(x):
            
            os_index = - x[::-1].index('[')

            return x[os_index:-1]

        # Useful functions
        def get_protein_function(x):

            x_l = x.split(" ")

            return ' '.join(x_l[1:x_l.index(next(x for x in x_l if x.startswith('[')))])
        
        def get_locus_name(x):
            return None
        
        def get_gene(x):
            return None
    else:
        def get_sciname(x):

            os_index = x.index('OS=')
            ox_index = x.index('OX=')

            return x[os_index+3:ox_index-1]

        # Useful functions
        def get_protein_function(x):

            x_l = x.split(" ")

            return ' '.join(x_l[1:x_l.index(next(x for x in x_l if x.startswith('OS=')))])
        
        def get_locus_name(x):
            return x.split("|")[2]
        
        def get_gene(x):

            try:
                gn_index = x.index('GN=')
                pe_index = x.index('PE=')
            except:
                return None
            return x[gn_index+3:pe_index-1]
        
    df_tmp['gene'] = df_tmp.stitle.apply(lambda x: get_gene(x))
    df_tmp['organism'] = df_tmp.stitle.apply(lambda x: get_sciname(x))
    df_tmp['protein_accession'] = df_tmp.apply(lambda x: make_hyperlink(x.sseqid, x.database), axis=1)
    df_tmp['protein_product'] = df_tmp.stitle.apply(lambda x: get_protein_function(x))
    df_tmp['locus_name'] = df_tmp.sseqid.apply(lambda x: get_locus_name(x))

    df_tmp = pd.merge(df_tmp, table, on='transcript')

    df = pd.concat([df, df_tmp[features]])

df.sort_values(['transcript', 'evalue'], inplace=True)

df.reset_index(drop=True, inplace=True)

If you want to remove repetitive information (e.g. trasncript name ecc..) run the cell below (you can also modify given list).

In [31]:
df.loc[df.duplicated(subset=['transcript', 'row', 'log2FoldChange', 'padj']), 'transcript':'padj'] = ''

Run the cell below to save the report.

In [32]:
df.to_excel(path + '.xlsx', index=False)