# Annotation Report: Multi-database Annotation Summary

This Jupyter notebook is designed to generate a report file based on annotation results obtained from annotation tools such as BLAST or Diamond. To use this notebook, users must have both the annotation results in TSV format and a table containing additional information about the transcripts. By leveraging the power of Jupyter notebooks, users can interactively visualize and analyze their annotation results, generating a comprehensive report that includes detailed statistics and data visualizations. The notebook's intuitive user interface and modular design make it easy to customize the report based on specific research needs. With its ability to quickly and efficiently generate reports from annotation data, this Jupyter notebook is an invaluable tool for researchers working with transcriptomic data.

First of all, run the following cell for defining useful functions and loading the libraries.

In [80]:
#Importing libraries
import pandas as pd
from openpyxl.utils import get_column_letter
import os
import time
pd.set_option('display.max_columns', None)

def get_transcripts_from_id(transcripts, table):
    transcripts = transcripts.unique()

    dic = dict()

    for t in transcripts:
        for x in table.transcript:
            if t.startswith(x):
                dic[t] = x
    return dic

def make_hyperlink(sseqid, database):
    
    try:
        if database.lower() == 'nr':
            protein_accession = sseqid.split(" ")[0]
            url = "https://www.ncbi.nlm.nih.gov/gene/?term={}"
        else:
            protein_accession = sseqid.split("|")[1]
            url = "https://www.uniprot.org/uniprotkb/{}/entry"
    except:
        print(sseqid)
        return ""
        
    return '=HYPERLINK("%s", "%s")' % (url.format(protein_accession), protein_accession)

def get_accession(sseqid, database):

    if database.lower() == 'nr':
        try:
            return sseqid.split(" ")[0]
        except:
            return ""
    else:
        try:
            return sseqid.split("|")[1]
        except:
            return ""

In this section, the user must customize the generation parameters (by appropriately modifying the variables) following the instructions in the comments in the cell below.

In [83]:
# Insert the names (or paths) of the tsv files
files = [
    "../culex_pipiens/blast_nr.tsv",
    "../culex_pipiens/blast_tr.tsv",
    "../culex_pipiens/blast_sp.tsv"
] 

# Insert the titles of the graph
title = "Blast"

# Insert the databases names (the order must match the result files order)
databases_names =[
    "Nr", 
    "TrEMBL",
    "Swiss-Prot",
]

# Insert the table (with additional informations) path
table_path = "./culex_pipiens/tables"

# Insert the path of the report
path = "./results/" + title

# Set the outformat
# e.g. 
# outfmt = "qseqid qlen sseqid sallseqid slen qstart qend sstart send qseq full_qseq sseq full_sseq evalue bitscore score length pident nident mismatch positive gapopen gaps ppos qframe btop cigar staxids sscinames sskingdoms skingdoms sphylums stitle salltitles qcovhsp scovhsp qtitle qqual full_qqual qstrand"
# If there are column names in the file then set outfmt = None
#outfmt = "qseqid qlen sseqid sallseqid slen qstart qend sstart send qseq full_qseq sseq full_sseq evalue bitscore score length pident nident mismatch positive gapopen gaps ppos qframe btop cigar stitle salltitles qcovhsp scovhsp qtitle qqual full_qqual qstrand"
outfmt = None
# Columns names (modify this list by inserting the column names of the report)
features = ["transcript", "log2FoldChange", "padj", 
            "protein_accession", "sequence_identity", "alignment_length", 
            "evalue", "database", "gene", "locus_name", "sequence_description",
            "sequence_length", "organism", "protein_product"]

Now you can run the cell below to build the table. 

In [84]:
def generate_df(table):

    df = pd.DataFrame()
    table = pd.read_csv(table_path + "/" + table, sep='\t')
    table.index.name = 'transcript'
    tools = []
    table.reset_index(inplace=True)

    for i in range(len(files)):

        #Import the dataset
        if outfmt == None:
            df_tmp = pd.read_csv(files[i], sep="\t", low_memory=False)
        else:
            df_tmp = pd.read_csv(files[i], sep="\t", names=outfmt.split(), low_memory=False)

        df_tmp['qseqid'] = df_tmp['qseqid'].map(get_transcripts_from_id(df_tmp['qseqid'], table))
        df_tmp['database'] = databases_names[i]

        tools = df_tmp.row.unique().tolist()
        for row in tools:
            df_tmp[row] = (df_tmp["row"] == row).astype(int)
        df_tmp = df_tmp.groupby(["qseqid", "sseqid", "pident", "slen", "stitle", "length", "evalue", "database"]).sum().reset_index()
        for row in tools:
            df_tmp[row] = df_tmp[row].map(lambda x: row if x == 1 else "")
        df_tmp.drop("row", axis=1, inplace=True)

        df_tmp.rename(columns={'pident': 'sequence_identity',
                                'length': 'alignment_length',
                                'stitle': 'sequence_description',
                                'slen':   'sequence_length'
        }, inplace=True)

        if "OS=" not in df_tmp.sequence_description[0]:    
            def get_sciname(x):
                
                try:
                    os_index = - x[::-1].index('[')
                except:
                    return ""

                return x[os_index:-1]

            # Useful functions
            def get_protein_function(x):

                x_l = x.split(" ")

                try:
                    nex = ' '.join(x_l[1:x_l.index(next(x for x in x_l if x.startswith('[')))])
                except:
                    return ""

                return nex
            
            def get_locus_name(x):
                return None
            
            def get_gene(x):
                return None
        else:
            def get_sciname(x):

                try:
                    os_index = x.index('OS=')
                    ox_index = x.index('OX=')
                except:
                    return ""

                return x[os_index+3:ox_index-1]

            # Useful functions
            def get_protein_function(x):

                x_l = x.split(" ")

                try:
                    nex = ' '.join(x_l[1:x_l.index(next(x for x in x_l if x.startswith('OS=')))])
                except:
                    return ""

                return nex
            
            def get_locus_name(x):
                try:
                    return x.split("|")[1]
                except:
                    return ""
            
            def get_gene(x):

                try:
                    gn_index = x.index('GN=')
                    pe_index = x.index('PE=')
                except:
                    return None
                return x[gn_index+3:pe_index-1]
            
        df_tmp['gene'] = df_tmp.sequence_description.apply(lambda x: get_gene(x))
        df_tmp['organism'] = df_tmp.sequence_description.apply(lambda x: get_sciname(x))
        df_tmp['protein_accession'] = df_tmp.apply(lambda x: make_hyperlink(x.sseqid, x.database), axis=1)
        df_tmp['protein_product'] = df_tmp.sequence_description.apply(lambda x: get_protein_function(x))
        df_tmp['locus_name'] = df_tmp.sseqid.apply(lambda x: get_locus_name(x))

        df_tmp = pd.merge(df_tmp, table, left_on='qseqid', right_on='transcript', how='inner')

        df = pd.concat([df, df_tmp[["transcript"] + tools + features[1:]]])

        print("File", files[i], "done!")

    df.sort_values(['transcript', 'evalue'], inplace=True)

    df.reset_index(drop=True, inplace=True)

    df = df.groupby('transcript').head(20)

    df.loc[df.duplicated(subset=['transcript', 'log2FoldChange', 'padj']), ['transcript', 'log2FoldChange', 'padj']] = ''
    #unmatched_transcript = pd.Series(list(set(table.transcript) - set(df.transcript)), name='unmatched_transcript')

    df = df[["transcript"] + tools + features[1:]]

    return df

def generate_xlsx(df, path, t):
    df_writer = pd.ExcelWriter(path + "_" + t.split("/")[-1] + '.xlsx') 

    # Write the DataFrame to the working ExcelWriter
    df.to_excel(df_writer, sheet_name='report', index=False)

    # Get the xlsxwriter workbook and worksheet objects
    worksheet = df_writer.sheets['report']

    # Set the column width
    for column in range(df.shape[1]):
        column_length = max(df.iloc[:, column].astype(str).map(len).max(), len(df.columns[column])) + 3
        column_letter = get_column_letter(column + 1)
        worksheet.column_dimensions[column_letter].width = column_length

    # Salva il foglio di lavoro
    df_writer.close()

In [85]:
# Save the super-table for each table in the tables directory
for t in os.listdir(table_path):

    print("Analysing", t)

    # get initial time
    start_time = time.time()

    df = generate_df(t)

    df.head()

    df.to_csv(path + "_" + t + '.tsv', sep='\t', index=False)

    generate_xlsx(df, path, t)

    print("Table", t, "done in", "--- %s seconds ---" % (time.time() - start_time))

Analysing UP-reg_padj_0.05--log2fc_1_table_ResistantCOST____ResistantDFB.txt
File ./culex_pipiens/blast_nr.tsv done!
File ./culex_pipiens/blast_tr.tsv done!
File ./culex_pipiens/blast_sp.tsv done!
Table UP-reg_padj_0.05--log2fc_1_table_ResistantCOST____ResistantDFB.txt done in --- 75.82250189781189 seconds ---
Analysing UP-reg_padj_0.05--log2fc_1_table_SusceptibleCOST____SusceptibleDFB.txt
File ./culex_pipiens/blast_nr.tsv done!
File ./culex_pipiens/blast_tr.tsv done!
File ./culex_pipiens/blast_sp.tsv done!
Table UP-reg_padj_0.05--log2fc_1_table_SusceptibleCOST____SusceptibleDFB.txt done in --- 82.96428084373474 seconds ---
Analysing DOWN-reg_padj_0.05--log2fc_1_table_SusceptibleCOST____ResistantCOST.txt
File ./culex_pipiens/blast_nr.tsv done!
File ./culex_pipiens/blast_tr.tsv done!
File ./culex_pipiens/blast_sp.tsv done!
Table DOWN-reg_padj_0.05--log2fc_1_table_SusceptibleCOST____ResistantCOST.txt done in --- 15.278906345367432 seconds ---
Analysing DOWN-reg_padj_0.05--log2fc_1_table_