In [1]:
import pandas as pd

## Arquivo .daa.tsv

Esse arquivo contém o reultado da etapa de [alinhamenho](https://en.wikipedia.org/wiki/Sequence_alignment) executada no fluxo do DeepARG

<table style="border-collapse:collapse;border-color:rgb(200,200,200);border-width:1px" cellspacing="0" bordercolor="#888" border="1">
<tbody>
<tr>
<td style="text-align:left;width:24px;height:20px">&nbsp;1.</td>
<td style="width:77px;height:20px">&nbsp;<code>qseqid</code></td>
<td style="width:307px;height:20px">&nbsp;query (e.g., unknown gene) sequence id</td>
</tr>
<tr>
<td style="text-align:left;width:24px;height:20px">&nbsp;2.</td>
<td style="width:77px;height:20px">&nbsp;<code>sseqid</code></td>
<td style="width:307px;height:20px">&nbsp;subject (e.g., reference genome) sequence id</td>
</tr>
<tr>
<td style="text-align:left;width:24px;height:20px">&nbsp;3.</td>
<td style="width:77px;height:20px">&nbsp;<code>pident</code></td>
<td style="width:307px;height:20px">&nbsp;percentage of identical matches</td>
</tr>
<tr>
<td style="text-align:left;width:24px;height:20px">&nbsp;4.</td>
<td style="width:77px;height:20px">&nbsp;<code>length</code></td>
<td style="width:307px;height:20px">&nbsp;alignment length (sequence overlap)<br></td>
</tr>
<tr>
<td style="text-align:left;width:24px;height:20px">&nbsp;5.</td>
<td style="width:77px;height:20px">&nbsp;<code>mismatch</code></td>
<td style="width:307px;height:20px">&nbsp;number of mismatches</td>
</tr>
<tr>
<td style="text-align:left;width:24px;height:20px">&nbsp;6.</td>
<td style="width:77px;height:20px">&nbsp;<code>gapopen</code></td>
<td style="width:307px;height:20px">&nbsp;number of gap openings</td>
</tr>
<tr>
<td style="text-align:left;width:24px;height:20px">&nbsp;7.</td>
<td style="width:77px;height:20px">&nbsp;<code>qstart</code></td>
<td style="width:307px;height:20px">&nbsp;start of alignment in query</td>
</tr>
<tr>
<td style="text-align:left;width:24px;height:20px">&nbsp;8.</td>
<td style="width:77px;height:20px">&nbsp;<code>qend</code></td>
<td style="width:307px;height:20px">&nbsp;end of alignment in query</td>
</tr>
<tr>
<td style="text-align:left;width:24px;height:20px">&nbsp;9.</td>
<td style="width:77px;height:20px">&nbsp;<code>sstart</code></td>
<td style="width:307px;height:20px">&nbsp;start of alignment in subject</td>
</tr>
<tr>
<td style="text-align:left;width:24px;height:20px">&nbsp;10.</td>
<td style="width:77px;height:20px">&nbsp;<code>send</code></td>
<td style="width:307px;height:20px">&nbsp;end of alignment in subject</td>
</tr>
<tr>
<td style="text-align:left;width:24px;height:20px">&nbsp;11.</td>
<td style="width:77px;height:20px">&nbsp;<code>evalue</code></td>
<td style="width:307px;height:20px">&nbsp;<a href="http://www.metagenomics.wiki/tools/blast/evalue">expect value</a></td>
</tr>
<tr>
<td style="text-align:left;width:24px;height:20px">&nbsp;12.</td>
<td style="width:77px;height:20px">&nbsp;<code>bitscore</code></td>
<td style="width:307px;height:20px">&nbsp;<a href="http://www.metagenomics.wiki/tools/blast/evalue"><b>bit score</b></a></td>
</tr>
</tbody>
</table>

In [2]:
columns = ["qseqid", "sseqid", "pident", "length", "mismatch", "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore"]

align_df = pd.read_csv("output/ORFs.align.daa.tsv", sep='\t', header=None, names=columns)

Uma amostra dos alinhamentos feitos pelo algoritmo é apresentada abaixo

In [6]:
align_df.head()

Unnamed: 0,qseqid,sseqid,pident,length,mismatch,gapopen,qstart,qend,sstart,send,evalue,bitscore
0,WP_000003698.1,ALV80601.1|FEATURES|bcr-1|bicyclomycin|bcr-1,36.6,369,231,3,15,382,18,384,3.2e-61,231.9
1,WP_000003698.1,YP_001437159|FEATURES|bicyclomycin-multidrug_e...,31.6,392,267,1,5,396,2,392,9.799999999999999e-58,220.3
2,WP_000003698.1,YP_001569731|FEATURES|bicyclomycin-multidrug_e...,30.1,395,275,1,5,399,2,395,8.3e-57,217.2
3,WP_000003698.1,YP_002237398|FEATURES|bicyclomycin-multidrug_e...,30.6,395,273,1,5,399,2,395,5.4e-56,214.5
4,WP_000003698.1,BAH64410|FEATURES|bicyclomycin-multidrug_efflu...,30.4,395,274,1,5,399,2,395,2e-55,212.6


A sequência de máximo cumprimento que o algoritmo de alinhamento encontrou

In [7]:
align_df[align_df.length == align_df.length.max()]

Unnamed: 0,qseqid,sseqid,pident,length,mismatch,gapopen,qstart,qend,sstart,send,evalue,bitscore
17295,WP_001027064.1,gi:654923461:ref:WP_028373991.1:|FEATURES|mexF...,42.8,1079,580,8,3,1055,2,1069,7.1e-249,856.7


## Demais arquivos ARG

O output se apresenta de dois arquivos com ARGs e potênciais ARGs separados em dois arquivos no formato TSV

O limiar para que um gene seja classificado como ARG ou potential ARG é controlado pelo programa e por default tem valor .8


In [21]:
args_df = pd.read_csv("output/ORFs.mapping.ARG", sep='\t')
pot_args_df = pd.read_csv("output/ORFs.mapping.potential.ARG", sep='\t')

In [22]:
args_df.head()

Unnamed: 0,#ARG,query-start,query-end,read_id,predicted_ARG-class,best-hit,probability,identity,alignment-length,alignment-bitscore,alignment-evalue,counts
0,OMPR,1,228,WP_000680577.1,multidrug,gi:446603231:ref:WP_000680577.1:|FEATURES|ompR...,0.999959,100.0,228,458.0,1.6e-129,1
1,ADEN,1,217,WP_024437117.1,multidrug,AGV28567.1|FEATURES|adeN|multidrug|adeN,0.99585,98.6,217,438.0,1.6e-123,1
2,ADES,1,357,WP_031975145.1,multidrug,ADM92606.1|FEATURES|adeS|multidrug|adeS,0.99582,96.1,357,677.6,1.9999999999999998e-195,1
3,PATA,71,450,WP_024437390.1,fluoroquinolone,NP_417544.5|FEATURES|patA|fluoroquinolone|patA,0.94717,34.7,383,197.6,6.899999999999999e-51,1
4,MEXT,1,329,WP_001047619.1,multidrug,NC_011595.7059912.p01|FEATURES|mexT|multidrug|...,0.999996,100.0,329,662.1,7.9e-191,1


In [19]:
pot_args_df.head()

Unnamed: 0,#ARG,query-start,query-end,read_id,predicted_ARG-class,best-hit,probability,identity,alignment-length,alignment-bitscore,alignment-evalue,counts
0,ADEL,1,295,WP_024437490.1,multidrug,ALH22601.1|FEATURES|adeL|multidrug|adeL,0.205386,33.9,298,156.4,1.2999999999999998e-38,1
1,ADEL,1,295,WP_024437490.1,peptide,undefined,0.270672,33.9,298,156.4,1.2999999999999998e-38,1
2,CAT_CHLORAMPHENICOL_ACETYLTRANSFERASE,13,199,WP_000380745.1,MLS,ZP_01950974|FEATURES|cat_chloramphenicol_acety...,0.271802,47.2,193,151.8,2.2e-37,1
3,CAT_CHLORAMPHENICOL_ACETYLTRANSFERASE,13,199,WP_000380745.1,phenicol,ZP_01950974|FEATURES|cat_chloramphenicol_acety...,0.728197,47.2,193,151.8,2.2e-37,1
4,TAEA,1,640,WP_000323468.1,multidrug,APB03219.1|FEATURES|TaeA|pleuromutilin|TaeA,0.113474,34.9,647,375.2,3.7e-104,1
