# TF sequence predictor
This notebook describes, given an INPUT of a new TF sequence, how to find the closest TFs/ligands present in the database.

Other algorithms can be used but Needle is simple and fast and works well querying an INPUT against a database locally.

In [95]:
from Bio.Blast.Applications import NcbiblastpCommandline
import pandas as pd

I need first to create the QUERY and DATABASE files, from the query and .csv file respectively.

In [96]:
database = pd.read_csv('./TF_DB_clean.csv')
sequences = database.astype(str).drop_duplicates(subset = ['AA_sequence'])
with open('database_file.fasta','w') as data_file:
    for _,row in sequences.iterrows():
    #data_file.write(f'>{row['NCBI_Accession']}\n{row['AA sequence']}\n')
        if row.NCBI_Accession != 'nan':
            data_file.write(f'>{row.NCBI_Accession}\n{row.AA_sequence}\n')
        else:
            data_file.write(f'>{row.UniProt}\n{row.AA_sequence}\n')

```bash
makeblastdb -in database_file.fasta -parse_seqids -blastdb_version 5 -dbtype prot -out BLASTdb
```

In [106]:
query = 'MRFKGLDLNLLVALDALMTERNLTAAARKINLSQPAMSAAIARLRSYFRDELFTMRGRELVPTPGAEALAGPVREALLHIQLSIISRDAFDPTQSSRRFRVILSDFMTIVFFRRIVDRIAQEAPAVRFELLPFSDEPGELLRRGEVDFLILPELFMSSAHPKATLFDETLVCVGCRTNKQLLRPLTFEKYNSTGHVTAKFGRALRPNLEEWFLLEHGLKRRIEVVVQGFSLIPPMLLDTGRIGTMPLRLARHFEKRMPLRIVEPPLPLPTFTEAVQWPAFHNTDPASIWMRRILLEEATNMGSAHREIPTRRRC'  #your string from some external source
blastp_cline = NcbiblastpCommandline(db="BLASTdb", outfmt="6 sseqid pident evalue bitscore") #Blast command
out, err = blastp_cline(stdin=query)
#print (out)
#print (err)

In [107]:
results = [i.split('\t') for i in out.splitlines()]

In [99]:
blast_df = pd.DataFrame(data=results, columns=['NCBI_Accession', 'id_pc','e_value','bit_score'])
blast_df['NCBI_Accession'] = blast_df['NCBI_Accession'].apply(lambda x : x.split('|')[1])
blast_df['bit_score'] = pd.to_numeric(blast_df['bit_score'])
blast_df

Unnamed: 0,NCBI_Accession,id_pc,e_value,bit_score
0,WP_194456231.1,100.000,0.0,632.0
1,WP_207159894.1,69.967,1.34e-159,444.0
2,WP_010967456,70.100,7.80e-157,437.0
3,WP_012172315.1,52.159,2.12e-102,299.0
4,CAA88827.1,51.827,8.81e-102,298.0
...,...,...,...,...
145,YP_093758.1,33.333,6.0,24.6
146,YP_001334511.1,36.364,6.2,24.3
147,ZP_01038496.1,52.941,7.2,24.6
148,YP_001423239.1,33.333,8.1,24.3


In [100]:
scored_db = database.merge(blast_df).sort_values(by=['bit_score'], ascending=False).reset_index(drop=True)
scored_db

Unnamed: 0,Molecule,InChI,SMILES,Organism_exp,Organism_wt,Synthase_gene,TF,Bibliographic_ref,Database_ref,Comments,Source,Type,Operator_seq,NCBI_Accession,UniProt,AA_sequence,id_pc,e_value,bit_score
0,naringenin,InChI=1S/C15H12O5/c16-9-3-1-8(2-4-9)13-7-12(19...,O=C1CC(c2ccc(O)cc2)Oc2cc(O)cc(O)c21,,Rhizobium tropici,,nodD,PMID:8419293,RegTransBase v20120406 (20170227),,,,,WP_194456231.1,,MRFKGLDLNLLVALDALMTERNLTAAARKINLSQPAMSAAIARLRS...,100.000,0.0,632.0
1,naringenin,InChI=1S/C15H12O5/c16-9-3-1-8(2-4-9)13-7-12(19...,O=C1CC(c2ccc(O)cc2)Oc2cc(O)cc(O)c21,,R.leguminosarum,,nodD,PMID:12799442,RegTransBase v20120406 (20170227),,,,,WP_207159894.1,,MRFKGLDLNLLVALDALMTERKLTAAARSINLSQPAMSAAISRLRA...,69.967,1.34e-159,444.0
2,naringenin,InChI=1S/C15H12O5/c16-9-3-1-8(2-4-9)13-7-12(19...,O=C1CC(c2ccc(O)cc2)Oc2cc(O)cc(O)c21,,Sinorhizobium meliloti,,nodD1,1021/acssynbio.8b00326,,Chimeric LysR-Type Transcriptional Biosensors ...,,,,WP_010967456,,MRFRGLDLNLLVALDALMTERKLTAAARRINLSQPAMSAAIARLRT...,70.100,7.80e-157,437.0
3,naringenin,InChI=1S/C15H12O5/c16-9-3-1-8(2-4-9)13-7-12(19...,O=C1CC(c2ccc(O)cc2)Oc2cc(O)cc(O)c21,,Azorhizobium caulinodans,,nodD,PMID:7590297,RegTransBase v20120406 (20170227),,,,,WP_012172315.1,,MRFKGLDLNLLVALNALLSEHSVTSAAKSINLSQPAMSAAVQRLRI...,52.159,2.12e-102,299.0
4,naringenin,InChI=1S/C15H12O5/c16-9-3-1-8(2-4-9)13-7-12(19...,O=C1CC(c2ccc(O)cc2)Oc2cc(O)cc(O)c21,,Azorhizobium caulinodans ORS571,,nodD,PMID:2158977,RegTransBase v20120406 (20170227),,,,,CAA88827.1,,MRFKGLDLNLLVALNALLSEHSVTSAAKSINLSQPAMSAAVQRLRI...,51.827,8.81e-102,298.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
210,2-deoxy-5-keto-d-gluconate 6-phosphate,InChI=1S/C6H11O9P/c7-3(1-5(9)10)6(11)4(8)2-15-...,O=C([O-])C[C@@H](O)[C@H](O)C(=O)COP(=O)([O-])[O-],,Bacillus licheniformis DSM 13 = ATCC 14580,,iolr,,https://regprecise.lbl.gov/sites.jsp?regulog_i...,,RegPrecise,,,YP_093758.1,,MKLMRIKEMEDYILTNGTVSLDELCQVFNVSKNTVRRDINKLTEKG...,33.333,6.0,24.6
211,uracil,"InChI=1S/C4H4N2O2/c7-3-1-2-5-4(8)6-3/h1-2H,(H2...",Oc1ccnc(O)n1,,Roseovarius sp. 217,,rutr,,https://regprecise.lbl.gov/sites.jsp?regulog_i...,,RegPrecise,,,ZP_01038496.1,,MGLSRDLAKLVNGVMKNTPRKERKRMPQAPAAQAGRKPSRIQLRNR...,52.941,7.2,24.6
212,manganese(mn2+),InChI=1S/Mn/q+2,[Mn+2],,Klebsiella pneumoniae subsp. pneumoniae MGH 78578,,mntr,,https://regprecise.lbl.gov/sites.jsp?regulog_i...,,RegPrecise,,,YP_001334511.1,,MTQLVNVEEHVEGFRQVREAHRRELIDDYVELISDLINEVGEARQV...,36.364,6.2,24.3
213,2-deoxy-5-keto-d-gluconate 6-phosphate,InChI=1S/C6H11O9P/c7-3(1-5(9)10)6(11)4(8)2-15-...,O=C([O-])C[C@@H](O)[C@H](O)C(=O)COP(=O)([O-])[O-],,Bacillus amyloliquefaciens subsp. plantarum st...,,iolr,,https://regprecise.lbl.gov/sites.jsp?regulog_i...,,RegPrecise,,,YP_001423239.1,,MKLMRIQEMEEYILKHGATSLDELCEVFNVSKNTVRRDINKLAEKG...,33.333,8.1,24.3
