# Example notebook showcasing the use of the PseudomonasDotCom Scraper as a programmable interface to the pseudomonas.com database 

## List of content 
[Load required python modules](#Load-required-python-modules)  
[Setting-things-up](#Setting-things-up)  
[Retrieve-the-data](#Retrieve-the-data)  
[Display the data](#Display-the-data)  
[List what data is in the results](#List-all-keys-in-the-results-dict)  
[Get the data for one queried gene](#Get-the-data-for-one-queried-gene)  
[Display one table](#Display-one-table)  
[Display a given table for all three genes](#Display-a-given-table-for-all-three-genes)  
[Select all rows with a given value in one column](#Select-all-rows-with-a-given-value-in-one-column)  
[Save to disk](#Save-to-disk)  
[Read from disk](#Read-results-from-disk)  
[Example with references](#Example-for-a-query-with-references)  
[Display references](#Display-references-with-proper-html-links)

## Load required python modules 

In [2]:
# The scraper
from GenDBScraper.PseudomonasDotComScraper import PseudomonasDotComScraper as scraper

# The query object (derived from collections.namedtuple)
from GenDBScraper.PseudomonasDotComScraper import pdc_query

# Regular expressions
import re

# pandas DataFrame, the workhorse datastructure
import pandas

## Setting things up 

In [3]:
# We want to get data for three adjacent genes, pflu0915, pflu0916, pflu0917
queries = [pdc_query(strain='sbw25',feature=feat) for feat in ['pflu0915', 'pflu0916', 'pflu0917']]

In [4]:
# Set up the scraper
scraper = scraper(query=queries)

In [5]:
# Connect to the database
scraper.connect()

INFO: Good response from https://www.pseudomonas.com .


## Retrieve the data

In [6]:
results = scraper.run_query()

DEBUG: Will now open https://www.pseudomonas.com/primarySequenceFeature/list?c1=name&v1=pflu0915&e1=1&term1=sbw25&assembly=complete .
INFO: Good response from https://www.pseudomonas.com/primarySequenceFeature/list?c1=name&v1=pflu0915&e1=1&term1=sbw25&assembly=complete .
INFO: Good response from https://www.pseudomonas.com/feature/show?id=1459887 .
INFO: Good response from https://www.pseudomonas.com/feature/show?id=1459887&view=functions .
DEBUG: Will now open https://www.pseudomonas.com/primarySequenceFeature/list?c1=name&v1=pflu0916&e1=1&term1=sbw25&assembly=complete .
INFO: Good response from https://www.pseudomonas.com/primarySequenceFeature/list?c1=name&v1=pflu0916&e1=1&term1=sbw25&assembly=complete .
INFO: Good response from https://www.pseudomonas.com/feature/show?id=1459889 .
INFO: Good response from https://www.pseudomonas.com/feature/show?id=1459889&view=functions .
DEBUG: Will now open https://www.pseudomonas.com/primarySequenceFeature/list?c1=name&v1=pflu0917&e1=1&term1=sb

## Display the data

In [7]:
results

{'sbw25__pflu0915': {'Gene Feature Overview':                                               1
  0                                              
  Strain            Pseudomonas fluorescens SBW25
  Locus Tag                              PFLU0915
  Name                                        NaN
  Replicon                             chromosome
  Genomic location  1015020  - 1015640 (+ strand),
  'Cross-References':                      1
  0                     
  RefSeq  YP_002870577.1
  GI           229588458
  Entrez         7816630,
  'Product':                                                         1
  0                                                        
  Feature Type                                          CDS
  Coding Frame                                            1
  Product\tName                        hypothetical protein
  Synonyms                                              NaN
  Evidence for Translation                              NaN
  Charge (pH 7)             

The results object is a two-fold nested dictionary:  
 results  
 |  
 +---sbw25_pflu0915  
     |  
     +---Gene Feature Overview  
     +---Cross References  
     +---Orthologs/Comparative Genomics  
     .  
     .  
     .  
 +---sbw25__pflu0916  
     |  
     +---Gene Feature Overview  
     +---Cross References  
     +---Orthologs/Comparative Genomics  
     .  
     .  
     .  
 +---sbw25__pflu0917  
     |  
     +---Gene Feature Overview  
     +---Cross References  
     +---Orthologs/Comparative Genomics  
     .  
     .  
     .  
       
The lowest hierarchy ("Gene Feature Overview", "Cross References",  
etc) are the data tables downloaded from pseudomonas.com. They are  
instances of the pandas.DataFrame class, a highly versatile data  
structure which allows many advanced dataset operations like slicing,  
selection based on values and ranges, and much more.  

### List all keys in the results dict

In [8]:
[k for k in results.keys()]

['sbw25__pflu0915', 'sbw25__pflu0916', 'sbw25__pflu0917']

### Get the data for one queried gene

In [9]:
pflu0915_data = results['sbw25__pflu0915']

In [10]:
# List all keys in the first gene.
[k for k in pflu0915_data]

['Gene Feature Overview',
 'Cross-References',
 'Product',
 'Subcellular localization',
 'Pathogen Association Analysis',
 'Orthologs/Comparative Genomics',
 'Interactions',
 'References',
 'Gene Ontology',
 'Functional Classifications Manually Assigned by PseudoCAP',
 'Functional Predictions from Interpro']

### Display one table 

In [11]:
# Display the functional predictions from Interpro.
display(pflu0915_data['Functional Predictions from Interpro'])

Unnamed: 0,Analysis,Accession,Description,Interpro Accession,Interpro Description,Amino Acid Start,Amino Acid Stop,E-value
0,Gene3D,G3DSA:3.90.1680.10,,IPR036590,SOS response associated peptidase-like,1,206,3.4e-46
1,SUPERFAMILY,SSF143081,,IPR036590,SOS response associated peptidase-like,2,205,1.09e-47
2,Pfam,PF02586,SOS response associated peptidase (SRAP),IPR003738,SOS response associated peptidase (SRAP),1,192,1.6e-38


### Display a given table for all three genes 

In [12]:
# Display functional predictions from all three genes.
for f in results.keys():
    print("\n\n")
    print(f)
    display(results[f]['Functional Predictions from Interpro'])




sbw25__pflu0915


Unnamed: 0,Analysis,Accession,Description,Interpro Accession,Interpro Description,Amino Acid Start,Amino Acid Stop,E-value
0,Gene3D,G3DSA:3.90.1680.10,,IPR036590,SOS response associated peptidase-like,1,206,3.4e-46
1,SUPERFAMILY,SSF143081,,IPR036590,SOS response associated peptidase-like,2,205,1.09e-47
2,Pfam,PF02586,SOS response associated peptidase (SRAP),IPR003738,SOS response associated peptidase (SRAP),1,192,1.6e-38





sbw25__pflu0916


Unnamed: 0,Analysis,Accession,Description,Interpro Accession,Interpro Description,Amino Acid Start,Amino Acid Stop,E-value
0,CDD,cd06225,HAMP,IPR003660,HAMP domain,383,431,8.02722E-7
1,Gene3D,G3DSA:1.10.287.950,,,,381,712,1.2E-78
2,Gene3D,G3DSA:3.30.450.20,,,,64,72,2.7E-40
3,Pfam,PF00672,HAMP domain,IPR003660,HAMP domain,383,431,1.3E-8
4,SUPERFAMILY,SSF58104,,,,403,712,8.89E-82
5,Gene3D,G3DSA:3.30.450.20,,,,238,341,2.7E-40
6,Pfam,PF02743,Cache domain,IPR033479,Double Cache domain 1,47,331,1.2E-16
7,Coils,Coil,,,,584,604,-
8,Gene3D,G3DSA:3.30.450.20,,,,73,237,2.7E-40
9,Pfam,PF00015,Methyl-accepting chemotaxis protein (MCP) sign...,IPR004089,Methyl-accepting chemotaxis protein (MCP) sign...,495,678,2.6E-45





sbw25__pflu0917


Unnamed: 0,Analysis,Accession,Description,Interpro Accession,Interpro Description,Amino Acid Start,Amino Acid Stop,E-value
0,Pfam,PF01435,Peptidase family M48,IPR001915,Peptidase M48,75,259,9.3e-35
1,ProSiteProfiles,PS51257,Prokaryotic membrane lipoprotein lipid attachm...,,,1,21,6.0
2,CDD,cd07331,M48C_Oma1_like,,,80,264,6.94438e-83


### Select all rows with a given value in one column 

In [13]:
# Display all Pfam analysis

In [14]:
# Temporary list of data
tmp = []

# Iterate all three results.
for q,r in results.items():
    # Take the functional predictions
    f = r['Functional Predictions from Interpro']
    # Select only rows where Analysis is "Pfam"
    pfam = f[f['Analysis'] == 'Pfam']
    # Add a column to denote the gene.
    newcol = [q]*len(pfam)
    pfam.insert(0, value=newcol, column="Feature")
    
    # Append to the temporary holder.
    tmp.append(pfam)

# Concatenate into one pandas DataFrame
tmp = pandas.concat(tmp)

In [15]:
display(tmp)

Unnamed: 0,Feature,Analysis,Accession,Description,Interpro Accession,Interpro Description,Amino Acid Start,Amino Acid Stop,E-value
2,sbw25__pflu0915,Pfam,PF02586,SOS response associated peptidase (SRAP),IPR003738,SOS response associated peptidase (SRAP),1,192,1.6000000000000001e-38
3,sbw25__pflu0916,Pfam,PF00672,HAMP domain,IPR003660,HAMP domain,383,431,1.3e-08
6,sbw25__pflu0916,Pfam,PF02743,Cache domain,IPR033479,Double Cache domain 1,47,331,1.2e-16
9,sbw25__pflu0916,Pfam,PF00015,Methyl-accepting chemotaxis protein (MCP) sign...,IPR004089,Methyl-accepting chemotaxis protein (MCP) sign...,495,678,2.6e-45
0,sbw25__pflu0917,Pfam,PF01435,Peptidase family M48,IPR001915,Peptidase M48,75,259,9.3e-35


## Save to disk

In [16]:
scraper.to_json(results, outfile="sbw25.json")

'sbw25.json'

## Read results from disk

In [17]:
loaded = scraper.from_json('sbw25.json')

In [18]:
loaded

{'sbw25__pflu0915': {'Gene Feature Overview':                                               1
  Genomic location  1015020  - 1015640 (+ strand)
  Locus Tag                              PFLU0915
  Name                                       None
  Replicon                             chromosome
  Strain            Pseudomonas fluorescens SBW25,
  'Cross-References':                      1
  Entrez         7816630
  GI           229588458
  RefSeq  YP_002870577.1,
  'Product':                                                         1
  Charge (pH 7)                                        2.14
  Coding Frame                                            1
  Evidence for Translation                             None
  Feature Type                                          CDS
  Isoelectric Point (pI)                               8.48
  Kyte-Doolittle Hydrophobicity Value                -0.358
  Molecular Weight (kDa)                               23.2
  Product\tName                        hypo

## Example for a query with references

In [19]:
results_pa = scraper.run_query(query=pdc_query(strain='UCBPP-PA14', feature='PA14_67210'))

DEBUG: Will now open https://www.pseudomonas.com/primarySequenceFeature/list?c1=name&v1=PA14_67210&e1=1&term1=UCBPP-PA14&assembly=complete .
INFO: Good response from https://www.pseudomonas.com/primarySequenceFeature/list?c1=name&v1=PA14_67210&e1=1&term1=UCBPP-PA14&assembly=complete .
INFO: Good response from https://www.pseudomonas.com/feature/show?id=1661780 .
INFO: Good response from https://www.pseudomonas.com/feature/show?id=1661780&view=functions .


### Display references with proper html links

In [23]:
results_pa["UCBPP-PA14__PA14_67210"]['References'].style.format({'pubmed_url': lambda x: '<a href={0}>link</a>'.format(x)})

Unnamed: 0,citation,pubmed_url
0,"Allsopp LP, Wood TE, Howard SA, Maggiorelli F, Nolan LM, et al. (2017). RsmA and AmrZ orchestrate the assembly of all three type VI secretion systems in Pseudomonas aeruginosa. Proc Natl Acad Sci U S A 114(29): 7707-7712.",link
