# Keywork Search Demo

![pdbj](https://pdbj.org/content/default.svg)

PDBj Mine 2 RDB keyword search query and MMTF filtering using pdbid.
This filter searches the 'keyword' column in the brief_summary table for a keyword and returns a couple of columns for the matching entries.

[PDBj Mine Search Website](https://pdbj.org/mine)

## Imports

In [1]:
from pyspark import SparkConf, SparkContext
from mmtfPyspark.webFilters import PdbjMine
from mmtfPyspark.datasets import pdbjMineService
from mmtfPyspark.io import mmtfReader

## Configure Spark Context

In [2]:
conf = SparkConf().setMaster("local[*]") \
                  .setAppName("keywordSearch")
    
sc = SparkContext(conf = conf)

## Read in MMTF files from local directory

In [3]:
path = "../../resources/mmtf_full_sample/"

pdb = mmtfReader.read_sequence_file(path, sc)

## Apply a SQL search on PDBj using a filter

In [4]:
sql = "select pdbid, resolution, biol_species, db_uniprot, db_pfam, hit_score from keyword_search('porin') order by hit_score desc"


search = PdbjMine(sql)
count = pdb.filter(search).keys().count()
print(f"Number of entries using sql to filter: {count}")

Number of entries using sql to filter: 6


## Apply a SQL search on PDBj and get a dataset

In [5]:
sql = "select pdbid, resolution, biol_species, db_uniprot, db_pfam, hit_score from keyword_search('porin') order by hit_score desc"

dataset = pdbjMineService.get_dataset(sql)
dataset.show(10)
search = PdbjMine(dataset = dataset)
count = pdb.filter(search).keys().count()
print(f"Number of entries using dataset to filter: {count}")


+-----+----------+--------------------+--------------------+-----------+---------+
|pdbid|resolution|        biol_species|          db_uniprot|    db_pfam|hit_score|
+-----+----------+--------------------+--------------------+-----------+---------+
| 3por|       2.5|Rhodobacter capsu...|['P31243', 'PORI_...|['PF13609']| 0.095809|
| 2omf|       2.4|Escherichia coli K12|['OMPF_ECOLI', 'P...|['PF00267']|0.0954989|
| 2por|       1.8|Rhodobacter capsu...|['P31243', 'PORI_...|['PF13609']|0.0951392|
| 1gfq|       2.8|    Escherichia coli|['OMPF_ECOLI', 'P...|['PF00267']| 0.094717|
| 1gfp|       2.7|    Escherichia coli|['OMPF_ECOLI', 'P...|['PF00267']| 0.094717|
| 1gfo|       3.3|    Escherichia coli|['OMPF_ECOLI', 'P...|['PF00267']| 0.094717|
| 1gfn|       3.1|    Escherichia coli|['OMPF_ECOLI', 'P...|         []| 0.094717|
| 1gfm|       3.5|    Escherichia coli|['OMPF_ECOLI', 'P...|['PF00267']| 0.094717|
| 1bt9|       3.0|    Escherichia coli|['OMPF_ECOLI', 'P...|['PF00267']| 0.094717|
| 1h

## Terminate Spark Context

In [6]:
sc.stop()