# NVIDIA Gave Me a \$15K Data Science Workstation - here's what I did with it
### Creating a GPU Accelerated Pubmed Search Engine
This notebook is an adaptation of my Towards Data Science Article availabe [here](https://towardsdatascience.com/nvidia-gave-me-a-15k-data-science-workstation-heres-what-i-did-with-it-70cfb069fc35).

## 0. Download and Process XML Data
This step gives you example data to work with for the tutorial. For the actual post, I worked with all of Pubmed. However, for the sake of brevity here I use the abstracts from a single file.

Here, I walked through the example with one Pubmed file, although you could repeat this process for every file in the directory. I explicitely chose a document that has newer abstracts. Make sure your computer has enough processing power to handle this part of the process.

In [12]:
# data dir
!mkdir data
# download single Pubmed XML from the directory
!wget https://mbr.nlm.nih.gov/Download/Baselines/2019/pubmed19n0972.xml.gz 
!mv pubmed19n0972.xml.gz  data/pubmed-data.xml.gz
# unzip it
!gunzip data/pubmed-data.xml

mkdir: cannot create directory ‘data’: File exists
--2020-03-03 16:24:58--  https://mbr.nlm.nih.gov/Download/Baselines/2019/pubmed19n0972.xml.gz
Resolving mbr.nlm.nih.gov (mbr.nlm.nih.gov)... 130.14.53.15, 2607:f220:41e:7053::15
Connecting to mbr.nlm.nih.gov (mbr.nlm.nih.gov)|130.14.53.15|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8755134 (8.3M) [application/x-gzip]
Saving to: ‘pubmed19n0972.xml.gz.2’


2020-03-03 16:24:58 (20.6 MB/s) - ‘pubmed19n0972.xml.gz.2’ saved [8755134/8755134]



Now we need to parse the XML to CSV.

In [13]:
import pandas as pd
from bs4 import BeautifulSoup

def get_file_text(path):
    
    with open(path, "r") as f:
        text = f.read()
        
    return text

def get_pubmed_articles_from_xml(text, field="PubmedArticle"):

    soup = BeautifulSoup(text,"xml")
    documents = soup.find_all(field)
    
    return documents

def get_pubmed_article_fields(soup, fields=["AbstractText","Year"]):
    d = {}
    
    for f in fields:
        item = '' if soup.find(f) is None else soup.find(f).text
        d[f] = item
    
    return d

In [9]:
RAW_DATA = "data/pubmed-data.xml"

text = get_file_text(RAW_DATA)
documents = get_pubmed_articles_from_xml(text)
print(documents[0])
lis = []
for doc in documents:
    fields = get_pubmed_article_fields(doc)
    lis.append(fields)
    
df = pd.DataFrame(lis)
df.head()

<PubmedArticle>
<MedlineCitation Owner="NLM" Status="Publisher">
<PMID Version="1">30516271</PMID>
<DateRevised>
<Year>2018</Year>
<Month>12</Month>
<Day>05</Day>
</DateRevised>
<Article PubModel="Print-Electronic">
<Journal>
<ISSN IssnType="Print">0012-9658</ISSN>
<JournalIssue CitedMedium="Print">
<PubDate>
<Year>2018</Year>
<Month>Dec</Month>
<Day>05</Day>
</PubDate>
</JournalIssue>
<Title>Ecology</Title>
<ISOAbbreviation>Ecology</ISOAbbreviation>
</Journal>
<ArticleTitle>Spatial scale modulates the inference of metacommunity assembly processes.</ArticleTitle>
<ELocationID EIdType="doi" ValidYN="Y">10.1002/ecy.2576</ELocationID>
<Abstract>
<AbstractText>The abundance and distribution of species across the landscape depend on the interaction between local, spatial and stochastic processes. However, empirical syntheses relating these processes to spatio-temporal patterns of structure in metacommunities remains elusive. One important reason for this lack of synthesis is that the relati

Unnamed: 0,AbstractText,Year
0,The abundance and distribution of species acro...,2018
1,Several recent methods address the dimension r...,2018
2,Research on regime shifts has focused primaril...,2018
3,The diversity-invasibility hypothesis and ecol...,2018
4,Most studies consider aboveground plant specie...,2018


There is a more intensive and particular way to process the documents that can lead to 

In [15]:
df.to_csv("data/pubmed.csv")

## 1. Reading in Data in Accelerated Manner

The first thing we can do is read in the data using dask. The wildcard matching makes it super easy to read a ton of csv files in a directory utilizing all the GPUs on your system. To monitor these GPUs, use `watch -n 0.5 nvidia-smi` in another terminal.

In [18]:
from dask_cuda import LocalCUDACluster
import dask_cudf
from dask.distributed import Client
import time

if __name__ == '__main__':  # need this for the cluster
    cluster = LocalCUDACluster()  # runs on two local GPUs
    client = Client(cluster)
    t0 = time.time()
    gdf = dask_cudf.read_csv('data/*.csv') # read all csv files
    abstract = gdf.Abstract.compute()
    t1 = time.time()
    
print("Read %s abstract in %f seconds" % (len(gdf), t1-t0))

OSError: Could not load shared object file: libllvmlite.so

Dask is extremely powerful, however there is more pandas-like funcitonality in cudf. Using simple Python code and only one of our GPUs, here we:

- read in every csv in our data dir
- lowercase all strings in the Abstract column
- remove all punctutation

This data cleaning operation could almost certainly be improved upon for greater efficiency. However, for our purposes here it serves the use case well.

In [19]:
import cudf
import os
import time
import string

PATH = "data/"
COLUMN = "Abstract"

start = time.time()
i = 0

for f in os.listdir(PATH):
        t0 = time.time()
        df = cudf.read_csv(PATH + f) # read using cudf instead of pandas
        length = len(df.dropna(subset=["Abstract"]))
        df[COLUMN] = df[COLUMN].str.lower()
        df[COLUMN] = df[COLUMN].str.translate(str.maketrans('','',string.punctuation))
        t1 = time.time()
        print("Processed %i abstracts in %s seconds" % (length, t1-t0))
        i += 1

end = time.time()
print("Processed %i files in %s seconds" % (i, end-start))

OSError: Could not load shared object file: libllvmlite.so

## 2. GPU Accelerated Information Retrieval

If you read my article, you saw that I failed to find a truly good way to use TF-IDF vectors here - so I'll skip that and move onto cosine similarity. When NVIDIA's cuml library includes a TF-IDF implementation, I will also add that to this notebook!

Here, we start with deep learning vectors. Let's download BERT.

In [None]:
# download model
!wget https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip
# unzip model
!unzip wwm_uncased_L-24_H-1024_A-16.zip
# start service with both GPUs available - do this another terminal
# bert-serving-start -model_dir wwm_uncased_L-24_H-1024_A-16 -num_worker=2

--2020-03-03 16:54:18--  https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.8.16, 2607:f8b0:4004:803::2010
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.8.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1248381879 (1.2G) [application/zip]
Saving to: ‘wwm_uncased_L-24_H-1024_A-16.zip’


2020-03-03 16:54:24 (191 MB/s) - ‘wwm_uncased_L-24_H-1024_A-16.zip’ saved [1248381879/1248381879]

Archive:  wwm_uncased_L-24_H-1024_A-16.zip
   creating: wwm_uncased_L-24_H-1024_A-16/
  inflating: wwm_uncased_L-24_H-1024_A-16/bert_model.ckpt.meta  
  inflating: wwm_uncased_L-24_H-1024_A-16/bert_model.ckpt.data-00000-of-00001  