# NVIDIA Gave Me a \$15K Data Science Workstation - here's what I did with it
### Creating a GPU Accelerated Pubmed Search Engine
This notebook is an adaptation of my Towards Data Science Article availabe [here](https://towardsdatascience.com/nvidia-gave-me-a-15k-data-science-workstation-heres-what-i-did-with-it-70cfb069fc35).

![img1](https://www.nvidia.com/content/dam/en-zz/Solutions/deep-learning/deep-learning-solutions/data-science/data-science-laptop-workstation-4c25-p@2x.jpg)

## 0. Download and Process XML Data
This step gives you example data to work with for the tutorial. For the actual post, I worked with all of Pubmed. However, for the sake of brevity here I use the abstracts from a single file.

Here, I walked through the example with one Pubmed file, although you could repeat this process for every file in the directory. I explicitely chose a document that has newer abstracts. Make sure your computer has enough processing power to handle this part of the process.

In [None]:
# data dir
!mkdir data
# download single Pubmed XML from the directory
!wget https://mbr.nlm.nih.gov/Download/Baselines/2019/pubmed19n0972.xml.gz 
!mv pubmed19n0972.xml.gz  data/pubmed-data.xml.gz
# unzip it
!gunzip data/pubmed-data.xml

Now we need to parse the XML to CSV.

In [None]:
import pandas as pd
from bs4 import BeautifulSoup

def get_file_text(path):
    with open(path, "r") as f:
        text = f.read()
    return text

def get_pubmed_articles(text, field="PubmedArticle"):
    soup = BeautifulSoup(text,"xml")
    documents = soup.find_all(field)
    return documents

def get_pubmed_article_fields(soup, fields=["AbstractText","Year"]):
    d = {}
    for f in fields:
        item = '' if soup.find(f) is None else soup.find(f).text
        d[f] = item
    return d

In [None]:
RAW_DATA = "data/pubmed-data.xml"

text = get_file_text(RAW_DATA)
documents = get_pubmed_articles(text)
print(documents[0])

Now we can extract the `AbstractText` and `Year` fields from each of these xml objects.

In [None]:
lis = []
for doc in documents:
    fields = get_pubmed_article_fields(doc)
    lis.append(fields)
    
df = pd.DataFrame(lis)
df.head()

There is a more intensive and particular way to process the documents that can lead to 

In [None]:
df.to_csv("data/pubmed.csv")

## 1. GPU Accelerated Data Load and Processing with NVIDIA RAPIDs and Dask

![img2](https://devblogs.nvidia.com/wp-content/uploads/2018/10/pipeline-1024x382.png)

The first thing we can do is read in the data using dask. The wildcard matching makes it super easy to read a ton of csv files in a directory utilizing all the GPUs on your system. To monitor these GPUs, use `watch -n 0.5 nvidia-smi` in another terminal.

In [None]:
from dask_cuda import LocalCUDACluster
import dask_cudf
from dask.distributed import Client
import time

if __name__ == '__main__':  # need this for the cluster
    cluster = LocalCUDACluster()  # runs on two local GPUs
    client = Client(cluster)
    t0 = time.time()
    gdf = dask_cudf.read_csv('data/*.csv') # read all csv files
    abstract = gdf.Abstract.compute()
    t1 = time.time()
    
print("Read %s abstract in %f seconds" % (len(gdf), t1-t0))

Dask is extremely powerful, however there is more pandas-like funcitonality in cudf. Using simple Python code and only one of our GPUs, here we:

- read in every csv in our data dir
- lowercase all strings in the Abstract column
- remove all punctutation

This data cleaning operation could almost certainly be improved upon for greater efficiency. However, for our purposes here it serves the use case well.

In [None]:
import cudf
import os
import time
import string

PATH = "data/"
COLUMN = "AbstractText"

start = time.time()
i = 0

for f in os.listdir(PATH):
    t0 = time.time()
    df = cudf.read_csv(PATH + f) # read using cudf instead of pandas
    length = len(df.dropna(subset=[COLUMN]))
    df[COLUMN] = df[COLUMN].str.lower()
    df[COLUMN] = df[COLUMN].str.translate(str.maketrans('','',string.punctuation))
    t1 = time.time()
    print("Processed %i abstracts in %s seconds" % (length, t1-t0))
    i += 1

end = time.time()
print("Processed %i files in %s seconds" % (i, end-start))

## 2. Basic Information Retreival 

As context for how to approach the information retreival problem, here is a basic search class using sklearn. This is CPU bound, and uses cosine similarity on the vectors to find similar documents.

In [None]:
import pickle
import numpy as np
import pandas as pd
import string
import sklearn.feature_extraction
from sklearn.metrics.pairwise import linear_kernel

class PubmedTfidfTrainServe:

    def __init__(self):
        self.vectorizer = None
        self.vectorizer_name = "vectorizer.pickle"
        self.lowercase = True
        self.text_data = None

    @staticmethod
    def preprocess_data(text):
        text = str(text).lower()
        text = text.translate(str.maketrans('', '', string.punctuation))
        return text.strip()

    def load_data(self, path, column):
        df = pd.read_csv(path)
        self.text_data = df[column]
        self.text_data = [self.preprocess_data(x) for x in self.text_data]

    def train(self):
        self.vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(lowercase=self.lowercase)  
        self.tfidf = self.vectorizer.fit_transform(self.text_data)
        self.save_model()

    def save_model(self):
        pickle.dump(self.vectorizer, open(self.vectorizer_name, "wb"))

    def load_model(self, model_path):
        self.vectorizer = pickle.load(open(model_path, "rb"))

    def search(self, text, n_results=10):

        if not self.vectorizer:
            self.load_model()

        vector = self.vectorizer.transform([str(text)])
        cosine_similarities = linear_kernel(vector, self.tfidf).flatten()
        related_doc_indicies = cosine_similarities.argsort()[:-n_results:-1]
        return [self.text_data[i] for i in related_doc_indicies]

Now at some point, these same APIs shold be available via cuml in RAPIDs and we can run both training and inference on GPUs. However, at the moment cuml does not cover all of the features in sklearn. Below is a hacky way to read the data and train the vectorizer on CPU, then do the inference itself on GPU.

In [None]:
import tensorflow as tf
import numpy as np

search_term = "heart"

pubmedTfidf = PubmedTfidfTrainServe()
pubmedTfidf.load_data("data/pubmed.csv","AbstractText")
pubmedTfidf.train()

def convert_sparse_matrix_to_sparse_tensor(X):
    coo = X.tocoo()
    indices = np.mat([coo.row, coo.col]).transpose()
    return tf.SparseTensor(indices, coo.data, coo.shape)

def create_search_vector(search_term):
    array = pubmedTfidf.vectorizer.transform([search_term]).toarray()
    tensor = tf.convert_to_tensor(array)
    return tensor 

## convert the sparse matrix to a tensor
tfidf = convert_sparse_matrix_to_sparse_tensor(pubmedTfidf.tfidf)
## vectorize a search term and do the same 
search_vector = create_search_vector(search_term)
## perform the matrix multiplication 
res = tf.sparse.sparse_dense_matmul(
    tfidf,
    tf.transpose(search_vector),
    adjoint_a=False,
    adjoint_b=False,
    name=None
    )

To santiy check our results, let's do one search with sklearn and one with GPU accelerated tensorflow!

In [None]:
print(pubmedTfidf.search("heart")[0])

In [None]:
x = tf.math.argmax(res)
tf.keras.backend.eval(x)[0]
print(pubmedTfidf.text_data[4128])

Tight, they both work - but since tensorflow is highly optimized for GPU inference should be faster - especially at scale. However, TF-IDF vectors quickly get massive since we have one position for every unique word or token. 

Fortunately, with the advances of NLP we can represent text contextually in significantly less dimensions.

## 3. GPU Accelerated Information Retrieval

TF-IDF fails with lots of data due to issues related with memory (with the naive approach used above). Using novel NLP models like BERT, we can make our information retrieval both faster and contextual.

Let's start by downloading one of the BERT large models and unzipping it.

In [None]:
# download model
!wget https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip
# unzip model
!unzip wwm_uncased_L-24_H-1024_A-16.zip
# start service with both GPUs available - do this another terminal
# bert-serving-start -model_dir wwm_uncased_L-24_H-1024_A-16 -num_worker=2

In [None]:
import cudf
import os
import time
import numpy as np
from numpy import savetxt
import faiss
import pickle
from bert_serving.client import BertClient

DATA_PATH = '/data/'
d = 2014
bc = BertClient(check_length=False)

Now we can use faiss to create the GPU accelerated index. Here, we first create a CPU index and then convert it to the GPU counterpart.

In [None]:
ngpus = faiss.get_num_gpus()
print("number of GPUs:", ngpus)
cpu_index = faiss.IndexFlatL2(d)
gpu_index = faiss.index_cpu_to_all_gpus(  # build the index
  cpu_index
)

Finally, after all the touting of GPU parallelization and acceleration, I write a for loop and just process the content with brute force.

In [None]:
text = []

for f in os.listdir(DATA_PATH):
    ## read data
    df = cudf.read_csv(DATA_PATH+f)
    n_abstracts = len(df)
    text.append(df['AbstractText'])
    

    ## vectorize and index in two lines 
    vectors = bc.encode(list(df['Abstract']))
    gpu_index.add(vectors.astype('float32'))
    print("Total vectors: ", gpu_index.ntotal)
    print("Total abstracts: ", len(text))

    ## save index to disk if desired
    cpu_index = faiss.index_gpu_to_cpu(gpu_index)
    faiss.write_index(cpu_index,"data/vector.index")
    print("Saved index to disk")

Now we've read the text data in, vectorized it, and passed those vectors to our fast index. All that's left is to confirm and sanity check our search!

In [None]:
t0 = time.time()
search_vector = bc.encode(["this is a search query about the brain and blood"])
distances,indicies = index.search(search_vector.astype('float32'), k=3)
t1 = time.time()
print("Search query ran in %f seconds" % t1-t0)
print([text[i] for i in indicies])