# Patient Similarity

In this notebook we will explore using vectors to represent patients based on their ICD9 codes and then use vector operations to compute similarity between patients. The vectors will typically be sparse so we will explore using 
dictionaries to represent sparse vectors.

In [None]:
import pymysql
import pandas as pd
import getpass
import pandas as pd
import seaborn as sns
import numpy as np
from collections import defaultdict
import itertools

In [None]:
#from myla.becsparsevec import *
#from mylq.becvector import *
from myla.becvectornp import *

In [None]:
conn = pymysql.connect(host="mysql",
                       port=3306,user="jovyan",
                       passwd="jovyan",db='mimic2')
cursor = conn.cursor()

In [None]:
pd.read_sql('SELECT * from icd9',conn).head()


In [None]:
icd9_codes = pd.read_sql('SELECT subject_id, code, description from icd9',conn)
icd9_codes.head()

### We need to ...

1. get the unique ICD9 codes
2. Create a vocabulary that maps a code to a dimension in our vector space.
3. Create a map from the code to the description to make things more human friendly

In [None]:
icd9_codes.shape

In [None]:
voc_code = icd9_codes.code.unique()
voc_code.sort()

In [None]:
len(voc_code)

In [None]:
code_map = dict(zip(icd9_codes.code, icd9_codes.description))
len(code_map)

In [None]:
voc_map = dict((voc_code[i], i) for i in range(len(voc_code)))
dim = len(voc_map)

### Get a List of ICD9 codes for each patient

In [None]:
demo = defaultdict(list)
demo

In [None]:
demo["Brian"].append("Chapman")
demo

In [None]:
patients = defaultdict(list)
for _,row in icd9_codes.iterrows():
    patients[row["subject_id"]].append(row["code"])

In [None]:
min([len(patients[k]) for k in patients]),max([len(patients[k]) for k in patients]), np.mean([len(patients[k]) for k in patients])

In [None]:
def patient2vec(p, vmap):
    """
    takes a patient p and a vocabulary vmap and returns a vector representation of p
    """
    pv = zero(len(vmap))
    for code in p:
        pv[vmap[code]] += 1
    return pv

In [None]:
type(patient2vec(patients[56], voc_map))

In [None]:
patient_vectors = {p:patient2vec(patients[p], voc_map) for p in patients}

In [None]:
norm(patient2vec(patients[56], voc_map))

### Cosine Similarity
One of the simplest ways of comparing two texts is with the [cosine similarity measure](https://en.wikipedia.org/wiki/Cosine_similarity). The sentences with the smallest angle between them are the most similar.

![angle between two vectors](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3e/Dot_Product.svg/200px-Dot_Product.svg.png)

---------------

$$\cos{\theta} = \frac{\vec{A}\cdot\vec{B}}{{\left|\left|\vec{A}\right|\right|}{\left|\left|\vec{B}\right|\right|}}$$
    

In [None]:

import random
def compute_similarities(patient_vectors, num=50):
    similarities = {}
    keys = list(patient_vectors.keys())
    random.shuffle(keys)
    use_keys = keys[:num]
    for p1 in use_keys:
        v1 = patient_vectors[p1]
        for p2 in use_keys:
            v2 = patient_vectors[p2]
            similarities[(p1,p2)] = cos_sim(v1,v2)
    return similarities

In [None]:
%timeit compute_similarities(patient_vectors, num=10)

## Performance for each model

```Python
%timeit compute_similarities(patient_vectors, num=10)
```
####  Python List Vector

`61.5 ms ± 688 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)`

#### Sparse Vector

`6.56 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)`

#### Numpy Vector

`4.71 ms ± 43.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)`

In [None]:
num=250
sims = compute_similarities(patient_vectors, num=num)

amap = dict(zip(set([k[0] for k in sims.keys()]), range(num)))

sims_array = np.zeros((num,num))
for key,val in sims.items():
    sims_array[amap[key[0]],amap[key[1]]] = val
sns.heatmap(sims_array)

In [None]:
sns.distplot(list(sims.values()))