# Patient Similarity

In this notebook we will explore using vectors to represent patients based on their ICD9 codes and then use vector operations to compute similarity between patients. The vectors will typically be sparse so we will explore using 
dictionaries to represent sparse vectors.

## vector norm 
[reference](http://mathworld.wolfram.com/VectorNorm.html)

In [6]:
import pymysql
import pandas as pd

import getpass
import pandas as pd
import seaborn as sns
import numpy as np
from numpy.linalg import norm
from collections import defaultdict
import itertools

In [5]:
from myla.becvector import *

In [7]:
conn = pymysql.connect(host="mysql",
                       port=3306,user="jovyan",
                       passwd=getpass.getpass("Enter MySQL passwd for jovyan"),db='mimic2')
cursor = conn.cursor()

Enter MySQL passwd for jovyan ······


In [8]:
pd.read_sql('SELECT * from icd9',conn).head()


Unnamed: 0,subject_id,hadm_id,sequence,code,description
0,56,28766,1,198.3,SECONDARY MALIGNANT NEOPLASM OF BRAIN AND SPIN...
1,56,28766,2,162.8,MALIGNANT NEOPLASM OF OTHER PARTS OF BRONCHUS ...
2,56,28766,3,531.4,CHRONIC OR UNSPECIFIED GASTRIC ULCER WITH HEMO...
3,56,28766,4,276.1,HYPOSMOLALITY AND/OR HYPONATREMIA
4,56,28766,5,428.0,CONGESTIVE HEART FAILURE UNSPECIFIED


In [9]:
icd9_codes = pd.read_sql('SELECT subject_id, code, description from icd9',conn)
icd9_codes.head()

Unnamed: 0,subject_id,code,description
0,56,198.3,SECONDARY MALIGNANT NEOPLASM OF BRAIN AND SPIN...
1,56,162.8,MALIGNANT NEOPLASM OF OTHER PARTS OF BRONCHUS ...
2,56,531.4,CHRONIC OR UNSPECIFIED GASTRIC ULCER WITH HEMO...
3,56,276.1,HYPOSMOLALITY AND/OR HYPONATREMIA
4,56,428.0,CONGESTIVE HEART FAILURE UNSPECIFIED


### We need to ...

1. get the unique ICD9 codes
2. Create a vocabulary that maps a code to a dimension in our vector space.
3. Create a map from the code to the description to make things more human friendly

### The shape of icd9_code

In [17]:
icd9_codes.shape

(53486, 3)

### The number of Patient

In [15]:
len(icd9_codes.subject_id.unique())

3951

In [19]:
voc_code = icd9_codes.code.unique()
voc_code.sort()

In [20]:
len(voc_code)

2719

In [23]:
code_map = dict(zip(icd9_codes.code, icd9_codes.description))
len(code_map)

2719

In [27]:
voc_map = dict((voc_code[i], i) for i in range(len(voc_code)))
dim = len(voc_map)
dim

2719

### Get a List of ICD9 codes for each patient

In [30]:
demo = defaultdict(list)
demo

defaultdict(list, {})

In [34]:
demo["Brian"] = "Chapman"
demo["Brian1"].append("Chapman")

In [35]:
demo

defaultdict(list, {'Brian': 'Chapman', 'Brian1': ['Chapman']})

In [36]:
patients = defaultdict(list)
for _,row in icd9_codes.iterrows():
    patients[row["subject_id"]].append(row["code"])

In [40]:
min([len(patients[k]) for k in patients]), max([len(patients[k]) for k in patients]), np.mean([len(patients[k]) for k in patients])

(1, 308, 13.53733232093141)

In [41]:
def patient2vec(p, vmap):
    """
    takes a patient p and a vocabulary vmap and returns a vector representation of p
    """
    pv = zero(len(vmap))
    for code in p:
        pv[vmap[code]] += 1
    return pv

In [46]:
v56 = patient2vec(patients[56],voc_map)
sum([1 for i in v56 if i != 0])

8

In [53]:
patient_vectors = {p:patient2vec(patients[p], voc_map) for p in patients}

In [51]:
len(patient_vectors)

3951

In [54]:
norm(patient2vec(patients[56],voc_map))

2.8284271247461903

### Cosine Similarity
One of the simplest ways of comparing two texts is with the [cosine similarity measure](https://en.wikipedia.org/wiki/Cosine_similarity). The sentences with the smallest angle between them are the most similar.

![angle between two vectors](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3e/Dot_Product.svg/200px-Dot_Product.svg.png)

---------------

$$\cos{\theta} = \frac{\vec{A}\cdot\vec{B}}{{\left|\left|\vec{A}\right|\right|}{\left|\left|\vec{B}\right|\right|}}$$
    

In [3]:
def cos_sim(v1, v2):
    return dot(v1,v2) / (norm(v1) * norm(v2))

In [4]:
cos_sim([1,2], [2,1])

0.7999999999999998

In [55]:
similarities = {}

In [None]:
for p1, v1 in patient_vectors.items():
    for p2, v2 in patient_vectors.items():
        similarities[(p1, p2)] = cos_sim(v1, v2)
