## Exploration of AMB dataset

### Questions

- What does each row represent? A Dx, a Rx?
- How many unique patients under `P32c` and `P33`
- How many diagnoses under `P32c` and `P33`

## Data cleaning

We begin by creating a new CSV file that will only include `P32c` and `P33` HCGs

In [1]:
import csv
import pandas as pd

## We filter the data and save it in a new file

In [2]:
SOURCE = 'AMB_ALL.tsv'
TARGET = 'SELECTED_HCG.tsv'

ALLOWED_HCG = ['p32c', 'p33']
HEADERS = ['year', 'quarter', 'patid', 'gender', 'agegrp', 'race', 'ethnicity', 'language', 'metro', 'paid', 'patpaid', 'payer', 'pos', 'urban', 'dx1', 'px1', 'hcg']

In [3]:
%time
# with open(SOURCE, 'rt') as f, open(TARGET, 'wt') as g:
#     csvreader = csv.DictReader(f, fieldnames=HEADERS, delimiter="\t", )
#     csvwriter = csv.DictWriter(g, fieldnames=HEADERS, delimiter="\t")
#     csvwriter.writerow(csvreader.next())
#     for row in csvreader:
#         if row['hcg'].lower() in ALLOWED_HCG:
#             csvwriter.writerow(row)

CPU times: user 3 µs, sys: 4 µs, total: 7 µs
Wall time: 7.15 µs


In [None]:
x = pd.read_csv(TARGET, delimiter="\t")

## How many unique patients diagnosed under `P32c` and `P33` ?

In [5]:
%time total_number_of_unique_patients = len(x['patid'].unique())
print "There are %d unique patients" % total_number_of_unique_patients


CPU times: user 2.21 s, sys: 86.3 ms, total: 2.3 s
Wall time: 2.31 s
There are 2276316 unique patients


## How many unique Dx under `P32c` and `P33` ?

In [6]:
%time total_number_of_unique_dx = len(x['dx1'].unique())
total_number_of_dx = len(x)
print "There are %d unique dx" % total_number_of_unique_dx

CPU times: user 501 ms, sys: 34.6 ms, total: 535 ms
Wall time: 539 ms
There are 9837 unique dx


In [7]:
print "Each patient is has on average %.2f diagnoses" % (len(x) / float(total_number_of_unique_patients))

Each patient is has on average 3.43 diagnoses


## Read ICD2Module

In [8]:
MODULES_FILE = 'Bright.md ICD-9 codes.xlsx - ICD codes.tsv'
modules = pd.read_csv(MODULES_FILE, delimiter="\t")

In [9]:
icd9 = [i for i in modules['ICD-9'].dropna().unique()]
print "Smartexam supports %d ICD9 diagnoses" % len(icd9)

Smartexam supports 60 ICD9 diagnoses


In [10]:
unique_icd9_supercategories = set([i.split('.')[0] for i in icd9])

### We strip `dx1` and save it back

In [11]:
x['dx1'] = x['dx1'].str.strip()
# x.to_csv(TARGET, sep="\t")

In [12]:
data_supercategories = set()
for i in set(x['dx1']):
    data_supercategories.add(i[:3])

In [13]:
rows_supported_by_brightmd = data_supercategories.intersection(unique_icd9_supercategories)
total_number_of_rows = len(x)

## How many total diagnoses are supported by Brightmd's modules?

In [14]:
print "Smartexam supports %d out of %d total diagnoses (%.2f%%)" % (len(rows_supported_by_brightmd),
                                                                  total_number_of_rows,
                                                                  (100*len(rows_supported_by_brightmd)/float(total_number_of_rows)))

Smartexam supports 36 out of 7803527 total diagnoses (0.00%)


In [15]:
print "%.2f%%" % (100*len(icd9)/float(len(x['dx1'].unique())))

0.61%


In [42]:
y = pd.value_counts(x['dx1']).head(50)
y

4019     207540
25000    200812
4659     173543
4011     171634
42731    122418
462      116306
7862     100743
7242      98311
2724      86240
5990      84871
4619      81053
3829      76761
7295      74826
78079     73627
30000     72955
71946     69988
4660      66367
311       66104
2449      63644
49390     62332
78900     61512
7245      55613
6929      53724
25002     52962
31401     52740
7881      47808
7840      47100
4779      45622
71941     45114
78650     43921
7821      43560
32723     42782
53081     42671
486       42446
496       42372
7231      40769
V700      39519
4739      38144
V5883     37923
490       36982
78791     36401
7804      36344
7244      32887
V5861     32710
7291      32299
7823      31887
71945     31165
0340      30105
78060     30068
31400     29624
Name: dx1, dtype: int64

In [45]:
import requests
from bs4 import BeautifulSoup
url = 'https://www.hipaaspace.com/Medical_Billing/Crosswalk.Services/ICD-9.to.ICD-10.Mapping/%s'
stats = []
for k in y.keys():
    r = requests.get(url % k)
    soup = BeautifulSoup(r.text)
    tables = soup.find_all("table")
    rows = tables[1].findChildren('tr')
    td = rows[3].findChildren('td')
    stats.append({'total': y[k], 'description': td[0].text.replace('(Diagnosis) - ', '').strip()})

stats = pd.DataFrame(stats)
stats.head(50)

Unnamed: 0,description,total
0,4019 Hypertension NOS (Unspecified essential h...,207540
1,25000 DMII wo cmp nt st uncntr (Diabetes melli...,200812
2,4659 Acute uri NOS (Acute upper respiratory in...,173543
3,4011 Benign hypertension (Benign essential hyp...,171634
4,42731 Atrial fibrillation,122418
5,462 Acute pharyngitis,116306
6,7862 Cough,100743
7,7242 Lumbago,98311
8,2724 Hyperlipidemia NEC/NOS (Other and unspeci...,86240
9,5990 Urin tract infection NOS (Urinary tract i...,84871


In [65]:
100* (1 - len(data_supercategories.difference(unique_icd9_supercategories))/float(len(data_supercategories)))

3.5019455252918275

In [60]:
sorted(data_supercategories)

['*NU',
 '000',
 '002',
 '003',
 '004',
 '005',
 '006',
 '007',
 '008',
 '009',
 '010',
 '011',
 '012',
 '013',
 '014',
 '015',
 '016',
 '017',
 '018',
 '020',
 '021',
 '023',
 '024',
 '025',
 '026',
 '027',
 '030',
 '031',
 '032',
 '033',
 '034',
 '035',
 '036',
 '037',
 '038',
 '039',
 '040',
 '041',
 '042',
 '045',
 '046',
 '047',
 '048',
 '049',
 '050',
 '051',
 '052',
 '053',
 '054',
 '055',
 '056',
 '057',
 '058',
 '061',
 '063',
 '065',
 '066',
 '070',
 '072',
 '073',
 '074',
 '075',
 '076',
 '077',
 '078',
 '079',
 '081',
 '082',
 '083',
 '084',
 '085',
 '086',
 '087',
 '088',
 '090',
 '091',
 '092',
 '093',
 '094',
 '095',
 '096',
 '097',
 '098',
 '099',
 '100',
 '101',
 '102',
 '103',
 '104',
 '110',
 '111',
 '112',
 '114',
 '115',
 '116',
 '117',
 '118',
 '120',
 '121',
 '122',
 '123',
 '124',
 '125',
 '126',
 '127',
 '128',
 '129',
 '130',
 '131',
 '132',
 '133',
 '134',
 '135',
 '136',
 '137',
 '138',
 '139',
 '140',
 '141',
 '142',
 '143',
 '144',
 '145',
 '146',
 '147',


In [59]:
len(data_supercategories)

1028

In [62]:
len(unique_icd9_supercategories.difference(data_supercategories))

8

In [63]:
len(unique_icd9_supercategories)

44