### 2. PREPARE DEATH DATA FOR ANALYSIS AND NATURAL LANGUAGE PROCESSING

Data for this analysis are restricted to deaths that occurred in Washington State from January 1, 2016 through December 31, 2019.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display
from importlib import reload
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
from collections import Counter
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.util import ngrams
import re
from gensim import corpora, models

In [2]:
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', None)

**Load raw data extracted from SQL database**

In [3]:
d1619 = pd.read_csv(r'Y:/DQSS/Death/MBG/py/capstone2/data/dth1619_raw.csv',
                  low_memory=False,
                  encoding = 'unicode_escape')

**Check data to make sure they are for the right time frame and that all deaths occurred in WA**

In [4]:
d1619['dody'].value_counts(dropna = False).sort_index()

2016    54821
2017    56992
2018    56950
2019    58234
Name: dody, dtype: int64

In [5]:
d1619.dstateFIPS.value_counts(dropna=False)

WA    226997
Name: dstateFIPS, dtype: int64

**Keep relevant variables** including underlying cause code ('UCOD'), all multiple cause ICD-10 code fields ('MC1' to 'MC20'), the concatenated cause of death literal field, and the 'tobacco' variable that indicates whether the medical certifier believed that tobacco use contributed to the decedent's death. The working data set will contain death records for deaths occurring in Washington State regardless of the decedents' residence states.

In [6]:
d1619.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226997 entries, 0 to 226996
Data columns (total 26 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   sfn         226997 non-null  int64  
 1   sex         226997 non-null  object 
 2   ageyrs      226997 non-null  float64
 3   dod         226997 non-null  object 
 4   dody        226997 non-null  int64  
 5   dcounty     226996 non-null  object 
 6   dstateFIPS  226997 non-null  object 
 7   bridgerace  226997 non-null  int64  
 8   hispno      226997 non-null  object 
 9   marital     226997 non-null  object 
 10  rstatefips  226997 non-null  object 
 11  manner      226955 non-null  object 
 12  tobac       226912 non-null  object 
 13  pg          166760 non-null  float64
 14  certdesig   225383 non-null  float64
 15  UCOD        226841 non-null  object 
 16  MC1         226825 non-null  object 
 17  MC2         188306 non-null  object 
 18  codAq       226996 non-null  object 
 19  co

In [8]:
ds = d1619.loc[:,['sfn', 'sex', 'ageyrs', 'dody', 'dcounty', 'bridgerace', 'hispno', 'manner', 'tobac', 'pg',
                'certdesig', 'UCOD', 'MC2','AllMC', 'MC2_20', 'codlit']]

In [9]:
ds.head()

Unnamed: 0,sfn,sex,ageyrs,dody,dcounty,bridgerace,hispno,manner,tobac,pg,certdesig,UCOD,MC2,AllMC,MC2_20,codlit
0,2017025187,M,84.0,2017,KING,1,Y,N,N,8.0,1.0,J841,I10,J841 I10 I219 I269 I48 I509 J690 J960 R090 ...,I10 I219 I269 I48 I509 J690 J960 R090,ACUTE RESPIRATORY FAILURE WITH HYPOXIA ASPIRAT...
1,2017025188,M,58.0,2017,KING,1,Y,N,N,8.0,7.0,I64,,I64,,ACUTE ISCHEMIC CEREBRAL VASCULAR ACCIDENT
2,2017025189,F,70.0,2017,KING,10,Y,N,N,8.0,1.0,C833,D619,C833 D619 E46 E883 R58 Y433,D619 E46 E883 R58 Y433,RETROPERITONEAL BLEEDING EXTENSIVE LARGE B CEL...
3,2017025190,F,85.0,2017,KING,1,Y,N,N,8.0,1.0,E86,N179,E86 N179 R13,N179 R13,ACUTE RENAL FAILURE DEHYDRATION DYSPHAGIA
4,2017025191,M,75.0,2017,SPOKANE,1,Y,N,Y,8.0,1.0,E870,F54,E870 F54 G729 I509 J969 R093 R418 R53 T179 W80...,F54 G729 I509 J969 R093 R418 R53 T179 W80 ...,PULSELESS ELECTRICAL ACTIVITY ARREST HYPOXEMIC...


I start with 226,997 death records for individuals who died in Washington State from 2016 through 2019.

**LABELING RECORDS THAT HAVE GARBAGE UNDERLYING CAUSE CODES**
<br> 
Foreman et. al. published an article on death data quality that include comprehensive list of garbage codes that were divided up into the nine groups listed below.  The code groups are mutually exclusive and together represent an exhaustive list of garbage codes.  This list has been referenced in subsequent articles that address death data quality.  In this step I label all records in the data set from 1 through 9 if the underlying cause of death (UCOD) code is one of the garbage codes and 0 if the code is a 'valid' (non-garbage) code.

**Create lists of ICD-10 codes for each of the 9 garbage code categories.** These lists will be used to flag records where underlying cause code (UCOD) consists of a garbage code. The nine categories of garbage codes are as follows:

- g1 = Septicemia
- g2 = Heart failure
- g3 = Ill-defined cancer
- g4 = Volume depletion
- g5 = Ill-defined
- g6 = Ill-defined cardiovascular
- g7 = Ill-defined injury
- g8 = Undetermined intent
- g9 = Ill-defined infectious

In [10]:
g1 = ['A40', 'A400', 'A401', 'A402', 'A403', 'A408', 'A409', 'A41', 'A410', 'A411', 'A412', 'A413', 'A414',
      'A415', 'A418', 'A419']

g2 = ['I50', 'I500', 'I501', 'I509']

g3 = ['C759', 'C76', 'C760', 'C761', 'C762', 'C763', 'C764', 'C765', 'C767', 'C768', 'C80', 'D099', 'D489']

g4 = ['E86', 'E87', 'E870', 'E871', 'E872', 'E873', 'E874', 'E875', 'E876', 'E877', 'E878' ]

g5 = ['I46', 'I460', 'I461', 'I469', 'P95', 'R00', 'R000', 'R001', 'R002', 'R008', 'R01', 'R010', 'R011', 
      'R012', 'R02', 'R03', 'R030', 'R031', 'R04', 'R040', 'R041', 'R042', 'R048', 'R049', 'R05', 'R06', 
      'R060', 'R061', 'R062', 'R063', 'R064', 'R065', 'R066', 'R067', 'R068', 'R07', 'R070', 'R071', 
      'R072', 'R073', 'R074', 'R09', 'R090', 'R091', 'R092', 'R093', 'R098', 'R10', 'R100', 'R101', 
      'R102', 'R103', 'R104', 'R11', 'R12', 'R13', 'R14', 'R15', 'R16', 'R160', 'R161', 'R162', 'R17', 
      'R18', 'R19', 'R190', 'R191', 'R192', 'R193', 'R194', 'R195', 'R196', 'R198', 'R20', 'R200', 
      'R201', 'R202', 'R203', 'R208', 'R21', 'R22', 'R220', 'R221', 'R222', 'R223', 'R224', 'R227', 
      'R229', 'R23', 'R230', 'R231', 'R232', 'R233', 'R234', 'R238', 'R25', 'R250', 'R251', 'R252', 
      'R253', 'R258', 'R26', 'R260', 'R261', 'R262', 'R268', 'R27', 'R270', 'R278', 'R29', 'R290', 
      'R291', 'R292', 'R293', 'R294', 'R296', 'R298', 'R30', 'R300', 'R301', 'R309', 'R31', 'R32', 
      'R33', 'R34', 'R35', 'R36', 'R39', 'R390', 'R391', 'R392', 'R398', 'R40', 'R400', 'R401', 'R402', 
      'R41', 'R410', 'R411', 'R412', 'R413', 'R418', 'R42', 'R43', 'R430', 'R431', 'R432', 'R438', 'R44', 
      'R440', 'R441', 'R442', 'R443', 'R448', 'R45', 'R450', 'R451', 'R452', 'R453', 'R454', 'R455', 'R456', 
      'R457', 'R458', 'R46', 'R460', 'R461', 'R462', 'R463', 'R464', 'R465', 'R466', 'R467', 'R468', 'R47', 
      'R470', 'R471', 'R478', 'R48', 'R480', 'R481', 'R482', 'R488', 'R49', 'R490', 'R491', 'R492', 'R498', 
      'R50', 'R500', 'R501', 'R502', 'R508', 'R509', 'R51', 'R52', 'R520', 'R521', 'R522', 'R529', 'R53', 
      'R54', 'R55', 'R56', 'R560', 'R568', 'R57', 'R570', 'R571', 'R578', 'R579', 'R58', 'R59', 'R590', 
      'R591', 'R599', 'R60', 'R600', 'R601', 'R609', 'R61', 'R610', 'R611', 'R619', 'R62', 'R620', 'R628', 
      'R629', 'R63', 'R630', 'R631', 'R632', 'R633', 'R634', 'R635', 'R638', 'R64', 'R68', 'R680', 'R681', 
      'R682', 'R683', 'R688', 'R69', 'R70', 'R700', 'R701', 'R71', 'R72', 'R73', 'R730', 'R739', 'R74', 'R740',
      'R748', 'R749', 'R75', 'R76', 'R760', 'R761', 'R762', 'R768', 'R769', 'R77', 'R770', 'R771', 'R772', 'R778',
      'R779', 'R78', 'R780', 'R781', 'R782', 'R783', 'R784', 'R785', 'R786', 'R787', 'R788', 'R789', 'R79', 'R790',
      'R798', 'R799', 'R80', 'R81', 'R82', 'R820', 'R821', 'R822', 'R823', 'R824', 'R825', 'R826', 'R827', 'R828',
      'R829', 'R83', 'R830', 'R831', 'R832', 'R833', 'R834', 'R835', 'R836', 'R837', 'R838', 'R839', 'R84', 'R840',
      'R841', 'R842', 'R843', 'R844', 'R845', 'R846', 'R847', 'R848', 'R849', 'R85', 'R850', 'R851', 'R852',
      'R853', 'R854', 'R855', 'R856', 'R857', 'R858', 'R859', 'R86', 'R860', 'R861', 'R862', 'R863', 'R864',
      'R865', 'R866', 'R867', 'R868', 'R869', 'R87', 'R870', 'R871', 'R872', 'R873', 'R874', 'R875',
      'R876', 'R877', 'R878', 'R879', 'R89', 'R890', 'R891', 'R892', 'R893', 'R894', 'R895', 'R896', 'R897',
      'R898', 'R899', 'R90', 'R900', 'R908', 'R91', 'R92', 'R93', 'R930', 'R931', 'R932', 'R933', 'R934',
      'R935', 'R936', 'R937', 'R938', 'R94', 'R940', 'R941', 'R942', 'R943', 'R944', 'R945', 'R946', 'R947',
      'R948', 'R95', 'R96', 'R960', 'R961', 'R98', 'R99']

g6 = ['I10', 'I15', 'I150', 'I151', 'I152', 'I158', 'I159', 'I26', 'I260', 'I269', 'I49', 'I490', 'I491', 'I492',
      'I493', 'I494', 'I495', 'I498', 'I499', 'I51', 'I510', 'I511', 'I512', 'I513', 'I514', 'I515', 'I516', 'I517',
      'I518', 'I519', 'I70', 'I700', 'I701', 'I709', 'I74', 'I740', 'I741', 'I742', 'I743', 'I744', 'I745', 'I748',
      'I749', 'I99']

g7 = ['S00', 'S000', 'S001', 'S002', 'S003', 'S004', 'S005', 'S007', 'S008', 'S009', 'S01', 'S010', 'S011', 'S012',
      'S013', 'S014', 'S015', 'S017', 'S018', 'S019', 'S02', 'S020', 'S021', 'S022', 'S023', 'S024', 'S025', 'S026',
      'S027', 'S028', 'S029', 'S03', 'S030', 'S031', 'S032', 'S033', 'S034', 'S035', 'S04', 'S040', 'S041', 'S042',
      'S043', 'S044', 'S045', 'S046', 'S047', 'S048', 'S049', 'S05', 'S050', 'S051', 'S052', 'S053', 'S054', 'S055',
      'S056', 'S057', 'S058', 'S059', 'S06', 'S060', 'S061', 'S062', 'S063', 'S064', 'S065', 'S066', 'S067', 'S068',
      'S069', 'S07', 'S070', 'S071', 'S078', 'S079', 'S08', 'S080', 'S081', 'S088', 'S089', 'S09', 'S090', 'S091',
      'S092', 'S097', 'S098', 'S099', 'S10', 'S100', 'S101', 'S107', 'S108', 'S109', 'S11', 'S110', 'S111', 'S112',
      'S117', 'S118', 'S119', 'S12', 'S120', 'S121', 'S122', 'S127', 'S128', 'S129', 'S13', 'S130', 'S131', 'S132',
      'S133', 'S134', 'S135', 'S136', 'S14', 'S140', 'S141', 'S142', 'S143', 'S144', 'S145', 'S146', 'S15', 'S150',
      'S151', 'S152', 'S153', 'S157', 'S158', 'S159', 'S16', 'S17', 'S170', 'S178', 'S179', 'S18', 'S19', 'S197',
      'S198', 'S199', 'S20', 'S200', 'S201', 'S202', 'S203', 'S204', 'S207', 'S208', 'S21', 'S210', 'S211', 'S212',
      'S217', 'S218', 'S219', 'S22', 'S220', 'S221', 'S222', 'S223', 'S224', 'S225', 'S228', 'S229', 'S23', 'S230',
      'S231', 'S232', 'S233', 'S234', 'S235', 'S24', 'S240', 'S241', 'S242', 'S243', 'S244', 'S245', 'S246', 'S25',
      'S250', 'S251', 'S252', 'S253', 'S254', 'S255', 'S257', 'S258', 'S259', 'S26', 'S260', 'S268', 'S269', 'S27',
      'S270', 'S271', 'S272', 'S273', 'S274', 'S275', 'S276', 'S277', 'S278', 'S279', 'S28', 'S280', 'S281', 'S29',
      'S290', 'S297', 'S298', 'S299', 'S30', 'S300', 'S301', 'S302', 'S307', 'S308', 'S309', 'S31', 'S310', 'S311',
      'S312', 'S313', 'S314', 'S315', 'S317', 'S318', 'S32', 'S320', 'S321', 'S322', 'S323', 'S324', 'S325', 'S327',
      'S328', 'S33', 'S330', 'S331', 'S332', 'S333', 'S334', 'S335', 'S336', 'S337', 'S34', 'S340', 'S341', 'S342',
      'S343', 'S344', 'S345', 'S346', 'S348', 'S35', 'S350', 'S351', 'S352', 'S353', 'S354', 'S355', 'S357', 'S358',
      'S359', 'S36', 'S360', 'S361', 'S362', 'S363', 'S364', 'S365', 'S366', 'S367', 'S368', 'S369', 'S37', 'S370',
      'S371', 'S372', 'S373', 'S374', 'S375', 'S376', 'S377', 'S378', 'S379', 'S38', 'S380', 'S381', 'S382', 'S383',
      'S39', 'S390', 'S396', 'S397', 'S398', 'S399', 'S40', 'S400', 'S407', 'S408', 'S409', 'S41', 'S410', 'S411',
      'S417', 'S418', 'S42', 'S420', 'S421', 'S422', 'S423', 'S424', 'S427', 'S428', 'S429', 'S43', 'S430', 'S431',
      'S432', 'S433', 'S434', 'S435', 'S436', 'S437', 'S44', 'S440', 'S441', 'S442', 'S443', 'S444', 'S445', 'S447',
      'S448', 'S449', 'S45', 'S450', 'S451', 'S452', 'S453', 'S457', 'S458', 'S459', 'S46', 'S460', 'S461', 'S462',
      'S463', 'S467', 'S468', 'S469', 'S47', 'S48', 'S480', 'S481', 'S489', 'S49', 'S497', 'S498', 'S499', 'S50',
      'S500', 'S501', 'S507', 'S508', 'S509', 'S51', 'S510', 'S517', 'S518', 'S519', 'S52', 'S520', 'S521', 'S522',
      'S523', 'S524', 'S525', 'S526', 'S527', 'S528', 'S529', 'S53', 'S530', 'S531', 'S532', 'S533', 'S534', 'S54',
      'S540', 'S541', 'S542', 'S543', 'S547', 'S548', 'S549', 'S55', 'S550', 'S551', 'S552', 'S557', 'S558', 'S559',
      'S56', 'S560', 'S561', 'S562', 'S563', 'S564', 'S565', 'S567', 'S568', 'S57', 'S570', 'S578', 'S579', 'S58',
      'S580', 'S581', 'S589', 'S59', 'S597', 'S598', 'S599', 'S60', 'S600', 'S601', 'S602', 'S607', 'S608', 'S609',
      'S61', 'S610', 'S611', 'S617', 'S618', 'S619', 'S62', 'S620', 'S621', 'S622', 'S623', 'S624', 'S625', 'S626',
      'S627', 'S628', 'S63', 'S630', 'S631', 'S632', 'S633', 'S634', 'S635', 'S636', 'S637', 'S64', 'S640', 'S641',
      'S642', 'S643', 'S644', 'S647', 'S648', 'S649', 'S65', 'S650', 'S651', 'S652', 'S653', 'S654', 'S655', 'S657',
      'S658', 'S659', 'S66', 'S660', 'S661', 'S662', 'S663', 'S664', 'S665', 'S666', 'S667', 'S668', 'S669', 'S67',
      'S670', 'S678', 'S68', 'S680', 'S681', 'S682', 'S683', 'S684', 'S688', 'S689', 'S69', 'S697', 'S698', 'S699',
      'S70', 'S700', 'S701', 'S707', 'S708', 'S709', 'S71', 'S710', 'S711', 'S717', 'S718', 'S72', 'S720', 'S721',
      'S722', 'S723', 'S724', 'S727', 'S728', 'S729', 'S73', 'S730', 'S731', 'S74', 'S740', 'S741', 'S742', 'S747',
      'S748', 'S749', 'S75', 'S750', 'S751', 'S752', 'S757', 'S758', 'S759', 'S76', 'S760', 'S761', 'S762', 'S763',
      'S764', 'S767', 'S77', 'S770', 'S771', 'S772', 'S78', 'S780', 'S781', 'S789', 'S79', 'S797', 'S798', 'S799',
      'S80', 'S800', 'S801', 'S807', 'S808', 'S809', 'S81', 'S810', 'S817', 'S818', 'S819', 'S82', 'S820', 'S821',
      'S822', 'S823', 'S824', 'S825', 'S826', 'S827', 'S828', 'S829', 'S83', 'S830', 'S831', 'S832', 'S833', 'S834',
      'S835', 'S836', 'S837', 'S84', 'S840', 'S841', 'S842', 'S847', 'S848', 'S849', 'S85', 'S850', 'S851', 'S852',
      'S853', 'S854', 'S855', 'S857', 'S858', 'S859', 'S86', 'S860', 'S861', 'S862', 'S863', 'S867', 'S868', 'S869',
      'S87', 'S870', 'S878', 'S88', 'S880', 'S881', 'S889', 'S89', 'S897', 'S898', 'S899', 'S90', 'S900', 'S901',
      'S902', 'S903', 'S907', 'S908', 'S909', 'S91', 'S910', 'S911', 'S912', 'S913', 'S917', 'S92', 'S920', 'S921',
      'S922', 'S923', 'S924', 'S925', 'S927', 'S929', 'S93', 'S930', 'S931', 'S932', 'S933', 'S934', 'S935', 'S936',
      'S94', 'S940', 'S941', 'S942', 'S943', 'S947', 'S948', 'S949', 'S95', 'S950', 'S951', 'S952', 'S957', 'S958',
      'S959', 'S96', 'S960', 'S961', 'S962', 'S967', 'S968', 'S969', 'S97', 'S970', 'S971', 'S978', 'S98', 'S980',
      'S981', 'S982', 'S983', 'S984', 'S99', 'S997', 'S998', 'S999', 'T00', 'T000', 'T001', 'T002', 'T003', 'T006',
      'T008', 'T009', 'T01', 'T010', 'T011', 'T012', 'T013', 'T016', 'T018', 'T019', 'T02', 'T020', 'T021', 'T022',
      'T023', 'T024', 'T025', 'T026', 'T027', 'T028', 'T029', 'T03', 'T030', 'T031', 'T032', 'T033', 'T034', 'T038',
      'T039', 'T04', 'T040', 'T041', 'T042', 'T043', 'T044', 'T047', 'T048', 'T049', 'T05', 'T050', 'T051', 'T052',
      'T053', 'T054', 'T055', 'T056', 'T058', 'T059', 'T06', 'T060', 'T061', 'T062', 'T063', 'T064', 'T065', 'T068',
      'T07', 'T08', 'T09', 'T090', 'T091', 'T092', 'T093', 'T094', 'T095', 'T096', 'T098', 'T099', 'T10', 'T11',
      'T110', 'T111', 'T112', 'T113', 'T114', 'T115', 'T116', 'T118', 'T119', 'T12', 'T13', 'T130', 'T131', 'T132',
      'T133', 'T134', 'T135', 'T136', 'T138', 'T139', 'T14', 'T140', 'T141', 'T142', 'T143', 'T144', 'T145', 'T146',
      'T147', 'T148', 'T149', 'T15', 'T150', 'T151', 'T158', 'T159', 'T16', 'T17', 'T170', 'T171', 'T172', 'T173',
      'T174', 'T175', 'T178', 'T179', 'T18', 'T180', 'T181', 'T182', 'T183', 'T184', 'T185', 'T188', 'T189', 'T19',
      'T190', 'T191', 'T192', 'T193', 'T198', 'T199', 'T20', 'T200', 'T201', 'T202', 'T203', 'T204', 'T205', 'T206',
      'T207', 'T21', 'T210', 'T211', 'T212', 'T213', 'T214', 'T215', 'T216', 'T217', 'T22', 'T220', 'T221', 'T222',
      'T223', 'T224', 'T225', 'T226', 'T227', 'T23', 'T230', 'T231', 'T232', 'T233', 'T234', 'T235', 'T236', 'T237',
      'T24', 'T240', 'T241', 'T242', 'T243', 'T244', 'T245', 'T246', 'T247', 'T25', 'T250', 'T251', 'T252', 'T253',
      'T254', 'T255', 'T256', 'T257', 'T26', 'T260', 'T261', 'T262', 'T263', 'T264', 'T265', 'T266', 'T267', 'T268',
      'T269', 'T27', 'T270', 'T271', 'T272', 'T273', 'T274', 'T275', 'T276', 'T277', 'T28', 'T280', 'T281', 'T282',
      'T283', 'T284', 'T285', 'T286', 'T287', 'T288', 'T289', 'T29', 'T290', 'T291', 'T292', 'T293', 'T294', 'T295',
      'T296', 'T297', 'T30', 'T300', 'T301', 'T302', 'T303', 'T304', 'T305', 'T306', 'T307', 'T31', 'T310', 'T311',
      'T312', 'T313', 'T314', 'T315', 'T316', 'T317', 'T318', 'T319', 'T32', 'T320', 'T321', 'T322', 'T323', 'T324',
      'T325', 'T326', 'T327', 'T328', 'T329', 'T33', 'T330', 'T331', 'T332', 'T333', 'T334', 'T335', 'T336', 'T337',
      'T338', 'T339', 'T34', 'T340', 'T341', 'T342', 'T343', 'T344', 'T345', 'T346', 'T347', 'T348', 'T349', 'T35',
      'T350', 'T351', 'T352', 'T353', 'T354', 'T355', 'T356', 'T357', 'T36', 'T360', 'T361', 'T362', 'T363', 'T364',
      'T365', 'T366', 'T367', 'T368', 'T369', 'T37', 'T370', 'T371', 'T372', 'T373', 'T374', 'T375', 'T378', 'T379',
      'T38', 'T380', 'T381', 'T382', 'T383', 'T384', 'T385', 'T386', 'T387', 'T388', 'T389', 'T39', 'T390', 'T391',
      'T392', 'T393', 'T394', 'T398', 'T399', 'T40', 'T400', 'T401', 'T402', 'T403', 'T404', 'T405', 'T406', 'T407',
      'T408', 'T409', 'T41', 'T410', 'T411', 'T412', 'T413', 'T414', 'T415', 'T42', 'T420', 'T421', 'T422', 'T423',
      'T424', 'T425', 'T426', 'T427', 'T428', 'T43', 'T430', 'T431', 'T432', 'T433', 'T434', 'T435', 'T436', 'T438',
      'T439', 'T44', 'T440', 'T441', 'T442', 'T443', 'T444', 'T445', 'T446', 'T447', 'T448', 'T449', 'T45', 'T450',
      'T451', 'T452', 'T453', 'T454', 'T455', 'T456', 'T457', 'T458', 'T459', 'T46', 'T460', 'T461', 'T462', 'T463',
      'T464', 'T465', 'T466', 'T467', 'T468', 'T469', 'T47', 'T470', 'T471', 'T472', 'T473', 'T474', 'T475', 'T476',
      'T477', 'T478', 'T479', 'T48', 'T480', 'T481', 'T482', 'T483', 'T484', 'T485', 'T486', 'T487', 'T49', 'T490',
      'T491', 'T492', 'T493', 'T494', 'T495', 'T496', 'T497', 'T498', 'T499', 'T50', 'T500', 'T501', 'T502', 'T503',
      'T504', 'T505', 'T506', 'T507', 'T508', 'T509', 'T51', 'T510', 'T511', 'T512', 'T513', 'T518', 'T519', 'T52',
      'T520', 'T521', 'T522', 'T523', 'T524', 'T528', 'T529', 'T53', 'T530', 'T531', 'T532', 'T533', 'T534', 'T535',
      'T536', 'T537', 'T539', 'T54', 'T540', 'T541', 'T542', 'T543', 'T549', 'T55', 'T56', 'T560', 'T561', 'T562',
      'T563', 'T564', 'T565', 'T566', 'T567', 'T568', 'T569', 'T57', 'T570', 'T571', 'T572', 'T573', 'T578', 'T579',
      'T58', 'T59', 'T590', 'T591', 'T592', 'T593', 'T594', 'T595', 'T596', 'T597', 'T598', 'T599', 'T60', 'T600',
      'T601', 'T602', 'T603', 'T604', 'T608', 'T609', 'T61', 'T610', 'T611', 'T612', 'T618', 'T619', 'T62', 'T620',
      'T621', 'T622', 'T628', 'T629', 'T63', 'T630', 'T631', 'T632', 'T633', 'T634', 'T635', 'T636', 'T638', 'T639',
      'T64', 'T65', 'T650', 'T651', 'T652', 'T653', 'T654', 'T655', 'T656', 'T658', 'T659', 'T66', 'T67', 'T670',
      'T671', 'T672', 'T673', 'T674', 'T675', 'T676', 'T677', 'T678', 'T679', 'T68', 'T69', 'T690', 'T691', 'T698',
      'T699', 'T70', 'T700', 'T701', 'T702', 'T703', 'T704', 'T708', 'T709', 'T71', 'T73', 'T730', 'T731', 'T732',
      'T733', 'T738', 'T739', 'T74', 'T740', 'T741', 'T742', 'T743', 'T748', 'T749', 'T75', 'T750', 'T751', 'T752',
      'T753', 'T754', 'T758', 'T78', 'T780', 'T781', 'T782', 'T783', 'T784', 'T788', 'T789', 'T79', 'T790', 'T791',
      'T792', 'T793', 'T794', 'T795', 'T796', 'T797', 'T798', 'T799', 'T80', 'T800', 'T801', 'T802', 'T803', 'T804',
      'T805', 'T806', 'T808', 'T809', 'T81', 'T810', 'T811', 'T812', 'T813', 'T814', 'T815', 'T816', 'T817', 'T818',
      'T819', 'T82', 'T820', 'T821', 'T822', 'T823', 'T824', 'T825', 'T826', 'T827', 'T828', 'T829', 'T83', 'T830',
      'T831', 'T832', 'T833', 'T834', 'T835', 'T836', 'T838', 'T839', 'T84', 'T840', 'T841', 'T842', 'T843', 'T844',
      'T845', 'T846', 'T847', 'T848', 'T849', 'T85', 'T850', 'T851', 'T852', 'T853', 'T854', 'T855', 'T856', 'T857',
      'T858', 'T859', 'T86', 'T860', 'T861', 'T862', 'T863', 'T864', 'T868', 'T869', 'T87', 'T870', 'T871', 'T872',
      'T873', 'T874', 'T875', 'T876', 'T88', 'T880', 'T881', 'T882', 'T883', 'T884', 'T885', 'T886', 'T887', 'T888',
      'T889', 'T90', 'T900', 'T901', 'T902', 'T903', 'T904', 'T905', 'T908', 'T909', 'T91', 'T910', 'T911', 'T912',
      'T913', 'T914', 'T915', 'T918', 'T919', 'T92', 'T920', 'T921', 'T922', 'T923', 'T924', 'T925', 'T926', 'T928',
      'T929', 'T93', 'T930', 'T931', 'T932', 'T933', 'T934', 'T935', 'T936', 'T938', 'T939', 'T94', 'T940', 'T941',
      'T95', 'T950', 'T951', 'T952', 'T953', 'T954', 'T958', 'T959', 'T96', 'T97', 'T98', 'T980', 'T981', 'T982',
      'T983', 'Y89', 'Y899']
      
g8 = ['Y10', 'Y11', 'Y12', 'Y13', 'Y14', 'Y15', 'Y16', 'Y17', 'Y18', 'Y19', 'Y20', 'Y21', 'Y22', 'Y23',
      'Y24', 'Y241', 'Y242', 'Y243', 'Y244', 'Y249', 'Y25', 'Y26', 'Y27', 'Y28', 'Y29', 'Y30', 'Y31',
      'Y32', 'Y33', 'Y34', 'Y87', 'Y872']
      
g9 = ['B99']

**Flag records with any garbage code in UCOD field.** Combine sublists of garbage codes and flag row if underlying cause ICD-10 code (UCOD) is in the combined list.

In [11]:
gc_all = g1 + g2 + g3 + g4 + g5 + g6 + g7 + g8 + g9
%store gc_all


Stored 'gc_all' (list)


**Create a single flag variable** to indicate that the underlying cause code was one of the garbage codes listed above.

In [12]:
ds['gc_any'] = ds['UCOD'].isin(gc_all)

In [13]:
gc_table = ds['gc_any'].value_counts(dropna=False).to_frame('has_garbage_code')
gc_table['Percent'] = (gc_table['has_garbage_code']/gc_table['has_garbage_code'].sum()) * 100

gc_table

Unnamed: 0,has_garbage_code,Percent
False,211924,93.359824
True,15073,6.640176


In [14]:
pd.crosstab(ds.gc_any, ds.dody)

dody,2016,2017,2018,2019
gc_any,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
False,51127,53364,53265,54168
True,3694,3628,3685,4066


In [15]:
round(pd.crosstab(ds.gc_any, ds.dody, normalize="columns")*100,1)

dody,2016,2017,2018,2019
gc_any,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
False,93.3,93.6,93.5,93.0
True,6.7,6.4,6.5,7.0


**Between 2016 and 2019 the number of death records that were assigned garbage codes** for the underlying cause ranged from 3,628 (6.4%) to 4,066 (7%) of the the total death records for persons who died in Washington State.

**Create a dictionary of garbage code flags for the nine categories (and one for valid codes) with corresponding ICD-10 codes to attach to records**


In [16]:

gcdict = {'1': g1, '2': g2, '3': g3, '4': g4, '5': g5, '6': g6, '7': g7, '8': g8, '9': g9}

gcdict_rev = {v: k for k in gcdict for v in gcdict[k]}

#the above is functional equivalent of:

#gcdict_rev = {}
#for key in gcdict:
#    for value in gcdict[key]:
#        gcdict_rev[value] = key

gc_label_dict = {1 : '1-Septicemia', 
                 2 : '2-Heart failure', 
                 3 : '3-Ill-defined cancer', 
                 4 : '4-Volume depletion', 
                 5 : '5-Ill-defined', 
                 6 : '6-Ill-defined cardiovascular', 
                 7 : '7-Ill-defined injury', 
                 8 : '8-Undetermined intent', 
                 9 : '9-Ill-defined infectious'}

%store gc_label_dict

ds['gc_cat'] = ds['UCOD'].map(gcdict_rev).fillna(0).astype(int) # map GC category numeric labels (1-9) to record based on UCOD code.  Records with valid UCOD codes are labelled '0'.

ds['gc_cat_label'] = ds['gc_cat'].map(gc_label_dict).fillna('0-No GC').astype(str) # add word labels.

Stored 'gc_label_dict' (dict)


**Check to assure all records are labelled**

In [17]:
ds['gc_cat'].value_counts().sort_index(ascending=True).to_frame()

Unnamed: 0,gc_cat
0,211924
1,2149
2,2767
3,2376
4,534
5,2973
6,3868
8,368
9,38


In [18]:
ds['gc_cat_label'].value_counts().sort_index(ascending=True).to_frame()

Unnamed: 0,gc_cat_label
0-No GC,211924
1-Septicemia,2149
2-Heart failure,2767
3-Ill-defined cancer,2376
4-Volume depletion,534
5-Ill-defined,2973
6-Ill-defined cardiovascular,3868
8-Undetermined intent,368
9-Ill-defined infectious,38


**Label records by type of medical certifier** - at this point the records only have numeric values indicating the type of medical certifier (health care provider) that certified each record.

In [19]:
certifier_dict = {1:'1-Physician',
                  2: '2-ME/Coroner',
                  3: '3-DO',
                  4: '4-Chiropractor',
                  5: '5-Sanipractor',
                  6: '6-PA', 
                  7: '7-ARNP',
                  8: '8-NA',
                  9: '9-Unknown'}

ds['cert_label'] = ds['certdesig'].map(certifier_dict).fillna('9-Unknown').astype(str)

In [20]:
ds['cert_label'].value_counts(dropna=False)

1-Physician       150407
2-ME/Coroner       35073
7-ARNP             23668
3-DO               11724
6-PA                4495
9-Unknown           1625
4-Chiropractor         3
8-NA                   2
Name: cert_label, dtype: int64

**CREATE AGE GROUPS**

In [21]:
ds['agegrp'] = pd.cut(ds.ageyrs, 
                        bins=[0,19,29,39,49,59,69,79,115], 
                        labels = ['0-19 yrs', '20-29 yrs', '30-39 yrs', '40-49 yrs',
                                 '50-59 yrs', '60-69 yrs', '70-79 yrs', '80+ yrs'])

In [22]:
ds['agegrp'].value_counts(dropna=False)

80+ yrs      101570
70-79 yrs     48729
60-69 yrs     37560
50-59 yrs     19747
40-49 yrs      7986
30-39 yrs      4964
20-29 yrs      3461
0-19 yrs       1518
NaN            1462
Name: agegrp, dtype: int64

**RECODE AND LABEL RACE**

In [23]:
ds['race'] = ds['bridgerace']
ds['race'] = ds['race'].replace([1], "White")
ds['race'] = ds['race'].replace([2], "African Am.")
ds['race'] = ds['race'].replace([3], "AIAN")
ds['race'] = ds['race'].replace([range(4,11)], "Asian")
ds['race'] = ds['race'].replace([range(11, 15)], "Pacific Isl.")
ds['race'] = ds['race'].replace([15, 21, 22, 23, 24], "Other/multirace")
ds['race'] = ds['race'].replace([99], "Unknown")

In [24]:
ds['hispanic'] = ds['hispno']
ds['hispanic'] = ds['hispanic'].replace(['Y'], 'No')
ds['hispanic'] = ds['hispanic'].replace(['N'], 'Yes')

In [26]:
ds = ds.loc[:, ['sfn', 'sex', 'dody', 'dcounty','hispanic', 'race', 'agegrp', 'manner', 'tobac', 'pg',
                'gc_cat', 'gc_cat_label', 'UCOD', 'MC2','AllMC', 'MC2_20', 'codlit', 'cert_label']]
ds.head()

Unnamed: 0,sfn,sex,dody,dcounty,hispanic,race,agegrp,manner,tobac,pg,gc_cat,gc_cat_label,UCOD,MC2,AllMC,MC2_20,codlit,cert_label
0,2017025187,M,2017,KING,No,White,80+ yrs,N,N,8.0,0,0-No GC,J841,I10,J841 I10 I219 I269 I48 I509 J690 J960 R090 ...,I10 I219 I269 I48 I509 J690 J960 R090,ACUTE RESPIRATORY FAILURE WITH HYPOXIA ASPIRAT...,1-Physician
1,2017025188,M,2017,KING,No,White,50-59 yrs,N,N,8.0,0,0-No GC,I64,,I64,,ACUTE ISCHEMIC CEREBRAL VASCULAR ACCIDENT,7-ARNP
2,2017025189,F,2017,KING,No,Asian,70-79 yrs,N,N,8.0,0,0-No GC,C833,D619,C833 D619 E46 E883 R58 Y433,D619 E46 E883 R58 Y433,RETROPERITONEAL BLEEDING EXTENSIVE LARGE B CEL...,1-Physician
3,2017025190,F,2017,KING,No,White,80+ yrs,N,N,8.0,4,4-Volume depletion,E86,N179,E86 N179 R13,N179 R13,ACUTE RENAL FAILURE DEHYDRATION DYSPHAGIA,1-Physician
4,2017025191,M,2017,SPOKANE,No,White,70-79 yrs,N,Y,8.0,4,4-Volume depletion,E870,F54,E870 F54 G729 I509 J969 R093 R418 R53 T179 W80...,F54 G729 I509 J969 R093 R418 R53 T179 W80 ...,PULSELESS ELECTRICAL ACTIVITY ARREST HYPOXEMIC...,1-Physician


In [27]:
ds.to_csv(r'Y:/DQSS/Death/MBG/py/capstone2/data/d1619_clean.csv', index=None, header=True)

**TEXT PREPROCESSING**

In this section, I prepare a corpus of tokens for bag of words analysis and Latent Dirichlet Allocation. 
<br> 
<br> 
Text pre-processing involves a series of steps in which sentences or phrases in the body of text to be analyzed are standardized by converting them to 'tokens' which are single words or multi word phrases stripped of punctuation, numbers, white spaces, and stop words (commonly occurring words).  The words are then converted to a uniform case, stemmed, and lemmatized to complete the standardization process.  The end product of pre-processing is a list of words or multi word phrases for each observation.  For example, analysis of tweets to classify them by topic would involve pre-processing each tweet as described and converting the tweet into a list of cleaned and stemmed words for use in the model.
<br> 
<br> 
Initially, I used the text of the medical terms entered into the cause of death fields by medical certifiers.  However, after some exploration I decided to use ICD-10 codes to create the corpus instead of words as it yielded better results in my models.

**Using ICD-10 codes instead of medical terms** for tokens has two primary advantages: 
(1) all words and phrases are standardized so that variations of a given cause of death will not show up as separate topics.  (2) conversion to codes means that I don't have to create a customized list of stopwords that are appropriate for medical terminology. Words like 'of', 'the', 'and' etc. are already removed from the corpus.

**Restrict data set** to records with garbage code as underlying cause code. I also removed records that were labelled 4-Volume depletion, 8-Undetermined intent, and 9-ill-defined infectious causes because of the small number of records in each group.
<br>


In [28]:
keep_gc_cat = [1, 2, 3, 5, 6, 7]
ds = ds.loc[ds['gc_cat'].isin(keep_gc_cat), :]


In [29]:
print(ds.gc_cat.value_counts())
print(len(ds))

6    3868
5    2973
2    2767
3    2376
1    2149
Name: gc_cat, dtype: int64
14133


**NATURAL LANGUAGE PROCESSING USING CAUSE OF DEATH LITERAL TEXT VS. USING CAUSE OF DEATH ICD-10 CODES**

Initially, I intended to use cause of death literal text for natural language processing.  However, during the text pre-processing steps, it became clear that generally available tools for cleaning text including removing 'stop words' are geared towards day-to-day language and therefore, less effective at removing words that occur frequently but offer little information. It was difficult to find a pre-existing list of stopwords specialized for use with healthcare related text.
<br> 
Eventually, I decided to use ICD-10 codes assigned to each death record as the UCOD code or one of the multiple cause (MC) codes to conduct my analysis. The advantage of doing so is that much of the standardization and removal of extraneous language is already accomplished through the process of assigning ICD-10 codes to the records to represent the causes of death listed in the text literal fields.  I ended up using a field with all multiple cause fields MC2 through 20 (called "MC2_20" in this data set) concatenated and the UCOD code field.

**TREATING GARBAGE CODES AS STOPWORDS**

Garbage codes (GCs), whether they appear as the UCOD or one of the MCs, by definition don't contribute any meaningful information to understanding the disease process or injury that led to an individual's death.  In this sense, they are similar to stop words which also don't add either to clustering similar documents or separating them.  During text pre-processing of the concatenated MC field, initially, I treated GCs as stop words and removed them from the final text corpus.  However, when comparing a corpus that included GC codes with one that excluded them, the supervised learning methods performed slightly better when GCs were left intact.  For the unsupervised method, treating the GCs as stop words and removing them from the corpus yielded better results. For this reason, I decided to create two versions of the final corpus: with and without GCs.

**Use list of garbage ICD-10 codes as 'stop word' list.** In addition to removing records with garbage codes as underlying cause codes, I also remove the codes themselves from all multiple cause code positions. The presence of GCs in any multiple cause code position could obscure any useful information contained in the remaining multiple cause codes in the topic modeling step.


**Creating a corpus of ICD-10 full codes as unigrams**

ICD-10 mortaliy codes are usually comprised of 4 alpha numeric characters with the first position always being a letter of the alphabet indicating the larger family of conditions represented by those codes. The second and third positions of the code are numbers indicating a condition or cause of death, while the fourth and final position is occupied by a number which indicates a specific location in the body or sub-category of the specific disease represented by the previous three characters.
<br>
<br>
The ICD-10 code equivalent to stemming or lemmatization of words during the pre-processing step is the removal of the final number which provides granular information about the condition but may not be useful in the present effort to classify these deaths into broad categories. In this step I create a version of the corpus that is comprised of three character codes as unigrams.

**Creating a corpus of ICD-10 short codes as unigrams** ICD-10 mortaliy codes are usually comprised of 4 alpha numeric characters with the first position always being a letter of the alphabet indicating the larger family of conditions represented by those codes. The second and third positions of the code are numbers indicating a condition or cause of death, while the fourth and final position is occupied by a number which indicates a specific location in the body or sub-category of the specific disease represented by the previous three characters.
<br>
<br>
The ICD-10 code equivalent to stemming or lemmatization of words during the pre-processing step is the removal of the final number which provides granular information about the condition but may not be useful in the present effort to classify these deaths into broad categories. In this step I create a version of the corpus that is comprised of three character codes as unigrams.


In [30]:
# add 'respiratory failure' codes to the list of stopwords
respfail_tobac_codes = ['J960','J961','J969','F179']
gc_plus = gc_all + respfail_tobac_codes


**Remove rows with no values in MC2 through MC20**.  If MC2 is empty, the subsequent MC variables will also be blank.

In [32]:
ds.dropna(subset = ['MC2'], inplace=True)
len(ds)

10019

In [None]:
'''%%writefile myfunction_2.py

from nltk.tokenize import word_tokenize

clean_mc=[]
short_mc=[]
clean_mcgc=[]

def make_corpi(data, stop_words):
    for cod in data:
        cod_words = word_tokenize(cod)
        cod_nostop = [w for w in cod_words if w not in stop_words]
        clean_mc.append(cod_nostop)

        cod_short = [w[0:3] for w in cod_nostop]
        short_mc.append(cod_short)

        short_mcgc = [w[0:3] for w in cod_words]
        clean_mcgc.append(short_mcgc)
    return clean_mc, short_mc, clean_mcgc
'''

In [33]:
ds.head()

Unnamed: 0,sfn,sex,dody,dcounty,hispanic,race,agegrp,manner,tobac,pg,gc_cat,gc_cat_label,UCOD,MC2,AllMC,MC2_20,codlit,cert_label
30,2017014601,F,2017,KING,No,White,70-79 yrs,N,N,8.0,1,1-Septicemia,A419,I469,A419 I469 R092,I469 R092,"RESPIRATORY AND CARDIAC ARREST SEVERE SEPSIS, ...",1-Physician
38,2017025200,F,2017,SNOHOMISH,No,White,80+ yrs,N,N,8.0,2,2-Heart failure,I500,D469,I500 D469 I802 J189 T828 Y831,D469 I802 J189 T828 Y831,"CONGESTIVE HEART FAILURE, CAUSE NOT FORMALLY W...",1-Physician
45,2018031044,M,2018,PIERCE,No,White,60-69 yrs,N,U,8.0,3,3-Ill-defined cancer,C80,B182,C80 B182 C786 N179,B182 C786 N179,METASTATIC MALIGNANCY WITH PERITONEAL CARCINOM...,7-ARNP
49,2018031047,F,2018,SNOHOMISH,No,White,80+ yrs,N,N,8.0,1,1-Septicemia,A419,C762,A419 C762 R99,C762 R99,SEPTIC SHOCK OF UNKNOWN ETIOLOGY ABDOMINAL CAR...,1-Physician
69,2017019359,F,2017,PIERCE,No,White,80+ yrs,N,P,8.0,2,2-Heart failure,I509,A310,I509 A310 F179 I120 I461 I48 J449 K922 Q600 ...,A310 F179 I120 I461 I48 J449 K922 Q600,"SUDDEN CARDIAC DEATH, PROBABLE ARRHYTHMIA ATRI...",1-Physician


In [34]:
from myfunction_2 import make_corpi

ds['clean_mc'], ds['short_mc'], ds['clean_mcgc'] = make_corpi(ds.loc[:,'AllMC'], gc_plus)

**Creating a corpus of bigrams** After trying various versions of the corpus with Latent Dirichlet Allocation model I found the best performing model used a corpus of bigrams.  Instead of individual ICD-10 codes, the corpus consists of combinations of two codes appearing consecutively in a death record. 

In [35]:
def make_bigrams(doc):
    bi = []
    for i in range(len(doc)-1):
        bigrm = doc[i] + "_" + doc[i+1]
        bi.append(bigrm)
    return bi


In [36]:
ds['all_bigrams'] = ds.loc[:,'clean_mc'].apply(lambda row: make_bigrams(row))

**Creating corpus with unigrams with MC2 through MC20 and without removing stopwords** This corpus will be used in supervised machine learning models in notebook 5 of this series.  Supervised algorithms had slightly better classification accuracy when GC were left in the corpus.

In [40]:
'''%%writefile myfunction_3.py

from nltk.tokenize import word_tokenize

def short_mc2_20(data):
    mc220_gc_toks = []
    for cod in data:
        cod_words = word_tokenize(cod)
        cod_short = [w[0:3] for w in cod_words]
        mc220_gc_toks.append(cod_short)
    return mc220_gc_toks'''


Writing myfunction_3.py


In [41]:
from myfunction_3 import short_mc2_20

ds['mc2_20_short_toks'] = short_mc2_20(ds.loc[:,'MC2_20'])

**Remove rows with empty lists in 'clean_mc' column**. After removing all the garbage codes and creating the new variable 'clean_mc' some cells contain empty lists indicating that all of the ICD-10 codes for those records were garbage codes.  After removing these records, the dataset is reduced to 7,770 records meaning that roughly half of the 15,072 records with garbage underlying codes also had GCs in the multiple cause positions providing no valuable information at all about the disease process or condition causing the death.

In [42]:
 # to remove rows with empty lists in 'clean_mc' column.
ds = ds[ds.clean_mc.astype(bool)]
len(ds)

7163

In [43]:
ds.head()

Unnamed: 0,sfn,sex,dody,dcounty,hispanic,race,agegrp,manner,tobac,pg,gc_cat,gc_cat_label,UCOD,MC2,AllMC,MC2_20,codlit,cert_label,clean_mc,short_mc,clean_mcgc,all_bigrams,mc2_20_short_toks
38,2017025200,F,2017,SNOHOMISH,No,White,80+ yrs,N,N,8.0,2,2-Heart failure,I500,D469,I500 D469 I802 J189 T828 Y831,D469 I802 J189 T828 Y831,"CONGESTIVE HEART FAILURE, CAUSE NOT FORMALLY W...",1-Physician,"[D469, I802, J189, Y831]","[D46, I80, J18, Y83]","[I50, D46, I80, J18, T82, Y83]","[D469_I802, I802_J189, J189_Y831]","[D46, I80, J18, T82, Y83]"
45,2018031044,M,2018,PIERCE,No,White,60-69 yrs,N,U,8.0,3,3-Ill-defined cancer,C80,B182,C80 B182 C786 N179,B182 C786 N179,METASTATIC MALIGNANCY WITH PERITONEAL CARCINOM...,7-ARNP,"[B182, C786, N179]","[B18, C78, N17]","[C80, B18, C78, N17]","[B182_C786, C786_N179]","[B18, C78, N17]"
69,2017019359,F,2017,PIERCE,No,White,80+ yrs,N,P,8.0,2,2-Heart failure,I509,A310,I509 A310 F179 I120 I461 I48 J449 K922 Q600 ...,A310 F179 I120 I461 I48 J449 K922 Q600,"SUDDEN CARDIAC DEATH, PROBABLE ARRHYTHMIA ATRI...",1-Physician,"[A310, I120, I48, J449, K922, Q600]","[A31, I12, I48, J44, K92, Q60]","[I50, A31, F17, I12, I46, I48, J44, K92, Q60]","[A310_I120, I120_I48, I48_J449, J449_K922, K92...","[A31, F17, I12, I46, I48, J44, K92, Q60]"
70,2017026057,F,2017,CLALLAM,No,White,80+ yrs,N,N,8.0,2,2-Heart failure,I500,F03,I500 F03 J189,F03 J189,PNEUMONIA SYSTOLIC CONGESTIVE HEART FAILURE ...,1-Physician,"[F03, J189]","[F03, J18]","[I50, F03, J18]",[F03_J189],"[F03, J18]"
90,2017022720,F,2017,KING,No,White,80+ yrs,N,N,8.0,6,6-Ill-defined cardiovascular,I10,I48,I10 I48,I48,"UNSPECIFIED NATURAL CAUSES HYPERTENSION, AT...",1-Physician,[I48],[I48],"[I10, I48]",[],[I48]


In [44]:
ds.to_csv(r'Y:/DQSS/Death/MBG/py/capstone2/data/d1619_clean_preproctxt.csv', index=None, header=True)

**Next step: 3_Exploratory data analysis**