#### TEXT CLASSIFIERS - SUPERVISED METHODS

In [44]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import gensim
import nltk
import re
import sre_yield
from verbalexpressions import VerEx
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

**STEPS:**

- select records without garbage codes to train and test models
- create labels for records (10 leading causes of death) by grouping ICD-10 codes
- label records
- preprocess text and ICD-10 codes
- train models
    - Linear SVM
    - logistic regression
    - Naive Bayes
- compare accuracy and select best model
- train, test
- apply to garbage code records
    - does this need to be done separately for each gc category? OR
    - remove labels if target category is implausible or impossible"
- compare classification with LDA
    



In [20]:
death_df = pd.read_csv('Y:/DQSS/Death/MBG/py/capstone2/data/d1619s.csv',
                       low_memory=False,
                       encoding = 'unicode_escape')

In [21]:
death_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226988 entries, 0 to 226987
Data columns (total 23 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Unnamed: 0    226988 non-null  int64  
 1   sex           226988 non-null  object 
 2   ageyrs        226988 non-null  float64
 3   dob           226988 non-null  object 
 4   dod           226988 non-null  object 
 5   dody          226988 non-null  int64  
 6   dstateFIPS    226988 non-null  object 
 7   marital       226988 non-null  object 
 8   dcounty       226987 non-null  object 
 9   rcounty       226988 non-null  object 
 10  rstatefips    226988 non-null  object 
 11  certdesig     226988 non-null  float64
 12  UCOD          226841 non-null  object 
 13  AllMC         226988 non-null  object 
 14  codlit        226987 non-null  object 
 15  pg            166752 non-null  float64
 16  manner        226946 non-null  object 
 17  tobac         226988 non-null  object 
 18  gc_a

In [22]:
death_df.head()

Unnamed: 0.1,Unnamed: 0,sex,ageyrs,dob,dod,dody,dstateFIPS,marital,dcounty,rcounty,...,AllMC,codlit,pg,manner,tobac,gc_any,gc_cat,gc_cat_label,agegrp,cert_label
0,0,F,71.0,04/15/1945,2017-03-03,2017,WA,D,KING,KING,...,I110 F019 I64,CEREBROVASCULAR ACCIDENT DEMENTIA VASCULAR CON...,8.0,N,U,False,0,0-No GC,70-79 yrs,7-ARNP
1,1,M,98.0,05/03/1918,2017-04-24,2017,WA,W,LEWIS,LEWIS,...,I516,CARDIOVASCULAR DISEASE,8.0,N,N,True,6,6-Ill-defined cardiovascular,80+ yrs,1-Physician
2,2,F,90.0,12/27/1926,2017-04-24,2017,WA,W,LEWIS,LEWIS,...,G419 F179,STATUS EPILEPTICUS,8.0,N,P,False,0,0-No GC,80+ yrs,1-Physician
3,3,M,89.0,09/22/1927,2017-04-23,2017,WA,M,SNOHOMISH,SNOHOMISH,...,J441 C61 I10 I250 R092 R263 S324 W18,RESPIRATORY ARREST CHRONIC OBSTRUCTIVE PULMONA...,8.0,A,U,False,0,0-No GC,80+ yrs,2-ME/Coroner
4,4,F,90.0,09/29/1926,2017-04-23,2017,WA,W,PIERCE,PIERCE,...,I509 A310 F179 I120 I461 I48 J449 K922 Q600 ...,"SUDDEN CARDIAC DEATH, PROBABLE ARRHYTHMIA ATRI...",8.0,N,P,True,2,2-Heart failure,80+ yrs,1-Physician


**Keep records with valid underlying cause code** i.e. no garbage codes.  Keep relevant variables.

In [26]:
df = death_df.loc[death_df['gc_cat']==0 ,['gc_cat', 'UCOD','codlit', 'AllMC']]

In [27]:
df.head()

Unnamed: 0,gc_cat,UCOD,codlit,AllMC
0,0,I110,CEREBROVASCULAR ACCIDENT DEMENTIA VASCULAR CON...,I110 F019 I64
2,0,G419,STATUS EPILEPTICUS,G419 F179
3,0,J441,RESPIRATORY ARREST CHRONIC OBSTRUCTIVE PULMONA...,J441 C61 I10 I250 R092 R263 S324 W18
6,0,I251,CORONARY ARTERY DISEASE CONGESTIVE HEART FAILU...,I251 D151 I120 I500
7,0,I251,CARDIOVASCULAR CRISIS PROLAPSED MITRAL VALV...,I251 E149 I10 I341 I48 I516


**ATTACHING LABELS** The data frame consists of records with non-garbage underlying cause codes.  ICD-10 codes in the underlying cause variable are typically grouped together to more meaningful categories that are used to understand mortality patterns.  For example, while lung cancer deaths are assigned a code between C34.0 and C34.9 (based on the specific location of the cancer), typically, leading cause of death analyses will group deaths due to any cancer into the "malignant neoplasm" category which contains all codes from C00.0 through C97.9.

In the next step, all deaths in the data frame will be labeled with one of ten causes of death that are leading causes in Washington State (and most of the United States). Together, these ten causes account for ___ % of all deaths in the state.  It is very likely that the poorly coded records will belong to one of these groups.

In [58]:

# all cancer (C00–C97)
cancer = []

for code in sre_yield.AllStrings(r'^C[0-8][0-9][0-9]{0,1}'):
    cancer.append(code)
    
for code in sre_yield.AllStrings(r'^C9[0-7][0-9]{0,1}'):
    cancer.append(code)

# all heart disease (I00–I09,I11,I13,I20–I51)
 
heart_disease = []

for code in sre_yield.AllStrings(r'^I[00][0-9][0-9]{0,1}'):
    heart_disease.append(code)

for code in sre_yield.AllStrings(r'^I13[0-9]{0,1}'):
    heart_disease.append(code)

for code in sre_yield.AllStrings(r'^I[2-4][0-9][0-9]{0,1}'):
    heart_disease.append(code)

for code in sre_yield.AllStrings(r'^I51[0-9]{0.1}'):
    heart_disease.append(code)

# Cerebrovasular disease  (I60–I69)

cerebrovascular_disease = []

for code in sre_yield.AllStrings(r'^I6[0-9][0-9]{0,1}'):
    cerebrovascular_disease.append(code)

# Diabetes (E10–E14)

diabetes = []

for code in sre_yield.AllStrings(r'^E1[0-4][0-9]{0,1}'):
    diabetes.append(code)
    
# Alzheimer's disease(G30)
                                 
alzheimers = []

for code in sre_yield.AllStrings(r'^G30{0,1}'):
    alzheimers.append(code)
    
# Influenza and Pneumonia (J09–J18)
    
flu_pneumonia = []

for code in sre_yield.AllStrings(r'^J09[0-1]{0,1}'):
    flu_pneumonia.append(code)
    
for code in sre_yield.AllStrings(r'^J1[0-8][0-9]{0,1}'):
    flu_pneumonia.append(code)
    
    
# chronic lower respiratory disease (J40–J47)

clrd = []

for code in sre_yield.AllStrings(r'^J4[0-7][0-9]{0,1}'):
    clrd.append(code)
    
# Chronic liver disease and cirrhosis (K70,K73–K74)    

liver_dis = []

for code in sre_yield.AllStrings(r'^K70(0-9)?' r'^K7[3-4](0-9)?'):
    liver_dis.append(code)
    
# Suicide (*U03,X60–X84,Y87.0)

#Unintentional injury (V01–X59,Y85–Y86)
    
        

In [70]:
#t = re.compile(r'^C[0-8][0-9][0-9]{0,1}' r'^C9[0-7][0-9]{0,1}')

# can I use this to flag records i.e. 'had cancer'

TypeError: 're.Pattern' object is not iterable

In [66]:
df['test'] = df['UCOD'].isin(cancer).astype(int)

In [50]:
df.test.head()

0    0
2    0
3    0
6    0
7    0
Name: test, dtype: int32