We'll start by loading the data.

In [1]:
import numpy as np
import pandas as pd
import sklearn

pt_info = pd.read_csv("../data/interim/pt_info.csv")

We'll start where we left off last time. To refresh your memory, we realized we had a few main issues in our data set that needed to be addressed.
1. The variable *diagnosis* had multiple values in each column.
2. Variables are categorical data, which must be put into numeric value for many machine learning algorithms to work.

The first order of business will be to fix the *diagnosis* variable.

In [2]:
pt_info["diagnosis"].value_counts()

SEPSIS                                               15
PNEUMONIA                                            12
FEVER                                                 5
SHORTNESS OF BREATH                                   4
CONGESTIVE HEART FAILURE                              3
                                                     ..
SYNCOPE;TELEMETRY                                     1
SUBDURAL HEMATOMA/S/P FALL                            1
ABDOMINAL PAIN                                        1
CHRONIC MYELOGENOUS LEUKEMIA;TRANSFUSION REACTION     1
BASAL GANGLIN BLEED                                   1
Name: diagnosis, Length: 95, dtype: int64

In [3]:
diagnosis = pt_info["diagnosis"].str.split(";", expand = True)
diagnosis

Unnamed: 0,0,1,2
0,SEPSIS,,
1,HEPATITIS B,,
2,SEPSIS,,
3,HUMERAL FRACTURE,,
4,ALCOHOLIC HEPATITIS,,
...,...,...,...
139,PERICARDIAL EFFUSION,,
140,ALTERED MENTAL STATUS,,
141,ACUTE RESPIRATORY DISTRESS SYNDROME,ACUTE RENAL FAILURE,
142,BRADYCARDIA,,


In order to gather insight from the data, we have to find out how to translate our information in a way that a machine will understand. We'll use something called a [One Hot Encoder](link) in order to create numeric columns per input.

In [4]:
diagnosis2 = diagnosis.stack().str.get_dummies().sum(level=0)
diagnosis2['UTI'] = diagnosis2[' UTI'] + diagnosis2['URINARY TRACT INFECTION']
diagnosis2 = diagnosis2.drop(columns=[' UTI', 'URINARY TRACT INFECTION'])
list(diagnosis2)

[' MITRAL REGURGITATION',
 'ABDOMINAL PAIN',
 'ABSCESS',
 'ACUTE CHOLANGITIS',
 'ACUTE CHOLECYSTITIS',
 'ACUTE PULMONARY EMBOLISM',
 'ACUTE RENAL FAILURE',
 'ACUTE RESPIRATORY DISTRESS SYNDROME',
 'ACUTE SUBDURAL HEMATOMA',
 'ALCOHOLIC HEPATITIS',
 'ALTERED MENTAL STATUS',
 'ANEMIA',
 'AROMEGLEY',
 'ASTHMA',
 'ASTHMA/COPD FLARE',
 'BASAL GANGLIN BLEED',
 'BRADYCARDIA',
 'BRAIN METASTASES',
 'BRAIN METASTASIS',
 'BURKITTS LYMPHOMA',
 'CELLULITIS',
 'CEREBROVASCULAR ACCIDENT',
 'CHEST PAIN',
 'CHEST PAIN/ CATH',
 'CHOLANGITIS',
 'CHOLECYSTITIS',
 'CHRONIC MYELOGENOUS LEUKEMIA',
 'CHRONIC OBST PULM DISEASE',
 'CONGESTIVE HEART FAILURE',
 'CORONARY ARTERY DISEASE\\CORONARY ARTERY BYPASS GRAFT /SDA',
 'CORONARY ARTERY DISEASE\\CORONARY ARTERY BYPASS GRAFT WITH MVR  ? MITRAL VALVE REPLACEMENT /SDA',
 'CRITICAL AORTIC STENOSIS/HYPOTENSION',
 'ELEVATED LIVER FUNCTIONS',
 'ESOPHAGEAL CA/SDA',
 'ESOPHAGEAL CANCER/SDA',
 'FACIAL NUMBNESS',
 'FAILURE TO THRIVE',
 'FEVER',
 'GASTROINTESTINAL BLEED'

In [5]:
pt_info_clean = pd.concat([pt_info, diagnosis2], axis=1)

pt_info_clean = pd.get_dummies(pt_info_clean, columns=['insurance', 'gender', 'ethnicity', 'admission_type', 'admission_location'])
pt_info_clean = pt_info_clean.drop(columns=['Unnamed: 0', 'diagnosis'])
pt_info_clean = pt_info_clean.fillna(0)
list(pt_info_clean)

['mrsa_positive',
 ' MITRAL REGURGITATION',
 'ABDOMINAL PAIN',
 'ABSCESS',
 'ACUTE CHOLANGITIS',
 'ACUTE CHOLECYSTITIS',
 'ACUTE PULMONARY EMBOLISM',
 'ACUTE RENAL FAILURE',
 'ACUTE RESPIRATORY DISTRESS SYNDROME',
 'ACUTE SUBDURAL HEMATOMA',
 'ALCOHOLIC HEPATITIS',
 'ALTERED MENTAL STATUS',
 'ANEMIA',
 'AROMEGLEY',
 'ASTHMA',
 'ASTHMA/COPD FLARE',
 'BASAL GANGLIN BLEED',
 'BRADYCARDIA',
 'BRAIN METASTASES',
 'BRAIN METASTASIS',
 'BURKITTS LYMPHOMA',
 'CELLULITIS',
 'CEREBROVASCULAR ACCIDENT',
 'CHEST PAIN',
 'CHEST PAIN/ CATH',
 'CHOLANGITIS',
 'CHOLECYSTITIS',
 'CHRONIC MYELOGENOUS LEUKEMIA',
 'CHRONIC OBST PULM DISEASE',
 'CONGESTIVE HEART FAILURE',
 'CORONARY ARTERY DISEASE\\CORONARY ARTERY BYPASS GRAFT /SDA',
 'CORONARY ARTERY DISEASE\\CORONARY ARTERY BYPASS GRAFT WITH MVR  ? MITRAL VALVE REPLACEMENT /SDA',
 'CRITICAL AORTIC STENOSIS/HYPOTENSION',
 'ELEVATED LIVER FUNCTIONS',
 'ESOPHAGEAL CA/SDA',
 'ESOPHAGEAL CANCER/SDA',
 'FACIAL NUMBNESS',
 'FAILURE TO THRIVE',
 'FEVER',
 'GASTR

In [11]:
#more reasonable in a different notebook

import sklearn.decomposition 
from sklearn.decomposition import PCA

pca = sklearn.decomposition.TruncatedSVD(2)

# fit_transform original data, put into data frame
pca_demographics = pca.fit_transform(pt_info_clean)
df_pca_demographics = pd.DataFrame(pt_info_clean, columns=["x", "y"])



In [None]:
pt_info_clean.to_csv("../data/interim/pt_info_clean.csv")