# Diabetes 130-US hospitals for years 1999-2008 Data Set

## Data Set Information:

The dataset represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes. Information was extracted from the database for encounters that satisfied the following criteria.
(1) It is an inpatient encounter (a hospital admission).
(2) It is a diabetic encounter, that is, one during which any kind of diabetes was entered to the system as a diagnosis.
(3) The length of stay was at least 1 day and at most 14 days.
(4) Laboratory tests were performed during the encounter.
(5) Medications were administered during the encounter.
The data contains such attributes as patient number, race, gender, age, admission type, time in hospital, medical specialty of admitting physician, number of lab test performed, HbA1c test result, diagnosis, number of medication, diabetic medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization, etc.

## Bibliography

Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura, Krzysztof J. Cios, and John N. Clore, “Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records,” BioMed Research International, vol. 2014, Article ID 781670, 11 pages, 2014.

In [3]:
# Import statements

import sklearn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## train models

In [4]:
train = pd.read_csv("numeric_data.csv").dropna(axis=0)

In [5]:
## get model preprocessing and work stuff
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

In [6]:
Xtrain, Xtest, ytrain, ytest = train_test_split(train.drop("readmitted", axis=1), train["readmitted"], \
                                                test_size = 0.20, train_size = 0.80, shuffle = True)

In [18]:
scaler = StandardScaler().fit(Xtrain)
Xtrainscaled = scaler.transform(Xtrain)
Xtestscaled = scaler.transform(Xtest)

In [21]:
## logistic regression classifier
reg = LogisticRegression(solver = "liblinear", penalty = "l1").fit(Xtrain, ytrain)
reg.score(Xtest, ytest)

0.6171550465925676

In [7]:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier().fit(Xtrain, ytrain)
mlp.score(Xtest, ytest)

0.6082856180532166

In [9]:
from sklearn.svm import SVC
svc = SVC().fit(Xtrain, ytrain)
svc.score(Xtest, ytest)

0.6054788368698776

In [12]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier().fit(Xtrain, ytrain)
rfc.score(Xtest, ytest)

0.6018861569552038

In [11]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier().fit(Xtrain, ytrain)
dtc.score(Xtest, ytest)

0.5351970360390704

In [13]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier().fit(Xtrain, ytrain)
knn.score(Xtest, ytest)

0.5668575277871337

## GridSearch cross-validation across parameters

In [None]:
from sklearn.model_selection import GridSearchCV

## testing