<a href="https://colab.research.google.com/github/premdub/AutoML/blob/main/autoML/autoML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install Pacakges

In [347]:
!pip install tpot mljar-supervised

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [348]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from supervised.automl import AutoML


# Options Available

- mode — the package ships with four built-in models. 
  - The Explain mode is ideal for explaining and understanding the data. It results in visualizations of feature importance as well as tree visualizations.
  - The Perform is used when building ML models for production. 
  - The Compete is meant to build models used in machine learning competitions. 
  - The Optuna mode is used to search for highly-tuned ML models.
- algorithms — specifies the algorithms you would like to use. They are usually passed in as a list.
- results_path — the path where the results will be stored
- total_time_limit — the total time in seconds for training the model
- train_ensemble — dictates if an ensemble will be created at the end of the training process
- stack_models — determines if a models stack will be created
- eval_metric — the metric that will be optimized. If auto the logloss is used for classification problems while the rmse is used for regression problems

In [349]:
#automl = AutoML(
    # mode="Explain"
    # algorithms=""
    # results_path="AutoML_22",
    # total_time_limit=30 * 60,
    # train_ensemble=True,
    # stack_models="",
    # eval_metric=""
#)

# Healthcare Dataset - Hospital Acquired Infection 

## Load in dataset

In [350]:
import pandas as pd
HAI = pd.read_csv('https://query.data.world/s/lcc3wkldiuom2xid5xlcn3bqpkk263')
HAI

Unnamed: 0,Provider ID,Hospital Name,Address,City,State,ZIP Code,County Name,Phone Number,Measure Name,Measure ID,Compared to National,Score,Footnote,Measure Start Date,Measure End Date,Location
0,10005,MARSHALL MEDICAL CENTER SOUTH,2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957,MARSHALL,2565938310,CLABSI: Lower Confidence Limit,HAI_1_CI_LOWER,No Different than National Benchmark,0.165,,01/01/2015,06/30/2015,"2505 U S HIGHWAY 431 NORTH\nBOAZ, AL 35957\n"
1,10005,MARSHALL MEDICAL CENTER SOUTH,2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957,MARSHALL,2565938310,CLABSI: Upper Confidence Limit,HAI_1_CI_UPPER,No Different than National Benchmark,3.249,,01/01/2015,06/30/2015,"2505 U S HIGHWAY 431 NORTH\nBOAZ, AL 35957\n"
2,10005,MARSHALL MEDICAL CENTER SOUTH,2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957,MARSHALL,2565938310,CLABSI: Number of Procedures,HAI_1_DOPC_DAYS,No Different than National Benchmark,1485,,01/01/2015,06/30/2015,"2505 U S HIGHWAY 431 NORTH\nBOAZ, AL 35957\n"
3,10005,MARSHALL MEDICAL CENTER SOUTH,2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957,MARSHALL,2565938310,CLABSI: Predicted Cases,HAI_1_ELIGCASES,No Different than National Benchmark,2.034,,01/01/2015,06/30/2015,"2505 U S HIGHWAY 431 NORTH\nBOAZ, AL 35957\n"
4,10005,MARSHALL MEDICAL CENTER SOUTH,2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957,MARSHALL,2565938310,CLABSI: Observed Cases,HAI_1_NUMERATOR,No Different than National Benchmark,2,,01/01/2015,06/30/2015,"2505 U S HIGHWAY 431 NORTH\nBOAZ, AL 35957\n"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
222859,351325,ST LUKE'S HOSPITAL,702 1ST ST SW,CROSBY,ND,58730,DIVIDE,7019656384,CAUTI: Observed Cases,HAI_2_NUMERATOR,Not Available,Not Available,5 - Results are not available for this reporti...,01/01/2015,06/30/2015,"702 1ST ST SW\nCROSBY, ND 58730\n(48.906708, -..."
222860,471303,NORTHEASTERN VERMONT REGIONAL HOSPITAL,1315 HOSPITAL DRIVE,SAINT JOHNSBURY,VT,5819,CALEDONIA,8027487400,C.diff Lower Confidence Limit,HAI_6_CI_LOWER,Not Available,Not Available,5 - Results are not available for this reporti...,07/01/2014,06/30/2015,"1315 HOSPITAL DRIVE\nSAINT JOHNSBURY, VT 05819..."
222861,150026,IU HEALTH GOSHEN HOSPITAL,200 HIGH PARK AVE,GOSHEN,IN,46526,ELKHART,5743641000,CLABSI: Number of Procedures,HAI_1_DOPC_DAYS,No Different than National Benchmark,1416,,01/01/2015,06/30/2015,"200 HIGH PARK AVE\nGOSHEN, IN 46526\n(41.56432..."
222862,50038,SANTA CLARA VALLEY MEDICAL CENTER,751 SOUTH BASCOM AVENUE,SAN JOSE,CA,95128,SANTA CLARA,4088855000,SSI: Colon Predicted Cases,HAI_3_ELIGCASES,Worse than the National Benchmark,3.246,,07/01/2014,06/30/2015,"751 SOUTH BASCOM AVENUE\nSAN JOSE, CA 95128\n(..."


In [351]:
HAI.columns

Index(['Provider ID', 'Hospital Name', 'Address', 'City', 'State', 'ZIP Code',
       'County Name', 'Phone Number', 'Measure Name', 'Measure ID',
       'Compared to National', 'Score', 'Footnote', 'Measure Start Date',
       'Measure End Date', 'Location'],
      dtype='object')

## Potential variables of interest

- Measure ID (categorical) 
- Compared to National (continuous) 
- Score

In [352]:
HAI['Score'].describe()

count            222864
unique            21771
top       Not Available
freq             124717
Name: Score, dtype: object

In [353]:
HAI['Measure ID'].describe()

count             222864
unique                48
top       HAI_1_CI_LOWER
freq                4643
Name: Measure ID, dtype: object

In [354]:
HAI['Compared to National'].value_counts()

Not Available                           121398
No Different than National Benchmark     78324
Better than the National Benchmark       18108
Worse than the National Benchmark         5034
Name: Compared to National, dtype: int64

## Create some simplified binary versions

In [355]:
HAI['Score'] = pd.to_numeric(HAI['Score'], errors='coerce')
HAI['HAI_los'] = HAI['Score'].apply(lambda x: 'High-infection-rate' if x > 15 else 'Low-infection-rate')
HAI.drop('Score', axis=1, inplace=True)
HAI['HAI_los'].value_counts()


Low-infection-rate     200205
High-infection-rate     22659
Name: HAI_los, dtype: int64

# MLJar Examples

## Binary Classifier -Hospital_Acquired_Infection

### **Create new model**

In [372]:
X = HAI.drop(columns=['HAI_los'])

In [373]:
y = HAI["HAI_los"]

In [374]:
X

Unnamed: 0,Provider ID,Hospital Name,Address,City,State,ZIP Code,County Name,Phone Number,Measure Name,Measure ID,Compared to National,Footnote,Measure Start Date,Measure End Date,Location
0,10005,MARSHALL MEDICAL CENTER SOUTH,2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957,MARSHALL,2565938310,CLABSI: Lower Confidence Limit,HAI_1_CI_LOWER,No Different than National Benchmark,,01/01/2015,06/30/2015,"2505 U S HIGHWAY 431 NORTH\nBOAZ, AL 35957\n"
1,10005,MARSHALL MEDICAL CENTER SOUTH,2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957,MARSHALL,2565938310,CLABSI: Upper Confidence Limit,HAI_1_CI_UPPER,No Different than National Benchmark,,01/01/2015,06/30/2015,"2505 U S HIGHWAY 431 NORTH\nBOAZ, AL 35957\n"
2,10005,MARSHALL MEDICAL CENTER SOUTH,2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957,MARSHALL,2565938310,CLABSI: Number of Procedures,HAI_1_DOPC_DAYS,No Different than National Benchmark,,01/01/2015,06/30/2015,"2505 U S HIGHWAY 431 NORTH\nBOAZ, AL 35957\n"
3,10005,MARSHALL MEDICAL CENTER SOUTH,2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957,MARSHALL,2565938310,CLABSI: Predicted Cases,HAI_1_ELIGCASES,No Different than National Benchmark,,01/01/2015,06/30/2015,"2505 U S HIGHWAY 431 NORTH\nBOAZ, AL 35957\n"
4,10005,MARSHALL MEDICAL CENTER SOUTH,2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957,MARSHALL,2565938310,CLABSI: Observed Cases,HAI_1_NUMERATOR,No Different than National Benchmark,,01/01/2015,06/30/2015,"2505 U S HIGHWAY 431 NORTH\nBOAZ, AL 35957\n"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
222859,351325,ST LUKE'S HOSPITAL,702 1ST ST SW,CROSBY,ND,58730,DIVIDE,7019656384,CAUTI: Observed Cases,HAI_2_NUMERATOR,Not Available,5 - Results are not available for this reporti...,01/01/2015,06/30/2015,"702 1ST ST SW\nCROSBY, ND 58730\n(48.906708, -..."
222860,471303,NORTHEASTERN VERMONT REGIONAL HOSPITAL,1315 HOSPITAL DRIVE,SAINT JOHNSBURY,VT,5819,CALEDONIA,8027487400,C.diff Lower Confidence Limit,HAI_6_CI_LOWER,Not Available,5 - Results are not available for this reporti...,07/01/2014,06/30/2015,"1315 HOSPITAL DRIVE\nSAINT JOHNSBURY, VT 05819..."
222861,150026,IU HEALTH GOSHEN HOSPITAL,200 HIGH PARK AVE,GOSHEN,IN,46526,ELKHART,5743641000,CLABSI: Number of Procedures,HAI_1_DOPC_DAYS,No Different than National Benchmark,,01/01/2015,06/30/2015,"200 HIGH PARK AVE\nGOSHEN, IN 46526\n(41.56432..."
222862,50038,SANTA CLARA VALLEY MEDICAL CENTER,751 SOUTH BASCOM AVENUE,SAN JOSE,CA,95128,SANTA CLARA,4088855000,SSI: Colon Predicted Cases,HAI_3_ELIGCASES,Worse than the National Benchmark,,07/01/2014,06/30/2015,"751 SOUTH BASCOM AVENUE\nSAN JOSE, CA 95128\n(..."


In [375]:
y

0          Low-infection-rate
1          Low-infection-rate
2         High-infection-rate
3          Low-infection-rate
4          Low-infection-rate
                 ...         
222859     Low-infection-rate
222860     Low-infection-rate
222861    High-infection-rate
222862     Low-infection-rate
222863     Low-infection-rate
Name: HAI_los, Length: 222864, dtype: object

In [376]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.25)

In [377]:
X_test

Unnamed: 0,Provider ID,Hospital Name,Address,City,State,ZIP Code,County Name,Phone Number,Measure Name,Measure ID,Compared to National,Footnote,Measure Start Date,Measure End Date,Location
33531,331301,"CUBA MEMORIAL HOSPITAL, INC",140 WEST MAIN STREET,CUBA,NY,14727,ALLEGANY,5859612000,CAUTI Predicted Cases,HAI_2a_ELIGCASES,Not Available,5 - Results are not available for this reporti...,07/01/2014,06/30/2015,"140 WEST MAIN STREET\nCUBA, NY 14727\n(42.2133..."
86046,450845,EL PASO SPECIALTY HOSPITAL,1755 CURIE SUITE A,EL PASO,TX,79902,EL PASO,9155443636,SSI: Colon Observed Cases,HAI_3_NUMERATOR,Not Available,12 - This measure does not apply to this hospi...,07/01/2014,06/30/2015,"1755 CURIE SUITE A\nEL PASO, TX 79902\n(31.780..."
97139,250082,DELTA REGIONAL MEDICAL CENTER,1400 E UNION ST,GREENVILLE,MS,38704,WASHINGTON,6623783783,Surgical Site Infection from abdominal hystere...,HAI_4_SIR,Not Available,13 - Results cannot be calculated for this rep...,07/01/2014,06/30/2015,"1400 E UNION ST\nGREENVILLE, MS 38704\n(33.409..."
11431,390163,ACMH HOSPITAL,ONE NOLTE DRIVE,KITTANNING,PA,16201,ARMSTRONG,7245438500,MRSA Observed Cases,HAI_5_NUMERATOR,Not Available,13 - Results cannot be calculated for this rep...,07/01/2014,06/30/2015,"ONE NOLTE DRIVE\nKITTANNING, PA 16201\n"
92162,170194,DOCTORS HOSPITAL LLC,4901 COLLEGE BLVD,LEAWOOD,KS,66211,JOHNSON,9135291801,Surgical Site Infection from colon surgery (SS...,HAI_3_SIR,Not Available,12 - This measure does not apply to this hospi...,07/01/2014,06/30/2015,"4901 COLLEGE BLVD\nLEAWOOD, KS 66211\n(38.9274..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
159465,501307,ODESSA MEMORIAL HEALTHCARE CENTER,502 E AMENDE DRIVE,ODESSA,WA,99159,LINCOLN,5099822611,SSI: Abdominal Observed Cases,HAI_4_NUMERATOR,Not Available,5 - Results are not available for this reporti...,07/01/2014,06/30/2015,"502 E AMENDE DRIVE\nODESSA, WA 99159\n(47.3315..."
1585,50488,EDEN MEDICAL CENTER,20103 LAKE CHABOT ROAD,CASTRO VALLEY,CA,94546,ALAMEDA,5105371234,CLABSI: Upper Confidence Limit,HAI_1_CI_UPPER,No Different than National Benchmark,,01/01/2015,06/30/2015,"20103 LAKE CHABOT ROAD\nCASTRO VALLEY, CA 94546\n"
138579,500003,SKAGIT VALLEY HOSPITAL,1415 KINCAID STREET,MOUNT VERNON,WA,98274,SKAGIT,3604244111,CAUTI: Upper Confidence Limit,HAI_2_CI_UPPER,Better than the National Benchmark,,01/01/2015,06/30/2015,"1415 KINCAID STREET\nMOUNT VERNON, WA 98274\n(..."
166897,250138,MERIT HEALTH RIVER OAKS,1030 RIVER OAKS DRIVE,FLOWOOD,MS,39232,RANKIN,6019321030,SSI: Abdominal Observed Cases,HAI_4_NUMERATOR,Not Available,13 - Results cannot be calculated for this rep...,07/01/2014,06/30/2015,"1030 RIVER OAKS DRIVE\nFLOWOOD, MS 39232\n(32...."


In [378]:
automl = AutoML(results_path="HAI_binary_los", mode="Explain")

In [379]:
automl.fit(X_train, y_train)

Linear algorithm was disabled.
This model has already been fitted. You can use predict methods or select a new 'results_path' for a new 'fit()'.


In [380]:
pred = automl.predict(X_test)
pred

array(['Low-infection-rate', 'Low-infection-rate', 'Low-infection-rate',
       ..., 'Low-infection-rate', 'Low-infection-rate',
       'Low-infection-rate'], dtype=object)

In [381]:
automl.report()

Best model,name,model_type,metric_type,metric_value,train_time
,1_Baseline,Baseline,logloss,0.328712,3.9
,2_DecisionTree,Decision Tree,logloss,0.210423,36.63
the best,3_Default_Xgboost,Xgboost,logloss,0.0287939,161.83
,4_Default_NeuralNetwork,Neural Network,logloss,0.225101,48.14
,5_Default_RandomForest,Random Forest,logloss,0.160248,36.6
,Ensemble,Ensemble,logloss,0.0287939,2.85

Unnamed: 0,score,threshold
logloss,0.225101,
auc,0.871916,
f1,0.946449,0.621767
accuracy,0.898342,0.621767
precision,1.0,0.999993
recall,1.0,0.621767
mcc,0.415978,0.835621

Unnamed: 0,score,threshold
logloss,0.225101,
auc,0.871916,
f1,0.946449,0.621767
accuracy,0.898342,0.621767
precision,0.898342,0.621767
recall,1.0,0.621767
mcc,0.0,0.621767

Unnamed: 0,Predicted as High-infection-rate,Predicted as Low-infection-rate
Labeled as High-infection-rate,0,4248
Labeled as Low-infection-rate,0,37539

Unnamed: 0,score,threshold
logloss,0.160248,
auc,0.961891,
f1,0.958809,0.753138
accuracy,0.927322,0.753138
precision,1.0,0.921877
recall,1.0,0.0660418
mcc,0.659161,0.753138

Unnamed: 0,score,threshold
logloss,0.160248,
auc,0.961891,
f1,0.958809,0.753138
accuracy,0.927322,0.753138
precision,0.976679,0.753138
recall,0.941581,0.753138
mcc,0.659161,0.753138

Unnamed: 0,Predicted as High-infection-rate,Predicted as Low-infection-rate
Labeled as High-infection-rate,3404,844
Labeled as Low-infection-rate,2193,35346

Unnamed: 0,score,threshold
logloss,0.210423,
auc,0.887944,
f1,0.952547,0.680592
accuracy,0.912102,0.680592
precision,1.0,0.941225
recall,1.0,0.286134
mcc,0.421553,0.872935

Unnamed: 0,score,threshold
logloss,0.210423,
auc,0.887944,
f1,0.952547,0.680592
accuracy,0.912102,0.680592
precision,0.924769,0.680592
recall,0.982045,0.680592
mcc,0.398168,0.680592

Unnamed: 0,Predicted as High-infection-rate,Predicted as Low-infection-rate
Labeled as High-infection-rate,1249,2999
Labeled as Low-infection-rate,674,36865

Unnamed: 0,score,threshold
logloss,0.328712,
auc,0.5,
f1,0.946449,0.808493
accuracy,0.898342,0.808493
precision,0.898342,0.808493
recall,1.0,0.808493
mcc,0.0,0.808493

Unnamed: 0,score,threshold
logloss,0.328712,
auc,0.5,
f1,0.946449,0.808493
accuracy,0.898342,0.808493
precision,0.898342,0.808493
recall,1.0,0.808493
mcc,0.0,0.808493

Unnamed: 0,Predicted as High-infection-rate,Predicted as Low-infection-rate
Labeled as High-infection-rate,0,4248
Labeled as Low-infection-rate,0,37539

Model,Weight
3_Default_Xgboost,1

Unnamed: 0,score,threshold
logloss,0.0287939,
auc,0.998495,
f1,0.984664,0.942205
accuracy,0.972838,0.942205
precision,1.0,0.999999
recall,1.0,9.06675e-08
mcc,0.873128,0.942205

Unnamed: 0,score,threshold
logloss,0.0287939,
auc,0.998495,
f1,0.984664,0.942205
accuracy,0.972838,0.942205
precision,0.999068,0.942205
recall,0.970671,0.942205
mcc,0.873128,0.942205

Unnamed: 0,Predicted as High-infection-rate,Predicted as Low-infection-rate
Labeled as High-infection-rate,4214,34
Labeled as Low-infection-rate,1101,36438

Unnamed: 0,score,threshold
logloss,0.0287939,
auc,0.998495,
f1,0.984664,0.942205
accuracy,0.972838,0.942205
precision,1.0,0.999999
recall,1.0,9.06675e-08
mcc,0.873128,0.942205

Unnamed: 0,score,threshold
logloss,0.0287939,
auc,0.998495,
f1,0.984664,0.942205
accuracy,0.972838,0.942205
precision,0.999068,0.942205
recall,0.970671,0.942205
mcc,0.873128,0.942205

Unnamed: 0,Predicted as High-infection-rate,Predicted as Low-infection-rate
Labeled as High-infection-rate,4214,34
Labeled as Low-infection-rate,1101,36438


### **Test new (not really) data**

In [382]:
# load in the data model 

automl_HAI_los = AutoML(results_path="HAI_binary_los")

In [383]:
# create a new dataset that follows the same data structure as the training set
X_withlos = HAI.sample(25)
X_withoutlos = X_withlos.drop(columns=['HAI_los'])

In [384]:
X_withlos

Unnamed: 0,Provider ID,Hospital Name,Address,City,State,ZIP Code,County Name,Phone Number,Measure Name,Measure ID,Compared to National,Footnote,Measure Start Date,Measure End Date,Location,HAI_los
17855,301304,NEW LONDON HOSPITAL,273 COUNTY ROAD,NEW LONDON,NH,3257,MERRIMACK,6035262911,C.diff Predicted Cases,HAI_6_ELIGCASES,Not Available,12 - This measure does not apply to this hospi...,07/01/2014,06/30/2015,"273 COUNTY ROAD\nNEW LONDON, NH 03257\n(43.420...",Low-infection-rate
142656,220108,BETH ISRAEL DEACONESS HOSPITAL-MILTON INC,199 REEDSDALE ROAD,MILTON,MA,2186,NORFOLK,6176964600,CLABSI: Upper Confidence Limit,HAI_1_CI_UPPER,No Different than National Benchmark,,01/01/2015,06/30/2015,"199 REEDSDALE ROAD\nMILTON, MA 02186\n(42.2523...",Low-infection-rate
29118,420055,CAROLINAS HOSPITAL SYSTEM MARION,2829 E HWY 76,MULLINS,SC,29574,MARION,8434312000,C.diff Predicted Cases,HAI_6_ELIGCASES,No Different than National Benchmark,,07/01/2014,06/30/2015,"2829 E HWY 76\nMULLINS, SC 29574\n(34.189232, ...",Low-infection-rate
16602,490013,SENTARA HALIFAX REGIONAL HOSPITAL,2204 WILBORN AVENUE,HALIFAX,VA,24558,HALIFAX,4345173100,"SSI: Colon, Number of Procedures",HAI_3_DOPC_DAYS,No Different than National Benchmark,,07/01/2014,06/30/2015,"2204 WILBORN AVENUE\nHALIFAX, VA 24558\n",High-infection-rate
175071,110028,UNIVERSITY HOSPITAL,1350 WALTON WAY,AUGUSTA,GA,30901,RICHMOND,7067229011,Surgical Site Infection from colon surgery (SS...,HAI_3_SIR,No Different than National Benchmark,,07/01/2014,06/30/2015,"1350 WALTON WAY\nAUGUSTA, GA 30901\n(33.473139...",Low-infection-rate
83126,450028,"VHS BROWNSVILLE HOSPITAL COMPANY, LLC",1040 W JEFFERSON ST,BROWNSVILLE,TX,78520,CAMERON,9565441400,SSI: Colon Observed Cases,HAI_3_NUMERATOR,No Different than National Benchmark,,07/01/2014,06/30/2015,"1040 W JEFFERSON ST\nBROWNSVILLE, TX 78520\n(2...",Low-infection-rate
96523,520004,MAYO CLINIC HLTH SYSTEM FRANCISCAN MED CTR,700 WEST AVENUE SOUTH,LA CROSSE,WI,54601,LA CROSSE,6087850940,CLABSI Central Line Days,HAI_1a_DOPC_DAYS,No Different than National Benchmark,8 - The lower limit of the confidence interval...,07/01/2014,06/30/2015,"700 WEST AVENUE SOUTH\nLA CROSSE, WI 54601\n(4...",High-infection-rate
167456,330259,MERCY MEDICAL CENTER,1000 NORTH VILLAGE AVENUE,ROCKVILLE CENTRE,NY,11570,NASSAU,5167052525,Clostridium difficile (C.diff.) Laboratory-ide...,HAI_6_SIR,Worse than the National Benchmark,,07/01/2014,06/30/2015,"1000 NORTH VILLAGE AVENUE\nROCKVILLE CENTRE, N...",Low-infection-rate
111256,330196,CONEY ISLAND HOSPITAL,2601 OCEAN PARKWAY,BROOKLYN,NY,11235,KINGS,7186163000,C.diff Patient Days,HAI_6_DOPC_DAYS,No Different than National Benchmark,,07/01/2014,06/30/2015,"2601 OCEAN PARKWAY\nBROOKLYN, NY 11235\n(40.58...",High-infection-rate
140866,460023,AMERICAN FORK HOSPITAL,170 NORTH 1100 EAST,AMERICAN FORK,UT,84003,UTAH,8018553305,CLABSI: Number of Procedures,HAI_1_DOPC_DAYS,Not Available,13 - Results cannot be calculated for this rep...,01/01/2015,06/30/2015,"170 NORTH 1100 EAST\nAMERICAN FORK, UT 84003\n...",Low-infection-rate


In [369]:
X_withoutlos

Unnamed: 0,Provider ID,Hospital Name,Address,City,State,ZIP Code,County Name,Phone Number,Measure Name,Measure ID,Compared to National,Footnote,Measure Start Date,Measure End Date,Location
59334,31304,PAGE HOSPITAL,501 NORTH NAVAJO DRIVE,PAGE,AZ,86040,COCONINO,9286452424,CLABSI Upper Confidence Limit,HAI_1a_CI_UPPER,Not Available,12 - This measure does not apply to this hospi...,07/01/2014,06/30/2015,"501 NORTH NAVAJO DRIVE\nPAGE, AZ 86040\n(36.91..."
32354,421303,WILLIAMSBURG REGIONAL HOSPITAL,500 NELSON BOULEVARD,KINGSTREE,SC,29556,WILLIAMSBURG,8433558888,CLABSI Observed Cases,HAI_1a_NUMERATOR,Not Available,3 - Results are based on a shorter time period...,07/01/2014,06/30/2015,"500 NELSON BOULEVARD\nKINGSTREE, SC 29556\n(33..."
177770,60016,CENTURA HEALTH-ST THOMAS MORE HOSPITAL,1338 PHAY AVE,CANON CITY,CO,81212,FREMONT,7192852000,CAUTI: Predicted Cases,HAI_2_ELIGCASES,No Different than National Benchmark,8 - The lower limit of the confidence interval...,01/01/2015,06/30/2015,"1338 PHAY AVE\nCANON CITY, CO 81212\n(38.45574..."
61096,210034,MEDSTAR HARBOR HOSPITAL,3001 SOUTH HANOVER STREET,BALTIMORE,MD,21225,BALTIMORE CITY,4103503201,Central line-associated blood stream infection...,HAI_1a_SIR,No Different than National Benchmark,8 - The lower limit of the confidence interval...,07/01/2014,06/30/2015,"3001 SOUTH HANOVER STREET\nBALTIMORE, MD 21225..."
458,21301,PROVIDENCE VALDEZ MEDICAL CENTER,PO BOX 550,VALDEZ,AK,99686,,9078352249,"SSI: Colon, Number of Procedures",HAI_3_DOPC_DAYS,Not Available,5 - Results are not available for this reporti...,07/01/2014,06/30/2015,"PO BOX 550\nVALDEZ, AK 99686\n"
111735,500039,HARRISON MEMORIAL HOSPITAL,2520 CHERRY AVENUE,BREMERTON,WA,98310,KITSAP,3603773911,"SSI: Colon, Number of Procedures",HAI_3_DOPC_DAYS,No Different than National Benchmark,,07/01/2014,06/30/2015,"2520 CHERRY AVENUE\nBREMERTON, WA 98310\n(47.5..."
97378,190065,BATON ROUGE GENERAL MEDICAL CENTER,3600 FLORIDA STREET,BATON ROUGE,LA,70806,EAST BATON ROUGE,2253877767,SSI: Abdominal Predicted Cases,HAI_4_ELIGCASES,Not Available,13 - Results cannot be calculated for this rep...,07/01/2014,06/30/2015,"3600 FLORIDA STREET\nBATON ROUGE, LA 70806\n(3..."
186435,281321,OAKLAND MERCY HOSPITAL,601 EAST SECOND ST,OAKLAND,NE,68045,BURT,4026855601,MRSA Predicted Cases,HAI_5_ELIGCASES,Not Available,5 - Results are not available for this reporti...,07/01/2014,06/30/2015,"601 EAST SECOND ST\nOAKLAND, NE 68045\n(41.834..."
83784,190026,RAPIDES REGIONAL MEDICAL CENTER,211 4TH STREET,ALEXANDRIA,LA,71301,RAPIDES,3187693000,SSI: Colon Lower Confidence Limit,HAI_3_CI_LOWER,No Different than National Benchmark,,07/01/2014,06/30/2015,"211 4TH STREET\nALEXANDRIA, LA 71301\n(31.3143..."
90558,30023,FLAGSTAFF MEDICAL CENTER,1200 NORTH BEAVER STREET,FLAGSTAFF,AZ,86001,COCONINO,9287732378,CLABSI: Number of Procedures,HAI_1_DOPC_DAYS,No Different than National Benchmark,,01/01/2015,06/30/2015,"1200 NORTH BEAVER STREET\nFLAGSTAFF, AZ 86001\..."


In [385]:
predict = automl.predict(X_withoutlos)
predict

array(['Low-infection-rate', 'Low-infection-rate', 'High-infection-rate',
       'High-infection-rate', 'Low-infection-rate', 'Low-infection-rate',
       'High-infection-rate', 'Low-infection-rate', 'High-infection-rate',
       'Low-infection-rate', 'Low-infection-rate', 'Low-infection-rate',
       'Low-infection-rate', 'Low-infection-rate', 'Low-infection-rate',
       'Low-infection-rate', 'Low-infection-rate', 'Low-infection-rate',
       'Low-infection-rate', 'Low-infection-rate', 'High-infection-rate',
       'Low-infection-rate', 'Low-infection-rate', 'Low-infection-rate',
       'Low-infection-rate'], dtype=object)

In [386]:
# actual values from X_withlos
values_actual = X_withlos['HAI_los'].values.tolist()
values_predicted = predict.tolist()
output = pd.DataFrame({'actual': values_actual, 'predicted': values_predicted})
output

Unnamed: 0,actual,predicted
0,Low-infection-rate,Low-infection-rate
1,Low-infection-rate,Low-infection-rate
2,Low-infection-rate,High-infection-rate
3,High-infection-rate,High-infection-rate
4,Low-infection-rate,Low-infection-rate
5,Low-infection-rate,Low-infection-rate
6,High-infection-rate,High-infection-rate
7,Low-infection-rate,Low-infection-rate
8,High-infection-rate,High-infection-rate
9,Low-infection-rate,Low-infection-rate


# Download outputs

In [387]:
# get current working directory
import os
os.getcwd()

'/content'

In [388]:
folders = os.listdir()
foldersML = [x for x in folders if x.startswith('HAI')]
print(foldersML)

['HAI_binary_los', 'HAI_binary_los.zip']


In [389]:
!zip -r /content/HAI_binary_los.zip.zip /content/HAI_binary_los

  adding: content/HAI_binary_los/ (stored 0%)
  adding: content/HAI_binary_los/data_info.json (deflated 79%)
  adding: content/HAI_binary_los/ldb_performance_boxplot.png (deflated 22%)
  adding: content/HAI_binary_los/4_Default_NeuralNetwork/ (stored 0%)
  adding: content/HAI_binary_los/4_Default_NeuralNetwork/ks_statistic.png (deflated 7%)
  adding: content/HAI_binary_los/4_Default_NeuralNetwork/status.txt (stored 0%)
  adding: content/HAI_binary_los/4_Default_NeuralNetwork/permutation_importance.png (deflated 19%)
  adding: content/HAI_binary_los/4_Default_NeuralNetwork/learner_fold_0_importance.csv (deflated 35%)
  adding: content/HAI_binary_los/4_Default_NeuralNetwork/cumulative_gains_curve.png (deflated 6%)
  adding: content/HAI_binary_los/4_Default_NeuralNetwork/lift_curve.png (deflated 10%)
  adding: content/HAI_binary_los/4_Default_NeuralNetwork/calibration_curve_curve.png (deflated 13%)
  adding: content/HAI_binary_los/4_Default_NeuralNetwork/README.md (deflated 69%)
  adding:

## Regression - Life expectancy dataset

In [390]:
import numpy as np
import pandas as pd
from supervised.automl import AutoML

df = pd.read_csv("/content/LifeExpectancyData.csv")
x_cols = [c for c in df.columns if c != "Life expectancy"]
X = df[x_cols]
y = df["Life expectancy"]

In [391]:
df

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,6.0,8.16,65.0,0.1,584.259210,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.470,9.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,...,67.0,8.52,67.0,0.1,669.959000,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,...,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2933,Zimbabwe,2004,Developing,44.3,723.0,27,4.36,0.000000,68.0,31,...,67.0,7.13,65.0,33.6,454.366654,12777511.0,9.4,9.4,0.407,9.2
2934,Zimbabwe,2003,Developing,44.5,715.0,26,4.06,0.000000,7.0,998,...,7.0,6.52,68.0,36.7,453.351155,12633897.0,9.8,9.9,0.418,9.5
2935,Zimbabwe,2002,Developing,44.8,73.0,25,4.43,0.000000,73.0,304,...,73.0,6.53,71.0,39.8,57.348340,125525.0,1.2,1.3,0.427,10.0
2936,Zimbabwe,2001,Developing,45.3,686.0,25,1.72,0.000000,76.0,529,...,76.0,6.16,75.0,42.1,548.587312,12366165.0,1.6,1.7,0.427,9.8


In [392]:
x_cols

['Country',
 'Year',
 'Status',
 'Adult Mortality',
 'infant deaths',
 'Alcohol',
 'percentage expenditure',
 'Hepatitis B',
 'Measles ',
 ' BMI ',
 'under-five deaths ',
 'Polio',
 'Total expenditure',
 'Diphtheria ',
 ' HIV/AIDS',
 'GDP',
 'Population',
 ' thinness  1-19 years',
 ' thinness 5-9 years',
 'Income composition of resources',
 'Schooling']

In [393]:
X

Unnamed: 0,Country,Year,Status,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,BMI,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2015,Developing,263.0,62,0.01,71.279624,65.0,1154,19.1,...,6.0,8.16,65.0,0.1,584.259210,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,271.0,64,0.01,73.523582,62.0,492,18.6,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,268.0,66,0.01,73.219243,64.0,430,18.1,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.470,9.9
3,Afghanistan,2012,Developing,272.0,69,0.01,78.184215,67.0,2787,17.6,...,67.0,8.52,67.0,0.1,669.959000,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,Developing,275.0,71,0.01,7.097109,68.0,3013,17.2,...,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2933,Zimbabwe,2004,Developing,723.0,27,4.36,0.000000,68.0,31,27.1,...,67.0,7.13,65.0,33.6,454.366654,12777511.0,9.4,9.4,0.407,9.2
2934,Zimbabwe,2003,Developing,715.0,26,4.06,0.000000,7.0,998,26.7,...,7.0,6.52,68.0,36.7,453.351155,12633897.0,9.8,9.9,0.418,9.5
2935,Zimbabwe,2002,Developing,73.0,25,4.43,0.000000,73.0,304,26.3,...,73.0,6.53,71.0,39.8,57.348340,125525.0,1.2,1.3,0.427,10.0
2936,Zimbabwe,2001,Developing,686.0,25,1.72,0.000000,76.0,529,25.9,...,76.0,6.16,75.0,42.1,548.587312,12366165.0,1.6,1.7,0.427,9.8


In [394]:
y

0       65.0
1       59.9
2       59.9
3       59.5
4       59.2
        ... 
2933    44.3
2934    44.5
2935    44.8
2936    45.3
2937    46.0
Name: Life expectancy, Length: 2938, dtype: float64

In [395]:
automl = AutoML(results_path="Life_expectancy_regression", mode="Explain")
automl.fit(X, y)

This model has already been fitted. You can use predict methods or select a new 'results_path' for a new 'fit()'.


There are samples with missing target values in the data which will be excluded for further analysis


In [396]:
df["predictions"] = automl.predict(X)


X has feature names, but StandardScaler was fitted without feature names


In [397]:
print("Predictions")
print(df[["Life expectancy", "predictions"]].head())

Predictions
   Life expectancy  predictions
0             65.0    62.556262
1             59.9    59.845463
2             59.9    60.000315
3             59.5    59.488580
4             59.2    59.255311


# Download outputs

In [398]:
# get current working directory
import os
os.getcwd()

'/content'

In [399]:
folders = os.listdir()
foldersML = [x for x in folders if x.startswith('Life_expectancy')]
print(foldersML)

['Life_expectancy_regression.zip', 'Life_expectancy_regression']


In [400]:
!zip -r /content/Life_expectancy_regression.zip /content/Life_expectancy_regression

updating: content/Life_expectancy_regression/ (stored 0%)
updating: content/Life_expectancy_regression/data_info.json (deflated 80%)
updating: content/Life_expectancy_regression/6_Default_RandomForest/ (stored 0%)
updating: content/Life_expectancy_regression/6_Default_RandomForest/learner_fold_0_shap_dependence.png (deflated 4%)
updating: content/Life_expectancy_regression/6_Default_RandomForest/learner_fold_0_shap_worst_decisions.png (deflated 7%)
updating: content/Life_expectancy_regression/6_Default_RandomForest/status.txt (stored 0%)
updating: content/Life_expectancy_regression/6_Default_RandomForest/permutation_importance.png (deflated 16%)
updating: content/Life_expectancy_regression/6_Default_RandomForest/learner_fold_0_shap_importance.csv (deflated 37%)
updating: content/Life_expectancy_regression/6_Default_RandomForest/learner_fold_0_importance.csv (deflated 37%)
updating: content/Life_expectancy_regression/6_Default_RandomForest/true_vs_predicted.png (deflated 5%)
updating: c