<a href="https://colab.research.google.com/github/premdub/AutoML/blob/main/autoML/autoML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install Pacakges

In [19]:
!pip install tpot mljar-supervised

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [20]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from supervised.automl import AutoML


# Options Available

- mode — the package ships with four built-in models. 
  - The Explain mode is ideal for explaining and understanding the data. It results in visualizations of feature importance as well as tree visualizations.
  - The Perform is used when building ML models for production. 
  - The Compete is meant to build models used in machine learning competitions. 
  - The Optuna mode is used to search for highly-tuned ML models.
- algorithms — specifies the algorithms you would like to use. They are usually passed in as a list.
- results_path — the path where the results will be stored
- total_time_limit — the total time in seconds for training the model
- train_ensemble — dictates if an ensemble will be created at the end of the training process
- stack_models — determines if a models stack will be created
- eval_metric — the metric that will be optimized. If auto the logloss is used for classification problems while the rmse is used for regression problems

In [21]:
#automl = AutoML(
    # mode="Explain"
    # algorithms=""
    # results_path="AutoML_22",
    # total_time_limit=30 * 60,
    # train_ensemble=True,
    # stack_models="",
    # eval_metric=""
#)

# Healthcare Dataset - SPARCS

## Load in dataset

In [30]:
import pandas as pd
sparcs = pd.read_csv('https://query.data.world/s/lcc3wkldiuom2xid5xlcn3bqpkk263')
sparcs

Unnamed: 0,Provider ID,Hospital Name,Address,City,State,ZIP Code,County Name,Phone Number,Measure Name,Measure ID,Compared to National,Score,Footnote,Measure Start Date,Measure End Date,Location
0,10005,MARSHALL MEDICAL CENTER SOUTH,2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957,MARSHALL,2565938310,CLABSI: Lower Confidence Limit,HAI_1_CI_LOWER,No Different than National Benchmark,0.165,,01/01/2015,06/30/2015,"2505 U S HIGHWAY 431 NORTH\nBOAZ, AL 35957\n"
1,10005,MARSHALL MEDICAL CENTER SOUTH,2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957,MARSHALL,2565938310,CLABSI: Upper Confidence Limit,HAI_1_CI_UPPER,No Different than National Benchmark,3.249,,01/01/2015,06/30/2015,"2505 U S HIGHWAY 431 NORTH\nBOAZ, AL 35957\n"
2,10005,MARSHALL MEDICAL CENTER SOUTH,2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957,MARSHALL,2565938310,CLABSI: Number of Procedures,HAI_1_DOPC_DAYS,No Different than National Benchmark,1485,,01/01/2015,06/30/2015,"2505 U S HIGHWAY 431 NORTH\nBOAZ, AL 35957\n"
3,10005,MARSHALL MEDICAL CENTER SOUTH,2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957,MARSHALL,2565938310,CLABSI: Predicted Cases,HAI_1_ELIGCASES,No Different than National Benchmark,2.034,,01/01/2015,06/30/2015,"2505 U S HIGHWAY 431 NORTH\nBOAZ, AL 35957\n"
4,10005,MARSHALL MEDICAL CENTER SOUTH,2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957,MARSHALL,2565938310,CLABSI: Observed Cases,HAI_1_NUMERATOR,No Different than National Benchmark,2,,01/01/2015,06/30/2015,"2505 U S HIGHWAY 431 NORTH\nBOAZ, AL 35957\n"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
222859,351325,ST LUKE'S HOSPITAL,702 1ST ST SW,CROSBY,ND,58730,DIVIDE,7019656384,CAUTI: Observed Cases,HAI_2_NUMERATOR,Not Available,Not Available,5 - Results are not available for this reporti...,01/01/2015,06/30/2015,"702 1ST ST SW\nCROSBY, ND 58730\n(48.906708, -..."
222860,471303,NORTHEASTERN VERMONT REGIONAL HOSPITAL,1315 HOSPITAL DRIVE,SAINT JOHNSBURY,VT,5819,CALEDONIA,8027487400,C.diff Lower Confidence Limit,HAI_6_CI_LOWER,Not Available,Not Available,5 - Results are not available for this reporti...,07/01/2014,06/30/2015,"1315 HOSPITAL DRIVE\nSAINT JOHNSBURY, VT 05819..."
222861,150026,IU HEALTH GOSHEN HOSPITAL,200 HIGH PARK AVE,GOSHEN,IN,46526,ELKHART,5743641000,CLABSI: Number of Procedures,HAI_1_DOPC_DAYS,No Different than National Benchmark,1416,,01/01/2015,06/30/2015,"200 HIGH PARK AVE\nGOSHEN, IN 46526\n(41.56432..."
222862,50038,SANTA CLARA VALLEY MEDICAL CENTER,751 SOUTH BASCOM AVENUE,SAN JOSE,CA,95128,SANTA CLARA,4088855000,SSI: Colon Predicted Cases,HAI_3_ELIGCASES,Worse than the National Benchmark,3.246,,07/01/2014,06/30/2015,"751 SOUTH BASCOM AVENUE\nSAN JOSE, CA 95128\n(..."


In [32]:
sparcs.columns

Index(['Provider ID', 'Hospital Name', 'Address', 'City', 'State', 'ZIP Code',
       'County Name', 'Phone Number', 'Measure Name', 'Measure ID',
       'Compared to National', 'Score', 'Footnote', 'Measure Start Date',
       'Measure End Date', 'Location'],
      dtype='object')

## Potential variables of interest

- APR Risk of Mortality (categorical) 
- Total costs (continuous) 
- Length of Stay

In [33]:
sparcs['Score'].describe()

count            222864
unique            21771
top       Not Available
freq             124717
Name: Score, dtype: object

In [34]:
sparcs['Measure ID'].describe()

count             222864
unique                48
top       HAI_1_CI_LOWER
freq                4643
Name: Measure ID, dtype: object

In [35]:
sparcs['Compared to National'].value_counts()

Not Available                           121398
No Different than National Benchmark     78324
Better than the National Benchmark       18108
Worse than the National Benchmark         5034
Name: Compared to National, dtype: int64

## Create some simplified binary versions

In [36]:
sparcs['Score'] = pd.to_numeric(sparcs['Score'], errors='coerce')
sparcs['sparcs_los'] = sparcs['Score'].apply(lambda x: 'high-infection-rate' if x > 15 else 'low-infection-rate')
sparcs.drop('Score', axis=1, inplace=True)
sparcs['sparcs_los'].value_counts()


low-infection-rate     200205
high-infection-rate     22659
Name: sparcs_los, dtype: int64

# MLJar Examples

## Binary Classifier Example 1 - SPARCS

### **Create new model**

In [37]:
X = sparcs.drop(columns=['sparcs_los'])

In [38]:
y = sparcs["sparcs_los"]

In [39]:
X

Unnamed: 0,Provider ID,Hospital Name,Address,City,State,ZIP Code,County Name,Phone Number,Measure Name,Measure ID,Compared to National,Footnote,Measure Start Date,Measure End Date,Location
0,10005,MARSHALL MEDICAL CENTER SOUTH,2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957,MARSHALL,2565938310,CLABSI: Lower Confidence Limit,HAI_1_CI_LOWER,No Different than National Benchmark,,01/01/2015,06/30/2015,"2505 U S HIGHWAY 431 NORTH\nBOAZ, AL 35957\n"
1,10005,MARSHALL MEDICAL CENTER SOUTH,2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957,MARSHALL,2565938310,CLABSI: Upper Confidence Limit,HAI_1_CI_UPPER,No Different than National Benchmark,,01/01/2015,06/30/2015,"2505 U S HIGHWAY 431 NORTH\nBOAZ, AL 35957\n"
2,10005,MARSHALL MEDICAL CENTER SOUTH,2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957,MARSHALL,2565938310,CLABSI: Number of Procedures,HAI_1_DOPC_DAYS,No Different than National Benchmark,,01/01/2015,06/30/2015,"2505 U S HIGHWAY 431 NORTH\nBOAZ, AL 35957\n"
3,10005,MARSHALL MEDICAL CENTER SOUTH,2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957,MARSHALL,2565938310,CLABSI: Predicted Cases,HAI_1_ELIGCASES,No Different than National Benchmark,,01/01/2015,06/30/2015,"2505 U S HIGHWAY 431 NORTH\nBOAZ, AL 35957\n"
4,10005,MARSHALL MEDICAL CENTER SOUTH,2505 U S HIGHWAY 431 NORTH,BOAZ,AL,35957,MARSHALL,2565938310,CLABSI: Observed Cases,HAI_1_NUMERATOR,No Different than National Benchmark,,01/01/2015,06/30/2015,"2505 U S HIGHWAY 431 NORTH\nBOAZ, AL 35957\n"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
222859,351325,ST LUKE'S HOSPITAL,702 1ST ST SW,CROSBY,ND,58730,DIVIDE,7019656384,CAUTI: Observed Cases,HAI_2_NUMERATOR,Not Available,5 - Results are not available for this reporti...,01/01/2015,06/30/2015,"702 1ST ST SW\nCROSBY, ND 58730\n(48.906708, -..."
222860,471303,NORTHEASTERN VERMONT REGIONAL HOSPITAL,1315 HOSPITAL DRIVE,SAINT JOHNSBURY,VT,5819,CALEDONIA,8027487400,C.diff Lower Confidence Limit,HAI_6_CI_LOWER,Not Available,5 - Results are not available for this reporti...,07/01/2014,06/30/2015,"1315 HOSPITAL DRIVE\nSAINT JOHNSBURY, VT 05819..."
222861,150026,IU HEALTH GOSHEN HOSPITAL,200 HIGH PARK AVE,GOSHEN,IN,46526,ELKHART,5743641000,CLABSI: Number of Procedures,HAI_1_DOPC_DAYS,No Different than National Benchmark,,01/01/2015,06/30/2015,"200 HIGH PARK AVE\nGOSHEN, IN 46526\n(41.56432..."
222862,50038,SANTA CLARA VALLEY MEDICAL CENTER,751 SOUTH BASCOM AVENUE,SAN JOSE,CA,95128,SANTA CLARA,4088855000,SSI: Colon Predicted Cases,HAI_3_ELIGCASES,Worse than the National Benchmark,,07/01/2014,06/30/2015,"751 SOUTH BASCOM AVENUE\nSAN JOSE, CA 95128\n(..."


In [40]:
y

0          low-infection-rate
1          low-infection-rate
2         high-infection-rate
3          low-infection-rate
4          low-infection-rate
                 ...         
222859     low-infection-rate
222860     low-infection-rate
222861    high-infection-rate
222862     low-infection-rate
222863     low-infection-rate
Name: sparcs_los, Length: 222864, dtype: object

In [41]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.25)

In [42]:
X_test

Unnamed: 0,Provider ID,Hospital Name,Address,City,State,ZIP Code,County Name,Phone Number,Measure Name,Measure ID,Compared to National,Footnote,Measure Start Date,Measure End Date,Location
77717,220105,WINCHESTER HOSPITAL,41 HIGHLAND AVENUE,WINCHESTER,MA,1890,MIDDLESEX,7817299000,CAUTI: Number of Procedures,HAI_2_DOPC_DAYS,Better than the National Benchmark,8 - The lower limit of the confidence interval...,01/01/2015,06/30/2015,"41 HIGHLAND AVENUE\nWINCHESTER, MA 01890\n(42...."
83995,100072,FLORIDA HOSPITAL FISH MEMORIAL,1055 SAXON BLVD,ORANGE CITY,FL,32763,VOLUSIA,3869175000,C.diff Predicted Cases,HAI_6_ELIGCASES,No Different than National Benchmark,,07/01/2014,06/30/2015,"1055 SAXON BLVD\nORANGE CITY, FL 32763\n(28.91..."
79429,50674,KAISER FOUNDATION HOSP SO SACRAMENTO,6600 BRUCEVILLE ROAD,SACRAMENTO,CA,95823,SACRAMENTO,9166882000,MRSA Patient Days,HAI_5_DOPC_DAYS,No Different than National Benchmark,,07/01/2014,06/30/2015,"6600 BRUCEVILLE ROAD\nSACRAMENTO, CA 95823\n(3..."
21432,330144,"IRA DAVENPORT MEMORIAL HOSPITAL, INC",7571 STATE ROUTE 54,BATH,NY,14810,STEUBEN,6077768500,Catheter-Associated Urinary Tract Infections (...,HAI_2a_SIR,Not Available,12 - This measure does not apply to this hospi...,07/01/2014,06/30/2015,"7571 STATE ROUTE 54\nBATH, NY 14810\n(42.37297..."
52988,670054,FOUNDATION SURGICAL HOSPITAL OF SAN ANTONIO,9522 HUEBNER ROAD,SAN ANTONIO,TX,78240,BEXAR,2104785400,Catheter-Associated Urinary Tract Infections (...,HAI_2a_SIR,Not Available,12 - This measure does not apply to this hospi...,07/01/2014,06/30/2015,"9522 HUEBNER ROAD\nSAN ANTONIO, TX 78240\n(29...."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
178160,381311,LOWER UMPQUA HOSPITAL DISTRICT,600 RANCH ROAD,REEDSPORT,OR,97467,DOUGLAS,5412712171,CLABSI: Number of Procedures,HAI_1_DOPC_DAYS,Not Available,5 - Results are not available for this reporti...,01/01/2015,06/30/2015,"600 RANCH ROAD\nREEDSPORT, OR 97467\n(43.69513..."
62303,281310,REGIONAL WEST GARDEN COUNTY HOSPITAL,1100 WEST 2ND ST,OSHKOSH,NE,69154,GARDEN,3087723283,Surgical Site Infection from abdominal hystere...,HAI_4_SIR,Not Available,5 - Results are not available for this reporti...,07/01/2014,06/30/2015,"1100 WEST 2ND ST\nOSHKOSH, NE 69154\n(41.41249..."
67799,440181,BOLIVAR GENERAL HOSPITAL,650 NUCKOLLS ROAD,BOLIVAR,TN,38008,HARDEMAN,7316583100,CAUTI: Upper Confidence Limit,HAI_2_CI_UPPER,Not Available,12 - This measure does not apply to this hospi...,01/01/2015,06/30/2015,"650 NUCKOLLS ROAD\nBOLIVAR, TN 38008\n(35.2677..."
15719,450395,MEMORIAL MEDICAL CENTER LIVINGSTON,1717 HWY 59 BYPASS,LIVINGSTON,TX,77351,POLK,9363274381,CLABSI Central Line Days,HAI_1a_DOPC_DAYS,Not Available,13 - Results cannot be calculated for this rep...,07/01/2014,06/30/2015,"1717 HWY 59 BYPASS\nLIVINGSTON, TX 77351\n"


In [43]:
automl = AutoML(results_path="sparcs_binary_los", mode="Explain")

In [44]:
automl.fit(X_train, y_train)

Linear algorithm was disabled.
AutoML directory: sparcs_binary_los
The task is binary_classification with evaluation metric logloss
AutoML will use algorithms: ['Baseline', 'Decision Tree', 'Random Forest', 'Xgboost', 'Neural Network']
AutoML will ensemble available models
AutoML steps: ['simple_algorithms', 'default_algorithms', 'ensemble']
* Step simple_algorithms will try to check up to 2 models
1_Baseline logloss 0.328712 trained in 2.53 seconds




2_DecisionTree logloss 0.210459 trained in 35.77 seconds
* Step default_algorithms will try to check up to 3 models
3_Default_Xgboost logloss 0.02921 trained in 169.89 seconds
4_Default_NeuralNetwork logloss 0.131545 trained in 128.59 seconds
5_Default_RandomForest logloss 0.15751 trained in 33.49 seconds
* Step ensemble will try to check up to 1 model
Ensemble logloss 0.02921 trained in 2.42 seconds
AutoML fit time: 390.54 seconds
AutoML best model: 3_Default_Xgboost


AutoML(results_path='sparcs_binary_los')

In [45]:
pred = automl.predict(X_test)
pred

array(['low-infection-rate', 'low-infection-rate', 'low-infection-rate',
       ..., 'low-infection-rate', 'low-infection-rate',
       'low-infection-rate'], dtype=object)

In [46]:
automl.report()

Best model,name,model_type,metric_type,metric_value,train_time
,1_Baseline,Baseline,logloss,0.328712,3.69
,2_DecisionTree,Decision Tree,logloss,0.210459,37.36
the best,3_Default_Xgboost,Xgboost,logloss,0.0292104,171.5
,4_Default_NeuralNetwork,Neural Network,logloss,0.131545,130.02
,5_Default_RandomForest,Random Forest,logloss,0.15751,35.11
,Ensemble,Ensemble,logloss,0.0292104,2.42

Unnamed: 0,score,threshold
logloss,0.131545,
auc,0.96479,
f1,0.965053,0.684621
accuracy,0.938689,0.684621
precision,0.999514,1.0
recall,1.0,0.332318
mcc,0.730649,0.684621

Unnamed: 0,score,threshold
logloss,0.131545,
auc,0.96479,
f1,0.965053,0.684621
accuracy,0.938689,0.684621
precision,0.988874,0.684621
recall,0.942353,0.684621
mcc,0.730649,0.684621

Unnamed: 0,Predicted as high-infection-rate,Predicted as low-infection-rate
Labeled as high-infection-rate,3850,398
Labeled as low-infection-rate,2164,35375

Unnamed: 0,score,threshold
logloss,0.15751,
auc,0.963772,
f1,0.946449,0.0278974
accuracy,0.898342,0.0278974
precision,0.989024,0.845295
recall,1.0,0.0278974
mcc,0.576954,0.845295

Unnamed: 0,score,threshold
logloss,0.15751,
auc,0.963772,
f1,0.946449,0.0278974
accuracy,0.898342,0.0278974
precision,0.898342,0.0278974
recall,1.0,0.0278974
mcc,0.0,0.0278974

Unnamed: 0,Predicted as high-infection-rate,Predicted as low-infection-rate
Labeled as high-infection-rate,0,4248
Labeled as low-infection-rate,0,37539

Unnamed: 0,score,threshold
logloss,0.210459,
auc,0.889831,
f1,0.946449,0.284146
accuracy,0.898342,0.284146
precision,0.967278,0.819466
recall,1.0,0.284146
mcc,0.426325,0.819466

Unnamed: 0,score,threshold
logloss,0.210459,
auc,0.889831,
f1,0.946449,0.284146
accuracy,0.898342,0.284146
precision,0.898342,0.284146
recall,1.0,0.284146
mcc,0.0,0.284146

Unnamed: 0,Predicted as high-infection-rate,Predicted as low-infection-rate
Labeled as high-infection-rate,0,4248
Labeled as low-infection-rate,0,37539

Unnamed: 0,score,threshold
logloss,0.328712,
auc,0.5,
f1,0.946449,0.808493
accuracy,0.898342,0.808493
precision,0.898342,0.808493
recall,1.0,0.808493
mcc,0.0,0.808493

Unnamed: 0,score,threshold
logloss,0.328712,
auc,0.5,
f1,0.946449,0.808493
accuracy,0.898342,0.808493
precision,0.898342,0.808493
recall,1.0,0.808493
mcc,0.0,0.808493

Unnamed: 0,Predicted as high-infection-rate,Predicted as low-infection-rate
Labeled as high-infection-rate,0,4248
Labeled as low-infection-rate,0,37539

Model,Weight
3_Default_Xgboost,1

Unnamed: 0,score,threshold
logloss,0.0292104,
auc,0.998449,
f1,0.946449,5.63428e-08
accuracy,0.898342,5.63428e-08
precision,1.0,0.999997
recall,1.0,5.63428e-08
mcc,0.337372,0.999997

Unnamed: 0,score,threshold
logloss,0.0292104,
auc,0.998449,
f1,0.946449,5.63428e-08
accuracy,0.898342,5.63428e-08
precision,0.898342,5.63428e-08
recall,1.0,5.63428e-08
mcc,0.0,5.63428e-08

Unnamed: 0,Predicted as high-infection-rate,Predicted as low-infection-rate
Labeled as high-infection-rate,0,4248
Labeled as low-infection-rate,0,37539

Unnamed: 0,score,threshold
logloss,0.0292104,
auc,0.998449,
f1,0.946449,5.63428e-08
accuracy,0.898342,5.63428e-08
precision,1.0,0.999997
recall,1.0,5.63428e-08
mcc,0.337372,0.999997

Unnamed: 0,score,threshold
logloss,0.0292104,
auc,0.998449,
f1,0.946449,5.63428e-08
accuracy,0.898342,5.63428e-08
precision,0.898342,5.63428e-08
recall,1.0,5.63428e-08
mcc,0.0,5.63428e-08

Unnamed: 0,Predicted as high-infection-rate,Predicted as low-infection-rate
Labeled as high-infection-rate,0,4248
Labeled as low-infection-rate,0,37539


### **Test new (not really) data**

In [47]:
# load in the data model 

automl_sparcs_los = AutoML(results_path="sparcs_binary_los")

In [48]:
# create a new dataset that follows the same data structure as the training set
X_withlos = sparcs.sample(25)
X_withoutlos = X_withlos.drop(columns=['sparcs_los'])

In [49]:
X_withlos

Unnamed: 0,Provider ID,Hospital Name,Address,City,State,ZIP Code,County Name,Phone Number,Measure Name,Measure ID,Compared to National,Footnote,Measure Start Date,Measure End Date,Location,sparcs_los
25016,240036,ST CLOUD HOSPITAL,1406 6TH AVE NORTH,SAINT CLOUD,MN,56303,STEARNS,3202512700,MRSA Predicted Cases,HAI_5_ELIGCASES,No Different than National Benchmark,,07/01/2014,06/30/2015,"1406 6TH AVE NORTH\nSAINT CLOUD, MN 56303\n(45...",low-infection-rate
186616,220024,HOLYOKE MEDICAL CENTER,575 BEECH STREET,HOLYOKE,MA,1040,HAMPDEN,4135342500,CLABSI: Upper Confidence Limit,HAI_1_CI_UPPER,No Different than National Benchmark,8 - The lower limit of the confidence interval...,01/01/2015,06/30/2015,"575 BEECH STREET\nHOLYOKE, MA 01040\n(42.19961...",low-infection-rate
91953,100319,FLORIDA HOSPITAL WESLEY CHAPEL,2600 BRUCE B DOWNS BLVD,WESLEY CHAPEL,FL,33544,PASCO,8139295490,Surgical Site Infection from abdominal hystere...,HAI_4_SIR,Not Available,13 - Results cannot be calculated for this rep...,07/01/2014,06/30/2015,"2600 BRUCE B DOWNS BLVD\nWESLEY CHAPEL, FL 335...",low-infection-rate
217637,50008,CALIFORNIA PACIFIC MEDICAL CTR-DAVIES CAMPUS HOSP,601 DUBOCE AVE,SAN FRANCISCO,CA,94117,SAN FRANCISCO,4156006000,SSI: Colon Upper Confidence Limit,HAI_3_CI_UPPER,Not Available,13 - Results cannot be calculated for this rep...,07/01/2014,06/30/2015,"601 DUBOCE AVE\nSAN FRANCISCO, CA 94117\n(37.7...",low-infection-rate
11432,390163,ACMH HOSPITAL,ONE NOLTE DRIVE,KITTANNING,PA,16201,ARMSTRONG,7245438500,Methicillin-resistant Staphylococcus Aureus (M...,HAI_5_SIR,Not Available,13 - Results cannot be calculated for this rep...,07/01/2014,06/30/2015,"ONE NOLTE DRIVE\nKITTANNING, PA 16201\n",low-infection-rate
89064,170104,SHAWNEE MISSION MEDICAL CENTER,9100 W 74TH STREET,SHAWNEE MISSION,KS,66204,JOHNSON,9136762151,Central line-associated blood stream infection...,HAI_1a_SIR,No Different than National Benchmark,,07/01/2014,06/30/2015,"9100 W 74TH STREET\nSHAWNEE MISSION, KS 66204\...",low-infection-rate
178608,140197,METHODIST HOSPITAL OF CHICAGO,5025 N PAULINA STREET,CHICAGO,IL,60640,COOK,7732719040,SSI: Abdominal Lower Confidence Limit,HAI_4_CI_LOWER,Not Available,12 - This measure does not apply to this hospi...,07/01/2014,06/30/2015,"5025 N PAULINA STREET\nCHICAGO, IL 60640\n(41....",low-infection-rate
190149,450148,TEXAS HEALTH HARRIS METHODIST HOSPITAL CLEBURNE,201 WALLS DRIVE,CLEBURNE,TX,76033,JOHNSON,8176412551,C.diff Lower Confidence Limit,HAI_6_CI_LOWER,No Different than National Benchmark,,07/01/2014,06/30/2015,"201 WALLS DRIVE\nCLEBURNE, TX 76033\n(32.33084...",low-infection-rate
100856,201315,STEPHENS MEMORIAL HOSPITAL,181 MAIN STREET,NORWAY,ME,4268,OXFORD,2077435933,MRSA Upper Confidence Limit,HAI_5_CI_UPPER,Not Available,13 - Results cannot be calculated for this rep...,07/01/2014,06/30/2015,"181 MAIN STREET\nNORWAY, ME 04268\n(44.209539,...",low-infection-rate
150623,360107,BELLEVUE HOSPITAL,1400 WEST MAIN STREET,BELLEVUE,OH,44811,SANDUSKY,4194834040,CLABSI Central Line Days,HAI_1a_DOPC_DAYS,Not Available,13 - Results cannot be calculated for this rep...,07/01/2014,06/30/2015,"1400 WEST MAIN STREET\nBELLEVUE, OH 44811\n(41...",low-infection-rate


In [50]:
X_withoutlos

Unnamed: 0,Provider ID,Hospital Name,Address,City,State,ZIP Code,County Name,Phone Number,Measure Name,Measure ID,Compared to National,Footnote,Measure Start Date,Measure End Date,Location
25016,240036,ST CLOUD HOSPITAL,1406 6TH AVE NORTH,SAINT CLOUD,MN,56303,STEARNS,3202512700,MRSA Predicted Cases,HAI_5_ELIGCASES,No Different than National Benchmark,,07/01/2014,06/30/2015,"1406 6TH AVE NORTH\nSAINT CLOUD, MN 56303\n(45..."
186616,220024,HOLYOKE MEDICAL CENTER,575 BEECH STREET,HOLYOKE,MA,1040,HAMPDEN,4135342500,CLABSI: Upper Confidence Limit,HAI_1_CI_UPPER,No Different than National Benchmark,8 - The lower limit of the confidence interval...,01/01/2015,06/30/2015,"575 BEECH STREET\nHOLYOKE, MA 01040\n(42.19961..."
91953,100319,FLORIDA HOSPITAL WESLEY CHAPEL,2600 BRUCE B DOWNS BLVD,WESLEY CHAPEL,FL,33544,PASCO,8139295490,Surgical Site Infection from abdominal hystere...,HAI_4_SIR,Not Available,13 - Results cannot be calculated for this rep...,07/01/2014,06/30/2015,"2600 BRUCE B DOWNS BLVD\nWESLEY CHAPEL, FL 335..."
217637,50008,CALIFORNIA PACIFIC MEDICAL CTR-DAVIES CAMPUS HOSP,601 DUBOCE AVE,SAN FRANCISCO,CA,94117,SAN FRANCISCO,4156006000,SSI: Colon Upper Confidence Limit,HAI_3_CI_UPPER,Not Available,13 - Results cannot be calculated for this rep...,07/01/2014,06/30/2015,"601 DUBOCE AVE\nSAN FRANCISCO, CA 94117\n(37.7..."
11432,390163,ACMH HOSPITAL,ONE NOLTE DRIVE,KITTANNING,PA,16201,ARMSTRONG,7245438500,Methicillin-resistant Staphylococcus Aureus (M...,HAI_5_SIR,Not Available,13 - Results cannot be calculated for this rep...,07/01/2014,06/30/2015,"ONE NOLTE DRIVE\nKITTANNING, PA 16201\n"
89064,170104,SHAWNEE MISSION MEDICAL CENTER,9100 W 74TH STREET,SHAWNEE MISSION,KS,66204,JOHNSON,9136762151,Central line-associated blood stream infection...,HAI_1a_SIR,No Different than National Benchmark,,07/01/2014,06/30/2015,"9100 W 74TH STREET\nSHAWNEE MISSION, KS 66204\..."
178608,140197,METHODIST HOSPITAL OF CHICAGO,5025 N PAULINA STREET,CHICAGO,IL,60640,COOK,7732719040,SSI: Abdominal Lower Confidence Limit,HAI_4_CI_LOWER,Not Available,12 - This measure does not apply to this hospi...,07/01/2014,06/30/2015,"5025 N PAULINA STREET\nCHICAGO, IL 60640\n(41...."
190149,450148,TEXAS HEALTH HARRIS METHODIST HOSPITAL CLEBURNE,201 WALLS DRIVE,CLEBURNE,TX,76033,JOHNSON,8176412551,C.diff Lower Confidence Limit,HAI_6_CI_LOWER,No Different than National Benchmark,,07/01/2014,06/30/2015,"201 WALLS DRIVE\nCLEBURNE, TX 76033\n(32.33084..."
100856,201315,STEPHENS MEMORIAL HOSPITAL,181 MAIN STREET,NORWAY,ME,4268,OXFORD,2077435933,MRSA Upper Confidence Limit,HAI_5_CI_UPPER,Not Available,13 - Results cannot be calculated for this rep...,07/01/2014,06/30/2015,"181 MAIN STREET\nNORWAY, ME 04268\n(44.209539,..."
150623,360107,BELLEVUE HOSPITAL,1400 WEST MAIN STREET,BELLEVUE,OH,44811,SANDUSKY,4194834040,CLABSI Central Line Days,HAI_1a_DOPC_DAYS,Not Available,13 - Results cannot be calculated for this rep...,07/01/2014,06/30/2015,"1400 WEST MAIN STREET\nBELLEVUE, OH 44811\n(41..."


In [51]:
predict = automl.predict(X_withoutlos)
predict

array(['low-infection-rate', 'low-infection-rate', 'low-infection-rate',
       'low-infection-rate', 'low-infection-rate', 'low-infection-rate',
       'low-infection-rate', 'low-infection-rate', 'low-infection-rate',
       'low-infection-rate', 'low-infection-rate', 'low-infection-rate',
       'low-infection-rate', 'low-infection-rate', 'low-infection-rate',
       'low-infection-rate', 'low-infection-rate', 'low-infection-rate',
       'low-infection-rate', 'low-infection-rate', 'low-infection-rate',
       'low-infection-rate', 'low-infection-rate', 'low-infection-rate',
       'low-infection-rate'], dtype=object)

In [None]:
# actual values from X_withlos
values_actual = X_withlos['sparcs_los'].values.tolist()
values_predicted = predict.tolist()
output = pd.DataFrame({'actual': values_actual, 'predicted': values_predicted})
output

## Binary Classifier Example 2 - GENERIC

In [53]:
import pandas as pd
from supervised.automl import AutoML
import os

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

df = pd.read_csv("https://raw.githubusercontent.com/hantswilliams/HHA-507-2022/main/autoML/datasets/data_binary_bank.csv")

X = df[df.columns[:-1]]
y = df["y"]

In [54]:
X

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,261,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,149,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0
2,37,services,married,high.school,no,yes,no,telephone,may,mon,226,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,151,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0
4,56,services,married,high.school,no,no,yes,telephone,may,mon,307,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,retired,married,professional.course,no,yes,no,cellular,nov,fri,334,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6
41184,46,blue-collar,married,professional.course,no,no,no,cellular,nov,fri,383,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6
41185,56,retired,married,university.degree,no,yes,no,cellular,nov,fri,189,2,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6
41186,44,technician,married,professional.course,no,no,no,cellular,nov,fri,442,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6


In [55]:
y

0         no
1         no
2         no
3         no
4         no
        ... 
41183    yes
41184     no
41185     no
41186    yes
41187     no
Name: y, Length: 41188, dtype: object

In [56]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.25)


In [57]:
automl = AutoML(
    # results_path="AutoML_22",
    # total_time_limit=30 * 60,
    # start_random_models=10,
    # hill_climbing_steps=3,
    # top_models_to_improve=3,
    # train_ensemble=True,
    mode="Explain"
)

automl.fit(X_train, y_train)

Linear algorithm was disabled.
AutoML directory: AutoML_1
The task is binary_classification with evaluation metric logloss
AutoML will use algorithms: ['Baseline', 'Decision Tree', 'Random Forest', 'Xgboost', 'Neural Network']
AutoML will ensemble available models
AutoML steps: ['simple_algorithms', 'default_algorithms', 'ensemble']
* Step simple_algorithms will try to check up to 2 models
1_Baseline logloss 0.352021 trained in 1.26 seconds
2_DecisionTree logloss 0.209722 trained in 11.37 seconds
* Step default_algorithms will try to check up to 3 models
3_Default_Xgboost logloss 0.168641 trained in 15.68 seconds
4_Default_NeuralNetwork logloss 0.188355 trained in 12.57 seconds
5_Default_RandomForest logloss 0.19204 trained in 16.51 seconds
* Step ensemble will try to check up to 1 model
Ensemble logloss 0.167882 trained in 2.18 seconds
AutoML fit time: 71.51 seconds
AutoML best model: Ensemble


AutoML()

In [58]:
pred = automl.predict(X_test)
pred
# print("Test accuracy", accuracy_score(y_test, pred["label"]))

array(['no', 'no', 'no', ..., 'no', 'no', 'no'], dtype=object)

In [59]:
automl.report()

Best model,name,model_type,metric_type,metric_value,train_time
,1_Baseline,Baseline,logloss,0.352021,2.27
,2_DecisionTree,Decision Tree,logloss,0.209722,12.86
,3_Default_Xgboost,Xgboost,logloss,0.168641,17.21
,4_Default_NeuralNetwork,Neural Network,logloss,0.188355,13.91
,5_Default_RandomForest,Random Forest,logloss,0.19204,18.07
the best,Ensemble,Ensemble,logloss,0.167882,2.18

Unnamed: 0,score,threshold
logloss,0.188355,
auc,0.942012,
f1,0.634774,0.304977
accuracy,0.91221,0.425716
precision,0.888,0.81257
recall,1.0,1.48557e-13
mcc,0.591978,0.148388

Unnamed: 0,score,threshold
logloss,0.188355,
auc,0.942012,
f1,0.602113,0.425716
accuracy,0.91221,0.425716
precision,0.615108,0.425716
recall,0.589655,0.425716
mcc,0.552956,0.425716

Unnamed: 0,Predicted as no,Predicted as yes
Labeled as no,6532,321
Labeled as yes,357,513

Unnamed: 0,score,threshold
logloss,0.19204,
auc,0.931891,
f1,0.634195,0.298047
accuracy,0.911563,0.499591
precision,0.804598,0.649208
recall,1.0,0.00985953
mcc,0.596314,0.228478

Unnamed: 0,score,threshold
logloss,0.19204,
auc,0.931891,
f1,0.560206,0.499591
accuracy,0.911563,0.499591
precision,0.636896,0.499591
recall,0.5,0.499591
mcc,0.51647,0.499591

Unnamed: 0,Predicted as no,Predicted as yes
Labeled as no,6605,248
Labeled as yes,435,435

Unnamed: 0,score,threshold
logloss,0.209722,
auc,0.900979,
f1,0.601156,0.300244
accuracy,0.907419,0.451926
precision,0.793333,0.677513
recall,1.0,0.0122761
mcc,0.556765,0.300244

Unnamed: 0,score,threshold
logloss,0.209722,
auc,0.900979,
f1,0.585026,0.451926
accuracy,0.907419,0.451926
precision,0.590856,0.451926
recall,0.57931,0.451926
mcc,0.532962,0.451926

Unnamed: 0,Predicted as no,Predicted as yes
Labeled as no,6504,349
Labeled as yes,366,504

Unnamed: 0,score,threshold
logloss,0.352021,
auc,0.5,
f1,0.20249,0.10139
accuracy,0.112651,0.10139
precision,0.112651,0.10139
recall,1.0,0.10139
mcc,0.0,0.10139

Unnamed: 0,score,threshold
logloss,0.352021,
auc,0.5,
f1,0.20249,0.10139
accuracy,0.112651,0.10139
precision,0.112651,0.10139
recall,1.0,0.10139
mcc,0.0,0.10139

Unnamed: 0,Predicted as no,Predicted as yes
Labeled as no,0,6853
Labeled as yes,0,870

Model,Weight
3_Default_Xgboost,4
4_Default_NeuralNetwork,1

Unnamed: 0,score,threshold
logloss,0.167882,
auc,0.952093,
f1,0.667926,0.258864
accuracy,0.917649,0.456965
precision,0.895238,0.821598
recall,1.0,0.000125528
mcc,0.63174,0.193721

Unnamed: 0,score,threshold
logloss,0.167882,
auc,0.952093,
f1,0.617329,0.456965
accuracy,0.917649,0.456965
precision,0.647727,0.456965
recall,0.589655,0.456965
mcc,0.572093,0.456965

Unnamed: 0,Predicted as no,Predicted as yes
Labeled as no,6574,279
Labeled as yes,357,513

Unnamed: 0,score,threshold
logloss,0.168641,
auc,0.95191,
f1,0.662188,0.279629
accuracy,0.916483,0.464777
precision,0.85124,0.831752
recall,1.0,0.000156908
mcc,0.626582,0.208308

Unnamed: 0,score,threshold
logloss,0.168641,
auc,0.95191,
f1,0.607903,0.464777
accuracy,0.916483,0.464777
precision,0.645161,0.464777
recall,0.574713,0.464777
mcc,0.562517,0.464777

Unnamed: 0,Predicted as no,Predicted as yes
Labeled as no,6578,275
Labeled as yes,370,500


## Regression - Example - GENERIC

## Regression - Example 2 - GENERIC

In [None]:
import numpy as np
import pandas as pd
from supervised.automl import AutoML

df = pd.read_csv("https://raw.githubusercontent.com/hantswilliams/HHA-507-2022/main/autoML/datasets/data_regression_housing.csv")
x_cols = [c for c in df.columns if c != "MEDV"]
X = df[x_cols]
y = df["MEDV"]

In [None]:
df

In [None]:
x_cols

In [None]:
X

In [None]:
y

In [None]:
automl = AutoML()
automl.fit(X, y)

In [None]:
df["predictions"] = automl.predict(X)


In [None]:
print("Predictions")
print(df[["MEDV", "predictions"]].head())

## Multiclass Classifier - GENERIC

In [None]:
import pandas as pd
import numpy as np
from supervised.automl import AutoML
import supervised


import warnings

from sklearn import datasets
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import PCA

from supervised import AutoML
from supervised.exceptions import AutoMLException

# warnings.filterwarnings('error')
warnings.filterwarnings(
    "error", category=pd.core.common.SettingWithCopyWarning
)  # message="*ndarray*")

df = pd.read_csv("https://raw.githubusercontent.com/hantswilliams/HHA-507-2022/main/autoML/datasets/data_classes_iris.csv")
X = df[["feature_1", "feature_2", "feature_3", "feature_4"]]
y = df["class"]



In [None]:
df

In [None]:
X

In [None]:
y.value_counts()

In [None]:
automl = AutoML()


In [None]:
automl.fit(X, y)

In [None]:
predictions = automl.predict_all(X)


In [None]:
print(predictions.head())
print(predictions.tail())

print(X.shape)
print(predictions.shape)

# Download outputs

In [None]:
# get current working directory
import os
os.getcwd()

'/content'

In [None]:
folders = os.listdir()
foldersML = [x for x in folders if x.startswith('sparcs')]
print(foldersML)

In [None]:
!zip -r /content/sparcs_binary_los.zip /content/sparcs_binary_los

  adding: content/sparcs_binary_los/ (stored 0%)
  adding: content/sparcs_binary_los/Ensemble/ (stored 0%)
  adding: content/sparcs_binary_los/Ensemble/roc_curve.png (deflated 9%)
  adding: content/sparcs_binary_los/Ensemble/precision_recall_curve.png (deflated 11%)
  adding: content/sparcs_binary_los/Ensemble/predictions_ensemble.csv (deflated 62%)
  adding: content/sparcs_binary_los/Ensemble/lift_curve.png (deflated 7%)
  adding: content/sparcs_binary_los/Ensemble/confusion_matrix_normalized.png (deflated 18%)
  adding: content/sparcs_binary_los/Ensemble/calibration_curve_curve.png (deflated 14%)
  adding: content/sparcs_binary_los/Ensemble/cumulative_gains_curve.png (deflated 7%)
  adding: content/sparcs_binary_los/Ensemble/learning_curves.png (deflated 11%)
  adding: content/sparcs_binary_los/Ensemble/confusion_matrix.png (deflated 17%)
  adding: content/sparcs_binary_los/Ensemble/README.md (deflated 67%)
  adding: content/sparcs_binary_los/Ensemble/ensemble.json (deflated 44%)
  a