# Set up Autogluon

To model drug behavior we are going to use [AutoGluon](https://auto.gluon.ai/stable/index.html).  

To install AutoGluon follow [these](https://auto.gluon.ai/stable/install.html) instructions. I will install it for windows below.


In [1]:
!pip install -U pip
!pip install -U setuptools wheel
!pip install torch==1.12+cpu torchvision==0.13.0+cpu torchtext==0.13.0 -f https://download.pytorch.org/whl/cpu/torch_stable.html
!pip install autogluon

Looking in links: https://download.pytorch.org/whl/cpu/torch_stable.html


## Try basic tutorial to ensure that AutoGluon is working

In [75]:
import autogluon
import pandas as pd
import pathlib
import numpy as np
import statsmodels.formula.api as smf
import statsmodels.api as sm

from autogluon.tabular import TabularDataset, TabularPredictor

In [4]:
train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
subsample_size = 500  # subsample subset of data for faster demo, try setting this to much larger values
train_data = train_data.sample(n=subsample_size, random_state=0)
train_data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
6118,51,Private,39264,Some-college,10,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,>50K
23204,58,Private,51662,10th,6,Married-civ-spouse,Other-service,Wife,White,Female,0,0,8,United-States,<=50K
29590,40,Private,326310,Some-college,10,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,44,United-States,<=50K
18116,37,Private,222450,HS-grad,9,Never-married,Sales,Not-in-family,White,Male,0,2339,40,El-Salvador,<=50K
33964,62,Private,109190,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,15024,0,40,United-States,>50K


In [5]:
label = 'class'
print("Summary of class variable: \n", train_data[label].describe())

Summary of class variable: 
 count        500
unique         2
top        <=50K
freq         365
Name: class, dtype: object


In [6]:
# Set directory
output_f = pathlib.Path.cwd().parent.parent.joinpath('output')
output_f

WindowsPath('f:/mdi-workshop/output')

In [7]:
save_path = output_f.joinpath('Ag_TEST')  # specifies folder to store trained models
predictor = TabularPredictor(label=label, path=save_path).fit(train_data)

Beginning AutoGluon training ...
AutoGluon will save models to "f:\mdi-workshop\output\Ag_TEST\"
AutoGluon Version:  0.6.0
Python Version:     3.9.0
Operating System:   Windows
Platform Machine:   AMD64
Platform Version:   10.0.19041
Train Data Rows:    500
Train Data Columns: 14
Label Column: class
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [' >50K', ' <=50K']
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 =  >50K, class 0 =  <=50K
	Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive ( >50K) vs negative ( <=50K) class.
	To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor in

In [10]:
test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')
y_test = test_data[label]  # values to predict
test_data_nolab = test_data.drop(columns=[label])  # delete label column to prove we're not cheating
test_data_nolab.head()

Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv | Columns = 15 / 15 | Rows = 9769 -> 9769


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,31,Private,169085,11th,7,Married-civ-spouse,Sales,Wife,White,Female,0,0,20,United-States
1,17,Self-emp-not-inc,226203,12th,8,Never-married,Sales,Own-child,White,Male,0,0,45,United-States
2,47,Private,54260,Assoc-voc,11,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,1887,60,United-States
3,21,Private,176262,Some-college,10,Never-married,Exec-managerial,Own-child,White,Female,0,0,30,United-States
4,17,Private,241185,12th,8,Never-married,Prof-specialty,Own-child,White,Male,0,0,20,United-States


In [11]:
predictor = TabularPredictor.load(save_path)  # unnecessary, just demonstrates how to load previously-trained predictor from file

y_pred = predictor.predict(test_data_nolab)
print("Predictions:  \n", y_pred)
perf = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)

Evaluation: accuracy on test data: 0.8183027945542021
Evaluations on test data:
{
    "accuracy": 0.8183027945542021,
    "balanced_accuracy": 0.7224828905188908,
    "mcc": 0.4725871539587948,
    "f1": 0.5851834540780556,
    "precision": 0.6384497705252422,
    "recall": 0.540120793787748
}


Predictions:  
 0        <=50K
1        <=50K
2         >50K
3        <=50K
4        <=50K
         ...  
9764     <=50K
9765     <=50K
9766     <=50K
9767     <=50K
9768     <=50K
Name: class, Length: 9769, dtype: object


In [12]:
predictor.leaderboard(test_data, silent=True)


Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,RandomForestGini,0.84287,0.84,0.150125,0.073062,0.551456,0.150125,0.073062,0.551456,1,True,5
1,CatBoost,0.842461,0.85,0.019017,0.006004,39.533127,0.019017,0.006004,39.533127,1,True,7
2,RandomForestEntr,0.84113,0.83,0.142116,0.06005,0.476976,0.142116,0.06005,0.476976,1,True,6
3,LightGBM,0.839799,0.85,0.064053,0.011009,0.292242,0.064053,0.011009,0.292242,1,True,4
4,XGBoost,0.837445,0.87,0.074061,0.007006,0.714366,0.074061,0.007006,0.714366,1,True,11
5,LightGBMXT,0.836421,0.83,0.014011,0.01201,1.490118,0.014011,0.01201,1.490118,1,True,3
6,ExtraTreesGini,0.834579,0.82,0.126105,0.065052,0.47139,0.126105,0.065052,0.47139,1,True,8
7,ExtraTreesEntr,0.83335,0.81,0.133108,0.063052,0.499333,0.133108,0.063052,0.499333,1,True,9
8,LightGBMLarge,0.828949,0.83,0.019015,0.014011,0.634495,0.019015,0.014011,0.634495,1,True,13
9,NeuralNetFastAI,0.823626,0.82,0.170141,0.023018,1.919977,0.170141,0.023018,1.919977,1,True,10


## Now we will load the real data

In [8]:
repo_f = pathlib.Path.cwd().parent.parent
code_f = repo_f.joinpath('code')
input_f = repo_f.joinpath('input')
output_f = repo_f.joinpath('output')

In [23]:
def get_data(dep_var):
    infile = input_f.joinpath('NSDUH/' + dep_var + '_test_2019_1pct.csv')
    df_test = pd.read_csv(infile)

    infile = input_f.joinpath('NSDUH/' + dep_var + '_train_2019_1pct.csv')
    df_train = pd.read_csv(infile)

    infile = input_f.joinpath('NSDUH/' + dep_var + '_validate_2019_1pct.csv')
    df_validate = pd.read_csv(infile)

    df_train = pd.concat([df_train, df_validate], ignore_index=True)

    infile = input_f.joinpath('NSDUH/' + dep_var + '_resampled_2019_1pct.csv')
    df_resampled = pd.read_csv(infile)

    # Get rid of Unamed col artifacts
    try:
        df_train.drop(columns=['Unnamed: 0'], inplace=True)
        df_test.drop(columns=['Unnamed: 0'], inplace=True)
        df_resampled.drop(columns=['Unnamed: 0'], inplace=True)

    except:
        pass

    return df_train, df_test, df_resampled

In [78]:
df_train, df_test, df_resampled = get_data('nonmj')

In [43]:
df_train.head()

Unnamed: 0,agegrp,marstat,male,edu,race,employ,income,famincome,poverty,actual
0,7,4,1,9,7,1,1,7,3,0
1,13,1,2,11,7,4,1,7,3,0
2,3,0,1,4,7,0,1,3,2,0
3,16,1,1,11,1,4,3,7,3,0
4,2,0,2,2,6,0,1,7,3,0


## Run naive logit

In [79]:
df_train.head()

Unnamed: 0,agegrp,marstat,male,edu,race,employ,income,famincome,poverty,actual
0,7,4,1,9,7,1,1,7,3,0
1,13,1,2,11,7,4,1,7,3,0
2,3,0,1,4,7,0,1,3,2,0
3,16,1,1,11,1,4,3,7,3,0
4,2,0,2,2,6,0,1,7,3,0


In [80]:
fx = 'actual ~ C(agegrp) + C(marstat) + C(male) + C(edu) + C(race) + C(employ) + C(income) + C(famincome) + C(poverty)'

log_reg = smf.logit(fx, data=df_train).fit()


Optimization terminated successfully.
         Current function value: 0.275251
         Iterations 8


In [81]:
# performing predictions on the test dataset
yhat = log_reg.predict(df_test)
prediction = list(map(round, yhat))

from sklearn.metrics import (confusion_matrix,
                           accuracy_score)
 
# confusion matrix
cm = confusion_matrix(df_test['actual'], prediction)
print ("Confusion Matrix : \n", cm)
 
# accuracy score of the model
print('Test accuracy = ', accuracy_score(df_test['actual'], prediction))

Confusion Matrix : 
 [[248442      0]
 [ 24353      0]]
Test accuracy =  0.9107278359207464


## Run autogluon

In [82]:
## Make dummies for each categorical variable
cat_vars = ['agegrp', 'marstat', 'male', 'edu', 'race', 'employ',	'income',	'famincome',	'poverty']

df_train = df_train.astype(float).astype(int).astype(str)
df_train = pd.get_dummies(df_train, prefix_sep='', dummy_na=True, columns=cat_vars, sparse=False, drop_first=False)
df_train['actual'] = df_train['actual'].astype(int)

df_test = df_test.astype(float).astype(int).astype(str)
df_test = pd.get_dummies(df_test, prefix_sep='', dummy_na=True, columns=cat_vars, sparse=False, drop_first=False)
df_test['actual'] = df_test['actual'].astype(int)


In [85]:
## Run the model
label = 'actual' # label of thing we want to predict (labeled actual but will be determined by dep_var in get_model_comparison_data)
folder = str(output_f.joinpath('agModels-predictnonmj_use'))
save_path = output_f.joinpath(folder)  # specifies folder to store trained models
time_limit = 0.25*60*60 # time limit for model in seconds
metric = 'f1'  # choose a metric that the alg will look to optimize over
predictor = TabularPredictor(label, eval_metric=metric, path=save_path).fit(df_train,time_limit=time_limit, presets='best_quality')


Presets specified: ['best_quality']
Stack configuration (auto_stack=True): num_stack_levels=0, num_bag_folds=8, num_bag_sets=20
Beginning AutoGluon training ... Time limit = 900.0s
AutoGluon will save models to "f:\mdi-workshop\output\agModels-predictnonmj_use\"
AutoGluon Version:  0.6.0
Python Version:     3.9.0
Operating System:   Windows
Platform Machine:   AMD64
Platform Version:   10.0.19041
Train Data Rows:    553854
Train Data Columns: 74
Label Column: actual
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [0, 1]
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memor

In [86]:
## Look at the leaderboard (comparison of all models trained)
path = output_f.joinpath(folder) 
leaderboard = predictor.leaderboard(df_test, silent=True)
outfile = output_f.joinpath('leaderboard_nonmj_use' + '.csv')
leaderboard.to_csv(outfile)
best_model = leaderboard.sort_values('score_test', ascending=False).loc[0]['model']


In [88]:
## Inspect confusion matrix
df_test['nonmj_use_prob'] = predictor.predict_proba(df_test, as_multiclass=False, model=best_model)
df_test['pred'] = np.where(df_test['nonmj_use_prob']>=0.5,1,0)
from sklearn.metrics import confusion_matrix
conf_mat = confusion_matrix(y_true=df_test['actual'], y_pred=df_test['pred'])
print('Confusion matrix:\n', conf_mat)
outfile = output_f.joinpath('nonmj_confusion_matrix' + '.csv')
conf_mat =pd.DataFrame(conf_mat)
conf_mat.to_csv(outfile)
# predictor.delete_models(models_to_keep='best', dry_run=False) # Run this to save hard drive space (remove all models that arent the "best")

Confusion matrix:
 [[245818   2624]
 [  7210  17143]]
