<img src="../img/GTK_Logo_Social Icon.jpg" width=175 align="right" />

#  Worksheet 5.4: Automate it All! - Answers
This worksheet covers concepts relating to automating a machine learning model using the techniques we learned.  It should take no more than 20-30 minutes to complete.  Please raise your hand if you get stuck.  

In [6]:
!pip uninstall ydata-profiling

Found existing installation: ydata-profiling 0.0.dev0
Uninstalling ydata-profiling-0.0.dev0:
  Would remove:
    /opt/miniconda3/envs/gtk-bh-5-6v2/bin/pandas_profiling
    /opt/miniconda3/envs/gtk-bh-5-6v2/bin/ydata_profiling
    /opt/miniconda3/envs/gtk-bh-5-6v2/lib/python3.11/site-packages/pandas_profiling/*
    /opt/miniconda3/envs/gtk-bh-5-6v2/lib/python3.11/site-packages/ydata_profiling-0.0.dev0.dist-info/*
    /opt/miniconda3/envs/gtk-bh-5-6v2/lib/python3.11/site-packages/ydata_profiling/*
Proceed (Y/n)? ^C
[31mERROR: Operation cancelled by user[0m[31m
[0m

In [2]:
!pip install tpot

Collecting tpot
  Downloading TPOT-1.1.0-py3-none-any.whl.metadata (1.9 kB)
Collecting numpy>=1.26.4 (from tpot)
  Downloading numpy-2.3.2-cp311-cp311-macosx_14_0_arm64.whl.metadata (62 kB)
Collecting update-checker>=0.16 (from tpot)
  Downloading update_checker-0.18.0-py3-none-any.whl.metadata (2.3 kB)
Collecting stopit>=1.1.1 (from tpot)
  Downloading stopit-1.1.2.tar.gz (18 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting pandas>=2.2.0 (from tpot)
  Downloading pandas-2.3.1-cp311-cp311-macosx_11_0_arm64.whl.metadata (91 kB)
Collecting xgboost>=3.0.0 (from tpot)
  Downloading xgboost-3.0.3-py3-none-macosx_12_0_arm64.whl.metadata (2.1 kB)
Collecting lightgbm>=3.3.3 (from tpot)
  Downloading lightgbm-4.6.0-py3-none-macosx_12_0_arm64.whl.metadata (17 kB)
Collecting optuna>=3.0.5 (from tpot)
  Downloading optuna-4.4.0-py3-none-any.whl.metadata (17 kB)
Collecting dask>=2024.4.2 (from tpot)
  Downloading dask-2025.7.0-py3-none-any.whl.metadata (3.8 kB)
Collecting distrib

In [1]:
# Load Libraries - Make sure to run this cell!
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
from tpot import TPOTClassifier
import joblib

## Step One:  Import the Data
In this example, we're going to use the dataset we used in worksheet 5.3.  Run the following code to read in the data, extract the features and target vector.

In [2]:
df = pd.read_csv('../data/dga_features_final_df.csv')
target = df['isDGA']
feature_matrix = df.drop(['isDGA'], axis='columns')

Next, perform the test/train split in the conventional manner.

In [3]:
feature_matrix_train, feature_matrix_test, target_train, target_test = train_test_split(feature_matrix, 
                                                                                        target, 
                                                                                        test_size=0.25)

## Step Two:  Run the Optimizer
In the next step, use TPOT to create a classification pipeline using the DGA data set that we have been using.  The `TPOTClassifier()` has many configuration options and in the interest of time, please set the following variables when you instantiate the classifier.

* `max_eval_time_mins`:  In the interests of time, set this to 15 or 20.
* `verbosity`: Set to 1 or 2 so you can see what TPOT is doing.


**Note:  This step will take some time, so you might want to get some coffee or a snack when it is running.**  While this is running take a look at the other configuration options available here: http://epistasislab.github.io/tpot/api/.  

In [4]:
# encode labels
target_encoded = LabelEncoder().fit_transform(target_train)
print('orig target\n ',target_train[0:5] )
print('\n')
print('encoded target\n ', target_encoded[0:5])

orig target
  588       dga
1653    legit
575       dga
225       dga
865       dga
Name: isDGA, dtype: object


encoded target
  [0 1 0 0 0]


In [5]:
optimizer = TPOTClassifier(n_jobs=-1, max_eval_time_mins=1)
optimizer.fit(feature_matrix_train, target_encoded)

TimeoutError: No valid workers found

## Step Three:  Evaluate the Performance
Now that you have a trained model, the next step is to evaluate the performance and see how TPOT did in comparison with earlier models we created.  Use the techniques you've learned to evaluate the performance of your model.  Specifically, print out the `classification report` and a confusion matrix. 

What is the accuracy of your model?  Is it significantly better than what you did in earlier labs?

In [None]:
predictions = optimizer.predict(feature_matrix_test)

In [None]:
print(classification_report(predictions, target_test))

In [None]:
conf_matrix = confusion_matrix(target_test, predictions, labels=optimizer.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=conf_matrix,
                              display_labels=optimizer.classes_)

disp.plot(cmap='summer');

## Step 4:  Export your Pipeline
If you are happy with the results from `TPOT` you can export the pipeline as python code. The final step in this lab is to export the pipeline as a file called `automate_ml.py` and examine it.  What model and preprocessing steps did TPOT find?  Was this a surprise?

In [None]:
optimizer.export('automate_ml.py')