DGA-generated domain detection using deep learning models
I'm currently using Conda (Anaconda/Miniconda) for development, but you should be able to use Pipenv or virtualenv as well using the included requirements.txt.
conda env create -f environment.yml conda activate degas
pipenv install -r requirements.txt pipenv shell
Virtualenv is similar, but there's really no reason to use virtualenv instead of Pipenv anymore.
Retraining the model
There is a trained model checked into the
models directory. If you'd like to train your own, you'll first need to
download the training data from S3:
python degas/runner download-data
Process the data into the simple CSV form that the model builder expects:
python degas/runner process-data data/raw data/processed
Those steps only need to be run once, unless you change the training data.
To then retrain the model using the generated dataset, first install tensorflow-gpu using your package manager of choice
conda install tensorflow-gpu or
pip install tensorflow-gpu) so that training is GPU-accelerated.
python degas/runner train-model data/processed
python degas/runner train-model --help for some available tuning options.
This takes about an hour and a half on an GTX 1070. It only runs about 9 epochs before it short-circuits; you could
potentially run it for, say, 5 epochs and still get good accuracy with half the training time:
python degas/runner train-model --epochs 5 data/processed
Since this project uses Tensorflow as the underlying deep learning library, the recommended way to use this for inference is to use Tensorflow Serving.
You should be able to serve it using:
docker run -p 8501:8501 \ --mount type=bind,source=models/degas,target=/models/degas\ -e MODEL_NAME=degas -t tensorflow/serving
See Tensorflow Serving docs for more information about available options.
Why deep learning for this task? Because it works well, and it isn't hard to implement. From Byu et al, 2018:
"Deep neural networks have recently appeared in the literature on DGA detection Woodbridge et al. (2016); Saxe & Berlin (2017); Yu et al. (2017). They significantly outperform traditional machine learning methods in accuracy, at the price of increasing the complexity of training the model and requiring larger datasets."
Since there's plenty of data available to train with, creating a deep learning model is just as easy or easier than the alternatives.
- Because it's more fun working on a project with a name, rather than "DGA-detector" or something.
- Perhaps naming the project after an impressionist painter will make it sound more impressive?
- It was the first result from the classic "Samba naming algorithm" (
egrep -i '^d.*g.*a.* /usr/share/dict/words)
For the record, I'm pronouncing it "de-gah", as in Edgar Degas, not "de-gas", as in "to remove all the gas."