Degas

DGA-generated domain detection using deep learning models

Running

I'm currently using Conda (Anaconda/Miniconda) for development, but you should be able to use Pipenv or virtualenv as well using the included requirements.txt.

conda:

conda env create -f environment.yml
conda activate degas

Pipenv:

pipenv install -r requirements.txt
pipenv shell

Virtualenv is similar, but there's really no reason to use virtualenv instead of Pipenv anymore.

Retraining the model

There is a trained model checked into the models directory. If you'd like to train your own, you'll first need to download the training data from S3:

python degas/runner download-data

Process the data into the simple CSV form that the model builder expects:

python degas/runner process-data data/raw data/processed

Those steps only need to be run once, unless you change the training data.

To then retrain the model using the generated dataset, first install tensorflow-gpu using your package manager of choice (conda install tensorflow-gpu or pip install tensorflow-gpu) so that training is GPU-accelerated.

Then, run:

python degas/runner train-model data/processed

Run python degas/runner train-model --help for some available tuning options. This takes about an hour and a half on an GTX 1070. It only runs about 9 epochs before it short-circuits; you could potentially run it for, say, 5 epochs and still get good accuracy with half the training time: python degas/runner train-model --epochs 5 data/processed

Making predictions

Since this project uses Tensorflow as the underlying deep learning library, the recommended way to use this for inference is to use Tensorflow Serving.

You should be able to serve it using:

docker run -p 8501:8501 \
  --mount type=bind,source=models/degas,target=/models/degas\
  -e MODEL_NAME=degas -t tensorflow/serving

See Tensorflow Serving docs for more information about available options.

About Degas

Why deep learning for this task? Because it works well, and it isn't hard to implement. From Byu et al, 2018:

"Deep neural networks have recently appeared in the literature on DGA detection Woodbridge et al. (2016); Saxe & Berlin (2017); Yu et al. (2017). They significantly outperform traditional machine learning methods in accuracy, at the price of increasing the complexity of training the model and requiring larger datasets."

Since there's plenty of data available to train with, creating a deep learning model is just as easy or easier than the alternatives.

References

https://openreview.net/forum?id=BJLmN8xRW&noteId=BJLmN8xRW http://faculty.washington.edu/mdecock/papers/byu2018a.pdf

Why "Degas"?

Because it's more fun working on a project with a name, rather than "DGA-detector" or something.
Perhaps naming the project after an impressionist painter will make it sound more impressive?
It was the first result from the classic "Samba naming algorithm" ( egrep -i '^d.*g.*a.* /usr/share/dict/words )

For the record, I'm pronouncing it "de-gah", as in Edgar Degas, not "de-gas", as in "to remove all the gas."

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
degas		degas
models		models
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
degas_four_dancers.jpg		degas_four_dancers.jpg
environment.yml		environment.yml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Degas

Running

Retraining the model

Making predictions

About Degas

References

Why "Degas"?

About

Releases

Packages

Languages

License

matthoffman/degas

Folders and files

Latest commit

History

Repository files navigation

Degas

Running

Retraining the model

Making predictions

About Degas

References

Why "Degas"?

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages