# Initial Research
## Docs and Existed Solutions
1. [Domain generation algorithm / Detection](https://en.wikipedia.org/wiki/Domain_generation_algorithm#Detection)
2. ["Character Level based Detection of DGA Domain Names"](http://faculty.washington.edu/mdecock/papers/byu2018a.pdf) - a perfect review to step into this problem.
3. GAN perspective:
  - ["DeepDGA: Adversarially-Tuned Domain Generation and Detection"](https://arxiv.org/pdf/1610.01969.pdf) It has a [code](https://github.com/roreagan/DeepDGA).
  - ["MaskDGA: A Black-box Evasion Technique Against DGA Classifiers and Adversarial Defenses"](https://arxiv.org/pdf/1902.08909.pdf)
4. ["Inline Detection of Domain Generation Algorithms with Context-Sensitive Word Embeddings"](https://arxiv.org/pdf/1811.08705.pdf) - non-typical approach with splitting DN to words and applying a pretrained ELMO model.
5. https://github.com/jayjacobs/dga On R :(
6. https://github.com/andrewaeva/DGA 5 years old :( Has a big DGA data (18 MB)
7. https://github.com/nickwallen/botnet-dga-classifier On R. 6 years old :(
8. https://github.com/Daniellee1990/Language-Model-based-Detection-Approach-of-Algorithmically-Generated-Malicious-Domain-Names old methods :(


# Initial Approach
I'm selecting one of the existed solutions for the start, based on these criteria: 
- it should be "good enough" and "near state-of-art" (up-to-date)
- it should be faster to start, not complex for implementation

A good candidate is the 4th solution based on a pretrained model.

## Pro & Cons
### Pro
- It takes several hours to get the first result.
- There are many points of improvement including many pretrained models.
- Fast training, which allows the broad hyperparameter tunning.

### Cons
- No source code to start right away (but there is no good code base for other candidates also)


## Open Questions with current solution and ways to experiments
1. word2vec vs. ELMO and BERT. Is there any "long text' correlation between "words" inside dn string? If no such correlation, then we don't need the long text context hence word2vec would work just fine.
1. Splitting DN to words. Does it really help? Verify this hypothesis. 
1. Word Embedding. The BPE encoding can be better. Verify with the FB FastText model (and BERT model).
1. Best Pretrained Model. The BERT or custom ELMO (flair pretrained) can be better. Verify with the pretrained ELMO, BERT and, maybe, train a custom ELMO model.
1. More complex trained classification layers. The paper didn't mention any hyperparameter tuning. If the proposed model trained fast, it makes sense to play with hyperparameters: a number of layers-units; normalization; dropout; batch size; learning rate progress; etc.
1. Classification objective. Do we really need the multilabel classification while business requires only the binary classification? Does multilabel classification improve the binary results?
1. Symmetrical Loss function. Are FPs and TPs important at the same rate? 
1. TLD. Is TLD helpfull? 

# Data Preparations
* download correct DNs:
  1. http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
* download malignant DNs (clone these projects):
  1. https://github.com/baderj/domain_generation_algorithms
  1. https://github.com/liorsidi/Adversarial-DGA-Datasets
  
* download pretrained models (see code below): 
  1. word embedding, defined by the pretrained model. GloVe from [spaCy](https://spacy.io/models/en) : md model: 685k keys, 20k unique vectors (300 dimensions) or lg model: 685k keys, 685k unique vectors (300 dimensions); original GloVe. We will create an input data of fixed size (in number of words).
  2. BPE embedding: [flair](https://github.com/flairNLP/flair), [FastText](https://fasttext.cc/docs/en/python-module.html#text-classification-model); code examples: https://github.com/facebookresearch/fastText/tree/master/python/doc/examples, https://fasttext.cc/docs/en/english-vectors.html; models: https://fasttext.cc/docs/en/english-vectors.html. 
- Input Data: we convert the combined benign and malignant DNs into embedding from pretrained models
- Output Data: DGA code (multiclass probability) or Good/Bad (binary).

## Input raw data
Download http://s3.amazonaws.com/alexa-static/top-1m.csv.zip - 500K benign DNs.

Clone https://github.com/baderj/domain_generation_algorithms project with malignant DNs.

In [90]:
def read_data(file_name):
    ret = open(file_name, encoding='utf-8').read().splitlines()
    print(f'Load {len(ret):,} from file "{file_name}"')
    return ret
    
def remove_TLD(dns):
    return [el.split('.')[0] for el in dns]

file_name = 'data/Alex_top-0.5M.csv'
good_dn = read_data(file_name)

print(len(good_dn), good_dn[-5:])
good_dn = [el.split(',')[1] for el in good_dn]
print(len(good_dn), good_dn[-5:])

good_dn = remove_TLD(good_dn)
print(len(good_dn), good_dn[-5:])

good_dn = [{'dn': el, 'label': 'good'} for el in good_dn]
print(len(good_dn), good_dn[-5:])


Load 525,215 from file "data/Alex_top-0.5M.csv"
525215 ['525211,webtrics.ch', '525212,weshfee.com', '525213,yosofunny.com', '525214,yougogirlz.com', '525215,zehabesha.com']
525215 ['webtrics.ch', 'weshfee.com', 'yosofunny.com', 'yougogirlz.com', 'zehabesha.com']
525215 ['webtrics', 'weshfee', 'yosofunny', 'yougogirlz', 'zehabesha']
525215 [{'dn': 'webtrics', 'label': 'good'}, {'dn': 'weshfee', 'label': 'good'}, {'dn': 'yosofunny', 'label': 'good'}, {'dn': 'yougogirlz', 'label': 'good'}, {'dn': 'zehabesha', 'label': 'good'}]


In [92]:
# clone https://github.com/baderj/domain_generation_algorithms project then:

dr = 'domain_generation_algorithms/*/example_domains.txt'

import glob

# root_dir needs a trailing slash (i.e. /root/dir/)
files = glob.iglob(dr, recursive=True)
len(list(files))

31

In [93]:
# files = ['data/banjori.txt', 'data/bazarbackdoor.txt', 'data/chinad.txt', 'data/corebot.txt']
bad_dn = []
for f in files:
    label = f.split('/')[1].split('.')[0]
    bad_dn += [{'dn': el, 'label': label} for el in remove_TLD(read_data(f))]
    
    
print(len(bad_dn), bad_dn[-5:])


0 []


## splitting DN to words (Optional)
Calculate the max word of the test data, we can use these max for defining the fixed size of the input windows, if we use the word embeddings

In [32]:
import wordninja

bad_len_max, good_len_max = max([len(wordninja.split(el['dn'])) for el in bad_dn]), max([len(wordninja.split(el['dn'])) for el in good_dn])
bad_len_max, good_len_max # (19, 33)

# we can use these max for defining the fixed size of the input windows, if we use the word embeddings.

(19, 33)

## Embeddings

### GloVe

In [26]:
import spacy
from spacy.tokenizer import Tokenizer


class Model_vectors():
    def __init__(self, model_size='M'):
        models = {'S': 'en_core_web_sm', 'M': 'en_core_web_md', 'L': 'en_core_web_lg', 'XL': 'en_vectors_web_lg'}
        self.model = model_size
        self.nlp = spacy.load(models[self.model], disable=["tagger", "parser", 'ner'])
        self._vocab = self.nlp.vocab
        self._tokenizer = Tokenizer(self.nlp.vocab)
        self.get_vecs = self._get_spacy_vecs
        self.get_vec = self._get_spacy_vec
        print(f'Initialized "{models[self.model]}" model.')

    def is_oov(self, word):
        return False if word in self._vocab else True

    def remove_oov(self, text):
        """
        It takes text and remove (or replace) words that are out-of-vocabulary.
        Note: it splits text only with spaces.
        :param text: any text
        :return: text without oov words.
        """
        return (' '.join(str(w) for w in text.split(' ') if not self.is_oov(str(w)))).strip()

    def _get_spacy_vecs(self, texts):
        return [(doc.text, doc.vector) for doc in self._tokenizer.pipe(texts, batch_size=1000)]

    def _get_spacy_vec(self, text):
        return text, self._tokenizer(text).vector

    def doc_without_stop_words(self, text):
        return self.nlp(self.text_without_stop_words(text))

    def text_without_stop_words(self, text):
        return ' '.join([str(t).lower() for t in self.nlp(text) if not t.is_stop and not t.is_punct and not t.is_oov])


In [27]:
model_vectors = Model_vectors('M')

Initialized "en_core_web_md" model.


In [28]:
model_vectors.get_vecs(['text', 'title'])

[('text', array([ 0.037103 , -0.31259  , -0.17857  ,  0.30001  ,  0.078154 ,
          0.17958  ,  0.12048  , -0.11879  , -0.20601  ,  1.2849   ,
         -0.20409  ,  0.80613  ,  0.34344  , -0.19191  , -0.084511 ,
          0.17339  ,  0.042483 ,  2.0282   , -0.16278  , -0.60306  ,
         -0.53766  ,  0.35711  ,  0.22882  ,  0.1171   ,  0.42983  ,
          0.16165  ,  0.407    ,  0.036476 ,  0.52636  , -0.13524  ,
         -0.016897 ,  0.029259 , -0.079115 , -0.32305  ,  0.052255 ,
         -0.3617   , -0.18355  , -0.34717  , -0.3691   ,  0.16881  ,
          0.21018  , -0.38376  , -0.096909 , -0.36296  , -0.37319  ,
          0.0021152,  0.32512  ,  0.063977 ,  0.36249  , -0.26935  ,
         -0.59341  , -0.13625  ,  0.016425 , -0.2474   , -0.07498  ,
          0.034708 , -0.01476  , -0.11648  ,  0.25559  , -0.35002  ,
         -0.52707  ,  0.21221  ,  0.062456 ,  0.26184  ,  0.53149  ,
          0.34957  , -0.22692  ,  0.44076  ,  0.4438   ,  0.6335   ,
         -0.049757 , -0.08

### ELMO embedding

In [41]:
from flair.data import Sentence
from flair.embeddings import WordEmbeddings, FlairEmbeddings, StackedEmbeddings, CharacterEmbeddings
from flair.embeddings import BertEmbeddings, ELMoEmbeddings, CharLMEmbeddings

In [34]:
glove_emb = WordEmbeddings('glove')

I0915 17:48:32.024530 21016 utils.py:422] loading Word2VecKeyedVectors object from C:\Users\leo_g\.flair\embeddings\glove.gensim
I0915 17:48:33.275043 21016 utils.py:461] loading vectors from C:\Users\leo_g\.flair\embeddings\glove.gensim.vectors.npy with mmap=None
I0915 17:48:33.492771 21016 utils.py:494] setting ignored attribute vectors_norm to None
I0915 17:48:33.493755 21016 utils.py:428] loaded C:\Users\leo_g\.flair\embeddings\glove.gensim


In [36]:
sent = Sentence('This is a system integration.')
glove_emb.embed(sent)
_ = [print(t, t.embedding.shape) for t in sent]

Token: 1 This torch.Size([100])
Token: 2 is torch.Size([100])
Token: 3 a torch.Size([100])
Token: 4 system torch.Size([100])
Token: 5 integration. torch.Size([100])


In [38]:
sent = Sentence(' '.join(wordninja.split('igjyestnessbiophysicalohax')))
glove_emb.embed(sent)
_ = [print(t, t.embedding.shape) for t in sent]

Token: 1 ig torch.Size([100])
Token: 2 j torch.Size([100])
Token: 3 yest torch.Size([100])
Token: 4 ness torch.Size([100])
Token: 5 bio torch.Size([100])
Token: 6 physical torch.Size([100])
Token: 7 oha torch.Size([100])
Token: 8 x torch.Size([100])


In [46]:
sent = Sentence(' '.join(wordninja.split('igjyestnessbiophysicalohax')))
glove_emb.embed(sent)
sent[0].embedding.shape, sent[0].embedding

(torch.Size([100]),
 tensor([ 0.3876,  0.1173, -0.1786, -0.5518, -1.3703, -1.7363, -0.4303,  0.2215,
         -0.1897, -0.2305, -0.2904, -0.9843, -0.6379,  0.8521, -0.4181,  0.7708,
          0.1105, -0.6815, -0.1501,  0.0790,  1.1104, -0.1524, -0.0356, -0.5496,
          0.6761, -1.1164,  0.3294,  0.3358, -0.1761,  0.3174,  0.2326, -0.1749,
         -0.1531, -0.2616, -0.1671, -0.1590,  0.9442,  0.3235, -0.1124, -0.2102,
         -0.2055,  0.2575, -0.3094,  0.0053, -0.3699,  0.0685,  0.1430, -0.0455,
         -0.2702,  0.4782,  0.3559, -0.3921, -0.2094,  0.0851, -1.3836,  0.4357,
         -0.4373,  0.6922, -0.2560,  0.4856,  0.7953, -0.4402, -0.6072,  0.8645,
          0.6602, -0.1795, -0.2840,  0.6075,  1.1740,  0.0348,  0.0285,  0.1810,
         -0.8015, -1.0579, -0.3046, -0.5177, -0.3481, -1.2305, -0.9520,  0.0256,
          0.6460,  0.2348, -0.8135, -0.4275, -0.2730, -0.3802,  0.5882,  0.0488,
          0.7306, -0.4720,  0.7308, -0.0583,  0.1116,  0.0297,  1.1622,  0.2942,
        

In [43]:
flair_emb = FlairEmbeddings('d:/Program Files/flair/embeddings/lm-news-english-forward-v0.2rc.pt')

In [45]:
sent = Sentence(' '.join(wordninja.split('igjyestnessbiophysicalohax')))
flair_emb.embed(sent)
sent[0].embedding.shape, sent[0].embedding

(torch.Size([2048]),
 tensor([ 1.4106e-05,  3.7738e-02, -5.4646e-02,  ..., -1.2723e-04,
          1.2393e-01, -2.3951e-03], device='cuda:0'))

In [48]:
bert_base_emb = BertEmbeddings('bert-base-uncased')


I0915 19:33:33.334479 21016 tokenization_utils.py:380] loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at C:\Users\leo_g\.cache\torch\transformers\26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
I0915 19:33:33.884137 21016 configuration_utils.py:157] loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json from cache at C:\Users\leo_g\.cache\torch\transformers\4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.7156163d5fdc189c3016baca0775ffce230789d7fa2a42ef516483e4ca884517
I0915 19:33:33.885138 21016 configuration_utils.py:174] Model config {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "finetuning_task": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072

In [49]:
sent = Sentence(' '.join(wordninja.split('igjyestnessbiophysicalohax')))
bert_base_emb.embed(sent)
sent[0].embedding.shape, sent[0].embedding

(torch.Size([3072]),
 tensor([-0.4745,  0.2342,  0.8207,  ..., -0.2806,  0.5300,  0.3944],
        device='cuda:0'))

# flair models

## classification example

In [5]:
from flair.data import Corpus
from flair.datasets import TREC_6
from flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentRNNEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer

In [None]:
# # 1. get the corpus
# corpus: Corpus = TREC_6()

In [5]:
ds = corpus.test
dir(ds)

['__add__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_parse_line_to_sentence',
 'in_memory',
 'is_in_memory',
 'label_prefix',
 'max_chars_per_doc',
 'max_tokens_per_doc',
 'path_to_file',
 'sentences',
 'tokenizer',
 'total_sentence_count']

In [6]:
# 2. create the label dictionary
label_dict = corpus.make_label_dictionary()

2020-09-17 08:35:48,357 Computing label dictionary. Progress:


100%|██████████████████████████████████████████████████████████████████████████| 4907/4907 [00:00<00:00, 213334.54it/s]

2020-09-17 08:35:48,394 [b'ENTY', b'DESC', b'NUM', b'ABBR', b'LOC', b'HUM']





In [7]:
# len(label_dict), type(label_dict), dir(label_dict)
from torch.optim.adam import Adam

from flair.data import Corpus
from flair.datasets import TREC_6
from flair.embeddings import TransformerDocumentEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer


AttributeError: module 'flair' has no attribute 'nn'

In [9]:
# 3. make a list of word embeddings
word_embeddings = [WordEmbeddings('glove')]

I0917 08:37:05.399397 11616 utils.py:422] loading Word2VecKeyedVectors object from C:\Users\leo_g\.flair\embeddings\glove.gensim
I0917 08:37:06.428536 11616 utils.py:461] loading vectors from C:\Users\leo_g\.flair\embeddings\glove.gensim.vectors.npy with mmap=None
I0917 08:37:06.633977 11616 utils.py:494] setting ignored attribute vectors_norm to None
I0917 08:37:06.634947 11616 utils.py:428] loaded C:\Users\leo_g\.flair\embeddings\glove.gensim


In [15]:
# len(word_embeddings), type(word_embeddings), dir(word_embeddings)

In [11]:
# 4. initialize document embedding by passing list of word embeddings
# Can choose between many RNN types (GRU by default, to change use rnn_type parameter)
document_embeddings = DocumentRNNEmbeddings(word_embeddings, hidden_size=256)

In [14]:
# type(document_embeddings), dir(document_embeddings)

In [17]:
# 5. create the text classifier
classifier = TextClassifier(document_embeddings, label_dictionary=label_dict)

In [18]:
# 6. initialize the text classifier trainer
trainer = ModelTrainer(classifier, corpus)

In [19]:
# 7. start the training
trainer.train('resources/taggers/trec',
              learning_rate=0.1,
              mini_batch_size=32,
              anneal_factor=0.5,
              patience=5,
              max_epochs=150)

2020-09-17 08:40:57,074 ----------------------------------------------------------------------------------------------------
2020-09-17 08:40:57,076 Model: "TextClassifier(
  (document_embeddings): DocumentRNNEmbeddings(
    (embeddings): StackedEmbeddings(
      (list_embedding_0): WordEmbeddings('glove')
    )
    (word_reprojection_map): Linear(in_features=100, out_features=100, bias=True)
    (rnn): GRU(100, 256, batch_first=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Linear(in_features=256, out_features=6, bias=True)
  (loss_function): CrossEntropyLoss()
)"
2020-09-17 08:40:57,077 ----------------------------------------------------------------------------------------------------
2020-09-17 08:40:57,094 Corpus: "Corpus: 4907 train + 545 dev + 500 test sentences"
2020-09-17 08:40:57,096 ----------------------------------------------------------------------------------------------------
2020-09-17 08:40:57,097 Parameters:
2020-09-17 08:40:57,098  - learning_r

2020-09-17 08:42:00,575 epoch 5 - iter 120/154 - loss 1.21609633 - samples/sec: 504.53
2020-09-17 08:42:01,496 epoch 5 - iter 135/154 - loss 1.21323682 - samples/sec: 542.16
2020-09-17 08:42:02,440 epoch 5 - iter 150/154 - loss 1.20384763 - samples/sec: 527.20
2020-09-17 08:42:02,625 ----------------------------------------------------------------------------------------------------
2020-09-17 08:42:02,626 EPOCH 5 done: loss 1.2023 - lr 0.1000
2020-09-17 08:42:03,387 DEV : loss 1.0862823724746704 - score 0.5817
2020-09-17 08:42:03,401 BAD EPOCHS (no improvement): 0
2020-09-17 08:42:07,851 ----------------------------------------------------------------------------------------------------
2020-09-17 08:42:07,914 epoch 6 - iter 0/154 - loss 1.17252481 - samples/sec: 7868.99
2020-09-17 08:42:08,923 epoch 6 - iter 15/154 - loss 1.07886096 - samples/sec: 491.81
2020-09-17 08:42:09,918 epoch 6 - iter 30/154 - loss 1.07490783 - samples/sec: 499.89
2020-09-17 08:42:10,947 epoch 6 - iter 45/154

2020-09-17 08:43:17,528 epoch 11 - iter 105/154 - loss 0.80230314 - samples/sec: 542.18
2020-09-17 08:43:18,528 epoch 11 - iter 120/154 - loss 0.80666312 - samples/sec: 498.53
2020-09-17 08:43:19,525 epoch 11 - iter 135/154 - loss 0.79973664 - samples/sec: 499.45
2020-09-17 08:43:20,575 epoch 11 - iter 150/154 - loss 0.79486089 - samples/sec: 480.01
2020-09-17 08:43:20,756 ----------------------------------------------------------------------------------------------------
2020-09-17 08:43:20,758 EPOCH 11 done: loss 0.7944 - lr 0.1000
2020-09-17 08:43:21,599 DEV : loss 1.0040833950042725 - score 0.6312
2020-09-17 08:43:21,613 BAD EPOCHS (no improvement): 2
2020-09-17 08:43:21,616 ----------------------------------------------------------------------------------------------------
2020-09-17 08:43:21,684 epoch 12 - iter 0/154 - loss 1.42439318 - samples/sec: 7382.20
2020-09-17 08:43:22,616 epoch 12 - iter 15/154 - loss 0.78031881 - samples/sec: 534.45
2020-09-17 08:43:23,537 epoch 12 - it

2020-09-17 08:44:33,036 epoch 17 - iter 75/154 - loss 0.64839264 - samples/sec: 545.74
2020-09-17 08:44:33,941 epoch 17 - iter 90/154 - loss 0.64458025 - samples/sec: 551.78
2020-09-17 08:44:34,951 epoch 17 - iter 105/154 - loss 0.64538896 - samples/sec: 492.52
2020-09-17 08:44:35,933 epoch 17 - iter 120/154 - loss 0.63559149 - samples/sec: 512.01
2020-09-17 08:44:36,927 epoch 17 - iter 135/154 - loss 0.63272227 - samples/sec: 504.44
2020-09-17 08:44:37,859 epoch 17 - iter 150/154 - loss 0.63573848 - samples/sec: 538.08
2020-09-17 08:44:38,048 ----------------------------------------------------------------------------------------------------
2020-09-17 08:44:38,049 EPOCH 17 done: loss 0.6340 - lr 0.1000
2020-09-17 08:44:38,944 DEV : loss 0.5324942469596863 - score 0.7963
2020-09-17 08:44:38,958 BAD EPOCHS (no improvement): 0
2020-09-17 08:44:43,235 ----------------------------------------------------------------------------------------------------
2020-09-17 08:44:43,325 epoch 18 - it

2020-09-17 08:45:43,780 epoch 23 - iter 45/154 - loss 0.53638160 - samples/sec: 491.58
2020-09-17 08:45:44,728 epoch 23 - iter 60/154 - loss 0.53681191 - samples/sec: 525.70
2020-09-17 08:45:45,714 epoch 23 - iter 75/154 - loss 0.53815041 - samples/sec: 504.57
2020-09-17 08:45:46,718 epoch 23 - iter 90/154 - loss 0.54550043 - samples/sec: 503.30
2020-09-17 08:45:47,680 epoch 23 - iter 105/154 - loss 0.54051415 - samples/sec: 520.20
2020-09-17 08:45:48,629 epoch 23 - iter 120/154 - loss 0.53782664 - samples/sec: 525.90
2020-09-17 08:45:49,584 epoch 23 - iter 135/154 - loss 0.54638279 - samples/sec: 521.81
2020-09-17 08:45:50,559 epoch 23 - iter 150/154 - loss 0.54693036 - samples/sec: 510.33
2020-09-17 08:45:50,750 ----------------------------------------------------------------------------------------------------
2020-09-17 08:45:50,751 EPOCH 23 done: loss 0.5485 - lr 0.1000
2020-09-17 08:45:51,588 DEV : loss 0.6499155759811401 - score 0.7523
2020-09-17 08:45:51,607 BAD EPOCHS (no impr

2020-09-17 08:46:56,641 epoch 29 - iter 15/154 - loss 0.44022097 - samples/sec: 537.52
2020-09-17 08:46:57,561 epoch 29 - iter 30/154 - loss 0.47939367 - samples/sec: 541.17
2020-09-17 08:46:58,498 epoch 29 - iter 45/154 - loss 0.48069310 - samples/sec: 536.69
2020-09-17 08:46:59,394 epoch 29 - iter 60/154 - loss 0.49674024 - samples/sec: 555.50
2020-09-17 08:47:00,391 epoch 29 - iter 75/154 - loss 0.50233355 - samples/sec: 496.94
2020-09-17 08:47:01,304 epoch 29 - iter 90/154 - loss 0.49602299 - samples/sec: 545.64
2020-09-17 08:47:02,224 epoch 29 - iter 105/154 - loss 0.48464133 - samples/sec: 540.71
2020-09-17 08:47:03,151 epoch 29 - iter 120/154 - loss 0.48655941 - samples/sec: 542.55
2020-09-17 08:47:04,095 epoch 29 - iter 135/154 - loss 0.48840677 - samples/sec: 526.18
2020-09-17 08:47:05,051 epoch 29 - iter 150/154 - loss 0.47931675 - samples/sec: 520.01
2020-09-17 08:47:05,241 ----------------------------------------------------------------------------------------------------
2

2020-09-17 08:48:05,184 epoch 35 - iter 0/154 - loss 0.57633495 - samples/sec: 6316.13
2020-09-17 08:48:06,062 epoch 35 - iter 15/154 - loss 0.43384211 - samples/sec: 570.25
2020-09-17 08:48:06,976 epoch 35 - iter 30/154 - loss 0.47151881 - samples/sec: 543.61
2020-09-17 08:48:07,937 epoch 35 - iter 45/154 - loss 0.46743700 - samples/sec: 515.09
2020-09-17 08:48:08,906 epoch 35 - iter 60/154 - loss 0.45566785 - samples/sec: 514.59
2020-09-17 08:48:09,872 epoch 35 - iter 75/154 - loss 0.45172675 - samples/sec: 512.00
2020-09-17 08:48:10,832 epoch 35 - iter 90/154 - loss 0.44153968 - samples/sec: 519.00
2020-09-17 08:48:11,764 epoch 35 - iter 105/154 - loss 0.43877675 - samples/sec: 537.88
2020-09-17 08:48:12,747 epoch 35 - iter 120/154 - loss 0.43347406 - samples/sec: 502.98
2020-09-17 08:48:13,684 epoch 35 - iter 135/154 - loss 0.43915027 - samples/sec: 530.69
2020-09-17 08:48:14,662 epoch 35 - iter 150/154 - loss 0.43874495 - samples/sec: 505.30
2020-09-17 08:48:14,871 ---------------

2020-09-17 08:49:23,766 BAD EPOCHS (no improvement): 4
2020-09-17 08:49:23,769 ----------------------------------------------------------------------------------------------------
2020-09-17 08:49:23,843 epoch 41 - iter 0/154 - loss 0.38837168 - samples/sec: 6667.77
2020-09-17 08:49:24,781 epoch 41 - iter 15/154 - loss 0.36806412 - samples/sec: 533.92
2020-09-17 08:49:25,698 epoch 41 - iter 30/154 - loss 0.36472470 - samples/sec: 540.83
2020-09-17 08:49:26,687 epoch 41 - iter 45/154 - loss 0.39301705 - samples/sec: 499.87
2020-09-17 08:49:27,606 epoch 41 - iter 60/154 - loss 0.39009325 - samples/sec: 542.95
2020-09-17 08:49:28,670 epoch 41 - iter 75/154 - loss 0.38945091 - samples/sec: 466.91
2020-09-17 08:49:29,589 epoch 41 - iter 90/154 - loss 0.39189105 - samples/sec: 551.67
2020-09-17 08:49:30,498 epoch 41 - iter 105/154 - loss 0.39144634 - samples/sec: 546.68
2020-09-17 08:49:31,484 epoch 41 - iter 120/154 - loss 0.39374552 - samples/sec: 504.71
2020-09-17 08:49:32,460 epoch 41 - 

2020-09-17 08:50:30,882 EPOCH 46 done: loss 0.3577 - lr 0.1000
2020-09-17 08:50:31,670 DEV : loss 0.39096799492836 - score 0.8734
2020-09-17 08:50:31,683 BAD EPOCHS (no improvement): 0
2020-09-17 08:50:36,800 ----------------------------------------------------------------------------------------------------
2020-09-17 08:50:36,859 epoch 47 - iter 0/154 - loss 0.46727926 - samples/sec: 8269.90
2020-09-17 08:50:37,907 epoch 47 - iter 15/154 - loss 0.37793587 - samples/sec: 471.51
2020-09-17 08:50:38,873 epoch 47 - iter 30/154 - loss 0.35528016 - samples/sec: 515.01
2020-09-17 08:50:39,792 epoch 47 - iter 45/154 - loss 0.35359808 - samples/sec: 541.30
2020-09-17 08:50:40,837 epoch 47 - iter 60/154 - loss 0.35843113 - samples/sec: 471.95
2020-09-17 08:50:41,764 epoch 47 - iter 75/154 - loss 0.35125349 - samples/sec: 537.01
2020-09-17 08:50:42,718 epoch 47 - iter 90/154 - loss 0.34349647 - samples/sec: 519.32
2020-09-17 08:50:43,657 epoch 47 - iter 105/154 - loss 0.34915640 - samples/sec: 

2020-09-17 08:51:46,528 ----------------------------------------------------------------------------------------------------
2020-09-17 08:51:46,530 EPOCH 52 done: loss 0.3340 - lr 0.1000
2020-09-17 08:51:47,314 DEV : loss 0.3762478828430176 - score 0.8771
2020-09-17 08:51:47,330 BAD EPOCHS (no improvement): 3
2020-09-17 08:51:47,333 ----------------------------------------------------------------------------------------------------
2020-09-17 08:51:47,419 epoch 53 - iter 0/154 - loss 0.48142138 - samples/sec: 5869.72
2020-09-17 08:51:48,400 epoch 53 - iter 15/154 - loss 0.27760891 - samples/sec: 515.94
2020-09-17 08:51:49,307 epoch 53 - iter 30/154 - loss 0.30837768 - samples/sec: 552.99
2020-09-17 08:51:50,258 epoch 53 - iter 45/154 - loss 0.32087717 - samples/sec: 525.13
2020-09-17 08:51:51,160 epoch 53 - iter 60/154 - loss 0.34035765 - samples/sec: 554.90
2020-09-17 08:51:52,137 epoch 53 - iter 75/154 - loss 0.33086195 - samples/sec: 509.01
2020-09-17 08:51:53,067 epoch 53 - iter 9

2020-09-17 08:52:50,439 epoch 58 - iter 135/154 - loss 0.28370266 - samples/sec: 355.67
2020-09-17 08:52:51,853 epoch 58 - iter 150/154 - loss 0.27889721 - samples/sec: 349.44
2020-09-17 08:52:52,055 ----------------------------------------------------------------------------------------------------
2020-09-17 08:52:52,056 EPOCH 58 done: loss 0.2840 - lr 0.0500
2020-09-17 08:52:52,945 DEV : loss 0.3527340888977051 - score 0.8807
2020-09-17 08:52:52,960 BAD EPOCHS (no improvement): 3
2020-09-17 08:52:52,962 ----------------------------------------------------------------------------------------------------
2020-09-17 08:52:53,017 epoch 59 - iter 0/154 - loss 0.20902500 - samples/sec: 9058.56
2020-09-17 08:52:54,010 epoch 59 - iter 15/154 - loss 0.27836555 - samples/sec: 499.45
2020-09-17 08:52:54,974 epoch 59 - iter 30/154 - loss 0.24628820 - samples/sec: 514.47
2020-09-17 08:52:55,934 epoch 59 - iter 45/154 - loss 0.26264143 - samples/sec: 519.48
2020-09-17 08:52:56,884 epoch 59 - iter

2020-09-17 08:54:11,803 epoch 64 - iter 105/154 - loss 0.24191262 - samples/sec: 291.70
2020-09-17 08:54:13,661 epoch 64 - iter 120/154 - loss 0.23917348 - samples/sec: 268.43
2020-09-17 08:54:15,365 epoch 64 - iter 135/154 - loss 0.24495526 - samples/sec: 295.80
2020-09-17 08:54:16,960 epoch 64 - iter 150/154 - loss 0.24768724 - samples/sec: 318.38
2020-09-17 08:54:17,242 ----------------------------------------------------------------------------------------------------
2020-09-17 08:54:17,244 EPOCH 64 done: loss 0.2483 - lr 0.0500
2020-09-17 08:54:18,296 DEV : loss 0.33853450417518616 - score 0.8899
2020-09-17 08:54:18,328 BAD EPOCHS (no improvement): 4
2020-09-17 08:54:18,330 ----------------------------------------------------------------------------------------------------
2020-09-17 08:54:18,427 epoch 65 - iter 0/154 - loss 0.38951007 - samples/sec: 5052.57
2020-09-17 08:54:19,504 epoch 65 - iter 15/154 - loss 0.30967177 - samples/sec: 467.02
2020-09-17 08:54:20,585 epoch 65 - i

2020-09-17 08:55:21,892 epoch 70 - iter 75/154 - loss 0.22276114 - samples/sec: 535.14
2020-09-17 08:55:22,806 epoch 70 - iter 90/154 - loss 0.21913063 - samples/sec: 544.94
2020-09-17 08:55:23,752 epoch 70 - iter 105/154 - loss 0.22461381 - samples/sec: 524.91
2020-09-17 08:55:24,667 epoch 70 - iter 120/154 - loss 0.22929669 - samples/sec: 542.47
2020-09-17 08:55:25,717 epoch 70 - iter 135/154 - loss 0.23195596 - samples/sec: 473.11
2020-09-17 08:55:26,640 epoch 70 - iter 150/154 - loss 0.22737986 - samples/sec: 538.45
2020-09-17 08:55:26,816 ----------------------------------------------------------------------------------------------------
2020-09-17 08:55:26,817 EPOCH 70 done: loss 0.2271 - lr 0.0250
2020-09-17 08:55:27,580 DEV : loss 0.35208263993263245 - score 0.8899
2020-09-17 08:55:27,594 BAD EPOCHS (no improvement): 4
2020-09-17 08:55:27,596 ----------------------------------------------------------------------------------------------------
2020-09-17 08:55:27,660 epoch 71 - i

2020-09-17 08:56:24,294 epoch 76 - iter 45/154 - loss 0.21318417 - samples/sec: 515.87
2020-09-17 08:56:25,212 epoch 76 - iter 60/154 - loss 0.20195785 - samples/sec: 543.11
2020-09-17 08:56:26,254 epoch 76 - iter 75/154 - loss 0.20341362 - samples/sec: 475.71
2020-09-17 08:56:27,235 epoch 76 - iter 90/154 - loss 0.20074117 - samples/sec: 512.02
2020-09-17 08:56:28,271 epoch 76 - iter 105/154 - loss 0.20272644 - samples/sec: 478.77
2020-09-17 08:56:29,225 epoch 76 - iter 120/154 - loss 0.20762871 - samples/sec: 522.72
2020-09-17 08:56:30,181 epoch 76 - iter 135/154 - loss 0.21053792 - samples/sec: 517.70
2020-09-17 08:56:31,171 epoch 76 - iter 150/154 - loss 0.21257777 - samples/sec: 504.61
2020-09-17 08:56:31,426 ----------------------------------------------------------------------------------------------------
2020-09-17 08:56:31,427 EPOCH 76 done: loss 0.2111 - lr 0.0125
2020-09-17 08:56:32,202 DEV : loss 0.35670655965805054 - score 0.8881
2020-09-17 08:56:32,215 BAD EPOCHS (no imp

2020-09-17 08:57:24,812 epoch 82 - iter 0/154 - loss 0.11023014 - samples/sec: 8217.71
2020-09-17 08:57:25,744 epoch 82 - iter 15/154 - loss 0.23517658 - samples/sec: 532.31
2020-09-17 08:57:26,748 epoch 82 - iter 30/154 - loss 0.24410558 - samples/sec: 493.62
2020-09-17 08:57:27,711 epoch 82 - iter 45/154 - loss 0.23347459 - samples/sec: 515.57
2020-09-17 08:57:28,733 epoch 82 - iter 60/154 - loss 0.23880031 - samples/sec: 485.10
2020-09-17 08:57:29,645 epoch 82 - iter 75/154 - loss 0.23812296 - samples/sec: 546.59
2020-09-17 08:57:30,545 epoch 82 - iter 90/154 - loss 0.22962674 - samples/sec: 551.09
2020-09-17 08:57:31,489 epoch 82 - iter 105/154 - loss 0.22737497 - samples/sec: 531.58
2020-09-17 08:57:32,449 epoch 82 - iter 120/154 - loss 0.21517936 - samples/sec: 518.33
2020-09-17 08:57:33,466 epoch 82 - iter 135/154 - loss 0.21439298 - samples/sec: 492.62
2020-09-17 08:57:34,367 epoch 82 - iter 150/154 - loss 0.21359836 - samples/sec: 552.38
2020-09-17 08:57:34,556 ---------------

2020-09-17 08:58:28,105 DEV : loss 0.35061216354370117 - score 0.8936
2020-09-17 08:58:28,119 BAD EPOCHS (no improvement): 3
2020-09-17 08:58:28,121 ----------------------------------------------------------------------------------------------------
2020-09-17 08:58:28,189 epoch 88 - iter 0/154 - loss 0.17264447 - samples/sec: 7272.55
2020-09-17 08:58:29,177 epoch 88 - iter 15/154 - loss 0.19665443 - samples/sec: 504.70
2020-09-17 08:58:30,073 epoch 88 - iter 30/154 - loss 0.19054690 - samples/sec: 556.37
2020-09-17 08:58:31,044 epoch 88 - iter 45/154 - loss 0.19325137 - samples/sec: 511.18
2020-09-17 08:58:31,950 epoch 88 - iter 60/154 - loss 0.19255885 - samples/sec: 550.49
2020-09-17 08:58:32,891 epoch 88 - iter 75/154 - loss 0.19657003 - samples/sec: 527.56
2020-09-17 08:58:33,838 epoch 88 - iter 90/154 - loss 0.20072648 - samples/sec: 525.20
2020-09-17 08:58:34,747 epoch 88 - iter 105/154 - loss 0.20646886 - samples/sec: 549.49
2020-09-17 08:58:35,693 epoch 88 - iter 120/154 - los

2020-09-17 08:59:30,078 ----------------------------------------------------------------------------------------------------
2020-09-17 08:59:30,079 EPOCH 93 done: loss 0.2036 - lr 0.0016
2020-09-17 08:59:30,850 DEV : loss 0.3528648614883423 - score 0.8936
2020-09-17 08:59:30,863 BAD EPOCHS (no improvement): 3
2020-09-17 08:59:30,866 ----------------------------------------------------------------------------------------------------
2020-09-17 08:59:30,936 epoch 94 - iter 0/154 - loss 0.20543510 - samples/sec: 7056.17
2020-09-17 08:59:31,920 epoch 94 - iter 15/154 - loss 0.18314623 - samples/sec: 506.37
2020-09-17 08:59:32,868 epoch 94 - iter 30/154 - loss 0.17561469 - samples/sec: 524.60
2020-09-17 08:59:33,805 epoch 94 - iter 45/154 - loss 0.18612487 - samples/sec: 530.38
2020-09-17 08:59:34,732 epoch 94 - iter 60/154 - loss 0.18909314 - samples/sec: 535.32
2020-09-17 08:59:35,679 epoch 94 - iter 75/154 - loss 0.19540384 - samples/sec: 523.97
2020-09-17 08:59:36,672 epoch 94 - iter 9

2020-09-17 09:00:34,573 epoch 99 - iter 135/154 - loss 0.19713857 - samples/sec: 509.07
2020-09-17 09:00:35,512 epoch 99 - iter 150/154 - loss 0.19641551 - samples/sec: 535.38
2020-09-17 09:00:35,705 ----------------------------------------------------------------------------------------------------
2020-09-17 09:00:35,706 EPOCH 99 done: loss 0.1975 - lr 0.0008
2020-09-17 09:00:36,527 DEV : loss 0.35381194949150085 - score 0.8954
2020-09-17 09:00:36,541 BAD EPOCHS (no improvement): 3
2020-09-17 09:00:36,542 ----------------------------------------------------------------------------------------------------
2020-09-17 09:00:36,611 epoch 100 - iter 0/154 - loss 0.18002027 - samples/sec: 7165.18
2020-09-17 09:00:37,543 epoch 100 - iter 15/154 - loss 0.22265865 - samples/sec: 535.60
2020-09-17 09:00:38,511 epoch 100 - iter 30/154 - loss 0.19932022 - samples/sec: 515.49
2020-09-17 09:00:39,437 epoch 100 - iter 45/154 - loss 0.19427421 - samples/sec: 541.71
2020-09-17 09:00:40,367 epoch 100 

2020-09-17 09:01:36,079 epoch 105 - iter 90/154 - loss 0.19189787 - samples/sec: 537.81
2020-09-17 09:01:37,046 epoch 105 - iter 105/154 - loss 0.19658792 - samples/sec: 521.93
2020-09-17 09:01:37,951 epoch 105 - iter 120/154 - loss 0.19845687 - samples/sec: 553.65
2020-09-17 09:01:38,932 epoch 105 - iter 135/154 - loss 0.19771562 - samples/sec: 505.76
2020-09-17 09:01:39,865 epoch 105 - iter 150/154 - loss 0.19875597 - samples/sec: 535.06
2020-09-17 09:01:40,045 ----------------------------------------------------------------------------------------------------
2020-09-17 09:01:40,047 EPOCH 105 done: loss 0.1999 - lr 0.0004
2020-09-17 09:01:40,881 DEV : loss 0.3521348536014557 - score 0.8954
2020-09-17 09:01:40,904 BAD EPOCHS (no improvement): 3
2020-09-17 09:01:40,906 ----------------------------------------------------------------------------------------------------
2020-09-17 09:01:40,966 epoch 106 - iter 0/154 - loss 0.29063112 - samples/sec: 8127.97
2020-09-17 09:01:41,887 epoch 

2020-09-17 09:02:37,509 epoch 111 - iter 45/154 - loss 0.18844842 - samples/sec: 537.79
2020-09-17 09:02:38,520 epoch 111 - iter 60/154 - loss 0.19923696 - samples/sec: 494.89
2020-09-17 09:02:39,437 epoch 111 - iter 75/154 - loss 0.19439198 - samples/sec: 541.82
2020-09-17 09:02:40,380 epoch 111 - iter 90/154 - loss 0.20310201 - samples/sec: 526.67
2020-09-17 09:02:41,390 epoch 111 - iter 105/154 - loss 0.20613366 - samples/sec: 495.86
2020-09-17 09:02:42,345 epoch 111 - iter 120/154 - loss 0.20856594 - samples/sec: 525.44
2020-09-17 09:02:43,284 epoch 111 - iter 135/154 - loss 0.21149184 - samples/sec: 531.02
2020-09-17 09:02:44,214 epoch 111 - iter 150/154 - loss 0.20643389 - samples/sec: 536.66
2020-09-17 09:02:44,399 ----------------------------------------------------------------------------------------------------
2020-09-17 09:02:44,401 EPOCH 111 done: loss 0.2079 - lr 0.0002
2020-09-17 09:02:45,193 DEV : loss 0.3528764247894287 - score 0.8936
2020-09-17 09:02:45,209 BAD EPOCHS

{'test_score': 0.916,
 'dev_score_history': [0.411,
  0.3743,
  0.4294,
  0.4147,
  0.5817,
  0.5358,
  0.6422,
  0.6349,
  0.7193,
  0.7046,
  0.6312,
  0.7394,
  0.7394,
  0.7358,
  0.7284,
  0.7706,
  0.7963,
  0.8,
  0.7725,
  0.778,
  0.7945,
  0.789,
  0.7523,
  0.8312,
  0.8257,
  0.8404,
  0.8367,
  0.8294,
  0.8092,
  0.8202,
  0.844,
  0.8385,
  0.822,
  0.8294,
  0.8587,
  0.8606,
  0.8385,
  0.8202,
  0.8606,
  0.8532,
  0.8697,
  0.8422,
  0.8569,
  0.855,
  0.8679,
  0.8734,
  0.8569,
  0.8606,
  0.8881,
  0.8716,
  0.8477,
  0.8771,
  0.8679,
  0.8789,
  0.8789,
  0.8789,
  0.8752,
  0.8807,
  0.8936,
  0.9009,
  0.8789,
  0.8807,
  0.8624,
  0.8899,
  0.8716,
  0.8862,
  0.8936,
  0.8826,
  0.8862,
  0.8899,
  0.8881,
  0.8899,
  0.8936,
  0.8972,
  0.8991,
  0.8881,
  0.8881,
  0.8972,
  0.8972,
  0.8954,
  0.8954,
  0.8991,
  0.8936,
  0.8954,
  0.8954,
  0.8991,
  0.8936,
  0.8954,
  0.8899,
  0.8917,
  0.8936,
  0.8936,
  0.8936,
  0.8936,
  0.8954,
  0.8954,
  0.89

In [23]:
classifier = TextClassifier.load('resources/taggers/trec/final-model.pt')

# create example sentence
from flair.data import Sentence

sentence = Sentence('Who built the Eiffel Tower ?')

# predict class and print
classifier.predict(sentence)

print(sentence.labels)

2020-09-17 09:08:26,787 loading file resources/taggers/trec/final-model.pt
[HUM (0.9996542930603027)]


## prepare data files

Malignant data spread between several sets with different size for different DGA algorithms. We need stratification for these algorithms between train, dev, and test samples. So we split each DGA set into train, dev, and test and ONLY after that compound samples of all DGA algos into the result train, dev, and test sample sets.

DNs split by pseudo-words by wordninja package. Using pseudo-words inplace of DN we can use the models with word embeddings.
We prepare two data sets: one with samples split by pseudo-words and the second without such spliting. Then we experiment with both data sets.

Samples compound of a lablel and a sample text (DN in form of pseudo-words), where label is in the FastText format (as "__label__<label> <sample text>").

In [3]:
import wordninja

def split(txt):
    return ' . '.join(' '.join(wordninja.split(t)) for t in txt.split('.'))

split('myshopify.biz.com')

'my shop if y . biz . com'

In [4]:
# benign data

def read_data(file_name):
    ret = open(file_name, encoding='utf-8').read().splitlines()
    print(f'Load {len(ret):,} from file "{file_name}"')
    return ret
    
def remove_TLD(dns):
    return [el.split('.')[0] for el in dns]


def add_label(in_file, out_file, label, split_by_words=True):
    lines = read_data(in_file)
    if split_by_words: lines = [split(el) for el in lines]
    lines = [f'__label__{label} {el}\n' for el in lines]
    with open(out_file, 'w') as f:
        f.writelines(lines)
        print(f'Saved {len(lines):,} into "{out_file}"')

in_file = 'data/Alex_top-0.5M.txt'
label = 'binign'
split_by_words=True
out_file = f'{in_file[:-4]}.{label}.{"split_by_words." if split_by_words else ""}txt'
add_label(in_file, out_file, 'binign', split_by_words=split_by_words)


In [101]:
# malitious data
#

def read_data(file_name, start, end):
    ret = open(file_name, encoding='utf-8').read().splitlines()
    print(f' Load {len(ret):,} from file "{file_name}"')
    st, en = int(start*len(ret)), int(end*len(ret))
    print(f'  Get: [{st:,}:{en:,}]')
    return ret[st:en]

def read(f, start, end):
    ret = []
    if type(f) == list:
        for ff in f:
            ret += read_data(ff, start=start, end=end)
    elif type(f) == str:
        ret = read_data(f, start=start, end=end)
    return ret
        
def add_label(in_file, out_file, label, split_by_words=True, start=0, end=0.8):
    lines = read(in_file, start=start, end=end)
    if split_by_words: lines = [split(el) for el in lines]
    lines = [f'__label__{label} {el}\n' for el in lines]
    with open(out_file, 'w') as f:
        f.writelines(lines)
        print(f'Saved {len(lines):,} into "{out_file}"')

dr = 'domain_generation_algorithms/*/example_domains.txt'

import glob

in_file = list(glob.iglob(dr, recursive=True))

print('malignant files:', len(in_file), )

for split_by_words in [True, False]:
    for label, start, end in [('train', 0, 0.8), ('dev', 0.8, 0.9), ('test', 0.9, 1)]:
        out_file = f'data/malignant.{"split_by_words." if split_by_words else ""}{label}.txt'
        add_label(in_file, out_file, label, split_by_words=split_by_words, start=start, end=end)

# Saved 18,119 into "data/malignant.train.txt"

# Saved 2,265 into "data/malignant.dev.txt"

# Saved 2,273 into "data/malignant.test.txt"

malignant files: 31
 Load 1,000 from file "domain_generation_algorithms\banjori\example_domains.txt"
  Get: [0:800]
 Load 2,160 from file "domain_generation_algorithms\bazarbackdoor\example_domains.txt"
  Get: [0:1,728]
 Load 256 from file "domain_generation_algorithms\chinad\example_domains.txt"
  Get: [0:204]
 Load 40 from file "domain_generation_algorithms\corebot\example_domains.txt"
  Get: [0:32]
 Load 30 from file "domain_generation_algorithms\dircrypt\example_domains.txt"
  Get: [0:24]
 Load 5 from file "domain_generation_algorithms\dnschanger\example_domains.txt"
  Get: [0:4]
 Load 12 from file "domain_generation_algorithms\gozi\example_domains.txt"
  Get: [0:9]
 Load 8 from file "domain_generation_algorithms\locky\example_domains.txt"
  Get: [0:6]
 Load 2,500 from file "domain_generation_algorithms\monerodownloader\example_domains.txt"
  Get: [0:2,000]
 Load 99 from file "domain_generation_algorithms\mydoom\example_domains.txt"
  Get: [0:79]
 Load 2,048 from file "domain_gener

Saved 2,273 into "data/malignant.split_by_words.test.txt"
 Load 1,000 from file "domain_generation_algorithms\banjori\example_domains.txt"
  Get: [0:800]
 Load 2,160 from file "domain_generation_algorithms\bazarbackdoor\example_domains.txt"
  Get: [0:1,728]
 Load 256 from file "domain_generation_algorithms\chinad\example_domains.txt"
  Get: [0:204]
 Load 40 from file "domain_generation_algorithms\corebot\example_domains.txt"
  Get: [0:32]
 Load 30 from file "domain_generation_algorithms\dircrypt\example_domains.txt"
  Get: [0:24]
 Load 5 from file "domain_generation_algorithms\dnschanger\example_domains.txt"
  Get: [0:4]
 Load 12 from file "domain_generation_algorithms\gozi\example_domains.txt"
  Get: [0:9]
 Load 8 from file "domain_generation_algorithms\locky\example_domains.txt"
  Get: [0:6]
 Load 2,500 from file "domain_generation_algorithms\monerodownloader\example_domains.txt"
  Get: [0:2,000]
 Load 99 from file "domain_generation_algorithms\mydoom\example_domains.txt"
  Get: [0:7

In [68]:
in_file = 'data/Alex_top-0.5M.txt'
label = 'benign'
split_by_words=True
out_file = f'{in_file[:-4]}.{label}.{"split_by_words." if split_by_words else ""}train.txt'
add_label(in_file, out_file, label, split_by_words=split_by_words, start=0, end=0.8)

Load 525,215 from file "data/Alex_top-0.5M.txt"
  Get: [0:420,172]
Saved 420,172 into "data/Alex_top-0.5M.binign.split_by_words.train.txt"


In [71]:
# compound
def compound_files(in_files, out_file):
    with open(out_file, 'w') as outfile:
        for fname in in_files:
            with open(fname) as infile:
                outfile.write(infile.read())
    print(f'Saved "{out_file}"')
                
vars = ['dev', 'test', 'train', 'split_by_words.dev', 'split_by_words.test', 'split_by_words.train']
_ = [compound_files([f'data/{el}.{var}.txt' for el in ['Alex_top-0.5M.benign', 'malignant']], f'data/data.{var}.txt') for var in vars]


Saved "data/data.dev.txt"
Saved "data/data.test.txt"
Saved "data/data.train.txt"
Saved "data/data.split_by_words.dev.txt"
Saved "data/data.split_by_words.test.txt"
Saved "data/data.split_by_words.train.txt"


## Prepare Corpuses

In [102]:
from flair.data import Corpus
from flair.datasets import ClassificationCorpus

corpus_no_split: Corpus = ClassificationCorpus('data',
                                      test_file='data.test.txt',
                                      dev_file='data.dev.txt',
                                      train_file='data.train.txt')

2020-09-17 14:05:55,290 Reading data from data
2020-09-17 14:05:55,292 Train: data\data.train.txt
2020-09-17 14:05:55,293 Dev: data\data.dev.txt
2020-09-17 14:05:55,295 Test: data\data.test.txt


In [1]:
from flair.data import Corpus
from flair.datasets import ClassificationCorpus

corpus_word_split: Corpus = ClassificationCorpus('data',
                                      test_file='data.split_by_words.test.txt',
                                      dev_file='data.split_by_words.dev.txt',
                                      train_file='data.split_by_words.train.txt')


2020-09-17 21:23:21,636 Reading data from data
2020-09-17 21:23:21,639 Train: data\data.split_by_words.train.txt
2020-09-17 21:23:21,640 Dev: data\data.split_by_words.dev.txt
2020-09-17 21:23:21,642 Test: data\data.split_by_words.test.txt


## Train models

In [2]:
from flair.data import Corpus
from flair.datasets import TREC_6
from flair.embeddings import WordEmbeddings, FlairEmbeddings, DocumentRNNEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer

### GloVe / no_split

In [103]:
# 2. create the label dictionary
label_dict = corpus_no_split.make_label_dictionary()

2020-09-17 14:44:26,883 Computing label dictionary. Progress:


100%|████████████████████████████████████████████████████████████████████████| 422936/422936 [02:02<00:00, 3460.80it/s]

2020-09-17 14:47:32,717 [b'benign', b'malignant']





In [77]:
# len(label_dict), type(label_dict), dir(label_dict)

In [104]:
word_embeddings = [WordEmbeddings('glove')]

document_embeddings = DocumentRNNEmbeddings(word_embeddings, hidden_size=256)
classifier = TextClassifier(document_embeddings, label_dictionary=label_dict)
trainer = ModelTrainer(classifier, corpus_no_split)

I0917 14:47:44.510528 11616 utils.py:422] loading Word2VecKeyedVectors object from C:\Users\leo_g\.flair\embeddings\glove.gensim
I0917 14:47:48.508447 11616 utils.py:461] loading vectors from C:\Users\leo_g\.flair\embeddings\glove.gensim.vectors.npy with mmap=None
I0917 14:47:48.596435 11616 utils.py:494] setting ignored attribute vectors_norm to None
I0917 14:47:48.598442 11616 utils.py:428] loaded C:\Users\leo_g\.flair\embeddings\glove.gensim


In [105]:
trainer.train('models/glove.no_split',
              learning_rate=0.1,
              mini_batch_size=256, # 32,
              anneal_factor=0.5,
              patience=3,
              max_epochs=30)

# 2020-09-17 15:23:05,197 loading file models\glove.no_split\best-model.pt
# 2020-09-17 15:24:17,229 0.9935	0.9935	0.9935
# 2020-09-17 15:24:17,230 
# MICRO_AVG: acc 0.987 - f1-score 0.9935
# MACRO_AVG: acc 0.4968 - f1-score 0.49835
# benign     tp: 52522 - fp: 346 - fn: 0 - tn: 0 - precision: 0.9935 - recall: 1.0000 - accuracy: 0.9935 - f1-score: 0.9967
# malignant  tp: 0 - fp: 0 - fn: 346 - tn: 52522 - precision: 0.0000 - recall: 0.0000 - accuracy: 0.0000 - f1-score: 0.0000


2020-09-17 14:48:48,172 ----------------------------------------------------------------------------------------------------
2020-09-17 14:48:48,174 Model: "TextClassifier(
  (document_embeddings): DocumentRNNEmbeddings(
    (embeddings): StackedEmbeddings(
      (list_embedding_0): WordEmbeddings('glove')
    )
    (word_reprojection_map): Linear(in_features=100, out_features=100, bias=True)
    (rnn): GRU(100, 256, batch_first=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Linear(in_features=256, out_features=2, bias=True)
  (loss_function): CrossEntropyLoss()
)"
2020-09-17 14:48:48,177 ----------------------------------------------------------------------------------------------------
2020-09-17 14:48:48,179 Corpus: "Corpus: 422936 train + 52867 dev + 52868 test sentences"
2020-09-17 14:48:48,181 ----------------------------------------------------------------------------------------------------
2020-09-17 14:48:48,183 Parameters:
2020-09-17 14:48:48,185  - lear

2020-09-17 15:23:05,197 loading file models\glove.no_split\best-model.pt
2020-09-17 15:24:17,229 0.9935	0.9935	0.9935
2020-09-17 15:24:17,230 
MICRO_AVG: acc 0.987 - f1-score 0.9935
MACRO_AVG: acc 0.4968 - f1-score 0.49835
benign     tp: 52522 - fp: 346 - fn: 0 - tn: 0 - precision: 0.9935 - recall: 1.0000 - accuracy: 0.9935 - f1-score: 0.9967
malignant  tp: 0 - fp: 0 - fn: 346 - tn: 52522 - precision: 0.0000 - recall: 0.0000 - accuracy: 0.0000 - f1-score: 0.0000
2020-09-17 15:24:17,231 ----------------------------------------------------------------------------------------------------


{'test_score': 0.9935,
 'dev_score_history': [0.9935, 0.9935, 0.9935, 0.9935],
 'train_loss_history': [0.04442076490379541,
  0.03973632867696497,
  0.039671257708426756,
  0.0396350922786416],
 'dev_loss_history': [tensor(0.0511, device='cuda:0'),
  tensor(0.0512, device='cuda:0'),
  tensor(0.0512, device='cuda:0'),
  tensor(0.0513, device='cuda:0')]}

### GloVe / word_split

In [106]:
model_name = 'glove'
model_var = 'word_split'
corpus = corpus_word_split

import os

model_dir = f'models/{model_name}.{model_var}'
try:
    os.mkdir(model_dir)
except OSError:
    print (f"Creation of the directory '{model_dir}' failed.")
else:
    print (f"Created the directory '{model_dir}'")

label_dict = corpus.make_label_dictionary()
word_embeddings = [WordEmbeddings(model_name)]
document_embeddings = DocumentRNNEmbeddings(word_embeddings, hidden_size=256)
classifier = TextClassifier(document_embeddings, label_dictionary=label_dict)
trainer = ModelTrainer(classifier, corpus)

trainer.train(f'models/{model_name}.{model_var}',
              learning_rate=0.1,
              mini_batch_size=256, # 32,
              anneal_factor=0.5,
              patience=3,
              max_epochs=30)
# MICRO_AVG: acc 0.9947 - f1-score 0.9973
# MACRO_AVG: acc 0.7949 - f1-score 0.8713500000000001
# benign     tp: 52522 - fp: 141 - fn: 0 - tn: 205 - precision: 0.9973 - recall: 1.0000 - accuracy: 0.9973 - f1-score: 0.9986
# malignant  tp: 205 - fp: 0 - fn: 141 - tn: 52522 - precision: 1.0000 - recall: 0.5925 - accuracy: 0.5925 - f1-score: 0.7441
# 20

Created the directory models/glove.word_split 
2020-09-17 15:31:13,744 Computing label dictionary. Progress:


100%|████████████████████████████████████████████████████████████████████████| 422936/422936 [02:02<00:00, 3454.79it/s]

2020-09-17 15:34:06,947 [b'benign', b'malignant']



I0917 15:34:06.951880 11616 utils.py:422] loading Word2VecKeyedVectors object from C:\Users\leo_g\.flair\embeddings\glove.gensim
I0917 15:34:08.144447 11616 utils.py:461] loading vectors from C:\Users\leo_g\.flair\embeddings\glove.gensim.vectors.npy with mmap=None
I0917 15:34:08.233437 11616 utils.py:494] setting ignored attribute vectors_norm to None
I0917 15:34:08.234394 11616 utils.py:428] loaded C:\Users\leo_g\.flair\embeddings\glove.gensim


2020-09-17 15:34:08,343 ----------------------------------------------------------------------------------------------------
2020-09-17 15:34:08,345 Model: "TextClassifier(
  (document_embeddings): DocumentRNNEmbeddings(
    (embeddings): StackedEmbeddings(
      (list_embedding_0): WordEmbeddings('glove')
    )
    (word_reprojection_map): Linear(in_features=100, out_features=100, bias=True)
    (rnn): GRU(100, 256, batch_first=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Linear(in_features=256, out_features=2, bias=True)
  (loss_function): CrossEntropyLoss()
)"
2020-09-17 15:34:08,346 ----------------------------------------------------------------------------------------------------
2020-09-17 15:34:08,347 Corpus: "Corpus: 422936 train + 52867 dev + 52868 test sentences"
2020-09-17 15:34:08,348 ----------------------------------------------------------------------------------------------------
2020-09-17 15:34:08,350 Parameters:
2020-09-17 15:34:08,351  - lear

{'test_score': 0.9973,
 'dev_score_history': [],
 'train_loss_history': [],
 'dev_loss_history': []}

In [107]:
model_name = 'glove'
corpus = corpus_word_split
lr = 0.1
batch_size = 32
model_var = f'word_split_{lr}_{batch_size}'

import os

model_dir = f'models/{model_name}.{model_var}'
try:
    os.mkdir(model_dir)
except OSError:
    print (f"Creation of the directory '{model_dir}' failed.")
else:
    print (f"Created the directory '{model_dir}'")

label_dict = corpus.make_label_dictionary()
word_embeddings = [WordEmbeddings(model_name)]
document_embeddings = DocumentRNNEmbeddings(word_embeddings, hidden_size=256)
classifier = TextClassifier(document_embeddings, label_dictionary=label_dict)
trainer = ModelTrainer(classifier, corpus)

trainer.train(model_dir,
              learning_rate=lr,
              mini_batch_size=batch_size, # 32,
              anneal_factor=0.5,
              patience=3,
              max_epochs=30)

#  models\glove.word_split_0.5_256\best-model.pt
# MICRO_AVG: acc 0.9947 - f1-score 0.9973
# MACRO_AVG: acc 0.7949 - f1-score 0.8713500000000001
# benign     tp: 52522 - fp: 141 - fn: 0 - tn: 205 - precision: 0.9973 - recall: 1.0000 - accuracy: 0.9973 - f1-score: 0.9986
# malignant  tp: 205 - fp: 0 - fn: 141 - tn: 52522 - precision: 1.0000 - recall: 0.5925 - accuracy: 0.5925 - f1-score: 0.7441


# ******************** The BEST ***************************************************************
# ... training not finished
# 2020-09-17 19:17:51,743 loading file models\glove.word_split_0.1_32\best-model.pt
# 2020-09-17 19:19:07,756 0.9998	0.9998	0.9998
# MICRO_AVG: acc 0.9996 - f1-score 0.9998
# MACRO_AVG: acc 0.9844 - f1-score 0.9921
# benign     tp: 52513 - fp: 2 - fn: 9 - tn: 344 - precision: 1.0000 - recall: 0.9998 - accuracy: 0.9998 - f1-score: 0.9999
# malignant  tp: 344 - fp: 9 - fn: 2 - tn: 52513 - precision: 0.9745 - recall: 0.9942 - accuracy: 0.9690 - f1-score: 0.9843


Created the directory 'models/glove.word_split_0.1_32'
2020-09-17 15:59:57,686 Computing label dictionary. Progress:


100%|████████████████████████████████████████████████████████████████████████| 422936/422936 [01:49<00:00, 3879.39it/s]

2020-09-17 16:02:37,984 [b'benign', b'malignant']



I0917 16:02:37.986963 11616 utils.py:422] loading Word2VecKeyedVectors object from C:\Users\leo_g\.flair\embeddings\glove.gensim
I0917 16:02:39.216480 11616 utils.py:461] loading vectors from C:\Users\leo_g\.flair\embeddings\glove.gensim.vectors.npy with mmap=None
I0917 16:02:39.284472 11616 utils.py:494] setting ignored attribute vectors_norm to None
I0917 16:02:39.285509 11616 utils.py:428] loaded C:\Users\leo_g\.flair\embeddings\glove.gensim


2020-09-17 16:02:39,303 ----------------------------------------------------------------------------------------------------
2020-09-17 16:02:39,305 Model: "TextClassifier(
  (document_embeddings): DocumentRNNEmbeddings(
    (embeddings): StackedEmbeddings(
      (list_embedding_0): WordEmbeddings('glove')
    )
    (word_reprojection_map): Linear(in_features=100, out_features=100, bias=True)
    (rnn): GRU(100, 256, batch_first=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Linear(in_features=256, out_features=2, bias=True)
  (loss_function): CrossEntropyLoss()
)"
2020-09-17 16:02:39,306 ----------------------------------------------------------------------------------------------------
2020-09-17 16:02:39,308 Corpus: "Corpus: 422936 train + 52867 dev + 52868 test sentences"
2020-09-17 16:02:39,310 ----------------------------------------------------------------------------------------------------
2020-09-17 16:02:39,311 Parameters:
2020-09-17 16:02:39,312  - lear

2020-09-17 17:02:53,486 epoch 5 - iter 6605/13217 - loss 0.00119771 - samples/sec: 715.39
2020-09-17 17:03:56,394 epoch 5 - iter 7926/13217 - loss 0.00124413 - samples/sec: 722.50
2020-09-17 17:05:00,330 epoch 5 - iter 9247/13217 - loss 0.00116674 - samples/sec: 709.42
2020-09-17 17:06:03,429 epoch 5 - iter 10568/13217 - loss 0.00115461 - samples/sec: 721.75
2020-09-17 17:07:07,310 epoch 5 - iter 11889/13217 - loss 0.00112642 - samples/sec: 701.50
2020-09-17 17:08:12,959 epoch 5 - iter 13210/13217 - loss 0.00116030 - samples/sec: 698.27
2020-09-17 17:08:14,009 ----------------------------------------------------------------------------------------------------
2020-09-17 17:08:14,011 EPOCH 5 done: loss 0.0012 - lr 0.1000
2020-09-17 17:10:03,072 DEV : loss 0.0006250182632356882 - score 0.9998
2020-09-17 17:10:26,699 BAD EPOCHS (no improvement): 3
2020-09-17 17:10:31,861 ----------------------------------------------------------------------------------------------------
2020-09-17 17:11:0

2020-09-17 18:19:58,631 epoch 11 - iter 0/13217 - loss 0.00002681 - samples/sec: 812967.09
2020-09-17 18:20:58,485 epoch 11 - iter 1321/13217 - loss 0.00066178 - samples/sec: 758.56
2020-09-17 18:21:58,721 epoch 11 - iter 2642/13217 - loss 0.00069827 - samples/sec: 754.66
2020-09-17 18:22:58,852 epoch 11 - iter 3963/13217 - loss 0.00064905 - samples/sec: 748.66
2020-09-17 18:23:58,641 epoch 11 - iter 5284/13217 - loss 0.00061713 - samples/sec: 743.08
2020-09-17 18:24:58,994 epoch 11 - iter 6605/13217 - loss 0.00061299 - samples/sec: 753.66
2020-09-17 18:25:58,980 epoch 11 - iter 7926/13217 - loss 0.00058855 - samples/sec: 740.87
2020-09-17 18:26:58,778 epoch 11 - iter 9247/13217 - loss 0.00058982 - samples/sec: 761.88
2020-09-17 18:27:58,999 epoch 11 - iter 10568/13217 - loss 0.00060913 - samples/sec: 757.18
2020-09-17 18:28:59,406 epoch 11 - iter 11889/13217 - loss 0.00060015 - samples/sec: 745.92
2020-09-17 18:29:59,266 epoch 11 - iter 13210/13217 - loss 0.00059535 - samples/sec: 752

{'test_score': 0.9998,
 'dev_score_history': [0.9995,
  0.9998,
  0.9997,
  0.9998,
  0.9998,
  0.9998,
  0.9998,
  0.9998,
  0.9998,
  0.9998,
  0.9998,
  0.9998,
  0.9998,
  0.9998],
 'train_loss_history': [0.006704615072667729,
  0.0018764245461755271,
  0.001428390409741658,
  0.001281433457158759,
  0.001160885160813556,
  0.0009288391432520336,
  0.0008205886772528867,
  0.0007256656202444912,
  0.0006219205977933742,
  0.0007342968501146098,
  0.0005951481113686868,
  0.0006219105104267317,
  0.0005906871863142453,
  0.0006297557100612004],
 'dev_loss_history': [tensor(0.0020, device='cuda:0'),
  tensor(0.0008, device='cuda:0'),
  tensor(0.0010, device='cuda:0'),
  tensor(0.0009, device='cuda:0'),
  tensor(0.0006, device='cuda:0'),
  tensor(0.0005, device='cuda:0'),
  tensor(0.0005, device='cuda:0'),
  tensor(0.0005, device='cuda:0'),
  tensor(0.0005, device='cuda:0'),
  tensor(0.0005, device='cuda:0'),
  tensor(0.0004, device='cuda:0'),
  tensor(0.0005, device='cuda:0'),
  tens

### greedy conclusion
The GloVe experiments show that splitting DNs to words works much better. So, we will use only data sets with word-splitting in the next experiments.

### en (FastText) / word_split
[CLASSIC_WORD_EMBEDDINGS](https://github.com/flairNLP/flair/blob/master/resources/docs/embeddings/CLASSIC_WORD_EMBEDDINGS.md)

In [7]:
# emb = WordEmbeddings('en')

2020-09-17 20:48:51,533 https://flair.informatik.hu-berlin.de/resources/embeddings/token/en-fasttext-news-300d-1M.vectors.npy not found in cache, downloading to D:\Temp\tmp8nn7o21t


100%|██████████████████████████████████████████████████████████████| 1200000128/1200000128 [07:17<00:00, 2743514.45B/s]

2020-09-17 20:56:09,623 copying D:\Temp\tmp8nn7o21t to cache at C:\Users\leo_g\.flair\embeddings\en-fasttext-news-300d-1M.vectors.npy





2020-09-17 20:56:11,760 removing temp file D:\Temp\tmp8nn7o21t
2020-09-17 20:56:12,478 https://flair.informatik.hu-berlin.de/resources/embeddings/token/en-fasttext-news-300d-1M not found in cache, downloading to D:\Temp\tmpbf28rx8w


100%|██████████████████████████████████████████████████████████████████| 54600983/54600983 [00:19<00:00, 2804971.30B/s]

2020-09-17 20:56:32,619 copying D:\Temp\tmpbf28rx8w to cache at C:\Users\leo_g\.flair\embeddings\en-fasttext-news-300d-1M





2020-09-17 20:56:32,711 removing temp file D:\Temp\tmpbf28rx8w


I0917 20:56:32.719858 11228 utils.py:422] loading Word2VecKeyedVectors object from C:\Users\leo_g\.flair\embeddings\en-fasttext-news-300d-1M
I0917 20:56:35.019429 11228 utils.py:461] loading vectors from C:\Users\leo_g\.flair\embeddings\en-fasttext-news-300d-1M.vectors.npy with mmap=None
I0917 20:56:35.518429 11228 utils.py:494] setting ignored attribute vectors_norm to None
I0917 20:56:35.519431 11228 utils.py:428] loaded C:\Users\leo_g\.flair\embeddings\en-fasttext-news-300d-1M


In [3]:
import os

# 'en' (or 'en-news' or 'news')	English	FastText embeddings over news and wikipedia data

model_name = 'en'

corpus = corpus_word_split
lr = 0.1
batch_size = 32
model_var = f'word_split_{lr}_{batch_size}'


model_dir = f'models/{model_name}.{model_var}'
try:
    os.mkdir(model_dir)
except OSError:
    print (f"Creation of the directory '{model_dir}' failed.")
else:
    print (f"Created the directory '{model_dir}'")

label_dict = corpus.make_label_dictionary()
word_embeddings = [WordEmbeddings(model_name)]
document_embeddings = DocumentRNNEmbeddings(word_embeddings, hidden_size=256)
classifier = TextClassifier(document_embeddings, label_dictionary=label_dict)
trainer = ModelTrainer(classifier, corpus)

trainer.train(model_dir,
              learning_rate=lr,
              mini_batch_size=batch_size, # 32,
              anneal_factor=0.5,
              patience=3,
              max_epochs=30)

# 2020-09-18 02:26:11,258 loading file models\en.word_split_0.1_32\best-model.pt
# 2020-09-18 02:27:51,902 	0.9999
# 2020-09-18 02:27:52,074 
# Results:
# - F-score (micro) 0.9999
# - F-score (macro) 0.9956
# - Accuracy 0.9999

# By class:
#               precision    recall  f1-score   support

#       benign     0.9999    1.0000    0.9999     52522
#    malignant     0.9942    0.9884    0.9913       346

#    micro avg     0.9999    0.9999    0.9999     52868
#    macro avg     0.9971    0.9942    0.9956     52868
# weighted avg     0.9999    0.9999    0.9999     52868
#  samples avg     0.9999    0.9999    0.9999     52868


Creation of the directory 'models/en.word_split_0.1_32' failed.
2020-09-17 21:24:03,844 Computing label dictionary. Progress:


100%|████████████████████████████████████████████████████████████████████████| 475804/475804 [01:54<00:00, 4172.25it/s]

2020-09-17 21:27:03,865 [b'benign', b'malignant']



I0917 21:27:03.870359  7756 utils.py:422] loading Word2VecKeyedVectors object from C:\Users\leo_g\.flair\embeddings\en-fasttext-news-300d-1M
I0917 21:27:06.158385  7756 utils.py:461] loading vectors from C:\Users\leo_g\.flair\embeddings\en-fasttext-news-300d-1M.vectors.npy with mmap=None
I0917 21:27:08.727932  7756 utils.py:494] setting ignored attribute vectors_norm to None
I0917 21:27:08.728893  7756 utils.py:428] loaded C:\Users\leo_g\.flair\embeddings\en-fasttext-news-300d-1M


2020-09-17 21:27:13,074 ----------------------------------------------------------------------------------------------------
2020-09-17 21:27:13,076 Model: "TextClassifier(
  (document_embeddings): DocumentRNNEmbeddings(
    (embeddings): StackedEmbeddings(
      (list_embedding_0): WordEmbeddings('en')
    )
    (word_reprojection_map): Linear(in_features=300, out_features=300, bias=True)
    (rnn): GRU(300, 256, batch_first=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Linear(in_features=256, out_features=2, bias=True)
  (loss_function): CrossEntropyLoss()
  (beta): 1.0
  (weights): None
  (weight_tensor) None
)"
2020-09-17 21:27:13,076 ----------------------------------------------------------------------------------------------------
2020-09-17 21:27:13,078 Corpus: "Corpus: 422936 train + 52867 dev + 52868 test sentences"
2020-09-17 21:27:13,079 ----------------------------------------------------------------------------------------------------
2020-09-17 21:2

2020-09-17 22:13:38,130 epoch 5 - iter 2642/13217 - loss 0.00075971 - samples/sec: 1074.83 - lr: 0.100000
2020-09-17 22:14:20,329 epoch 5 - iter 3963/13217 - loss 0.00091977 - samples/sec: 1103.07 - lr: 0.100000
2020-09-17 22:15:03,194 epoch 5 - iter 5284/13217 - loss 0.00095851 - samples/sec: 1085.33 - lr: 0.100000
2020-09-17 22:15:46,222 epoch 5 - iter 6605/13217 - loss 0.00097530 - samples/sec: 1086.41 - lr: 0.100000
2020-09-17 22:16:28,123 epoch 5 - iter 7926/13217 - loss 0.00095412 - samples/sec: 1110.44 - lr: 0.100000
2020-09-17 22:17:10,307 epoch 5 - iter 9247/13217 - loss 0.00099935 - samples/sec: 1104.07 - lr: 0.100000
2020-09-17 22:17:52,639 epoch 5 - iter 10568/13217 - loss 0.00103127 - samples/sec: 1099.68 - lr: 0.100000
2020-09-17 22:18:34,395 epoch 5 - iter 11889/13217 - loss 0.00100628 - samples/sec: 1114.95 - lr: 0.100000
2020-09-17 22:19:16,582 epoch 5 - iter 13210/13217 - loss 0.00099259 - samples/sec: 1104.27 - lr: 0.100000
2020-09-17 22:19:18,089 -------------------

2020-09-17 23:09:29,604 epoch 10 - iter 10568/13217 - loss 0.00068931 - samples/sec: 996.91 - lr: 0.050000
2020-09-17 23:10:13,025 epoch 10 - iter 11889/13217 - loss 0.00069817 - samples/sec: 1060.66 - lr: 0.050000
2020-09-17 23:10:54,865 epoch 10 - iter 13210/13217 - loss 0.00070031 - samples/sec: 1107.11 - lr: 0.050000
2020-09-17 23:10:56,346 ----------------------------------------------------------------------------------------------------
2020-09-17 23:10:56,348 EPOCH 10 done: loss 0.0007 - lr 0.0500000
2020-09-17 23:12:37,809 DEV : loss 0.0004569814773276448 - score 0.9998
2020-09-17 23:12:48,056 BAD EPOCHS (no improvement): 0
saving best model
2020-09-17 23:13:36,712 ----------------------------------------------------------------------------------------------------
2020-09-17 23:15:19,223 epoch 11 - iter 1321/13217 - loss 0.00104728 - samples/sec: 808.25 - lr: 0.050000
2020-09-17 23:16:02,256 epoch 11 - iter 2642/13217 - loss 0.00090430 - samples/sec: 1052.94 - lr: 0.050000
202

2020-09-18 00:01:44,060 BAD EPOCHS (no improvement): 1
2020-09-18 00:01:44,129 ----------------------------------------------------------------------------------------------------
2020-09-18 00:03:06,628 epoch 16 - iter 1321/13217 - loss 0.00054618 - samples/sec: 1113.32 - lr: 0.050000
2020-09-18 00:03:48,159 epoch 16 - iter 2642/13217 - loss 0.00048265 - samples/sec: 1093.09 - lr: 0.050000
2020-09-18 00:04:29,906 epoch 16 - iter 3963/13217 - loss 0.00046876 - samples/sec: 1116.56 - lr: 0.050000
2020-09-18 00:05:11,642 epoch 16 - iter 5284/13217 - loss 0.00048654 - samples/sec: 1114.03 - lr: 0.050000
2020-09-18 00:05:53,121 epoch 16 - iter 6605/13217 - loss 0.00046446 - samples/sec: 1096.79 - lr: 0.050000
2020-09-18 00:06:33,798 epoch 16 - iter 7926/13217 - loss 0.00056249 - samples/sec: 1118.52 - lr: 0.050000
2020-09-18 00:07:15,206 epoch 16 - iter 9247/13217 - loss 0.00063623 - samples/sec: 1097.11 - lr: 0.050000
2020-09-18 00:07:56,748 epoch 16 - iter 10568/13217 - loss 0.00067024 -

2020-09-18 00:53:39,625 epoch 21 - iter 6605/13217 - loss 0.00061899 - samples/sec: 1086.54 - lr: 0.050000
2020-09-18 00:54:20,433 epoch 21 - iter 7926/13217 - loss 0.00061459 - samples/sec: 1094.60 - lr: 0.050000
2020-09-18 00:55:02,420 epoch 21 - iter 9247/13217 - loss 0.00067146 - samples/sec: 1113.43 - lr: 0.050000
2020-09-18 00:55:43,288 epoch 21 - iter 10568/13217 - loss 0.00066082 - samples/sec: 1119.36 - lr: 0.050000
2020-09-18 00:56:25,058 epoch 21 - iter 11889/13217 - loss 0.00061890 - samples/sec: 1122.43 - lr: 0.050000
2020-09-18 00:57:07,032 epoch 21 - iter 13210/13217 - loss 0.00059348 - samples/sec: 1114.12 - lr: 0.050000
2020-09-18 00:57:08,043 ----------------------------------------------------------------------------------------------------
2020-09-18 00:57:08,044 EPOCH 21 done: loss 0.0006 - lr 0.0500000
2020-09-18 00:58:45,913 DEV : loss 0.000368967856047675 - score 0.9999
2020-09-18 00:58:55,489 BAD EPOCHS (no improvement): 0
saving best model
2020-09-18 00:59:19,

2020-09-18 01:45:32,241 ----------------------------------------------------------------------------------------------------
2020-09-18 01:45:32,243 EPOCH 26 done: loss 0.0005 - lr 0.0500000
2020-09-18 01:47:10,691 DEV : loss 0.0003236537449993193 - score 0.9999
2020-09-18 01:47:20,018 BAD EPOCHS (no improvement): 0
saving best model
2020-09-18 01:47:39,140 ----------------------------------------------------------------------------------------------------
2020-09-18 01:49:03,510 epoch 27 - iter 1321/13217 - loss 0.00077307 - samples/sec: 1118.38 - lr: 0.050000
2020-09-18 01:49:45,198 epoch 27 - iter 2642/13217 - loss 0.00060027 - samples/sec: 1089.18 - lr: 0.050000
2020-09-18 01:50:27,464 epoch 27 - iter 3963/13217 - loss 0.00075467 - samples/sec: 1108.30 - lr: 0.050000
2020-09-18 01:51:09,126 epoch 27 - iter 5284/13217 - loss 0.00071988 - samples/sec: 1094.30 - lr: 0.050000
2020-09-18 01:51:49,882 epoch 27 - iter 6605/13217 - loss 0.00067537 - samples/sec: 1120.54 - lr: 0.050000
2020

{'test_score': 0.9999,
 'dev_score_history': [0.9997,
  0.9998,
  0.9998,
  0.9997,
  0.9998,
  0.9998,
  0.9998,
  0.9998,
  0.9998,
  0.9998,
  0.9998,
  0.9998,
  0.9998,
  0.9998,
  0.9998,
  0.9998,
  0.9999,
  0.9998,
  0.9998,
  0.9998,
  0.9999,
  0.9998,
  0.9999,
  0.9999,
  0.9998,
  0.9999,
  0.9999,
  0.9999,
  0.9999,
  0.9999],
 'train_loss_history': [0.007005688341932229,
  0.0013865314515874637,
  0.0010426083367080436,
  0.0010144814260991692,
  0.0009920659119428336,
  0.0010587797197936935,
  0.0008621925794808552,
  0.0008075173992434479,
  0.000731073996334196,
  0.0007000347918036083,
  0.0007187440749044018,
  0.000748671631027228,
  0.000783193307360384,
  0.0006957890033008677,
  0.0006514761315378964,
  0.0006951858777576511,
  0.0005949204746242241,
  0.0006549592268721019,
  0.0006606985747614043,
  0.0005890207203470428,
  0.0005931733206512098,
  0.0005345358036626473,
  0.0006067134716228813,
  0.000516379041056304,
  0.0005498284672124913,
  0.000480606

# Further Development and Research
1. Additional Data Sources:
   1. Feedback data from the production systems: Predicted TPs and TNs. Can we discover simple heuristics/statistics?
1. The complexity of the existed DGAs and neural networks. Is there a correlation between them? Can we estimate the complexity of the DGA in terms of a number of parameters (or any other NN complexity measurements)?
1. Can we group the DGA by algo groups? If YES, can we train different NNs for different DGA groups and use an ensemble of the models?