Skip to content
Jing Conan Wang edited this page Nov 23, 2016 · 15 revisions

Welcome to the GAD wiki!

FAQ

Got error message fatal error: 'numpy/arrayobject.h' file not found

You may come across this error message when trying to run ./cmdgad detect -c ./example-configs/detect-config.py -d ./test/n0_flow.txt --method='mfmb' --pic_show

Some versions of GAD uses cython to compile some performance critical modules. This error message indicates that cython had hard time to find the header files

Solution:

https://github.com/andersbll/cudarray/issues/25

you can create a symlink like (from within /usr/local/include):

cd /usr/local/include
ln -s /usr/local/lib/python2.7/site-packages/numpy/core/include/numpy numpy

About Configuration File

There are two versions of configuration.

Version 0

If not explicitly specified, configuration file is version 0

This version was designed to handle only two types of features

  1. Numerical Values.
  2. IP Address. It can only handle IPs in one column and the column name must be 'src_ip'. When loading data, it will delete the 'src_ip' column and instead two columns 'cluster' and 'dist_to_center' instead.

Example: https://github.com/hbhzwj/GAD/blob/master/example-configs/detect-config.py

DETECTOR_DESC a dictionary for detection configuration. In particular, DETECTOR_DESC.fea_option is a dictionary that specifies the quantization schema.

Since version 0 was designed only for a specific usage, we encourage you to use version 1 if possible.

Version 1

Version 1 is introduced to support more complex features, e.g., categorical, numerical, ipv4_address, port. You can add the following line

VERSION = 1

to you configuration file to tell the program that the configuration file is version 1. Here is example of version 1 config: https://github.com/hbhzwj/GAD/blob/master/example-configs/detect-config-botnet.py

The main change of version 1 v.s. version 0 is DETECTOR_DESC.fea_option In v1, fea_option is a list of dictionary. Each dictionary specifies one feature and looks like this:

        {
            'feature_name': 'Proto',
            'feature_type': 'categorical',
            ....
        },

feature_name is the name of the feature, i.e., the column name in the data. feature_type is the type of the feature. Please see definition of FEATURE_PROCESSOR_MAP for complete list of valid feature types https://github.com/hbhzwj/GAD/blob/master/gad/Detector/DataHandler.py#L150

numerical

For numerical value, you need to specify 'quantized_number' and 'range'

       {
            'feature_name': 'SrcBytes',
            'feature_type': 'numerical',
            'quantized_number': 50,
            'range': [0, 251771542],
       }

categorical

For categorical feature, you need to specify a dictionary 'symbol_index' that maps each categorical value to numerical value. The dictionary should also hold a key 'DEFAULT' for unknown categorical value. For example:

        {
            'feature_name': 'Proto',
            'feature_type': 'categorical',
            'symbol_index': {
                'tcp': 1,
                'udp': 2,
                'DEFAULT': 0
            },
        },

ipv4_address

ipv4_address is a special type of categorical feature in which the program can generate symbol_index automatically using clustering algorithm. You can optionally add 'ip_columns' that specifies the columns for IPs, or you can just don't set ip_columns. If you set ip_columns in the option (like the example below), it will search IP address for columns specified by ip_columns. Otherwise, it will search for values with x.x.x.x format in column specified by 'feature_name' usethem for clustering (see this code).

After the clustering, it will create a mapping from IP address to cluster ID. The IP in the column will be replaced as the cluster ID.

In practice, since IP clustering usually takes a long time. We suggest you to do a test run and save the symbol_index file. Later, you just set the 'symbol_index' to the file path.

Here is an example of v2 configuration file. https://github.com/hbhzwj/GAD/blob/master/example-configs/detect-config-botnet-v1.py

In order to save the symbol_index file generated by IP clustering, you can uncomment the following part

        {
             'feature_name': 'SrcAddr',
             'feature_type': 'ipv4_address',
             'ip_cluster_num': 5,
             'DEFAULT': -1,
             'ip_columns': ['SrcAddr'],
             'save_symbol_index_path': './test-data/SrcAddrSymbolIndex.json'
        },

and comment the 'SrcAddr' configuration below it. Make a sample run using the following command:

./cmdgad detect -c ./example-configs/detect-config-botnet-v1.py -d test-data/capture20110816_test.binetflow -m mf

It will not only do a normal run but also will also save symbol_index_path Later, you can use the configuration below

        {
             'feature_name': 'SrcAddr',
             'feature_type': 'ipv4_address',
              'symbol_index': json.load(open('./test-data/SrcAddrSymbolIndex.json', 'r')),
        },

which will bypass the IP clustering step and use the symbol_index file generated in the previous step directly.