Home

Welcome to the GAD wiki!

FAQ

Got error message fatal error: 'numpy/arrayobject.h' file not found

You may come across this error message when trying to run ./cmdgad detect -c ./example-configs/detect-config.py -d ./test/n0_flow.txt --method='mfmb' --pic_show

Some versions of GAD uses cython to compile some performance critical modules. This error message indicates that cython had hard time to find the header files

Solution:

https://github.com/andersbll/cudarray/issues/25

you can create a symlink like (from within /usr/local/include):

cd /usr/local/include
ln -s /usr/local/lib/python2.7/site-packages/numpy/core/include/numpy numpy

About Configuration File

There are two versions of configuration.

Version 0

If not explicitly specified, configuration file is version 0

This version was designed to handle only two types of features

Numerical Values.
IP Address. It can only handle IPs in one column and the column name must be 'src_ip'. When loading data, it will delete the 'src_ip' column and instead two columns 'cluster' and 'dist_to_center' instead.

Example: https://github.com/hbhzwj/GAD/blob/master/example-configs/detect-config.py

DETECTOR_DESC a dictionary for detection configuration. In particular, DETECTOR_DESC.fea_option is a dictionary that specifies the quantization schema.

Since version 0 was designed only for a specific usage, we encourage you to use version 1 if possible.

Version 1

Version 1 is introduced to support more complex features, e.g., categorical, numerical, ipv4_address, port. You can add the following line

VERSION = 1

to you configuration file to tell the program that the configuration file is version 1. Here is example of version 1 config: https://github.com/hbhzwj/GAD/blob/master/example-configs/detect-config-botnet.py

The main change of version 1 v.s. version 0 is DETECTOR_DESC.fea_option In v1, fea_option is a list of dictionary. Each dictionary specifies one feature and looks like this:

        {
            'feature_name': 'Proto',
            'feature_type': 'categorical',
            ....
        },

feature_name is the name of the feature, i.e., the column name in the data. feature_type is the type of the feature. Please see definition of FEATURE_PROCESSOR_MAP for complete list of valid feature types https://github.com/hbhzwj/GAD/blob/master/gad/Detector/DataHandler.py#L150

numerical

For numerical value, you need to specify 'quantized_number' and 'range'

       {
            'feature_name': 'SrcBytes',
            'feature_type': 'numerical',
            'quantized_number': 50,
            'range': [0, 251771542],
       }

categorical

For categorical feature, you need to specify a dictionary 'symbol_index' that maps each categorical value to numerical value. The dictionary should also hold a key 'DEFAULT' for unknown categorical value. For example:

        {
            'feature_name': 'Proto',
            'feature_type': 'categorical',
            'symbol_index': {
                'tcp': 1,
                'udp': 2,
                'DEFAULT': 0
            },
        },

ipv4_address

ipv4_address is a special type of categorical feature in which the program can generate symbol_index automatically using clustering algorithm. You can optionally add 'ip_columns' that specifies the columns for IPs, or you can just don't set ip_columns. If you set ip_columns in the option (like the example below), it will search IP address for columns specified by ip_columns. Otherwise, it will search for values with x.x.x.x format in column specified by 'feature_name' usethem for clustering (see this code).

After the clustering, it will create a mapping from IP address to cluster ID. The IP in the column will be replaced as the cluster ID.

In practice, since IP clustering usually takes a long time. We suggest you to do a test run and save the symbol_index file. Later, you just set the 'symbol_index' to the file path.

Here is an example of v2 configuration file. https://github.com/hbhzwj/GAD/blob/master/example-configs/detect-config-botnet-v1.py

In order to save the symbol_index file generated by IP clustering, you can uncomment the following part

        {
             'feature_name': 'SrcAddr',
             'feature_type': 'ipv4_address',
             'ip_cluster_num': 5,
             'DEFAULT': -1,
             'ip_columns': ['SrcAddr'],
             'save_symbol_index_path': './test-data/SrcAddrSymbolIndex.json'
        },

and comment the 'SrcAddr' configuration below it. Make a sample run using the following command:

./cmdgad detect -c ./example-configs/detect-config-botnet-v1.py -d test-data/capture20110816_test.binetflow -m mf

It will not only do a normal run but also will also save symbol_index_path Later, you can use the configuration below

        {
             'feature_name': 'SrcAddr',
             'feature_type': 'ipv4_address',
              'symbol_index': json.load(open('./test-data/SrcAddrSymbolIndex.json', 'r')),
        },

which will bypass the IP clustering step and use the symbol_index file generated in the previous step directly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly