-
Notifications
You must be signed in to change notification settings - Fork 2
Home
Welcome to the GAD wiki!
You may come across this error message when trying to run ./cmdgad detect -c ./example-configs/detect-config.py -d ./test/n0_flow.txt --method='mfmb' --pic_show
Some versions of GAD uses cython to compile some performance critical modules. This error message indicates that cython had hard time to find the header files
Solution:
https://github.com/andersbll/cudarray/issues/25
you can create a symlink like (from within /usr/local/include):
cd /usr/local/include
ln -s /usr/local/lib/python2.7/site-packages/numpy/core/include/numpy numpy
There are two versions of configuration.
If not explicitly specified, configuration file is version 0
This version was designed to handle only two types of features
- Numerical Values.
- IP Address. It can only handle IPs in one column and the column name must be 'src_ip'. When loading data, it will delete the 'src_ip' column and instead two columns 'cluster' and 'dist_to_center' instead.
Example: https://github.com/hbhzwj/GAD/blob/master/example-configs/detect-config.py
DETECTOR_DESC a dictionary for detection configuration. In particular, DETECTOR_DESC.fea_option is a dictionary that specifies the quantization schema.
Since version 0 was designed only for a specific usage, we encourage you to use version 1 if possible.
Version 1 is introduced to support more complex features, e.g., categorical, numerical, ipv4_address, port. You can add the following line
VERSION = 1
to you configuration file to tell the program that the configuration file is version 1. Here is example of version 1 config: https://github.com/hbhzwj/GAD/blob/master/example-configs/detect-config-botnet.py
The main change of version 1 v.s. version 0 is DETECTOR_DESC.fea_option In v1, fea_option is a list of dictionary. Each dictionary specifies one feature and looks like this:
{
'feature_name': 'Proto',
'feature_type': 'categorical',
....
},
feature_name is the name of the feature, i.e., the column name in the data. feature_type is the type of the feature. Please see definition of FEATURE_PROCESSOR_MAP for complete list of valid feature types https://github.com/hbhzwj/GAD/blob/master/gad/Detector/DataHandler.py#L150
For numerical value, you need to specify 'quantized_number' and 'range'
{
'feature_name': 'SrcBytes',
'feature_type': 'numerical',
'quantized_number': 50,
'range': [0, 251771542],
}
For categorical feature, you need to specify a dictionary 'symbol_index' that maps each categorical value to numerical value. The dictionary should also hold a key 'DEFAULT' for unknown categorical value. For example:
{
'feature_name': 'Proto',
'feature_type': 'categorical',
'symbol_index': {
'tcp': 1,
'udp': 2,
'DEFAULT': 0
},
},
ipv4_address is a special type of categorical feature in which the program can generate symbol_index automatically using clustering algorithm. You can optionally add 'ip_columns' that specifies the columns for IPs, or you can just don't set ip_columns. If you set ip_columns in the option (like the example below), it will search IP address for columns specified by ip_columns. Otherwise, it will search for values with x.x.x.x format in column specified by 'feature_name' usethem for clustering (see this code).
After the clustering, it will create a mapping from IP address to cluster ID. The IP in the column will be replaced as the cluster ID.
In practice, since IP clustering usually takes a long time. We suggest you to do a test run and save the symbol_index file. Later, you just set the 'symbol_index' to the file path.
Here is an example of v2 configuration file. https://github.com/hbhzwj/GAD/blob/master/example-configs/detect-config-botnet-v1.py
In order to save the symbol_index file generated by IP clustering, you can uncomment the following part
{
'feature_name': 'SrcAddr',
'feature_type': 'ipv4_address',
'ip_cluster_num': 5,
'DEFAULT': -1,
'ip_columns': ['SrcAddr'],
'save_symbol_index_path': './test-data/SrcAddrSymbolIndex.json'
},
and comment the 'SrcAddr' configuration below it. Make a sample run using the following command:
./cmdgad detect -c ./example-configs/detect-config-botnet-v1.py -d test-data/capture20110816_test.binetflow -m mf
It will not only do a normal run but also will also save symbol_index_path Later, you can use the configuration below
{
'feature_name': 'SrcAddr',
'feature_type': 'ipv4_address',
'symbol_index': json.load(open('./test-data/SrcAddrSymbolIndex.json', 'r')),
},
which will bypass the IP clustering step and use the symbol_index file generated in the previous step directly.