This implementation was realized for my master thesis on "Botnet detection in encrypted traffic - a machine learning approach"
The configuration has to be done in config.py file. A template is provided in example_config.py
Follow these steps:
- run features_extraction/MainBro.py to extract the features in results/features.csv
- run machine_learning/normalize_and_split.py to generate data to feed to ML
- run train.py to generate models
Pass the setname of the features to use through
Get_normalize_data.get_all_data("model_folder", "set_name")
setname can take the value "all", "dns", "https", "reduced", "reduced_30", "reduced_40" and "enhanced_30". To create a new set of features, just complete the features_set dictionnary present in the get_all_data(...) function
The enhanced features set contains cipher suites from ClientHello packets. Unfortunately the information is not available by default in Bro logs. Therefore it is required to extract them by hand. The tls_finger.bro script from securityartwork.es has been used in order to do this extraction Moreover, to avoid re-computing the whole features set (which is time and ressources consuming), the features are calculated separately then added to the csv with all features.
Here are the steps to generate the enhanced features set:
- Install Bro or install SecurityOnion and put the tls_finger.bro file into the folder "/usr/local/share/bro/site"
- Use extract_bro_ciphers.py to extract cipher suites from Bro logs
- Use feature_extraction/compute_ciphersuites_features.ipynb to compute the features from Bro logs and store them in results/model/features_enhanced.csv
- dataset_tools/ -> contains all the tools related to the datasets (download, collect infected IPs, label and discard datasets)
- download_datasets.py: to download the desired datasets
- discard_unuseful_datasets.py: to discard datasets that have no flows labelled
- collect_infected_ips.py: to collect infected and normal IPs from README.html files present in the dataset folders (uses a regex to parse the files)
- label_normal_datasets.py: to label normal datasets
- label_mcfp_datasets.py: to label MFCP datasets (excluding the "CTU-13 Dataset" which is already labelled)
- features_extraction/ -> contains the scripts that extract the features. Credits go to Frantisek Strasak for HTTPS features extractions.
- machine_learning/ -> contains the scripts to normalize the data from the features extracted and train the model
- results/{graphs|logs|model} -> default folders for generated graphs, models and logs
- results_backup/ -> contains the backup results of the different experiments
- statistics/ -> contains the scripts to analyze the features extracted and the models generated
- tools/ -> Various tools:
- tls_finger.bro: Bro script to extract cipher suites
- extract_bro_ciphers.py: Python script to extract logs + cipher suites from pcap's
- backup_results.py: to backup the result folder (requires "results_folder_backup" to be set in config file)
- delete_results.py: to delete the result folder
- split_alexa.py: to sort and split alexa top websites in multiple files for quicker lookups
BotnetDetectorThesis is released under the MIT license. Credits go to František Střasák for some parts of the code (https://github.com/frenky-strasak/HTTPSDetector).