## Universidad del Valle de Guatemala
## CC3094 - Security Data Science - Sección 10
## Proyecto Final - DDoS Detection
## Integrantes:
## - 19402 - Julio Herrera
## - 19270 - Oliver De León
## - 18341 - Randy Venegas

## Fase 1: Propuesta de Proyecto

La propuesta de proyecto aprobada se encuentra en el documento Avances Proyecto Security Data Science.pdf donde se define la motivación por la que se seleccionó este proyecto, las preguntas clave a considerar durante el desarrollo, la revisión de literatura actual sobre el tema  y la recolección de datos.

Como resumen, el proyecto seleccionado fue desarrollar un sistema de detección de ataques DDoS, específicamente de llamadas a un servidor que sean parte de un ataque DDoS, esto con el propósito de que organizaciones y empresas puedan detectar y mitigar estos ataques enfocados a sus servidores, obteniendo una precisión y recall igual o mayor al que se ha demostrado en la literatura actual.

## Fase 2: Implementación

Esta fase la cual se desarrolla en este jupyter notebook, consiste en la implementación completa del modelo, pasando por la limpieza de datos y selección de features relevantes, pre-procesamiento y análisis exploratorio, implementación de los modelos y evaluación.

### Pre-procesamiento

Para el pre-procesamiento se cargarán los distintos datasets necesarios, se juntarán, se limpiarán y se seleccionarán las features relevantes para el modelo, el objetivo de ese pre-procesamiento no es dejar el dataset final sino tener un dataset manejable, sin features que de primeras sabemos que no son relevantes para ir haciendo la exploración luego.

#### Carga de los datasets

El dataset completo es DDoS Evaluation Dataset (CIC-DDoS2019) de la University of New Brunswick (UNB) y se encuentra disponible en https://www.unb.ca/cic/datasets/ddos-2019.html o la copia personal en https://drive.google.com/file/d/1U9ccgLqrv36eLZuV0-pgUsGshgqJ73ma/view?usp=sharing que contiene el zip CSV-01-12 (HASH b86b3553b1c5086222b27ea17a27e07a) y la copia en https://drive.google.com/file/d/1v88DSdQ-tmwICOQWZGXUpKIj9JJD3ji6/view?usp=sharing que contiene el zip CSV-03-11 (HASH f43510dfa38483cb5851063807997baa). Estos dataset contienen multiples archivos csv que corresponden los primeros (CSV-01-12) a 12 tipos de ataques DDoS incluyendo NTP, DNS, LDAP, MSSQL, NetBIOS, SNMP, SSDP, UDP, UDP-Lag, WebDDoS, SYN y TFTP durante el día de entrenamiento y los otros (CSV-03-11) a 7 tipos de ataques incluyendo PortScan, NetBIOS, LDAP, MSSQL, UDP, UDP-Lag y SYN durante el día de testeo.

In [3]:
import pandas as pd
import os
import numpy as np
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn import svm
from tqdm import tqdm_notebook as tqdm
from tqdm.notebook import tqdm_notebook

In [2]:
# train day
DrDoS_DNS = pd.read_csv('datasets/DrDoS_DNS.csv')
DrDoS_LDAP = pd.read_csv('datasets/DrDoS_LDAP.csv')
DrDoS_MSSQL = pd.read_csv('datasets/DrDoS_MSSQL.csv')
DrDoS_NetBIOS = pd.read_csv('datasets/DrDoS_NetBIOS.csv')
DrDoS_NTP = pd.read_csv('datasets/DrDoS_NTP.csv')
DrDoS_SNMP = pd.read_csv('datasets/DrDoS_SNMP.csv')
DrDoS_SSDP = pd.read_csv('datasets/DrDoS_SSDP.csv')
DrDoS_UDP = pd.read_csv('datasets/DrDoS_UDP.csv')
Syn = pd.read_csv('datasets/Syn1.csv')
TFTPdf = pd.read_csv('datasets/TFTP.csv')
UDPLag = pd.read_csv('datasets/UDPLag1.csv')

# test day
LDAPdf = pd.read_csv('datasets/LDAP.csv')
MSSQLdf = pd.read_csv('datasets/MSSQL.csv')
NetBIOS = pd.read_csv('datasets/NetBIOS.csv')
Portmap = pd.read_csv('datasets/Portmap.csv')
Syn_2 = pd.read_csv('datasets/Syn.csv')
UDPdf = pd.read_csv('datasets/UDP.csv')
UDPLag_2 = pd.read_csv('datasets/UDPLag.csv')

  DrDoS_DNS = pd.read_csv('datasets/DrDoS_DNS.csv')
  DrDoS_LDAP = pd.read_csv('datasets/DrDoS_LDAP.csv')
  DrDoS_MSSQL = pd.read_csv('datasets/DrDoS_MSSQL.csv')
  DrDoS_NetBIOS = pd.read_csv('datasets/DrDoS_NetBIOS.csv')
  DrDoS_NTP = pd.read_csv('datasets/DrDoS_NTP.csv')
  DrDoS_SNMP = pd.read_csv('datasets/DrDoS_SNMP.csv')
  DrDoS_SSDP = pd.read_csv('datasets/DrDoS_SSDP.csv')
  DrDoS_UDP = pd.read_csv('datasets/DrDoS_UDP.csv')
  Syn = pd.read_csv('datasets/Syn1.csv')
  TFTPdf = pd.read_csv('datasets/TFTP.csv')
  UDPLag = pd.read_csv('datasets/UDPLag1.csv')
  LDAPdf = pd.read_csv('datasets/LDAP.csv')
  MSSQLdf = pd.read_csv('datasets/MSSQL.csv')
  NetBIOS = pd.read_csv('datasets/NetBIOS.csv')
  Portmap = pd.read_csv('datasets/Portmap.csv')
  Syn_2 = pd.read_csv('datasets/Syn.csv')
  UDPdf = pd.read_csv('datasets/UDP.csv')
  UDPLag_2 = pd.read_csv('datasets/UDPLag.csv')


#### Vistazo a los datasets

Se da un vistazo rápido a los datasets para ver la cantidad de datos que se tienen y la cantidad de features que se tienen.

In [3]:
## check the data

DrDoS_DNS.head()

Unnamed: 0.1,Unnamed: 0,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,...,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,SimillarHTTP,Inbound,Label
0,425,172.16.0.5-192.168.50.1-634-60495-17,172.16.0.5,634,192.168.50.1,60495,17,2018-12-01 10:51:39.813448,28415,97,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_DNS
1,430,172.16.0.5-192.168.50.1-60495-634-17,192.168.50.1,634,172.16.0.5,60495,17,2018-12-01 10:51:39.820842,2,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,DrDoS_DNS
2,1654,172.16.0.5-192.168.50.1-634-46391-17,172.16.0.5,634,192.168.50.1,46391,17,2018-12-01 10:51:39.852499,48549,200,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_DNS
3,2927,172.16.0.5-192.168.50.1-634-11894-17,172.16.0.5,634,192.168.50.1,11894,17,2018-12-01 10:51:39.890213,48337,200,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_DNS
4,694,172.16.0.5-192.168.50.1-634-27878-17,172.16.0.5,634,192.168.50.1,27878,17,2018-12-01 10:51:39.941151,32026,200,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_DNS


In [4]:
DrDoS_LDAP.head()

Unnamed: 0.1,Unnamed: 0,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,...,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,SimillarHTTP,Inbound,Label
0,21010,172.16.0.5-192.168.50.1-0-0-0,172.16.0.5,0,192.168.50.1,0,0,2018-12-01 11:22:40.254769,9141643,85894,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_LDAP
1,20932,172.16.0.5-192.168.50.1-900-1808-17,172.16.0.5,900,192.168.50.1,1808,17,2018-12-01 11:22:40.255361,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_LDAP
2,27876,172.16.0.5-192.168.50.1-900-58766-17,172.16.0.5,900,192.168.50.1,58766,17,2018-12-01 11:22:40.255568,2,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_LDAP
3,24270,172.16.0.5-192.168.50.1-900-35228-17,172.16.0.5,900,192.168.50.1,35228,17,2018-12-01 11:22:40.256113,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_LDAP
4,5109,172.16.0.5-192.168.50.1-900-44969-17,172.16.0.5,900,192.168.50.1,44969,17,2018-12-01 11:22:40.256285,2,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_LDAP


In [5]:
DrDoS_MSSQL.head()

Unnamed: 0.1,Unnamed: 0,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,...,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,SimillarHTTP,Inbound,Label
0,139,172.16.0.5-192.168.50.1-0-0-0,172.16.0.5,0,192.168.50.1,0,0,2018-12-01 11:32:32.915441,119151083,60959,...,28536810.0,67834732.0,4024278.0,5975510.0,98.183502,5975622.0,5975358.0,0,1,DrDoS_MSSQL
1,38385,172.16.0.5-192.168.50.1-850-20345-17,172.16.0.5,850,192.168.50.1,20345,17,2018-12-01 11:32:32.915442,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_MSSQL
2,27033,172.16.0.5-192.168.50.1-851-21631-17,172.16.0.5,851,192.168.50.1,21631,17,2018-12-01 11:32:32.915578,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_MSSQL
3,34348,172.16.0.5-192.168.50.1-852-15332-17,172.16.0.5,852,192.168.50.1,15332,17,2018-12-01 11:32:32.915773,2,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_MSSQL
4,19225,172.16.0.5-192.168.50.1-853-41853-17,172.16.0.5,853,192.168.50.1,41853,17,2018-12-01 11:32:32.916114,2,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_MSSQL


In [6]:
DrDoS_NetBIOS.head()

Unnamed: 0.1,Unnamed: 0,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,...,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,SimillarHTTP,Inbound,Label
0,256565,172.16.0.5-192.168.50.1-34012-2334-17,172.16.0.5,34012,192.168.50.1,2334,17,2018-12-01 11:47:08.463789,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_NetBIOS
1,252918,172.16.0.5-192.168.50.1-34013-50170-17,172.16.0.5,34013,192.168.50.1,50170,17,2018-12-01 11:47:08.464316,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_NetBIOS
2,174257,172.16.0.5-192.168.50.1-34014-61534-17,172.16.0.5,34014,192.168.50.1,61534,17,2018-12-01 11:47:08.464472,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_NetBIOS
3,185193,172.16.0.5-192.168.50.1-34015-8930-17,172.16.0.5,34015,192.168.50.1,8930,17,2018-12-01 11:47:08.464520,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_NetBIOS
4,198671,172.16.0.5-192.168.50.1-34016-33040-17,172.16.0.5,34016,192.168.50.1,33040,17,2018-12-01 11:47:08.464925,2,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_NetBIOS


In [7]:
DrDoS_NTP.head()

Unnamed: 0.1,Unnamed: 0,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,...,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,SimillarHTTP,Inbound,Label
0,0,172.16.0.5-192.168.50.1-60675-80-6,172.16.0.5,60675,192.168.50.1,80,6,2018-12-01 09:17:11.183810,5220876,12,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,205.174.165.72/c.php,1,DrDoS_NTP
1,7,172.16.0.5-192.168.50.1-60676-80-6,172.16.0.5,60676,192.168.50.1,80,6,2018-12-01 09:17:11.205636,12644252,5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_NTP
2,12858,192.168.50.7-65.55.163.78-50458-443-6,65.55.163.78,443,192.168.50.7,50458,6,2018-12-01 09:17:12.634569,3,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,BENIGN
3,10191,192.168.50.7-65.55.163.78-50465-443-6,65.55.163.78,443,192.168.50.7,50465,6,2018-12-01 09:17:13.458370,3,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,BENIGN
4,239,192.168.50.253-224.0.0.5-0-0-0,192.168.50.253,0,224.0.0.5,0,0,2018-12-01 09:17:13.470913,114329232,52,...,2.466441,15.0,6.0,9527428.0,248706.681286,9950741.0,9092248.0,0,0,BENIGN


In [8]:
DrDoS_SNMP.head()

Unnamed: 0.1,Unnamed: 0,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,...,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,SimillarHTTP,Inbound,Label
0,109857,172.16.0.5-192.168.50.1-528-47330-17,172.16.0.5,528,192.168.50.1,47330,17,2018-12-01 12:00:13.902782,2,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_SNMP
1,238192,172.16.0.5-192.168.50.1-529-26888-17,172.16.0.5,529,192.168.50.1,26888,17,2018-12-01 12:00:13.902785,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_SNMP
2,82963,172.16.0.5-192.168.50.1-530-3723-17,172.16.0.5,530,192.168.50.1,3723,17,2018-12-01 12:00:13.903230,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_SNMP
3,205787,172.16.0.5-192.168.50.1-531-38862-17,172.16.0.5,531,192.168.50.1,38862,17,2018-12-01 12:00:13.903311,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_SNMP
4,83411,172.16.0.5-192.168.50.1-532-35383-17,172.16.0.5,532,192.168.50.1,35383,17,2018-12-01 12:00:13.903490,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_SNMP


In [9]:
DrDoS_SSDP.head()

Unnamed: 0.1,Unnamed: 0,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,...,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,SimillarHTTP,Inbound,Label
0,72,172.16.0.5-192.168.50.1-0-0-0,172.16.0.5,0,192.168.50.1,0,0,2018-12-01 12:23:13.663425,119714230,49476,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_SSDP
1,55171,172.16.0.5-192.168.50.1-700-36081-17,172.16.0.5,700,192.168.50.1,36081,17,2018-12-01 12:23:13.663475,2,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_SSDP
2,39545,172.16.0.5-192.168.50.1-701-25269-17,172.16.0.5,701,192.168.50.1,25269,17,2018-12-01 12:23:13.663526,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_SSDP
3,20334,172.16.0.5-192.168.50.1-702-2533-17,172.16.0.5,702,192.168.50.1,2533,17,2018-12-01 12:23:13.663622,2,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_SSDP
4,18397,172.16.0.5-192.168.50.1-703-34942-17,172.16.0.5,703,192.168.50.1,34942,17,2018-12-01 12:23:13.663844,2,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_SSDP


In [10]:
DrDoS_UDP.head()

Unnamed: 0.1,Unnamed: 0,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,...,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,SimillarHTTP,Inbound,Label
0,101418,172.16.0.5-192.168.50.1-43443-6652-17,172.16.0.5,43443,192.168.50.1,6652,17,2018-12-01 12:36:57.628026,218395,6,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_UDP
1,21564,172.16.0.5-192.168.50.1-54741-9712-17,172.16.0.5,54741,192.168.50.1,9712,17,2018-12-01 12:36:57.628076,108219,4,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_UDP
2,23389,172.16.0.5-192.168.50.1-56589-4680-17,172.16.0.5,56589,192.168.50.1,4680,17,2018-12-01 12:36:57.628164,104579,4,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_UDP
3,48872,172.16.0.5-192.168.50.1-40233-2644-17,172.16.0.5,40233,192.168.50.1,2644,17,2018-12-01 12:36:57.628166,110967,4,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_UDP
4,80354,172.16.0.5-192.168.50.1-33989-16901-17,172.16.0.5,33989,192.168.50.1,16901,17,2018-12-01 12:36:57.628217,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,DrDoS_UDP


In [11]:
Syn.head()

Unnamed: 0.1,Unnamed: 0,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,...,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,SimillarHTTP,Inbound,Label
0,281052,172.16.0.5-192.168.50.1-53058-53058-6,172.16.0.5,53058,192.168.50.1,53058,6,2018-12-01 13:30:30.741451,115799309,19,...,646237.483665,1709809.0,1.0,14261170.0,3220326.0,21714933.0,11043464.0,0,1,Syn
1,450424,172.16.0.5-192.168.50.1-32237-32237-6,172.16.0.5,32237,192.168.50.1,32237,6,2018-12-01 13:30:30.741452,113973933,16,...,19.595918,49.0,1.0,16281980.0,2573891.0,20019405.0,11993631.0,0,1,Syn
2,182979,172.16.0.5-192.168.50.1-60495-9840-6,172.16.0.5,60495,192.168.50.1,9840,6,2018-12-01 13:30:30.741501,112,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,Syn
3,41540,172.16.0.5-192.168.50.1-59724-59724-6,172.16.0.5,59724,192.168.50.1,59724,6,2018-12-01 13:30:30.741563,105985004,16,...,17.705259,48.0,1.0,15140710.0,3077366.0,20954123.0,11120336.0,0,1,Syn
4,358711,172.16.0.5-192.168.50.1-60496-32538-6,172.16.0.5,60496,192.168.50.1,32538,6,2018-12-01 13:30:30.741565,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,Syn


In [12]:
TFTPdf.head()

Unnamed: 0.1,Unnamed: 0,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,...,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,SimillarHTTP,Inbound,Label
0,46970,172.16.0.5-192.168.50.1-64670-64670-6,172.16.0.5,64670,192.168.50.1,64670,6,2018-12-01 13:34:27.403713,13120080,4,...,0.0,1.0,1.0,13120078.0,0.0,13120078.0,13120078.0,0,1,TFTP
1,164775,172.16.0.5-192.168.50.1-16405-16405-6,172.16.0.5,16405,192.168.50.1,16405,6,2018-12-01 13:34:27.404124,3990342,4,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,TFTP
2,177893,172.16.0.5-192.168.50.1-45838-45838-6,172.16.0.5,45838,192.168.50.1,45838,6,2018-12-01 13:34:27.404125,11842236,4,...,0.0,1.0,1.0,11842234.0,0.0,11842234.0,11842234.0,0,1,TFTP
3,82756,172.16.0.5-192.168.50.1-27049-27049-6,172.16.0.5,27049,192.168.50.1,27049,6,2018-12-01 13:34:27.404177,3579644,4,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,TFTP
4,153782,172.16.0.5-192.168.50.1-43305-27049-6,172.16.0.5,43305,192.168.50.1,27049,6,2018-12-01 13:34:27.404228,90,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,TFTP


In [13]:
UDPLag.head()

Unnamed: 0.1,Unnamed: 0,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,...,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,SimillarHTTP,Inbound,Label
0,186059,172.16.0.5-192.168.50.1-58445-4463-17,172.16.0.5,58445,192.168.50.1,4463,17,2018-12-01 13:04:45.928673,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,UDP-lag
1,135692,172.16.0.5-192.168.50.1-36908-9914-17,172.16.0.5,36908,192.168.50.1,9914,17,2018-12-01 13:04:45.928913,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,UDP-lag
2,33822,172.16.0.5-192.168.50.1-41727-32361-17,172.16.0.5,41727,192.168.50.1,32361,17,2018-12-01 13:04:45.928915,2,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,UDP-lag
3,24498,172.16.0.5-192.168.50.1-55447-5691-17,172.16.0.5,55447,192.168.50.1,5691,17,2018-12-01 13:04:45.929024,2,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,UDP-lag
4,117372,172.16.0.5-192.168.50.1-58794-56335-17,172.16.0.5,58794,192.168.50.1,56335,17,2018-12-01 13:04:45.929096,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,UDP-lag


In [14]:
LDAPdf.head()

Unnamed: 0.1,Unnamed: 0,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,...,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,SimillarHTTP,Inbound,Label
0,13605,172.16.0.5-192.168.50.4-870-2908-17,172.16.0.5,870,192.168.50.4,2908,17,2018-11-03 10:09:00.565557,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,NetBIOS
1,62631,172.16.0.5-192.168.50.4-871-53796-17,172.16.0.5,871,192.168.50.4,53796,17,2018-11-03 10:09:00.565559,48,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,NetBIOS
2,143869,172.16.0.5-192.168.50.4-648-40660-17,172.16.0.5,648,192.168.50.4,40660,17,2018-11-03 10:09:00.565608,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,NetBIOS
3,16171,172.16.0.5-192.168.50.4-872-54308-17,172.16.0.5,872,192.168.50.4,54308,17,2018-11-03 10:09:00.565993,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,NetBIOS
4,80845,172.16.0.5-192.168.50.4-873-40653-17,172.16.0.5,873,192.168.50.4,40653,17,2018-11-03 10:09:00.565994,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,NetBIOS


In [15]:
MSSQLdf.head()

Unnamed: 0.1,Unnamed: 0,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,...,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,SimillarHTTP,Inbound,Label
0,98115,172.16.0.5-192.168.50.4-615-28754-17,172.16.0.5,615,192.168.50.4,28754,17,2018-11-03 10:29:52.072724,3,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,LDAP
1,137,172.16.0.5-192.168.50.4-0-0-0,172.16.0.5,0,192.168.50.4,0,0,2018-11-03 10:29:52.072729,117876168,25274,...,0.0,81408014.0,81408014.0,6258062.0,0.0,6258062.0,6258062.0,0,1,LDAP
2,98988,172.16.0.5-192.168.50.4-900-42364-17,172.16.0.5,900,192.168.50.4,42364,17,2018-11-03 10:29:52.072825,2,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,LDAP
3,35177,172.16.0.5-192.168.50.4-616-10537-17,172.16.0.5,616,192.168.50.4,10537,17,2018-11-03 10:29:52.073221,3,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,LDAP
4,55362,172.16.0.5-192.168.50.4-617-14928-17,172.16.0.5,617,192.168.50.4,14928,17,2018-11-03 10:29:52.073285,44,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,LDAP


In [16]:
NetBIOS.head()

Unnamed: 0.1,Unnamed: 0,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,...,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,SimillarHTTP,Inbound,Label
0,302291,172.16.0.5-192.168.50.4-648-16174-17,172.16.0.5,648,192.168.50.4,16174,17,2018-11-03 10:01:48.920574,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,NetBIOS
1,341625,172.16.0.5-192.168.50.4-861-34200-17,172.16.0.5,861,192.168.50.4,34200,17,2018-11-03 10:01:48.920625,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,NetBIOS
2,245313,172.16.0.5-192.168.50.4-862-4750-17,172.16.0.5,862,192.168.50.4,4750,17,2018-11-03 10:01:48.920685,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,NetBIOS
3,266106,172.16.0.5-192.168.50.4-863-4443-17,172.16.0.5,863,192.168.50.4,4443,17,2018-11-03 10:01:48.921008,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,NetBIOS
4,47182,172.16.0.5-192.168.50.4-864-48627-17,172.16.0.5,864,192.168.50.4,48627,17,2018-11-03 10:01:48.921010,48,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,NetBIOS


In [17]:
Portmap.head()

Unnamed: 0.1,Unnamed: 0,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,...,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,SimillarHTTP,Inbound,Label
0,24,192.168.50.254-224.0.0.5-0-0-0,192.168.50.254,0,224.0.0.5,0,0,2018-11-03 09:18:16.964447,114456999,45,...,28337.112288,98168.0,3.0,9529897.25,351582.631269,10001143.0,9048097.0,0,0,BENIGN
1,26,192.168.50.253-224.0.0.5-0-0-0,192.168.50.253,0,224.0.0.5,0,0,2018-11-03 09:18:18.506537,114347504,56,...,121314.911865,420255.0,4.0,9493929.75,351541.079539,9978130.0,8820294.0,0,0,BENIGN
2,176563,172.217.10.98-192.168.50.6-443-54799-6,192.168.50.6,54799,172.217.10.98,443,6,2018-11-03 09:18:18.610576,36435473,6,...,0.0,62416.0,62416.0,36373056.0,0.0,36373056.0,36373056.0,0,0,BENIGN
3,50762,172.217.7.2-192.168.50.6-443-54800-6,192.168.50.6,54800,172.217.7.2,443,6,2018-11-03 09:18:18.610579,36434705,6,...,0.0,62413.0,62413.0,36372291.0,0.0,36372291.0,36372291.0,0,0,BENIGN
4,87149,172.217.10.98-192.168.50.6-443-54801-6,192.168.50.6,54801,172.217.10.98,443,6,2018-11-03 09:18:18.610581,36434626,6,...,0.0,62409.0,62409.0,36372216.0,0.0,36372216.0,36372216.0,0,0,BENIGN


In [18]:
Syn_2.head()

Unnamed: 0.1,Unnamed: 0,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,...,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,SimillarHTTP,Inbound,Label
0,445444,172.16.0.5-192.168.50.4-9429-9429-6,172.16.0.5,9429,192.168.50.4,9429,6,2018-11-03 11:36:28.607338,36063894,7,...,29.444864,52.0,1.0,12021280.0,6253623.0,18628035.0,6193840.0,0,1,Syn
1,113842,172.16.0.5-192.168.50.4-60224-60224-6,172.16.0.5,60224,192.168.50.4,60224,6,2018-11-03 11:36:28.607339,44851366,8,...,0.0,1.0,1.0,20662680.0,11697830.0,28934293.0,12391060.0,0,1,Syn
2,176377,172.16.0.5-192.168.50.4-33827-11746-6,192.168.50.4,11746,172.16.0.5,33827,6,2018-11-03 11:36:28.607388,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,Syn
3,24777,172.16.0.5-192.168.50.4-33828-1431-6,172.16.0.5,33828,192.168.50.4,1431,6,2018-11-03 11:36:28.607391,0,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,Syn
4,85100,172.16.0.5-192.168.50.4-5311-5311-6,172.16.0.5,5311,192.168.50.4,5311,6,2018-11-03 11:36:28.607442,35731470,8,...,33.234019,48.0,1.0,11910470.0,1849493.0,13693985.0,10001398.0,0,1,Syn


In [19]:
UDPdf.head()

Unnamed: 0.1,Unnamed: 0,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,...,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,SimillarHTTP,Inbound,Label
0,15798,172.16.0.5-192.168.50.4-9401-15931-17,172.16.0.5,9401,192.168.50.4,15931,17,2018-11-03 10:42:57.176671,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,MSSQL
1,110891,172.16.0.5-192.168.50.4-9402-29997-17,172.16.0.5,9402,192.168.50.4,29997,17,2018-11-03 10:42:57.176673,0,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,MSSQL
2,66956,172.16.0.5-192.168.50.4-9403-29887-17,172.16.0.5,9403,192.168.50.4,29887,17,2018-11-03 10:42:57.176727,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,MSSQL
3,66144,172.16.0.5-192.168.50.4-9404-7393-17,172.16.0.5,9404,192.168.50.4,7393,17,2018-11-03 10:42:57.176729,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,MSSQL
4,72903,172.16.0.5-192.168.50.4-9405-57957-17,172.16.0.5,9405,192.168.50.4,57957,17,2018-11-03 10:42:57.177121,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,MSSQL


In [20]:
UDPLag_2.head()

Unnamed: 0.1,Unnamed: 0,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,...,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,SimillarHTTP,Inbound,Label
0,50880,172.16.0.5-192.168.50.4-35468-49856-17,172.16.0.5,35468,192.168.50.4,49856,17,2018-11-03 11:01:43.652742,47,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,UDP
1,83164,172.16.0.5-192.168.50.4-44167-44225-17,172.16.0.5,44167,192.168.50.4,44225,17,2018-11-03 11:01:43.653107,2,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,UDP
2,49389,172.16.0.5-192.168.50.4-36215-28771-17,172.16.0.5,36215,192.168.50.4,28771,17,2018-11-03 11:01:43.653383,2,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,UDP
3,34687,172.16.0.5-192.168.50.4-44168-43679-17,172.16.0.5,44168,192.168.50.4,43679,17,2018-11-03 11:01:43.653386,2,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,UDP
4,87466,172.16.0.5-192.168.50.4-52334-44960-17,172.16.0.5,52334,192.168.50.4,44960,17,2018-11-03 11:01:43.653387,880701,18,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,UDP


#### Unión de los datasets

Se unen los datasets de entrenamiento y testeo en un solo dataset para poder hacer el pre-procesamiento y la exploración de datos.

In [3]:
# Concatenamos los DataFrames en uno solo de 3 en 3
Concatfull = pd.concat([DrDoS_DNS, DrDoS_LDAP, DrDoS_MSSQL])


In [4]:
# Concatenamos los DataFrames en uno solo de 3 en 3
Concatfull = pd.concat([DrDoS_NetBIOS, DrDoS_NTP, DrDoS_SNMP])



In [5]:
# Concatenamos los DataFrames en uno solo de 3 en 3
Concatfull = pd.concat([DrDoS_SSDP, DrDoS_UDP, Syn])


In [6]:
# Concatenamos los DataFrames en uno solo de 3 en 3
Concatfull = pd.concat([TFTPdf, UDPLag, LDAPdf])


In [7]:
# Concatenamos los DataFrames en uno solo de 3 en 3
Concatfull = pd.concat([MSSQLdf, NetBIOS, Portmap])


In [8]:
# Concatenamos los DataFrames en uno solo de 3 en 3
Concatfull = pd.concat([Syn_2, UDPdf, UDPLag_2])


In [9]:
# Guardamos el DataFrame completo en un archivo csv
Concatfull.to_csv('archivo_completo.csv', index=False)

### Luego de que todo este en un mismo dataset, procede a hacer el preprocesamiento

In [3]:
filename = 'archivo_completo.csv'

data = pd.read_csv(filename, index_col=False)
# data = pd.read_csv(filename, index_col=False, nrows=10000)
print('Dataframe shape',data.shape)
data.head()

  data = pd.read_csv(filename, index_col=False)


Dataframe shape (8827912, 88)


Unnamed: 0.1,Unnamed: 0,Flow ID,Source IP,Source Port,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,...,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,SimillarHTTP,Inbound,Label
0,445444,172.16.0.5-192.168.50.4-9429-9429-6,172.16.0.5,9429,192.168.50.4,9429,6,2018-11-03 11:36:28.607338,36063894,7,...,29.444864,52.0,1.0,12021280.0,6253623.0,18628035.0,6193840.0,0,1,Syn
1,113842,172.16.0.5-192.168.50.4-60224-60224-6,172.16.0.5,60224,192.168.50.4,60224,6,2018-11-03 11:36:28.607339,44851366,8,...,0.0,1.0,1.0,20662680.0,11697830.0,28934293.0,12391060.0,0,1,Syn
2,176377,172.16.0.5-192.168.50.4-33827-11746-6,192.168.50.4,11746,172.16.0.5,33827,6,2018-11-03 11:36:28.607388,1,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,Syn
3,24777,172.16.0.5-192.168.50.4-33828-1431-6,172.16.0.5,33828,192.168.50.4,1431,6,2018-11-03 11:36:28.607391,0,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,Syn
4,85100,172.16.0.5-192.168.50.4-5311-5311-6,172.16.0.5,5311,192.168.50.4,5311,6,2018-11-03 11:36:28.607442,35731470,8,...,33.234019,48.0,1.0,11910470.0,1849493.0,13693985.0,10001398.0,0,1,Syn


In [4]:
# Eliminar las columnas no utiles.
df = data.drop(columns=['Flow ID', 'SimillarHTTP',' Fwd IAT Min','Bwd IAT Total',' Bwd IAT Mean',' Bwd IAT Std',' Bwd IAT Max',' Bwd IAT Min','Fwd PSH Flags',' Bwd PSH Flags',' Fwd URG Flags',' Bwd URG Flags',' Bwd Header Length',' Min Packet Length',' Max Packet Length',' Packet Length Mean',' Packet Length Std',' Packet Length Variance','FIN Flag Count',' SYN Flag Count',' RST Flag Count',' PSH Flag Count',' ACK Flag Count',' URG Flag Count',' CWE Flag Count',' ECE Flag Count',' Down/Up Ratio','Fwd Avg Bytes/Bulk',' Fwd Avg Packets/Bulk',' Fwd Avg Bulk Rate',' Bwd Avg Bytes/Bulk',' Bwd Avg Packets/Bulk','Bwd Avg Bulk Rate',' Subflow Bwd Packets',' Subflow Bwd Bytes',' Active Std'])

In [5]:
#Elimina todos los inf y NAN para no tener problemas en el futuro
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.dropna(inplace=True)

In [6]:
# Verificar si hay valores faltantes en el DataFrame
print(df.isnull().sum())

Unnamed: 0                      0
 Source IP                      0
 Source Port                    0
 Destination IP                 0
 Destination Port               0
 Protocol                       0
 Timestamp                      0
 Flow Duration                  0
 Total Fwd Packets              0
 Total Backward Packets         0
Total Length of Fwd Packets     0
 Total Length of Bwd Packets    0
 Fwd Packet Length Max          0
 Fwd Packet Length Min          0
 Fwd Packet Length Mean         0
 Fwd Packet Length Std          0
Bwd Packet Length Max           0
 Bwd Packet Length Min          0
 Bwd Packet Length Mean         0
 Bwd Packet Length Std          0
Flow Bytes/s                    0
 Flow Packets/s                 0
 Flow IAT Mean                  0
 Flow IAT Std                   0
 Flow IAT Max                   0
 Flow IAT Min                   0
Fwd IAT Total                   0
 Fwd IAT Mean                   0
 Fwd IAT Std                    0
 Fwd IAT Max  

In [7]:
for column in [' Source Port', ' Destination Port', ' Total Fwd Packets', 'Total Length of Fwd Packets', 'Total Length of Fwd Packets', 'Fwd Packets/s', 'Flow Bytes/s', ' Flow Packets/s']:
    print(f"Columna: {column}")
    print(f"Número de valores perdidos: {df[column].isna().sum()}")

Columna:  Source Port
Número de valores perdidos: 0
Columna:  Destination Port
Número de valores perdidos: 0
Columna:  Total Fwd Packets
Número de valores perdidos: 0
Columna: Total Length of Fwd Packets
Número de valores perdidos: 0
Columna: Total Length of Fwd Packets
Número de valores perdidos: 0
Columna: Fwd Packets/s
Número de valores perdidos: 0
Columna: Flow Bytes/s
Número de valores perdidos: 0
Columna:  Flow Packets/s
Número de valores perdidos: 0


In [8]:
for column in [' Source Port', ' Destination Port', ' Total Fwd Packets', 'Total Length of Fwd Packets', 'Total Length of Fwd Packets', 'Fwd Packets/s', 'Flow Bytes/s', ' Flow Packets/s']:
    print(f"Columna: {column}")
    print(f"Valor máximo: {df[column].max()}")
    print(f"Valor mínimo: {df[column].min()}")

Columna:  Source Port
Valor máximo: 65534
Valor mínimo: 0
Columna:  Destination Port
Valor máximo: 65535
Valor mínimo: 0
Columna:  Total Fwd Packets
Valor máximo: 15614
Valor mínimo: 1
Columna: Total Length of Fwd Packets
Valor máximo: 208524.0
Valor mínimo: 0.0
Columna: Total Length of Fwd Packets
Valor máximo: 208524.0
Valor mínimo: 0.0
Columna: Fwd Packets/s
Valor máximo: 4000000.0
Valor mínimo: 0.0099627624815555
Columna: Flow Bytes/s
Valor máximo: 2944000000.0
Valor mínimo: 0.0
Columna:  Flow Packets/s
Valor máximo: 4000000.0
Valor mínimo: 0.0345645852106763


In [9]:
# Escalar y normalizar las columnas relevantes
scaler = MinMaxScaler()
df[[' Source Port', ' Destination Port', ' Total Fwd Packets', 'Total Length of Fwd Packets', 'Total Length of Fwd Packets', 'Fwd Packets/s', ' Flow Packets/s']] = scaler.fit_transform(df[[' Source Port', ' Destination Port', ' Total Fwd Packets', 'Total Length of Fwd Packets', 'Total Length of Fwd Packets', 'Fwd Packets/s', ' Flow Packets/s']])

In [10]:
# Convertir la columna 'Timestamp' a un formato de fecha y hora
df[' Timestamp'] = pd.to_datetime(df[' Timestamp'], format='%Y-%m-%d %H:%M:%S')


In [2]:
df.head()

NameError: name 'df' is not defined

In [12]:
#Se guarda el .csv en limpio listo para usar los modelos
df.to_csv('archivo_completo_limpio.csv', index=False)

### Una vez limpios se usan los modelos mencionados en las las primeras fases.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("archivo_completo_limpio.csv")
print('Ready!')

Ready!


In [3]:
print(df.shape)
print(df.sample(5))

(8417087, 52)
         Unnamed: 0   Source IP   Source Port  Destination IP  \
3765475      388250  172.16.0.5      0.947493    192.168.50.4   
5109811       43569  172.16.0.5      0.689947    192.168.50.4   
1941056      117124  172.16.0.5      0.178228    192.168.50.4   
2099496      446815  172.16.0.5      0.603961    192.168.50.4   
6128142        2378  172.16.0.5      0.785775    192.168.50.4   

          Destination Port   Protocol                   Timestamp  \
3765475           0.025589          6  2018-11-03 11:36:43.240641   
5109811           0.406836         17  2018-11-03 10:56:12.473757   
1941056           0.772564          6  2018-11-03 11:33:09.943246   
2099496           0.450217          6  2018-11-03 11:33:28.795247   
6128142           0.905013         17  2018-11-03 10:58:12.666898   

          Flow Duration   Total Fwd Packets   Total Backward Packets  ...  \
3765475        30040179            0.000320                        2  ...   
5109811          108000   

In [31]:
df_new = df.loc[:, [' Source IP', ' Destination IP', ' Destination Port', ' Protocol', ' Timestamp', ' Flow Duration', ' Total Fwd Packets', ' Total Backward Packets', 'Total Length of Fwd Packets', ' Total Length of Bwd Packets', ' Fwd Packet Length Max', ' Fwd Packet Length Min', ' Fwd Packet Length Mean', ' Fwd Packet Length Std', 'Bwd Packet Length Max', ' Bwd Packet Length Min', ' Bwd Packet Length Mean', ' Bwd Packet Length Std', 'Flow Bytes/s', ' Flow Packets/s', ' Label']]
print(df_new.columns)

Index([' Source IP', ' Destination IP', ' Destination Port', ' Protocol',
       ' Timestamp', ' Flow Duration', ' Total Fwd Packets',
       ' Total Backward Packets', 'Total Length of Fwd Packets',
       ' Total Length of Bwd Packets', ' Fwd Packet Length Max',
       ' Fwd Packet Length Min', ' Fwd Packet Length Mean',
       ' Fwd Packet Length Std', 'Bwd Packet Length Max',
       ' Bwd Packet Length Min', ' Bwd Packet Length Mean',
       ' Bwd Packet Length Std', 'Flow Bytes/s', ' Flow Packets/s', ' Label'],
      dtype='object')


In [32]:
print(df_new.sample(5))

          Source IP  Destination IP   Destination Port   Protocol  \
7772135  172.16.0.5    192.168.50.4           0.715801         17   
3753325  172.16.0.5    192.168.50.4           0.518776          6   
7603378  172.16.0.5    192.168.50.4           0.595895         17   
6782670  172.16.0.5    192.168.50.4           0.567529         17   
4445552  172.16.0.5    192.168.50.4           0.805295         17   

                          Timestamp   Flow Duration   Total Fwd Packets  \
7772135  2018-11-03 11:01:47.511280               1            0.000064   
3753325  2018-11-03 11:36:41.979273               1            0.000064   
7603378  2018-11-03 11:01:24.884331               1            0.000064   
6782670  2018-11-03 10:59:36.812458          216804            0.000320   
4445552  2018-11-03 10:54:56.275082          216008            0.000320   

          Total Backward Packets  Total Length of Fwd Packets  \
7772135                        0                     0.003673   
3753

In [33]:
max_valor = df_new[' Flow Packets/s'].max()
print(max_valor)

1.0


In [34]:

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_new[' Flow Duration'] = scaler.fit_transform(df_new[[' Flow Duration']])
df_new[' Total Backward Packets'] = scaler.fit_transform(df_new[[' Total Backward Packets']])
df_new[' Total Length of Bwd Packets'] = scaler.fit_transform(df_new[[' Total Length of Bwd Packets']])
df_new[' Fwd Packet Length Max'] = scaler.fit_transform(df_new[[' Fwd Packet Length Max']])
df_new[' Fwd Packet Length Min'] = scaler.fit_transform(df_new[[' Fwd Packet Length Min']])
df_new[' Fwd Packet Length Mean'] = scaler.fit_transform(df_new[[' Fwd Packet Length Mean']])
df_new[' Fwd Packet Length Std'] = scaler.fit_transform(df_new[[' Fwd Packet Length Std']])
df_new['Bwd Packet Length Max'] = scaler.fit_transform(df_new[['Bwd Packet Length Max']])
df_new[' Bwd Packet Length Min'] = scaler.fit_transform(df_new[[' Bwd Packet Length Min']])
df_new[' Bwd Packet Length Mean'] = scaler.fit_transform(df_new[[' Bwd Packet Length Mean']])
df_new[' Bwd Packet Length Std'] = scaler.fit_transform(df_new[[' Bwd Packet Length Std']])
df_new['Flow Bytes/s'] = scaler.fit_transform(df_new[['Flow Bytes/s']])



In [38]:
print(df_new.sample(5))

          Source IP  Destination IP   Destination Port   Protocol  \
8333794  172.16.0.5    192.168.50.4           0.956313          6   
7474002  172.16.0.5    192.168.50.4           0.356924         17   
1159742  172.16.0.5    192.168.50.4           0.160601          6   
6742956  172.16.0.5    192.168.50.4           0.314519         17   
7242477  172.16.0.5    192.168.50.4           0.397650         17   

                          Timestamp   Flow Duration   Total Fwd Packets  \
8333794  2018-11-03 11:30:15.269227    0.000000e+00            0.000064   
7474002  2018-11-03 11:01:06.670215    3.583408e-07            0.000064   
1159742  2018-11-03 11:31:45.033575    1.008354e-06            0.000064   
6742956  2018-11-03 10:59:32.883508    1.769637e-03            0.000320   
7242477  2018-11-03 11:00:35.318697    0.000000e+00            0.000064   

          Total Backward Packets  Total Length of Fwd Packets  \
8333794                 0.000000                     0.000058   
7474

In [39]:
# Quick Backup
df_new.to_csv('df_SVM.csv', index=False)
print('Ready!')

Ready!


In [4]:
df = pd.read_csv("df_SVM.csv")
print('Ready!')

Ready!


In [5]:
new_df = df.copy()
df.shape

(8417087, 21)

In [6]:
df = new_df.copy()

In [7]:
df = df.sample(frac=0.0001, random_state=42)

In [8]:
print(df.shape)

(842, 21)


In [9]:
df[' Source IP'] = df[' Source IP'].apply(lambda x: int(''.join(['{:03}'.format(int(octeto)) for octeto in x.split('.')])))
df[' Destination IP'] = df[' Destination IP'].apply(lambda x: int(''.join(['{:03}'.format(int(octeto)) for octeto in x.split('.')])))

In [10]:
df[' Timestamp'] = pd.to_datetime(df[' Timestamp'])
df[' Timestamp'] = df[' Timestamp'].astype('int64') // 10**9


In [11]:
from sklearn.preprocessing import LabelEncoder

# Creamos un objeto LabelEncoder
le = LabelEncoder()

# Codificamos la columna Label
df[' Label'] = le.fit_transform(df[' Label'])

print(df[' Label'])
counts = df[' Label'].value_counts()



1980149    2
8020176    2
8387637    2
7376677    3
6776987    3
          ..
488245     2
1016506    2
8372878    2
830918     2
4830095    3
Name:  Label, Length: 842, dtype: int32


In [12]:
df

Unnamed: 0,Source IP,Destination IP,Destination Port,Protocol,Timestamp,Flow Duration,Total Fwd Packets,Total Backward Packets,Total Length of Fwd Packets,Total Length of Bwd Packets,...,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,Bwd Packet Length Max,Bwd Packet Length Min,Bwd Packet Length Mean,Bwd Packet Length Std,Flow Bytes/s,Flow Packets/s,Label
1980149,172016000005,192168050004,0.108446,6,1541244794,8.333506e-07,0.000064,0.000249,0.000058,9.307785e-07,...,0.002816,0.002816,0.000000,0.001654,0.00411,0.003245,0.0,0.000081,0.009901,2
8020176,172016000005,192168050004,0.921935,6,1541244573,0.000000e+00,0.000064,0.000000,0.000058,0.000000e+00,...,0.002816,0.002816,0.000000,0.000000,0.00000,0.000000,0.0,0.004076,0.500000,2
8387637,172016000005,192168050004,0.857923,6,1541244622,0.000000e+00,0.000064,0.000000,0.000058,0.000000e+00,...,0.002816,0.002816,0.000000,0.000000,0.00000,0.000000,0.0,0.004076,0.500000,2
7376677,172016000005,192168050004,0.450477,17,1541242853,1.822304e-03,0.000320,0.000000,0.010013,0.000000e+00,...,0.150634,0.163304,0.024971,0.000000,0.00000,0.000000,0.0,0.000003,0.000007,3
6776987,172016000005,192168050004,0.762646,17,1541242776,8.867600e-04,0.000192,0.000000,0.006704,0.000000e+00,...,0.154857,0.164008,0.016024,0.000000,0.00000,0.000000,0.0,0.000004,0.000009,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
488245,172016000005,192168050004,0.956115,6,1541244626,4.166753e-07,0.000064,0.000000,0.000058,0.000000e+00,...,0.002816,0.002816,0.000000,0.000000,0.00000,0.000000,0.0,0.000080,0.009804,2
1016506,172016000005,192168050004,0.624826,6,1541244690,0.000000e+00,0.000064,0.000000,0.000058,0.000000e+00,...,0.002816,0.002816,0.000000,0.000000,0.00000,0.000000,0.0,0.004076,0.500000,2
8372878,172016000005,192168050004,0.566064,6,1541244620,4.583428e-07,0.000064,0.000249,0.000058,9.307785e-07,...,0.002816,0.002816,0.000000,0.001654,0.00411,0.003245,0.0,0.000146,0.017857,2
830918,172016000005,192168050004,0.650919,6,1541244664,0.000000e+00,0.000064,0.000000,0.000058,0.000000e+00,...,0.002816,0.002816,0.000000,0.000000,0.00000,0.000000,0.0,0.004076,0.500000,2


In [13]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
print('Ready?')

Ready?


In [14]:
from sklearn.model_selection import train_test_split
y = df[' Label'] # etiquetas u objetivo
X = df.drop(' Label', axis=1) # separamos las características de la variable objetivo
# Dividir en conjunto de entrenamiento y pruebas (70/30)
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Dividir en conjunto de entrenamiento y validación (55/15 del tamaño original)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.15/0.7, random_state=42)
print('Finished')

Finished


In [15]:
from sklearn.svm import SVC

In [16]:
svm = SVC(kernel='rbf', C=1, random_state=42)

In [17]:


# Entrenar el modelo SVM
svm.fit(X_train, y_train)
print('Finished')


Finished


In [18]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score
# Predecir los valores de la muestra de prueba
# Predicciones en conjunto de prueba
y_pred = svm.predict(X_test)

# Predicciones en conjunto de validación
y_val_pred = svm.predict(X_val)

# Matriz de confusión en conjunto de prueba
cm_test = confusion_matrix(y_test, y_pred)
print("Matriz de confusión en conjunto de prueba:\n", cm_test)

# Matriz de confusión en conjunto de validación
cm_val = confusion_matrix(y_val, y_val_pred)
print("Matriz de confusión en conjunto de validación:\n", cm_val)

# Accuracy en conjunto de prueba
acc_test = accuracy_score(y_test, y_pred)
print("Accuracy en conjunto de prueba:", acc_test)

# Accuracy en conjunto de validación
acc_val = accuracy_score(y_val, y_val_pred)
print("Accuracy en conjunto de validación:", acc_val)

# Precisión en conjunto de prueba
precision_test = precision_score(y_test, y_pred,average='macro')
print("Precisión en conjunto de prueba:", precision_test)

# Precisión en conjunto de validación
precision_val = precision_score(y_val, y_val_pred,average='macro')
print("Precisión en conjunto de validación:", precision_val)

# Recall en conjunto de prueba
recall_test = recall_score(y_test, y_pred)
print("Recall en conjunto de prueba:", recall_test)

# Recall en conjunto de validación
recall_val = recall_score(y_val, y_val_pred)
print("Recall en conjunto de validación:", recall_val)

# F1 en conjunto de prueba
f1_test = f1_score(y_test, y_pred)
print("F1 en conjunto de prueba:", f1_test)

# F1 en conjunto de validación
f1_val = f1_score(y_val, y_val_pred)
print("F1 en conjunto de validación:", f1_val)

Matriz de confusión en conjunto de prueba:
 [[  0   1   0]
 [  0 150   0]
 [  0 102   0]]
Matriz de confusión en conjunto de validación:
 [[ 0  1  0]
 [ 0 75  0]
 [ 0 51  0]]
Accuracy en conjunto de prueba: 0.5928853754940712
Accuracy en conjunto de validación: 0.5905511811023622


ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].