# Preprocesar o limpiar la data

Es muy probable que a la hora de obtener nuestro dataset que se usará para entrenar y generar nuestro modelo, este venga con algunos detalles que serán necesarios corregir para asegurarnos de un proceso adecuado y que no generará errores.

#### 1. Valores nulos

Los valores nulos pueden aparecer en varias filas o columnas y las soluciones más comunes para manejarlos son:
* **Eliminar las filas con nulos** - Si no hay demasiadas filas del conjunto que se vean afectadas se pueden    eliminar aquellas con valores nulos en algunas columnas.
* **Eliminar columnas** - Si hay demasiadas filas que pueden verse comprometidas lo mejor es eleminar la columna que esta generando demasiados valores nulos.
* **Rellenar un un dato en específico** - Dependiendo del tipo de dato que maneje la colummna puede ser una buena idea cambiar los valores nulos por el promedio de todos los valores del dataset para esa columna (si se trata de una columna con valores continuos).


#### 2. Columnas categóricas

Las columnas categoricas incluyen informacion clasificada por valores textuales, dado que los algoritmos de entrenamiento solo entienden valores numéricos estas se deben transformar a equivalentes generalmente en valores binarios con diversas técnicas de encoding.


#### 4. Escalas muy dispares

Cuando hay columnas con valores numéricos con escalas muy distantes entre sí (Ej. 2786.2321564 -> 0.0000046456 ) pueden darse problemas a futuro con algunos algoritmos, para solucionar esto se recurre a métodos como **min-max scaling** (Genera valores que esten únicamente en un rango de entre 0 y 1) y **estandarización**


#### 5. Datos desequilibrados

En los conjuntos de datos hay un porcentanje muy grande de valores etiqueta que clasifican el ejemplo como   positivo por ejemplo y un número muy reducido de valores negativos, esto puede afectar a un correcto entrenamiento y entendiemiento del contexto, para ello se usan técnicas como:   
*  Repetir los ejemplos minoritarios para balancear.
*  Reducir el subconjunto mayoritario para balancear.
  

## Funciones auxiliares

Se seguirá usando el dataset de flujos de tráfico normales o anómalos *NSL-KDD* y se hará uso de las funciones que se han venido trabajando para:

1. Cargar el dataset en formato arff a DF de Pandas.
2. Dividir el conjunto de datos en 3 subconjuntos (train, test, validation)

In [2]:
import arff
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [3]:
# Funcion para cargar dataset en formato arff a un df de Pandas
def load_kdd_dataset(data_path):
    """Lectura del conjunto de datos NSL-KDD."""
    with open(data_path, 'r') as train_set:
        dataset = arff.load(train_set)
    attributes = [attr[0] for attr in dataset["attributes"]]
    return pd.DataFrame(dataset["data"], columns=attributes)

In [4]:
# Función que realiza el particionado del DF
def train_val_test_split(df, rstate=42, shuffle=True, stratify=None):

    # Strat solo si le pasamos la columa a dispersar
    strat = df[stratify] if stratify else None 
    
    train_set, test_set = train_test_split(
        df,
        test_size=0.4,
        random_state=rstate, # Semilla de generación aleatoria única
        shuffle=shuffle, # Si se hace o no un shuffle
        stratify=strat # Columna a dispersar si la hay
    )

    # Se repite el proceso para obtener el validation_set
    strat = test_set[stratify] if stratify else None
    
    val_set, test_set = train_test_split(
        test_set,
        test_size=0.5,
        random_state=rstate,
        shuffle=shuffle,
        stratify=strat
    )
    
    return (train_set, val_set, test_set)

## 1. Leer el conjunto de datos

In [6]:
df = load_kdd_dataset('../datasets/NSL-KDD/KDDTrain+.arff')
df

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,root_shell,su_attempted,num_root,num_file_creations,num_shells,num_access_files,num_outbound_cmds,is_host_login,is_guest_login,count,srv_count,serror_rate,srv_serror_rate,rerror_rate,srv_rerror_rate,same_srv_rate,diff_srv_rate,srv_diff_host_rate,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,class
0,0.0,tcp,ftp_data,SF,491.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,2.0,2.0,0.0,0.0,0.0,0.0,1.00,0.00,0.00,150.0,25.0,0.17,0.03,0.17,0.00,0.00,0.00,0.05,0.00,normal
1,0.0,udp,other,SF,146.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,13.0,1.0,0.0,0.0,0.0,0.0,0.08,0.15,0.00,255.0,1.0,0.00,0.60,0.88,0.00,0.00,0.00,0.00,0.00,normal
2,0.0,tcp,private,S0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,123.0,6.0,1.0,1.0,0.0,0.0,0.05,0.07,0.00,255.0,26.0,0.10,0.05,0.00,0.00,1.00,1.00,0.00,0.00,anomaly
3,0.0,tcp,http,SF,232.0,8153.0,0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,5.0,5.0,0.2,0.2,0.0,0.0,1.00,0.00,0.00,30.0,255.0,1.00,0.00,0.03,0.04,0.03,0.01,0.00,0.01,normal
4,0.0,tcp,http,SF,199.0,420.0,0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,30.0,32.0,0.0,0.0,0.0,0.0,1.00,0.00,0.09,255.0,255.0,1.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125968,0.0,tcp,private,S0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,184.0,25.0,1.0,1.0,0.0,0.0,0.14,0.06,0.00,255.0,25.0,0.10,0.06,0.00,0.00,1.00,1.00,0.00,0.00,anomaly
125969,8.0,udp,private,SF,105.0,145.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,2.0,2.0,0.0,0.0,0.0,0.0,1.00,0.00,0.00,255.0,244.0,0.96,0.01,0.01,0.00,0.00,0.00,0.00,0.00,normal
125970,0.0,tcp,smtp,SF,2231.0,384.0,0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,1.0,1.0,0.0,0.0,0.0,0.0,1.00,0.00,0.00,255.0,30.0,0.12,0.06,0.00,0.00,0.72,0.00,0.01,0.00,normal
125971,0.0,tcp,klogin,S0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,144.0,8.0,1.0,1.0,0.0,0.0,0.06,0.05,0.00,255.0,8.0,0.03,0.05,0.00,0.00,1.00,1.00,0.00,0.00,anomaly


## 2. Dividir el conjunto

In [7]:
train_set, test_set, val_set = train_val_test_split(
    df,
    stratify='protocol_type'
)

In [8]:
print(f'Tamaño train_set: {len(train_set)}')
print(f'Tamaño val_set: {len(val_set)}')
print(f'Tamaño test_set: {len(test_set)}')

Tamaño train_set: 75583
Tamaño val_set: 25195
Tamaño test_set: 25195


## 3. Como limpiar datos nulos

In [9]:
# Primero se va a separar la etiqueta de las caracteristicas del conjunto
# Esto pues los valores de la etiqueta no poseen errores a limpiar
x_train = train_set.drop('class', axis=1) # Caracteristicas de entrada
y_train = train_set['class'].copy() # Etiqueta

In [11]:
# Dado que el dataset esta limpio vamos a simular unos cuantos nulos para 
# poder demostrar la limpieza
x_train.loc[(x_train["src_bytes"]>400) & (x_train["src_bytes"]<800), "src_bytes"] = np.nan
x_train.loc[(x_train["dst_bytes"]>500) & (x_train["dst_bytes"]<2000), "dst_bytes"] = np.nan
x_train

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,root_shell,su_attempted,num_root,num_file_creations,num_shells,num_access_files,num_outbound_cmds,is_host_login,is_guest_login,count,srv_count,serror_rate,srv_serror_rate,rerror_rate,srv_rerror_rate,same_srv_rate,diff_srv_rate,srv_diff_host_rate,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
113467,0.0,tcp,http,SF,,53508.0,0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,1.0,5.0,0.0,0.0,0.0,0.0,1.00,0.00,0.40,9.0,255.0,1.00,0.00,0.11,0.03,0.00,0.00,0.0,0.0
31899,0.0,tcp,private,S0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,200.0,4.0,1.0,1.0,0.0,0.0,0.02,0.05,0.00,255.0,4.0,0.02,0.05,0.00,0.00,1.00,1.00,0.0,0.0
108116,0.0,tcp,http,SF,304.0,,0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,2.0,5.0,0.0,0.0,0.0,0.0,1.00,0.00,0.40,39.0,255.0,1.00,0.00,0.03,0.06,0.00,0.00,0.0,0.0
89913,0.0,tcp,private,S0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,122.0,17.0,1.0,1.0,0.0,0.0,0.14,0.06,0.00,255.0,15.0,0.06,0.07,0.00,0.00,1.00,1.00,0.0,0.0
106319,0.0,icmp,eco_i,SF,8.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,1.0,13.0,0.0,0.0,0.0,0.0,1.00,0.00,1.00,2.0,7.0,1.00,0.00,1.00,0.57,0.00,0.00,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64559,0.0,tcp,systat,S0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,246.0,20.0,1.0,1.0,0.0,0.0,0.08,0.06,0.00,255.0,20.0,0.08,0.06,0.00,0.00,1.00,1.00,0.0,0.0
67272,0.0,tcp,http,SF,210.0,,0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,3.0,5.0,0.0,0.0,0.0,0.0,1.00,0.00,0.40,119.0,255.0,1.00,0.00,0.01,0.02,0.02,0.01,0.0,0.0
32452,3.0,tcp,smtp,SF,889.0,328.0,0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,1.0,1.0,0.0,0.0,0.0,0.0,1.00,0.00,0.00,111.0,155.0,0.64,0.04,0.01,0.01,0.01,0.00,0.0,0.0
112657,0.0,tcp,http,SF,284.0,444.0,0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,2.0,2.0,0.0,0.0,0.0,0.0,1.00,0.00,0.00,255.0,255.0,1.00,0.00,0.00,0.00,0.00,0.00,0.0,0.0


#### Comprobar si hay valores nulos

In [12]:
x_train.isna().any()

duration                       False
protocol_type                  False
service                        False
flag                           False
src_bytes                       True
dst_bytes                       True
land                           False
wrong_fragment                 False
urgent                         False
hot                            False
num_failed_logins              False
logged_in                      False
num_compromised                False
root_shell                     False
su_attempted                   False
num_root                       False
num_file_creations             False
num_shells                     False
num_access_files               False
num_outbound_cmds              False
is_host_login                  False
is_guest_login                 False
count                          False
srv_count                      False
serror_rate                    False
srv_serror_rate                False
rerror_rate                    False
s

In [15]:
# Nro de filas con valores nulos
null_rows = x_train[ x_train.isnull().any(axis=1) ]
len(null_rows)

9886

### Opcion 1: Eliminar filas con nulos

In [16]:
# Copiar el ds para no alterar el original
x_train_copy = x_train.copy()

In [17]:
# Eliminar filas con valores nulos (de las columnas src_bytes y dst_bytes)
x_train_copy.dropna(subset=['src_bytes', 'dst_bytes'], inplace=True)
x_train_copy

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,root_shell,su_attempted,num_root,num_file_creations,num_shells,num_access_files,num_outbound_cmds,is_host_login,is_guest_login,count,srv_count,serror_rate,srv_serror_rate,rerror_rate,srv_rerror_rate,same_srv_rate,diff_srv_rate,srv_diff_host_rate,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
31899,0.0,tcp,private,S0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,200.0,4.0,1.0,1.0,0.0,0.0,0.02,0.05,0.00,255.0,4.0,0.02,0.05,0.00,0.00,1.00,1.0,0.0,0.0
89913,0.0,tcp,private,S0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,122.0,17.0,1.0,1.0,0.0,0.0,0.14,0.06,0.00,255.0,15.0,0.06,0.07,0.00,0.00,1.00,1.0,0.0,0.0
106319,0.0,icmp,eco_i,SF,8.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,1.0,13.0,0.0,0.0,0.0,0.0,1.00,0.00,1.00,2.0,7.0,1.00,0.00,1.00,0.57,0.00,0.0,0.0,0.0
98007,0.0,udp,domain_u,SF,46.0,139.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,65.0,145.0,0.0,0.0,0.0,0.0,1.00,0.00,0.01,255.0,254.0,1.00,0.01,0.00,0.00,0.00,0.0,0.0,0.0
16447,0.0,tcp,smtp,SF,1790.0,363.0,0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,1.0,1.0,0.0,0.0,0.0,0.0,1.00,0.00,0.00,141.0,137.0,0.55,0.04,0.01,0.01,0.00,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90665,0.0,tcp,ftp_data,S0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,45.0,7.0,1.0,1.0,0.0,0.0,0.16,0.09,0.00,255.0,63.0,0.25,0.02,0.02,0.00,1.00,1.0,0.0,0.0
64559,0.0,tcp,systat,S0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,246.0,20.0,1.0,1.0,0.0,0.0,0.08,0.06,0.00,255.0,20.0,0.08,0.06,0.00,0.00,1.00,1.0,0.0,0.0
32452,3.0,tcp,smtp,SF,889.0,328.0,0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,1.0,1.0,0.0,0.0,0.0,0.0,1.00,0.00,0.00,111.0,155.0,0.64,0.04,0.01,0.01,0.01,0.0,0.0,0.0
112657,0.0,tcp,http,SF,284.0,444.0,0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,2.0,2.0,0.0,0.0,0.0,0.0,1.00,0.00,0.00,255.0,255.0,1.00,0.00,0.00,0.00,0.00,0.0,0.0,0.0


In [19]:
# Nro de filas eliminadas
len(x_train) - len(x_train_copy)
# No parece ser la mejor opcion puesto que se perdieron muchos datos

9886

### Opcion 2: Eliminar columnas con nulos

In [20]:
# Se hace una copia del conjunto original
x_train_copy = x_train.copy()

In [21]:
# Eliminar columnas con nulos
x_train_copy.drop(['src_bytes', 'dst_bytes'], axis=1, inplace=True)
x_train_copy

Unnamed: 0,duration,protocol_type,service,flag,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,root_shell,su_attempted,num_root,num_file_creations,num_shells,num_access_files,num_outbound_cmds,is_host_login,is_guest_login,count,srv_count,serror_rate,srv_serror_rate,rerror_rate,srv_rerror_rate,same_srv_rate,diff_srv_rate,srv_diff_host_rate,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
113467,0.0,tcp,http,SF,0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,1.0,5.0,0.0,0.0,0.0,0.0,1.00,0.00,0.40,9.0,255.0,1.00,0.00,0.11,0.03,0.00,0.00,0.0,0.0
31899,0.0,tcp,private,S0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,200.0,4.0,1.0,1.0,0.0,0.0,0.02,0.05,0.00,255.0,4.0,0.02,0.05,0.00,0.00,1.00,1.00,0.0,0.0
108116,0.0,tcp,http,SF,0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,2.0,5.0,0.0,0.0,0.0,0.0,1.00,0.00,0.40,39.0,255.0,1.00,0.00,0.03,0.06,0.00,0.00,0.0,0.0
89913,0.0,tcp,private,S0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,122.0,17.0,1.0,1.0,0.0,0.0,0.14,0.06,0.00,255.0,15.0,0.06,0.07,0.00,0.00,1.00,1.00,0.0,0.0
106319,0.0,icmp,eco_i,SF,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,1.0,13.0,0.0,0.0,0.0,0.0,1.00,0.00,1.00,2.0,7.0,1.00,0.00,1.00,0.57,0.00,0.00,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64559,0.0,tcp,systat,S0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,246.0,20.0,1.0,1.0,0.0,0.0,0.08,0.06,0.00,255.0,20.0,0.08,0.06,0.00,0.00,1.00,1.00,0.0,0.0
67272,0.0,tcp,http,SF,0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,3.0,5.0,0.0,0.0,0.0,0.0,1.00,0.00,0.40,119.0,255.0,1.00,0.00,0.01,0.02,0.02,0.01,0.0,0.0
32452,3.0,tcp,smtp,SF,0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,1.0,1.0,0.0,0.0,0.0,0.0,1.00,0.00,0.00,111.0,155.0,0.64,0.04,0.01,0.01,0.01,0.00,0.0,0.0
112657,0.0,tcp,http,SF,0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,2.0,2.0,0.0,0.0,0.0,0.0,1.00,0.00,0.00,255.0,255.0,1.00,0.00,0.00,0.00,0.00,0.00,0.0,0.0


In [22]:
# Nro de atributos eliminados
len(list(x_train)) - len(list(x_train_copy))

2

### Opcion 3: Rellenar nulos con valores por defecto

In [23]:
# Se hace una copia del conjunto original
x_train_copy = x_train.copy()

In [36]:
# Rellenar nulos con la media de las columnas con nulos
avg_src_bytes = x_train_copy['src_bytes'].mean()
avg_dst_bytes = x_train_copy['dst_bytes'].mean()

x_train_copy['src_bytes'] = x_train_copy['src_bytes'].fillna(avg_src_bytes)
x_train_copy['dst_bytes'] = x_train_copy['dst_bytes'].fillna(avg_dst_bytes)

x_train_copy

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,root_shell,su_attempted,num_root,num_file_creations,num_shells,num_access_files,num_outbound_cmds,is_host_login,is_guest_login,count,srv_count,serror_rate,srv_serror_rate,rerror_rate,srv_rerror_rate,same_srv_rate,diff_srv_rate,srv_diff_host_rate,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
113467,0.0,tcp,http,SF,66914.530762,53508.000000,0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,1.0,5.0,0.0,0.0,0.0,0.0,1.00,0.00,0.40,9.0,255.0,1.00,0.00,0.11,0.03,0.00,0.00,0.0,0.0
31899,0.0,tcp,private,S0,0.000000,0.000000,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,200.0,4.0,1.0,1.0,0.0,0.0,0.02,0.05,0.00,255.0,4.0,0.02,0.05,0.00,0.00,1.00,1.00,0.0,0.0
108116,0.0,tcp,http,SF,304.000000,9181.334754,0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,2.0,5.0,0.0,0.0,0.0,0.0,1.00,0.00,0.40,39.0,255.0,1.00,0.00,0.03,0.06,0.00,0.00,0.0,0.0
89913,0.0,tcp,private,S0,0.000000,0.000000,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,122.0,17.0,1.0,1.0,0.0,0.0,0.14,0.06,0.00,255.0,15.0,0.06,0.07,0.00,0.00,1.00,1.00,0.0,0.0
106319,0.0,icmp,eco_i,SF,8.000000,0.000000,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,1.0,13.0,0.0,0.0,0.0,0.0,1.00,0.00,1.00,2.0,7.0,1.00,0.00,1.00,0.57,0.00,0.00,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64559,0.0,tcp,systat,S0,0.000000,0.000000,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,246.0,20.0,1.0,1.0,0.0,0.0,0.08,0.06,0.00,255.0,20.0,0.08,0.06,0.00,0.00,1.00,1.00,0.0,0.0
67272,0.0,tcp,http,SF,210.000000,9181.334754,0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,3.0,5.0,0.0,0.0,0.0,0.0,1.00,0.00,0.40,119.0,255.0,1.00,0.00,0.01,0.02,0.02,0.01,0.0,0.0
32452,3.0,tcp,smtp,SF,889.000000,328.000000,0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,1.0,1.0,0.0,0.0,0.0,0.0,1.00,0.00,0.00,111.0,155.0,0.64,0.04,0.01,0.01,0.01,0.00,0.0,0.0
112657,0.0,tcp,http,SF,284.000000,444.000000,0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,2.0,2.0,0.0,0.0,0.0,0.0,1.00,0.00,0.00,255.0,255.0,1.00,0.00,0.00,0.00,0.00,0.00,0.0,0.0


In [37]:
# Nro de filas con valores nulos
null_rows = x_train_copy[ x_train_copy.isnull().any(axis=1) ]
len(null_rows)

0

Pero a veces este promedio podría dar resultados desacertados si existen valores con máximos o mínimos muy altos con respecto al resto de datos.

Como solucion se puede reemplazar los nulos con la media y no con el promedio

In [33]:
# Se hace una copia del conjunto original
x_train_copy = x_train.copy()

In [38]:
# Rellenar nulos con la media de las columnas con nulos
med_src_bytes = x_train_copy['src_bytes'].median()
med_dst_bytes = x_train_copy['dst_bytes'].median()

x_train_copy['src_bytes'] = x_train_copy['src_bytes'].fillna(med_src_bytes)
x_train_copy['dst_bytes'] = x_train_copy['dst_bytes'].fillna(med_dst_bytes)

x_train_copy

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,root_shell,su_attempted,num_root,num_file_creations,num_shells,num_access_files,num_outbound_cmds,is_host_login,is_guest_login,count,srv_count,serror_rate,srv_serror_rate,rerror_rate,srv_rerror_rate,same_srv_rate,diff_srv_rate,srv_diff_host_rate,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
113467,0.0,tcp,http,SF,66914.530762,53508.000000,0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,1.0,5.0,0.0,0.0,0.0,0.0,1.00,0.00,0.40,9.0,255.0,1.00,0.00,0.11,0.03,0.00,0.00,0.0,0.0
31899,0.0,tcp,private,S0,0.000000,0.000000,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,200.0,4.0,1.0,1.0,0.0,0.0,0.02,0.05,0.00,255.0,4.0,0.02,0.05,0.00,0.00,1.00,1.00,0.0,0.0
108116,0.0,tcp,http,SF,304.000000,9181.334754,0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,2.0,5.0,0.0,0.0,0.0,0.0,1.00,0.00,0.40,39.0,255.0,1.00,0.00,0.03,0.06,0.00,0.00,0.0,0.0
89913,0.0,tcp,private,S0,0.000000,0.000000,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,122.0,17.0,1.0,1.0,0.0,0.0,0.14,0.06,0.00,255.0,15.0,0.06,0.07,0.00,0.00,1.00,1.00,0.0,0.0
106319,0.0,icmp,eco_i,SF,8.000000,0.000000,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,1.0,13.0,0.0,0.0,0.0,0.0,1.00,0.00,1.00,2.0,7.0,1.00,0.00,1.00,0.57,0.00,0.00,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64559,0.0,tcp,systat,S0,0.000000,0.000000,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,246.0,20.0,1.0,1.0,0.0,0.0,0.08,0.06,0.00,255.0,20.0,0.08,0.06,0.00,0.00,1.00,1.00,0.0,0.0
67272,0.0,tcp,http,SF,210.000000,9181.334754,0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,3.0,5.0,0.0,0.0,0.0,0.0,1.00,0.00,0.40,119.0,255.0,1.00,0.00,0.01,0.02,0.02,0.01,0.0,0.0
32452,3.0,tcp,smtp,SF,889.000000,328.000000,0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,1.0,1.0,0.0,0.0,0.0,0.0,1.00,0.00,0.00,111.0,155.0,0.64,0.04,0.01,0.01,0.01,0.00,0.0,0.0
112657,0.0,tcp,http,SF,284.000000,444.000000,0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,2.0,2.0,0.0,0.0,0.0,0.0,1.00,0.00,0.00,255.0,255.0,1.00,0.00,0.00,0.00,0.00,0.00,0.0,0.0


In [39]:
# Nro de filas con valores nulos
null_rows = x_train_copy[ x_train_copy.isnull().any(axis=1) ]
len(null_rows)

0

### Sklearn ofrece una clase *imputer* propia para aplicar la opcion 3

Esta clase elimina todos los valores nulos en el df de manera automática, pero hay que considerar que solo trabaja con columnas numéricas

In [40]:
# Se hace una copia del conjunto original
x_train_copy = x_train.copy()

In [41]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='median')

In [42]:
# Extraer únicamente las columnas numéricas
x_train_copy_num = x_train_copy.select_dtypes(exclude=['object'])
x_train_copy_num.info()

<class 'pandas.core.frame.DataFrame'>
Index: 75583 entries, 113467 to 99030
Data columns (total 34 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   duration                     75583 non-null  float64
 1   src_bytes                    73696 non-null  float64
 2   dst_bytes                    67572 non-null  float64
 3   wrong_fragment               75583 non-null  float64
 4   urgent                       75583 non-null  float64
 5   hot                          75583 non-null  float64
 6   num_failed_logins            75583 non-null  float64
 7   num_compromised              75583 non-null  float64
 8   root_shell                   75583 non-null  float64
 9   su_attempted                 75583 non-null  float64
 10  num_root                     75583 non-null  float64
 11  num_file_creations           75583 non-null  float64
 12  num_shells                   75583 non-null  float64
 13  num_access_files

In [43]:
# Se le proporciona las columnas numéricas para que calcule los valores
imputer.fit(x_train_copy_num)

In [44]:
# Se rellenan los valores nulos
x_train_copy_num_nonull = imputer.transform(x_train_copy_num)

In [46]:
x_train_copy_num_nonull

array([[0.0000e+00, 4.3000e+01, 5.3508e+04, ..., 0.0000e+00, 0.0000e+00,
        0.0000e+00],
       [0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 1.0000e+00, 0.0000e+00,
        0.0000e+00],
       [0.0000e+00, 3.0400e+02, 0.0000e+00, ..., 0.0000e+00, 0.0000e+00,
        0.0000e+00],
       ...,
       [3.0000e+00, 8.8900e+02, 3.2800e+02, ..., 0.0000e+00, 0.0000e+00,
        0.0000e+00],
       [0.0000e+00, 2.8400e+02, 4.4400e+02, ..., 0.0000e+00, 0.0000e+00,
        0.0000e+00],
       [0.0000e+00, 2.0900e+02, 3.1270e+03, ..., 0.0000e+00, 0.0000e+00,
        0.0000e+00]])

In [45]:
# Se trasnforma el resultado en un df de pandas
x_train_copy = pd.DataFrame(x_train_copy_num_nonull, columns=x_train_copy_num.columns)

In [47]:
x_train_copy.head(10)

Unnamed: 0,duration,src_bytes,dst_bytes,wrong_fragment,urgent,hot,num_failed_logins,num_compromised,root_shell,su_attempted,num_root,num_file_creations,num_shells,num_access_files,num_outbound_cmds,count,srv_count,serror_rate,srv_serror_rate,rerror_rate,srv_rerror_rate,same_srv_rate,diff_srv_rate,srv_diff_host_rate,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
0,0.0,43.0,53508.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,5.0,0.0,0.0,0.0,0.0,1.0,0.0,0.4,9.0,255.0,1.0,0.0,0.11,0.03,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,200.0,4.0,1.0,1.0,0.0,0.0,0.02,0.05,0.0,255.0,4.0,0.02,0.05,0.0,0.0,1.0,1.0,0.0,0.0
2,0.0,304.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,5.0,0.0,0.0,0.0,0.0,1.0,0.0,0.4,39.0,255.0,1.0,0.0,0.03,0.06,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,122.0,17.0,1.0,1.0,0.0,0.0,0.14,0.06,0.0,255.0,15.0,0.06,0.07,0.0,0.0,1.0,1.0,0.0,0.0
4,0.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,13.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,2.0,7.0,1.0,0.0,1.0,0.57,0.0,0.0,0.0,0.0
5,0.0,46.0,139.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,65.0,145.0,0.0,0.0,0.0,0.0,1.0,0.0,0.01,255.0,254.0,1.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,1790.0,363.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,141.0,137.0,0.55,0.04,0.01,0.01,0.0,0.0,0.0,0.0
7,1.0,43.0,329.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,3.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,198.0,181.0,0.65,0.03,0.01,0.01,0.02,0.02,0.0,0.0
8,0.0,206.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,13.0,0.0,0.0,0.0,0.0,1.0,0.0,0.15,255.0,255.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,334.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,8.0,28.0,1.0,0.0,1.0,0.11,0.0,0.0,0.0,0.0


## 4. Como transformar columnas categoricas a numericas

In [49]:
# Primero se va a separar la etiqueta de las caracteristicas del conjunto
# Esto pues los valores de la etiqueta no necesitan ser transformados
x_train = train_set.drop('class', axis=1) # Caracteristicas de entrada
y_train = train_set['class'].copy() # Etiqueta

In [51]:
x_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 75583 entries, 113467 to 99030
Data columns (total 41 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   duration                     75583 non-null  float64
 1   protocol_type                75583 non-null  object 
 2   service                      75583 non-null  object 
 3   flag                         75583 non-null  object 
 4   src_bytes                    75583 non-null  float64
 5   dst_bytes                    75583 non-null  float64
 6   land                         75583 non-null  object 
 7   wrong_fragment               75583 non-null  float64
 8   urgent                       75583 non-null  float64
 9   hot                          75583 non-null  float64
 10  num_failed_logins            75583 non-null  float64
 11  logged_in                    75583 non-null  object 
 12  num_compromised              75583 non-null  float64
 13  root_shell      

### Opcion 1: Transformar cada categoria en un numero de 0 a n con Pandas

Puede volverse un proceso anticuado puesto que se debe aplicar a cada una de las columnas categoricas de df

In [52]:
protocol_type = x_train['protocol_type']
protocol_type_num, categorias = protocol_type.factorize()

In [53]:
# Como quedan los valores
for i in range(10):
    print( protocol_type.iloc[i], '=', protocol_type_num[i] )

tcp = 0
tcp = 0
tcp = 0
tcp = 0
icmp = 1
udp = 2
tcp = 0
tcp = 0
tcp = 0
tcp = 0


In [54]:
print(categorias)

Index(['tcp', 'icmp', 'udp'], dtype='object')


## Transformaciones avanzadas con sklearn

### Opcion 2: Ordinal encoding

Similar al metodo factorize de Pandas

El problema de este tipo de metodos de encoding es que no sirven para algortimos de clustering los cuales miden las distancias entre los valores, por ello hay soluciones cono el encoding binario que se encuentra mas adelante

In [64]:
from sklearn.preprocessing import OrdinalEncoder

protocol_type = x_train[['protocol_type']]

ordinal_enc = OrdinalEncoder()
protocol_type_num = ordinal_enc.fit_transform(protocol_type)

In [66]:
# Como quedan los valores
for i in range(10):
    print( protocol_type['protocol_type'].iloc[i], '=', protocol_type_num[i] )

tcp = [1.]
tcp = [1.]
tcp = [1.]
tcp = [1.]
icmp = [0.]
udp = [2.]
tcp = [1.]
tcp = [1.]
tcp = [1.]
tcp = [1.]


### Opcion 3: One-Hot encoding

In [67]:
from sklearn.preprocessing import OneHotEncoder

oh_encoder = OneHotEncoder() 

protocol_type = x_train[['protocol_type']]
protocol_type_num = oh_encoder.fit_transform(protocol_type)
protocol_type_num

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 75583 stored elements and shape (75583, 3)>

In [69]:
# Como quedan los valores
for i in range(10):
    print( protocol_type['protocol_type'].iloc[i], '=', protocol_type_num.toarray()[i] )

tcp = [0. 1. 0.]
tcp = [0. 1. 0.]
tcp = [0. 1. 0.]
tcp = [0. 1. 0.]
icmp = [1. 0. 0.]
udp = [0. 0. 1.]
tcp = [0. 1. 0.]
tcp = [0. 1. 0.]
tcp = [0. 1. 0.]
tcp = [0. 1. 0.]


In [70]:
# En caso de recibir una nueva categoria en el futuro, desconocida para el encoder
# entrenado se puede usar el param handle_unknown='ignore' para que clasifique 
# cualquier categoria nueva como 'otros' [0,0,0]
oh_encoder = OneHotEncoder(handle_unknown='ignore')

### Opcion 4: Get Dummies encoding (Pandas)

Genera 3 nuevas columnas que corresponden a cada categoria indiccando cada fila a que categoria pertenece

In [76]:
pd.get_dummies(x_train['protocol_type']).astype(int)

Unnamed: 0,icmp,tcp,udp
113467,0,1,0
31899,0,1,0
108116,0,1,0
89913,0,1,0
106319,1,0,0
...,...,...,...
64559,0,1,0
67272,0,1,0
32452,0,1,0
112657,0,1,0


## 5. Como manejar las escalas muy dispares entre valores

In [77]:
# Primero se va a separar la etiqueta de las caracteristicas del conjunto
# Esto pues los valores de la etiqueta no necesitan ser transformados
x_train = train_set.drop('class', axis=1) # Caracteristicas de entrada
y_train = train_set['class'].copy() # Etiqueta

Estos mecanismos de escalado NO SE APLICAN sobre etiquetas, además primero se aplican sobre los datos de entrenamiento y luego sobre los de prueba, por separado.

In [78]:
from sklearn.preprocessing import RobustScaler

scale_atts = x_train[['src_bytes', 'dst_bytes']] # Columnas a las que se les aplicara

robust_scaler = RobustScaler()

x_train_scaled = robust_scaler.fit_transform(scale_atts)
x_train_scaled = pd.DataFrame(x_train_scaled, columns=['src_bytes', 'dst_bytes'])

In [79]:
x_train_scaled.head(10)

Unnamed: 0,src_bytes,dst_bytes
0,1.324818,101.92
1,-0.160584,0.0
2,0.948905,1.211429
3,-0.160584,0.0
4,-0.131387,0.0
5,0.007299,0.264762
6,6.372263,0.691429
7,2.5,0.626667
8,0.591241,2.841905
9,1.058394,0.0


In [80]:
x_train.head(10)

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,root_shell,su_attempted,num_root,num_file_creations,num_shells,num_access_files,num_outbound_cmds,is_host_login,is_guest_login,count,srv_count,serror_rate,srv_serror_rate,rerror_rate,srv_rerror_rate,same_srv_rate,diff_srv_rate,srv_diff_host_rate,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
113467,0.0,tcp,http,SF,407.0,53508.0,0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,1.0,5.0,0.0,0.0,0.0,0.0,1.0,0.0,0.4,9.0,255.0,1.0,0.0,0.11,0.03,0.0,0.0,0.0,0.0
31899,0.0,tcp,private,S0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,200.0,4.0,1.0,1.0,0.0,0.0,0.02,0.05,0.0,255.0,4.0,0.02,0.05,0.0,0.0,1.0,1.0,0.0,0.0
108116,0.0,tcp,http,SF,304.0,636.0,0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,2.0,5.0,0.0,0.0,0.0,0.0,1.0,0.0,0.4,39.0,255.0,1.0,0.0,0.03,0.06,0.0,0.0,0.0,0.0
89913,0.0,tcp,private,S0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,122.0,17.0,1.0,1.0,0.0,0.0,0.14,0.06,0.0,255.0,15.0,0.06,0.07,0.0,0.0,1.0,1.0,0.0,0.0
106319,0.0,icmp,eco_i,SF,8.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,1.0,13.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,2.0,7.0,1.0,0.0,1.0,0.57,0.0,0.0,0.0,0.0
98007,0.0,udp,domain_u,SF,46.0,139.0,0,0.0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,65.0,145.0,0.0,0.0,0.0,0.0,1.0,0.0,0.01,255.0,254.0,1.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0
16447,0.0,tcp,smtp,SF,1790.0,363.0,0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,141.0,137.0,0.55,0.04,0.01,0.01,0.0,0.0,0.0,0.0
64957,1.0,tcp,smtp,SF,729.0,329.0,0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,1.0,3.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,198.0,181.0,0.65,0.03,0.01,0.01,0.02,0.02,0.0,0.0
100052,0.0,tcp,http,SF,206.0,1492.0,0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,10.0,13.0,0.0,0.0,0.0,0.0,1.0,0.0,0.15,255.0,255.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
28800,0.0,tcp,ftp_data,SF,334.0,0.0,0,0.0,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,2.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,8.0,28.0,1.0,0.0,1.0,0.11,0.0,0.0,0.0,0.0
