# PREPROCESADO DE LOS DATOS<br>

En este apartado se aplicarán técnicas de preprocesado sobre los datos numéricos para adaptarlos a los algoritmos de aprendizaje automático. Sólo se aplicará sobre los subconjuntos de datos obtenidos con el primer métodos de la extracción de características.<br>

- Importamos las _librerías_ necesarias:

In [1]:
import os
import numpy as np
import pandas as pd

from sklearn.preprocessing import MinMaxScaler
from sklearn import preprocessing

- __Directorios__ utilizados:

In [2]:
PROJECT_ROOT_PATH = "."
DATASETS_PATH = PROJECT_ROOT_PATH + os.sep + "datasets"
FINAL_DATASETS_PATH = PROJECT_ROOT_PATH + os.sep + "final_datasets"

***
## Lectura de datos<br>

- Con la siguiente función leemos el fichero CSV creado en el paso anterior y lo almacenamos en un _DataFrame_:<br>

In [3]:
def load_data(filename, separator, folder, path=FINAL_DATASETS_PATH):
    file_path = os.path.join(path, folder + os.sep + filename)
    return pd.read_csv(file_path, sep=separator)

In [4]:
SUBFOLDER_METHOD_1 = "4_extraccion_caracteristicas" + os.sep + "metodo_1"

df_train = load_data("4_1_train_features_dataset.csv",',', SUBFOLDER_METHOD_1)
df_test = load_data("4_1_test_features_dataset.csv",',', SUBFOLDER_METHOD_1)

print(len(df_train))
print(len(df_test))

483977
207419


***
## Eliminación de variables tras análisis<br>

En el anterior apartado se detectó una serie de características que no aportaban mucha información. Es por ello que se procede a eliminarlas tanto en los subconjuntos de entrenamiento como de prueba:<br>

In [5]:
df_train = df_train.drop(columns=["contains_'_'", "contains_'~'"], axis=1)
df_test = df_test.drop(columns=["contains_'_'", "contains_'~'"], axis=1)

***
## Escalado de los datos<br>
https://benalexkeen.com/feature-scaling-with-scikit-learn/

In [6]:
non_binary_features = [
    "total_length", 
    "hostname_depth", 
    "domain_length", 
    "hostname_length", 
    "hostname_digits", 
    "n_special",
    "vowel_consonant_ratio",
    "digit_character_ratio"
]

#### - Min-max scaler

In [8]:
df_train.describe()

Unnamed: 0,total_length,is_ip,hostname_depth,domain_length,hostname_length,hostname_digits,n_special,www_prefix,vowel_consonant_ratio,digit_character_ratio,contains_'@',contains_'-',contains_'//',percent_encoding,is_shorten,bad_tld,malicious_extension,label
count,483977.0,483977.0,483977.0,483977.0,483977.0,483977.0,483977.0,483977.0,483977.0,483977.0,483977.0,483977.0,483977.0,483977.0,483977.0,483977.0,483977.0,483977.0
mean,51.037186,0.176688,3.233647,10.733909,19.373627,2.344835,7.494819,0.545082,0.172132,0.150653,0.004436,0.067053,0.001374,0.018937,3.9e-05,0.0254,0.204435,0.500001
std,34.319346,0.381405,0.864659,4.502968,9.581623,4.773879,4.516625,0.497964,0.110152,0.304295,0.066457,0.250114,0.037043,0.136302,0.006266,0.157337,0.403289,0.500001
min,7.0,0.0,1.0,1.0,4.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,31.0,0.0,3.0,7.0,14.0,0.0,5.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,44.0,0.0,3.0,10.0,18.0,0.0,6.0,1.0,0.181818,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,62.0,0.0,4.0,14.0,21.0,0.0,9.0,1.0,0.235294,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
max,2307.0,1.0,21.0,63.0,240.0,105.0,209.0,1.0,0.875,0.9,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [12]:
#Train Set
min_max_scaler = MinMaxScaler()
df_train[non_binary_features] = min_max_scaler.fit_transform(df_train[non_binary_features])

#Test Set
min_max_scaler = MinMaxScaler()
df_test[non_binary_features] = min_max_scaler.fit_transform(df_test[non_binary_features])

In [12]:
df_train.describe()

Unnamed: 0,total_length,is_ip,hostname_depth,domain_length,hostname_length,hostname_digits,n_special,www_prefix,vowel_consonant_ratio,digit_character_ratio,contains_'@',contains_'-',contains_'//',percent_encoding,is_shorten,bad_tld,malicious_extension,label
count,483977.0,483977.0,483977.0,483977.0,483977.0,483977.0,483977.0,483977.0,483977.0,483977.0,483977.0,483977.0,483977.0,483977.0,483977.0,483977.0,483977.0,483977.0
mean,0.019147,0.176688,0.111682,0.156999,0.065142,0.022332,0.031225,0.545082,0.196722,0.167392,0.004436,0.067053,0.001374,0.018937,3.9e-05,0.0254,0.204435,0.500001
std,0.014921,0.381405,0.043233,0.072629,0.0406,0.045466,0.021715,0.497964,0.125888,0.338106,0.066457,0.250114,0.037043,0.136302,0.006266,0.157337,0.403289,0.500001
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.010435,0.0,0.1,0.096774,0.042373,0.0,0.019231,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.016087,0.0,0.1,0.145161,0.059322,0.0,0.024038,1.0,0.207792,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,0.023913,0.0,0.15,0.209677,0.072034,0.0,0.038462,1.0,0.268908,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


***
## Guardamos los datos<br>
- __Guardamos__ el contenido del _DataFrame_ final para realizar ejecutar los algoritmos de aprendizaje automático como siguiente paso:<br>

In [13]:
def save_data(dataframe, filename, separator, folder, path=FINAL_DATASETS_PATH):
    file_path = os.path.join(path, folder + os.sep + filename)
    dataframe.to_csv(file_path, sep=separator, index=False)

- Método 1

In [14]:
SUBFOLDER_METHOD_1 = "6_preprocesado_datos"

#Training set
save_data(df_train, "6_train_dataset.csv", ',', SUBFOLDER_METHOD_1)
#Testing set
save_data(df_test, "6_test_dataset.csv", ',', SUBFOLDER_METHOD_1)