# EXTRACCIÓN DE CARACTERÍSTICAS<br>

En este apartado se extraerán características sobre cada URL mediante el uso de dos técnicas diferentes. El resultado será obtener un vector de características que será utilizado por los algoritmos de aprendizaje automático.

- Importamos las _librerías_ necesarias:

In [1]:
import re
import time
import timeit
import numpy as np

import random
import tldextract
import ipaddress
from urllib.parse import urlparse

from urllib.request import urlopen
import requests 
from bs4 import BeautifulSoup

- __Directorios__ utilizados:

In [2]:
import os
PROJECT_ROOT_PATH = "."
DATASETS_PATH = PROJECT_ROOT_PATH + os.sep + "datasets"
FINAL_DATASETS_PATH = PROJECT_ROOT_PATH + os.sep + "final_datasets"
RESOURCES_PATH = PROJECT_ROOT_PATH + os.sep + "resources"

***
## Lectura de datos<br>

- Con la siguiente función leemos el fichero CSV creado en el paso anterior y lo almacenamos en un _DataFrame_:

In [3]:
import pandas as pd

def load_data_csv(filename, separator, folder, path=FINAL_DATASETS_PATH):
    file_path = os.path.join(path, folder + os.sep + filename)
    return pd.read_csv(file_path, sep=separator)

In [4]:
#Train dataset
df_train = load_data_csv("3_train_dataset.csv",',', "3_preparacion_datos")
df_train.head()

Unnamed: 0,url,label
0,https://www.athletics.mta.ca/varsity/football/...,0
1,https://drive.google.com/file/d/1uis0mbfzg1vrx...,1
2,http://xmotor.ir/localization/closed_section/v...,1
3,https://www.en.wikipedia.org/wiki/papineau_(mo...,0
4,https://www.eddiecibrian-actor.blogspot.com/,0


In [5]:
#Test dataset
df_test = load_data_csv("3_test_dataset.csv",',', "3_preparacion_datos")
df_test.head()

Unnamed: 0,url,label
0,http://www.marketseg.com.br/wp-content/uploads...,1
1,http://157.230.113.199/lnkfmx,1
2,http://116.114.95.44:35618/mozi.m,1
3,http://ccglass.co.za/cgi-bin/hkgru-nf0sp820cqw...,1
4,https://wildzone.it/,0


***
## Método 1: características propias de las URLs<br>

En este primer método se ha procedido a extraer características propias de cada URL que se puedan representar de forma numérica y con valores booleanos (si contiene IP o no, número de carácteres especiales,...). __Definimos__ cada una de las __funciones__ que utilizaremos para la extracción de las características:<br>

__1.__ __Longitud total__ de la URL<br>

In [6]:
def url_total_length(url):
    return len(url)

***
__2.__ El __nombre del host__ es una __dirección IP__<br>
<br>
Se comprueba que el dominio sea una dirección __IPv4__ o __IPv6__ (tanto en formato decimal como hexadecimal)<br>

In [7]:
def domain_is_ip(hostname):
    try:
        if ipaddress.ip_address(hostname):
            #IPV4 o IPV6
            return 1
    except ValueError:
        if re.match("^0{1}[xX]{1}[abcdefABCDEF0123456789]{2}\.0{1}[xX]{1}[abcdefABCDEF0123456789]{2}\.0{1}[xX]{1}[abcdefABCDEF0123456789]{2}\.0{1}[xX]{1}[abcdefABCDEF0123456789]{2}$", hostname):
            #IPV4 en formato hexadecimal
            return 1
        try:
            if ipaddress.ip_address(hostname.split(':')[0]):
                #IPV4 con puerto
                return 1
        except ValueError:
            return 0

***
__3.__ __Profundidad__ del __hostname__

In [8]:
def hostname_depth(hostname):
    return len(hostname.split('.'))

***
__4.__ __Longitud__ del __dominio__

In [9]:
def domain_length(domain):
    return len(domain)

***
__5.__ __Longitud__ del __hostname__ completo

In [10]:
def hostname_length(hostname):
    return len(hostname)

***
__6.__ __Número__ de __dígitos__ en el __hostname__<br>

In [11]:
def digits_number(hostname):
    return sum(c.isdigit() for c in hostname)

***
__7.__ __Número__ de carácteres __especiales__ en la URL<br>

In [12]:
def special_characters_number(url):
    return sum((not c.isdigit() and not c.isalpha()) for c in url)

***
__8.__ URL contiene __prefijo www__

In [13]:
def contains_www_prefix(hostname):
    if hostname.startswith("www."):
        return 1
    return 0

***
__9.__ __Proporción vocales-consonantes__ (vocales / consonantes): <br>

In [14]:
def vowel_consonant_ratio(subdomain, domain):
    tokens = subdomain.split('.')
    tokens.extend(domain.split('.'))
    
    vowels = ['a', 'e', 'i', 'o', 'u', 'A', 'E', 'I', 'O', 'U']
    
    ratio = 0

    for t in tokens:
        n_ratio = 0
        if len(t) == 0:
            ratio += 0
        else:
            for c in t:
                if c in vowels:
                    n_ratio += 1
            ratio += (n_ratio / len(t))
    
    return (ratio / len(tokens))

***
__10.__ __Proporción dígitos-letras__: <br>

In [15]:
def digit_character_ratio(subdomain, domain):
    tokens = subdomain.split('.')
    tokens.extend(domain.split('.'))
    
    digits = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
    
    ratio = 0

    for t in tokens:
        n_ratio = 0
        if len(t) == 0:
            ratio += 0
        else:
            for c in t:
                if c in digits:
                    n_ratio += 1
            ratio += (n_ratio / len(t))
    
    return (ratio / len(tokens))

***
__11.__ URL contiene __carácter__ '@'

In [16]:
def contains_arroba_char(url):
    if url.count('@') > 0:
        return 1
    return 0

El caráter __@__ dentro de una URL descarta todo lo que le precede. Por ejemplo, la URL [http://www.google.es@http://www.badsite.com](http://www.google.es@http://www.badsite.com) redirigirá a [http://www.google.es@http://www.badsite.com](http://www.badsite.com) descartando [http://www.google.es](http://www.google.es).

***
- Hostname contiene __caracteres no reservados__:<br>
__12:__ Carácter '-'<br>
__13:__ Carácter '\_'<br>
__14:__ Carácter '~'<br>

In [17]:
#Caracteres '-','_','~'
def contains_unreserved_char(hostname, unreserved_char):
    if hostname.count(unreserved_char) > 0:
        return 1
    return 0

Los atacantes suelen utilizan este tipo de carácteres en el nombre del dominio para conseguir que ciertas URLs maliciosas aparenten no serlo. 

***
__15.__ URL contiene __carácteres ' // '__

In [18]:
def contains_double_slash(url):
    aux = url.split('://')
    for i in aux:
        if i.count('//') != 0:
            return 1
    return 0

***
__16.__ URL contiene __carácteres__ en __hexadecimal__ (_"percent encoding"_)

https://developer.mozilla.org/en-US/docs/Glossary/percent-encoding<br>
PERCENT-ENCODING (CARÁCTERES ESPECIALES)<br>
Reserved characters in URL<br>

In [19]:
codified_characters = {
    '%3A': ':',
    '%2F': '/',
    '%23': '#',
    '%3F': '?',
    '%20': ' ',
    '%24': '&',
    '%40': '@',
    '%25': '%',
    '%2B': '+',
    '%3B': ';',
    '%3D': '=',
    '%26': '$',
    '%2C': ',',
    '%3C': '<',
    '%3E': '>',
    '%5E': '^',
    '%60': '`',
    '%5C': '\\',
    '%5B': '[',
    '%5D': ']',
    '%7B': '{',
    '%7D': '}',
    '%7C': '|',
    '%22': '"'
}

def contains_percent_encoding(url, codified_characters):
    n = 0
    for i in codified_characters.keys():
        x = re.findall('[a-zA-Z]', i)
        if (len(x)==0):
            n += url.count(i)
        else:
            n += url.count(str.upper(i))
            n += url.count(str.lower(i)) 
    if n!=0:
        return 1
    return 0

***
__17.__ Uso de __acortadores__ de __URL__<br>

- Definimos un método para leer ficheros de texto y almacenar su contenido en una lista:

In [20]:
def load_data_txt(filename, path=RESOURCES_PATH):
    file_path = os.path.join(path, filename)
    data_list = []
    with open(file_path, 'r') as file:
        data_list = [line.rstrip() for line in file]
    return data_list

In [21]:
url_short_services = load_data_txt("url_shorteners.txt")

def is_shortened_url(hostname, url_short_services):
    #Método para detectar usa algún servicio para acortar URLs
    for s in url_short_services:
        if s == hostname:
            return 1
    return 0

***
__18.__ Contiene __TLD explotado__:<br>

In [22]:
most_abused_tlds = load_data_txt("most_abused_tld.txt")

def contains_abused_tld(hostname, most_abused_tlds):
    for tld in most_abused_tlds:
        if hostname.endswith(tld):
            return 1
    return 0

***
__19.__ Ficheros con __extensiones maliciosas__<br>

1- [https://its.uiowa.edu/support/article/1348](https://its.uiowa.edu/support/article/1348)

2- [https://www.file-extensions.org/filetype/extension/name/dangerous-malicious-files](https://www.file-extensions.org/filetype/extension/name/dangerous-malicious-files)

Leemos el fichero __malicious_extensions.txt__, el cual contiene todas las posibles extensiones que puedan ser maliciosas

In [23]:
suspect_extensions = load_data_txt("malicious_extensions.txt")

def contains_malicious_extension(url, suspect_extensions):
    #Este método comprueba que la extensión del fichero es "maliciosa"
    for extension in suspect_extensions:
        if (url.endswith('.'+extension) or url.endswith('.'+extension+'/') or url.endswith('.'+extension+'%2F') or url.endswith('.'+extension+'%2f')):
            return 1
    return 0

***
- __Método principal__ que reune todos los métodos anteriores:<br>

In [24]:
def extract_features(url, label):
    features = []
    
    if not url.startswith('http://') and not url.startswith('https://'):
        url = "http://" + url
    
    url_without_scheme = url.split('://')
    url_without_scheme.pop(0)
    url_without_scheme = '://'.join(str(e) for e in url_without_scheme)
    try:
        parsed_uri = urlparse(url)
    except:
        print(url)
    scheme = '{uri.scheme}://'.format(uri=parsed_uri)
    hostname = '{uri.netloc}'.format(uri=parsed_uri)
    subdomain, domain, tld = tldextract.extract('{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri))
    
    #1. Longitud total de la URL
    features.append(url_total_length(url_without_scheme))
    
    #2. El 'hostname' es una dirección IP
    features.append(domain_is_ip(hostname))
    
    #3. Profundidad del hostname
    features.append(hostname_depth(hostname))
    
    #4. Longitud del dominio
    features.append(domain_length(domain))
    
    #5. Longitud del hostname
    features.append(hostname_length(hostname))
    
    #6. Número de dígitos en el hostname
    features.append(digits_number(hostname))
    
    #7. Número de carácteres especiales en la URL
    features.append(special_characters_number(url_without_scheme))
    
    #8. Hostname contiene prefijo www
    features.append(contains_www_prefix(hostname))
    
    #9. Ratio vocales-consonantes
    features.append(vowel_consonant_ratio(subdomain, domain))
    
    #10. Ratio dígitos-letras
    features.append(digit_character_ratio(subdomain, domain))
    
    #11. URL contiene carácter '@'
    features.append(contains_arroba_char(url_without_scheme))
    
    #12. Hostname contiene carácter '-'
    features.append(contains_unreserved_char(hostname, '-'))
    
    #13. Hostname contiene carácter '_'
    features.append(contains_unreserved_char(hostname, '_'))
    
    #14. Hostname contiene carácter '~'
    features.append(contains_unreserved_char(hostname, '~'))
    
    #15. URL contiene carácteres '//'
    features.append(contains_double_slash(url_without_scheme))
    
    #16. URL contiene carácteres en hexadecimal ("percent encoding")
    features.append(contains_percent_encoding(url, codified_characters))
    
    #17. Uso de acortadores de URL
    features.append(is_shortened_url(hostname, url_short_services))
    
    #18. Contiene TLD explotado
    features.append(contains_abused_tld(hostname, most_abused_tlds))
    
    #19. Ficheros con extensiones maliciosas
    features.append(contains_malicious_extension(url, suspect_extensions))
    
    features.append(label)
    
    return features

- __Clase__ _URLFeatureExtractor_:<br>

In [25]:
from sklearn.base import BaseEstimator, TransformerMixin

class URLFeatureExtractor(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        
        url_short_services = load_data_txt("url_shorteners.txt")
        most_abused_tlds = load_data_txt("most_abused_tld.txt")
        suspect_extensions = load_data_txt("malicious_extensions.txt")
        codified_characters = {
            '%3A': ':',
            '%2F': '/',
            '%23': '#',
            '%3F': '?',
            '%20': ' ',
            '%24': '&',
            '%40': '@',
            '%25': '%',
            '%2B': '+',
            '%3B': ';',
            '%3D': '=',
            '%26': '$',
            '%2C': ',',
            '%3C': '<',
            '%3E': '>',
            '%5E': '^',
            '%60': '`',
            '%5C': '\\',
            '%5B': '[',
            '%5D': ']',
            '%7B': '{',
            '%7D': '}',
            '%7C': '|',
            '%22': '"'
        }
    
    @staticmethod
    def append_http(url):
        if not url.startswith('http://') and not url.startswith('https://'):
            return "http://" + url
        return url
    
    @staticmethod    
    def obtain_url_without_scheme(url):
        url_without_scheme = url.split('://')
        url_without_scheme.pop(0)
        return '://'.join(str(e) for e in url_without_scheme)
    
    @staticmethod
    def parse_url(url, part):
        parsed_uri = urlparse(url)
        if part == 'scheme':
            return '{uri.scheme}://'.format(uri=parsed_uri)
        if part == 'hostname':
            return '{uri.netloc}'.format(uri=parsed_uri)
        if part == 'subdomain':
            return tldextract.extract('{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri))[0]
        if part == 'domain':
            return tldextract.extract('{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri))[1]
        if part == 'tld':
            return tldextract.extract('{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri))[2]
    
    @staticmethod
    def url_total_length(url):
        return len(url)
    
    @staticmethod
    def domain_is_ip(hostname):
        try:
            if ipaddress.ip_address(hostname):
                #IPV4 o IPV6
                return 1
        except ValueError:
            if re.match("^0{1}[xX]{1}[abcdefABCDEF0123456789]{2}\.0{1}[xX]{1}[abcdefABCDEF0123456789]{2}\.0{1}[xX]{1}[abcdefABCDEF0123456789]{2}\.0{1}[xX]{1}[abcdefABCDEF0123456789]{2}$", hostname):
                #IPV4 en formato hexadecimal
                return 1
            try:
                if ipaddress.ip_address(hostname.split(':')[0]):
                    #IPV4 con puerto
                    return 1
            except ValueError:
                return 0
    
    @staticmethod
    def hostname_depth(hostname):
        return len(hostname.split('.'))
    
    @staticmethod
    def domain_length(domain):
        return len(domain)
    
    @staticmethod
    def hostname_length(hostname):
        return len(hostname)
    
    @staticmethod
    def digits_number(hostname):
        return sum(c.isdigit() for c in hostname)
    
    @staticmethod
    def special_characters_number(url):
        return sum((not c.isdigit() and not c.isalpha()) for c in url)
    
    @staticmethod
    def contains_www_prefix(hostname):
        if hostname.startswith("www."):
            return 1
        return 0
    
    @staticmethod
    def onion_domain(hostname):
        if hostname.endswith("onion"):
            return 1
        return 0
    
    @staticmethod
    def contains_arroba_char(url):
        if url.count('@') > 0:
            return 1
        return 0
    
    @staticmethod
    def contains_unreserved_char(hostname, unreserved_char):
        if hostname.count(unreserved_char) > 0:
            return 1
        return 0
    
    @staticmethod
    def contains_double_slash(url):
        aux = url.split('://')
        for i in aux:
            if i.count('//') != 0:
                return 1
        return 0
    
    @staticmethod
    def contains_percent_encoding(url, codified_characters):
        n = 0
        for i in codified_characters.keys():
            x = re.findall('[a-zA-Z]', i)
            if (len(x)==0):
                n += url.count(i)
            else:
                n += url.count(str.upper(i))
                n += url.count(str.lower(i)) 
        if n!=0:
            return 1
        return 0
    
    @staticmethod
    def is_shortened_url(hostname, url_short_services):
        #Método para detectar usa algún servicio para acortar URLs
        for s in url_short_services:
            if s == hostname:
                return 1
        return 0
    
    @staticmethod
    def contains_abused_tld(hostname, most_abused_tlds):
        for tld in most_abused_tlds:
            if hostname.endswith(tld):
                return 1
        return 0
    
    @staticmethod
    def contains_malicious_extension(url, suspect_extensions):
        #Este método comprueba que la extensión del fichero es "maliciosa"
        for extension in suspect_extensions:
            if (url.endswith('.'+extension) or url.endswith('.'+extension+'/') or url.endswith('.'+extension+'%2F') or url.endswith('.'+extension+'%2f')):
                return 1
        return 0
    
    def fit(self, df, y=None):
        """Returns `self` unless something different happens in train and test"""
        return self
    
    def transform(self, df, y=None):
        """The workhorse of this feature extractor"""
        columns = ["total_length","is_ip","hostname_depth","domain_length","hostname_length","hostname_digits","n_special","www_prefix","onion_domain","contains_'@'","contains_'-'","contains_'_'","contains_'~'","contains_'//'","percent_encoding","is_shorten","bad_tld","malicious_extension","label"]
        transformed_df = pd.DataFrame(columns=columns)
        
        df['url'] = df['url'].apply(self.append_http)
        
        transformed_df['hostname_depth'] = df['url'].apply(self.parse_url, part = 'hostname').apply(self.hostname_depth)
        transformed_df['domain_length'] = df['url'].apply(self.parse_url, part = 'domain').apply(self.domain_length)
        transformed_df['hostname_length'] = df['url'].apply(self.parse_url, part = 'hostname').apply(self.hostname_length)
        transformed_df['hostname_digits'] = df['url'].apply(self.parse_url, part = 'hostname').apply(self.digits_number)
        transformed_df['n_special'] = df['url'].apply(self.obtain_url_without_scheme).apply(self.url_total_length)
        transformed_df['www_prefix'] = df['url'].apply(self.parse_url, part = 'hostname').apply(self.contains_www_prefix)
        transformed_df['onion_domain'] = df['url'].apply(self.parse_url, part = 'tld').apply(self.onion_domain)
        transformed_df['contains_\'@\''] = df['url'].apply(self.obtain_url_without_scheme).apply(self.contains_arroba_char)
        transformed_df['contains_\'-\''] = df['url'].apply(self.parse_url, part = 'hostname').apply(self.contains_unreserved_char, unreserved_char = '-')
        transformed_df['contains_\'_\''] = df['url'].apply(self.parse_url, part = 'hostname').apply(self.contains_unreserved_char, unreserved_char = '_')
        transformed_df['contains_\'~\''] = df['url'].apply(self.parse_url, part = 'hostname').apply(self.contains_unreserved_char, unreserved_char = '~')
        transformed_df['contains_\'//\''] = df['url'].apply(self.obtain_url_without_scheme).apply(self.contains_double_slash)
        transformed_df['percent_encoding'] = df['url'].apply(self.obtain_url_without_scheme).apply(self.contains_percent_encoding, codified_characters = codified_characters)
        transformed_df['is_shorten'] = df['url'].apply(self.parse_url, part = 'hostname').apply(self.is_shortened_url, url_short_services = url_short_services)
        transformed_df['bad_tld'] = df['url'].apply(self.parse_url, part = 'hostname').apply(self.contains_abused_tld, most_abused_tlds = most_abused_tlds)
        transformed_df['malicious_extension'] = df['url'].apply(self.obtain_url_without_scheme).apply(self.contains_malicious_extension, suspect_extensions = suspect_extensions)
        transformed_df['label'] = df['label']
        
        return transformed_df


- Aplicamos la extracción de características a los conjuntos de entrenamiento y de prueba:<br>

In [None]:
#Train dataset
ufe = URLFeatureExtractor()
%time df_train_features = ufe.transform(df_train)

#Test dataset
ufe = URLFeatureExtractor()
%time df_test_features = ufe.transform(df_test)

In [26]:
#Train dataset
start = timeit.default_timer()

list_urls = []
for i in range(len(df_train)):
    list_urls.append(extract_features(df_train["url"].loc[i], df_train["label"].loc[i]))

columns = ["total_length","is_ip","hostname_depth","domain_length","hostname_length","hostname_digits","n_special","www_prefix","vowel_consonant_ratio","digit_character_ratio", "contains_'@'","contains_'-'","contains_'_'","contains_'~'","contains_'//'","percent_encoding","is_shorten","bad_tld","malicious_extension","label"]

df_train_features = pd.DataFrame(list_urls, columns=columns)
stop = timeit.default_timer()
print('Time: ', stop - start)

df_train_features.head()

Time:  219.0090668


Unnamed: 0,total_length,is_ip,hostname_depth,domain_length,hostname_length,hostname_digits,n_special,www_prefix,vowel_consonant_ratio,digit_character_ratio,contains_'@',contains_'-',contains_'_',contains_'~',contains_'//',percent_encoding,is_shorten,bad_tld,malicious_extension,label
0,46,0,4,3,20,0,7,1,0.222222,0.0,0,0,0,0,0,0,0,0,0,0
1,57,0,3,6,16,0,5,0,0.45,0.0,0,0,0,0,0,0,0,0,0,1
2,76,0,2,6,9,0,9,0,0.166667,0.0,0,0,0,0,0,0,0,0,0,1
3,51,0,4,9,20,0,9,1,0.351852,0.0,0,0,0,0,0,0,0,0,0,0
4,36,0,4,8,35,0,5,1,0.231481,0.0,0,1,0,0,0,0,0,0,1,0


In [27]:
#Test dataset
start = timeit.default_timer()

list_urls = []
for i in range(len(df_test)):
    list_urls.append(extract_features(df_test["url"].loc[i], df_test["label"].loc[i]))

columns = ["total_length","is_ip","hostname_depth","domain_length","hostname_length","hostname_digits","n_special","www_prefix","vowel_consonant_ratio","digit_character_ratio","contains_'@'","contains_'-'","contains_'_'","contains_'~'","contains_'//'","percent_encoding","is_shorten","bad_tld","malicious_extension","label"]

df_test_features = pd.DataFrame(list_urls, columns=columns)
stop = timeit.default_timer()
print('Time: ', stop - start)

df_test_features.head()

Time:  128.04439560000003


Unnamed: 0,total_length,is_ip,hostname_depth,domain_length,hostname_length,hostname_digits,n_special,www_prefix,vowel_consonant_ratio,digit_character_ratio,contains_'@',contains_'-',contains_'_',contains_'~',contains_'//',percent_encoding,is_shorten,bad_tld,malicious_extension,label
0,57,0,4,9,20,0,10,1,0.166667,0.0,0,0,0,0,0,0,0,0,0,1
1,22,1,4,15,15,12,4,0,0.0,0.8,0,0,0,0,0,0,0,0,0,1
2,26,1,4,13,19,15,6,0,0.0,0.8,0,0,0,0,0,0,0,0,0,1
3,55,0,3,7,13,0,9,0,0.071429,0.0,0,0,0,0,0,0,0,0,0,1
4,12,0,2,8,11,0,2,0,0.1875,0.0,0,0,0,0,0,0,0,0,0,0


## Guardamos los datos<br>

- __Guardamos__ el contenido del _DataFrame_ final para realizar la extracción de características como siguiente paso: <br>

In [28]:
import pandas as pd

def save_data(dataframe, filename, separator, folder, path=FINAL_DATASETS_PATH):
    file_path = os.path.join(path, folder + os.sep + filename)
    dataframe.to_csv(file_path, sep=separator, index=False)

In [29]:
SUBFOLDER_METHOD_1 = "4_extraccion_caracteristicas" + os.sep + "metodo_1"

#Train set
save_data(df_train_features, "4_1_train_features_dataset.csv", ',', SUBFOLDER_METHOD_1)

#Test set
save_data(df_test_features, "4_1_test_features_dataset.csv", ',', SUBFOLDER_METHOD_1)

***
## Método 2: Bag of Words<br>

__Utilizaremos__ las técnicas CountVectorizer y TfidfVectorizer:<br>

- CountVectorizer<br>
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer<br>
https://kavita-ganesan.com/how-to-use-countvectorizer/#Example-of-How-CountVectorizer-Works<br>

In [11]:
#KFold nos permite 
from sklearn.model_selection import KFold
#cross_val_score nos permite 
from sklearn.model_selection import cross_val_score
#GridSearchCV nos permite 
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

from sklearn.pipeline import FeatureUnion

- Al principio tenemos una única columna (URL). Vamos a crear un nuevo dataframe con cinco columnas, las cuales almacenarán la información de cada parte de la URL. Esto permitirá que podamos aplicar el modelo de Bolsa de Palabras sobre cada una de ellas de forma independiente de las otras: <br>

In [8]:
import tldextract
from urllib.parse import urlparse

#Nuevo dataframe conjunto entrenamiento
data = []
for index, row in df_train.iterrows():   
    parsed_uri = urlparse(row['url'])
    
    scheme = '{uri.scheme}'.format(uri=parsed_uri)
    subdomain = tldextract.extract('{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri))[0]
    domain = tldextract.extract('{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri))[1]
    tld = tldextract.extract('{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri))[2]
    path = '{uri.path}'.format(uri=parsed_uri)
    query = '{uri.query}'.format(uri=parsed_uri)
    
    data.append([scheme, subdomain, domain, tld, path, query, row['label']])

columns = ['protocolo','subdominio', 'dominio', 'tld', 'path', 'query','label']
parts_df_train = pd.DataFrame(data, columns=columns)


#Nuevo dataframe conjunto prueba
data = []
for index, row in df_test.iterrows():   
    parsed_uri = urlparse(row['url'])
    
    scheme = '{uri.scheme}'.format(uri=parsed_uri)
    subdomain = tldextract.extract('{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri))[0]
    domain = tldextract.extract('{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri))[1]
    tld = tldextract.extract('{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri))[2]
    path = '{uri.path}'.format(uri=parsed_uri)
    query = '{uri.query}'.format(uri=parsed_uri)
    
    
    data.append([scheme, subdomain, domain, tld, path, query, row['label']])

columns = ['protocolo','subdominio', 'dominio', 'tld', 'path', 'query','label']
parts_df_test = pd.DataFrame(data, columns=columns)

- Sustituimos los valores NaN por cadena vacía (""): <br>

In [None]:
df_train['protocolo'] = df_train['protocolo'].fillna('')
df_train['subdominio'] = df_train['subdominio'].fillna('')
df_train['dominio'] = df_train['dominio'].fillna('')
df_train['tld'] = df_train['tld'].fillna('')
df_train['path'] = df_train['path'].fillna('')
df_train['query'] = df_train['query'].fillna('')

- Aplicamos _CountVectorizer_ sobre cada columna del subconjunto de entrenamiento: <br>

1. Protocolo<br>

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
cv_scheme = CountVectorizer(ngram_range=(1,1), min_df=50, token_pattern=r'[a-zA-Z]{2,}', max_features=2,binary=True)

X_train = cv_scheme.fit_transform(parts_df_train['protocolo'])
y_train = parts_df_train['label']

X_test = parts_df_test['protocolo']
y_test = parts_df_test['label']
X_test = cv_scheme.transform(X_test)

# creating and training logistic regression model
from sklearn.linear_model import LogisticRegression
logreg_1 = LogisticRegression(max_iter=1000)
logreg_1.fit(X_train, y_train)
y_predicted_1 = logreg_1.predict(X_test)

print('Puntuación: ' + str(accuracy_score(y_test, y_predicted_1)))
print(confusion_matrix(y_test, y_predicted_1))
print(classification_report(y_test, y_predicted_1))
tn, fp, fn, tp = confusion_matrix(y_test, y_predicted_1).ravel()
print(tp)
print(tn)
print(fp)
print(fn)

Puntuación: 0.9416205844209065
[[103709      1]
 [ 12108  91601]]
              precision    recall  f1-score   support

           0       0.90      1.00      0.94    103710
           1       1.00      0.88      0.94    103709

    accuracy                           0.94    207419
   macro avg       0.95      0.94      0.94    207419
weighted avg       0.95      0.94      0.94    207419

91601
103709
1
12108


In [14]:
cv_scheme.vocabulary_

{'https': 1, 'http': 0}

In [15]:
cv_scheme.stop_words_

{'ftp'}

2. Subdominio: <br>

In [16]:
from sklearn.feature_extraction.text import CountVectorizer
cv_subdomain = CountVectorizer(ngram_range=(1,2), min_df=200, token_pattern=r'[a-zA-Z]{2,}', max_features=100,binary=True)

X_train = cv_subdomain.fit_transform(parts_df_train['subdominio'])
y_train = parts_df_train['label']

X_test = parts_df_test['subdominio']
y_test = parts_df_test['label']
X_test = cv_subdomain.transform(X_test)

# creating and training logistic regression model
from sklearn.linear_model import LogisticRegression
logreg_2 = LogisticRegression(max_iter=1000)
logreg_2.fit(X_train, y_train)
y_predicted_2 = logreg_2.predict(X_test)

print('Puntuación: ' + str(accuracy_score(y_test, y_predicted_2)))
print(confusion_matrix(y_test, y_predicted_2))
print(classification_report(y_test, y_predicted_2))
tn, fp, fn, tp = confusion_matrix(y_test, y_predicted_2).ravel()
print(tp)
print(tn)
print(fp)
print(fn)

Puntuación: 0.9536590187012761
[[103489    221]
 [  9391  94318]]
              precision    recall  f1-score   support

           0       0.92      1.00      0.96    103710
           1       1.00      0.91      0.95    103709

    accuracy                           0.95    207419
   macro avg       0.96      0.95      0.95    207419
weighted avg       0.96      0.95      0.95    207419

94318
103489
221
9391


In [17]:
cv_subdomain.vocabulary_

{'www': 78,
 'drive': 16,
 'en': 21,
 'www en': 84,
 'com': 10,
 'dictionary': 12,
 'www dictionary': 83,
 'cmail': 5,
 'oln': 49,
 'outbound': 53,
 'protection': 56,
 'sketchwefair': 62,
 'watduoliprudential': 73,
 'watchdogdns': 71,
 'cmail oln': 6,
 'oln outbound': 50,
 'outbound protection': 54,
 'protection sketchwefair': 57,
 'sketchwefair watduoliprudential': 63,
 'watduoliprudential com': 74,
 'com watchdogdns': 11,
 'ca': 3,
 'www ca': 82,
 'zajcmail': 98,
 'zajcmail oln': 99,
 'articles': 0,
 'www articles': 79,
 'local': 40,
 'www local': 90,
 'docs': 15,
 'folkbjnrwwww': 24,
 'folkbjnrwwww watchdogdns': 25,
 'wiki': 77,
 'www wiki': 97,
 'my': 47,
 'web': 75,
 'tracking': 68,
 'cocomputewww': 8,
 'web tracking': 76,
 'tracking cocomputewww': 69,
 'cocomputewww watchdogdns': 9,
 'uk': 70,
 'www uk': 96,
 'music': 46,
 'www music': 92,
 'freepages': 26,
 'genealogy': 28,
 'rootsweb': 58,
 'www freepages': 87,
 'freepages genealogy': 27,
 'genealogy rootsweb': 29,
 'duoliprude

In [18]:
cv_subdomain.stop_words_

{'social technet',
 'vaudreuil',
 'file tancyo',
 'fileonline',
 'daston',
 'www fairgozzisurgical',
 'killercoversoftheweek',
 'tradeshow',
 'renerussonude',
 'www cinf',
 'www kurtoskalacs',
 'inertiatours com',
 'verification accounts',
 'ejcdqrh',
 'union',
 'www fromtheaeroplaneoverthesea',
 'hexadl',
 'unistore',
 'www maryannsmanymusings',
 'www gonzalolira',
 'samoan',
 'constellation collectors',
 'serviceesssmaillling',
 'kids',
 'openinstall',
 'itunes store',
 'indianahighschoolfootballhuddle',
 'randy phillips',
 'bollywood',
 'industrialtyrelcompany',
 'www forums',
 'kleedkamersinkiness',
 'www ng',
 'www drewfriedman',
 'www bbk',
 'mortified',
 'download cryptonet',
 'nhaidet',
 'www education',
 'celebritybabies celebfan',
 'autoslalom',
 'newactdoconline',
 'www glenn',
 'www praguejazz',
 'de stba',
 'mizosiri web',
 'cthunter my',
 'barbearialumber',
 'www georgetown',
 'www postdoc',
 'athensastoria',
 'fantasy metal',
 'www billnelson',
 'www kathiebracy',
 'arie

3. Dominio: <br>

In [23]:
from sklearn.feature_extraction.text import CountVectorizer
cv_domain = CountVectorizer(ngram_range=(1,2), min_df=500, token_pattern=r'[a-zA-Z]{2,}', max_features=1000,binary=True)

X_train = cv_domain.fit_transform(parts_df_train['subdominio'])
y_train = parts_df_train['label']

X_test = parts_df_test['subdominio']
y_test = parts_df_test['label']
X_test = cv_domain.transform(X_test)

# creating and training logistic regression model
from sklearn.linear_model import LogisticRegression
logreg_3 = LogisticRegression(max_iter=1000)
logreg_3.fit(X_train, y_train)
y_predicted_3 = logreg_3.predict(X_test)

print('Puntuación: ' + str(accuracy_score(y_test, y_predicted_3)))
print(confusion_matrix(y_test, y_predicted_3))
print(classification_report(y_test, y_predicted_3))
tn, fp, fn, tp = confusion_matrix(y_test, y_predicted_3).ravel()
print(tp)
print(tn)
print(fp)
print(fn)

Puntuación: 0.9536686610194823
[[103489    221]
 [  9389  94320]]
              precision    recall  f1-score   support

           0       0.92      1.00      0.96    103710
           1       1.00      0.91      0.95    103709

    accuracy                           0.95    207419
   macro avg       0.96      0.95      0.95    207419
weighted avg       0.96      0.95      0.95    207419

94320
103489
221
9389


In [24]:
cv_domain.vocabulary_

{'www': 55,
 'drive': 10,
 'en': 15,
 'www en': 59,
 'com': 6,
 'oln': 31,
 'outbound': 34,
 'protection': 37,
 'sketchwefair': 42,
 'watduoliprudential': 51,
 'watchdogdns': 49,
 'oln outbound': 32,
 'outbound protection': 35,
 'protection sketchwefair': 38,
 'sketchwefair watduoliprudential': 43,
 'watduoliprudential com': 52,
 'com watchdogdns': 7,
 'ca': 3,
 'www ca': 58,
 'articles': 0,
 'www articles': 56,
 'local': 25,
 'docs': 9,
 'folkbjnrwwww': 17,
 'folkbjnrwwww watchdogdns': 18,
 'wiki': 54,
 'www wiki': 68,
 'web': 53,
 'cocomputewww': 4,
 'cocomputewww watchdogdns': 5,
 'uk': 48,
 'www uk': 67,
 'music': 30,
 'www music': 64,
 'freepages': 19,
 'genealogy': 20,
 'rootsweb': 39,
 'www freepages': 61,
 'duoliprudential': 13,
 'duoliprudential com': 14,
 'rsmart': 40,
 'testsolutions': 46,
 'rsmart testsolutions': 41,
 'testsolutions watchdogdns': 47,
 'dl': 8,
 'blog': 1,
 'www blog': 57,
 'storage': 45,
 'duckdns': 11,
 'orgwatchdogdns': 33,
 'watchdogdns duckdns': 50,
 'd

In [25]:
cv_domain.stop_words_

{'social technet',
 'vaudreuil',
 'file tancyo',
 'fileonline',
 'daston',
 'www fairgozzisurgical',
 'killercoversoftheweek',
 'tradeshow',
 'renerussonude',
 'www cinf',
 'www kurtoskalacs',
 'inertiatours com',
 'verification accounts',
 'ejcdqrh',
 'union',
 'www fromtheaeroplaneoverthesea',
 'hexadl',
 'unistore',
 'www maryannsmanymusings',
 'www gonzalolira',
 'samoan',
 'constellation collectors',
 'serviceesssmaillling',
 'kids',
 'openinstall',
 'itunes store',
 'indianahighschoolfootballhuddle',
 'randy phillips',
 'bollywood',
 'industrialtyrelcompany',
 'www forums',
 'kleedkamersinkiness',
 'www ng',
 'www drewfriedman',
 'www bbk',
 'mortified',
 'download cryptonet',
 'nhaidet',
 'www education',
 'celebritybabies celebfan',
 'autoslalom',
 'newactdoconline',
 'www glenn',
 'www praguejazz',
 'de stba',
 'mizosiri web',
 'cthunter my',
 'barbearialumber',
 'www georgetown',
 'www postdoc',
 'athensastoria',
 'fantasy metal',
 'www billnelson',
 'www kathiebracy',
 'arie

4. TLD: <br>

In [28]:
from sklearn.feature_extraction.text import CountVectorizer
cv_tld = CountVectorizer(ngram_range=(1,1), min_df=500, token_pattern=r'[a-zA-Z]{2,}', max_features=200,binary=True)

X_train = cv_tld.fit_transform(parts_df_train['tld'])
y_train = parts_df_train['label']

X_test = parts_df_test['tld']
y_test = parts_df_test['label']
X_test = cv_tld.transform(X_test)

# creating and training logistic regression model
from sklearn.linear_model import LogisticRegression
logreg_4 = LogisticRegression(max_iter=1000)
logreg_4.fit(X_train, y_train)
y_predicted_4 = logreg_4.predict(X_test)

print('Puntuación: ' + str(accuracy_score(y_test, y_predicted_4)))
print(confusion_matrix(y_test, y_predicted_4))
print(classification_report(y_test, y_predicted_4))
tn, fp, fn, tp = confusion_matrix(y_test, y_predicted_4).ravel()
print(tp)
print(tn)
print(fp)
print(fn)

Puntuación: 0.7675140657316832
[[101255   2455]
 [ 45767  57942]]
              precision    recall  f1-score   support

           0       0.69      0.98      0.81    103710
           1       0.96      0.56      0.71    103709

    accuracy                           0.77    207419
   macro avg       0.82      0.77      0.76    207419
weighted avg       0.82      0.77      0.76    207419

57942
101255
2455
45767


In [29]:
cv_tld.vocabulary_

{'ca': 5,
 'com': 9,
 'ir': 20,
 'org': 26,
 'in': 18,
 'pl': 27,
 'edu': 11,
 'br': 4,
 'ru': 29,
 'nl': 25,
 'it': 21,
 'info': 19,
 'co': 8,
 'uk': 33,
 'net': 24,
 'top': 30,
 'eu': 13,
 'ac': 0,
 'id': 17,
 'au': 2,
 'de': 10,
 'ua': 32,
 'vn': 35,
 'za': 37,
 'gov': 16,
 'fr': 15,
 'us': 34,
 'fm': 14,
 'ro': 28,
 'mx': 23,
 'cn': 7,
 'ar': 1,
 'jp': 22,
 'cl': 6,
 'xyz': 36,
 'es': 12,
 'tr': 31,
 'biz': 3}

In [30]:
cv_tld.stop_words_

{'ab',
 'academy',
 'accountant',
 'acf',
 'ad',
 'adm',
 'adult',
 'adv',
 'adxhks',
 'ae',
 'aero',
 'af',
 'africa',
 'ag',
 'agency',
 'agr',
 'ai',
 'ais',
 'ak',
 'al',
 'am',
 'amh',
 'ao',
 'app',
 'archi',
 'argyll',
 'arq',
 'art',
 'as',
 'asehdb',
 'asia',
 'asn',
 'asso',
 'aswg',
 'at',
 'audio',
 'auto',
 'av',
 'avg',
 'ax',
 'az',
 'ba',
 'band',
 'bb',
 'bc',
 'bd',
 'be',
 'berlin',
 'best',
 'bf',
 'bg',
 'bh',
 'bialystok',
 'bid',
 'bieszczady',
 'bio',
 'bj',
 'blog',
 'bm',
 'bo',
 'bolt',
 'bradesco',
 'brj',
 'bs',
 'build',
 'builders',
 'business',
 'bute',
 'buzz',
 'bw',
 'by',
 'bydgoszcz',
 'bz',
 'bzh',
 'cab',
 'capetown',
 'capital',
 'cardiff',
 'care',
 'careers',
 'casa',
 'cash',
 'cat',
 'catering',
 'catholic',
 'cc',
 'cd',
 'center',
 'ceo',
 'cf',
 'ch',
 'chat',
 'cheap',
 'chirurgiens',
 'church',
 'ci',
 'city',
 'ck',
 'click',
 'clinic',
 'cloud',
 'club',
 'cm',
 'cng',
 'cnt',
 'coffee',
 'college',
 'company',
 'consulting',
 'contrac

5. Path: <br>

In [31]:
from sklearn.feature_extraction.text import CountVectorizer
cv_path = CountVectorizer(ngram_range=(1,3), min_df=100, token_pattern=r'[a-zA-Z]{2,}', max_features=500,binary=True)

X_train = cv_path.fit_transform(parts_df_train['path'])
y_train = parts_df_train['label']

X_test = parts_df_test['path']
y_test = parts_df_test['label']
X_test = cv_path.transform(X_test)

# creating and training logistic regression model
from sklearn.linear_model import LogisticRegression
logreg_5 = LogisticRegression(max_iter=1000)
logreg_5.fit(X_train, y_train)
y_predicted_5 = logreg_5.predict(X_test)

print('Puntuación: ' + str(accuracy_score(y_test, y_predicted_5)))
print(confusion_matrix(y_test, y_predicted_5))
print(classification_report(y_test, y_predicted_5))
tn, fp, fn, tp = confusion_matrix(y_test, y_predicted_5).ravel()
print(tp)
print(tn)
print(fp)
print(fn)

Puntuación: 0.8838679195252123
[[99188  4522]
 [19566 84143]]
              precision    recall  f1-score   support

           0       0.84      0.96      0.89    103710
           1       0.95      0.81      0.87    103709

    accuracy                           0.88    207419
   macro avg       0.89      0.88      0.88    207419
weighted avg       0.89      0.88      0.88    207419

84143
99188
4522
19566


In [32]:
cv_path.vocabulary_

{'football': 183,
 'html': 210,
 'file': 179,
 'area': 29,
 'wiki': 470,
 'montreal': 283,
 'images': 215,
 'admin': 15,
 'php': 342,
 'forums': 187,
 'archives': 28,
 'live': 255,
 'en': 161,
 'en en': 162,
 'dr': 154,
 'browse': 76,
 'mt': 291,
 'ca': 79,
 'corporation': 114,
 'movie': 284,
 'preview': 353,
 'music': 292,
 'de': 128,
 'details': 135,
 'de de': 129,
 'mozi': 286,
 'zehir': 495,
 'hir': 200,
 'arm': 30,
 'zehir hir': 496,
 'hir arm': 201,
 'zehir hir arm': 497,
 'vbc': 457,
 'exe': 169,
 'vbc exe': 458,
 'smallbusiness': 398,
 'system': 419,
 'login': 261,
 'htm': 209,
 'wp': 479,
 'content': 110,
 'uploads': 449,
 'wp content': 482,
 'content uploads': 113,
 'wp content uploads': 485,
 'players': 345,
 'player': 344,
 'kansas': 245,
 'tv': 438,
 'th': 425,
 'college': 101,
 'sports': 405,
 'open': 320,
 'profile': 356,
 'wp admin': 480,
 'index': 222,
 'index php': 225,
 'fc': 176,
 'aug': 42,
 'zaher': 494,
 'includes': 219,
 'dhl': 137,
 'wp includes': 486,
 'biz': 

In [33]:
cv_path.stop_words_

{'mozier',
 'tgmbfqj',
 'xkf zv',
 'local london',
 'newsid default',
 'military chemicals',
 'atlantic aviation',
 'taraji henson',
 'cfdd df',
 'vital en en',
 'tami',
 'kazmaier default',
 'menusl etrac',
 'usf html',
 'cdn qu qumlr',
 'oudovwcwwrn de',
 'th tour',
 'relais sainte marie',
 'irs confim index',
 'catholic church in',
 'robert groves website',
 'archive bvuln old',
 'tamilsongs aalavanthan songs',
 'aiopm',
 'ad jpg php',
 'movie film watch',
 'content aq gtbtopoj',
 'obrigatorio ibpflogin',
 'xm watch allan',
 'and bute young',
 'stat faculty alan',
 'teg fxlwp',
 'jlothiowrjota otu',
 'sploits py',
 'emusic dogmusic',
 'playsets brands dp',
 'dlink',
 'terry evanshen plp',
 'aug rechnung hilfestellung',
 'rip roarin rodent',
 'film view third',
 'segthjotijo',
 'ckf ua',
 'parti socialiste belgique',
 'people haber richard',
 'video cremona',
 'desert schedule',
 'sitlst',
 'doctor myocardial',
 'zypiv leudfajf ih',
 'zanon',
 'en biography denisjuneau',
 'verify inf

6. Query: <br>

In [35]:
from sklearn.feature_extraction.text import CountVectorizer
cv_query = CountVectorizer(ngram_range=(1,1), min_df=100, token_pattern=r'[a-zA-Z]{2,}', max_features=500,binary=True)

X_train = cv_query.fit_transform(parts_df_train['query'])
y_train = parts_df_train['label']

X_test = parts_df_test['query']
y_test = parts_df_test['label']
X_test = cv_query.transform(X_test)

# creating and training logistic regression model
from sklearn.linear_model import LogisticRegression
logreg_6 = LogisticRegression(max_iter=1000)
logreg_6.fit(X_train, y_train)
y_predicted_6 = logreg_6.predict(X_test)

print('Puntuación: ' + str(accuracy_score(y_test, y_predicted_6)))
print(confusion_matrix(y_test, y_predicted_6))
print(classification_report(y_test, y_predicted_6))
tn, fp, fn, tp = confusion_matrix(y_test, y_predicted_6).ravel()
print(tp)
print(tn)
print(fp)
print(fn)

Puntuación: 0.5397480462252735
[[103292    418]
 [ 95047   8662]]
              precision    recall  f1-score   support

           0       0.52      1.00      0.68    103710
           1       0.95      0.08      0.15    103709

    accuracy                           0.54    207419
   macro avg       0.74      0.54      0.42    207419
weighted avg       0.74      0.54      0.42    207419

8662
103292
418
95047


In [36]:
cv_query.vocabulary_

{'sa': 193,
 'amp': 14,
 'source': 204,
 'cd': 51,
 'rh': 192,
 'url': 229,
 'http': 116,
 'nl': 154,
 'fid': 102,
 'db': 69,
 'oem': 158,
 'id': 118,
 'atclid': 27,
 'topic': 222,
 'page': 166,
 'qid': 178,
 'ec': 84,
 'cab': 43,
 'us': 230,
 'battle': 30,
 'net': 150,
 'login': 141,
 'en': 89,
 'ref': 183,
 'index': 124,
 'mv': 148,
 'export': 92,
 'download': 81,
 'abuse': 2,
 'com': 61,
 'aid': 10,
 'org': 164,
 'tar': 210,
 'rand': 180,
 'inboxlight': 122,
 'aspx': 24,
 'jehfuq': 129,
 'vjoxk': 239,
 'qwhtogydw': 179,
 'product': 175,
 'userid': 232,
 'cmd': 57,
 'submit': 208,
 'bc': 32,
 'fb': 96,
 'session': 198,
 'name': 149,
 'file': 103,
 'news': 152,
 'ef': 87,
 'pid': 173,
 'gid': 107,
 'usp': 233,
 'sharing': 199,
 'ee': 86,
 'email': 88,
 'tkn': 219,
 'dl': 77,
 'wp': 244,
 'go': 108,
 'restore': 190,
 'start': 206,
 'acess': 5,
 'tooken': 221,
 'fc': 97,
 'fa': 93,
 'cat': 44,
 'bd': 33,
 'cb': 47,
 'dc': 70,
 'title': 218,
 'of': 159,
 'sk': 202,
 'app': 18,
 'lob': 13

In [37]:
cv_query.stop_words_

{'eyjzijoinf',
 'bquel',
 'uqpzemuus',
 'ooss',
 'tmode',
 'vaudreuil',
 'cjhfxkdaqxf',
 'rcxiylrhpcim',
 'realize',
 'dnkepggqvepq',
 'rjblrr',
 'fcihbsbm',
 'cgsz',
 'fcampaign',
 'fond',
 'utfwaziavu',
 'zuooguzzov',
 'eigfa',
 'fyl',
 'celebid',
 'ocz',
 'union',
 'fbnxf',
 'jhido',
 'wiiwidii',
 'acpf',
 'coehvyp',
 'ruta',
 'edh',
 'jrehbey',
 'llwzg',
 'ype',
 'wqy',
 'nzmzmzma',
 'kkltxf',
 'fja',
 'rytptajrkosbemmlmqcyacc',
 'bosky',
 'sqawx',
 'uvuhzuevaec',
 'jeuczuq',
 'awa',
 'kids',
 'defyiwctu',
 'jdjfhfkdjfhhfdhfk',
 'vszv',
 'xhgx',
 'ucprfs',
 'bollywood',
 'xzaaeelj',
 'fow',
 'like',
 'uzfjcongxb',
 'womens',
 'yyr',
 'dqxmmckrwzo',
 'tnzdld',
 'bdewzye',
 'omktygolc',
 'hbwsxlsnwg',
 'ktq',
 'oso',
 'klty',
 'gna',
 'varydntiujshlavxnn',
 'acvatdesrdotphooydzauuueqw',
 'anchor',
 'nwgfn',
 'vhfsmrmpmyanvrod',
 'guyoungtech',
 'pqvfigu',
 'ambwkvu',
 'nhspm',
 'bjvgfbnj',
 'ofezicg',
 'bsku',
 'auston',
 'vdcgn',
 'lyvb',
 'twzhityaszxslp',
 'bztbs',
 'gbcpbzk',
 'y

***
#### Combinación de los seis clasificadores: <br>

- Si combinamos los seis modelos anteriores mediante _Hard Voting_ obtenemos un nuevo modelo con los siguientes resultados: <br>

In [157]:
columns = [
    'resultado_protocolo',
    'resultado_subdominio',
    'resultado_dominio',
    'resultado_tld',
    'resultado_path',
    'resultado_query'
]
df_result = pd.DataFrame(columns=columns)

df_result['resultado_protocolo'] = y_predicted_1.tolist()
df_result['resultado_subdominio'] = y_predicted_2.tolist()
df_result['resultado_dominio'] = y_predicted_3.tolist()
df_result['resultado_tld'] = y_predicted_4.tolist()
df_result['resultado_path'] = y_predicted_5.tolist()
df_result['resultado_query'] = y_predicted_6.tolist()

In [161]:
voting = []

for index, row in df_result.iterrows():
    result = row['resultado_protocolo'] + row['resultado_subdominio'] + row['resultado_dominio'] + row['resultado_tld'] + row['resultado_path'] + row['resultado_query']
    
    if result >= 4:
        voting.append(1)
    else:
        voting.append(0)

df_result['voting'] = voting

In [164]:
y_test = new_df_test['label']
y_predicted = df_result['voting']

print('Puntuación: ' + str(accuracy_score(y_test, y_predicted)))
print(confusion_matrix(y_test, y_predicted))
print(classification_report(y_test, y_predicted))
tn, fp, fn, tp = confusion_matrix(y_test, y_predicted).ravel()
print(tp)
print(tn)
print(fp)
print(fn)

Puntuación: 0.8972369937180297
[[103710      0]
 [ 21315  82394]]
              precision    recall  f1-score   support

           0       0.83      1.00      0.91    103710
           1       1.00      0.79      0.89    103709

    accuracy                           0.90    207419
   macro avg       0.91      0.90      0.90    207419
weighted avg       0.91      0.90      0.90    207419

82394
103710
0
21315


***
## Guardamos los datos<br>
- __Guardamos__ el contenido del _DataFrame_ final para realizar la extracción de características como siguiente paso:<br>

In [10]:
import pandas as pd

def save_data(dataframe, filename, separator, folder, path=FINAL_DATASETS_PATH):
    file_path = os.path.join(path, folder + os.sep + filename)
    dataframe.to_csv(file_path, sep=separator, index=False)

In [None]:
SUBFOLDER_METHOD_2 = "4_extraccion_caracteristicas" + os.sep + "metodo_2"

#Train set
save_data(df_train_features, "4_1_train_features_dataset.csv", ',', SUBFOLDER_METHOD_1)

#Test set
save_data(df_test_features, "4_1_test_features_dataset.csv", ',', SUBFOLDER_METHOD_1)

- Guardamos también el _dataframe_ que contiene cada parte de la URL por separado para analizarlo posteriormente en el siguiente paso del proyecto: <br>

In [13]:
SUBFOLDER_METHOD_2 = "4_extraccion_caracteristicas" + os.sep + "metodo_2"

#Train set
save_data(parts_df_train, "4_2_parts_train.csv", ',', SUBFOLDER_METHOD_1)

#Test set
save_data(parts_df_test, "4_2_parts_test.csv", ',', SUBFOLDER_METHOD_1)