# SVM (Support Vector Machine)

A partir de un conjunto de datos que clasifica urls como normales o maliciosas, generar un modelo SVM capaz de clasificar correctamente cualquier URL entre normal o maliciosa.

Entre ellas se clasifican:
* URLs benignas
* URLs de SPAM
* URLs de malware
* URLs falsas

En este ejemplo se buscará clasificar entre URLs benignas y de Phishing

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.metrics import f1_score
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.pipeline import Pipeline

In [2]:
import warnings
# Ignoramos algunos warnings que se producen por invocar el pipeline sin el nombre de las características
warnings.filterwarnings('ignore', category=UserWarning, message='.*X does not have valid feature names.*')
warnings.filterwarnings('ignore', category=RuntimeWarning, message='.*invalid value encountered in subtract.*')

## Funciones auxiliares

In [3]:
# Función que realiza el particionado del DF
def train_val_test_split(df, rstate=42, shuffle=True, stratify=None):

    # Strat solo si le pasamos la columa a dispersar
    strat = df[stratify] if stratify else None 
    
    train_set, test_set = train_test_split(
        df,
        test_size=0.4,
        random_state=rstate, # Semilla de generación aleatoria única
        shuffle=shuffle, # Si se hace o no un shuffle
        stratify=strat # Columna a dispersar si la hay
    )

    # Se repite el proceso para obtener el validation_set
    strat = test_set[stratify] if stratify else None
    
    val_set, test_set = train_test_split(
        test_set,
        test_size=0.5,
        random_state=rstate,
        shuffle=shuffle,
        stratify=strat
    )
    
    return (train_set, val_set, test_set)

In [4]:
# Representación gráfica del límite de decisión
def plot_svc_decision_boundary(svm_clf, xmin, xmax):
    w = svm_clf.coef_[0]
    b = svm_clf.intercept_[0]

    # At the decision boundary, w0*x0 + w1*x1 + b = 0
    # => x1 = -w0/w1 * x0 - b/w1
    x0 = np.linspace(xmin, xmax, 200)
    decision_boundary = -w[0]/w[1] * x0 - b/w[1]

    margin = 1/w[1]
    gutter_up = decision_boundary + margin
    gutter_down = decision_boundary - margin

    svs = svm_clf.support_vectors_
    plt.scatter(svs[:, 0], svs[:, 1], s=180, facecolors='#FFAAAA')
    plt.plot(x0, decision_boundary, "k-", linewidth=2)
    plt.plot(x0, gutter_up, "k--", linewidth=2)
    plt.plot(x0, gutter_down, "k--", linewidth=2)

## 1. Leer el conjunto de datos

In [5]:
df = pd.read_csv('../datasets/FinalDataset/Phishing.csv')

## 2. Visualizar el Dataset

In [6]:
df.head(10)

Unnamed: 0,Querylength,domain_token_count,path_token_count,avgdomaintokenlen,longdomaintokenlen,avgpathtokenlen,tld,charcompvowels,charcompace,ldl_url,...,SymbolCount_FileName,SymbolCount_Extension,SymbolCount_Afterpath,Entropy_URL,Entropy_Domain,Entropy_DirectoryName,Entropy_Filename,Entropy_Extension,Entropy_Afterpath,URL_Type_obf_Type
0,0,2,12,5.5,8,4.083334,2,15,7,0,...,-1,-1,-1,0.676804,0.860529,-1.0,-1.0,-1.0,-1.0,benign
1,0,3,12,5.0,10,3.583333,3,12,8,2,...,1,0,-1,0.715629,0.776796,0.693127,0.738315,1.0,-1.0,benign
2,2,2,11,4.0,5,4.75,2,16,11,0,...,2,0,1,0.677701,1.0,0.677704,0.916667,0.0,0.898227,benign
3,0,2,7,4.5,7,5.714286,2,15,10,0,...,0,0,-1,0.696067,0.879588,0.818007,0.753585,0.0,-1.0,benign
4,19,2,10,6.0,9,2.25,2,9,5,0,...,5,4,3,0.747202,0.8337,0.655459,0.829535,0.83615,0.823008,benign
5,0,2,10,5.5,9,4.1,2,15,11,0,...,-1,-1,-1,0.732981,0.860529,-1.0,-1.0,-1.0,-1.0,benign
6,0,2,12,4.5,6,5.333334,2,24,9,0,...,0,0,-1,0.692383,0.939794,0.910795,0.673973,0.0,-1.0,benign
7,0,2,11,3.5,4,3.909091,2,15,6,0,...,0,0,-1,0.707365,0.916667,0.916667,0.690332,0.0,-1.0,benign
8,0,2,9,2.5,3,4.555555,2,6,3,0,...,1,0,-1,0.742606,1.0,0.785719,0.808833,1.0,-1.0,benign
9,0,2,13,4.5,6,5.307692,2,16,9,1,...,-1,-1,-1,0.734633,0.939794,-1.0,-1.0,-1.0,-1.0,benign


In [7]:
df.describe()

Unnamed: 0,Querylength,domain_token_count,path_token_count,avgdomaintokenlen,longdomaintokenlen,avgpathtokenlen,tld,charcompvowels,charcompace,ldl_url,...,SymbolCount_Directoryname,SymbolCount_FileName,SymbolCount_Extension,SymbolCount_Afterpath,Entropy_URL,Entropy_Domain,Entropy_DirectoryName,Entropy_Filename,Entropy_Extension,Entropy_Afterpath
count,15367.0,15367.0,15367.0,15367.0,15367.0,15096.0,15367.0,15367.0,15367.0,15367.0,...,15367.0,15367.0,15367.0,15367.0,15367.0,15367.0,13541.0,15177.0,15364.0,15364.0
mean,3.446021,2.543698,8.477061,5.851956,10.027461,5.289936,2.543698,12.659986,8.398516,1.910913,...,2.120843,1.124618,0.500813,-0.158782,0.721684,0.854232,0.634859,0.682896,0.313617,-0.723793
std,14.151453,0.944938,4.66025,2.064581,5.28109,3.535097,0.944938,8.562206,6.329007,4.657731,...,2.777307,2.570246,2.261013,2.535939,0.049246,0.072641,0.510992,0.502288,0.57691,0.649785
min,0.0,2.0,0.0,1.5,2.0,0.0,2.0,0.0,0.0,0.0,...,-1.0,-1.0,-1.0,-1.0,0.41956,0.561913,-1.0,-1.0,-1.0,-1.0
25%,0.0,2.0,5.0,4.5,7.0,3.8,2.0,6.0,4.0,0.0,...,1.0,0.0,0.0,-1.0,0.687215,0.798231,0.709532,0.707165,0.0,-1.0
50%,0.0,2.0,8.0,5.5,9.0,4.5,2.0,11.0,7.0,0.0,...,2.0,0.0,0.0,-1.0,0.723217,0.859793,0.785949,0.814038,0.0,-1.0
75%,0.0,3.0,11.0,6.666666,12.0,5.571429,3.0,17.0,11.0,1.0,...,3.0,1.0,0.0,-1.0,0.757949,0.916667,0.859582,0.916667,1.0,-1.0
max,173.0,19.0,68.0,29.5,63.0,105.0,19.0,94.0,62.0,58.0,...,24.0,31.0,30.0,29.0,0.869701,1.0,0.962479,1.0,1.0,1.0


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15367 entries, 0 to 15366
Data columns (total 80 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   Querylength                      15367 non-null  int64  
 1   domain_token_count               15367 non-null  int64  
 2   path_token_count                 15367 non-null  int64  
 3   avgdomaintokenlen                15367 non-null  float64
 4   longdomaintokenlen               15367 non-null  int64  
 5   avgpathtokenlen                  15096 non-null  float64
 6   tld                              15367 non-null  int64  
 7   charcompvowels                   15367 non-null  int64  
 8   charcompace                      15367 non-null  int64  
 9   ldl_url                          15367 non-null  int64  
 10  ldl_domain                       15367 non-null  int64  
 11  ldl_path                         15367 non-null  int64  
 12  ldl_filename      

In [6]:
df['URL_Type_obf_Type'].value_counts()

URL_Type_obf_Type
benign      7781
phishing    7586
Name: count, dtype: int64

### Primero debemos reducir el número de características a aquellas más relevantes

In [6]:
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()

df_corr = df.copy()
df_corr['URL_Type_obf_Type'] = labelencoder.fit_transform(df_corr['URL_Type_obf_Type'])

corr = df_corr.corr()['URL_Type_obf_Type'].sort_values(ascending=False)
corr

URL_Type_obf_Type          1.000000
domainUrlRatio             0.575217
tld                        0.493899
domain_token_count         0.493899
SymbolCount_Domain         0.493899
                             ...   
path_token_count          -0.452085
pathDomainRatio           -0.470423
delimeter_path            -0.522306
pathurlRatio              -0.556790
ISIpAddressInDomainName         NaN
Name: URL_Type_obf_Type, Length: 80, dtype: float64

In [7]:
# Establecer un umbral para la correlación
threshold = 0.05

# Filtrar características con correlación mayor o igual al umbral
relevant_features = corr[corr.abs() >= threshold]

# Mostrar las características relevantes
print(relevant_features)

URL_Type_obf_Type          1.000000
domainUrlRatio             0.575217
tld                        0.493899
domain_token_count         0.493899
SymbolCount_Domain         0.493899
                             ...   
CharacterContinuityRate   -0.417122
path_token_count          -0.452085
pathDomainRatio           -0.470423
delimeter_path            -0.522306
pathurlRatio              -0.556790
Name: URL_Type_obf_Type, Length: 64, dtype: float64


In [8]:
# Filtrar el DataFrame original
df_relevant = df[relevant_features.index]

# Mostrar el nuevo DataFrame
df_relevant

Unnamed: 0,URL_Type_obf_Type,domainUrlRatio,tld,domain_token_count,SymbolCount_Domain,host_letter_count,domainlength,longdomaintokenlen,NumberofDotsinURL,Domain_LongestWordLength,...,Extension_LetterCount,sub-Directory_LongestWordLength,LongestPathTokenLength,charcompvowels,Entropy_Domain,CharacterContinuityRate,path_token_count,pathDomainRatio,delimeter_path,pathurlRatio
0,benign,0.150000,2,2,1,11,12,8,1,8,...,39,8,48,15,0.860529,0.750000,12,5.083334,7,0.762500
1,benign,0.217949,3,3,2,15,17,10,3,10,...,4,7,40,12,0.776796,0.647059,12,3.176471,8,0.692308
2,benign,0.126761,2,2,1,8,9,5,1,5,...,6,7,22,16,1.000000,0.666667,11,6.000000,3,0.760563
3,benign,0.156250,2,2,1,9,10,7,1,7,...,30,7,33,15,0.879588,0.800000,7,4.700000,3,0.734375
4,benign,0.191176,2,2,1,12,13,9,2,9,...,26,7,39,9,0.833700,0.769231,10,3.692308,4,0.705882
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15362,phishing,0.459459,2,2,1,16,17,13,1,13,...,8,-1,8,3,0.884870,0.823529,3,0.764706,0,0.351351
15363,phishing,0.783784,3,3,2,27,29,16,2,16,...,-1,-1,0,0,0.813569,0.586207,0,0.034483,0,0.027027
15364,phishing,0.594595,3,3,2,20,22,10,2,10,...,-1,-1,6,3,0.801139,0.500000,2,0.363636,0,0.216216
15365,phishing,0.459459,2,2,1,16,17,13,1,13,...,6,-1,6,4,0.787659,0.823529,3,0.764706,0,0.351351


In [9]:
# Corroborar si hay nulos
is_null = df_relevant.isna().any()
is_null[is_null] # Todas las columnas con nulos

Entropy_Extension        True
Entropy_Filename         True
NumberRate_Extension     True
Entropy_DirectoryName    True
avgpathtokenlen          True
NumberRate_FileName      True
NumberRate_AfterPath     True
Entropy_Afterpath        True
dtype: bool

In [10]:
# Comprobar si hay valores infinitos
is_inf = df_relevant.isin([np.inf, -np.inf]).any()
is_inf[is_inf]

argPathRatio    True
dtype: bool

## 3. Dividir el conjunto de datos

In [11]:
# Dividir en suconjuntos
train_set, val_set, test_set = train_val_test_split(df_relevant)

In [12]:
# Separar características de la etiqueta
x_train = train_set.drop('URL_Type_obf_Type', axis=1)
y_train = train_set['URL_Type_obf_Type'].copy()

x_val = val_set.drop('URL_Type_obf_Type', axis=1)
y_val = val_set['URL_Type_obf_Type'].copy()

x_test = test_set.drop('URL_Type_obf_Type', axis=1)
y_test = test_set['URL_Type_obf_Type'].copy()

## 4. Preparar el conjunto de datos

In [13]:
# Eliminar la columna que posee un valor infinito (Solo se aplica si usaramos el DF original con todas las caracteristicas)
x_train = x_train.drop('argPathRatio', axis=1)
x_val = x_val.drop('argPathRatio', axis=1)
x_test = x_test.drop('argPathRatio', axis=1)

In [16]:
# Rellenar los nulos con la mediana
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='median')

In [17]:
x_train_prep = imputer.fit_transform(x_train)
x_val_prep = imputer.fit_transform(x_val)
x_test_prep = imputer.fit_transform(x_test)

In [18]:
# Transformar el resultado a un df de Pandas
x_train_prep = pd.DataFrame(x_train_prep, columns=x_train.columns, index=y_train.index)
x_val_prep = pd.DataFrame(x_val_prep, columns=x_val.columns, index=y_val.index)
x_test_prep = pd.DataFrame(x_test_prep, columns=x_test.columns, index=y_test.index)

In [18]:
x_train_prep.head(10)

Unnamed: 0,domainUrlRatio,tld,domain_token_count,SymbolCount_Domain,host_letter_count,domainlength,longdomaintokenlen,NumberofDotsinURL,Domain_LongestWordLength,avgdomaintokenlen,...,Extension_LetterCount,sub-Directory_LongestWordLength,LongestPathTokenLength,charcompvowels,Entropy_Domain,CharacterContinuityRate,path_token_count,pathDomainRatio,delimeter_path,pathurlRatio
2134,0.072464,2.0,2.0,1.0,4.0,5.0,2.0,1.0,2.0,2.0,...,21.0,18.0,27.0,17.0,0.827729,0.6,6.0,11.4,2.0,0.826087
9178,0.166667,4.0,4.0,3.0,13.0,16.0,5.0,7.0,5.0,3.25,...,15.0,7.0,20.0,18.0,0.82016,0.375,18.0,4.5625,5.0,0.760417
13622,0.511628,3.0,3.0,2.0,20.0,22.0,14.0,2.0,14.0,6.666666,...,0.0,-1.0,8.0,1.0,0.869991,0.681818,3.0,0.636364,0.0,0.325581
15182,0.315789,3.0,3.0,2.0,10.0,12.0,4.0,3.0,4.0,3.333333,...,3.0,5.0,5.0,5.0,0.79649,0.416667,5.0,1.583333,1.0,0.5
8013,0.107527,2.0,2.0,1.0,19.0,20.0,17.0,2.0,17.0,9.5,...,45.0,8.0,92.0,21.0,0.820569,0.9,13.0,7.95,4.0,0.854839
12408,0.509434,3.0,3.0,2.0,25.0,27.0,19.0,2.0,19.0,8.333333,...,2.0,1.0,12.0,5.0,0.789538,0.740741,4.0,0.703704,0.0,0.358491
509,0.088496,2.0,2.0,1.0,9.0,10.0,6.0,2.0,6.0,4.5,...,60.0,10.0,84.0,24.0,0.796658,0.7,13.0,9.6,1.0,0.849557
10714,0.314286,3.0,3.0,2.0,20.0,22.0,14.0,2.0,14.0,6.666666,...,0.0,10.0,14.0,11.0,0.869991,0.681818,8.0,1.863636,2.0,0.585714
3986,0.264151,2.0,2.0,1.0,13.0,14.0,10.0,1.0,10.0,6.5,...,19.0,7.0,21.0,7.0,0.798231,0.785714,6.0,2.285714,2.0,0.603774
748,0.126761,2.0,2.0,1.0,8.0,9.0,5.0,1.0,5.0,4.0,...,34.0,8.0,39.0,14.0,0.929897,0.666667,8.0,6.0,4.0,0.760563


In [19]:
# Corroborar si hay nulos en el subconjunto de entrenamiento
is_null = x_train_prep.isna().any()
is_null[is_null] # Todas las columnas con nulos

Series([], dtype: bool)

## 5. Entrenamiento con SVM

### 5.1 Clasificación Lineal con Soft Margin Classification

Para ello el parámetro **kernel** se establece como *linear* y el parámetro **C** como *50*.

In [20]:
from sklearn.svm import SVC

model_softm = SVC(kernel='linear', C=50)
model_softm.fit(x_train_prep, y_train)

In [21]:
# Predecir con el conjunto de validacion
y_pred = model_softm.predict(x_val_prep)

In [22]:
from sklearn.metrics import f1_score
f_score = f1_score(y_pred, y_val, pos_label='phishing')
print(f'Exactitud del modelo SVM con Soft Margin Classification del {round(f_score * 100, ndigits=2)} %')

Exactitud del modelo SVM con Soft Margin Classification del 95.59 %


Soft margin no requiere de aplicar escalado a los valores para mejorar la exactitud del modelo

### 5.2 Clasificacion no lineal con kernel de regresion polinómica

Para este sub-algoritmo los valores etiqueta deben estar representados de manera numerica

In [21]:
# Representar los valores etiqueta de manera numerica
y_train_num = y_train.factorize()[0]
y_val_num = y_val.factorize()[0]

In [22]:
from sklearn.svm import SVC

# Crear el modelo con la implementacion propia de sklearn
model_kernelpoly = SVC(kernel='poly', degree=3, coef0=10, C=20)
model_kernelpoly.fit(x_train_prep, y_train_num)

In [23]:
# Predecir valores de validacion
y_pred = model_kernelpoly.predict(x_val_prep)

In [24]:
# Verificar la exactitud del modelo
f_score = f1_score(y_pred, y_val_num)
print(f'Exactitud del modelo SVM con regresion Polynomical del {round(f_score * 100, ndigits=2)} %')

Exactitud del modelo SVM con regresion Polynomical del 96.99 %


### 5.3 Clasificacion no lineal con kernel gaussiano

Este tipo de sub-algoritmo requiere de escalar los valores del dataset y que los valores de la etiqueta sean de tipo numerico.

Este es el mas utilizado de todos para casos practicos reales y es un algoritmo de entrenamiento bastante rapido

In [31]:
# Pipeline para preparar los datos escalados y entrenar el modelo
model_kernelgauss = Pipeline([
    ('scaler', RobustScaler()),
    ('model_gauss', SVC( kernel='rbf', gamma=0.05, C=2000 ))
])

model_kernelgauss.fit(x_train_prep, y_train_num)

In [34]:
# Predecir valores de validacion
y_pred = model_kernelgauss.predict(x_val_prep)

In [35]:
# Verificar la exactitud del modelo
f_score = f1_score(y_pred, y_val_num)
print(f'Exactitud del modelo SVM con regresion Polynomical del {round(f_score * 100, ndigits=2)} %')

Exactitud del modelo SVM con regresion Polynomical del 96.95 %


### En resumen

Con nuestro conjunto de datos de mas de 9000 ejemplos para entrenamiento y mas de 60 caracteristicas de entrada los difererentes sub-algoritmos de SVM dieron estos resultados respectivamente:

* Soft Margin Classification: 95.59 %
* Kernel de regresión polinómica: 96.99 %
* Kernel Gaussiano: 96.95 %

Por ende podemos concluir que los sub-algoritmos de clasificacion **no lineal** manejan un mejor porcentaje de excatitud en predicciones para este conjunto de datos.