# LAB01: Detección de pishing

## Ingeniería de características

In [2]:
import pandas as pd
import numpy as np

### Exploración de datos

In [4]:
df = pd.read_csv("datasets/dataset_pishing.csv")

In [5]:
df.head(5)

Unnamed: 0,url,ip,nb_www,nb_com,nb_dslash,http_in_path,punycode,port,tld_in_path,tld_in_subdomain,...,domain_in_title,domain_with_copyright,whois_registered_domain,domain_registration_length,domain_age,web_traffic,dns_record,google_index,page_rank,status
0,http://www.crestonwood.com/router.php,0,1,0,0,0,0,0,0,0,...,0,1,0,45,-1,0,1,1,4,legitimate
1,http://shadetreetechnology.com/V4/validation/a...,1,0,0,0,0,0,0,0,0,...,1,0,0,77,5767,0,0,1,2,phishing
2,https://support-appleld.com.secureupdate.duila...,1,0,1,0,0,0,0,0,1,...,1,0,0,14,4004,5828815,0,1,0,phishing
3,http://rgipt.ac.in,0,0,0,0,0,0,0,0,0,...,1,0,0,62,-1,107721,0,0,3,legitimate
4,http://www.iracing.com/tracks/gateway-motorspo...,0,1,0,0,0,0,0,0,0,...,0,1,0,224,8175,8725,0,0,6,legitimate


#### ¿Está balanceado el dataset?

In [6]:
pd.Series(' '.join(df.status).split()).value_counts()[:2]

legitimate    5715
phishing      5715
dtype: int64

El dataset si está balanceado, al tener las mismas cantidades sobre legitimate y phishing.

### Derivación de características
En base al artículo “Towards Benchmark Datasets for ML Based Wensite Phishing Detection: An
Experimental Study” se derivan:

- f1 = Full URL length
- f2 = Hostname length
- f4 -> f20 =  Number of occurrences of the following characters: ’.’ (f4), ’-’ (f5), ’@’ (f6) , ’?’ (f7), ’&’ (f8), ’|’ (f9), ’=’ (f10), ’_’ (f11), ’ ̃’ (f12), ’%’ (f13), ’/’ (f14), ’*’ (f15), ’:’ (f16), ’,’ (f17), ’;’ (f18), '\$'(f19), ’%20’ or space (f20)
- f25 = HTTPS token
- f26 = Ratio of digits in full URLs 
- f27 = Ratio of digits in hostnames

In [7]:
# f1
df['url_len'] = df['url'].apply(len)

In [8]:
# f2
from urllib.parse import urlparse

def url_info(url):
    parsed_url = urlparse(url)
    return parsed_url.hostname, parsed_url.scheme

df['hostname'], df['scheme'] = zip(*df['url'].apply(url_info)) # scheme for f25
df['hostname_len'] = df['hostname'].apply(len)

In [9]:
#f4 - f20
chars = ['.','-','@','?','&','|','=','_',' ̃','%','/','*',':',',',';','$']
spaces = ['%20', ' '] # two options for spaces

def count_chars(chs):
    return lambda s: sum(s.count(ch) for ch in chs)

for ch in chars:
    df[f'{ch}_count'] = df['url'].apply(count_chars(ch))

df['space_count'] = df['url'].apply(count_chars(spaces))

In [10]:
# f25
df['is_https'] = df['scheme'].apply(lambda scheme: int(scheme.lower() == 'https'))

In [11]:
#f26
digits_ratio = lambda s: sum(c.isdigit() for c in s) / sum(not c.isdigit() for c in s)
df['digit_ratio_url'] = df['url'].apply(digits_ratio)
# f27
df['digit_ratio_hostname'] = df['url'].apply(digits_ratio)

In [12]:
# Drop added columns no longer needed
df = df.drop(['hostname', 'scheme'], axis=1)

In [13]:
# Ejemplo de cinco observaciones
df.head(5)

Unnamed: 0,url,ip,nb_www,nb_com,nb_dslash,http_in_path,punycode,port,tld_in_path,tld_in_subdomain,...,/_count,*_count,:_count,",_count",;_count,$_count,space_count,is_https,digit_ratio_url,digit_ratio_hostname
0,http://www.crestonwood.com/router.php,0,1,0,0,0,0,0,0,0,...,3,0,1,0,0,0,0,0,0.0,0.0
1,http://shadetreetechnology.com/V4/validation/a...,1,0,0,0,0,0,0,0,0,...,5,0,1,0,0,0,0,0,0.283333,0.283333
2,https://support-appleld.com.secureupdate.duila...,1,0,1,0,0,0,0,0,1,...,5,0,1,0,0,0,0,1,0.17757,0.17757
3,http://rgipt.ac.in,0,0,0,0,0,0,0,0,0,...,2,0,1,0,0,0,0,0,0.0,0.0
4,http://www.iracing.com/tracks/gateway-motorspo...,0,1,0,0,0,0,0,0,0,...,5,0,1,0,0,0,0,0,0.0,0.0


#### Preprocesamiento

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11430 entries, 0 to 11429
Data columns (total 89 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   url                         11430 non-null  object 
 1   ip                          11430 non-null  int64  
 2   nb_www                      11430 non-null  int64  
 3   nb_com                      11430 non-null  int64  
 4   nb_dslash                   11430 non-null  int64  
 5   http_in_path                11430 non-null  int64  
 6   punycode                    11430 non-null  int64  
 7   port                        11430 non-null  int64  
 8   tld_in_path                 11430 non-null  int64  
 9   tld_in_subdomain            11430 non-null  int64  
 10  abnormal_subdomain          11430 non-null  int64  
 11  nb_subdomains               11430 non-null  int64  
 12  prefix_suffix               11430 non-null  int64  
 13  random_domain               114

In [15]:
# Convertir la variable categórica status a una variable binaria
df['status'] = df['status'].apply(lambda status: int(status == 'legitimate'))

In [16]:
# Elimine la columna del dominio
del df['url']

In [17]:
df.head(5)

Unnamed: 0,ip,nb_www,nb_com,nb_dslash,http_in_path,punycode,port,tld_in_path,tld_in_subdomain,abnormal_subdomain,...,/_count,*_count,:_count,",_count",;_count,$_count,space_count,is_https,digit_ratio_url,digit_ratio_hostname
0,0,1,0,0,0,0,0,0,0,0,...,3,0,1,0,0,0,0,0,0.0,0.0
1,1,0,0,0,0,0,0,0,0,0,...,5,0,1,0,0,0,0,0,0.283333,0.283333
2,1,0,1,0,0,0,0,0,1,0,...,5,0,1,0,0,0,0,1,0.17757,0.17757
3,0,0,0,0,0,0,0,0,0,0,...,2,0,1,0,0,0,0,0,0.0,0.0
4,0,1,0,0,0,0,0,0,0,0,...,5,0,1,0,0,0,0,0,0.0,0.0


#### 5) Visualización de resultados

In [18]:
import pandas_profiling

# https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/advanced_usage.html
profile = df.profile_report(
    title="Pandas Profiling Report",
    vars = {
        "num": {
            "low_categorical_threshold": 0
        }
    },
    interactions = {
        "continuous": False, # Avoid huge html file
    },
    correlations= { # enable just spearman to speed up
        "pearson": { 
            "calculate": True,
            "threshold": 0.3,
        },
        "spearman": { "calculate": False },
        "kendall": { "calculate": False },
        "phi_k": { "calculate": False },
        "cramers": { "calculate": False },
    }
)
profile.to_file("phishing_report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

  ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45, ha='right', fontsize=fontsize)
(using `df.profile_report(missing_diagrams={"Count": False}`)
If this is problematic for your use case, please report this as an issue:
https://github.com/pandas-profiling/pandas-profiling/issues
(include the error message: 'The number of FixedLocator locations (7), usually from a call to set_ticks, does not match the number of ticklabels (88).')


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

#### Selección de Características

Las características con mayor correlacion con status y no son constantes:
- ip
- nb_www
- phish_hints
- nb_hyperlinks
- domain_in_title
- domain_age
- google_index
- page_rank
- digit_ratio_url
- digit_ratio_hostname#### 6) Selección de características

In [19]:
# Eliminamos registros duplicados
df = df.drop_duplicates()

In [20]:
# Mantenemos solo los features que vamos a usar en el modelo
features = ['ip', 'nb_www', 'phish_hints', 'nb_hyperlinks', 'domain_in_title', 'domain_age', 'google_index', 'page_rank', 'digit_ratio_url', 'digit_ratio_hostname', 'status']
cols_to_remove = list(filter(lambda col: col not in features, df.columns))
df = df.drop(cols_to_remove, axis=1)

In [21]:
df.head(5)

Unnamed: 0,ip,nb_www,phish_hints,nb_hyperlinks,domain_in_title,domain_age,google_index,page_rank,status,digit_ratio_url,digit_ratio_hostname
0,0,1,0,17,0,-1,1,4,1,0.0,0.0
1,1,0,0,30,1,5767,1,2,0,0.283333,0.283333
2,1,0,0,4,1,4004,1,0,0,0.17757,0.17757
3,0,0,0,149,1,-1,0,3,1,0.0,0.0
4,0,1,0,102,0,8175,0,6,1,0.0,0.0


## Implementación del modelo

#### Separación de datos
- Datos de entrenamiento: 55%
- Datos de validación: 15%
- Datos de prueba: 30%

In [22]:
target = df['status']
features = df.drop(['status'], axis=1)

In [23]:
from sklearn.model_selection import train_test_split
feature_train, feature_test, target_train, target_test =\
    train_test_split(features, target, test_size=0.30, random_state=31)

feature_train, feature_validation, target_train, target_validation =\
    train_test_split(feature_train, target_train, test_size=0.21, random_state=31) # 0.21*0.7 = 0.15

In [24]:
# Dump data to csv
feature_train.to_csv('datasets/feature_train.csv')
feature_test.to_csv('datasets/feature_test.csv')
target_train.to_csv('datasets/target_train.csv')
target_test.to_csv('datasets/target_test.csv')
feature_validation.to_csv('datasets/feature_validation.csv')
target_validation.to_csv('datasets/target_validation.csv')

#### Implementación

Algoritmo de árboles de decisión para entrenar el modelo

In [25]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc = dtc.fit(feature_train, target_train)

#### Métricas de desempeño del modelo

In [26]:
from sklearn.metrics import confusion_matrix, classification_report

##### Datos de pruebas

In [27]:
target_pred_test = dtc.predict(feature_test)

Matriz de confusión

In [28]:
confusion_matrix(target_test, target_pred_test)

array([[1568,  125],
       [ 143, 1541]])

Precision, Recall, F1 Score

In [29]:
classification_report_test = classification_report(target_test, target_pred_test, target_names=['legit', 'phishing'])
print(classification_report_test)

              precision    recall  f1-score   support

       legit       0.92      0.93      0.92      1693
    phishing       0.92      0.92      0.92      1684

    accuracy                           0.92      3377
   macro avg       0.92      0.92      0.92      3377
weighted avg       0.92      0.92      0.92      3377



##### Datos de validación

In [30]:
target_pred_validation = dtc.predict(feature_validation)

Matriz de confusión

In [31]:
confusion_matrix(target_validation, target_pred_validation)

array([[732,  62],
       [ 76, 785]])

Precision, Recall, F1 Score

In [32]:
classification_report_validation = classification_report(target_validation, target_pred_validation, target_names=['legit', 'phishing'])
print(classification_report_validation)

              precision    recall  f1-score   support

       legit       0.91      0.92      0.91       794
    phishing       0.93      0.91      0.92       861

    accuracy                           0.92      1655
   macro avg       0.92      0.92      0.92      1655
weighted avg       0.92      0.92      0.92      1655



## Resultados
1. **¿Cuál es el impacto de clasificar un sitio legítimo como Phishing?**

Dependiendo del contexto del sitio puede resultar en pérdidas de dinero (por ejemplo disminuir la exposición de un sitio de ecommerce a posibles clientes) o en riesgos que pueden afectar seriamente a los usuarios de los sitios (por ejemplo no permitir que notificaciones de sitios gubernamentales puedan ser expuestas a los usuarios).

2. **¿Cuál es el impacto de clasificar un sitio de Phishing como legítimo?**

Se corre el riesgo de filtrar información sensible a terceras partes con fines maliciosos. Además esto puede resultar en una mala reputación para el sitio legítimo, ya que la mayoría de personas que pueda ser afectado por los sitios de Phishing nunca se enteren que fue un sitio de Phishing. 

3. **En base a las respuestas anteriores, ¿Qué métrica elegiría para comparar modelos similares de clasificación de phishing?**

Recall, en estos casos podemos decir que en la mayoría de casos se detectan correctamente entre sitios legítimos y phishing.

4. **¿Es necesaria la intervención de una persona humana en la decisión final de clasificación?**

No debe de ser necesaria. Una de las características que hacen al Phishing tan efectivo es la dificultad para un humano en distinguir entre sitios legítimos y sitios de phishing. Las características que un humano busca en un sitio de phishing pueden ser bien detectadas por un modelo automatizado. 
