# Detector de Phishing usando Naive Bayes 

En esta tarea Ud. debe implementar un detector de phising usando el modelo Naive Bayes. El modelo Naive asume independencia condicional entre todas las variables binarias observadas $\mathbf{x}$ (tokens o partes de una URL) dada la clase $c$ ('phishing' o 'no phishing'). 

$p(\mathbf{x},c) \propto p(c)p(\mathbf{x} | c )$

donde
$p( \mathbf{x}| c) = \prod_{i=1}^D p(x_i | c)$


Dado un conjunto de tuplas de URLs en formato binario y sus respectivas etiquetas $\mathcal D=\{(\mathbf{x},c)^j\}_{j=1}^N$, el estimador de maxima verosimilitud para la distribucion de Bernoulli se calcula a partir de la verosimilitud de los datos observados con los de las distribuciones condicionales de clase:

$p(\mathcal D )= \prod_j \prod_i( \theta_{ci})^{c^j} (1-\theta_{ci} )^{1-c^j}$.

Por lo tanto, el estimador de maxima verosimilitud es:

$\theta_{ci}=\frac{\textrm{numero de veces donde } x_i=1 \textrm{ para la clase }c + \alpha}{\textrm{numero de ejemplos para la clase }c + 2\alpha}$

Los datos originales de las imagenes contienen regiones donde siempre los valores son cero. Una forma de suavizar la estimacion de probabilidades condicionales es usar suavizado de Laplace introduciendo un parametro $\alpha$ (https://en.wikipedia.org/wiki/Additive_smoothing). 

La probabilidad apriori de clase es:

$p(c)=\frac{\textrm{numero de ejemplos para la clase }c }{\textrm{numero total de ejemplos}}$

Una vez obtenidos los parametros de las distribuciones de Bernoulli condicionales a la clase, podemos hacer inferencia para nuevos datos $\mathbf{x^\ast}$ con el modelo.

$p(c | \mathbf{x^\ast})=\frac{p(c)p(\mathbf{x^\ast} | c )}{p(\mathbf{x^\ast})}$

C.D. Manning, P. Raghavan and H. Schuetze (2008). Introduction to Information Retrieval. Cambridge University Press, pp. 234-265. https://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html

A. McCallum and K. Nigam (1998). A comparison of event models for naive Bayes text classification. Proc. AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41-48.


In [None]:
import pandas as pd
import numpy as np
import sklearn as sk


In [None]:
df=pd.read_csv('phishing_site_urls.csv')

In [None]:
df.shape

In [None]:
df.head()

Primero obtenemos las distribuciones a priori $p(c)$

In [None]:
import matplotlib.pyplot as plt

def get_prior(df):
    N=df.shape[0]   
    prob_c=df['Label'].value_counts().values/N
    class_names=df['Label'].value_counts().index
    return prob_c,class_names

prob_c,class_names=get_prior(df)
plt.bar(class_names,prob_c)
plt.xlabel('Clase')
plt.ylabel('Probabilidad apriori')
plt.title('Distribucion de clases')
plt.show()

In [None]:
df.iloc[2]

In [None]:
from sklearn.utils import shuffle

df=shuffle(df, random_state=200)


In [None]:
train=df.sample(frac=0.8, random_state=200) #random state is a seed value
test=df.drop(train.index)

In [None]:
train

In [None]:
test

In [None]:
prob_c,class_names=get_prior(train)

prob_c,class_names=get_prior(train)
plt.bar(class_names,prob_c)
plt.xlabel('Clase')
plt.ylabel('Probabilidad apriori')
plt.title('Distribucion de clases Entrenamiento')
plt.show()

In [None]:
prob_c

Ahora calculamos las probabilidades condicionales de clase. Transformamos las URLs en una matriz binaria de tokens.

In [None]:
import re 

tokens=set()

for i in range(len(train)):
    tokens.update(set(re.split(r'\.|/|\?|=',train['URL'].iloc[i].lower())))

In [None]:
D=len(tokens)
N=train.shape[0]
print("El numero total de documentos es {0}, el total de tokens es {1}".format(N,D))

In [None]:
import re 

df_good=train[train.Label=='good']

token_freq_good=dict({t:0 for t in tokens})

for i in range(len(df_good)):
    token_list=re.split(r'\.|/|\?|=',df_good['URL'].iloc[i].lower())
    for t in token_list:
        token_freq_good[t]+=1

In [None]:
sorted(token_freq_good.items(), key=lambda item: item[1],reverse=True)[:20]

In [None]:
df_bad=train[train.Label=='bad']

token_freq_bad=dict({t:0 for t in tokens})

for i in range(len(df_bad)):
    token_list=re.split(r'\.|/|\?|=',df_bad['URL'].iloc[i].lower())
    for t in token_list:
        token_freq_bad[t]+=1

In [None]:
sorted(token_freq_bad.items(), key=lambda item: item[1],reverse=True)[:20]

In [None]:
n_bad=df_bad.shape[0]
n_good=df_good.shape[0]
alpha_val=0.1
theta_bad={t:(freq+alpha_val)/(n_bad+2*alpha_val) for t,freq in token_freq_bad.items() }
theta_good={t:(freq+alpha_val)/(n_good+2*alpha_val) for t,freq in token_freq_good.items()}

In [None]:
sorted(theta_bad.items(), key=lambda item: item[1],reverse=True)[:20]

In [None]:
sorted(theta_good.items(), key=lambda item: item[1],reverse=True)[:20]

In [None]:
from scipy.stats import bernoulli

prob_x_good={t:bernoulli(p) for t,p in theta_good.items()}
prob_x_bad={t:bernoulli(p) for t,p in theta_bad.items()}

# Inferencia en Datos de Test

Ahora, es posible obtener la probabilidad posterior de clase para los datos de test:

$p(c | \mathbf{x^\ast})=\frac{p(c)p(\mathbf{x^\ast} | c )}{p(\mathbf{x^\ast})}$

In [None]:
test.iloc[2]

In [None]:
token_list=re.split(r'\.|/|\?|=',test['URL'].iloc[2].lower())


In [None]:
token_list

In [None]:
(theta_good['ro']**0)*(1-theta_good['ro'])**(1-0)

In [None]:
prob_x_good['000001992819278372818381291'].pmf(1)

In [None]:
prob_x_bad['000001992819278372818381291'].pmf(1)

In [None]:
x_star={t:(1 if t in token_list else 0) for t in tokens}

In [None]:
p_good=np.prod([prob_x_good[t].pmf(x_i) for t,x_i in x_star.items() if t in prob_x_good.keys()])*prob_c[0]

In [None]:
p_bad=np.prod([prob_x_bad[t].pmf(x_i) for t,x_i in x_star.items() if t in prob_x_bad.keys()])*prob_c[1]

In [None]:
p_good,p_bad

In [None]:
p_good=p_good/(p_good+p_bad)
p_bad=p_bad/(p_good+p_bad)
np.argmax([p_good,p_bad])


In [None]:
y_hat=list()
for j in range(test.shape[0]):
    token_list=re.split(r'\.|/|\?|=',test['URL'].iloc[j].lower())
    x_star={t:(1 if t in token_list else 0) for t in tokens}
    p_good=np.prod([prob_x_good[t].pmf(x_i) for t,x_i in x_star.items() if t in prob_x_good.keys()])*prob_c[0]
    p_bad=np.prod([prob_x_bad[t].pmf(x_i) for t,x_i in x_star.items() if t in prob_x_bad.keys()])*prob_c[1]
    p_good=p_good/(p_good+p_bad)
    p_bad=p_bad/(p_good+p_bad)
    y_hat.append(np.argmax([p_good,p_bad]))
    if j==10:
        break

# Naive Bayes usando Scikit-Learn


In [7]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.utils import shuffle
import re 

df=pd.read_csv('phishing_site_urls.csv')
df=shuffle(df, random_state=200)

train=df.sample(frac=0.8, random_state=200) #random state is a seed value
test=df.drop(train.index)
n_data=train.shape[0]

tokens=set()
for i in range(len(train)):
    tokens.update(set(re.split(r'\.|/|\?|=',train['URL'].iloc[i].lower())))
    
vectorizer = CountVectorizer(vocabulary=tokens,min_df=1./n_data,max_df=1.0)


Ajustamos el vectorizador con una porcion de los datos de entrenamiento.

In [8]:
vectorizer.fit(train['URL'])

Ahora ajustamos el modelo mediante aprendizaje incremental, lo cual nos permite escalar el cómputo cuando no es posible almacenar los datos de entrenamiento en memoria. 

https://scikit-learn.org/stable/computing/scaling_strategies.html?highlight=out+core

In [25]:
from sklearn.naive_bayes import BernoulliNB

clf = BernoulliNB(alpha=0.2)

batch_size=5000
n_batches=train.shape[0]//batch_size
for i in range(n_batches + 1):
    mini_batch = train[i*batch_size:(i+1)*batch_size]
    X_train = vectorizer.transform(mini_batch['URL'])
    y_train = (mini_batch['Label']=='bad').astype('int')
    if X_train.shape[0]>0:
        clf.partial_fit(X_train, y_train,classes=[0,1])

Una vez que entrenamos el modelo, podemos evaluar en datos de test y comparar con la etiqueta verdadera.

In [21]:
batch_size=5000
n_batches=test.shape[0]//batch_size
y_hat=list()
y_true=list()
for i in range(n_batches + 1):
    mini_batch = test[i*batch_size:(i+1)*batch_size]
    X_test=vectorizer.transform(mini_batch['URL'])
    y_test=(mini_batch['Label']=='bad').astype('int')
    if X_test.shape[0]>0:
        y_pred=clf.predict(X_test)
        y_true.extend(y_test)
        y_hat.extend(y_pred)

In [22]:
target_names=train['Label'].unique()

In [23]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report


print(classification_report(y_true, y_hat, target_names=target_names))

              precision    recall  f1-score   support

        good       0.97      0.99      0.98     78462
         bad       0.97      0.92      0.95     31407

    accuracy                           0.97    109869
   macro avg       0.97      0.96      0.96    109869
weighted avg       0.97      0.97      0.97    109869

