# Phishing URL Machine Learning Model

This notebook will contain the process to creating our machine learning model that will then be used for predictions on a url.

# Import Libraries

In [5]:
# Data Manipulation
try:
    import numpy as np
    import pandas as pd

    # Data Engineering / Machine learning
    from sklearn.preprocessing import FunctionTransformer
    from sklearn.feature_extraction import DictVectorizer
    from sklearn.metrics import classification_report
    from sklearn.pipeline import Pipeline
    from sklearn.compose import ColumnTransformer
    from sklearn.svm import LinearSVC
    from sklearn.model_selection import train_test_split

    # Create a pickle model
    import cloudpickle

    # Keep track when model was created
    import datetime
    
    # File Path
    from pathlib import Path

    print('[SUCCESS]')

#CATCH ERROR IMPORTING A LIBRARY
except ImportError as ie:
    raise ImportError(f'[Error importing]: {ie}')

[SUCCESS]


# Load Data

The data was gathered from PhishStorm: Detecting Phishing using Streaming Analytics. You can find the details [here](https://ieeexplore.ieee.org/abstract/document/6975177)

Our data consists of the following columns:

['domain', 'ranking', 'mld_res', 'mld.ps_res', 'card_rem', 'ratio_Rrem','ratio_Arem', 'jaccard_RR', 'jaccard_RA', 'jaccard_AR', 'jaccard_AA','jaccard_ARrd', 'jaccard_ARrem', 'label']
       
Our main use for the dataset is to use the domain and label. We can go ahead and create our pandas dataframe on what we want.

In [10]:
def load_data():
    """
    Function will check if file is in current directory, if it's not download the zip file and unzip.
    Save dataset into a pandas dataframe.
    Based on previous information regarding the columns, we will only use domain and label. 
    We will also drop na values.
    """
    
    path = Path('urlset.csv')
    if not path.is_file():
        !wget 'https://research.aalto.fi/files/16859732/urlset.csv.zip'
        !unzip urlset.csv.zip
    
    
    df = pd.read_csv('urlset.csv', encoding_errors='ignore', on_bad_lines='skip')
    
    df = df[['domain','label']]
    df = df.dropna()

    
    return df

df = load_data() # Run Function

  df = pd.read_csv('urlset.csv', encoding_errors='ignore', on_bad_lines='skip')


In [11]:
df.head()

Unnamed: 0,domain,label
0,nobell.it/70ffb52d079109dca5664cce6f317373782/...,1.0
1,www.dghjdgf.com/paypal.co.uk/cycgi-bin/webscrc...,1.0
2,serviciosbys.com/paypal.cgi.bin.get-into.herf....,1.0
3,mail.printakid.com/www.online.americanexpress....,1.0
4,thewhiskeydregs.com/wp-content/themes/widescre...,1.0


# URL Manipulation

The function below will be used to split our url into multiple parts.

It will contain the domain and the path (everything after the first '/')

In [15]:
def url_hacking(df):
    """
    Given a data frame with column 'domain' that contains the full url, it will return a new dataframe with new feautres: domain and path.
    """
    
    # Set the domain as everything before the first /
    col_domain = df['domain'].str.split('/').str[0]
    
    # Set the path as everything after first / but before any query string ?
    path = df['domain'].str.split('/',1).str[1].fillna('')
    col_path = path.str.split('?').str[0]
    
    # Create dataframe that will map to a pandas series
    return pd.DataFrame(
        {'domain': col_domain,
         'path': col_path}
    )
# Let's take a look at our new data
test = url_hacking(df)
test.head(10)

Unnamed: 0,domain,path
0,nobell.it,70ffb52d079109dca5664cce6f317373782/login.SkyP...
1,www.dghjdgf.com,paypal.co.uk/cycgi-bin/webscrcmd=_home-custome...
2,serviciosbys.com,paypal.cgi.bin.get-into.herf.secure.dispatch35...
3,mail.printakid.com,www.online.americanexpress.com/index.html
4,thewhiskeydregs.com,wp-content/themes/widescreen/includes/temp/pro...
5,smilesvoegol.servebbs.org,voegol.php
6,premierpaymentprocessing.com,includes/boleto-2via-07-2012.php
7,myxxxcollection.com,v1/js/jih321/bpd.com.do/do/l.popular.php
8,super1000.info,docs
9,horizonsgallery.com,js/bin/ssl1/_id/www.paypal.com/fr/cgi-bin/webs...
