# Phishing URL Machine Learning Model

This notebook will contain the process to creating our machine learning model that will then be used for predictions on a url.

<br>

The process for this project:
- Load Data
    - Only use 2 columns
    
- Manipulate the URL
    - Create Domain and Path
    
- Use sklearn's PIPELINE to create a machine learning model.



# Import Libraries

In [1]:
# Data Manipulation
try:
    import numpy as np
    import pandas as pd

    # Data Engineering / Machine learning
    from sklearn.preprocessing import FunctionTransformer
    from sklearn.feature_extraction import DictVectorizer
    from sklearn.metrics import classification_report
    from sklearn.pipeline import Pipeline
    from sklearn.compose import ColumnTransformer
    from sklearn.svm import LinearSVC
    from sklearn.model_selection import train_test_split

    # Create a pickle model
    import cloudpickle

    # Keep track when model was created
    import datetime
    
    # File Path
    from pathlib import Path

    print('[SUCCESS]')

#CATCH ERROR IMPORTING A LIBRARY
except ImportError as ie:
    raise ImportError(f'[Error importing]: {ie}')

[SUCCESS]


# Load Data

The data was gathered from PhishStorm: Detecting Phishing using Streaming Analytics. You can find the details [here](https://ieeexplore.ieee.org/abstract/document/6975177)

Our data consists of the following columns:

['domain', 'ranking', 'mld_res', 'mld.ps_res', 'card_rem', 'ratio_Rrem','ratio_Arem', 'jaccard_RR', 'jaccard_RA', 'jaccard_AR', 'jaccard_AA','jaccard_ARrd', 'jaccard_ARrem', 'label']
       
Our main use for the dataset is to use the domain and label. We can go ahead and create our pandas dataframe on what we want.

In [2]:
def load_data():
    """
    Function will check if file is in current directory, if it's not download the zip file and unzip.
    Save dataset into a pandas dataframe.
    Based on previous information regarding the columns, we will only use domain and label. 
    We will also drop na values.
    """
    
    path = Path('urlset.csv')
    if not path.is_file():
        !wget 'https://research.aalto.fi/files/16859732/urlset.csv.zip'
        !unzip urlset.csv.zip
    
    
    df = pd.read_csv('urlset.csv', encoding_errors='ignore', on_bad_lines='skip')
    
    df = df[['domain','label']]
    df = df.dropna()

    
    return df

df = load_data() # Run Function
df.head(10)

  df = pd.read_csv('urlset.csv', encoding_errors='ignore', on_bad_lines='skip')


Unnamed: 0,domain,label
0,nobell.it/70ffb52d079109dca5664cce6f317373782/...,1.0
1,www.dghjdgf.com/paypal.co.uk/cycgi-bin/webscrc...,1.0
2,serviciosbys.com/paypal.cgi.bin.get-into.herf....,1.0
3,mail.printakid.com/www.online.americanexpress....,1.0
4,thewhiskeydregs.com/wp-content/themes/widescre...,1.0
5,smilesvoegol.servebbs.org/voegol.php,1.0
6,premierpaymentprocessing.com/includes/boleto-2...,1.0
7,myxxxcollection.com/v1/js/jih321/bpd.com.do/do...,1.0
8,super1000.info/docs,1.0
9,horizonsgallery.com/js/bin/ssl1/_id/www.paypal...,1.0


# URL Manipulation

The function below will be used to split our url into multiple parts.

It will contain the domain and the path (everything after the first '/')

In [3]:
def url_hacking(df):
    """
    Given a data frame with column 'domain' that contains the full url, it will return a new dataframe with new feautres: domain and path.
    """
    
    # Set the domain as everything before the first /
    col_domain = df['domain'].str.split('/').str[0]
    
    # Set the path as everything after first / but before any query string ?
    path = df['domain'].str.split('/',1).str[1].fillna('')
    col_path = path.str.split('?').str[0]
    
    # Create dataframe that will map to a pandas series
    return pd.DataFrame(
        {'domain': col_domain,
         'path': col_path}
    )

url_transformer = FunctionTransformer(url_hacking)

# Feature Engineering

Using our Domain and Path columns, we will create different features associated with them.

### Domain Feature Engineering

In [4]:
def domain_feature_engineering(domain_list):
    """
    This function will take in the domain column as a list.
    
    We will have have 4 features:
    - length of domain
    - number of dots in the domin
    - top-level domain of the url
    - sub domain of the url
    
    Function returns a list of dictionaries for each domain provided.
    """
    
    return [
        {
            "length_domain": len(domain),
            "count_of_dots": domain.count('.'), 
            "tld": domain.rsplit('.')[-1],
            "subdomain": domain.split('.')[0]
        } for domain in domain_list
    ]


In [5]:
def path_feature_engineering(path_list):
    """
    This function will takin the path column as a list.
    
    We will have 4 features:
    - lenght of path
    - number of slashes in path
    - number of dots in path
    """
    
    return [
        {
            'length_path': len(path),
            'count_of_slash': path.count('/'),
            'count_dots':path.count('.')
        } for path in path_list
        
    ]

#### For testing

In [None]:
"""
URL MANIPUTLATION TESTING

We want to check that the URL provided is split up appropriately.

Example:
url = www.google.com/skjgnfksjfnksjfn/ksjnfdkjsndf/lohf.sjkdfksjnf

Feed into url_hacking(df)

Output:

domain                path
www.google.com        skjgnfksjfnksjfn/ksjnfdkjsndf/lohf.sjkdfksjnf


"""
test = url_hacking(df)
test.head(5)

Unnamed: 0,domain,path
0,nobell.it,70ffb52d079109dca5664cce6f317373782/login.SkyP...
1,www.dghjdgf.com,paypal.co.uk/cycgi-bin/webscrcmd=_home-custome...
2,serviciosbys.com,paypal.cgi.bin.get-into.herf.secure.dispatch35...
3,mail.printakid.com,www.online.americanexpress.com/index.html
4,thewhiskeydregs.com,wp-content/themes/widescreen/includes/temp/pro...


In [7]:
"""
DOMAIN FEATURE ENGINEERING TEST

We want to make sure that if a list of domains is provided to the function, it will return
length of domain, count of dots in domain, top-level domain, and subdomain.

Example
url = www.google.com

Output:
{'length_domain': 14, 'count_of_dots': 2, 'tld': '.com', 'subdomain':'www'}


"""


domain_list = test['domain'].tolist()
dl = domain_list[:10]
output_test_domain = domain_feature_engineering(dl)
for i in output_test_domain:
    print(i)

{'length_domain': 9, 'count_of_dots': 1, 'tld': 'it', 'subdomain': 'nobell'}
{'length_domain': 15, 'count_of_dots': 2, 'tld': 'com', 'subdomain': 'www'}
{'length_domain': 16, 'count_of_dots': 1, 'tld': 'com', 'subdomain': 'serviciosbys'}
{'length_domain': 18, 'count_of_dots': 2, 'tld': 'com', 'subdomain': 'mail'}
{'length_domain': 19, 'count_of_dots': 1, 'tld': 'com', 'subdomain': 'thewhiskeydregs'}
{'length_domain': 25, 'count_of_dots': 2, 'tld': 'org', 'subdomain': 'smilesvoegol'}
{'length_domain': 28, 'count_of_dots': 1, 'tld': 'com', 'subdomain': 'premierpaymentprocessing'}
{'length_domain': 19, 'count_of_dots': 1, 'tld': 'com', 'subdomain': 'myxxxcollection'}
{'length_domain': 14, 'count_of_dots': 1, 'tld': 'info', 'subdomain': 'super1000'}
{'length_domain': 19, 'count_of_dots': 1, 'tld': 'com', 'subdomain': 'horizonsgallery'}


In [8]:
"""
PATH FEATURE ENGINEERING TEST

We want to make sure that if a list of paths is provided to the function, it will return
length of path, count of dots in path and count of slashes in path.

Example
path = lohf.sjkdfksjn

Output:
{'length_path': 14, 'count_of_slash': 0, 'count_of_dots': 1}


"""


path_list = test['path'].tolist()
pl = path_list[:10]
output_test_path = path_feature_engineering(pl)
for i in output_test_path:
    print(i)

{'length_path': 124, 'count_of_slash': 7, 'count_dots': 3}
{'length_path': 65, 'count_of_slash': 3, 'count_dots': 3}
{'length_path': 160, 'count_of_slash': 10, 'count_dots': 6}
{'length_path': 41, 'count_of_slash': 1, 'count_dots': 4}
{'length_path': 59, 'count_of_slash': 6, 'count_dots': 0}
{'length_path': 10, 'count_of_slash': 0, 'count_dots': 1}
{'length_path': 32, 'count_of_slash': 1, 'count_dots': 1}
{'length_path': 40, 'count_of_slash': 5, 'count_dots': 4}
{'length_path': 4, 'count_of_slash': 0, 'count_dots': 0}
{'length_path': 80, 'count_of_slash': 9, 'count_dots': 3}
