# Manning LiveProject

## Detecting Phishing Websites using ML and Python
https://liveproject.manning.com/course/101/detecting-phishing-websites-using-machine-learning-and-python

In [26]:
# Setup
from collections import Counter

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

In [8]:
# Load data and start exploring it
df = pd.read_csv('Phishing.csv')
df.shape

(11055, 31)

In [22]:
df.head(5).T

Unnamed: 0,0,1,2,3,4
having_IP_Address,-1,1,1,1,1
URL_Length,1,1,0,0,0
Shortining_Service,1,1,1,1,-1
having_At_Symbol,1,1,1,1,1
double_slash_redirecting,-1,1,1,1,1
Prefix_Suffix,-1,-1,-1,-1,-1
having_Sub_Domain,-1,0,-1,-1,1
SSLfinal_State,-1,1,-1,-1,1
Domain_registeration_length,-1,-1,-1,1,-1
Favicon,1,1,1,1,1


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11055 entries, 0 to 11054
Data columns (total 31 columns):
having_IP_Address              11055 non-null int64
URL_Length                     11055 non-null int64
Shortining_Service             11055 non-null int64
having_At_Symbol               11055 non-null int64
double_slash_redirecting       11055 non-null int64
Prefix_Suffix                  11055 non-null int64
having_Sub_Domain              11055 non-null int64
SSLfinal_State                 11055 non-null int64
Domain_registeration_length    11055 non-null int64
Favicon                        11055 non-null int64
port                           11055 non-null int64
HTTPS_token                    11055 non-null int64
Request_URL                    11055 non-null int64
URL_of_Anchor                  11055 non-null int64
Links_in_tags                  11055 non-null int64
SFH                            11055 non-null int64
Submitting_to_email            11055 non-null int64
Abnorma

# Workflow 

## Objective: Prepare a cleaned version of the dataset and split the dataset into two parts.

As defined in [Course](https://liveproject.manning.com/module/101_4_2/detecting-phishing-websites-using-machine-learning-and-python/3--cleaning-the-class-labels-and-inspecting-for-missing-values/3-2--submit-your-work?):

1. The task of detecting phishing websites is essentially a binary classification problem. In the dataset, the phishing websites are labeled as -1. It is not a good practice to build machine learning models where the labels are encoded as negative values. It affects the performance of the models. So, you would want to change the **-1** values to **0**.

2. A common in almost every real dataset out there is the problem of missing values. The presence of missing values can hamper the predictive performance of machine learning models. You would want to verify if the dataset suffers from the problem of missing values. If there are missing values, impute them. (**Note:** Missing values can be present in a lot of various forms. Refer to [this article](https://www.datacamp.com/community/tutorials/parameter-optimization-machine-learning-models) for more on this.)

3. Keeping two separate datasets i.e. training set and a testing set is important for building machine learning models. So, you would want to split the dataset into train and test sets using methods from the `scikit-learn` library. Before you do this, you will have to segregate the predictors and the class labels of the dataset into separate variables. The split should be in the ratio of **80:20**. But how should the dataset be split? Would you want care for the order of instances or could the splitting be done in a random manner? While you are splitting the dataset, you would also want to ensure the split is the same every time someone runs your code. Also, make sure that the labels of the data points do not get corrupted while you are splitting the dataset.

In [10]:
df["Result"].value_counts()

 1    6157
-1    4898
Name: Result, dtype: int64

In [11]:
# Replace -1 with 0 as Target Labels for the "Result" column. And verify for correctness.
df.Result = df.Result.replace({-1:0})
df["Result"].value_counts()

1    6157
0    4898
Name: Result, dtype: int64

In [13]:
# Check for null values in dataframe
df.isnull().sum()

having_IP_Address              0
URL_Length                     0
Shortining_Service             0
having_At_Symbol               0
double_slash_redirecting       0
Prefix_Suffix                  0
having_Sub_Domain              0
SSLfinal_State                 0
Domain_registeration_length    0
Favicon                        0
port                           0
HTTPS_token                    0
Request_URL                    0
URL_of_Anchor                  0
Links_in_tags                  0
SFH                            0
Submitting_to_email            0
Abnormal_URL                   0
Redirect                       0
on_mouseover                   0
RightClick                     0
popUpWidnow                    0
Iframe                         0
age_of_domain                  0
DNSRecord                      0
web_traffic                    0
Page_Rank                      0
Google_Index                   0
Links_pointing_to_page         0
Statistical_report             0
Result    

In [17]:
df.columns.size

31

In [23]:
df.iloc[:,0:30].head().T

Unnamed: 0,0,1,2,3,4
having_IP_Address,-1,1,1,1,1
URL_Length,1,1,0,0,0
Shortining_Service,1,1,1,1,-1
having_At_Symbol,1,1,1,1,1
double_slash_redirecting,-1,1,1,1,1
Prefix_Suffix,-1,-1,-1,-1,-1
having_Sub_Domain,-1,0,-1,-1,1
SSLfinal_State,-1,1,-1,-1,1
Domain_registeration_length,-1,-1,-1,1,-1
Favicon,1,1,1,1,1


In [25]:
# Split train data-set
x_train, x_test, y_train, y_test = train_test_split(df.iloc[:,0:30], df.Result, train_size = 0.8, random_state = 42)

In [31]:
print(y_train.value_counts())
print(y_test.value_counts())

1    4902
0    3942
Name: Result, dtype: int64
1    1255
0     956
Name: Result, dtype: int64
