# Manning LiveProject

## Detecting Phishing Websites using ML and Python
https://liveproject.manning.com/course/101/detecting-phishing-websites-using-machine-learning-and-python

In [47]:
# Setup
from collections import Counter

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.metrics import classification_report, confusion_matrix

import warnings
warnings.filterwarnings('ignore')

In [6]:
# Load data and start exploring it
df = pd.read_csv('Phishing.csv')
df.shape

(11055, 31)

In [7]:
df.head(5).T

Unnamed: 0,0,1,2,3,4
having_IP_Address,-1,1,1,1,1
URL_Length,1,1,0,0,0
Shortining_Service,1,1,1,1,-1
having_At_Symbol,1,1,1,1,1
double_slash_redirecting,-1,1,1,1,1
Prefix_Suffix,-1,-1,-1,-1,-1
having_Sub_Domain,-1,0,-1,-1,1
SSLfinal_State,-1,1,-1,-1,1
Domain_registeration_length,-1,-1,-1,1,-1
Favicon,1,1,1,1,1


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11055 entries, 0 to 11054
Data columns (total 31 columns):
having_IP_Address              11055 non-null int64
URL_Length                     11055 non-null int64
Shortining_Service             11055 non-null int64
having_At_Symbol               11055 non-null int64
double_slash_redirecting       11055 non-null int64
Prefix_Suffix                  11055 non-null int64
having_Sub_Domain              11055 non-null int64
SSLfinal_State                 11055 non-null int64
Domain_registeration_length    11055 non-null int64
Favicon                        11055 non-null int64
port                           11055 non-null int64
HTTPS_token                    11055 non-null int64
Request_URL                    11055 non-null int64
URL_of_Anchor                  11055 non-null int64
Links_in_tags                  11055 non-null int64
SFH                            11055 non-null int64
Submitting_to_email            11055 non-null int64
Abnorma

In [9]:
# Replace -1 with 0 as Target Labels for the "Result" column. And verify for correctness.
df.Result = df.Result.replace({-1:0})
df["Result"].value_counts()

1    6157
0    4898
Name: Result, dtype: int64

In [11]:
df.iloc[:,0:30].head().T

Unnamed: 0,0,1,2,3,4
having_IP_Address,-1,1,1,1,1
URL_Length,1,1,0,0,0
Shortining_Service,1,1,1,1,-1
having_At_Symbol,1,1,1,1,1
double_slash_redirecting,-1,1,1,1,1
Prefix_Suffix,-1,-1,-1,-1,-1
having_Sub_Domain,-1,0,-1,-1,1
SSLfinal_State,-1,1,-1,-1,1
Domain_registeration_length,-1,-1,-1,1,-1
Favicon,1,1,1,1,1


In [17]:
# Split train data-set
x_train, x_test, y_train, y_test = train_test_split(df.iloc[:,0:30], df.Result, train_size = 0.8, random_state = 42)

# Workflow 

## Objective: Build a Logistic Regression model to classify the phishing websites and incorporate the random searching process to improve its predictive performance.

As defined in [Course](https://liveproject.manning.com/module/101_5_1/detecting-phishing-websites-using-machine-learning-and-python/4--training-a-logistic-regression-model/4-1--training-a-logistic-regression-model?):

1. You are now ready to feed the data that you had prepared in the previous units to machine learning models. It is good to start with a linear classifier like Logistic Regression particularly when the dataset is filled with only -1, 0 and 1 values (you should have come across this by now). You now construct a Logistic Regression model using the scikit-learn library. After the model is instantiated, fit this model to the training dataset you prepared in the previous part. Although the task asks you to use Logistic Regression as a starting point, I encourage you to try out other models as well and see how they perform..

2. Evaluate the model on the test set. The evaluation report should look like so: having precision/recall matrix

3. To improve the Logistic Regression model, use methods from scikit-learn to tune the hyperparameters of the model so that they are searched in a randomized fashion. Remember the model was constructed using the default hyperparameter values as provided by scikit-learn. Construct the grid of hyperparameters using penalty, C, tol and max_iter. You can provide the following value ranges to the grid:

4. You now want to begin the random search process. Fit the RandomizedSearchCV model to the training data. After the model fitting is done, you will extract and print the best estimator and its hyperparameter values. The output should look like the following

5. You would want to evaluate the best estimator now. Evaluate it on the test data and print out the necessary evaluation metrics just like you did in the second task.

6. Don’t forget to log the necessary information accuracy, model training time, model’s hyperparameters etc. using the Python library `wandb`.

In [39]:
lr_clf = LogisticRegression()
lr_clf.fit(x_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [43]:
preds = lr_clf.predict(x_test)
print('Accuracy = ', np.mean(preds == y_test))
print()
creport = classification_report(y_test, preds, target_names=['Phishing Websites', 'Normal Websites'])
print(creport)

Accuracy =  0.9240162822252375

                   precision    recall  f1-score   support

Phishing Websites       0.92      0.90      0.91       956
  Normal Websites       0.93      0.94      0.93      1255

         accuracy                           0.92      2211
        macro avg       0.92      0.92      0.92      2211
     weighted avg       0.92      0.92      0.92      2211



In [54]:
# Hyper-parameter tuning
penality = ['l1','l2']
C = [0.8,0.9,0.1]
tol=[0.01, 0.001, 0.0001]
max_iter = [100,150,200,250]

distributions = dict(C=C, penalty=penality, tol=[0.01, 0.001, 0.0001], max_iter = [100,150,200,250])
rscv_clf = RandomizedSearchCV(lr_clf, distributions, random_state=0)
search = rscv_clf.fit(x_train, y_train)

print('Best Score : ' + str(search.best_score_.round(4)*100) + '% using ' + str(search.best_params_))

Best Score : 92.84% using {'tol': 0.001, 'penalty': 'l2', 'max_iter': 250, 'C': 0.8}
