# Phishing detection with logistic regression and decision trees

Before introducing our new project let's have a look to the algorithms used for our software in more details.

## Regression models

Regression models are the most used of all the learnign algorithms. From the statistical analysis world, regression models have quicly spread in ML and in AI in general.The most known of the models is **Linear Regression** thanks to the simplicity of the implementation and that it can be applied in many cases like: (Estimate the house prices in relation to the variation of the interest rates)

Another model, is the **Logistic Regression Model** useful in the most complex cases where the linear model can't fit the datasets.

### Linear Regression

Linear regression is characterized by the fact that the data is represented as sums of features, that gives us a straight line in the Cartesian Plane, described by the following formula.
$$ y = wX + \beta$$

Here on the formula we have the y that rapresent the predicted values, which are the results of the linear combination ($X$) to wich a weight vector is applied ($w$), and then $\beta$ wich stands for the default value when the other 2 are 0. This model can be extendet can to n-dimension (for more features), at this point the formula can generalized and instead of a line we define a hyperplane that divede our hyperspace to classify our data. 

Let's have a look on how to implement Linear Regression

## Linear regression with scikit-learn
To check how this model works and how to implement in details have a look to the documentation [Here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html?highlight=linear%20regression#sklearn.linear_model.LinearRegression)


In [None]:
import pandas as pd
import numpy as np 
#Read our dataset
df = pd.read_csv('./resources/datasets/spam/smsSpamPerceptron.csv')
#Define our X and Y
X = df.iloc[:, [2, 3]].values
y = df.iloc[:, 1].values
y = np.where(y == 'spam', -1, 1)
#implement the linear regression model from scikit-learn
from sklearn.linear_model import LinearRegression
linear_regression = LinearRegression()
linear_regression.fit(X, y)
print(linear_regression.score(X,y))

## Pros and cons of Linear regression

The simplicity of the implementation could rapresent an advantage but the limitations of the model are important.

The linear regression model is used when we have to manage quantitative data, but in the case where the dataset contain categorical data, we have to choose another model (logistic regression). Another bad side of the linear model is that it assume that our feature are for the most unrelated.
The negative side effect of this is that we have the risk that our linear model will start adding the same information several times, falling to predict the effect of the combination of the the variables on the final result.

## Logistic regression

The difference betwenn the logistic regression and the linear one is that, we have seen it before, we can't use the linear to solve classifications problems.
Perceptron as a linear model does not give us good results in term of accurancy classification and the main reason for that is cause linear regression works better with continuos interval of values compared to classes of discrete values.
But we have a solution for that logistic regression consist of estimating the probability of a sample to belong to an individual class. Let's see some boring math formula and try to explain those. 

$$ P(y = c|x) = e^z / (1 + e^z) $$

Here the $ z = \sum w_{i}x_{i} ,$ the $ P(y = c|x) $ measure the conditional probability that a given sample falls into the $c$ class, given the $x_{i} features.

After all this boring theory let's write some code.

## Phishing detector using logistic regression
We are going to use this [dataset](https://archive.ics.uci.edu/ml/datasets/Phishing+Websites)
This dataset has been converted from .arff to .csv using **one-hot-encoding**
and consist of a dataset containing 30 features that characterize Phishing Websites.



In [6]:
import pandas as pd
import numpy as numpy
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
#importing the dataframe
phishing_df = pd.read_csv("./resources/datasets/phishing/phishing_dataset.csv", header= None)


In [9]:
#Define our features and targets values
X = phishing_df.iloc[:,:-1]
y = phishing_df.iloc[:,-1]
#use train_test_split to split your data in training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=1/2, random_state=0)
#Initialize the Regressor 
Lgr = LogisticRegression()
#Fit the model with the Training values
Lgr.fit(X_train, y_train)
#Predict values
y_pred = Lgr.predict(X_test)

print("Logistic Regression accuracy :", accuracy_score(y_test, y_pred)*100, "%")

Logistic Regression accuracy : 92.61939218523878 %


We've got a pretty good score as the the Logistic Regressor is able to detect the 90% of the phishing URLs