# Phishing detection

## Dataset

### Loading the data


In [3]:
import pandas as pd

# Load the kaggle dataset
data = pd.read_csv("dataset_phishing.csv")

# Prints the first few rows
print(data.head())

# Displays dataset information
print(data.info())


                                                 url  length_url  \
0              http://www.crestonwood.com/router.php          37   
1  http://shadetreetechnology.com/V4/validation/a...          77   
2  https://support-appleld.com.secureupdate.duila...         126   
3                                 http://rgipt.ac.in          18   
4  http://www.iracing.com/tracks/gateway-motorspo...          55   

   length_hostname  ip  nb_dots  nb_hyphens  nb_at  nb_qm  nb_and  nb_or  ...  \
0               19   0        3           0      0      0       0      0  ...   
1               23   1        1           0      0      0       0      0  ...   
2               50   1        4           1      0      1       2      0  ...   
3               11   0        2           0      0      0       0      0  ...   
4               15   0        2           2      0      0       0      0  ...   

   domain_in_title  domain_with_copyright  whois_registered_domain  \
0                0                


### Editing the data
The Status column (#88) is the one saying whether it´s phishing or if the url is legitimate, for purposes of training the model the column is being changed to:
- 0 = legitimate
- 1 = phishing

In [6]:
data['status'] = data['status'].replace({'legitimate': 0, 'phishing': 1})

print(data.head())

# Separates the independent variable (x) from the dependent variable column (y)
x = data.drop(columns=['status'])  
y = data['status'] 


                                                 url  length_url  \
0              http://www.crestonwood.com/router.php          37   
1  http://shadetreetechnology.com/V4/validation/a...          77   
2  https://support-appleld.com.secureupdate.duila...         126   
3                                 http://rgipt.ac.in          18   
4  http://www.iracing.com/tracks/gateway-motorspo...          55   

   length_hostname  ip  nb_dots  nb_hyphens  nb_at  nb_qm  nb_and  nb_or  ...  \
0               19   0        3           0      0      0       0      0  ...   
1               23   1        1           0      0      0       0      0  ...   
2               50   1        4           1      0      1       2      0  ...   
3               11   0        2           0      0      0       0      0  ...   
4               15   0        2           2      0      0       0      0  ...   

   domain_in_title  domain_with_copyright  whois_registered_domain  \
0                0                

### Spliting the data
The data is being split in 3 parts, 80% train & validation and 20% test. Further, the first part will be divided in 70% train and 30% validation.

The total dataset has 11,430 rows.

In [14]:
from sklearn.model_selection import train_test_split

xTrainVal, xTest, yTrainVal, yTest = train_test_split(x, y, test_size=0.2, random_state=453890, stratify=y)

# Splits the train+validation data into train (70%) and validation (30%)
xTrain, xVal, yTrain, yVal = train_test_split(
    xTrainVal, yTrainVal, test_size=0.3, random_state=42, stratify=yTrainVal
)

print(f"Train set: {len(xTrain)} rows")
print(f"Validation set: {len(xVal)} rows")
print(f"Test set: {len(xTest)} rows\n")

print("Values distribution:")
print(f"Train:\n{yTrain.value_counts(normalize=True)}\n")
print(f"Validation:\n{yVal.value_counts(normalize=True)}\n")
print(f"Test:\n{yTest.value_counts(normalize=True)}\n")

Train set: 6400 rows
Validation set: 2744 rows
Test set: 2286 rows

Values distribution:
Train:
0    0.5
1    0.5
Name: status, dtype: float64

Validation:
0    0.5
1    0.5
Name: status, dtype: float64

Test:
0    0.5
1    0.5
Name: status, dtype: float64

