<a href="https://colab.research.google.com/github/nyp-sit/aiup/blob/main/day1-am/Lab02a_Phishing_Prediction_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://www.nyp.edu.sg/content/dam/nypcorp/sg/en/common/logo-nyp.svg" width="238" height="70"/>

Welcome to the lab! Before we get started here are a few pointers on Colab notebooks.

1. The notebook is composed of cells; cells can contain code which you can run, or they can hold text and/or images which are there for you to read.

2. You can execute code cells by clicking the ```Run``` icon in the menu, or via the following keyboard shortcuts ```Shift-Enter``` (run and advance) or ```Ctrl-Enter``` (run and stay in the current cell).

3. To interrupt cell execution, click the ```Stop``` button on the toolbar or navigate to the ```Kernel``` menu, and select ```Interrupt ```.


# Phishing Prediction Exercise using K-Nearest Neighbour (Exercise)
In this lab, we will be working with a Phishing Dataset to train a K-Nearest Neighbour (KNN) model.

There are some parts that requires your input and some blanks indicated with **None** for you to fill in.

This lab is very similiar to the Malware Prediction, except that we are using a different dataset.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Exercise 1: Read the csv file

In [None]:
!wget https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/aiup/day1-am/phishing_dataset.csv

**Exercise**

Change the below code to load the data from the csv file.

<details><summary>Click here for answer</summary>
<br/>

```
file_name = './phishing_dataset.csv'
df = pd.read_csv(file_name, index_col='index')
```

<br/>

With the correct file_name, the csv file is loaded as a dataframe using *Pandas.read_csv* function.

</details>

In [None]:
# Ex1a: Load the data
file_name = None
# treat the column 'index' as index column in Pandas dataframe
df = pd.read_csv(None, index_col='index')

### Exercise 2: Preview and process the data

In [None]:
df.head(5)

In [None]:
df.tail(5)

In [None]:
# Print the shape (Get the number of rows and cols)
df.shape

**Exercise**

Fill in the codes for Ex2a, and Ex2b to get information on the dataset, and display the correlation.

<details><summary>Click here for answer</summary>
<br/>

```
# Get the column names
df.columns

# Display the correlation of the dataset
df.corr()
```

<br/>

The above functions help to get information on the dataset.

</details>

In [None]:
# Ex2a: Get the column names


In [None]:
# Ex2b: display the correlation of the dataset


In [None]:
# Checking for duplicates and removing them
df.drop_duplicates(inplace = True)

In [None]:
# Show the new shape (number of rows & columns)
df.shape

In [None]:
# Show the number of missing (NAN, NaN, na) data for each column
df.isnull().sum()

In [None]:
# list the different result and the number of records with it
df["Result"].value_counts()

**Exercise**

Fill in the codes for Ex2c to visualise the data. <br/>
Hint: You may use Seaborn and Matplotlib.pyplot.

<details><summary>Click here for answer</summary>
<br/>

```
# Ex2c: Use a statistical graph to visualise the data above
sns.countplot(x=df["Result"])
plt.show()
```

<br/>
Note:
<br/>
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. <br/>
Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.

</details>

In [None]:
# Ex2c: Use a statistical graph to visualise the data above


### Exercise 3: Identify the features and label

<details><summary>Click here for answer</summary>
<br/>

```
# Ex3a: Define x-axis
x = df.drop(["Result"],axis=1) #axis = 0 (drop by index), axis = 1
x

# Ex3b: Define y-axis
y = df["Result"]
y
```

<br/>

</details>

In [None]:
# Ex3a: Define x-axis


In [None]:
# Ex3b: Define y-axis


### Exercise 4: Choose and train the model

We will be using K-Nearest Neighbour (KNN) for this exercise

<details><summary>Click here for answer</summary>
<br/>

```
# Ex4a: spilt the data
x_train,x_test,y_train,y_test=train_test_split(x,y, shuffle=True, test_size=0.2, stratify=y)

# Ex4b: train the model
model=KNeighborsClassifier(n_neighbors=5)
model.fit(x_train,y_train)

# Ex4c: display the model score -- answer
model.score(x_test,y_test)
```

<br/>

</details>

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

In [None]:
# Ex4a: spilt the
# code here


In [None]:
# Ex4b: train the model
# code here


In [None]:
pred=model.predict(x_test)
pred

In [None]:
# Ex4c: display the model score
# code here


In [None]:
result=pd.DataFrame({
    "Actual_Value":y_test,
    "Predict_Value":pred
})
result

### Exercise 5: Evaluate the model and display the reports

<details><summary>Click here for answer</summary>
<br/>

```
# Ex5a: Evaluate the model using the training data
pred = model.predict(x_train)

# Ex5b: Display classification report
print('Classification Report: \n',classification_report(y_train ,pred ))

# Ex5c: Display Confusion Matrix and accuracy
cm = confusion_matrix(y_train, pred)
print('Confusion Matrix: \n',cm)
print()
print('Accuracy: ', accuracy_score(y_train,pred))

# Ex5c (optional): Plot the Confusion Matrix for easy visualisation
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_estimator(model, x_train, y_train,
                                          labels=model.classes_,
                                          cmap=plt.cm.Blues,
                                          values_format='')

plt.title("Confusion matrix for training data")
plt.show()

# Ex5d: repeat Ex5a-5c on testing data
pred = model.predict(x_test)
print(classification_report(y_test ,pred ))
print('Confusion Matrix: \n',confusion_matrix(y_test,pred))
print()
print('Accuracy: ', accuracy_score(y_test,pred))
print()


from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_estimator(model, x_test, y_test,
                                          labels=model.classes_,
                                          cmap=plt.cm.Blues,
                                          values_format='')

plt.title("Confusion matrix for test data")
plt.show()

```

<br/>

</details>


In [None]:
# Ex5a: Evaluate the model using the training data
# code here


In [None]:
from sklearn.metrics import classification_report,confusion_matrix, accuracy_score

In [None]:
# Ex5b: Display classification report
# code here


In [None]:
# Ex5c: Display Confusion Matrix and accuracy
# code here

In [None]:
# Ex5c (optional): Plot the Confusion Matrix for easy visualisation
# code here



In [None]:
# Ex5d: repeat Ex5a-5c on testing data
# code here
