In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from ast import literal_eval

# Connect to Google Drive

*   from google.colab import drive: This line of code imports a module called
"drive" from the "google.colab" library. The purpose of this module is to provide functionality for interacting with Google Drive, which is a cloud storage service provided by Google.
*   drive.mount('/content/drive'): This line of code calls the "mount" function from the "drive" module and passes '/content/drive' as an argument to it. The function's objective is to mount (connect) Google Drive to the current working environment within Google Colab. Once executed, this code establishes a connection between Google Drive and your Colab environment, making it possible to access and manipulate files and data stored in Google Drive directly from your Colab notebooks.

In [2]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


# Split dataset

The provided code snippet is a common step in machine learning for splitting a dataset into two subsets: a training set and a test set. This split is essential for training and evaluating machine learning models. Here's what each part of the code does:

**X_train and y_train:** These variables will hold the features (X) and labels (y) of the training set, respectively. The training set is used to train the machine learning model.

**X_test and y_test:** These variables will hold the features (X) and labels (y) of the test set, respectively. The test set is used to evaluate the model's performance after it has been trained on the training data. It provides an independent dataset to assess how well the model generalizes to new, unseen data.

**train_test_split**(X, y, test_size=0.2, random_state=42): This function is typically provided by machine learning libraries like scikit-learn (sklearn) and is used to split the dataset into training and test sets.

*   **X:** This should be your feature matrix, containing the input data.
*   **y:** This should be your target or label vector, containing the corresponding labels for each data point.

*   **test_size=0.2:** This parameter specifies the proportion of the dataset that should be allocated to the test set. In this case, it's set to 20%, meaning that 20% of the data will be used for testing, and the remaining 80% will be used for training.
*   **random_state=42:** This parameter is used to seed the random number generator for reproducibility. Setting it to a specific value (e.g., 42) ensures that you get the same split every time you run the code. It's helpful for debugging and ensuring consistent results.

After executing this code, your dataset (X and y) will be divided into a training set (X_train, y_train) and a test set (X_test, y_test), following the specified split ratio. You can then use these sets to train and evaluate your machine learning models.

In [3]:
# Load CSV file into DataFrame"
df = pd.read_csv('/content/drive/My Drive/30016/new_file_with_vectors.csv')

# To convert the 'Feature_Vector' column from a string to a list and then into a NumPy array
df['Feature_Vector'] = df['Feature_Vector'].apply(literal_eval).apply(np.array)

# To convert the 'Feature_Vector' column
X = np.stack(df['Feature_Vector'].values)

# Use the 'Email Type' column as labels.
y = df['Email Type'].values

#  split a dataset into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Now, you have your training and test sets
print("Training Set (X_train, y_train):")
print(X_train)
print(y_train)

print("\nTest Set (X_test, y_test):")
print(X_test)
print(y_test)




Training Set (X_train, y_train):
[[-0.06365381  0.33708285  0.25972635 ... -0.18531946  0.28610983
  -0.32812821]
 [-0.59569101  0.43483159  0.2401536  ... -0.55423564  0.44989784
  -0.46823569]
 [ 0.01979951  0.56385286  0.18379596 ... -0.28178019  0.32619213
  -0.12823096]
 ...
 [-0.18408333  0.37067542  0.16043905 ...  0.22429176 -0.03252939
  -0.25497596]
 [-0.15068387  0.2502145   0.11821755 ... -0.44579493  0.12311342
  -0.1493098 ]
 [ 0.28490031  0.03255535  0.28461433 ...  0.28652487  0.59897806
  -0.8500713 ]]
['Safe Email' 'Safe Email' 'Safe Email' ... 'Safe Email' 'Phishing Email'
 'Phishing Email']

Test Set (X_test, y_test):
[[ 0.04000142  0.14879549  0.20035583 ... -0.11181284  0.27778296
  -0.1077499 ]
 [-0.36007181  0.3893119   0.40374841 ... -0.26885068  0.22512174
  -0.20499668]
 [-0.10345544  0.41563723  0.14372655 ... -0.33273552  0.21169822
  -0.38920602]
 ...
 [ 0.25351475  0.63477441  0.02968802 ... -0.13958492  0.19093559
  -0.38318701]
 [-0.12648102  0.24100764

# Train the model

**clf = DecisionTreeClassifier()**: In this line, a decision tree classifier object (clf) is created. A decision tree classifier is a machine learning algorithm used for classification tasks. It builds a decision tree model based on the features and labels provided during training.

**clf.fit(X_train, y_train):** Here, the decision tree classifier (clf) is trained or "fit" using the training data (X_train and y_train). X_train contains the feature vectors (input data) from the training dataset, and y_train contains the corresponding labels (the correct classifications).

The classifier uses this training data to learn the patterns and relationships between features and labels, essentially constructing a decision tree that can make predictions based on the learned rules.

In [4]:
# Create a decision tree classifier and train it.
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)


# Predict the result

**y_pred = clf.predict(X_test)**: After the classifier is trained, it is used to make predictions on a separate dataset, the test set (X_test). The predict method is called on the trained classifier (clf), passing in the test data (X_test). This line of code generates predicted labels (y_pred) for the test data based on the learned decision tree model.

The predicted labels (y_pred) represent the classifier's best guess at the class (category) of each item in the test set, based on what it learned during training.

In [None]:
# Make predictions on the test set.
y_pred = clf.predict(X_test)
print(y_pred)

['Phishing Email' 'Safe Email' 'Phishing Email' ... 'Phishing Email'
 'Safe Email' 'Safe Email']


# **Evaluation**

"classification_report" is not a standalone function. It is a feature typically provided in machine learning libraries like Scikit-Learn to generate performance reports for classification models. In Scikit-Learn, you can use "sklearn.metrics.classification_report" to generate a classification report. This function takes two parameters:

**"y_true": An array or list containing the true class labels, typically the actual labels from the test set.**

**"y_pred": An array or list containing the predicted class labels, typically the model's predictions on the test set.**

**Explain**

**Precision**: Precision is the proportion of samples predicted as positive (Phishing Email or Safe Email) that are actually positive. For the Phishing Email and Safe Email categories, the precision values are 0.89 and 0.95, respectively. This means that when the model predicts an email as Phishing Email or Safe Email, it is correct 89% and 95% of the time for these categories, respectively.

**Recall**: Recall is the proportion of actual positive samples that were correctly predicted as positive by the model. For the Phishing Email and Safe Email categories, the recall values are 0.92 and 0.93, respectively. This indicates that the model correctly identifies 92% and 93% of the actual Phishing Emails and Safe Emails.

**F1-Score**: The F1-Score is the harmonic mean of precision and recall and is used to balance these two metrics. For the Phishing Email and Safe Email categories, the F1-Scores are 0.91 and 0.94, respectively. Higher F1-Scores suggest a good balance between precision and recall.

**Support**: Support represents the number of samples for each category in the test dataset. In this report, there are 1428 samples for Phishing Email and 2298 samples for Safe Email.

**Accuracy**: Accuracy indicates the proportion of correctly classified samples in the entire test dataset. The accuracy is 0.93, meaning that 93% of the test samples are classified correctly.








In [None]:
# print the result
# Assuming you have the true labels (y_true) and predicted labels (y_pred)
# from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

                precision    recall  f1-score   support

Phishing Email       0.89      0.92      0.91      1428
    Safe Email       0.95      0.93      0.94      2298

      accuracy                           0.93      3726
     macro avg       0.92      0.93      0.92      3726
  weighted avg       0.93      0.93      0.93      3726



**Example1: classification_report**

y_true = [0, 1, 1, 0, 1, 0, 1, 1, 1, 0]  # True labels

y_pred = [0, 1, 0, 0, 1, 1, 1, 1, 0, 0]  # Model predictions

In [5]:
from sklearn.metrics import classification_report

# Assuming you already have the model's predictions y_pred and the true labels y_true
y_true = [0, 1, 1, 0, 1, 0, 1, 1, 1, 0]  # True labels
y_pred = [0, 1, 0, 0, 1, 1, 1, 1, 0, 0]  # Model predictions

# Use the classification_report function to generate the report
report = classification_report(y_true, y_pred)

# Print the report
print(report)



              precision    recall  f1-score   support

           0       0.60      0.75      0.67         4
           1       0.80      0.67      0.73         6

    accuracy                           0.70        10
   macro avg       0.70      0.71      0.70        10
weighted avg       0.72      0.70      0.70        10



Let's break down the various sections of this report:

**precision**: Precision refers to the proportion of samples predicted as a specific class that truly belong to that class. For class 0, precision is 0.60, and for class 1, precision is 0.80. This indicates that the model's accuracy of predictions for class 0 is 60%, and for class 1, it's 80%.

**recall**: Recall is the proportion of actual samples of a specific class that were correctly predicted as that class. For class 0, recall is 0.75, and for class 1, recall is 0.67. This means that the model correctly captured 75% of class 0 samples and 67% of class 1 samples.

**f1-score**: The F1 score is the harmonic mean of precision and recall and is used to balance accuracy and coverage. A higher F1 score indicates a better balance between precision and recall. For class 0, the F1 score is 0.67, and for class 1, it's 0.73.

**support**: Support indicates the number of samples belonging to each class in the true dataset. Class 0 has 4 samples, and class 1 has 6 samples.

**accuracy**: Accuracy represents the overall proportion of correctly predicted samples. In this case, the overall accuracy is 0.70, indicating that the model correctly predicted 70% of the samples.

**macro avg**: Macro average calculates the average of all class-specific metrics (precision, recall, F1-score). It is a performance measure that doesn't take class imbalances into account. In this case, the macro-averaged F1 score is 0.70.

**weighted avg**: Weighted average computes the average of all class-specific metrics while considering class imbalances. It is a performance measure that gives more weight to classes with more samples. In this case, the weighted-averaged F1 score is 0.70

**Example2: accuracy_score**

y_true = [0, 1, 1, 0, 1, 0, 1, 1, 1, 0]  # True labels

y_pred = [0, 1, 0, 0, 1, 1, 1, 1, 0, 0]  # Model predictions

In [7]:
from sklearn.metrics import accuracy_score

# Assuming you already have the model's predictions y_pred and the true labels y_true
y_true = [0, 1, 1, 0, 1, 0, 1, 1, 1, 0]  # True labels
y_pred = [0, 1, 0, 0, 1, 1, 1, 1, 0, 0]  # Model predictions

# Calculate accuracy using the accuracy_score function
accuracy = accuracy_score(y_true, y_pred)

# Print the accuracy
print(f"Accuracy: {accuracy}")




Accuracy: 0.7


In this example, the accuracy_score function takes two arguments: the true labels y_true and the model's predicted labels y_pred. It calculates the accuracy of the model and returns a score that represents the proportion of correctly predicted samples.