# Phishing Classifier using Machine Learning

## 1. About the Project
Phishing is a type of cyber-attack where attackers send fake emails or messages that look genuine
in order to steal sensitive information such as passwords, bank details, or login credentials.

This project focuses on building a **Machine Learning based Phishing Classifier** that can
automatically detect whether a given text message or email is **Phishing** or **Legitimate**.
The system uses **Natural Language Processing (NLP)** techniques to analyze the text and
classify it accurately.

The model helps in improving online security by identifying suspicious messages at an early stage.

---

## 2. Objective of the Project
The main objectives of this project are:
- To analyze text messages and emails for phishing patterns
- To convert text data into numerical form using NLP techniques
- To train a Machine Learning model for classification
- To predict whether a message is phishing or legitimate

---

## 3. Technology Used
- Python
- Google Colab
- Pandas and NumPy
- Scikit-learn
- Machine Learning
- Natural Language Processing (NLP)

---

## 4. Dataset Description
The dataset used in this project contains text messages along with their labels.

- **Text**: Email or message content
- **Label**:
  - 1 → Phishing
  - 0 → Legitimate

The dataset includes examples of both phishing and normal messages so that the model can
learn the difference between them.

---

## 5. Steps Involved in the Project

### Step 1: Import Required Libraries
In this step, all the necessary Python libraries such as Pandas, NumPy, and Scikit-learn
are imported. These libraries are used for data handling, feature extraction, model training,
and evaluation.

---

### Step 2: Load or Create Dataset
The dataset is loaded into the notebook or created manually for training and testing.
The data is stored in a tabular format using a Pandas DataFrame.

---

### Step 3: Data Preprocessing
The text data is separated into:
- **Input features (X)** – message text
- **Output labels (y)** – phishing or legitimate

This step prepares the data for model training.

---

### Step 4: Train-Test Split
The dataset is divided into training and testing sets.
- Training data is used to train the model
- Testing data is used to evaluate the model performance

This helps in checking how well the model performs on unseen data.

---

### Step 5: Feature Extraction using TF-IDF
TF-IDF (Term Frequency-Inverse Document Frequency) is used to convert text data
into numerical vectors.  
This allows the Machine Learning model to understand and process text data.

---

### Step 6: Model Training
A **Logistic Regression** algorithm is used to train the phishing classifier.
It is a simple and effective algorithm for binary classification problems.

---

### Step 7: Model Evaluation
The trained model is evaluated using:
- Accuracy score
- Classification report

This step measures how correctly the model predicts phishing and legitimate messages.

---

### Step 8: Prediction on New Input
The trained model is tested on new, unseen messages.
The system predicts whether the input message is **Phishing** or **Legitimate**.

---

## 6. Result
The model successfully classifies messages as phishing or legitimate with good accuracy.
It demonstrates how Machine Learning and NLP can be used to improve cybersecurity.

---

## 7. Conclusion
This project shows that Machine Learning can effectively detect phishing messages.
By using TF-IDF and Logistic Regression, the system learns patterns commonly found in
phishing texts.

The model can be further improved by using:
- Larger datasets
- Advanced algorithms like Random Forest or XGBoost
- Real-time deployment using web applications

---

## 8. Future Scope
- Integration with email filtering systems
- Real-time phishing detection
- Deployment using Flask or Streamlit
- Support for URL-based phishing detection


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

data = {
    "text": [
        "Verify your bank account now",
        "Meeting at 10 AM tomorrow",
        "Update your password immediately",
        "Lunch with friends",
        "Account suspended click link"
    ],
    "label": [1, 0, 1, 0, 1]
}

df = pd.DataFrame(data)

X = df['text']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

model = LogisticRegression()
model.fit(X_train_vec, y_train)

y_pred = model.predict(X_test_vec)
print("Accuracy:", accuracy_score(y_test, y_pred))


Accuracy: 0.0
