<br>

<br>

<br>

# ðŸ‘¾ **SPAM LINK DETECTION SYSTEM** ðŸ‘¾

**NATURAL LANGUAGE PROCESSING**

<br>

## **INDEX**

- **STEP 1: PROBLEM DEFINITION AND DATA COLLECTION**
- **STEP 2: DATA EXPLORATION AND CLEANING**
- **STEP 3: DATA PROCESSING**
- **STEP 4: MODEL DEVELOPMENT: SUPPORT VECTOR MACHINE (SVM)**
- **STEP 5: MODEL OPTIMIZATION**
- **STEP 6: MODEL DEPLOYMENT AND SAVING**
- **STEP 7: CONCLUSION**

<br>

### **STEP 1: PROBLEM DEFINITION AND DATA COLLECTION**

- 1.1. Define the problem
- 1.2. Library Importing
- 1.3. Data Collection

**1.1. PROBLEM DEFINITION**

The increasing volume of web pages created daily has brought a proportional rise in spam and malicious URLs. These URLs often pose threats like phishing, malware, and other forms of cyber-attacks. The goal of this project is to create a **Spam Link Detection System** that can identify whether a URL is spam or legitimate based on its structure. By analyzing the patterns within URLs, we aim to automate this detection process, reducing the need for manual review and improving online security.

<br>

**What is Natural Language Processing (NLP)?**
**Natural Language Processing (NLP)** is a branch of Artificial Intelligence (AI) that focuses on the interaction between computers and human language. **NLP** techniques enable machines to read, understand, and derive meaning from text data. 

In this project, URLs are treated as a form of text data, allowing us to leverage NLP techniques like tokenization, stopword removal, and lemmatization to preprocess and extract meaningful patterns from them.

<br>

**Data Processing**

In relation to this project, **data processing** involves transforming raw URLs into a format suitable for machine learning models. This includes:
- **Tokenization:** Breaking URLs into smaller components based on punctuation or special characters.  
- **Stopword Removal:** Eliminating common yet uninformative words like "www" or "http."  
- **Lemmatization/Stemming:** Reducing words to their base or root forms.  

These steps help highlight the key elements of URLs that are indicative of spam, ensuring that our model focuses on the most relevant features.

<br>

**Methodology: SUPPORT VECTOR MACHINE (SVM)**

The **Support Vector Machine (SVM)** is a supervised learning algorithm widely used for classification problems. SVM works by finding the hyperplane that best separates data points into different classes. For this project:
- We will use an **initial SVM model** with default parameters to classify URLs as spam or legitimate.  
- **Hyperparameter optimization** will follow, refining the model for improved performance.  
- The final model will be saved and deployed for real-world application, enabling automated spam detection.


<br>

**1.2. LIBRARY IMPORTING**

In [2]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.svm import SVC 
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
import joblib
import os 
import re  # For working with regular expressions (e.g., to split URLs)
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

nltk.download('stopwords')
nltk.download('wordnet')
#nltk.download('punkt') hemos optado por "re" para la tokenizaciÃ³n.


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Jen\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Jen\AppData\Roaming\nltk_data...


True

<br>

**1.3. DATA COLLECTION** 

In [3]:

def load_dataset(url, required_columns=None):
    """
    Load a dataset from a given URL and validate its structure.
    
    Parameters:
    - url (str): The URL to the CSV dataset.
    - required_columns (list): List of required column names (optional).
    
    Returns:
    - pd.DataFrame or None: The loaded dataset if successful, otherwise None.
    """
    try:
        # Load the dataset from the URL
        data = pd.read_csv(url)

        # Display an initial summary of the dataset
        print("Dataset loaded successfully.")
        print(f"The dataset has {data.shape[0]} rows and {data.shape[1]} columns.\n")
        print("First 5 rows of the dataset:")
        print(data.head())

        # Check for required columns if specified
        if required_columns:
            missing_columns = [col for col in required_columns if col not in data.columns]
            if missing_columns:
                print(f"Error: The dataset is missing the following required columns: {missing_columns}")
                return None
        
        # Display missing values summary
        print("\nMissing values summary:")
        print(data.isnull().sum())
        
        return data

    except Exception as e:
        # Handle any errors that occur during loading
        print(f"An error occurred while loading the dataset: {e}")
        return None

# Dataset URL
url = "https://raw.githubusercontent.com/4GeeksAcademy/NLP-project-tutorial/main/url_spam.csv"

# Function call with validation for 'url' and 'is_spam' columns
required_columns = ['url', 'is_spam']
dataset = load_dataset(url, required_columns)

# Additional verification
if dataset is not None:
    print("\nDataset columns:")
    print(dataset.columns)


Dataset loaded successfully.
The dataset has 2999 rows and 2 columns.

First 5 rows of the dataset:
                                                 url  is_spam
0  https://briefingday.us8.list-manage.com/unsubs...     True
1                             https://www.hvper.com/     True
2                 https://briefingday.com/m/v4n3i4f3     True
3   https://briefingday.com/n/20200618/m#commentform    False
4                        https://briefingday.com/fan     True

Missing values summary:
url        0
is_spam    0
dtype: int64

Dataset columns:
Index(['url', 'is_spam'], dtype='object')


In [None]:
# Your code here