### 1. Load Dataset

This initial step involves importing the `pandas` library, a fundamental tool for data manipulation in Python. We then specify the path to our dataset, `data.csv` (which contains the comments and toxicity labels), located in the `data` directory relative to the notebook's location. The `pd.read_csv()` function reads the data from the specified file into a pandas DataFrame named `df`. Finally, `df.head()` displays the first 5 rows of the DataFrame, allowing for a quick inspection of the columns (like `comment_text` and the toxicity labels) and their content.

In [1]:
import pandas as pd

file_path = '../data/data.csv' 
df = pd.read_csv(file_path)

df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


### 2. Setup Text Preprocessing with spaCy

This block imports the necessary libraries, `spacy` for advanced NLP tasks and `re` for regular expression-based cleaning. It then loads a pre-trained small English language model from spaCy (`en_core_web_sm`), disabling the parser and Named Entity Recognition (NER) components (`disable=["parser", "ner"]`) as they are not required for our current task, thus improving efficiency. A confirmation message is printed upon successful loading.

The core part defines the `preprocess_text_spacy` function, which encapsulates our text cleaning and normalization pipeline:
1.  **Input Validation:** Ensures the input is treated as a string.
2.  **Basic Cleaning:** It converts text to lowercase and uses regular expressions (`re.sub`) to remove URLs, user mentions (`@username`), HTML tags, any characters that are not lowercase letters or whitespace, and redundant whitespace.
3.  **spaCy Processing:** The cleaned text is processed by the loaded `nlp` object, which performs tasks like tokenization and part-of-speech tagging internally.
4.  **Token Filtering & Lemmatization:** The code iterates through the resulting spaCy tokens within the processed `doc`. It selects only those tokens that consist purely of alphabetic characters (`token.is_alpha`) and are *not* identified as common English stopwords (`not token.is_stop`). For these selected tokens, their base or dictionary form (lemma) is retrieved using `token.lemma_`.
5.  **Output:** Finally, the selected lemmas are joined back together into a single space-separated string. This normalized text, containing only meaningful word stems/bases, is returned and will be used as input for the feature extraction stage.

In [2]:
import spacy
import re

try:
    nlp = spacy.load("en_core_web_sm", disable = ["parser", "ner"])
    print("spaCy 'en_core_web_sm' model loaded successfully.")
except OSError:
    print("spaCy model 'en_core_web_sm' not found. Please download it:")
    print("python -m spacy download en_core_web_sm")
    nlp = None

def preprocess_text_spacy(text):
    if not nlp:
        print("Warning: spaCy model not loaded. Performing basic cleaning only.")
    if not isinstance(text, str): text = str(text)
    text = text.lower()
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    text = re.sub(r'@\w+', '', text)
    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r'[^a-z\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    if nlp:
        doc = nlp(text)
        processed_tokens = [
            token.lemma_ for token in doc if token.is_alpha and not token.is_stop
        ]
        return ' '.join(processed_tokens)
    else:
         return text

spaCy 'en_core_web_sm' model loaded successfully.


### 3. Apply Preprocessing to Dataset

This code block applies the `preprocess_text_spacy` function (defined in the previous step) to the entire `comment_text` column of our DataFrame `df`.
* The `.fillna('')` method is used first to handle any potential missing values (NaN) in the comment column by replacing them with empty strings, ensuring the preprocessing function doesn't encounter errors.
* The `.apply()` method then executes `preprocess_text_spacy` for each comment.
* The results of this processing (the cleaned and lemmatized text strings) are stored in a new column named `comment_text_processed`.
* Finally, the DataFrame `df` is filtered to keep only the rows where `comment_text_processed` is not an empty string (`!= ""`). This step removes any comments that might have become empty after preprocessing (e.g., if they only contained stopwords, URLs, mentions, or symbols that were stripped out), ensuring that only entries with relevant textual content proceed to the feature extraction phase.

In [3]:
df['comment_text_processed'] = df['comment_text'].fillna('').apply(preprocess_text_spacy)
initial_rows = df.shape[0]
df = df[df['comment_text_processed'] != ""]
print(f"Rows after processing and removing empty: {df.shape[0]} (removed {initial_rows - df.shape[0]})")

Rows after processing and removing empty: 159434 (removed 137)


### 4. Split Data into Training and Testing Sets

Before training the model, we need to split our dataset into two parts: one for training the model and another for evaluating its performance on unseen data.
* First, we import the `train_test_split` function from `scikit-learn`.
* We define our features `X` as the column containing the preprocessed comment text (`comment_text_processed`) and our target variable `y` as the `toxic` column, which holds the labels (0 for non-toxic, 1 for toxic).
* The `train_test_split` function is then used to divide `X` and `y` into training sets (`X_train`, `y_train`) and testing sets (`X_test`, `y_test`).
    * `test_size=0.2` allocates 20% of the data to the test set and the remaining 80% to the training set.
    * `random_state=42` ensures that the split is identical each time the code runs, allowing for reproducible results.
    * `stratify=y` is crucial for classification tasks. It ensures that the proportion of toxic and non-toxic comments in both the training and testing sets mirrors the proportion in the original dataset, preventing skewed evaluation due to random chance in the split, especially important if the classes are imbalanced.
* The final print statement confirms the number of samples allocated to the training (`X_train`) and testing (`X_test`) feature sets.

In [4]:
from sklearn.model_selection import train_test_split

X = df['comment_text_processed']
y = df['toxic']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"X_train size: {len(X_train)}, X_test size: {len(X_test)}")

X_train size: 127547, X_test size: 31887


### 5. Hyperparameter Tuning using RandomizedSearchCV

This step aims to find the optimal hyperparameters for both the TF-IDF vectorizer and the LinearSVC classifier to potentially improve model performance.
* We import necessary classes including `Pipeline`, `RandomizedSearchCV`, and distributions from `scipy.stats`.
* A `Pipeline` is constructed, chaining the `TfidfVectorizer` and `LinearSVC` estimators. Using a pipeline is crucial for applying cross-validation correctly when tuning both preprocessing and classification steps.
* A dictionary `param_distributions` defines the search space, specifying ranges or distributions for key hyperparameters of both the TF-IDF step (e.g., `max_features`, `min_df`, `max_df`) and the LinearSVC classifier (`C`), using the `step_name__parameter_name` syntax.
* `RandomizedSearchCV` is initialized with the pipeline, the defined parameter distributions, and configured to try `n_iter=50` random combinations, using 3-fold cross-validation (`cv=3`) and optimizing for the `roc_auc` score. `n_jobs=-1` utilizes all available CPU cores, and `verbose=2` shows detailed progress.
* The search is executed by calling `.fit()` on the training data (`X_train`, `y_train`). Note that the raw text `X_train` is passed, as the pipeline handles vectorization internally during the search.
* Upon completion, the total time taken, the best set of parameters discovered (`.best_params_`), and the highest mean ROC AUC score achieved during cross-validation (`.best_score_`) are printed.
* Finally, the `best_estimator_` attribute, which contains the entire pipeline refitted with the optimal hyperparameters found, is stored in the `best_pipeline` variable.

In [5]:
import time
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

print("\n--- Hyperparameter Tuning ---")
start_time = time.time()

pipeline = Pipeline([
        ('tfidf', TfidfVectorizer(ngram_range=(1, 2))),
        ('clf', LinearSVC(class_weight='balanced', random_state=42, max_iter=5000, dual=False))
    ])

param_distributions = {
    'tfidf__max_features': randint(5000, 25000),
    'tfidf__min_df': randint(1, 5),
    'tfidf__max_df': uniform(0.85, 0.15),
    'clf__C': uniform(0.1, 10)
}

random_search = RandomizedSearchCV(
    estimator = pipeline,
    param_distributions = param_distributions,
    n_iter = 50,
    cv=3,
    scoring='roc_auc',
    n_jobs=-1,
    random_state = 42,
    verbose = 1
)

print("Starting RandomizedSearchCV... This may take a while.")
random_search.fit(X_train, y_train)

end_time = time.time()
print(f"Hyperparameter search finished in {end_time - start_time:.2f} seconds.")

print("\nBest parameters found by RandomizedSearchCV:")
print(random_search.best_params_)
print(f"\nBest cross-validation ROC AUC: {random_search.best_score_:.4f}")

best_pipeline = random_search.best_estimator_


--- Hyperparameter Tuning ---
Starting RandomizedSearchCV... This may take a while.
Fitting 3 folds for each of 50 candidates, totalling 150 fits
Hyperparameter search finished in 254.27 seconds.

Best parameters found by RandomizedSearchCV:
{'clf__C': np.float64(0.1052037699531582), 'tfidf__max_df': np.float64(0.9028853284501254), 'tfidf__max_features': 14474, 'tfidf__min_df': 3}

Best cross-validation ROC AUC: 0.9651


### 6. Evaluate Best Model on Test Set

After identifying the best hyperparameter combination using cross-validation on the training data, this step assesses the final performance of that optimized model (`best_pipeline`) on the held-out test set (`X_test`, `y_test`), which was not used during training or tuning.
* The `roc_auc_score` metric is imported from `sklearn.metrics`.
* The `.decision_function()` method of the `best_pipeline` is called on the raw test text data (`X_test`). The pipeline automatically applies the necessary preprocessing, TF-IDF transformation with the optimized parameters, and the tuned LinearSVC classifier to generate decision scores for the test set.
* The `roc_auc_score` function calculates the final ROC AUC score by comparing the true labels (`y_test`) with the predicted decision scores (`y_decision_scores_best`).
* The resulting score, printed out, represents an unbiased evaluation of the tuned model's generalization performance on previously unseen data.

In [6]:
from sklearn.metrics import roc_auc_score

print("\n--- Evaluating Best Model on Test Set ---")
y_decision_scores_best = best_pipeline.decision_function(X_test)
roc_auc_best = roc_auc_score(y_test, y_decision_scores_best)
print(f"Best model ROC AUC on TEST set: {roc_auc_best:.4f}")


--- Evaluating Best Model on Test Set ---
Best model ROC AUC on TEST set: 0.9661


### 7. Save the Optimized Pipeline

This final code block saves the best-performing pipeline (found through hyperparameter tuning and stored in the `best_pipeline` variable) to disk for later use, such as deploying it in the API.
* The `joblib` library (for efficient object serialization) and the `os` module (for interacting with the file system) are imported.
* An `output_dir` variable specifies the name of the directory where the pipeline will be saved.
* `os.makedirs(output_dir, exist_ok=True)` ensures this directory exists, creating it if necessary without raising an error if it's already there.
* `os.path.join` constructs the full, operating-system-independent path to the output file (`best_toxicity_pipeline.joblib`) within the specified directory.
* `joblib.dump()` serializes the `best_pipeline` object (containing both the optimized TF-IDF vectorizer and the tuned LinearSVC classifier) and writes it to the designated `.joblib` file.
* A confirmation message is printed, indicating the path where the pipeline has been successfully saved. This file now contains the complete, ready-to-use model artifact.

In [7]:
import joblib
import os

print("\n--- Saving Optimized Pipeline ---")
output_dir = 'saved_pipeline_spacy_svm'
os.makedirs(output_dir, exist_ok=True)

pipeline_path = os.path.join(output_dir, 'best_toxicity_pipeline.joblib')
joblib.dump(best_pipeline, pipeline_path)

print(f"Best pipeline saved to: {pipeline_path}")


--- Saving Optimized Pipeline ---
Best pipeline saved to: saved_pipeline_spacy_svm/best_toxicity_pipeline.joblib
