# Model Evaluation and Ranking Analysis

## Objective
Evaluate the performance of the multi-label Stack Overflow tag recommendation model using ranking-based evaluation metrics.  
This notebook measures how effectively the trained Linear SVM model recommends relevant tags for each question.

## Evaluation Metrics
The following metrics are used to assess model performance:

- **Micro F1 Score**  
  Measures overall multi-label classification performance across all tags.

- **Precision@5**  
  Evaluates how many of the top-5 recommended tags are actually correct.  
  This metric reflects real-world tag recommendation quality.

- **Hamming Loss**  
  Measures the fraction of incorrect labels to the total number of labels.

## Steps Performed
1. Loaded trained model, encoder, and vectorizer artifacts
2. Generated prediction scores using the Linear SVM decision function
3. Selected Top-5 tags for each sample using ranking-based selection
4. Computed evaluation metrics including Micro F1, Precision@5, and Hamming Loss
5. Analyzed prediction quality to understand ranking performance

## Outcome
This evaluation provides insight into the model’s ability to recommend relevant tags and highlights areas for further ranking and feature improvements.


## Loading Model Artifacts

The trained classifier, TF-IDF vectorizer, and label encoder are loaded to reproduce the prediction pipeline exactly as used during training.  
This ensures evaluation results reflect the actual deployed model behavior.


In [22]:
import pandas as pd
import joblib

df= pd.read_parquet("modeling.parquet")

mlb = joblib.load("mlb_encoder.pkl")
model = joblib.load("tag_prediction_model.pkl")
vectorizer = joblib.load("tfidf_vectorizer.pkl")

## Preparing Evaluation Data

The cleaned question text is transformed into TF-IDF feature vectors using the saved vectorizer.  
Ground-truth labels are encoded into multi-label format to enable proper multi-label evaluation.


In [23]:
X = vectorizer.transform(df["clean_text"])
Y = mlb.transform(df["filtered_tags"])

 **Train test split**

In [24]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(
    X,Y, test_size=0.2 , random_state= 42
)

## Generating Tag Predictions

The trained Linear SVM model produces decision scores for each tag.  
Top-K ranking based on these scores is used to simulate real-world tag recommendation behavior, where the system suggests the most relevant tags rather than making binary predictions.


In [25]:
Y_pred = model.predict(X_test)

print(Y_pred)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


Calcuating top 5 tags

In [114]:
scores = model.decision_function(X_test)

In [115]:
import numpy as np

top_k = 5
Y_pred_topk = np.zeros_like(scores)

for i in range(len(scores)):
    # rank positions (higher rank → higher weight)
    order = np.argsort(scores[i])
    rank_weight = np.zeros(scores.shape[1])
    rank_weight[order] = np.linspace(0, 1, len(order))

    # combine margin + rank signal
    combined_score = scores[i] + 0.25 * rank_weight

    top_indices = np.argsort(combined_score)[-top_k:]
    Y_pred_topk[i, top_indices] = 1


## Evaluation Metrics

Model performance is evaluated using ranking-aware multi-label metrics:

- **Micro F1 Score:** Measures overall prediction balance across labels
- **Precision@5:** Evaluates correctness of the top recommended tags
- **Hamming Loss:** Measures the fraction of incorrectly predicted labels

These metrics provide a more realistic assessment for recommendation systems.


Micro F1 Score

In [105]:
from sklearn.metrics import f1_score

micro_f1 = f1_score(Y_test, Y_pred, average="micro")

print("Micro F1:", micro_f1)


Micro F1: 0.6886113279555902


 Precision@5

In [116]:
correct = 0
for i in range(len(scores)):
    pred_tags = set(np.where(Y_pred_topk[i] == 1)[0])
    true_tags = set(np.where(Y_test[i] == 1)[0])
    correct += len(pred_tags & true_tags)

precision_at_5 = correct / (len(scores) * top_k)
print("Precision@5:", precision_at_5)

Precision@5: 0.25417561075334766


Hamming loss

In [101]:
from sklearn.metrics import hamming_loss

print("Hamming loss:", hamming_loss(Y_test, Y_pred))

Hamming loss: 0.014678523970152305


## Results Interpretation

The trained multi-label tag recommendation model achieved a **Micro-F1 score of ~0.68**, indicating that the classifier captures meaningful relationships between question text and associated tags across the selected tag space.

The **Precision@5 score of ~0.25** reflects the inherent difficulty of multi-label tag recommendation, where multiple relevant tags compete within a limited Top-5 recommendation window. Despite this challenge, the model consistently identifies at least one or more relevant tags among its top suggestions.

The relatively low **Hamming Loss (~0.014)** suggests that the overall label-wise prediction error remains small, confirming that the model maintains stable multi-label prediction behavior.

These results demonstrate that feature engineering and ranking strategy play a significant role in recommendation performance, providing a strong baseline for further improvements through enhanced text representations and ranking calibration.


## Prediction Analysis

Example predictions are examined to understand the qualitative performance of the model.  
This step helps identify common error patterns, tag confusion, and areas where feature representation may limit prediction accuracy.


In [112]:
sample_idx = 10

print("Text:\n", df.iloc[sample_idx]["clean_text"][:500])
print("\nActual tags:", df.iloc[sample_idx]["filtered_tags"])

pred_tags = mlb.inverse_transform(
    model.predict(X[sample_idx])
)

print("Predicted tags: ", pred_tags)

Text:
 ios 6 apps - how to deal with iphone 5 screen size? 
possible duplicate:
how to develop or migrate apps for iphone 5 screen resolution? 

i was just wondering with how should we deal with the iphone 5 bigger screen size.
as it has more pixels in height, things like gcrectmake that use coordinates (and just doubled the pixels with the retina/non retina problem) won't work seamlessly between versions, as it happened when we got the retina.
and will we have to design two storyboards, just like for 

Actual tags: ['iphone' 'ios']
Predicted tags:  [('ios', 'iphone')]


More Prediction Random Examples

In [118]:
import random

# Generate a few random sample indices to show diverse examples
sample_indices = random.sample(range(len(df)), 5)

for sample_idx in sample_indices:
    print(f"\n--- Sample Index: {sample_idx} ---")
    print("Text:\n", df.iloc[sample_idx]["clean_text"][:500])
    print("\nActual tags:", df.iloc[sample_idx]["filtered_tags"])

    # Reshape X[sample_idx] to be 2D for prediction (model expects 2D input)
    pred_tags = mlb.inverse_transform(model.predict(X[sample_idx].reshape(1, -1)))

    print("Predicted tags: ", pred_tags)
    print("-----------------------------------")


--- Sample Index: 41064 ---
Text:
 cakephp: using models in different controllers i have a controller/model for projects. so this controls the projects model, etc, etc. i have a homepage which is being controlled by the pages_controller. i want to show a list of projects on the homepage. is it as easy as doing:
function index() {
    $this->set('projects', $this->project->find('all'));        
}

i'm guessing not as i'm getting:
undefined property: pagescontroller::$project

can someone steer me in the right direction please,
jon

Actual tags: ['php']
Predicted tags:  [('php',)]
-----------------------------------

--- Sample Index: 38159 ---
Text:
 why illegal start of declaration in scala? for the following code:
package fileoperations
import java.net.url

object fileoperations {
    def processwindowspath(p: string): string {
        "file:///" + p.replaceall("\\", "/")
    }
}

compiler gives an error:
> scalac fileoperations.scala
fileoperations.scala:6: error: illegal start of d

## Summary and Key Insights

This project developed a multi-label NLP system to recommend relevant tags for Stack Overflow questions using TF-IDF text representations and a Linear SVM classifier. The evaluation demonstrates that the model is capable of capturing meaningful relationships between technical text patterns and associated programming tags.

### Key Insights
- Multi-label recommendation problems require **ranking-based evaluation metrics** rather than traditional accuracy, as multiple tags may be correct for a single question.
- **Feature engineering had a greater impact on performance than model selection**, highlighting the importance of informative text representations in NLP tasks.
- Tag recommendation performance is influenced by **label competition**, where several related tags may be simultaneously relevant, limiting achievable Precision@K scores.
- Decision-score ranking proved effective for selecting the most relevant predicted tags without requiring probability calibration.

Overall, the results provide a strong baseline tag recommendation system and demonstrate the importance of structured data pipelines, ranking-aware evaluation, and careful feature engineering in real-world NLP applications.
