Skip to content

kariemoorman/label-accuracy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Airline Label Accuracy Prediction

A logistic regression pipeline that predicts whether a candidate airline label is correct, partially correct, or incorrect for a given text snippet.


Requirements

  • Python Packages
joblib
matplotlib
numpy
onnxruntime
pandas
rapidfuzz
scikit-learn
skl2onnx
spacy
  • SpaCy Model
python -m spacy download en_core_web_md

Table of Contents

  1. Overview
  2. Dataset Generation
  3. NER Entity Extraction
  4. Feature Engineering
  5. Model Training & Evaluation
  6. Prediction
  7. File Reference

1. Overview

The problem: given a text snippet (e.g., a customer review, a tweet, a flight status message) and a candidate airline name, determine how accurately the airline label matches the text.

Three-class target:

Class Code Meaning
correct 2 The full airline name accurately identifies the airline in the text
partial 1 A short form, abbreviation, or IATA code that refers to the right airline but isn't the canonical name
incorrect 0 The label is wrong -- either the wrong airline, or no airline is mentioned at all

Pipeline:

    generate_data    ->    extract_entities    ->      train_model      ->    predict
   (labeled_data)         (ner_labeled_data)    (models, reports, images)    
                                             

2. Dataset Generation

Script: scripts/generate_data.py

2.1 Airline Corpus

30 airlines (domestic US + international) are defined with their common surface forms:

  • Full canonical name (e.g., "Scandinavian Airlines")
  • Common short name (e.g., "SAS")
  • IATA code (e.g., "SK")
  • Informal variants (e.g., "scandinavian", "Scandinavian Air")

2.2 Text Generation

Four categories of text templates produce the snippets:

Template Type Example Purpose
Direct mention "My flight on Delta Air Lines was delayed by 3 hours at Denver airport." Airline name appears explicitly
Indirect mention "The airline refunded my Denver booking without any hassle." Refers to an airline experience but never names it
No mention "The weather in Denver was beautiful today." No airline content at all
Ambiguous "SAS programming in Chicago", "AA meeting in Denver" Uses a word that matches an airline short form but in a non-airline context

2.3 Noise Injection

To prevent trivial substring matching, airline names in text are mutated via surface_form() and typo():

  • Casing: lowercase, UPPERCASE, original
  • Abbreviation: full name, short name, IATA code
  • Typos: character deletion, character doubling, truncation, suffix misspelling ("Airlnes"), dropped double letters

2.4 Label Assignment

Each sample gets a (text, candidate_label, label_quality) triple. The dataset is balanced into thirds:

Correct (10,000 samples): The candidate label is the canonical full airline name. The text may contain the full name, a short form, an IATA code, a typo, or no explicit name at all (indirect reference).

Partial (10,000 samples): The candidate label is a short form / abbreviation / IATA code of the correct airline. Same text variation as above.

Incorrect (10,000 samples): Split across 12 sub-buckets to create diverse failure modes:

Bucket Description
0 Wrong full airline name on text that explicitly mentions a different airline
1 Confusable airline pair (e.g., Delta/American, Emirates/Qatar)
2 Airline label on text with zero airline content
3 Short form of the wrong airline
4 Ambiguous word usage (e.g., "delta variant" labeled as Delta Air Lines)
5 Indirect text (no name) with wrong airline label
6 IATA code mismatch (code for airline X, label for airline Y)
7 Short form of a confusable airline
8 Non-airline label on text that mentions an airline (e.g., city names, company names, airport codes)
9 Non-airline label on text with no airline mention
10 Non-airline label on indirect airline text
11 Near-miss typo as label (e.g., "Untied", "Delt", "Sprit") on text mentioning an airline

The non-airline label pool includes 120+ entries: city names, airport codes (LAX, JFK, etc.), travel terms, non-airline companies (Boeing, Expedia, etc.), generic words (Air, Airways, Jet), and deliberate near-miss typos.

2.5 Output

data/labeled_data.csv -> 30,000 rows with columns: text, true_airline, candidate_label, label_quality


3. NER Entity Extraction

Script: scripts/extract_entities.py

An optional preprocessing step that automatically extracts airline name candidates from text using Named Entity Recognition, rather than relying on manually assigned labels.

3.1 spaCy NER

Each text is processed through spaCy's en_core_web_md model. Entities with the following NER labels are extracted as potential airline mentions:

NER Label Rationale
ORG Airlines are organizations (primary signal)
PRODUCT Sometimes airline names are tagged as products
GPE Catches cases like "Alaska", "Hawaiian", "Singapore"
PERSON Catches edge cases where names are misclassified

3.2 Matching

Each extracted entity is matched against a lookup table of all known airline surface forms (canonical names + abbreviations + IATA codes):

  1. Exact match first: If the entity text (lowercased) exactly matches a known form, it maps directly to the canonical airline with 100% confidence
  2. Fuzzy fallback: If no exact match, rapidfuzz.fuzz.WRatio is used with a threshold of 70
  3. If no match exceeds the threshold, the result is "NONE"

Each match also records a ner_match_type:

Match Type Meaning
canonical Matched the full airline name exactly
short_form Matched a common abbreviation
iata_code Matched a 2-letter IATA code
fuzzy No exact form matched, resolved via fuzzy similarity
none No airline entity found

And a ner_label_source: spacy (exact match) or rapidfuzz (fuzzy fallback).

3.3 Agreement & Label Quality Derivation

The script computes two additional columns:

ner_label_agrees -- Compares the existing candidate_label against ner_candidate_label:

  • 1.0 -- candidate resolves to the same canonical airline as NER
  • 0.5 -- candidate is a variant of the NER airline
  • 0.0 -- no agreement

ner_label_quality -- Compares ner_candidate_label against true_airline (ground truth from generation):

  • correct -- NER output matches the true airline (canonical key)
  • incorrect -- NER output doesn't match (wrong airline, NONE, or no true airline)

Note: There is no partial class for NER because the NER pipeline always resolves to canonical names.

3.4 Output Columns

Column Description
ner_entities JSON list of all extracted entities with their NER labels
ner_candidate_label Best-match canonical airline name (or "NONE")
ner_confidence Match score (0-100)
ner_match_type How the match was made (canonical/short_form/iata_code/fuzzy/none)
ner_label_source Which system provided the match (spacy/rapidfuzz)
ner_label_agrees Agreement between candidate_label and ner_candidate_label (0.0/0.5/1.0)
ner_label_quality NER-derived label quality (correct/incorrect)

3.5 Diagnostics

The script prints:

  • Extraction rate: percentage of rows with at least one airline entity
  • Agreement rate: how often ner_candidate_label matches candidate_label
  • Confidence distribution: mean, median, min, max of non-zero confidence scores
  • Top mismatches: most common (assigned label, NER label) disagreement pairs
  • Match type distribution: counts per match type
  • Label source distribution: spacy vs rapidfuzz counts
  • Extraction rate by label quality: broken down by correct/partial/incorrect
  • NER label quality distribution and crosstab against original label_quality

3.6 Output File

data/ner_labeled_data.csv -- all original columns plus the NER columns above.


4. Feature Engineering

Script: scripts/train_model.py (feature extraction happens at training time)

Both models share a common feature base. The NER model adds extra features on top.

Common Features (both models)

Three feature groups are concatenated into a sparse matrix:

4.1 TF-IDF Text Features

  • TfidfVectorizer on the text column
  • max_features=500, ngram_range=(1, 2), English stop words removed
  • Captures which words and bigrams appear in each text
  • Produces a 500-dimensional sparse vector per sample

4.2 One-Hot Label Features

  • One-hot encoded candidate label (fitted fresh during training)
  • For the candidate model: encodes candidate_label (user-assigned)
  • For the NER model: encodes ner_candidate_label (NER-extracted canonical name)
  • Dimensionality equals the number of unique labels in training data

4.3 Handcrafted Similarity Features

Five numeric features measuring how well the candidate label matches the text:

Feature Description
token_overlap Fraction of label tokens found in the text tokens (word-level Jaccard numerator)
bigram_sim Character bigram Jaccard similarity between label and text
text_len Character length of the text
label_len Character length of the candidate label
len_ratio label_len / text_len

4.4 NER-Specific Features (NER Model only)

When training on ner_candidate_label, the model adds four extra feature groups:

Feature Type Description
ner_match_type One-hot (5 cols) How the entity was matched: canonical, short_form, iata_code, fuzzy, none
ner_label_source One-hot (3 cols) Which system matched: spacy, rapidfuzz, or empty
ner_confidence Numeric (1 col) Match confidence score (0-100)
ner_label_agrees Numeric (1 col) Agreement between candidate_label and ner_candidate_label (0.0/0.5/1.0)

4.5 Combined Feature Matrix

All groups are horizontally stacked via scipy.sparse.hstack:

Candidate Model:

X = [TF-IDF (500) | One-Hot Label | Handcrafted (5)]

NER Model:

X = [TF-IDF (500) | One-Hot Label | Handcrafted (5) | Match Type (5) | Label Source (3) | Confidence (1) | Agreement (1)]

5. Model Training & Evaluation

Script: scripts/train_model.py

5.1 Train/Test Split

  • 80/20 split with stratify=y to preserve class balance
  • random_state=42 for reproducibility
  • Train: 24,000 samples, Test: 6,000 samples

5.2 Two Models

Candidate model (--model candidate): Uses candidate_label as the feature, predicts label_quality (3-class: correct/partial/incorrect). Target: does the pre-assigned label match the text?

NER model (--model ner): Uses ner_candidate_label as the feature, predicts ner_label_quality (2-class: correct/incorrect). Target: does the NER-extracted label match the true airline? No partial class because NER always resolves to canonical names.

Both use Logistic Regression:

Parameter Value Rationale
C 1.0 Default regularization strength (L2)
max_iter 5000 Sufficient for convergence on this feature set
solver lbfgs Default; handles multinomial well

Use --compare to train both and generate a side-by-side comparison report.

5.3 Metrics

Classification report: Per-class precision, recall, F1-score.

Confusion matrix: 3x3 matrix showing predictions vs true labels.

AUC-ROC (One-vs-Rest): Each class is treated as binary (this class vs. all others). The ROC curve plots True Positive Rate vs False Positive Rate at varying probability thresholds. Reported per-class and as macro/weighted averages.

PR-AUC (One-vs-Rest): Precision-Recall curve area. More informative than ROC when classes are imbalanced (though ours are balanced). Reported per-class and as macro/weighted averages.

5.4 Model Serialization

The trained model is exported in two formats:

ONNX (primary, safe): The sklearn LogisticRegression is converted to ONNX via skl2onnx. ONNX is a language-agnostic, open standard that does not allow arbitrary code execution on load. The TF-IDF vectorizer vocabulary and IDF weights are saved separately as JSON. Each model produces its own files: model_candidate.onnx / model_ner.onnx and tfidf_config_candidate.json / tfidf_config_ner.json.

Pickle (fallback): model_{tag}.pkl files are generated for quick iteration and contain the model, TF-IDF vectorizer, one-hot encoder, and label column name. Should not be used in production or with untrusted files.

5.5 Outputs

File Contents
models/model_{tag}.onnx ONNX-format logistic regression model (safe to load)
models/tfidf_config_{tag}.json TF-IDF vocabulary + IDF weights as JSON
models/model_{tag}.pkl Pickle fallback with model + vectorizer + OHE
images/roc_pr_curves_{tag}.png Side-by-side ROC and Precision-Recall curve plots per class
reports/error_report_{tag}.csv Every misclassified test sample with text, label, true/predicted class, confidence
COMPARISON.md Auto-generated model comparison report (when using --compare)

5.6 Error Analysis

The console prints a breakdown of each confusion pair (e.g., True=incorrect -> Pred=correct) with sample count and 3 example errors showing the text, label, and model confidence. The full error set is in reports/error_report_{tag}.csv for manual review.


6. Prediction

Script: scripts/predict.py

6.1 Usage

# Candidate model (default, 3-class: correct / partial / incorrect)
python scripts/predict.py "<text>" "<candidate_label>"

# NER model (2-class: correct / incorrect)
python scripts/predict.py "<text>" "<candidate_label>" --model ner

# With pre-filter (reject non-airline labels before running the model)
python scripts/predict.py "<text>" "<candidate_label>" --filter

6.2 Candidate Model Process

  1. Load model_candidate.onnx via ONNX Runtime
  2. Reconstruct the TF-IDF vectorizer from tfidf_config_candidate.json
  3. Load the one-hot encoder from model_candidate.pkl
  4. Use the user's candidate label directly as the model input
  5. Compute TF-IDF features, one-hot encode the label, compute handcrafted similarity features
  6. Stack all features into a single dense float32 vector
  7. Run ONNX inference -> predicted class and probabilities

6.3 NER Model Process

The NER model uses a route-through-NER approach. Instead of feeding the user's candidate label directly into the model, it first runs the NER pipeline to independently extract an airline from the text, then compares that extraction against the user's label.

  1. Load model_ner.onnx via ONNX Runtime
  2. Run spaCy NER on the input text to extract entities
  3. Fuzzy-match extracted entities against the airline lookup to resolve a canonical airline name (ner_resolved)
  4. Use ner_resolved (not the user's label) as the model's candidate label input
  5. Compute agreement between the user's candidate label and ner_resolved:
    • 1.0 -- user's label matches NER output (same canonical airline)
    • 0.5 -- user's label is a variant/abbreviation of the NER airline
    • 0.0 -- no match (wrong airline, non-airline label, or NER found nothing)
  6. Build feature vector: TF-IDF + one-hot of ner_resolved + handcrafted features + NER metadata (match type, label source, confidence, agreement)
  7. Run ONNX inference -> predicted class and probabilities

Hard rule: If NER extracts an airline but agreement with the user's label is 0.0, the prediction short-circuits to incorrect without running the model. This handles cases where the model's features alone can't distinguish (e.g., "Denver" vs "United Airlines" — the model sees strong NER signals but the label is completely wrong).

This means:

  • "Denver" as label when NER found "United Airlines" -> agreement 0.0 -> incorrect (hard rule)
  • "United" as label when NER found "United Airlines" -> agreement 0.5 -> runs through model
  • "United Airlines" as label when NER found "United Airlines" -> agreement 1.0 -> runs through model

6.4 Example Output

Candidate Model:

python scripts/predict.py "I flew United Airlines to Denver" "United Airlines" 
Model:           candidate (3-class model trained on pre-assigned candidate labels)
Text:            I flew United Airlines to Denver
Candidate label: United Airlines
Prediction:      correct
Probabilities:
   incorrect: 0.2977
     partial: 0.2776
     correct: 0.4235
python scripts/predict.py "I flew United Airlines to Denver" "United Air" 
Model:           candidate (3-class model trained on pre-assigned candidate labels)
Text:            I flew United Airlines to Denver
Candidate label: United Air
Prediction:      partial
Probabilities:
   incorrect: 0.2177
     partial: 0.7881
     correct: 0.0002
python scripts/predict.py "I flew United Airlines to Denver" "Denver" 
Model:           candidate (3-class model trained on pre-assigned candidate labels)
Text:            I flew United Airlines to Denver
Candidate label: Denver
Prediction:      incorrect
Probabilities:
   incorrect: 0.7039
     partial: 0.2976
     correct: 0.0005

NER model:

python scripts/predict.py "I flew United Airlines to Denver" "United Airlines" --model ner
Model:           ner (2-class model trained on NER-extracted labels)
Text:            I flew United Airlines to Denver
Candidate label: United Airlines
NER extracted:   United Airlines
Agreement:       1.0
Prediction:      correct
Probabilities:
   incorrect: 0.0083
     partial: 0.9917
python scripts/predict.py "I flew United Airlines to Denver" "United Air" --model ner
Model:           ner (2-class model trained on NER-extracted labels)
Text:            I flew United Airlines to Denver
Candidate label: United Air
NER extracted:   United Airlines
Agreement:       1.0
Prediction:      correct
Probabilities:
   incorrect: 0.0083
     partial: 0.9917
python scripts/predict.py "I flew United Airlines to Denver" "Denver" --model ner
Model:           ner (2-class model trained on NER-extracted labels)
Text:            I flew United Airlines to Denver
Candidate label: Denver
NER extracted:   United Airlines
Agreement:       0.0
Prediction:      incorrect
Reason:          NER found "United Airlines" but label "Denver" does not match

6.5 Pre-Filter (--filter)

An optional --filter flag runs a fuzzy match against the airline lookup before invoking the model. If the candidate label doesn't match any known airline form (threshold = 80, using token_set_ratio), it short-circuits to "incorrect" without loading the model. This is useful for fast rejection of obviously non-airline labels but is not required; both models can handle non-airline labels through their training data.

6.6 Edge Cases

  • Unknown Airline Label (Candidate model): The one-hot encoder outputs all zeros (handle_unknown="ignore"). The model relies on TF-IDF and similarity features. Training data includes non-airline labels (city names, company names, airport codes) as incorrect examples.
  • Unknown Airline Label (NER model): The NER pipeline runs independently and resolves the correct airline. Agreement with the unknown label will be 0.0, pushing the prediction toward incorrect.
  • NER Finds Nothing: ner_resolved = "NONE", confidence = 0, agreement = 0. The model treats this as low signal.
  • Empty Text: Produces a zero TF-IDF vector; prediction relies on label features and agreement.

About

Data label accuracy & prediction

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages