Airline Label Accuracy Prediction

A logistic regression pipeline that predicts whether a candidate airline label is correct, partially correct, or incorrect for a given text snippet.

Requirements

Python Packages

joblib
matplotlib
numpy
onnxruntime
pandas
rapidfuzz
scikit-learn
skl2onnx
spacy

SpaCy Model

python -m spacy download en_core_web_md

1. Overview

The problem: given a text snippet (e.g., a customer review, a tweet, a flight status message) and a candidate airline name, determine how accurately the airline label matches the text.

Three-class target:

Class	Code	Meaning
`correct`	2	The full airline name accurately identifies the airline in the text
`partial`	1	A short form, abbreviation, or IATA code that refers to the right airline but isn't the canonical name
`incorrect`	0	The label is wrong -- either the wrong airline, or no airline is mentioned at all

Pipeline:

    generate_data    ->    extract_entities    ->      train_model      ->    predict
   (labeled_data)         (ner_labeled_data)    (models, reports, images)

2. Dataset Generation

Script: scripts/generate_data.py

2.1 Airline Corpus

30 airlines (domestic US + international) are defined with their common surface forms:

Full canonical name (e.g., "Scandinavian Airlines")
Common short name (e.g., "SAS")
IATA code (e.g., "SK")
Informal variants (e.g., "scandinavian", "Scandinavian Air")

2.2 Text Generation

Four categories of text templates produce the snippets:

Template Type	Example	Purpose
Direct mention	"My flight on Delta Air Lines was delayed by 3 hours at Denver airport."	Airline name appears explicitly
Indirect mention	"The airline refunded my Denver booking without any hassle."	Refers to an airline experience but never names it
No mention	"The weather in Denver was beautiful today."	No airline content at all
Ambiguous	"SAS programming in Chicago", "AA meeting in Denver"	Uses a word that matches an airline short form but in a non-airline context

2.3 Noise Injection

To prevent trivial substring matching, airline names in text are mutated via surface_form() and typo():

Casing: lowercase, UPPERCASE, original
Abbreviation: full name, short name, IATA code
Typos: character deletion, character doubling, truncation, suffix misspelling ("Airlnes"), dropped double letters

2.4 Label Assignment

Each sample gets a (text, candidate_label, label_quality) triple. The dataset is balanced into thirds:

Correct (10,000 samples): The candidate label is the canonical full airline name. The text may contain the full name, a short form, an IATA code, a typo, or no explicit name at all (indirect reference).

Partial (10,000 samples): The candidate label is a short form / abbreviation / IATA code of the correct airline. Same text variation as above.

Incorrect (10,000 samples): Split across 12 sub-buckets to create diverse failure modes:

Bucket	Description
0	Wrong full airline name on text that explicitly mentions a different airline
1	Confusable airline pair (e.g., Delta/American, Emirates/Qatar)
2	Airline label on text with zero airline content
3	Short form of the wrong airline
4	Ambiguous word usage (e.g., "delta variant" labeled as Delta Air Lines)
5	Indirect text (no name) with wrong airline label
6	IATA code mismatch (code for airline X, label for airline Y)
7	Short form of a confusable airline
8	Non-airline label on text that mentions an airline (e.g., city names, company names, airport codes)
9	Non-airline label on text with no airline mention
10	Non-airline label on indirect airline text
11	Near-miss typo as label (e.g., "Untied", "Delt", "Sprit") on text mentioning an airline

The non-airline label pool includes 120+ entries: city names, airport codes (LAX, JFK, etc.), travel terms, non-airline companies (Boeing, Expedia, etc.), generic words (Air, Airways, Jet), and deliberate near-miss typos.

2.5 Output

data/labeled_data.csv -> 30,000 rows with columns: text, true_airline, candidate_label, label_quality

3. NER Entity Extraction

Script: scripts/extract_entities.py

An optional preprocessing step that automatically extracts airline name candidates from text using Named Entity Recognition, rather than relying on manually assigned labels.

3.1 spaCy NER

Each text is processed through spaCy's en_core_web_md model. Entities with the following NER labels are extracted as potential airline mentions:

NER Label	Rationale
`ORG`	Airlines are organizations (primary signal)
`PRODUCT`	Sometimes airline names are tagged as products
`GPE`	Catches cases like "Alaska", "Hawaiian", "Singapore"
`PERSON`	Catches edge cases where names are misclassified

3.2 Matching

Each extracted entity is matched against a lookup table of all known airline surface forms (canonical names + abbreviations + IATA codes):

Exact match first: If the entity text (lowercased) exactly matches a known form, it maps directly to the canonical airline with 100% confidence
Fuzzy fallback: If no exact match, rapidfuzz.fuzz.WRatio is used with a threshold of 70
If no match exceeds the threshold, the result is "NONE"

Each match also records a ner_match_type:

Match Type	Meaning
`canonical`	Matched the full airline name exactly
`short_form`	Matched a common abbreviation
`iata_code`	Matched a 2-letter IATA code
`fuzzy`	No exact form matched, resolved via fuzzy similarity
`none`	No airline entity found

And a ner_label_source: spacy (exact match) or rapidfuzz (fuzzy fallback).

3.3 Agreement & Label Quality Derivation

The script computes two additional columns:

ner_label_agrees -- Compares the existing candidate_label against ner_candidate_label:

1.0 -- candidate resolves to the same canonical airline as NER
0.5 -- candidate is a variant of the NER airline
0.0 -- no agreement

ner_label_quality -- Compares ner_candidate_label against true_airline (ground truth from generation):

correct -- NER output matches the true airline (canonical key)
incorrect -- NER output doesn't match (wrong airline, NONE, or no true airline)

Note: There is no partial class for NER because the NER pipeline always resolves to canonical names.

3.4 Output Columns

Column	Description
`ner_entities`	JSON list of all extracted entities with their NER labels
`ner_candidate_label`	Best-match canonical airline name (or `"NONE"`)
`ner_confidence`	Match score (0-100)
`ner_match_type`	How the match was made (canonical/short_form/iata_code/fuzzy/none)
`ner_label_source`	Which system provided the match (spacy/rapidfuzz)
`ner_label_agrees`	Agreement between candidate_label and ner_candidate_label (0.0/0.5/1.0)
`ner_label_quality`	NER-derived label quality (correct/incorrect)

3.5 Diagnostics

The script prints:

Extraction rate: percentage of rows with at least one airline entity
Agreement rate: how often ner_candidate_label matches candidate_label
Confidence distribution: mean, median, min, max of non-zero confidence scores
Top mismatches: most common (assigned label, NER label) disagreement pairs
Match type distribution: counts per match type
Label source distribution: spacy vs rapidfuzz counts
Extraction rate by label quality: broken down by correct/partial/incorrect
NER label quality distribution and crosstab against original label_quality

3.6 Output File

data/ner_labeled_data.csv -- all original columns plus the NER columns above.

4. Feature Engineering

Script: scripts/train_model.py (feature extraction happens at training time)

Both models share a common feature base. The NER model adds extra features on top.

Common Features (both models)

Three feature groups are concatenated into a sparse matrix:

4.1 TF-IDF Text Features

TfidfVectorizer on the text column
max_features=500, ngram_range=(1, 2), English stop words removed
Captures which words and bigrams appear in each text
Produces a 500-dimensional sparse vector per sample

4.2 One-Hot Label Features

One-hot encoded candidate label (fitted fresh during training)
For the candidate model: encodes candidate_label (user-assigned)
For the NER model: encodes ner_candidate_label (NER-extracted canonical name)
Dimensionality equals the number of unique labels in training data

4.3 Handcrafted Similarity Features

Five numeric features measuring how well the candidate label matches the text:

Feature	Description
`token_overlap`	Fraction of label tokens found in the text tokens (word-level Jaccard numerator)
`bigram_sim`	Character bigram Jaccard similarity between label and text
`text_len`	Character length of the text
`label_len`	Character length of the candidate label
`len_ratio`	`label_len / text_len`

4.4 NER-Specific Features (NER Model only)

When training on ner_candidate_label, the model adds four extra feature groups:

Feature	Type	Description
`ner_match_type`	One-hot (5 cols)	How the entity was matched: canonical, short_form, iata_code, fuzzy, none
`ner_label_source`	One-hot (3 cols)	Which system matched: spacy, rapidfuzz, or empty
`ner_confidence`	Numeric (1 col)	Match confidence score (0-100)
`ner_label_agrees`	Numeric (1 col)	Agreement between candidate_label and ner_candidate_label (0.0/0.5/1.0)

4.5 Combined Feature Matrix

All groups are horizontally stacked via scipy.sparse.hstack:

Candidate Model:

X = [TF-IDF (500) | One-Hot Label | Handcrafted (5)]

NER Model:

X = [TF-IDF (500) | One-Hot Label | Handcrafted (5) | Match Type (5) | Label Source (3) | Confidence (1) | Agreement (1)]

5. Model Training & Evaluation

Script: scripts/train_model.py

5.1 Train/Test Split

80/20 split with stratify=y to preserve class balance
random_state=42 for reproducibility
Train: 24,000 samples, Test: 6,000 samples

5.2 Two Models

Candidate model (--model candidate): Uses candidate_label as the feature, predicts label_quality (3-class: correct/partial/incorrect). Target: does the pre-assigned label match the text?

NER model (--model ner): Uses ner_candidate_label as the feature, predicts ner_label_quality (2-class: correct/incorrect). Target: does the NER-extracted label match the true airline? No partial class because NER always resolves to canonical names.

Both use Logistic Regression:

Parameter	Value	Rationale
`C`	1.0	Default regularization strength (L2)
`max_iter`	5000	Sufficient for convergence on this feature set
solver	`lbfgs`	Default; handles multinomial well

Use --compare to train both and generate a side-by-side comparison report.

5.3 Metrics

Classification report: Per-class precision, recall, F1-score.

Confusion matrix: 3x3 matrix showing predictions vs true labels.

AUC-ROC (One-vs-Rest): Each class is treated as binary (this class vs. all others). The ROC curve plots True Positive Rate vs False Positive Rate at varying probability thresholds. Reported per-class and as macro/weighted averages.

PR-AUC (One-vs-Rest): Precision-Recall curve area. More informative than ROC when classes are imbalanced (though ours are balanced). Reported per-class and as macro/weighted averages.

5.4 Model Serialization

The trained model is exported in two formats:

ONNX (primary, safe): The sklearn LogisticRegression is converted to ONNX via skl2onnx. ONNX is a language-agnostic, open standard that does not allow arbitrary code execution on load. The TF-IDF vectorizer vocabulary and IDF weights are saved separately as JSON. Each model produces its own files: model_candidate.onnx / model_ner.onnx and tfidf_config_candidate.json / tfidf_config_ner.json.

Pickle (fallback): model_{tag}.pkl files are generated for quick iteration and contain the model, TF-IDF vectorizer, one-hot encoder, and label column name. Should not be used in production or with untrusted files.

5.5 Outputs

File	Contents
`models/model_{tag}.onnx`	ONNX-format logistic regression model (safe to load)
`models/tfidf_config_{tag}.json`	TF-IDF vocabulary + IDF weights as JSON
`models/model_{tag}.pkl`	Pickle fallback with model + vectorizer + OHE
`images/roc_pr_curves_{tag}.png`	Side-by-side ROC and Precision-Recall curve plots per class
`reports/error_report_{tag}.csv`	Every misclassified test sample with text, label, true/predicted class, confidence
`COMPARISON.md`	Auto-generated model comparison report (when using `--compare`)

5.6 Error Analysis

The console prints a breakdown of each confusion pair (e.g., True=incorrect -> Pred=correct) with sample count and 3 example errors showing the text, label, and model confidence. The full error set is in reports/error_report_{tag}.csv for manual review.

6. Prediction

Script: scripts/predict.py

6.1 Usage

# Candidate model (default, 3-class: correct / partial / incorrect)
python scripts/predict.py "<text>" "<candidate_label>"

# NER model (2-class: correct / incorrect)
python scripts/predict.py "<text>" "<candidate_label>" --model ner

# With pre-filter (reject non-airline labels before running the model)
python scripts/predict.py "<text>" "<candidate_label>" --filter

6.2 Candidate Model Process

Load model_candidate.onnx via ONNX Runtime
Reconstruct the TF-IDF vectorizer from tfidf_config_candidate.json
Load the one-hot encoder from model_candidate.pkl
Use the user's candidate label directly as the model input
Compute TF-IDF features, one-hot encode the label, compute handcrafted similarity features
Stack all features into a single dense float32 vector
Run ONNX inference -> predicted class and probabilities

6.3 NER Model Process

The NER model uses a route-through-NER approach. Instead of feeding the user's candidate label directly into the model, it first runs the NER pipeline to independently extract an airline from the text, then compares that extraction against the user's label.

Load model_ner.onnx via ONNX Runtime
Run spaCy NER on the input text to extract entities
Fuzzy-match extracted entities against the airline lookup to resolve a canonical airline name (ner_resolved)
Use ner_resolved (not the user's label) as the model's candidate label input
Compute agreement between the user's candidate label and ner_resolved:
- 1.0 -- user's label matches NER output (same canonical airline)
- 0.5 -- user's label is a variant/abbreviation of the NER airline
- 0.0 -- no match (wrong airline, non-airline label, or NER found nothing)
Build feature vector: TF-IDF + one-hot of ner_resolved + handcrafted features + NER metadata (match type, label source, confidence, agreement)
Run ONNX inference -> predicted class and probabilities

Hard rule: If NER extracts an airline but agreement with the user's label is 0.0, the prediction short-circuits to incorrect without running the model. This handles cases where the model's features alone can't distinguish (e.g., "Denver" vs "United Airlines" — the model sees strong NER signals but the label is completely wrong).

This means:

"Denver" as label when NER found "United Airlines" -> agreement 0.0 -> incorrect (hard rule)
"United" as label when NER found "United Airlines" -> agreement 0.5 -> runs through model
"United Airlines" as label when NER found "United Airlines" -> agreement 1.0 -> runs through model

6.4 Example Output

Candidate Model:

python scripts/predict.py "I flew United Airlines to Denver" "United Airlines"

Model:           candidate (3-class model trained on pre-assigned candidate labels)
Text:            I flew United Airlines to Denver
Candidate label: United Airlines
Prediction:      correct
Probabilities:
   incorrect: 0.2977
     partial: 0.2776
     correct: 0.4235

python scripts/predict.py "I flew United Airlines to Denver" "United Air"

Model:           candidate (3-class model trained on pre-assigned candidate labels)
Text:            I flew United Airlines to Denver
Candidate label: United Air
Prediction:      partial
Probabilities:
   incorrect: 0.2177
     partial: 0.7881
     correct: 0.0002

python scripts/predict.py "I flew United Airlines to Denver" "Denver"

Model:           candidate (3-class model trained on pre-assigned candidate labels)
Text:            I flew United Airlines to Denver
Candidate label: Denver
Prediction:      incorrect
Probabilities:
   incorrect: 0.7039
     partial: 0.2976
     correct: 0.0005

NER model:

python scripts/predict.py "I flew United Airlines to Denver" "United Airlines" --model ner

Model:           ner (2-class model trained on NER-extracted labels)
Text:            I flew United Airlines to Denver
Candidate label: United Airlines
NER extracted:   United Airlines
Agreement:       1.0
Prediction:      correct
Probabilities:
   incorrect: 0.0083
     partial: 0.9917

python scripts/predict.py "I flew United Airlines to Denver" "United Air" --model ner

Model:           ner (2-class model trained on NER-extracted labels)
Text:            I flew United Airlines to Denver
Candidate label: United Air
NER extracted:   United Airlines
Agreement:       1.0
Prediction:      correct
Probabilities:
   incorrect: 0.0083
     partial: 0.9917

python scripts/predict.py "I flew United Airlines to Denver" "Denver" --model ner

Model:           ner (2-class model trained on NER-extracted labels)
Text:            I flew United Airlines to Denver
Candidate label: Denver
NER extracted:   United Airlines
Agreement:       0.0
Prediction:      incorrect
Reason:          NER found "United Airlines" but label "Denver" does not match

6.5 Pre-Filter (--filter)

An optional --filter flag runs a fuzzy match against the airline lookup before invoking the model. If the candidate label doesn't match any known airline form (threshold = 80, using token_set_ratio), it short-circuits to "incorrect" without loading the model. This is useful for fast rejection of obviously non-airline labels but is not required; both models can handle non-airline labels through their training data.

6.6 Edge Cases

Unknown Airline Label (Candidate model): The one-hot encoder outputs all zeros (handle_unknown="ignore"). The model relies on TF-IDF and similarity features. Training data includes non-airline labels (city names, company names, airport codes) as incorrect examples.
Unknown Airline Label (NER model): The NER pipeline runs independently and resolves the correct airline. Agreement with the unknown label will be 0.0, pushing the prediction toward incorrect.
NER Finds Nothing: ner_resolved = "NONE", confidence = 0, agreement = 0. The model treats this as low signal.
Empty Text: Produces a zero TF-IDF vector; prediction relies on label features and agreement.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
images		images
models		models
reports		reports
scripts		scripts
COMPARISON.md		COMPARISON.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation