A logistic regression pipeline that predicts whether a candidate airline label is correct, partially correct, or incorrect for a given text snippet.
- Python Packages
joblib
matplotlib
numpy
onnxruntime
pandas
rapidfuzz
scikit-learn
skl2onnx
spacy- SpaCy Model
python -m spacy download en_core_web_md
- Overview
- Dataset Generation
- NER Entity Extraction
- Feature Engineering
- Model Training & Evaluation
- Prediction
- File Reference
The problem: given a text snippet (e.g., a customer review, a tweet, a flight status message) and a candidate airline name, determine how accurately the airline label matches the text.
Three-class target:
| Class | Code | Meaning |
|---|---|---|
correct |
2 | The full airline name accurately identifies the airline in the text |
partial |
1 | A short form, abbreviation, or IATA code that refers to the right airline but isn't the canonical name |
incorrect |
0 | The label is wrong -- either the wrong airline, or no airline is mentioned at all |
Pipeline:
generate_data -> extract_entities -> train_model -> predict
(labeled_data) (ner_labeled_data) (models, reports, images)
Script: scripts/generate_data.py
30 airlines (domestic US + international) are defined with their common surface forms:
- Full canonical name (e.g.,
"Scandinavian Airlines") - Common short name (e.g.,
"SAS") - IATA code (e.g.,
"SK") - Informal variants (e.g.,
"scandinavian","Scandinavian Air")
Four categories of text templates produce the snippets:
| Template Type | Example | Purpose |
|---|---|---|
| Direct mention | "My flight on Delta Air Lines was delayed by 3 hours at Denver airport." | Airline name appears explicitly |
| Indirect mention | "The airline refunded my Denver booking without any hassle." | Refers to an airline experience but never names it |
| No mention | "The weather in Denver was beautiful today." | No airline content at all |
| Ambiguous | "SAS programming in Chicago", "AA meeting in Denver" | Uses a word that matches an airline short form but in a non-airline context |
To prevent trivial substring matching, airline names in text are mutated via surface_form() and typo():
- Casing: lowercase, UPPERCASE, original
- Abbreviation: full name, short name, IATA code
- Typos: character deletion, character doubling, truncation, suffix misspelling (
"Airlnes"), dropped double letters
Each sample gets a (text, candidate_label, label_quality) triple. The dataset is balanced into thirds:
Correct (10,000 samples): The candidate label is the canonical full airline name. The text may contain the full name, a short form, an IATA code, a typo, or no explicit name at all (indirect reference).
Partial (10,000 samples): The candidate label is a short form / abbreviation / IATA code of the correct airline. Same text variation as above.
Incorrect (10,000 samples): Split across 12 sub-buckets to create diverse failure modes:
| Bucket | Description |
|---|---|
| 0 | Wrong full airline name on text that explicitly mentions a different airline |
| 1 | Confusable airline pair (e.g., Delta/American, Emirates/Qatar) |
| 2 | Airline label on text with zero airline content |
| 3 | Short form of the wrong airline |
| 4 | Ambiguous word usage (e.g., "delta variant" labeled as Delta Air Lines) |
| 5 | Indirect text (no name) with wrong airline label |
| 6 | IATA code mismatch (code for airline X, label for airline Y) |
| 7 | Short form of a confusable airline |
| 8 | Non-airline label on text that mentions an airline (e.g., city names, company names, airport codes) |
| 9 | Non-airline label on text with no airline mention |
| 10 | Non-airline label on indirect airline text |
| 11 | Near-miss typo as label (e.g., "Untied", "Delt", "Sprit") on text mentioning an airline |
The non-airline label pool includes 120+ entries: city names, airport codes (LAX, JFK, etc.), travel terms, non-airline companies (Boeing, Expedia, etc.), generic words (Air, Airways, Jet), and deliberate near-miss typos.
data/labeled_data.csv -> 30,000 rows with columns: text, true_airline, candidate_label, label_quality
Script: scripts/extract_entities.py
An optional preprocessing step that automatically extracts airline name candidates from text using Named Entity Recognition, rather than relying on manually assigned labels.
Each text is processed through spaCy's en_core_web_md model. Entities with the following NER labels are extracted as potential airline mentions:
| NER Label | Rationale |
|---|---|
ORG |
Airlines are organizations (primary signal) |
PRODUCT |
Sometimes airline names are tagged as products |
GPE |
Catches cases like "Alaska", "Hawaiian", "Singapore" |
PERSON |
Catches edge cases where names are misclassified |
Each extracted entity is matched against a lookup table of all known airline surface forms (canonical names + abbreviations + IATA codes):
- Exact match first: If the entity text (lowercased) exactly matches a known form, it maps directly to the canonical airline with 100% confidence
- Fuzzy fallback: If no exact match,
rapidfuzz.fuzz.WRatiois used with a threshold of 70 - If no match exceeds the threshold, the result is
"NONE"
Each match also records a ner_match_type:
| Match Type | Meaning |
|---|---|
canonical |
Matched the full airline name exactly |
short_form |
Matched a common abbreviation |
iata_code |
Matched a 2-letter IATA code |
fuzzy |
No exact form matched, resolved via fuzzy similarity |
none |
No airline entity found |
And a ner_label_source: spacy (exact match) or rapidfuzz (fuzzy fallback).
The script computes two additional columns:
ner_label_agrees -- Compares the existing candidate_label against ner_candidate_label:
1.0-- candidate resolves to the same canonical airline as NER0.5-- candidate is a variant of the NER airline0.0-- no agreement
ner_label_quality -- Compares ner_candidate_label against true_airline (ground truth from generation):
correct-- NER output matches the true airline (canonical key)incorrect-- NER output doesn't match (wrong airline, NONE, or no true airline)
Note: There is no partial class for NER because the NER pipeline always resolves to canonical names.
| Column | Description |
|---|---|
ner_entities |
JSON list of all extracted entities with their NER labels |
ner_candidate_label |
Best-match canonical airline name (or "NONE") |
ner_confidence |
Match score (0-100) |
ner_match_type |
How the match was made (canonical/short_form/iata_code/fuzzy/none) |
ner_label_source |
Which system provided the match (spacy/rapidfuzz) |
ner_label_agrees |
Agreement between candidate_label and ner_candidate_label (0.0/0.5/1.0) |
ner_label_quality |
NER-derived label quality (correct/incorrect) |
The script prints:
- Extraction rate: percentage of rows with at least one airline entity
- Agreement rate: how often
ner_candidate_labelmatchescandidate_label - Confidence distribution: mean, median, min, max of non-zero confidence scores
- Top mismatches: most common (assigned label, NER label) disagreement pairs
- Match type distribution: counts per match type
- Label source distribution: spacy vs rapidfuzz counts
- Extraction rate by label quality: broken down by correct/partial/incorrect
- NER label quality distribution and crosstab against original label_quality
data/ner_labeled_data.csv -- all original columns plus the NER columns above.
Script: scripts/train_model.py (feature extraction happens at training time)
Both models share a common feature base. The NER model adds extra features on top.
Three feature groups are concatenated into a sparse matrix:
TfidfVectorizeron thetextcolumnmax_features=500,ngram_range=(1, 2), English stop words removed- Captures which words and bigrams appear in each text
- Produces a 500-dimensional sparse vector per sample
- One-hot encoded candidate label (fitted fresh during training)
- For the candidate model: encodes
candidate_label(user-assigned) - For the NER model: encodes
ner_candidate_label(NER-extracted canonical name) - Dimensionality equals the number of unique labels in training data
Five numeric features measuring how well the candidate label matches the text:
| Feature | Description |
|---|---|
token_overlap |
Fraction of label tokens found in the text tokens (word-level Jaccard numerator) |
bigram_sim |
Character bigram Jaccard similarity between label and text |
text_len |
Character length of the text |
label_len |
Character length of the candidate label |
len_ratio |
label_len / text_len |
When training on ner_candidate_label, the model adds four extra feature groups:
| Feature | Type | Description |
|---|---|---|
ner_match_type |
One-hot (5 cols) | How the entity was matched: canonical, short_form, iata_code, fuzzy, none |
ner_label_source |
One-hot (3 cols) | Which system matched: spacy, rapidfuzz, or empty |
ner_confidence |
Numeric (1 col) | Match confidence score (0-100) |
ner_label_agrees |
Numeric (1 col) | Agreement between candidate_label and ner_candidate_label (0.0/0.5/1.0) |
All groups are horizontally stacked via scipy.sparse.hstack:
Candidate Model:
X = [TF-IDF (500) | One-Hot Label | Handcrafted (5)]
NER Model:
X = [TF-IDF (500) | One-Hot Label | Handcrafted (5) | Match Type (5) | Label Source (3) | Confidence (1) | Agreement (1)]
Script: scripts/train_model.py
- 80/20 split with
stratify=yto preserve class balance random_state=42for reproducibility- Train: 24,000 samples, Test: 6,000 samples
Candidate model (--model candidate): Uses candidate_label as the feature, predicts label_quality (3-class: correct/partial/incorrect). Target: does the pre-assigned label match the text?
NER model (--model ner): Uses ner_candidate_label as the feature, predicts ner_label_quality (2-class: correct/incorrect). Target: does the NER-extracted label match the true airline? No partial class because NER always resolves to canonical names.
Both use Logistic Regression:
| Parameter | Value | Rationale |
|---|---|---|
C |
1.0 | Default regularization strength (L2) |
max_iter |
5000 | Sufficient for convergence on this feature set |
| solver | lbfgs |
Default; handles multinomial well |
Use --compare to train both and generate a side-by-side comparison report.
Classification report: Per-class precision, recall, F1-score.
Confusion matrix: 3x3 matrix showing predictions vs true labels.
AUC-ROC (One-vs-Rest): Each class is treated as binary (this class vs. all others). The ROC curve plots True Positive Rate vs False Positive Rate at varying probability thresholds. Reported per-class and as macro/weighted averages.
PR-AUC (One-vs-Rest): Precision-Recall curve area. More informative than ROC when classes are imbalanced (though ours are balanced). Reported per-class and as macro/weighted averages.
The trained model is exported in two formats:
ONNX (primary, safe): The sklearn LogisticRegression is converted to ONNX via skl2onnx. ONNX is a language-agnostic, open standard that does not allow arbitrary code execution on load. The TF-IDF vectorizer vocabulary and IDF weights are saved separately as JSON. Each model produces its own files: model_candidate.onnx / model_ner.onnx and tfidf_config_candidate.json / tfidf_config_ner.json.
Pickle (fallback): model_{tag}.pkl files are generated for quick iteration and contain the model, TF-IDF vectorizer, one-hot encoder, and label column name. Should not be used in production or with untrusted files.
| File | Contents |
|---|---|
models/model_{tag}.onnx |
ONNX-format logistic regression model (safe to load) |
models/tfidf_config_{tag}.json |
TF-IDF vocabulary + IDF weights as JSON |
models/model_{tag}.pkl |
Pickle fallback with model + vectorizer + OHE |
images/roc_pr_curves_{tag}.png |
Side-by-side ROC and Precision-Recall curve plots per class |
reports/error_report_{tag}.csv |
Every misclassified test sample with text, label, true/predicted class, confidence |
COMPARISON.md |
Auto-generated model comparison report (when using --compare) |
The console prints a breakdown of each confusion pair (e.g., True=incorrect -> Pred=correct) with sample count and 3 example errors showing the text, label, and model confidence. The full error set is in reports/error_report_{tag}.csv for manual review.
Script: scripts/predict.py
# Candidate model (default, 3-class: correct / partial / incorrect)
python scripts/predict.py "<text>" "<candidate_label>"
# NER model (2-class: correct / incorrect)
python scripts/predict.py "<text>" "<candidate_label>" --model ner
# With pre-filter (reject non-airline labels before running the model)
python scripts/predict.py "<text>" "<candidate_label>" --filter- Load
model_candidate.onnxvia ONNX Runtime - Reconstruct the TF-IDF vectorizer from
tfidf_config_candidate.json - Load the one-hot encoder from
model_candidate.pkl - Use the user's candidate label directly as the model input
- Compute TF-IDF features, one-hot encode the label, compute handcrafted similarity features
- Stack all features into a single dense float32 vector
- Run ONNX inference -> predicted class and probabilities
The NER model uses a route-through-NER approach. Instead of feeding the user's candidate label directly into the model, it first runs the NER pipeline to independently extract an airline from the text, then compares that extraction against the user's label.
- Load
model_ner.onnxvia ONNX Runtime - Run spaCy NER on the input text to extract entities
- Fuzzy-match extracted entities against the airline lookup to resolve a canonical airline name (
ner_resolved) - Use
ner_resolved(not the user's label) as the model's candidate label input - Compute agreement between the user's candidate label and
ner_resolved:1.0-- user's label matches NER output (same canonical airline)0.5-- user's label is a variant/abbreviation of the NER airline0.0-- no match (wrong airline, non-airline label, or NER found nothing)
- Build feature vector: TF-IDF + one-hot of
ner_resolved+ handcrafted features + NER metadata (match type, label source, confidence, agreement) - Run ONNX inference -> predicted class and probabilities
Hard rule: If NER extracts an airline but agreement with the user's label is 0.0, the prediction short-circuits to incorrect without running the model. This handles cases where the model's features alone can't distinguish (e.g., "Denver" vs "United Airlines" — the model sees strong NER signals but the label is completely wrong).
This means:
"Denver"as label when NER found "United Airlines" -> agreement 0.0 -> incorrect (hard rule)"United"as label when NER found "United Airlines" -> agreement 0.5 -> runs through model"United Airlines"as label when NER found "United Airlines" -> agreement 1.0 -> runs through model
Candidate Model:
python scripts/predict.py "I flew United Airlines to Denver" "United Airlines"
Model: candidate (3-class model trained on pre-assigned candidate labels)
Text: I flew United Airlines to Denver
Candidate label: United Airlines
Prediction: correct
Probabilities:
incorrect: 0.2977
partial: 0.2776
correct: 0.4235
python scripts/predict.py "I flew United Airlines to Denver" "United Air"
Model: candidate (3-class model trained on pre-assigned candidate labels)
Text: I flew United Airlines to Denver
Candidate label: United Air
Prediction: partial
Probabilities:
incorrect: 0.2177
partial: 0.7881
correct: 0.0002
python scripts/predict.py "I flew United Airlines to Denver" "Denver"
Model: candidate (3-class model trained on pre-assigned candidate labels)
Text: I flew United Airlines to Denver
Candidate label: Denver
Prediction: incorrect
Probabilities:
incorrect: 0.7039
partial: 0.2976
correct: 0.0005
NER model:
python scripts/predict.py "I flew United Airlines to Denver" "United Airlines" --model ner
Model: ner (2-class model trained on NER-extracted labels)
Text: I flew United Airlines to Denver
Candidate label: United Airlines
NER extracted: United Airlines
Agreement: 1.0
Prediction: correct
Probabilities:
incorrect: 0.0083
partial: 0.9917
python scripts/predict.py "I flew United Airlines to Denver" "United Air" --model ner
Model: ner (2-class model trained on NER-extracted labels)
Text: I flew United Airlines to Denver
Candidate label: United Air
NER extracted: United Airlines
Agreement: 1.0
Prediction: correct
Probabilities:
incorrect: 0.0083
partial: 0.9917
python scripts/predict.py "I flew United Airlines to Denver" "Denver" --model ner
Model: ner (2-class model trained on NER-extracted labels)
Text: I flew United Airlines to Denver
Candidate label: Denver
NER extracted: United Airlines
Agreement: 0.0
Prediction: incorrect
Reason: NER found "United Airlines" but label "Denver" does not match
An optional --filter flag runs a fuzzy match against the airline lookup before invoking the model. If the candidate label doesn't match any known airline form (threshold = 80, using token_set_ratio), it short-circuits to "incorrect" without loading the model. This is useful for fast rejection of obviously non-airline labels but is not required; both models can handle non-airline labels through their training data.
- Unknown Airline Label (Candidate model): The one-hot encoder outputs all zeros (
handle_unknown="ignore"). The model relies on TF-IDF and similarity features. Training data includes non-airline labels (city names, company names, airport codes) as incorrect examples. - Unknown Airline Label (NER model): The NER pipeline runs independently and resolves the correct airline. Agreement with the unknown label will be 0.0, pushing the prediction toward incorrect.
- NER Finds Nothing:
ner_resolved= "NONE", confidence = 0, agreement = 0. The model treats this as low signal. - Empty Text: Produces a zero TF-IDF vector; prediction relies on label features and agreement.