# 🍷 Wine & Food NLP with spaCy

This project builds two lightweight NLP models using [spaCy](https://spacy.io) for handling natural language queries related to wine and food. It supports:

- **Intent Classification**: Distinguish between wine or food recommendation queries.
- **Named Entity Recognition (NER)**: Extract key details like wine name, price, and tasting descriptions.

---
<br>

### Text Classification Model (Intent Detection)
Example Inputs:
- “Recommend a red wine under 300 HKD that pairs with grilled lamb.” → recommend_wine
- “What should I cook for dinner to go with a chilled bottle of Sancerre?” → recommend_food


Output Example
- {'recommend_wine': 0.01, 'recommend_food': 0.99}
<br>

---



In [26]:
import spacy
from spacy.training.example import Example

# prepare training data with annotations
TRAIN_DATA = [
    ("What wine goes well with spicy Thai green curry?", {"cats": {"recommend_wine": 1.0, "recommend_food": 0.0}}),
    ("Suggest a red wine under 300 HKD that pairs with grilled lamb.", {"cats": {"recommend_wine": 1.0, "recommend_food": 0.0}}),
    ("I have a bottle of Amarone — what foods would pair well with it?", {"cats": {"recommend_wine": 0.0, "recommend_food": 1.0}}),
    ("What should I cook for dinner to go with a chilled bottle of Sancerre?", {"cats": {"recommend_wine": 0.0, "recommend_food": 1.0}}),
    ("Suggest a celebratory wine that works with oysters and has high acidity.", {"cats": {"recommend_wine": 1.0, "recommend_food": 0.0}}),
    ("I'm cooking mushroom risotto and want something medium-bodied and earthy to go with it.", {"cats": {"recommend_wine": 1.0, "recommend_food": 0.0}}),
    ("Pair a bold Napa Cabernet Sauvignon with sushi.", {"cats": {"recommend_wine": 1.0, "recommend_food": 0.0}}),
    ("What are the best dishes to serve with a 2020 Puligny-Montrachet Chardonnay?", {"cats": {"recommend_wine": 0.0, "recommend_food": 1.0}}),
    ("Can you suggest a full-course meal to go with a vintage Champagne?", {"cats": {"recommend_wine": 0.0, "recommend_food": 1.0}}),
    ("What kind of food works well with a sweet Riesling from Mosel?", {"cats": {"recommend_wine": 0.0, "recommend_food": 1.0}})
]

# create blank NLP pipeline and add labels
nlp = spacy.blank("en")
textcat = nlp.add_pipe("textcat")
textcat.add_label("recommend_wine")
textcat.add_label("recommend_food")

# train
optimizer = nlp.begin_training()
for i in range(20):
    losses = {}
    for text, annotation in TRAIN_DATA:
        example = Example.from_dict(nlp.make_doc(text), annotation)
        nlp.update([example], sgd=optimizer, losses=losses)
    print(f"Epoch {i+1}, Losses: {losses}")

# save text categorization model
nlp.to_disk("textcat_model")

Epoch 1, Losses: {'textcat': 1.0340981185436249}
Epoch 2, Losses: {'textcat': 0.7742716521024704}
Epoch 3, Losses: {'textcat': 0.4202875867486}
Epoch 4, Losses: {'textcat': 0.13618924655020237}
Epoch 5, Losses: {'textcat': 0.02502437774091959}
Epoch 6, Losses: {'textcat': 0.002704028840526007}
Epoch 7, Losses: {'textcat': 0.0002550006211095024}
Epoch 8, Losses: {'textcat': 3.303227640572004e-05}
Epoch 9, Losses: {'textcat': 6.782430546081741e-06}
Epoch 10, Losses: {'textcat': 2.1016941502693953e-06}
Epoch 11, Losses: {'textcat': 8.968619056304306e-07}
Epoch 12, Losses: {'textcat': 4.847795551654599e-07}
Epoch 13, Losses: {'textcat': 3.1036319469990303e-07}
Epoch 14, Losses: {'textcat': 2.2392674026150416e-07}
Epoch 15, Losses: {'textcat': 1.7591662349047965e-07}
Epoch 16, Losses: {'textcat': 1.4668297509956574e-07}
Epoch 17, Losses: {'textcat': 1.2749947941870232e-07}
Epoch 18, Losses: {'textcat': 1.1411040290454366e-07}
Epoch 19, Losses: {'textcat': 1.0419687868079563e-07}
Epoch 20, L

<br>

To use the trained model, load it and pass a query.

In [27]:
nlp_textcat = spacy.load("textcat_model")

query = "What should I cook to go with my Champagne?" # test a query out!
doc = nlp_textcat(query)
print(doc.cats)

{'recommend_wine': 0.1159258708357811, 'recommend_food': 0.8840740919113159}


<br>

### Named Entity Recognition (NER) Model
Extracted Fields:
- wine_name: “Silver Oak 2019”
- price: “$120”
- description_phrase: “vanilla and cherries”

Output Example
- wine_name : 2020 Opus One
- price : $299
- description_phrase : chocolate, spice, and tobacco

---

**Quick Notes**: 
- A problem you will likely encounter when training a NER with multiple labels is the quality of annotations
- Annotation indices (i.e., start index and end index) should match with the start and end indices of the token/s
- For example, "Try Château Margaux for $350." "Château Margaux" has a valid token span from start=4 to end=19. An error would occur if you try to index end=18.
- Remember this when generating training data using LLMs, as they are prone to inaccurate indexing. Use Prodigy to help label your training data

In [5]:
import spacy
from spacy.training.example import Example
from spacy.tokens import DocBin

nlp = spacy.blank("en")

In [6]:
# function for checking misaligned indices
def check_alignment(train_data):
    for i, (text, annots) in enumerate(train_data):
        doc = nlp(text)
        entities = []
        errors = False

        print(f"Sample {i+1}: {text}")
        for start, end, label in annots["entities"]:
            span = doc.char_span(start, end, label=label, alignment_mode="strict")
            if span is None:
                print(f"Misaligned span: '{text[start:end]}' -> ({start}, {end}) for label {label}")
                errors = True
            else:
                print(f"Valid span: {span.text} ({span.start_char}, {span.end_char})")
        
        if not errors:
            print("all spans aligned")

In [19]:
# Training data for NER
NER_DATA = [
    ("Can you recommend a red wine under $40 for a dinner party?",
     {"entities": [
         (20, 23, "wine_type"),          # "red"
         (36, 38, "max_price")           # "$40"
     ]}),

    ("I need a white wine priced above 20 USD for seafood.",
     {"entities": [
         (9, 14, "wine_type"),          # "white"
         (33, 35, "min_price")           # "20 USD"
     ]}),

    ("Looking for Champagne or Prosecco to celebrate a birthday.",
     {"entities": [
         (12, 21, "wine_name"),          # "Champagne"
         (25, 33, "wine_name")           # "Prosecco"
     ]}),

    ("Suggest a wine under $25 that goes well with grilled chicken.",
     {"entities": [
         (22, 24, "max_price")           # "$25"
     ]}),

    ("Is there a full-bodied red under 100 HKD?",
     {"entities": [
         (23, 26, "wine_type"),          # "red"
         (33, 36, "max_price")           # "100 HKD"
     ]}),

    ("I’m in the mood for an oaked Chardonnay.",
     {"entities": [
         (29, 39, "wine_name")           # "oaked Chardonnay"
     ]}),

    ("Any bold Syrah or Malbec options for under $60?",
     {"entities": [
         (9, 14, "wine_name"),           # "Syrah"
         (18, 24, "wine_name"),          # "Malbec"
         (44, 46, "max_price")           # "$60"
     ]}),

    ("Looking for a wine that costs at least 30 USD, preferably red.",
     {"entities": [
         (39, 41, "min_price"),          # "30 USD"
         (58, 61, "wine_type")           # "red"
     ]}),

    ("I’d like a sparkling wine around 50 dollars.",
     {"entities": [
         (11, 20, "wine_type"),          # "sparkling wine"
         (33, 35, "max_price")           # "50 dollars"
     ]}),

    ("Do you have a Pinot Noir or Merlot from California?",
     {"entities": [
         (14, 24, "wine_name"),          # "Pinot Noir"
         (28, 34, "wine_name")           # "Merlot"
     ]}),
]
    



In [20]:
check_alignment(NER_DATA)

Sample 1: Can you recommend a red wine under $40 for a dinner party?
Valid span: red (20, 23)
Valid span: 40 (36, 38)
all spans aligned
Sample 2: I need a white wine priced above 20 USD for seafood.
Valid span: white (9, 14)
Valid span: 20 (33, 35)
all spans aligned
Sample 3: Looking for Champagne or Prosecco to celebrate a birthday.
Valid span: Champagne (12, 21)
Valid span: Prosecco (25, 33)
all spans aligned
Sample 4: Suggest a wine under $25 that goes well with grilled chicken.
Valid span: 25 (22, 24)
all spans aligned
Sample 5: Is there a full-bodied red under 100 HKD?
Valid span: red (23, 26)
Valid span: 100 (33, 36)
all spans aligned
Sample 6: I’m in the mood for an oaked Chardonnay.
Valid span: Chardonnay (29, 39)
all spans aligned
Sample 7: Any bold Syrah or Malbec options for under $60?
Valid span: Syrah (9, 14)
Valid span: Malbec (18, 24)
Valid span: 60 (44, 46)
all spans aligned
Sample 8: Looking for a wine that costs at least 30 USD, preferably red.
Valid span: 30 (39, 41)

In [21]:
labels = ["wine_name", "min_price", "max_price", "wine_type"]

# create blank NER pipeline and add labels
nlp_ner = spacy.blank("en")
ner = nlp_ner.add_pipe("ner")

for label in labels:
    ner.add_label(label)

# train model
optimizer = nlp_ner.begin_training()
for i in range(30):
    losses = {}
    for text, annotations in NER_DATA:
        example = Example.from_dict(nlp_ner.make_doc(text), annotations)
        nlp_ner.update([example], drop=0.1, losses=losses)
    print(f"Epoch {i+1}, losses: {losses}")

nlp_ner.to_disk("ner_model")

Epoch 1, losses: {'ner': np.float32(97.55092)}
Epoch 2, losses: {'ner': np.float32(34.74291)}
Epoch 3, losses: {'ner': np.float32(22.620691)}
Epoch 4, losses: {'ner': np.float32(23.715134)}
Epoch 5, losses: {'ner': np.float32(13.426863)}
Epoch 6, losses: {'ner': np.float32(15.221252)}
Epoch 7, losses: {'ner': np.float32(2.5267994)}
Epoch 8, losses: {'ner': np.float32(1.537985)}
Epoch 9, losses: {'ner': np.float32(1.9582396)}
Epoch 10, losses: {'ner': np.float32(2.3020358)}
Epoch 11, losses: {'ner': np.float32(0.52321017)}
Epoch 12, losses: {'ner': np.float32(0.8099005)}
Epoch 13, losses: {'ner': np.float32(0.0007082283)}
Epoch 14, losses: {'ner': np.float32(0.00020773779)}
Epoch 15, losses: {'ner': np.float32(4.023227e-05)}
Epoch 16, losses: {'ner': np.float32(1.7405722e-05)}
Epoch 17, losses: {'ner': np.float32(6.123233e-06)}
Epoch 18, losses: {'ner': np.float32(7.8020275e-06)}
Epoch 19, losses: {'ner': np.float32(8.219353e-06)}
Epoch 20, losses: {'ner': np.float32(1.988422e-06)}
Epoc

In [18]:
def extract_wine_query_fields(doc):
    fields = {
        "wine_name": None,
        "price": None,
        "description_phrase": None,
        "pairing_item": [],
        "region": None,
        "occasion": None
    }

    for ent in doc.ents:
        print(f"Detected: {ent.text} ({ent.label_})")
        if ent.label_ == "pairing_item":
            fields["pairing_item"].append(ent.text)
        elif ent.label_ in fields:
            fields[ent.label_] = ent.text

    if not fields["pairing_item"]:
        fields["pairing_item"] = None

    return fields


In [22]:
text = "I’d like a sparkling wine around 50 dollars."
doc = nlp_ner(text)

print([(ent.text, ent.label_) for ent in doc.ents])


[('sparkling', 'wine_type'), ('50', 'max_price')]


### Current Problems
- issues with alignment of labelling, which leads to improper learning -> understand problem
- lack of training samples for each field -> automate training data generation using spacy project
- consider possible problem that each combination of field is registered as unique


### How to Use
- Train each model separately using Jupyter or Python script.
- Use spacy.load() to load and run predictions.
- You can optionally combine both models into a single pipeline or deploy via FastAPI.

### Next Steps
- Add more examples for robust predictions
- Combine TextCat and NER in one spaCy pipeline
- Deploy with Flask or FastAPI for real-time use
- Use Label Studio or Prodigy to label at scale