# A few shot NER cook-off!

Hola! I'm David and I love cooking 👨🏽‍🍳 and coding 👨🏽‍💻. 

This blog will introduce **few-shot-learning** and **Named Entity Recognition (NER)** without having to rely on fancy deep learning models, which are often only available for English. Also, I will show you how this can be done to recognize ingredients from recipes and to use Rubrix for analyzing the results and start building a manually curated training set.

During my time at Pandora Intelligence, I lobied for creating cool open-source stuff. So recently, I wrote a [spaCy](https://spacy.io/) package, called `concise-concepts`, which can be used for few-shot-NER.

Do you want to stay updated on me or Pandora Intelligence. Follow us on [Github](https://github.com/Pandora-Intelligence)!

## Concise Concepts

Generally, pre-trained NER models are trained to recognize a fixed set 17 general entity labels like Person (PER), Location (LOC), and Organization (ORG). For our usecase, we would like to identify ingredients like 'fruits', 'vegetables', 'meat', 'dairy', 'herbs' and 'carbs'. 

Ideally, we would like to train a custom model, which means that we would need enough high quality labelled training data with the new set of labels. Rubrix comes in handy for annotating, improving, and managing this data. But, oftentimes, starting data annotation from scratch is costly or not feasible, that's why having a way to "pre-annotate" or even skip the data annotation process altogether can be highly beneficial. Within `concise-concepts`, few-shot NER is done by relying on the reasoning behind word2vec.

Models like word2vec have been designed, based on the idea that word that are used in similar context have a similar meaning. During training words are mapped to a vector space, where similar words should end up in a similar region within that vector space. This allows us to do 2 things:

1. Retrieve the *n* most similar words. Based on a few examples per label, we can now create a list of other words that should also belong to that label. Using this list, we can then label new data by finding exact word matches.
2. Compare word vectors for similarity. After having found these exact word matches, we can compare their vectors against the vectors of the entire group. This comparison can determine how represenative the word is for the group, which can be used to provide a confidence score for the recognized entity.

Want to know more about `concise-concepts`? Feel free to watch this [YouTube video](https://www.youtube.com/watch?v=DnQKOuG-I_0).

## A proof of concepts (pun intented)

Before gathering recipe data and actually uploading the data to Rubrix. I will first show you the capabilities of `concise-concepts`.

In [1]:
import spacy
from spacy import displacy

import concise_concepts

data = {
    "fruit": ["apple", "pear", "orange"],
    "vegetable": ["broccoli", "spinach", "tomato", "garlic", "onion", "beans"],
    "meat": ["beef", "pork", "fish", "lamb", "bacon", "ham", "meatball"],
    "dairy": ["milk", "butter", "eggs", "cheese", "cheddar", "yoghurt", "egg"],
    "herbs": ["rosemary", "salt", "sage", "basil", "cilantro"],
    "carbs": ["bread", "rice", "toast", "tortilla", "noodles", "bagel", "croissant"],
}

text = """
    Heat the oil in a large pan and add the Onion, celery and carrots. 
    Then, cook over a medium–low heat for 10 minutes, or until softened. 
    Add the courgette, garlic, red peppers and oregano and cook for 2–3 minutes.
    Later, add some oranges and chickens. """

nlp = spacy.load("en_core_web_lg", disable=["ner"])
nlp.add_pipe("concise_concepts", config={"data": data, "ent_score": True})
doc = nlp(text)

options = {
    "colors": {
        "fruit": "darkorange",
        "vegetable": "limegreen",
        "meat": "salmon",
        "dairy": "lightblue",
        "herbs": "darkgreen",
        "carbs": "lightbrown",
    },
    "ents": ["fruit", "vegetable", "meat", "dairy", "herbs", "carbs"],
}

ents = doc.ents
for ent in ents:
    new_label = f"{ent.label_} ({float(ent._.ent_score):.0%})"
    options["colors"][new_label] = options["colors"].get(ent.label_.lower(), None)
    options["ents"].append(new_label)
    ent.label_ = new_label
doc.ents = ents

# we have to check when we generate the markdown version but I think we can keep the cell output and visualize the html from displacy directly in the blog
displacy.render(doc, style="ent", options=options)


ModuleNotFoundError: No module named 'concise_concepts'

![base demo displacy](img/concise_concepts//base_displacy.png)

## Gathering data

for gathering the data, I will be using the `feedparser` library, which can load and parse arbitrary rss feeds. I found some interesting rss feeds via a quick Google search on "recipe data rss" resulting in [FeedSpot](https://blog.feedspot.com/home_cooking_rss_feeds/). Note that I am also cleaning some HTML formatting with BeautifulSoup.

In [33]:
import feedparser
from bs4 import BeautifulSoup as bs

rss_feeds = [
    "https://thestayathomechef.com/feed",
    "https://101cookbooks.com/feed",
    "https://spendwithpennies.com/feed",
    "https://barefeetinthekitchen.com/feed",
    "https://thesouthernladycooks.com/feed",
    "https://ohsweetbasil.com/feed",
    "https://panlasangpinoy.com/feed",
    "https://damndelicious.net/feed",
    "https://leitesculinaria.com/feed",
    "https://inspiredtaste.com/feed",
]

summaries = []
for source in rss_feeds:
    result = feedparser.parse(source)
    for entry in result.get("entries", []):
        summaries.append(entry.get("summary"))

summaries = [bs(text).get_text().replace(r"\w+", "") for text in summaries]
summaries[7]


'Strawberry Spinach Salad is packed with nutrient-loaded fresh fruits and veggies. Leafy spinach, juicy strawberries, crunchy onions, and a touch of salty bacon and feta marry wonderfully with our sweet yet tangy homemade poppy seed dressing.\nThe post Strawberry Spinach Salad appeared first on thestayathomechef.com.'

In [22]:
doc = nlp(summaries[7])

ents = doc.ents
for ent in ents:
    new_label = f"{ent.label_} ({float(ent._.ent_score):.0%})"
    options["colors"][new_label] = options["colors"].get(ent.label_.lower(), None)
    options["ents"].append(new_label)
    ent.label_ = new_label
doc.ents = ents

displacy.render(doc, style="ent", options=options)

![rss demo displacy](img/concise_concepts//rss_displacy.png)

## Calling upon Rubrix

We've gathered an cleaned our data and showed that it also works on new data. So, now we can actually start logging it to Rubrix. Since I am running on my local environment, I decided to use the [docker-compose](https://rubrix.readthedocs.io/en/stable/getting_started/advanced_setup_guides.html#launching-the-web-app-via-docker-compose) deployment on the Rubrix website. In short, this deploys an ElasticSearch instance to store our data, and the Rubrix server, which also comes with a nice UI under [http://localhost:6900/](http://localhost:6900/). 

For our usecase, we are doing entity classification, which falls under the [Token Classification](https://rubrix.readthedocs.io/en/stable/guides/cookbook.html#Token-Classification) category. Nice! There even is a [tutorial](https://rubrix.readthedocs.io/en/stable/guides/cookbook.html#NER) specifically for spaCy.



In [34]:
import rubrix as rb
import spacy

# Creating spaCy doc
docs = nlp.pipe(summaries)

records = []
for doc in docs:
    # Creating the prediction entity as a list of tuples (entity, start_char, end_char)
    prediction = [(ent.label_, ent.start_char, ent.end_char, ent._.ent_score) for ent in doc.ents]
    
    # Building TokenClassificationRecord
    record = rb.TokenClassificationRecord(
        text=doc.text,
        tokens=[token.text for token in doc],
        prediction=prediction,
        prediction_agent="David", # I'd use a name related to the pipeline component: e.g., concise-concepts-cooking or something else
    )
    records.append(record)

# Logging into Rubrix
rb.log(records=records, name="concise-concepts", metadata=data, verbose=False)

BulkResponse(dataset='concise-concepts', processed=185, failed=0)

We can now go to [http://localhost:6900/](http://localhost:6900/), look at the results and start with a base annotation for most labels. Even though, they the predicitons are not perfect, it surely beats starting from scratch.

![rss demo displacy](img/concise_concepts/rubrix_ui.png)

Now let's take a look at some statistics using the Rubrix Metric module. These metrics can help us refining the definition and examples of entities given to `concise-concepts`. For example, the entity label metric can help us an overview of the number of occurences for each entity in `concise-concepts`.

In [6]:
from rubrix.metrics.token_classification import entity_labels

entity_labels("concise-concepts").visualize()

From the above, we see that the `MEAT` concept is being recognized more often than most other concepts even though most meals only contain a single type of meat and multiple vegetables. This indicates that this concept might be a bit trigger happy. After some investigation, it shows that both **"roast"** and **"sauce"** are the culprits, since they are being recognized as `MEAT`. 

![meat is trigger happy](img/concise_concepts/trigger_happy.png)

After annotating and validating some data, we are able to assess the value of our few-shot model based on some scoring metrics. We can use this to asses potential fine-tuning of our input data or the number of words to expand over, within `concise-concepts`.

In [42]:
from rubrix.metrics.token_classification import f1
f1("concise-concepts").visualize()

Looking at these results, we can see that the predicitons for fruits are very good, having used only 3 examples. The other labels, perform worse, which indicates that we might benefit from fine-tuning these concepts. For `MEAT`, we could potentially split the concept in `MEAT` and `FISH`. Similarly, the poorly performing `CARBS` could be split up in `BREAD` and `GRAINS`. And, we could potentially split `HERBS` in `HERBS`, `CONDIMENTS` and `SPICES`.  

## Summary

There are many more things we could do like fine-tuning the few-shot training data, annotate data, or train a model. However, I think I've made my point so it is time for a short recap.

We have introduced **few-shot-learning** and **Named Entity Recognition (NER)**. We have also shown how a simple package like `concise-concepts` can be used for easy data labelling and entity scoring for any language out there, and how to store and interpret these predictions with Rubrix. Lastly, I have shown how you can completely over-engineer your hobby as a nerdy professional home cooking!

Don't forget, life is too short to eat bad food! 

Want to run this code yourself? Take a look at the [blog Github Repo](https://github.com/Pandora-Intelligence/rubrix-blogs) or the [company Github Repo](https://github.com/Pandora-Intelligence/rubrix-blogs).