<a href="https://colab.research.google.com/github/krumeto/oss_nlp_tools_demos/blob/main/notebooks/setfit_fewshot_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Run this notebook on Google Colab if you do not have a suitable GPU (running locally takes forever)

In [4]:
## Run the below if on Colab
!pip install -q setfit
!git clone https://github.com/krumeto/oss_nlp_tools_demos.git
from oss_nlp_tools_demos.data import preprocess_data

fatal: destination path 'oss_nlp_tools_demos' already exists and is not an empty directory.


In [5]:
import numpy as np
import pandas as pd
from pprint import pprint

from datasets import Dataset
from setfit import SetFitModel

from sentence_transformers.losses import CosineSimilarityLoss
from setfit import SetFitTrainer

import torch

if torch.cuda.is_available():
    print("CUDA is available!")
else:
    print("CUDA is not available.")

try:
    from data.preprocess_data import combine_json_to_dataframe
except ModuleNotFoundError:
    pass

CUDA is available!


In [6]:
annotated_df = pd.read_parquet("https://raw.githubusercontent.com/krumeto/oss_nlp_tools_demos/main/data/recipe_classes.parquet")

train_dataset = Dataset.from_pandas(annotated_df)
train_dataset

Dataset({
    features: ['recipe', 'label'],
    num_rows: 99
})

In [7]:
model_id = "sentence-transformers/all-MiniLM-L12-v2"

model = SetFitModel.from_pretrained(model_id)
model.model_body[0].max_seq_length = 512

Downloading (…)lve/main/config.json:   0%|          | 0.00/573 [00:00<?, ?B/s]

Downloading (…)5dded/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)4d81d5dded/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)81d5dded/config.json:   0%|          | 0.00/573 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)ded/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)5dded/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/352 [00:00<?, ?B/s]

Downloading (…)dded/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)4d81d5dded/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)1d5dded/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.


In [8]:
trainer = SetFitTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=train_dataset,
    loss_class=CosineSimilarityLoss,
    num_iterations=20,
    batch_size = 5, # Reduce the batch size due to memory issues
    column_mapping={"recipe": "text", "label": "label"},
    
)

In [9]:
trainer.train()

Applying column mapping to training dataset
***** Running training *****
  Num examples = 3960
  Num epochs = 1
  Total optimization steps = 792
  Total train batch size = 5


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/792 [00:00<?, ?it/s]

In [10]:
complicated_recipe = """Ingredients:
4 ounces pancetta, diced into 1/4 inch cubes
2 1/2 to 3 pounds veal shanks (4 to 6 pieces 2 to 3 inches thick)
1/2 cup diced onion
1/2 cup diced celery
1/2 cup diced carrot
3 garlic cloves , minced
1 1/2 cups canned chopped tomatoes
1 1/2 cups chicken broth
1/2 cup dry white wine
1 bay leaf
1 sprig fresh thyme
salt
freshly ground black pepper
all-purpose flour for dredging
2 tablespoons unsalted butter
2 tablespoons extra-virgin olive oil
4 3-inch strips of lemon zest

Directions:

Preheat oven to 375°F.
Heat the olive oil over medium heat in a large Dutch oven.
Cook pancetta until browned and crisp.
Remove pancetta with a slotted spoon and transfer to a paper towel-lined plate.
Season veal shanks with salt and pepper and dredge in flour.
Cook the veal until browned on all sides, working in batches if necessary, then transfer to a plate.
Add the onion, celery, carrot, garlic, and a pinch of salt to the Dutch oven and cook until softened.
Stir in the tomatoes, chicken broth, dry white wine, bay leaf, and thyme sprig.
Return the veal shanks and pancetta to the Dutch oven and bring the liquid to a simmer.
Cover the pot and place it in the oven to braise for 2-2 1/2 hours, until the veal is very tender.
Serve with gremolata and garnish with lemon zest strips.
Note: To make gremolata, finely chop 2 tablespoons fresh parsley, 1 tablespoon grated lemon zest, and 1 garlic clove. Mix together and sprinkle over the osso buco before serving."""

trainer.model.predict([complicated_recipe])

tensor([3])

In [11]:
simple_recipe = """Ingredients:
2 large eggs
Salt and pepper to taste
1 tablespoon unsalted butter
Instructions:
Crack the eggs into a bowl and whisk them with a fork until the whites and yolks are well combined.
Season with salt and pepper to taste."""

trainer.model.predict([simple_recipe])

tensor([1])

In [12]:
torch.save(trainer, 'setfit-recipe-cls.pt')

## Download the recipes, preprocess and classify with the trained SetFit model

In [15]:
## Run if on Colab
!wget https://eightportions.com/recipes_raw.zip

--2023-04-09 15:34:24--  https://eightportions.com/recipes_raw.zip
Resolving eightportions.com (eightportions.com)... 172.67.131.221, 104.21.4.85, 2606:4700:3033::6815:455, ...
Connecting to eightportions.com (eightportions.com)|172.67.131.221|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 53355492 (51M) [application/zip]
Saving to: ‘recipes_raw.zip’


2023-04-09 15:34:26 (39.1 MB/s) - ‘recipes_raw.zip’ saved [53355492/53355492]



In [16]:
# Weird, but necessary depending on running this one locally or on Colab
try:
    # when running on Colab
    recipe_data = preprocess_data.combine_json_to_dataframe("recipes_raw.zip")
except NameError:
    # when running locally
    recipe_data = combine_json_to_dataframe("../data/recipes_raw.zip")

### Classify the recipes

In [17]:
docs = [rec for rec in recipe_data.full_text]

class_predictions = trainer.model.predict(docs)
class_probas = trainer.model.predict_proba(docs)

### Quick test of the probabilities

In [18]:
def get_max_index(tensor, col):
    # Get highest score per column
    max_val, max_idx = torch.max(tensor[:, col], 0)
    return max_idx.item()

hardest_recipe = get_max_index(class_probas, 3) #hardest recipe

pprint(docs[hardest_recipe])
print(class_probas[hardest_recipe])

("Recipe title: Mrs. Patmore's London Particular . Ingredients: 1 smoked ham "
 'hock, soaked overnight in cold water; 1 large onion, peeled and halved; 2 '
 'celery sticks, chopped; 4 peppercorns; 1 bay leaf; 3 sprigs fresh thyme; 1 '
 'handful parsley; 1 pound green split peas, soaked overnight; 1/2 cup '
 'unsalted butter; 1 medium yellow onion, chopped; 1 medium carrot, chopped; 6 '
 'cups ham stock from above ham; Kosher salt and freshly ground black pepper '
 'to taste; Leftover boiled ham. Instructions: 1. Rinse, then drain, soaked '
 'ham hock. Place ham hock, large onion, celery, peppercorns, bay leaf, and '
 'thyme in a large saucepan. Cover with water. Bring to a boil, then simmer, '
 'partially covered, for 2 1/2 hours or until tender. Cool. 2. Strain ham '
 'stock through a fine-mesh sieve into a Tupperware or glass bowl with lid. '
 'Reserve the stock, and shred ham into bite-sized pieces. If stock is too '
 'spicy, distill with some water. 3. Rinse soaked peas until wate

In [19]:
easiest_recipe = get_max_index(class_probas, 0) 

pprint(docs[easiest_recipe])
print(class_probas[easiest_recipe])

('Recipe title: Rose Sangria. Ingredients: 1 bottle rose wine; 1/4 cup brandy; '
 '1/4 cup triple sec; 1 cup fresh orange juice; 1/4 cup simple syrup, or more '
 'to taste; Orange slices, lemon slices, lime slices, apple slices and '
 'blackberries. Instructions: Combine all ingredients in a large pitcher, '
 'cover and refrigerate for at least 8 hours or up to 24 hours. If you do not '
 'serve immediately, strain the fruit and add fresh when serving.')
tensor([0.8804, 0.0343, 0.0425, 0.0428], dtype=torch.float64)


### Save the scores

In [20]:
scores_pd = pd.DataFrame(torch.cat((class_predictions.unsqueeze(1), class_probas), dim=1), 
             columns=['pred_class', 'pred_very_easy', 'pred_easy', 'pred_medium', 'pred_hard'])

In [21]:
scores_pd.to_parquet("setfit_scores.parquet")