# AI Cooking Assistant - Model Fine-Tuning

Fine-tuning a pre-trained model on a recipe dataset

# 1. 🔧 Setting Up the Environment
First, install the necessary libraries.

`transformers`: For loading and fine-tuning pre-trained models for **Hugging Face**.

`datasets`: For handling datasets efficiently.

`peft`: For efficient parameter-efficient fine-tuning (LoRA).

`accelerate`: For optimized training across GPUs.

`bitsandbytes`: For low-memory optimization.

In [1]:
!pip install -q transformers datasets peft accelerate bitsandbytes chromadb sentence-transformers langchain langchain-core langchain-community openai anthropic google

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.3/61.3 MB[0m [31m26.1 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.8/19.8 MB[0m [31m68.0 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m44.9 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m35.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m447.5/447.5 kB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m

In [2]:
from datasets import load_dataset

dataset = load_dataset("csv", data_files="/kaggle/input/fooddotcom/RAW_recipes.csv")
dataset['train'][0]

Generating train split: 0 examples [00:00, ? examples/s]

{'name': 'arriba   baked winter squash mexican style',
 'id': 137739,
 'minutes': 55,
 'contributor_id': 47892,
 'submitted': '2005-09-16',
 'tags': "['60-minutes-or-less', 'time-to-make', 'course', 'main-ingredient', 'cuisine', 'preparation', 'occasion', 'north-american', 'side-dishes', 'vegetables', 'mexican', 'easy', 'fall', 'holiday-event', 'vegetarian', 'winter', 'dietary', 'christmas', 'seasonal', 'squash']",
 'nutrition': '[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]',
 'n_steps': 11,
 'steps': "['make a choice and proceed with recipe', 'depending on size of squash , cut into half or fourths', 'remove seeds', 'for spicy squash , drizzle olive oil or melted butter over each cut squash piece', 'season with mexican seasoning mix ii', 'for sweet squash , drizzle melted honey , butter , grated piloncillo over each cut squash piece', 'season with sweet mexican spice mix', 'bake at 350 degrees , again depending on size , for 40 minutes up to an hour , until a fork can easily pierce the skin',

In [3]:
import ast

def process_lists(example):
    
    # Convert stringified lists into actual lists
    ingredients = ast.literal_eval(example["ingredients"])
    steps = ast.literal_eval(example["steps"])

    # Create clean text inputs
    input_text = "Generate a recipe using: " + ", ".join(ingredients)
    target_text = " ".join(steps)

    return {
        "input_text": input_text,
        "target_text": target_text
    }

# Apply to dataset
dataset = dataset.map(process_lists)

# No longer need to keep other columns, so we can remove them
dataset = dataset.remove_columns(["name", "id", "minutes", "submitted", "tags", "nutrition", "contributor_id", "n_steps", "steps", "description", "n_ingredients"])

# Since you removed the old 'ingredients' and 'instructions', filter on the new fields
dataset = dataset.filter(lambda x: x["input_text"] and x["target_text"])




Map:   0%|          | 0/231637 [00:00<?, ? examples/s]

KeyboardInterrupt: 

In [4]:
!git config --global user.email "arpit.vaghela@outlook.com"
!git config --global user.name "magnifiques"

## Hugging Face Login

In [5]:
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()
HF_TOKEN = user_secrets.get_secret("HF_TOKEN")
# from google.colab import userdata
# HF_TOKEN=userdata.get('HF_TOKEN')


# login(HF_TOKEN)

if HF_TOKEN:
    login(HF_TOKEN)
    print("Successfully logged in to Hugging Face!")

Successfully logged in to Hugging Face!


In [6]:
import shutil
import os

src = "/kaggle/input/chroma-db-langchain-fooddotcom"  # original read-only
dst = "/kaggle/working/chroma-db-langchain-fooddotcom"  # writable location

if not os.path.exists(dst):
    shutil.copytree(src, dst)

In [None]:
import pandas as pd
import ast
import re
from datasets import load_dataset
from tqdm import tqdm

from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.vectorstores import Chroma
from langchain.schema import Document

tqdm.pandas()

# Load dataset
dataset = load_dataset("csv", data_files="/kaggle/input/fooddotcom/RAW_recipes.csv")['train']

# Convert to DataFrame
df = pd.DataFrame({
    "title": dataset["name"],
    "ingredients": dataset["ingredients"],
    "instructions": dataset["steps"],
    "description": dataset['description']
})

# Process ingredients and instructions
def process_text(row):
    try:
        row["ingredients"] = ", ".join(ast.literal_eval(row["ingredients"]))
    except:
        row["ingredients"] = "Unknown"
    try:
        steps = ast.literal_eval(row["instructions"])
        cleaned_steps = []
        for step in steps:
            step = re.sub(r'\s+,', ',', step)
            step = step.strip().capitalize()
            cleaned_steps.append(f"- {step}")
        row["instructions"] = "\n".join(cleaned_steps)
    except:
        row["instructions"] = "Instructions unavailable"
    return row

df = df.progress_apply(process_text, axis=1)
df.dropna(subset=["title", "ingredients", "instructions", 'description'], inplace=True)
df.reset_index(drop=True, inplace=True)

# Convert to LangChain Documents
docs = []
for idx, row in tqdm(df.iterrows(), total=len(df), desc="Converting to LangChain Documents"):
    content = f"Title: {row['title']}\nIngredients: {row['ingredients']}\nInstructions: {row['instructions']} \nDescription: {row['description']}"
    metadata = {
        "title": row['title'],
        "ingredients": row['ingredients'],
        "instructions": row['instructions'],
        "description": row['description']
    }
    docs.append(Document(page_content=content, metadata=metadata))

print(f"📄 Prepared {len(docs)} documents.")

# Set up embedding model
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# Create empty Chroma vectorstore
persist_directory = "/kaggle/working/chroma_db_langchain"
vectorstore = Chroma(
    embedding_function=embedding_function,
    persist_directory=persist_directory
)

# Ingest in batches
BATCH_SIZE = 5000
for i in tqdm(range(0, len(docs), BATCH_SIZE), desc="Batch Ingest"):
    batch = docs[i:i + BATCH_SIZE]
    vectorstore.add_documents(batch)
    vectorstore.persist()
    print(f"✅ Persisted batch {i//BATCH_SIZE + 1}")

print("✅ ChromaDB successfully created and saved with LangChain.")


In [None]:
import shutil

# Folder to zip, output zip file path (in /kaggle/working)
shutil.make_archive('/kaggle/working/chroma_db_langchain', 'zip', '/kaggle/working/chroma_db_langchain')


In [None]:
from IPython.display import FileLink
FileLink(r'chroma_db_langchain.zip')

In [7]:
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

# Same embedding model used during indexing
embedding_function = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

# Load the existing Chroma collection
vectordb = Chroma(
    persist_directory="/kaggle/working/chroma-db-langchain-fooddotcom",  # your saved db
    embedding_function=embedding_function
)



  embedding_function = HuggingFaceEmbeddings(


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

  vectordb = Chroma(


In [9]:
from langchain.prompts import PromptTemplate
from langchain_core.runnables import RunnableLambda
from langchain_core.output_parsers import StrOutputParser
import re

# --- TEXT CLEANING ---
def clean_text(text):
    text = re.sub(r"[^\x00-\x7F]+", "", text)
    text = re.sub(r"\s+", " ", text)
    return text.strip()

from langchain_core.runnables import RunnableMap
from langchain_core.output_parsers import StrOutputParser

def generate_answer_langchain(query, retriever, llm):
    # Step 1: Retrieve docs from ChromaDB
    docs = retriever.invoke(query)
    raw_context = "\n\n".join([doc.page_content for doc in docs])
    cleaned_context = clean_text(raw_context)

    # Step 2: Create the chain
    chain = (
        RunnableMap({
            "query": lambda x: query,
            "recipes": lambda x: cleaned_context
        })
        | prompt_template
        | llm
        | StrOutputParser()
        | RunnableLambda(lambda x: clean_text(x))  # Optional post-cleaning
    )

    # Step 3: Run the chain
    return chain.invoke({})

In [10]:
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.storage import LocalFileStore
from langchain.embeddings import CacheBackedEmbeddings
from langchain_core.documents import Document

# Load the same embedding model you used when creating the DB
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Load Chroma from persistence
chroma = Chroma(
    persist_directory="/kaggle/working/chroma-db-langchain-fooddotcom",  # replace this with your path
    embedding_function=embedding_model
)

# Create retriever from Chroma

retriever = chroma.as_retriever(search_kwargs={"k": 3})


In [11]:
import re

def format_recipes(raw_text):
    # Split based on "Title:"
    recipe_chunks = re.split(r'(?=Title:)', raw_text)
    formatted = []

    intro = "👩‍🍳 Here are some delicious recipe suggestions based on your ingredients!\n"
    outro = "\n🍽️ Hope one of these hits the spot! Let me know if you'd like more recipes, substitutions, or tips."

    for i, chunk in enumerate(recipe_chunks[:3]):  # Limit to 3 recipes
        if not chunk.strip():
            continue

        # Extract title
        title_match = re.search(r'Title:\s*(.*)', chunk)
        title = title_match.group(1).strip() if title_match else "N/A"

        # Extract ingredients
        ingredients_match = re.search(r'Ingredients:\s*(.*?)(?=\n\w+:|$)', chunk, re.DOTALL)
        ingredients_raw = ingredients_match.group(1).strip() if ingredients_match else "N/A"

        # Capitalize first letter of each word in each ingredient
        ingredients = ", ".join(
            word.strip().title() for word in ingredients_raw.split(",")
        )

        # Extract instructions
        instructions_match = re.search(r'Instructions:\s*(.*?)(?=\n\w+:|$)', chunk, re.DOTALL)
        instructions_raw = instructions_match.group(1).strip() if instructions_match else "N/A"
        
        # Split into lines or sentences while keeping phrases like 'medium-high' intact
        raw_steps = re.split(r'(?<!\w)(?=\n|•|\-|\.)', instructions_raw)
        
        # Clean up each step and remove non-letter characters
        instruction_steps = [step.strip(" -.()") for step in raw_steps if step.strip() and re.search(r'[a-zA-Z]', step)]
        
        # Join together hyphenated words like "medium-high" into one step
        instruction_steps = [re.sub(r'(\b\w+-\w+\b)', lambda m: m.group(0), step) for step in instruction_steps]
        
        # Format steps with emoji bullets
        instructions = "\n".join(f"- {step}" for step in instruction_steps)

        # Format the recipe
        formatted.append(f"""🍲 Recipe {i}: {title}

🧂 Ingredients: {ingredients}
        
👨‍🍳 Instructions - Step-by-step to cook it right:\n
{instructions}\n
"""
)

    return f"{intro}\n\n" + "\n\n".join(formatted) + f"\n\n{outro}"


In [13]:
docs = retriever.invoke("I have chicken and onion. What can I cook?")
raw_context = "\n\n".join([doc.page_content for doc in docs])

formatted_output = format_recipes(raw_context)
print(raw_context)
# print(formatted_output)

Title: chicken dish from netherlands antilles   original
Ingredients: chicken wings, oil, green bell pepper, onion, stewed tomatoes, tomato paste, garlic, sambal oelek, bouillon cube
Instructions: - Cut the wings at the joints so you have 3 pieces, or simply use drumsticks
- You can skin the chicken if you want to, we never do
- Cut the bell peppers and onions in strips
- Peel the garlic and mash it up
- Put some oil, butter or even pam in a medium hot skillet add the chicken and cook until the chicken seems to get done some
- Add the onions, garlic and bell peppers and stir-fry the whole lot for a while to add some color to the vegetables
- Cut the stewed tomatoes in smaller bits and add to the dish along with the tomato paste, fill the empty cans with water and add also
- Add some sambal or hot sauce
- Add stock cubes
- Let the whole thing simmer for about 15 minutes, which gives you time to cook some rice with it
- )
- Taste, add salt and sambal if needed
- Add a pinch of sugar if t

In [16]:
# Run the agent with your query
from openai import OpenAI

OPENAI_API_KEY = user_secrets.get_secret("OPENAI_API_KEY")
client = OpenAI(api_key=OPENAI_API_KEY)

In [28]:
query = "I have chicken, tomatoes, and garlic. What can I cook?"
docs = retriever.invoke(query)
raw_context = "\n\n".join([doc.page_content for doc in docs])

prompt = f"""
You are a helpful cooking assistant.
The user asked: {query}

Here are some relevant recipes from the database:
{raw_context}

Please:
1. Summarize their ingredients clearly.
2. Rewrite the steps in a beginner-friendly, easy-to-follow way.
3. Keep the tone warm and encouraging, like you're guiding a new cook.
"""

from openai import OpenAI
OPENAI_API_KEY = user_secrets.get_secret("OPENAI_API_KEY")
client = OpenAI(api_key=OPENAI_API_KEY)

gpt_response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.7
)

print(gpt_response.choices[0].message.content)

Sure! Let’s take a look at the two recipes that you can make with chicken, tomatoes, and garlic. I’ll summarize the ingredients and rewrite the steps in a beginner-friendly way. 

### Recipe 1: Mozzarella Topped Chicken with Roasted Tomato and Basil Sauce

#### Ingredients:
- Chicken breasts
- Cherry tomatoes
- Garlic cloves
- Olive oil
- Fresh basil
- Cream
- Mozzarella cheese

#### Instructions:
1. **Preheat the Oven:** Start by setting your oven to 200°C (about 400°F) to get it nice and hot.
2. **Prepare the Tomatoes and Garlic:** In an ovenproof dish, mix together the cherry tomatoes, garlic cloves, and a drizzle of olive oil. This will be roasted to bring out the flavors!
3. **Roast:** Place the dish in the oven and let it roast uncovered for about 20 minutes, or until the tomatoes are soft and bursting.
4. **Squeeze the Garlic:** Once the garlic is cool enough to handle, squeeze the soft garlic out of the skins and into a bowl.
5. **Cook the Chicken:** In a pan, heat a little oli

In [25]:
import anthropic

ANTHROPIC_API_KEY = user_secrets.get_secret("ANTHROPIC_API_KEY")
client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)

query = "I have chicken and onion. What can I cook?"
docs = retriever.invoke(query)
raw_context = "\n\n".join([doc.page_content for doc in docs])

prompt = f"""
You are a helpful cooking assistant.
The user asked: {query}

Here are some relevant recipes from the database:
{raw_context}

Please:
1. Summarize their ingredients clearly.
2. Rewrite the steps in a beginner-friendly, easy-to-follow way.
3. Keep the tone warm and encouraging, like you're guiding a new cook.
"""

response = client.messages.create(
    model="claude-3-5-haiku-20241022",
    max_tokens=600,
    messages=[
        {"role": "user", "content": prompt}
    ]
)

print(response.content[0].text)


Hey there! Great news - with chicken and onion, you've got the start of something delicious! Let me help you create a fantastic meal. Based on these recipes, I'll guide you through a simple, flavorful chicken dish.

Ingredients You'll Need:
- Chicken (wings, drumsticks, or pieces)
- Onion
- Optional additions (if you have them):
  * Garlic
  * Bell pepper
  * Tomatoes
  * Broth/stock
  * Carrots
  * Herbs like thyme

Easy-Peasy Cooking Steps:
1. Prep Work (Super Important!)
   - Chop your onion into nice strips
   - If you have garlic, peel and mince it
   - Cut chicken into manageable pieces

2. Cooking Magic
   - Heat a little oil in a skillet over medium heat
   - Add chicken and cook until it starts looking golden
   - Toss in those onions and let them get soft and slightly caramelized
   - If you have extras like bell peppers or garlic, add them now!

3. Make It Saucy
   - If you have tomatoes or broth, pour them in
   - Season with salt, pepper, and any herbs you like
   - Let ev