<a href="https://colab.research.google.com/github/hindhammad24/Final-Project/blob/main/final_project_fixed.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **NutriGuide** - Health Nutrition RAG Chatbot
**By Hind Hammad and Manmeet Sagri**

**Project Overview**

In our project, we are creating NutriGuide, a Retrieval-Augmented Generation (RAG) chatbot designed to give safe, evidence-based nutrition guidance. We built this system because we wanted to create a tool that doesn’t rely on random internet information, but instead pulls directly from trusted health sources like Canada’s Food Guide and the World Health Organization. Our chatbot uses retrieval to gather verified information and a language model to summarize it into clear, simple responses. Since we are focusing on general healthy eating, our system avoids giving medical or extreme diet advice and instead supports users with educational nutrition information.

**Problem**

While working on this project, we realized how difficult it is for people to access reliable nutrition information online. Many websites, influencers, and social media platforms share conflicting or unsafe diet advice, which can easily confuse users. A lot of people end up relying on viral trends instead of evidence-based recommendations. We wanted to solve this problem by building a chatbot that only uses trustworthy public-health information and provides consistent, safe guidance.

**significance**

As we researched different ideas, we quickly saw how much misinformation exists in the nutrition space. This problem felt meaningful to us because we noticed that students, newcomers to Canada, and busy individuals often struggle to identify what is actually healthy or safe. Since we are working with official health sources, our project gives users a way to access dependable information without having to navigate complex government websites. To us, the significance of this project lies in making nutrition education more accessible, culturally neutral, and easier to understand.

**Users/Usecase**

From our perspective, the people who will benefit most from our chatbot are users who want quick, general nutrition guidance without dealing with confusing or unreliable online sources. This includes beginners learning about healthy eating, people who want simple explanations, and anyone who feels overwhelmed by online diet content. We designed the use case to be a straightforward chat interface where users type questions and receive safe, evidence-based responses grounded directly in our dataset.

**where/how they'll be using the model?**

We expect users to interact with our model through a lightweight chatbot interface, either on a web app or through an interactive notebook. Our goal is for people to use it in everyday situations; for example, when planning meals, trying to understand healthy habits, or wanting quick reminders about balanced eating. Since the model retrieves real text from reliable sources, users get accurate information instantly without needing to browse multiple websites.

**what kind of data it requires?**

To build this system, we needed data that is both reliable and allowed for academic use. We collected text from Canada’s Food Guide and WHO Healthy Diet resources because these are publicly available educational materials. We cleaned the text, split it into smaller chunks; and turned it into embeddings so the chatbot can retrieve the right information when a user asks a question. Our model does not collect or store personal user data, it only uses the user’s question to retrieve the correct evidence and generate an answer.


# **Model Source and How We Used It**

For this project, we began by forking the RAG (Retrieval-Augmented Generation) notebook provided by our TA. The original notebook (“Retrieval_Augmented_Generation_(RAG).ipynb”) contained a minimal working example of a RAG pipeline using HuggingFace models. We used this notebook as the foundation for our system and adapted it to our nutrition use case.

Inside the instructor’s notebook, two external models were loaded through the HuggingFace transformers and sentence-transformers libraries:

- DistilGPT-2 (text generation)

- all-MiniLM-L6-v2 (sentence embeddings)



 ### **DistilGPT-2**

Source: HuggingFace model hub

Original Repository: https://huggingface.co/distilgpt2

License: MIT License (allows modification, reuse, and non-commercial/educational deployment)

**How we used it:**

T- he instructor’s notebook loaded DistilGPT-2 using the HuggingFace pipeline("text-generation").

- We kept this part of the code but modified the prompt template so that the model only generates answers using text from our nutrition dataset.

- We added safety rules (no diagnoses, no calories, no meal plans, etc.) and included them in the system prompt.

- DistilGPT-2 then generates short, coherent answers grounded in retrieved evidence.

### **MiniLM-L6-v2**

Source: SentenceTransformers library

Original Repository: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

License: Apache 2.0 License (permits research use, modification, and redistribution)

**How we used it**

- In the instructor’s notebook, this model was used to convert the dummy RAG documents into 384-dimensional embeddings.

- We replaced the original placeholder texts with our own dataset, which we created using reliable public health sources (Canada’s Food Guide and WHO Health Diet Facts).

- The embedding model allowed us to perform semantic search and retrieve the top relevant pieces of nutrition evidence for each user question.

### **FAISS Index**

Source: Installed inside the instructor’s notebook (FAISS CPU version)

License: MIT License

Repository: https://github.com/facebookresearch/faiss

**How we used it:**

- FAISS was originally used in the instructor’s example to store embeddings of small RAG documents.

- We kept the same retrieval structure, but instead of the original sample RAG text, we inserted nutrition text chunks.

- This enabled accurate retrieval of relevant information before passing it to the language model.

### **Instructor Notebook**

Provided by: TA

Filename: Retrieval_Augmented_Generation_(RAG).ipynb

**How we used and modified it:**

We forked the entire notebook as the starting point for our project.

We replaced the example documents = [...] list with our nutrition dataset.

We rewrote the generate_with_rag() function to:

- enforce safety guardrails,

- inject public-health context,

- prevent unsafe diet or medical advice,

- cite retrieved information clearly.

- We added a testing framework using real-world diet questions.

- We integrated everything into a simple chatbot interface for deployment.

# **RAG Dataset Description**





## **Overview of the Dataset**

For this project, we created a custom RAG dataset focused on evidence-based nutrition and healthy eating practices. Our goal was to ensure that NutriGuide provides safe, factual, and research-supported information rather than relying on general internet knowledge. The dataset contains information from trusted public-health sources, including Government of Canada dietary guidelines, the World Health Organization (WHO), and the Heart & Stroke Foundation. All documents focus on general healthy eating, nutrient recommendations, dietary patterns, and practical guidance for the public.

The final dataset contains 34 cleaned text files categorized into themes such as vegetables and fruits, whole grains, protein foods, food skills, food marketing, vitamin D, life-stage nutrition, dietary shifts, healthy eating habits, and global nutrition recommendations.

## **Data Collection Process & Source Links**

We manually collected the dataset from publicly available, non-copyrighted educational resources intended for public use. All content came from official health organizations that allow non-commercial academic reuse.

**Main sources used:**

Canada’s Food Guide – Government of Canada
https://food-guide.canada.ca/en/

Canada’s Dietary Guidelines – Health Canada (PDF)
https://www.canada.ca/content/dam/hc-sc/documents/services/food-nutrition/canada-food-guide/resources/canada-food-guide-backgrounder/food-guide-snapshot.pdf

World Health Organization: Healthy Diet
https://www.who.int/news-room/fact-sheets/detail/healthy-diet

Heart & Stroke Foundation – Healthy Eating Basics
https://www.heartandstroke.ca/healthy-living/healthy-eating/healthy-eating-basics

## **How we collected the data**

- We manually copied sections of text from each source.

- We ensured only public information intended for public education was included.

- We avoided copyrighted recipes, menus, or branded content.

- We grouped text into logical sections (e.g., vegetables, protein foods, food marketing).

**How we stored the dataset**

After cleaning each text section, we saved them as individual .txt files and packaged them into a folder, which we uploaded as data.zip for use in our RAG notebook.


# **Data Cleaning & Preprocessing**

The raw text from each website or PDF contained many elements that would interfere with retrieval (menus, buttons, duplicate headings, references, formatting noise). To make the dataset consistent and useful for RAG, we cleaned all text manually.

## **Cleaning steps we applied**

- Removed HTML menus, navigation text, sidebars, “On this page,” and repeated headings.

- Removed unnecessary lists (e.g., “Twitter / Facebook / Share”).

- Removed references, footnotes, citation numbers, and academic formatting.

- Consolidated bullet lists into readable sentences.

- Paraphrased and rewrote large portions for clarity while preserving the original meaning.

- Removed medical advice, prescriptive diet plans, and anything unsafe or diagnostic.

- Split long sections into smaller, topic-based files to improve retrieval accuracy.

- Ensured all wording was neutral, educational, and general, not personalized.

- Ensured no copyrighted text from non-governmental sources was used verbatim.

- All cleaned text files contain short, self-contained paragraphs that the embedding model can index effectively.


# **Naming Convention and File Organization**

To keep the dataset easy to load, search, and maintain, we used a consistent naming system:

**All files are saved as:** source_topic.txt

Examples:

- canada_vegetables_and_fruits.txt

- canada_whole_grains.txt

- canada_protein_foods.txt

- canada_mindful_eating.txt

- who_healthy_diet_key_facts.txt

- who_practical_healthy_eating_tips.txt

- heartstroke_healthy_eating_basics.txt


**Naming rules we followed**

- Lowercase only
- Words separated by underscores
- Prefix shows the source (canada, who, heartstroke)

- Second part describes the topic

- One topic per file for cleaner chunking

The dataset folder contains all .txt files and is stored as data/ in our project directory. We zipped the folder (data.zip) for submission and to make it easy to load inside the Jupyter notebook.

# **Issues and Biases**

## **Is the Dataset Balanced?**

Our dataset is reasonably balanced across major nutrition themes because we intentionally included information from multiple trusted sources (Canada’s Food Guide, WHO, Heart & Stroke Foundation). The dataset covers vegetables and fruits, whole grains, protein foods, healthy fats, water, food skills, processed foods, dietary shifts, life-stage nutrition, and supplementation. Each category has several text files, which allows the retrieval system to access information from different perspectives. While it is not perfectly uniform, the dataset includes a wide enough distribution of topics to support reliable retrieval for general healthy-eating questions.


## **What Is Missing From the Dataset**

Although the dataset is comprehensive for general public-health nutrition, several areas are intentionally not included:

- No medical or clinical nutrition information (e.g., diabetes-specific diets, kidney disease diets)

- No personalized diet plans, calorie targets, or macro calculations

- No weight-loss programs, therapeutic diets (keto, low-carb, intermittent fasting)

- No fitness or bodybuilding nutrition

- No recipes or cooking instructions from copyrighted sources

- No cultural dietary guidelines beyond general statements

- No Indigenous harvesting methods beyond high-level descriptions

- No highly technical scientific literature or biochemical nutrition details

**These omissions are intentional to ensure the chatbot remains safe, general, non-diagnostic, and aligned with government public-health guidance.**


## **How We Improved the Raw Dataset**

The raw text from websites and PDFs contained significant noise and had to be improved for RAG use. We enhanced the dataset by:

- Removing menus, navigation bars, sidebars, advertisements, and irrelevant interface text

- Removing duplicate sections, repeated headings, and broken formatting

- Converting long lists and bullet points into clear sentences

- Paraphrasing complex passages for clarity and readability

- Combining related ideas and splitting overly long sections into smaller topic files

- Ensuring every file contains clean, self-contained paragraphs for better embedding

- Eliminating unsafe content such as personalized diet claims, prescriptive meal plans, and medical advice

- Ensuring all text follows the same tone: neutral, factual, and educational

These improvements make the dataset cleaner, more consistent, and much more effective for retrieval-augmented question answering.

## **Does the Dataset Have Any Biases?**

Yes, like all real-world datasets, this dataset has some inherent biases:

1. **Institutional Bias:** Most content comes from government and public-health organizations, which follow Western nutrition frameworks and may not represent all cultural eating patterns.

2. **Population Bias:** Guidelines are written for the general population, not for individuals with medical conditions, allergies, or special dietary needs.

3. **Cultural Bias:** Though some cultural foods are mentioned, most recommendations reflect Canadian dietary norms and may not fully reflect global diversity.

4. **Socioeconomic Bias:** Many recommendations assume consistent access to fresh foods, which may not be affordable or available to everyone.

5. **Safety Bias:** To avoid harmful output, the dataset excludes weight-loss diets, restrictive eating patterns, and high-risk nutrition advice, which means the model cannot answer detailed personalized questions.

We acknowledge these biases and mitigate their impact by adding safety guardrails, ensuring the chatbot gives general guidance only, and refuses personalized or clinical advice.

# **Guardrails Implemented in Our Model**

We implemented several guardrails to ensure that NutriGuide provides safe, non-medical, and evidence-based responses. These restrictions prevent the model from giving harmful or overly personalized nutrition advice and ensure that all answers are grounded in the dataset rather than generated freely.


1. **No Medical or Diagnostic Advice**

The model is instructed to avoid: diagnosing health conditions, recommending treatments, interpreting symptoms, and giving clinical nutrition guidance (e.g., “What should a diabetic eat?”)

If users ask medical questions, the chatbot responds with a safe refusal and encourages them to consult a health professional.

2. **No Personalized Diet Plans**

To reduce safety risks and avoid misinformation, the model will not provide calorie targets, macro calculations, weight-loss plans, and perfect diets based on age, sex, or weight.

Instead, it offers general healthy-eating principles based on the dataset.

3. **No Restrictive or Trend Diet Guidance**

The model avoids recommending high-risk or restrictive diets such as:keto, fasting, detoxes, low-carb or extreme programs.

If the user asks about these, the model explains general concepts but does not encourage restrictive eating.

4. **Grounded Responses Only**

All answers must come from the retrieved dataset. We enforce this by: inserting retrieved evidence directly into the prompt, instructing the model to answer ONLY using provided context, preventing unsupported claims or invented facts.

If the context does not contain an answer, the model says it does not have information on that topic.

5. **Safety-First Prompt Template**

We added safety rules inside the system prompt, including:

“Do not give medical advice.”

“Do not create personalized diet plans.”

“Use only the retrieved text to answer.”

“If a question is unsafe or outside scope, politely decline.”

This ensures consistency across all outputs.

6. **Handling Sensitive Topics (eating disorders, weight loss, pregnancy)**

For sensitive nutrition topics:the model avoids giving instructions, responses focus on general education, the model suggests speaking with a qualified professional.

This protects users who may be vulnerable.

7. **Filters Against Harmful Outputs**

We added logic to detect high-risk prompts such as:extreme dieting, unsafe weight-loss methods,nutrient megadosing.

In these cases, the chatbot provides a safety message instead of an answer.

8. **No Storage of Personal Information**

User messages are processed only to generate a response.
We do not store:personal health data, demographic details, and user history.

This reduces privacy risks.

# **Install Dependencies**

Install libraries required for the project:

In [None]:
!pip install transformers sentence-transformers -q

# **Import Necessary Libraries**

In [None]:
from transformers import pipeline
from sentence_transformers import SentenceTransformer
import pandas as pd
import numpy as np

# **Google Drive**

we mounted Google Drive in Colab

In [None]:
from google.colab import drive

# We mount our Google Drive so we can read the nutrition dataset from it
drive.mount('/content/drive')


Mounted at /content/drive


**What we did and why:**

To manage our dataset efficiently, we stored all our cleaned nutrition text files inside a Google Drive folder and mounted that Drive into our Colab notebook. Mounting Google Drive allows the notebook to access files persistently, even across sessions, instead of uploading data manually every time. Once Drive was mounted, we specified the path to our dataset folder (/content/drive/MyDrive/NutriGuide_RAG/data) so the notebook could read all .txt files directly from storage. This setup ensures that our RAG system always has access to the latest version of our nutrition dataset.

# **Verifying and Accessing the Dataset Directory**

In [None]:
import os
import glob

# We point to the data folder stored on Google Drive
DATA_DIR = "/content/drive/MyDrive/NutriGuide_RAG/data"

print("DATA_DIR:", DATA_DIR)
print("Files in DATA_DIR:")
print(glob.glob(os.path.join(DATA_DIR, "*.txt")))


DATA_DIR: /content/drive/MyDrive/NutriGuide_RAG/data
Files in DATA_DIR:
['/content/drive/MyDrive/NutriGuide_RAG/data/canada_food_vegetables_fruits.txt', '/content/drive/MyDrive/NutriGuide_RAG/data/canada_food_protein.txt', '/content/drive/MyDrive/NutriGuide_RAG/data/canada_food_water.txt', '/content/drive/MyDrive/NutriGuide_RAG/data/canada_food_mindful_eating.txt', '/content/drive/MyDrive/NutriGuide_RAG/data/canada_food_homemade_food.txt', '/content/drive/MyDrive/NutriGuide_RAG/data/canada_food_enjoying_meals.txt', '/content/drive/MyDrive/NutriGuide_RAG/data/canada_food_food_labels.txt', '/content/drive/MyDrive/NutriGuide_RAG/data/canada_food_limiting_processed_foods.txt', '/content/drive/MyDrive/NutriGuide_RAG/data/canada_food_food_marketing.txt', '/content/drive/MyDrive/NutriGuide_RAG/data/canada_food_cultural_foods.txt', '/content/drive/MyDrive/NutriGuide_RAG/data/canada_food_frozen_canned_dried.txt', '/content/drive/MyDrive/NutriGuide_RAG/data/canada_food_indigenous_traditional_foo

**What we did and why:**

After mounting Drive, we confirmed that Python could correctly access the dataset folder by printing out all .txt files inside the directory. This step allowed us to verify that our cleaned documents were properly uploaded, that the file path was correct, and that the system could read the data before building embeddings. This check helped us catch potential mistakes early, such as missing files, incorrect folder names, or unreadable text formats.

# **Loading the Nutrition Documents**

In [None]:
# We load all our cleaned nutrition documents from the data folder on Google Drive
def load_nutrition_docs(folder=DATA_DIR):
    docs = []
    file_paths = sorted(glob.glob(os.path.join(folder, "*.txt")))

    print("Found", len(file_paths), "files in", folder)
    for path in file_paths:
        with open(path, "r", encoding="utf-8") as f:
            text = f.read().strip()
        if text:
            docs.append(text)
        else:
            print("Skipped empty file:", path)
    return docs

# We call our loader function to get a list of all text documents
documents = load_nutrition_docs(DATA_DIR)
print("Number of non-empty documents:", len(documents))


Found 33 files in /content/drive/MyDrive/NutriGuide_RAG/data
Number of non-empty documents: 33


**What we did and why:**

We created a custom function, load_nutrition_docs(), to automatically load every .txt file from our dataset folder. Instead of manually copying text into the notebook, this function scans the directory, opens each file, removes extra whitespace, and stores the cleaned text into a list called documents. This approach ensures consistency and scalability—if we add or update any nutrition files in the data folder, the notebook will automatically load the new content without requiring code changes. This loading step serves as the foundation for building embeddings and performing semantic retrieval in later stages.

# **Step 1: Ask LLM Without RAG**

In [None]:
from transformers import pipeline

# We load a lightweight pre-trained language model that can generate text.
# This model answers questions only from its pretraining and NOT from our nutrition dataset.
llm = pipeline('text-generation', model='gpt2')

# We define a simple baseline question to show how the model behaves without RAG.
question = "What is healthy eating?"

# We ask the model directly, without retrieval, to get our baseline output.
response = llm(question, max_length=80, num_return_sequences=1)

# We print the raw answer from the model.
print("Response from LLM WITHOUT RAG:", response[0]['generated_text'])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=80) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Response from LLM WITHOUT RAG: What is healthy eating?

A healthy diet is an important part of living a healthy life. These are the foods you eat whenever you want to be healthy. People often call these foods the "healthy foods." They include fruits, vegetables, whole grains, beans and grains.

You can find healthy eating at many food stores and convenience stores. Your local health food store can help you figure out if your food is healthy or not, but it is not always easy to find healthy foods.

Healthy Foods

It is easy to find healthy foods for your food choices.

There are over 100 types of healthy foods available on the market today. But this list is not all that comprehensive. Some of these are more specific than others.

Healthy Foods That Are Good for You

Tofu

Tofu is a delicious, healthy food. It is high in fiber, contains Vitamin A, Phytonutrients and a whole-grain protein.

Tofu doesn't contain processed grains, which are known to be a good source of fiber.

To find healt

**What we did and why:**

In this step, we tested how a pre-trained language model behaves before integrating our curated nutrition dataset. We loaded GPT-2 as a lightweight generator and asked it a simple nutrition question without giving it any supporting context. This allowed us to establish a baseline for comparison. Because the model answered purely from its general pretraining, the response was generic and did not reference Canada’s Food Guide, WHO guidelines, or any evidence-based recommendations. Observing these limitations demonstrated why retrieval was necessary: without RAG, the model cannot provide grounded, trustworthy, or Canada-specific nutrition guidance.

# **Step 2: Create dense embeddings for our nutrition documents**

## **Prepare a Dataset**

In [None]:
documents = load_nutrition_docs(DATA_DIR)

Found 33 files in /content/drive/MyDrive/NutriGuide_RAG/data


In [None]:
# Step 2: Prepare a Dataset (Knowledge Base)
# In our project, 'documents' is already defined from our nutrition txt files.
# It contains cleaned text from Canada's Food Guide, WHO, and Heart & Stroke.
print("Number of documents in our nutrition corpus:", len(documents))


Number of documents in our nutrition corpus: 33


## **Create dense embeddings**

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np
import pandas as pd

# We use MiniLM to embed each document into a dense vector for similarity search
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

embeddings = embedding_model.encode(documents)  # Generate embeddings for all docs

# We create a pandas DataFrame that links each document to its embedding
df = pd.DataFrame({
    "Document": documents,
    "Embedding": list(embeddings)  # store each embedding as a list
})

print("Indexed Documents (first 3 rows):")
print(df.head(3))


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Indexed Documents (first 3 rows):
                                            Document  \
0  Alcoholic drinks add extra calories with few n...   
1  Cultural foods and food traditions can be an i...   
2  Dietary shifts are changes that people can mak...   

                                           Embedding  
0  [0.034391064, 0.042760786, 0.011957302, 0.0681...  
1  [0.07672515, 0.08938183, 0.035383675, 0.055271...  
2  [-0.02657019, 0.011413579, 0.059902605, 0.0794...  


**What we did and why:**

We used the MiniLM sentence embedding model to convert each nutrition document into a dense vector and stored both the raw text and its embedding in a pandas DataFrame, which serves as a simple, searchable index for our RAG system.

## **Define a retrieval function**

In [None]:
# we defined a retrieval function that finds the most relevant nutrition documents
def retrieve_with_pandas(query, top_k=2):
    """
    We embed the user's query and compute cosine similarity with each document embedding.
    Then we return the top-k most similar nutrition documents.
    """
    # Generate embedding for the query
    query_embedding = embedding_model.encode([query])[0]

    # Compute cosine similarity between query and each document embedding
    def cosine_sim(vec):
        vec = np.array(vec)
        return np.dot(query_embedding, vec) / (np.linalg.norm(query_embedding) * np.linalg.norm(vec))

    df['Similarity'] = df['Embedding'].apply(cosine_sim)

    # Sort by similarity and return top-k results
    results = df.sort_values(by="Similarity", ascending=False).head(top_k)
    return results[["Document", "Similarity"]]


**What we did and why:**

To retrieve relevant context, we implemented retrieve_with_pandas(), which embeds the user’s question and computes cosine similarity between the query embedding and each document embedding in the DataFrame. The function returns the top-k nutrition documents with the highest similarity scores, forming the context we later pass to the language model.

## **RAG function using your llm pipeline**

In [None]:
# Define a simple RAG function that uses retrieved context + LLM
def generate_with_rag(query, top_k=2):
    """
    We retrieve the most relevant nutrition documents using our index
    and then ask the language model to answer the question using that context.
    """
    # Retrieve top-k relevant documents
    results = retrieve_with_pandas(query, top_k=top_k)
    context_texts = list(results["Document"])

    # Combine the context into a single block of text
    context_block = "\n\n".join(context_texts)

    # Build the input prompt for the LLM
    input_text = (
        "You are NutriGuide, a safe nutrition assistant. "
        "Use ONLY the context below to answer the question. "
        "Do NOT give medical advice or personalized meal plans.\n\n"
        f"Context:\n{context_block}\n\n"
        f"Question: {query}\n\nAnswer in a short, clear paragraph:"
    )

    # Use our LLM (gpt2 pipeline defined earlier) to generate a response
    response = llm(input_text, max_length=200, num_return_sequences=1)
    return response[0]['generated_text']


**What we did and why:**

Our generate_with_rag() function first calls retrieve_with_pandas() to obtain the most relevant nutrition documents, merges them into a single context block, and then feeds this context together with the user’s question into the GPT-2 language model. The prompt explicitly instructs the model to answer only using the provided context and to avoid medical or highly personalized advice. This mirrors the teacher’s RAG example but is adapted to our curated nutrition dataset and safety constraints.

## **Test query**

In [None]:
# Test the retrieval and RAG generation on a nutrition-related question
query = "What are some key recommendations for healthy eating?"

print("\nTop Retrieved Documents:")
results = retrieve_with_pandas(query, top_k=2)
print(results)

print("\nRAG Answer:")
print(generate_with_rag(query, top_k=2))


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=200) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



Top Retrieved Documents:
                                             Document  Similarity
12  Healthy eating is influenced not only by what ...    0.742608
30  A healthy diet throughout life helps prevent m...    0.728598

RAG Answer:
You are NutriGuide, a safe nutrition assistant. Use ONLY the context below to answer the question. Do NOT give medical advice or personalized meal plans.

Context:
Healthy eating is influenced not only by what you eat but also by where, when, why, and how
you eat. Being mindful of eating habits includes taking time to eat and noticing feelings
of hunger and fullness. Cooking more often and involving others in planning and preparing
meals can support healthier choices. Enjoying food is also important, and culture and food
traditions can be a meaningful part of healthy eating. Eating meals with others can help
strengthen social connections and support positive eating habits.

Making it a habit to eat a variety of healthy foods each day is recommended. Thi

# **Step 3: Ask Question with RAG**

In [None]:
# Ask the same question again, but this time USING RAG

# We reuse the same 'question' variable we used in the baseline test
print("Original question:", question)

# We now generate an answer using our RAG pipeline (retrieval + generation)
response_rag = generate_with_rag(question)

print("Response from LLM WITH RAG:")
print(response_rag)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=200) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Original question: What is healthy eating?
Response from LLM WITH RAG:
You are NutriGuide, a safe nutrition assistant. Use ONLY the context below to answer the question. Do NOT give medical advice or personalized meal plans.

Context:
A healthy diet throughout life helps prevent malnutrition in all its forms and reduces the
risk of noncommunicable diseases such as diabetes, heart disease, stroke, and cancer.
However, global eating patterns have shifted due to increased production of processed foods,
urbanization, and changing lifestyles. Many people now consume more foods high in energy,
fat, free sugars, and salt, while not eating enough fruits, vegetables, and whole grains.

The basic principles of a healthy diet remain consistent across different ages, cultures, and
food environments, although specific needs may vary based on age, gender, lifestyle, and
activity level. A healthy diet includes a variety of fruits, vegetables, legumes, nuts, and
whole grains, with limited intake of fr


**What we did and why:**

To evaluate the impact of retrieval, we asked the language model the same question twice: first without RAG, and then using our generate_with_rag() pipeline. In the baseline setting, the model answered purely from its pretraining and produced a long, generic description of “healthy eating” that was not clearly tied to any guideline. In contrast, the RAG version first retrieved relevant passages from our curated nutrition dataset and then generated an answer based on that context. This second answer was more focused, aligned with public health recommendations, and avoided hallucinated or unnecessary details, showing that retrieval significantly improved the grounding and quality of responses.

# **Step 4: Compare the Results**

In [None]:
# we compared the Baseline vs RAG Responses

print("=== WITHOUT RAG (Baseline LLM Response) ===")
print(response[0]['generated_text'])
print("\n****\n")

print("=== WITH RAG (NutriGuide RAG Response) ===")
print(response_rag)


=== WITHOUT RAG (Baseline LLM Response) ===
What is healthy eating?

A healthy diet is an important part of living a healthy life. These are the foods you eat whenever you want to be healthy. People often call these foods the "healthy foods." They include fruits, vegetables, whole grains, beans and grains.

You can find healthy eating at many food stores and convenience stores. Your local health food store can help you figure out if your food is healthy or not, but it is not always easy to find healthy foods.

Healthy Foods

It is easy to find healthy foods for your food choices.

There are over 100 types of healthy foods available on the market today. But this list is not all that comprehensive. Some of these are more specific than others.

Healthy Foods That Are Good for You

Tofu

Tofu is a delicious, healthy food. It is high in fiber, contains Vitamin A, Phytonutrients and a whole-grain protein.

Tofu doesn't contain processed grains, which are known to be a good source of fiber.



**What we did and why:**

To evaluate the effect of retrieval on answer quality, we asked the same question using two different methods: (1) the language model alone, and (2) our NutriGuide RAG system. The baseline GPT-2 model produced a vague and repetitive answer that relied solely on its general pretraining and did not reference any nutrition guidelines. In contrast, the RAG-enhanced version retrieved relevant information from our curated nutrition dataset and generated a response grounded in evidence-based sources such as Canada’s Food Guide and WHO recommendations. The RAG answer was clearer, more accurate, and aligned with public-health principles, demonstrating a clear improvement in safety, relevance, and grounding.

# **Changes to the Model**

**Did you need to change the model in any way before making it application ready, or fine-tune it?**

In our project, we did not change the underlying weights of the language model or fine-tune it. Instead, we kept the pre-trained DistilGPT-2 model as-is and made it application ready by wrapping it inside a RAG pipeline and adding strong prompt-level guardrails.

The main changes we made were:

- Restricting the model to short, grounded answers using a custom prompt template.

- Instructing it to only use information from the retrieved nutrition context.

- Adding explicit safety rules (no medical advice, no calorie targets, no personalized diet plans, no extreme diets).

This approach allowed us to adapt the model to our nutrition chatbot use case without fine-tuning, which is safer and more appropriate for a small course project dealing with health-related content.

# **Data Responses**

## **How we receive the data from the user**



For our application, the user interacts with NutriGuide through a simple text-based chat interface (for now, this is implemented in a notebook cell / web-style input box). The user types a natural-language question like “How can I eat more vegetables?” or “What are healthier drink choices?” and we pass that string into our RAG pipeline as the query.

We designed the assistant so that it does not need detailed personal information (exact weight, diagnoses, medications, etc.) in order to give useful, general nutrition guidance. In a more polished UI, we would also include a short note asking users not to enter sensitive medical or identity data.


## **How the data is cleaned after it is received**




Once the user submits a question, we apply light, safe preprocessing before using it:

- Trim leading/trailing whitespace

- Drop completely empty or whitespace-only inputs

- Optionally normalize spacing and basic punctuation

- We do not change the meaning of the input; we just standardize it enough so embeddings and retrieval work reliably.

On top of this, the guardrails in our prompt act as a second layer of cleaning at the meaning level:

- Even if the user includes age, gender, or medical details, we instruct the model not to give personalized meal plans, calories, or diagnoses.

- If a question clearly requires medical or treatment advice, the assistant is instructed to refuse and recommend talking to a health professional instead of using that information to generate a risky answer.

So, we keep text preprocessing minimal, but we enforce strong semantic safety through the system prompt and RAG design.


## **How and where the data is stored**





For this course project, we do not persist real user data in any external database:
- Inputs are held only in memory inside the running notebook session to generate a single response.
- If we log example conversations, they are synthetic test prompts we wrote ourselves (e.g., “a student wants cheaper healthy snacks”) and are not tied to any user identity.

In a real deployment scenario, we would:

- Store only anonymized logs of queries (no names, no contact info, no direct identifiers).

- Use them only for debugging and evaluation (e.g., checking if the model followed guardrails).

- Host logs in a secure backend with access control and a clear privacy policy.


## **Is the cleaning process the same in all environments?**




In our current setup (notebook / simple web-style interface), yes the same cleaning process applies to all users:

- Strip whitespace, ignore empty queries
- Send the cleaned text to the embedding model and retriever
- Rely on the same guardrails and refusal rules for safety

If NutriGuide were deployed across different environments (for example, a web app, mobile app, or voice assistant), the core cleaning logic would stay the same (normalize text, avoid storing sensitive information, enforce guardrails). However, we would add channel-specific steps:

**For voice input:** transcribe speech to text, then run the same cleaning.

**For messaging apps:** normalize emojis, short forms, or remove attachment placeholders, then process as usual.

But in all cases, our design principle stays consistent:
- minimal technical cleaning,
- no need for personal identifiers,
- safety enforced at the prompt/behavior level rather than by collecting more data.

# **Challenges and ethical and bias issues**

**What are the challenges?**

One of the main challenges in our project is that users often expect personalized diet plans, medical advice, or weight-loss instructions when interacting with a nutrition chatbot. These are unsafe for an AI model to provide. Another challenge is that nutrition questions can be very broad or vague, making it difficult for the model to determine what the user is actually asking. We also had to manage the technical challenge of making sure the model retrieves the correct information from our dataset instead of hallucinating.


One of the biggest challenges we faced while creating this project was preparing the dataset for RAG. The sources we used (Canada’s Food Guide, WHO documents, Heart & Stroke Foundation) were extremely long, repetitive, and formatted with bullet points, lists, headings, and duplicated sections. We had to manually clean, merge, and rewrite large parts of the text to make it usable. Another challenge was chunking the data properly so the model could retrieve meaningful sections without cutting sentences or losing important context.

We also struggled with technical issues inside Google Colab, such as metadata errors, widget states breaking the notebook, long loading times, and memory issues when generating embeddings. Making sure Google Drive connected properly and ensuring the dataset loaded correctly in the notebook took multiple attempts. Finally, customizing the instructor’s notebook to fit our project required rewriting functions, adding safety guardrails, updating prompts, and changing the retrieval system so the model only uses our nutrition dataset and not the original sample text.

**What ethical and bias issues will users be facing?**

Users may face ethical risks such as accidentally receiving information that feels medical when it is not intended to be. Our dataset is based on Canadian and WHO sources, which means it may not reflect all cultural diets, food traditions, or regional foods. This can create bias toward Western healthy-eating patterns. Some users may also rely too heavily on the tool, assuming it can replace a dietitian or a healthcare professional, which is not the case.

**How did you address these issues in your application?**

To address safety and ethics, we added strict guardrails that prevent the model from giving personalized, medical, or diagnostic advice. The chatbot avoids calorie numbers, supplement dosages, and disease-specific diet plans, and instead provides general guidance only. We also implemented refusal messages for unsafe topics and remind users to consult professionals. To reduce cultural bias, we structured answers around broad healthy-eating principles instead of specific recipes or culture-specific foods. Finally, using RAG ensures that all answers come only from trusted, verified sources in our dataset.

**What are the edge cases your model might fail?**

Our model may struggle or fail in cases where users ask for:

- medical diet plans (e.g., diabetes diets, heart-disease diets)

- extremely personalized instructions (e.g., “Tell me what to eat every hour”)

- weight-loss targets or calorie restrictions

- information not included in our dataset (e.g., TikTok diet trends)

- questions with multiple unrelated parts

- cultural foods or practices the dataset does not mention

In these cases, the model may not retrieve relevant text or may risk hallucinating.

**How can you address these?**

We address these failures by designing safe fallback responses. If the model cannot find information in the dataset, it tells the user that the information is unavailable. If the question is medical or unsafe, the model gives a refusal message and redirects the user to a professional. For vague or multi-part questions, we guide the user to clarify what they mean. In future versions, adding more diverse datasets, expanding cultural food sources, and fine-tuning the model on safer dialog patterns would further reduce these edge-case failures.

# **Feedback**

# **Links**


Github:
Google drive: https://drive.google.com/drive/folders/1oOSY-QsMjhM8ZkDPHsxS54dM0m7gGCwW?usp=sharing
  