#Cuisine Companion

**Introduction:**
>The "Cuisine Companion: A Recipe Recommendation System Using Retrieval-Augmented Generation (RAG) with OpenAI Embeddings" project delves into the practical implementation of an innovative approach to recipe recommendation. Leveraging the power of modern embedding models, specifically the text-embedding-3-small model, and integrating it into a Retrieval-Augmented Generation (RAG) system, this project aims to enhance the process of suggesting recipes to users based on their preferences and available ingredients.


>The foundation of this system is built upon the vast collection of Indian dessert recipes sourced from the dataset provided by Hugging Face's repository (/VishalMysore/indiandesert). By harnessing MongoDB's Vector Database, the project enables efficient storage, retrieval, and manipulation of recipe data, facilitating seamless interaction with the recommendation system.


---


Throughout this project, we delve into the intricacies of embedding models, demonstrating their capability to represent textual data in a dense vector space. By combining retrieval techniques with generation models, users can expect personalized and contextually relevant recipe recommendations tailored to their preferences and culinary requirements.


In the subsequent sections, we detail the step-by-step implementation process, from data ingestion and preprocessing to model integration and user interface development. Through this exploration, we aim to empower users with a sophisticated yet user-friendly tool, enriching their culinary experiences and fostering a deeper appreciation for the art of cooking.


##Installing the needed libraries and importing packages

* datasets: This package provides a collection of ready-to-use datasets for natural language processing (NLP) tasks.

* pandas: Pandas is a powerful data manipulation library in Python. It provides data structures and functions for efficiently working with structured data, such as tabular data.

* openai: This package provides access to the OpenAI API, which allows us to interact with OpenAI's language models, including GPT (Generative Pre-trained Transformer) models.

* pymongo: PyMongo is a Python library for interacting with MongoDB databases. Since our project involves storing and retrieving data from a MongoDB database, we'll need PyMongo to establish connections, perform queries, and manipulate data in the database.

By installing these packages, we can ensure that our project has access to the necessary tools and functionalities to carry out tasks such as data handling, model training, text generation, and database interaction. It's essential to install these packages to avoid import errors and ensure smooth execution of our project code.

In [None]:
!pip install datasets pandas openai pymongo




In [None]:
#Importing libraries
from datasets import load_dataset
import pandas as pd

import warnings

# Ignore all warnings
warnings.filterwarnings("ignore")

In [None]:
# Using this dataset: https://huggingface.co/datasets/VishalMysore/indiandesert
dataset = load_dataset("VishalMysore/indiandesert")

# Convert the dataset to a pandas dataframe
dataset_df = pd.DataFrame(dataset['train'])



In [None]:
#display top 5 rows in the dataset
dataset_df.head(5)

Unnamed: 0,name,ingredients,diet,prep_time,cook_time,flavor_profile,course,state,region
0,Balu shahi,"Maida flour, yogurt, oil, sugar",vegetarian,45,25,sweet,dessert,West Bengal,East
1,Boondi,"Gram flour, ghee, sugar",vegetarian,80,30,sweet,dessert,Rajasthan,West
2,Gajar ka halwa,"Carrots, milk, sugar, ghee, cashews, raisins",vegetarian,15,60,sweet,dessert,Punjab,North
3,Ghevar,"Flour, ghee, kewra, milk, clarified butter, su...",vegetarian,15,30,sweet,dessert,Rajasthan,West
4,Gulab jamun,"Milk powder, plain flour, baking powder, ghee,...",vegetarian,15,40,sweet,dessert,West Bengal,East


## Data Cleaning and Preparation


###Checking the database columns

In [None]:
dataset_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 255 entries, 0 to 254
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   name            255 non-null    object
 1   ingredients     255 non-null    object
 2   diet            255 non-null    object
 3   prep_time       255 non-null    int64 
 4   cook_time       255 non-null    int64 
 5   flavor_profile  255 non-null    object
 6   course          255 non-null    object
 7   state           255 non-null    object
 8   region          254 non-null    object
dtypes: int64(2), object(7)
memory usage: 18.1+ KB


### Shape of dataset

In [None]:
print("Columns:", dataset_df.columns)
print("\nNumber of rows and columns:", dataset_df.shape)

Columns: Index(['name', 'ingredients', 'diet', 'prep_time', 'cook_time',
       'flavor_profile', 'course', 'state', 'region'],
      dtype='object')

Number of rows and columns: (255, 9)


### Statistical summary of data

In [None]:
dataset_df.describe(include='all')

Unnamed: 0,name,ingredients,diet,prep_time,cook_time,flavor_profile,course,state,region
count,255,255,255,255.0,255.0,255,255,255,254
unique,255,252,2,,,5,4,25,7
top,Balu shahi,"Gram flour, ghee, sugar",vegetarian,,,spicy,main course,Gujarat,West
freq,1,2,226,,,133,129,35,74
mean,,,,31.105882,34.529412,,,,
std,,,,72.554409,48.26565,,,,
min,,,,-1.0,-1.0,,,,
25%,,,,10.0,20.0,,,,
50%,,,,10.0,30.0,,,,
75%,,,,20.0,40.0,,,,


###Checking for Missing Values

In [None]:
dataset_df.isnull().sum()

name              0
ingredients       0
diet              0
prep_time         0
cook_time         0
flavor_profile    0
course            0
state             0
region            1
dtype: int64

In [None]:
# Remove data point where region coloumn is missing
dataset_df = dataset_df.dropna(subset=['region'])

print("\nNumber of missing values in each column after removal:")
print(dataset_df.isnull().sum())



Number of missing values in each column after removal:
name              0
ingredients       0
diet              0
prep_time         0
cook_time         0
flavor_profile    0
course            0
state             0
region            0
dtype: int64


###Checking for Duplicates

In [None]:
dataset_df.duplicated().sum()

0

## Creating embeddings with OpenAI



###Creating Combined Text Column

In [None]:
dataset_df['combined_text'] = (
    'Name: ' + dataset_df['name'] + ' ' +
    'Ingredients: ' + dataset_df['ingredients'] + ' ' +
    'Diet: ' + dataset_df['diet'] + ' ' +
    'Flavor Profile: ' + dataset_df['flavor_profile'] + ' ' +
    'Course: ' + dataset_df['course'] + ' ' +
    'State: ' + dataset_df['state'] + ' ' +
    'Region: ' + dataset_df['region']
)

In [None]:
dataset_df

Unnamed: 0,name,ingredients,diet,prep_time,cook_time,flavor_profile,course,state,region,combined_text
0,Balu shahi,"Maida flour, yogurt, oil, sugar",vegetarian,45,25,sweet,dessert,West Bengal,East,"Name: Balu shahi Ingredients: Maida flour, yog..."
1,Boondi,"Gram flour, ghee, sugar",vegetarian,80,30,sweet,dessert,Rajasthan,West,"Name: Boondi Ingredients: Gram flour, ghee, su..."
2,Gajar ka halwa,"Carrots, milk, sugar, ghee, cashews, raisins",vegetarian,15,60,sweet,dessert,Punjab,North,"Name: Gajar ka halwa Ingredients: Carrots, mil..."
3,Ghevar,"Flour, ghee, kewra, milk, clarified butter, su...",vegetarian,15,30,sweet,dessert,Rajasthan,West,"Name: Ghevar Ingredients: Flour, ghee, kewra, ..."
4,Gulab jamun,"Milk powder, plain flour, baking powder, ghee,...",vegetarian,15,40,sweet,dessert,West Bengal,East,"Name: Gulab jamun Ingredients: Milk powder, pl..."
...,...,...,...,...,...,...,...,...,...,...
250,Til Pitha,"Glutinous rice, black sesame seeds, gur",vegetarian,5,30,sweet,dessert,Assam,North East,"Name: Til Pitha Ingredients: Glutinous rice, b..."
251,Bebinca,"Coconut milk, egg yolks, clarified butter, all...",vegetarian,20,60,sweet,dessert,Goa,West,"Name: Bebinca Ingredients: Coconut milk, egg y..."
252,Shufta,"Cottage cheese, dry dates, dried rose petals, ...",vegetarian,-1,-1,sweet,dessert,Jammu & Kashmir,North,"Name: Shufta Ingredients: Cottage cheese, dry ..."
253,Mawa Bati,"Milk powder, dry fruits, arrowroot powder, all...",vegetarian,20,45,sweet,dessert,Madhya Pradesh,Central,"Name: Mawa Bati Ingredients: Milk powder, dry ..."


###Generating Embeddings for Text

In [None]:
import openai
from google.colab import userdata

openai.api_key = ""

EMBEDDING_MODEL = "text-embedding-3-small"

def get_embedding(text):
    """Generate an embedding for the given text using OpenAI's API."""

    # Check for valid input
    if not text or not isinstance(text, str):
        return None

    try:
        # Call OpenAI API to get the embedding
        embedding = openai.embeddings.create(input=text, model=EMBEDDING_MODEL).data[0].embedding
        return embedding
    except Exception as e:
        print(f"Error in get_embedding: {e}")
        return None



###Applying Embedding Generation to DataFrame

In [None]:
dataset_df["optimized_search"] = dataset_df['combined_text'].apply(get_embedding)

dataset_df.head()

Unnamed: 0,name,ingredients,diet,prep_time,cook_time,flavor_profile,course,state,region,combined_text,optimized_search
0,Balu shahi,"Maida flour, yogurt, oil, sugar",vegetarian,45,25,sweet,dessert,West Bengal,East,"Name: Balu shahi Ingredients: Maida flour, yog...","[0.004400425590574741, -0.03774718940258026, -..."
1,Boondi,"Gram flour, ghee, sugar",vegetarian,80,30,sweet,dessert,Rajasthan,West,"Name: Boondi Ingredients: Gram flour, ghee, su...","[-0.004895045887678862, -0.02972494624555111, ..."
2,Gajar ka halwa,"Carrots, milk, sugar, ghee, cashews, raisins",vegetarian,15,60,sweet,dessert,Punjab,North,"Name: Gajar ka halwa Ingredients: Carrots, mil...","[-0.027340682223439217, 0.0008313045254908502,..."
3,Ghevar,"Flour, ghee, kewra, milk, clarified butter, su...",vegetarian,15,30,sweet,dessert,Rajasthan,West,"Name: Ghevar Ingredients: Flour, ghee, kewra, ...","[-0.04140964895486832, -0.00465221656486392, 0..."
4,Gulab jamun,"Milk powder, plain flour, baking powder, ghee,...",vegetarian,15,40,sweet,dessert,West Bengal,East,"Name: Gulab jamun Ingredients: Milk powder, pl...","[-0.02119562216103077, -0.03431466594338417, 0..."


## Vector Database Setup and Data Ingestion



In [None]:
import pymongo
from google.colab import userdata

def get_mongo_client(mongo_uri):
  """Establish connection to the MongoDB."""
  try:
    client = pymongo.MongoClient(mongo_uri)
    print("Connection to MongoDB successful")
    return client
  except pymongo.errors.ConnectionFailure as e:
    print(f"Connection failed: {e}")
    return None

mongo_uri = ""
if not mongo_uri:
  print("MONGO_URI not set in environment variables")

mongo_client = get_mongo_client(mongo_uri)



Connection to MongoDB successful


In [None]:
# Convert DataFrame to dictionary format
data = dataset_df.to_dict(orient='records')

In [None]:
# Ingest data into MongoDB
db = mongo_client['database_9e526']
collection = db['food_recipes']

In [None]:
# Delete any existing records in the collection
collection.delete_many({})

DeleteResult({'n': 254, 'ok': 1.0}, acknowledged=True)

In [None]:
documents = dataset_df.to_dict('records')
collection.insert_many(documents)

print("Data ingestion into MongoDB completed")

Data ingestion into MongoDB completed


## Searching the MongoDB

This function executes a vector search within the MongoDB collection using the provided user query.

Parameters:
user_query (str): The query string submitted by the user.
collection (MongoCollection): The MongoDB collection to search.

Returns:
list: A list containing the matching documents retrieved from the collection.

In [None]:
def vector_search(user_query, collection):
    # Generate embedding for the user query
    query_embedding = get_embedding(user_query)

    if query_embedding is None:
        return "Invalid query or embedding generation failed."

    # Define the vector search pipeline
    pipeline = [
        {
            "$vectorSearch": {
                "index": "vector_index",
                "queryVector": query_embedding,
                "path": "optimized_search",
                "numCandidates": 150,  # Number of candidate matches to consider
                "limit": 5  # Return top 5 matches
            }
        },
        {
            "$project": {
                "_id": 0,
                "name": 1,
                "ingredients": 1,
                "diet": 1,
                "prep_time": 1,
                "cook_time": 1,
                "flavor_profile": 1,
                "course": 1,
                "state": 1,
                "region": 1,
                "score": {
                    "$meta": "vectorSearchScore"  # Include the search score
                }
            }
        }
    ]

    # Execute the search
    results = collection.aggregate(pipeline)
    return list(results)


## Handling User Query and Result



The function extracts data from the MongoDB collection, combining attributes like name, ingredients, and others into a string.  Utilizing OpenAI's GPT-3.5 Turbo model, the function crafts a response to user queries, incorporating the extracted data as context. Then returns the generated response and the contextual data for further processing or presentation.







In [None]:
def handle_user_query(query, collection):
    # Retrieve knowledge from the MongoDB collection
    search_result = ''
    for result in collection.find():
        search_result += f"Name: {result.get('name', 'N/A')}, Ingredients: {result.get('ingredients', 'N/A')}, Diet: {result.get('diet', 'N/A')}, Prep Time: {result.get('prep_time', 'N/A')}, Cook Time: {result.get('cook_time', 'N/A')}, Flavor Profile: {result.get('flavor_profile', 'N/A')}, Course: {result.get('course', 'N/A')}, State: {result.get('state', 'N/A')}, Region: {result.get('region', 'N/A')}\n"

    # Generate response using OpenAI's GPT-3.5 model
    completion = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a Food recommendation system."},
            {"role": "user", "content": "Answer this user query: " + query + " with the following context: " + search_result}
        ]
    )

    return (completion.choices[0].message.content), search_result


##Conduct query with retrival of sources

In [None]:
query = "suggest me top 5 desserts to make with milk"
response, source_information = handle_user_query(query, collection)

print(f"Response: {response}")
print(f"Source Information: \n{source_information}")


Response: Here are the top 5 desserts to make with milk based on the context provided:

1. **Basundi** - A sweet dessert from Gujarat made with sugar, milk, and nuts.
2. **Doodhpak** - A sweet dessert from Gujarat made with milk, rice, and dry fruits.
3. **Malai curry** - A dessert from West Bengal made with coconut milk, lobster, and spices.
4. **Ras malai** - A sweet dessert from West Bengal made with chhena, reduced milk, and pistachio.
5. **Sohan papdi** - A sweet dessert from Maharashtra made with gram flour, ghee, sugar, and milk.

These desserts are sure to satisfy your sweet cravings with their delicious flavors and textures. Enjoy making and indulging in these delightful treats!
Source Information: 
Name: Balu shahi, Ingredients: Maida flour, yogurt, oil, sugar, Diet: vegetarian, Prep Time: 45, Cook Time: 25, Flavor Profile: sweet, Course: dessert, State: West Bengal, Region: East
Name: Chhena jalebi, Ingredients: Chhena, sugar, ghee, Diet: vegetarian, Prep Time: 10, Cook Time

##Web App Creation

Lets create the above created vector embeddings into a web app and search interactively.

In [None]:
!npm install localtunnel
!pip install streamlit

[K[?25h[37;40mnpm[0m [0m[30;43mWARN[0m [0m[35msaveError[0m ENOENT: no such file or directory, open '/content/package.json'
[0m[37;40mnpm[0m [0m[30;43mWARN[0m [0m[35menoent[0m ENOENT: no such file or directory, open '/content/package.json'
[0m[37;40mnpm[0m [0m[30;43mWARN[0m[35m[0m content No description
[0m[37;40mnpm[0m [0m[30;43mWARN[0m[35m[0m content No repository field.
[0m[37;40mnpm[0m [0m[30;43mWARN[0m[35m[0m content No README data
[0m[37;40mnpm[0m [0m[30;43mWARN[0m[35m[0m content No license field.
[0m
[K[?25h+ localtunnel@2.0.2
updated 1 package and audited 36 packages in 0.572s

3 packages are looking for funding
  run `npm fund` for details

found 2 [93mmoderate[0m severity vulnerabilities
  run `npm audit fix` to fix them, or `npm audit` for details


In [None]:
%%writefile app.py
import openai
from google.colab import userdata
import pymongo
import streamlit as st

openai.api_key = "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"

EMBEDDING_MODEL = "text-embedding-3-small"

def get_mongo_client(mongo_uri):
  """Establish connection to the MongoDB."""
  try:
    client = pymongo.MongoClient(mongo_uri)
    print("Connection to MongoDB successful")
    return client
  except pymongo.errors.ConnectionFailure as e:
    print(f"Connection failed: {e}")
    return None

mongo_uri = "mongodb://niveditha-3b1f3^database_9e526:lnmSZiyoTnfEBaQmieZMBWKkObtb17Ah@svc-3482219c-a389-4079-b18b-d50662524e8a-shared-mongo.aws-virginia-6.svc.singlestore.com:27017/?authMechanism=PLAIN&tls=true&loadBalanced=true&dbName=database_9e526"
if not mongo_uri:
  print("MONGO_URI not set in environment variables")

mongo_client = get_mongo_client(mongo_uri)


# Ingest data into MongoDB
db = mongo_client['database_9e526']
collection = db['food_recipes']

# Delete any existing records in the collection
def get_embedding(text):
    """Generate an embedding for the given text using OpenAI's API."""

    # Check for valid input
    if not text or not isinstance(text, str):
        return None

    try:
        # Call OpenAI API to get the embedding
        embedding = openai.embeddings.create(input=text, model=EMBEDDING_MODEL).data[0].embedding
        return embedding
    except Exception as e:
        print(f"Error in get_embedding: {e}")
        return None
def vector_search(user_query, collection):
    """
    Perform a vector search in the MongoDB collection based on the user query.

    Args:
    user_query (str): The user's query string.
    collection (MongoCollection): The MongoDB collection to search.

    Returns:
    list: A list of matching documents.
    """

    # Generate embedding for the user query
    query_embedding = get_embedding(user_query)

    if query_embedding is None:
        return "Invalid query or embedding generation failed."

    # Define the vector search pipeline
    pipeline = [
        {
            "$vectorSearch": {
                "index": "vector_index",
                "queryVector": query_embedding,
                "path": "optimized_search",
                "numCandidates": 150,  # Number of candidate matches to consider
                "limit": 5  # Return top 5 matches
            }
        },
        {
            "$project": {
                "_id": 0,
                "name": 1,
                "ingredients": 1,
                "diet": 1,
                "prep_time": 1,
                "cook_time": 1,
                "flavor_profile": 1,
                "course": 1,
                "state": 1,
                "region": 1,
                "score": {
                    "$meta": "vectorSearchScore"  # Include the search score
                }
            }
        }
    ]

    # Execute the search
    results = collection.aggregate(pipeline)
    return list(results)


def handle_user_query(query, collection):
    # Retrieve knowledge from the MongoDB collection
    search_result = ''
    for result in collection.find():
        search_result += f"Name: {result.get('name', 'N/A')}, Ingredients: {result.get('ingredients', 'N/A')}, Diet: {result.get('diet', 'N/A')}, Prep Time: {result.get('prep_time', 'N/A')}, Cook Time: {result.get('cook_time', 'N/A')}, Flavor Profile: {result.get('flavor_profile', 'N/A')}, Course: {result.get('course', 'N/A')}, State: {result.get('state', 'N/A')}, Region: {result.get('region', 'N/A')}\n"

    # Generate response using OpenAI's GPT-3.5 model
    completion = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a Food recommendation system."},
            {"role": "user", "content": "Answer this user query: " + query + " with the following context: " + search_result}
        ]
    )

    return (completion.choices[0].message.content), search_result


# 6. Conduct query with retrival of sources

st.set_page_config(page_title="Recipe Recommendation")
st.header("Ask For a recipe recommendation based on ingredients, name, or cooking time.")
query = st.text_input("Query")
if query != "":
  response, source_information = handle_user_query(query, collection)
  print(source_information)
  print(response)
  st.write(f"Response: {response}\n\n")






Overwriting app.py


##Launch the web app

In [None]:
!npx localtunnel --port 8501 & streamlit run app.py & curl ipv4.icanhazip.com


35.196.245.172

Collecting usage statistics. To deactivate, set browser.gatherUsageStats to False.
[0m
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Network URL: [0m[1mhttp://172.28.0.12:8501[0m
[34m  External URL: [0m[1mhttp://35.196.245.172:8501[0m
[0m
[K[?25hnpx: installed 22 in 2.699s
your url is: https://crazy-jeans-post.loca.lt
  return get_validated_options(opts, warn)
Connection to MongoDB successful
Connection to MongoDB successful
Name: Malapua, Ingredients: Yoghurt, refined flour, ghee, fennel seeds, Diet: vegetarian, Prep Time: 10, Cook Time: 120, Flavor Profile: sweet, Course: dessert, State: Bihar, Region: North
Name: Mysore pak, Ingredients: Besan flour, semolina, mung bean, jaggery, coconut, skimmed milk powder, sugar, ghee, Diet: vegetarian, Prep Time: 5, Cook Time: 20, Flavor Profile: sweet, Course: dessert, State: Karnataka, Region: South
Name: Aloo tikki, Ingredients: Rice flour, potato, bread crumbs, garam masala, s