# Dietary Restriction AI

### Import Packages

In [5]:
pip install langchain_community pandas pyspark sentence_transformers tokenizers faiss-cpu textstat nltk

Note: you may need to restart the kernel to use updated packages.


In [6]:
from langchain_community.llms import Ollama
import os
import json
import pandas as pd 
import numpy as np
import tempfile 
from pyspark.sql import SparkSession

import pyspark.sql.functions as f
from pyspark.sql import Row
from sentence_transformers import SentenceTransformer
import faiss
import ollama

import re
from pyspark.sql.types import StringType, FloatType, MapType

### Initialize a Spark Session

In [7]:
spark = SparkSession.builder \
    .appName("Local Spark") \
    .config("spark.driver.memory", "4g") \
    .config("spark.executor.memory", "4g") \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/04/20 19:10:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Data Ingestion
In order to be compatible with LLM that we will be creating, the data needs to be processed to be in an efficient retrieval format and stored in a searchable index. 

##### Recipe Data
Our recipe data is sourced from web-scraped data containing

In [8]:
all_files = os.listdir("./")
recipe_files = [file for file in all_files if "recipes_raw_nosource" in file]

df_list = []

for file_name in recipe_files: 
    temp_path = os.path.join(tempfile.gettempdir(), file_name)
    
    with open(file_name, "r", encoding="utf-8") as file:
        data = json.load(file)

    file_df = pd.DataFrame.from_dict(data, orient="index")
    df_list.append(file_df)  # Collect DataFrame
    
# Concatenate all dataframes
recipes_df = pd.concat(df_list)

# select only title, ingredient, instructions columns
recipes_df = recipes_df[['title', 'ingredients', 'instructions']]

# repartition the dataframe 
recipes_df = spark.createDataFrame(recipes_df)
recipes_df = recipes_df.repartition(100)

recipes_df.show()

25/04/20 19:10:44 WARN TaskSetManager: Stage 0 contains a task of very large size (16301 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

+--------------------+--------------------+--------------------+
|               title|         ingredients|        instructions|
+--------------------+--------------------+--------------------+
|  Baked Greens Chips|[6 to 8 ounces he...|Watch how to make...|
|Sweet Potato-Chic...|[2 large sweet po...|To prepare the ha...|
|         Cali Burger|[1/4 cup mayonnai...|For the chipotle ...|
|Oatmeal Cream Che...|[2 sticks unsalte...|Preheat the oven ...|
|    Campari Spritzer|[1 (12-ounce) can...|Stir the orange j...|
|Seared Rack of La...|[1/2 cup pistachi...|Watch how to make...|
|         Cream Puffs|[6 tablespoons un...|Special equipment...|
|Italian Style Hot...|[Cooking oil, sui...|In a saucepan ove...|
|    Crab Cakes Salad|[2 tablespoons fi...|For the salad: Co...|
|      Hot Cross Buns|[2 ounces fresh y...|Crumble the yeast...|
|Strawberries with...|[4 pints (8 cups)...|Thirty minutes to...|
|       Curry Chicken|[2 pounds chicken...|Put the sliced ch...|
|Golden Squash Blo...|[1 

##### Cooking Literature Data

The cooking literature data was pre-processed from PDF text files into a usable format in another notebook.

### Data Chunking
##### Recipes Data
The data was chunked into recipe-level chunks, since the recipes will then be able toi be referenced individually when needed. Since this use case is about modifying recipes in their entirety, we want the model to be able to reference the recipes in their entirety during its retrieval process. 

In [10]:
recipes_df_chunk = recipes_df.withColumn("chunk_text", 
                                         f.concat_ws("\n", f.col("title"), f.col("ingredients"), f.col("instructions")))
recipes_df_chunk.show()

25/04/20 19:15:49 WARN TaskSetManager: Stage 4 contains a task of very large size (16301 KiB). The maximum recommended task size is 1000 KiB.


+--------------------+--------------------+--------------------+--------------------+
|               title|         ingredients|        instructions|          chunk_text|
+--------------------+--------------------+--------------------+--------------------+
|  Baked Greens Chips|[6 to 8 ounces he...|Watch how to make...|Baked Greens Chip...|
|Sweet Potato-Chic...|[2 large sweet po...|To prepare the ha...|Sweet Potato-Chic...|
|         Cali Burger|[1/4 cup mayonnai...|For the chipotle ...|Cali Burger\n1/4 ...|
|Oatmeal Cream Che...|[2 sticks unsalte...|Preheat the oven ...|Oatmeal Cream Che...|
|    Campari Spritzer|[1 (12-ounce) can...|Stir the orange j...|Campari Spritzer\...|
|Seared Rack of La...|[1/2 cup pistachi...|Watch how to make...|Seared Rack of La...|
|         Cream Puffs|[6 tablespoons un...|Special equipment...|Cream Puffs\n6 ta...|
|Italian Style Hot...|[Cooking oil, sui...|In a saucepan ove...|Italian Style Hot...|
|    Crab Cakes Salad|[2 tablespoons fi...|For the sal

                                                                                

In [11]:
recipes_df = recipes_df_chunk.select("chunk_text").withColumnRenamed("chunk_text", "recipe_text")
recipes_df.show()

25/04/20 19:15:52 WARN TaskSetManager: Stage 7 contains a task of very large size (16301 KiB). The maximum recommended task size is 1000 KiB.


+--------------------+
|         recipe_text|
+--------------------+
|Baked Greens Chip...|
|Sweet Potato-Chic...|
|Cali Burger\n1/4 ...|
|Oatmeal Cream Che...|
|Campari Spritzer\...|
|Seared Rack of La...|
|Cream Puffs\n6 ta...|
|Italian Style Hot...|
|Crab Cakes Salad\...|
|Hot Cross Buns\n2...|
|Strawberries with...|
|Curry Chicken\n2 ...|
|Golden Squash Blo...|
|Mexican Hot Choco...|
|Georgian Short Ri...|
|Chicken Milanese\...|
|Citrus Marinated ...|
|Beet-and-Potato L...|
|Green Tea and Gin...|
|Jalapeno Cheese B...|
+--------------------+
only showing top 20 rows



### Generate Embeddings

In [12]:
# Load chunked JSON data
with open("chunked_data.json", "r", encoding="utf-8") as f:
    chunked_data = json.load(f)

# load model and generate embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = np.array([embedding_model.encode(chunk["body"]) for chunk in chunked_data], dtype=np.float32)

# Store embeddings in chunked JSON
for i, chunk in enumerate(chunked_data):
    chunk["embedding"] = embeddings[i].tolist()
    
# Create and save FAISS index
embedding_dim = embeddings.shape[1]
index = faiss.IndexFlatL2(embedding_dim)
index.add(embeddings)

### Model Ingestion

##### Prompt Engineering
The prompt inputted by the user should only need to contain the necessary recipe that the user wants to modify. The following prompt engineering code adds additional, consistent language that does the following: 
- Specifies that the user wants to modify the recipe, retaining the original intention
- Provides the dietary framework to stick to, in this case the high-protein low-carb diet. In another phase of development, this could be changed to xspecify a diet of choice
- Requests a list of macronutrients based on the data

##### RAG Component

In [13]:
sampled_recipes = (
    recipes_df.sample(withReplacement=False, fraction=0.1, seed=42)
    .select("recipe_text")
    .toLocalIterator()
)

25/04/20 19:20:40 WARN TaskSetManager: Stage 10 contains a task of very large size (16301 KiB). The maximum recommended task size is 1000 KiB.
[Stage 10:=====>                                                   (1 + 9) / 10]

In [14]:
def retrieve_relevant_chunks(query, k=5):
    """Retrieve top-k most relevant chunks using FAISS."""
    query_embedding = embedding_model.encode(query).reshape(1, -1)  # Convert query to embedding
    distances, indices = index.search(query_embedding, k)  # Retrieve top-k chunks

    return [chunked_data[i] for i in indices[0]]  # Get original text chunks

def query_ollama_with_context(query):
    """Retrieve relevant context and query Ollama 3.2."""
    retrieved_chunks = retrieve_relevant_chunks(query)
    context = "\n".join([chunk["body"] for chunk in retrieved_chunks])  # Combine relevant chunks

    # Formulate prompt for LLaMA
    prompt = f"Context:\n{context}\n\nQuery: {query}\nAnswer:"

    # Query Ollama
    response = ollama.chat(model="llama3.2", messages=[{"role": "user", "content": prompt}])
    return response["message"]["content"]

In [15]:
# if __name__ == "__main__":
#     query = input("Enter your recipe: ")
#     query += " Modify this recipe so that it is more suited for a high-protein, low carb diet. Provide a list of macronutrients as a part of the analysis."
#     answer = query_ollama_with_context(query)
#     print("\nOllama's Answer:", answer)

In [16]:
sampled_recipes = (
    recipes_df.sample(withReplacement=False, fraction=0.1, seed=42)
    .limit(100)
    .select("recipe_text")
    .toLocalIterator()
)
results = []
for line_num, row in enumerate(sampled_recipes):
    raw_query = row.recipe_text
    query = f"""{raw_query}
    
    Determine whether this recipe is a **solid food dish** that is meant to be eaten (in whole or in part) as a standalone food 
    item — such as a snack, side dish, entrée, breakfast, lunch, or dinner item.
    
    If the recipe is a **non-food item**, or if it is a **drink, cocktail, smoothie, marinade, condiment, dressing, spice mix, sauce, 
    or ingredient prep**, respond with:
    "This recipe is not a solid food dish and does not require modification."  
    Only if the recipe is not a solid food dish **briefly explain why**, and stop. Do **not** modify anything. 
    
    ---
    
    Modify recipes it to fit a **high-protein, low-carbohydrate** dietary pattern using the following:
    
    ### Dietary Goals:
    - Protein: 30–50% of total calories (target ~40%)
    - Fats: 20–40% (target ~30%)
    - Carbohydrates: 10–30% (target ~20%)
    
    ### Substitution Guidelines:
    - Prioritize lean protein sources.
    - Avoid cream, butter, full-fat cheese, fatty meats, or excessive oils.
    - Use low-fat cooking methods (grilling, baking, steaming).
    - Favor vegetables and legumes over starchy carbs.
    - Use cooking spray or broth instead of oil where possible.
    - **Only make substitutions that are likely to still taste good and preserve the appeal of the dish.** Avoid swaps that would make 
    the texture or flavor unpleasant.

    Avoid high-fat ingredients such as:
    - oils (even olive oil or coconut oil)
    - cream, butter, or cheese (unless low-fat versions are used in moderation)
    - fatty meats (like bacon, sausage, ground beef)
    - full-fat dairy
    - avocados, nuts, and seeds (limit to small quantities only if essential)
    
    Use the following **low-fat flavor and texture enhancers** instead:
    - vinegar, lemon juice, mustard, garlic, spices
    - vegetable purées (e.g., carrot, cauliflower, zucchini)
    - low-fat dairy alternatives (e.g., non-fat Greek yogurt, fat-free cottage cheese)
    - broth-based sauces instead of cream sauces

    If fat content exceeds 35%, **revise the recipe to reduce fat** until it falls between 20–35%. 
    
    ---
    
    ### Output Requirements:
    1. Title of Recipe
    2. Explanation of any substitutions made.
    3. Modified ingredient list.
    4. Modified cooking instructions.
    5. Macronutrient breakdown in this format:
       - Protein: #.# g  
       - Fats: #.# g  
       - Carbohydrates: #.# g  
    6. Confirm whether the final recipe meets macro goals. If not, revise it again — but only show the final version.
    Provide only the above. Do not include any additional sections in the output.
    
    Do **not** show failed or intermediate attempts. Just the final, modified version.
    """

    try:
        answer = query_ollama_with_context(query)
        results.append({
            "original_recipe": raw_query,
            "modified_response": answer
        })
        print(f"Completed line {line_num}")
    except Exception as e:
        results.append({
            "original_recipe": raw_query,
            "modified_response": f"Error: {e}"
        })

results_df = spark.createDataFrame([Row(**r) for r in results])

25/04/20 19:20:41 WARN TaskSetManager: Stage 11 contains a task of very large size (16301 KiB). The maximum recommended task size is 1000 KiB.


Completed line 0
Completed line 1
Completed line 2
Completed line 3
Completed line 4
Completed line 5
Completed line 6
Completed line 7
Completed line 8
Completed line 9
Completed line 10
Completed line 11
Completed line 12
Completed line 13
Completed line 14
Completed line 15
Completed line 16
Completed line 17
Completed line 18
Completed line 19
Completed line 20
Completed line 21
Completed line 22
Completed line 23
Completed line 24
Completed line 25
Completed line 26
Completed line 27
Completed line 28
Completed line 29
Completed line 30
Completed line 31
Completed line 32
Completed line 33
Completed line 34
Completed line 35
Completed line 36
Completed line 37
Completed line 38
Completed line 39
Completed line 40
Completed line 41
Completed line 42
Completed line 43
Completed line 44
Completed line 45
Completed line 46
Completed line 47
Completed line 48
Completed line 49
Completed line 50
Completed line 51
Completed line 52
Completed line 53
Completed line 54
Completed line 55
Co

In [17]:
results_df.show()

+--------------------+--------------------+
|     original_recipe|   modified_response|
+--------------------+--------------------+
|Italian Style Hot...|Italian Style Hot...|
|Citrus Marinated ...|Scallop Ceviche\n...|
|Green Tea and Gin...|Green Tea and Gin...|
|Spicy Fried Oyste...|Spicy Fried Oyste...|
|Tres Leches\n2 cu...|Tres Leches Cake\...|
|Table of Polenta\...|Title of Recipe: ...|
|Hot Fudge Sauce\n...|Hot Fudge Sauce\n...|
|Tomato Salsa\n2 r...|Tomato Salsa (Hig...|
|Raspberry Lemonad...|Raspberry Lemonad...|
|Peppermint Red Ve...|This recipe is a ...|
|Chocolate Creme F...|Chocolate Creme F...|
|Apricot Clafouti\...|Apricot Clafouti ...|
|Crispy Szechuan-S...|Crispy Szechuan-S...|
|Dewey's Albuquerq...|Title of Recipe: ...|
|Sweet-and-Sour Ca...|# Sweet-and-Sour ...|
|Chocolate Chip Ca...|Chocolate Chip Ca...|
|Fruity Gum Flower...|Fruity Gum Flower...|
|Gingerbread House...|Gingerbread House...|
|Surf and Turf Rib...|Surf and Turf Cro...|
|Corn Doggies\n2 1...|Title of R

                                                                                

In [18]:
# 1. Save as a single Parquet file
results_df.coalesce(1).write.parquet("temp_parquet_output", mode="overwrite")

# 2. Move the part file to a final .parquet file
import os
import shutil

temp_folder = "temp_parquet_output"
final_file = "results_output.parquet"

# Find the part file and move it
for fname in os.listdir(temp_folder):
    if fname.startswith("part-") and fname.endswith(".parquet"):
        shutil.move(os.path.join(temp_folder, fname), final_file)
        break

# Optional: clean up
shutil.rmtree(temp_folder)

print(f"✅ DataFrame saved as '{final_file}' in: {os.getcwd()}")


✅ DataFrame saved as 'results_output.parquet' in: /Users/jenny/ollama-env


### Recipe Evaluator

In [19]:
def extract_macronutrients(recipe_text):
    # Define a regular expression pattern to search for protein, fat, and carbs
    protein_pattern = r"Protein:\s*.*(\d+\.?\d*?)g"
    fat_pattern = r"Fats:\s*.*(\d+\.?\d*?)g"
    carbs_pattern = r"Carbohydrates:\s*.*(\d+\.?\d*?)g"
    
    # Search the text for the respective macronutrients
    protein = re.search(protein_pattern, recipe_text)
    fat = re.search(fat_pattern, recipe_text)
    carbs = re.search(carbs_pattern, recipe_text)
    
    # Extract the values if found
    protein_value = float(protein.group(1)) if protein else None
    fat_value = float(fat.group(1)) if fat else None
    carbs_value = float(carbs.group(1)) if carbs else None
    
    # Return the extracted values
    return {
        "P": protein_value,
        "F": fat_value,
        "C": carbs_value
    }

In [20]:
from pyspark.sql import functions as f
extract_macronutrients_udf = f.udf(extract_macronutrients, MapType(StringType(), FloatType()))

In [21]:
def evaluate_recipe(protein_g, fat_g, carb_g):
    # Caloric values per gram
    PROTEIN_CAL = 4
    CARB_CAL = 4
    FAT_CAL = 9

    if protein_g is not None and fat_g is not None and carb_g is not None:
        # Calculate total calories
        total_calories = (protein_g * PROTEIN_CAL) + (fat_g * FAT_CAL) + (carb_g * CARB_CAL)
        
        if total_calories == 0:
            return "Invalid recipe: Total calories cannot be zero."
        
        # Calculate macronutrient percentage
        protein_pct = (protein_g * PROTEIN_CAL / total_calories) * 100
        fat_pct = (fat_g * FAT_CAL / total_calories) * 100
        carb_pct = (carb_g * CARB_CAL / total_calories) * 100
        
        # Define healthy ranges
        protein_range = (30, 50)
        fat_range = (20, 40)
        carb_range = (10, 30)
        
        # Check if recipe meets healthy criteria
        if (protein_range[0] <= protein_pct <= protein_range[1] and
            fat_range[0] <= fat_pct <= fat_range[1] and
            carb_range[0] <= carb_pct <= carb_range[1]):
            return "Meets Criteria"
        else:
            return "Does Not Meet Criteria"
    else: 
        return None
# Example usage
recipe_result = evaluate_recipe(protein_g=35.2, fat_g=14.2, carb_g=12.8)
print(recipe_result)

Meets Criteria


In [22]:
evaluate_recipe_udf = f.udf(evaluate_recipe, StringType())

In [23]:
results_df.show()

+--------------------+--------------------+
|     original_recipe|   modified_response|
+--------------------+--------------------+
|Italian Style Hot...|Italian Style Hot...|
|Citrus Marinated ...|Scallop Ceviche\n...|
|Green Tea and Gin...|Green Tea and Gin...|
|Spicy Fried Oyste...|Spicy Fried Oyste...|
|Tres Leches\n2 cu...|Tres Leches Cake\...|
|Table of Polenta\...|Title of Recipe: ...|
|Hot Fudge Sauce\n...|Hot Fudge Sauce\n...|
|Tomato Salsa\n2 r...|Tomato Salsa (Hig...|
|Raspberry Lemonad...|Raspberry Lemonad...|
|Peppermint Red Ve...|This recipe is a ...|
|Chocolate Creme F...|Chocolate Creme F...|
|Apricot Clafouti\...|Apricot Clafouti ...|
|Crispy Szechuan-S...|Crispy Szechuan-S...|
|Dewey's Albuquerq...|Title of Recipe: ...|
|Sweet-and-Sour Ca...|# Sweet-and-Sour ...|
|Chocolate Chip Ca...|Chocolate Chip Ca...|
|Fruity Gum Flower...|Fruity Gum Flower...|
|Gingerbread House...|Gingerbread House...|
|Surf and Turf Rib...|Surf and Turf Cro...|
|Corn Doggies\n2 1...|Title of R

In [24]:
results_evaluated_df = results_df.withColumn("macros", extract_macronutrients_udf(f.col("modified_response")))
results_evaluated_df = results_evaluated_df.withColumn("dietary_compliance", evaluate_recipe_udf(
    f.col("macros").getItem("P"), 
    f.col("macros").getItem("F"), 
    f.col("macros").getItem("C")))

results_evaluated_df = results_evaluated_df.withColumn("protein", f.col("macros").getItem("P"))\
                                            .withColumn("fat", f.col("macros").getItem("F"))\
                                            .withColumn("carbohydrates", f.col("macros").getItem("C"))\
                                            .withColumn("protein_calories", f.col("protein") * 4)\
                                            .withColumn("fat_calories", f.col("fat") * 9)\
                                            .withColumn("carb_calories", f.col("carbohydrates") * 4)\
                                            .withColumn("total_calories", f.col("protein_calories") + f.col("fat_calories") + f.col("carb_calories"))\
                                            .withColumn("percent_protein", f.col("protein_calories") / f.col("total_calories"))\
                                            .withColumn("percent_fat", f.col("fat_calories") / f.col("total_calories"))\
                                            .withColumn("percent_carb", f.col("carb_calories") / f.col("total_calories"))
results_evaluated_df.toPandas()

                                                                                

Unnamed: 0,original_recipe,modified_response,macros,dietary_compliance,protein,fat,carbohydrates,protein_calories,fat_calories,carb_calories,total_calories,percent_protein,percent_fat,percent_carb
0,"Italian Style Hot Dogs\nCooking oil, suitable ...",Italian Style Hot Dogs\n\nThis recipe is a sol...,"{'P': 2.0, 'F': 8.0, 'C': 1.0}",Does Not Meet Criteria,2.0,8.0,1.0,8.0,72.0,4.0,84.0,0.095238,0.857143,0.047619
1,Citrus Marinated Olives\n1 pound large green o...,Scallop Ceviche\n\nThis recipe is a solid food...,"{'P': 8.0, 'F': 5.0, 'C': 7.0}",Does Not Meet Criteria,8.0,5.0,7.0,32.0,45.0,28.0,105.0,0.304762,0.428571,0.266667
2,Green Tea and Ginger Granita\n3 cups water\n1/...,Green Tea and Ginger Granita\n\nThis recipe is...,"{'P': None, 'F': None, 'C': None}",,,,,,,,,,,
3,Spicy Fried Oysters on Jalapeno Caesar Salad\n...,Spicy Fried Oysters on Jalapeno Caesar Salad\n...,"{'P': 5.0, 'F': 5.0, 'C': 2.0}",Does Not Meet Criteria,5.0,5.0,2.0,20.0,45.0,8.0,73.0,0.273973,0.616438,0.109589
4,Tres Leches\n2 cups chilled Homemade Sweetened...,Tres Leches Cake\n\nThis recipe is a solid foo...,"{'P': 5.0, 'F': 5.0, 'C': 0.0}",Does Not Meet Criteria,5.0,5.0,0.0,20.0,45.0,0.0,65.0,0.307692,0.692308,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,The Way to a Man's Heart\n1 pound halibut ADVE...,The Way to a Man's Heart\n\nThis recipe is not...,"{'P': 2.0, 'F': 1.0, 'C': 7.0}",Does Not Meet Criteria,2.0,1.0,7.0,8.0,9.0,28.0,45.0,0.177778,0.200000,0.622222
96,Armenian Rice Pilaf\n1/2 cup butter ADVERTISEM...,Armenian Rice Pilaf\n\nThis recipe is a solid ...,"{'P': None, 'F': None, 'C': None}",,,,,,,,,,,
97,Ham and Chicken Casserole\n1/2 cup uncooked eg...,Ham and Chicken Casserole\nThis recipe is a so...,"{'P': None, 'F': None, 'C': None}",,,,,,,,,,,
98,Cannoli\n1 cup confectioners' sugar ADVERTISEM...,# Cannoli\n\nThis recipe is a solid food dish ...,"{'P': None, 'F': None, 'C': None}",,,,,,,,,,,


In [25]:
results_evaluated_df.groupBy("dietary_compliance").count().show()

+--------------------+-----+
|  dietary_compliance|count|
+--------------------+-----+
|                NULL|   44|
|Does Not Meet Cri...|   53|
|      Meets Criteria|    3|
+--------------------+-----+



In [26]:
results_evaluated_df.select("percent_protein").filter(results_evaluated_df["percent_protein"].isNotNull()).orderBy("percent_protein", ascending=False).show()

+-------------------+
|    percent_protein|
+-------------------+
|                1.0|
| 0.5405405405405406|
| 0.5245901639344263|
|                0.5|
| 0.4878048780487805|
|0.48484848484848486|
|0.47619047619047616|
|0.47058823529411764|
|0.41379310344827586|
|0.40816326530612246|
| 0.3870967741935484|
| 0.3764705882352941|
|0.36363636363636365|
|0.36363636363636365|
| 0.3333333333333333|
|0.32727272727272727|
| 0.3076923076923077|
| 0.3047619047619048|
| 0.2857142857142857|
| 0.2857142857142857|
+-------------------+
only showing top 20 rows



In [27]:
calorie_stats_df = results_evaluated_df.select("protein", "percent_protein", 
                                       "fat", "percent_fat",
                                       "carbohydrates", "percent_carb", 
                                       "total_calories")
calorie_stats_df.summary().toPandas()

25/04/20 19:44:28 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


Unnamed: 0,summary,protein,percent_protein,fat,percent_fat,carbohydrates,percent_carb,total_calories
0,count,57.0,56.0,57.0,56.0,57.0,56.0,56.0
1,mean,3.789473684210526,0.2575330438539912,4.157894736842105,0.5250887945376266,3.333333333333333,0.2173781616083821,65.51785714285714
2,stddev,2.1442351458975426,0.1703173145936162,2.596628815022956,0.2524683818077061,2.70801280154532,0.1853285643813395,25.336277312421963
3,min,0.0,0.0,0.0,0.0,0.0,0.0,16.0
4,25%,2.0,0.1403508771929824,2.0,0.2903225806451613,1.0,0.0606060606060606,44.0
5,50%,3.0,0.2162162162162162,4.0,0.5294117647058824,2.0,0.1666666666666666,65.0
6,75%,5.0,0.3333333333333333,6.0,0.7241379310344828,5.0,0.3278688524590163,84.0
7,max,9.0,1.0,9.0,1.0,9.0,0.6666666666666666,125.0


In [29]:
for num, answer in enumerate(results_df.select("modified_response").collect()): 
    print(f"--------------------------------------{num}---------------------------------------------")
    print(answer.modified_response[:500])

--------------------------------------0---------------------------------------------
Italian Style Hot Dogs

This recipe is a solid food dish and does not require modification.

### Modified Ingredient List:
- 2 medium potatoes
- 4 all-beef hot dogs
- 1 medium onion
- 1 green or red pepper
- 1 medium round loaf of Italian bread
- Mustard
- Ketchup
- Red pepper flakes

### Modified Cooking Instructions:
In a saucepan over medium-high heat, heat enough cooking spray to cover the potatoes to frying temperature (325 to 350 degrees F), and add the potato rounds. Turn them over when t
--------------------------------------1---------------------------------------------
Scallop Ceviche

This recipe is a solid food dish and does not require modification.

Title of Recipe: Modified Citrus Marinated Olives

Explanation of Substitutions:
- No substitutions were made to this recipe.

Modified Ingredient List:

1 pound large green olives with pits
1 clove garlic, minced
1/2 cup low-fat extra-virgin 

In [33]:
results_safe = results_df.select("*")
results_safe.toPandas()

Unnamed: 0,original_recipe,modified_response
0,"Italian Style Hot Dogs\nCooking oil, suitable ...",Italian Style Hot Dogs\n\nThis recipe is a sol...
1,Citrus Marinated Olives\n1 pound large green o...,Scallop Ceviche\n\nThis recipe is a solid food...
2,Green Tea and Ginger Granita\n3 cups water\n1/...,Green Tea and Ginger Granita\n\nThis recipe is...
3,Spicy Fried Oysters on Jalapeno Caesar Salad\n...,Spicy Fried Oysters on Jalapeno Caesar Salad\n...
4,Tres Leches\n2 cups chilled Homemade Sweetened...,Tres Leches Cake\n\nThis recipe is a solid foo...
...,...,...
95,The Way to a Man's Heart\n1 pound halibut ADVE...,The Way to a Man's Heart\n\nThis recipe is not...
96,Armenian Rice Pilaf\n1/2 cup butter ADVERTISEM...,Armenian Rice Pilaf\n\nThis recipe is a solid ...
97,Ham and Chicken Casserole\n1/2 cup uncooked eg...,Ham and Chicken Casserole\nThis recipe is a so...
98,Cannoli\n1 cup confectioners' sugar ADVERTISEM...,# Cannoli\n\nThis recipe is a solid food dish ...


In [35]:
# readability 
from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType, StringType
import textstat

# UDF to calculate Flesch Reading Ease
@udf(FloatType())
def readability_score_udf(text):
    try:
        return textstat.flesch_reading_ease(text)
    except:
        return None

# UDF to calculate grade level
@udf(StringType())
def reading_grade_level_udf(text):
    try:
        return textstat.text_standard(text, float_output=False)
    except:
        return None

# Apply UDFs to your DataFrame
results_df = results_df.withColumn("readability_score", readability_score_udf("modified_response"))
results_df = results_df.withColumn("reading_grade_level", reading_grade_level_udf("modified_response"))


In [36]:
results_df.toPandas()

Unnamed: 0,original_recipe,modified_response,readability_score,reading_grade_level
0,"Italian Style Hot Dogs\nCooking oil, suitable ...",Italian Style Hot Dogs\n\nThis recipe is a sol...,54.220001,9th and 10th grade
1,Citrus Marinated Olives\n1 pound large green o...,Scallop Ceviche\n\nThis recipe is a solid food...,49.860001,11th and 12th grade
2,Green Tea and Ginger Granita\n3 cups water\n1/...,Green Tea and Ginger Granita\n\nThis recipe is...,65.930000,7th and 8th grade
3,Spicy Fried Oysters on Jalapeno Caesar Salad\n...,Spicy Fried Oysters on Jalapeno Caesar Salad\n...,53.099998,10th and 11th grade
4,Tres Leches\n2 cups chilled Homemade Sweetened...,Tres Leches Cake\n\nThis recipe is a solid foo...,32.160000,15th and 16th grade
...,...,...,...,...
95,The Way to a Man's Heart\n1 pound halibut ADVE...,The Way to a Man's Heart\n\nThis recipe is not...,52.360001,8th and 9th grade
96,Armenian Rice Pilaf\n1/2 cup butter ADVERTISEM...,Armenian Rice Pilaf\n\nThis recipe is a solid ...,65.120003,9th and 10th grade
97,Ham and Chicken Casserole\n1/2 cup uncooked eg...,Ham and Chicken Casserole\nThis recipe is a so...,46.470001,10th and 11th grade
98,Cannoli\n1 cup confectioners' sugar ADVERTISEM...,# Cannoli\n\nThis recipe is a solid food dish ...,47.689999,10th and 11th grade


In [48]:
import pandas as pd
import re
import nltk

# Ensure NLTK is ready
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger_eng')

# Convert Spark to Pandas
results_pd = results_df.toPandas()

# Add time_mentions column
def extract_time(text):
    if pd.isnull(text):
        return 0
    times = re.findall(r'\d+\s*(minutes|min|hours|hrs)', text.lower())
    return len(times)

results_pd['time_mentions'] = results_pd['modified_response'].apply(extract_time)

# Add verb_count column
def count_verbs(text):
    if pd.isnull(text):
        return 0
    tokens = nltk.word_tokenize(text)
    pos_tags = nltk.pos_tag(tokens)
    verbs = {word.lower() for word, tag in pos_tags if tag.startswith('VB')}
    return len(verbs)

results_pd['verb_count'] = results_pd['modified_response'].apply(count_verbs)

# Convert back to Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
results_df = spark.createDataFrame(results_pd)
results_pd

[nltk_data] Downloading package punkt to /Users/jenny/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/jenny/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
                                                                                

Unnamed: 0,original_recipe,modified_response,readability_score,reading_grade_level,time_mentions,verb_count
0,"Italian Style Hot Dogs\nCooking oil, suitable ...",Italian Style Hot Dogs\n\nThis recipe is a sol...,54.220001,9th and 10th grade,0,19
1,Citrus Marinated Olives\n1 pound large green o...,Scallop Ceviche\n\nThis recipe is a solid food...,49.860001,11th and 12th grade,2,19
2,Green Tea and Ginger Granita\n3 cups water\n1/...,Green Tea and Ginger Granita\n\nThis recipe is...,65.930000,7th and 8th grade,2,25
3,Spicy Fried Oysters on Jalapeno Caesar Salad\n...,Spicy Fried Oysters on Jalapeno Caesar Salad\n...,53.099998,10th and 11th grade,2,18
4,Tres Leches\n2 cups chilled Homemade Sweetened...,Tres Leches Cake\n\nThis recipe is a solid foo...,32.160000,15th and 16th grade,1,27
...,...,...,...,...,...,...
95,The Way to a Man's Heart\n1 pound halibut ADVE...,The Way to a Man's Heart\n\nThis recipe is not...,52.360001,8th and 9th grade,0,9
96,Armenian Rice Pilaf\n1/2 cup butter ADVERTISEM...,Armenian Rice Pilaf\n\nThis recipe is a solid ...,65.120003,9th and 10th grade,1,19
97,Ham and Chicken Casserole\n1/2 cup uncooked eg...,Ham and Chicken Casserole\nThis recipe is a so...,46.470001,10th and 11th grade,3,36
98,Cannoli\n1 cup confectioners' sugar ADVERTISEM...,# Cannoli\n\nThis recipe is a solid food dish ...,47.689999,10th and 11th grade,0,10


In [49]:
# faithfulness to the original
from sentence_transformers import SentenceTransformer, util

pandas_df = results_df.select('original_recipe', 'modified_response').toPandas()
model = SentenceTransformer('all-MiniLM-L6-v2')

original_emb = model.encode(pandas_df['original_recipe'].tolist(), convert_to_tensor=True)
modified_emb = model.encode(pandas_df['modified_response'].tolist(), convert_to_tensor=True)

sims = util.cos_sim(original_emb, modified_emb)
pandas_df['faithfulness_score'] = [sims[i][i].item() for i in range(len(pandas_df))]

# Assuming you're still in the same Spark session
faithfulness_scores = spark.createDataFrame(pandas_df[['faithfulness_score']])
results_df = results_df.withColumn("row_index", f.monotonically_increasing_id())
faithfulness_scores = faithfulness_scores.withColumn("row_index", f.monotonically_increasing_id())

results_df = results_df.join(faithfulness_scores, on="row_index").drop("row_index")
results_df.toPandas()

Unnamed: 0,original_recipe,modified_response,readability_score,reading_grade_level,time_mentions,verb_count,faithfulness_score
0,"Italian Style Hot Dogs\nCooking oil, suitable ...",Italian Style Hot Dogs\n\nThis recipe is a sol...,54.220001,9th and 10th grade,0,19,0.898883
1,"Tomato Salsa\n2 ripe tomatoes, cored and seede...","Tomato Salsa (High-Protein, Low-Carb Version)\...",36.419998,14th and 15th grade,0,13,0.688768
2,Hot Fudge Sauce\n2 cups sugar\n2 cups heavy cr...,Hot Fudge Sauce\nThis recipe is a non-food ite...,43.630001,11th and 12th grade,0,20,0.757719
3,Peppermint Red Velvet Whoopie Pie\n3 cups cake...,This recipe is a solid food dish and does not ...,51.340000,8th and 9th grade,0,6,0.485934
4,Table of Polenta\n4 cups heavy cream\nPinch of...,Title of Recipe: Balsamic Vegetable Grill with...,31.309999,14th and 15th grade,1,16,0.548858
...,...,...,...,...,...,...,...
95,Armenian Rice Pilaf\n1/2 cup butter ADVERTISEM...,Armenian Rice Pilaf\n\nThis recipe is a solid ...,65.120003,9th and 10th grade,1,19,0.908444
96,Old Fashioned \n1 scant teaspoon simple syrup\...,"This recipe is a drink, cocktail.",73.849998,4th and 5th grade,0,1,0.524474
97,It Won't Last Cake\n1 1/3 cups vegetable oil A...,It Won't Last Cake\n\nThis recipe is a solid f...,54.220001,9th and 10th grade,0,8,0.606982
98,Watermelon-Cucumber Margarita \n1 1/2 cups 1-i...,"Watermelon-Cucumber Margarita is a drink, cock...",41.360001,13th and 14th grade,0,20,0.869689


In [52]:
import re
import pandas as pd

# Optional: limit for testing to avoid long loops
# results_df = results_df.limit(50)

# Convert Spark to Pandas
results_pd = results_df.toPandas()

# Function to extract number from LLM response
def extract_score(text): 
    match = re.search(r'\b\d+\b', str(text))
    return int(match.group(0)) if match else None

# Taste evaluation function
def taste_score(text): 
    taste_prompt = f"""
    Based on the following recipe, how tasty do you think the result would be?
    Consider a balance of flavors, ingredients, and cooking method.
    Rate it on a scale from 1 to 10 as a digit.

    Recipe: 
    {text}
    """
    try: 
        response = ollama.chat(
            model="llama3.2", 
            messages=[{"role": "user", "content": taste_prompt}]
        )
        return extract_score(response["message"]["content"])
    except Exception as e:
        return None

# Apply the taste scoring
results_pd["taste_score"] = results_pd["modified_response"].apply(taste_score)

# Convert back to Spark
results_df = spark.createDataFrame(results_pd)
results_pd

Unnamed: 0,original_recipe,modified_response,readability_score,reading_grade_level,time_mentions,verb_count,faithfulness_score,taste_score
0,"Italian Style Hot Dogs\nCooking oil, suitable ...",Italian Style Hot Dogs\n\nThis recipe is a sol...,54.220001,9th and 10th grade,0,19,0.898883,8.0
1,"Tomato Salsa\n2 ripe tomatoes, cored and seede...","Tomato Salsa (High-Protein, Low-Carb Version)\...",36.419998,14th and 15th grade,0,13,0.688768,6.0
2,Hot Fudge Sauce\n2 cups sugar\n2 cups heavy cr...,Hot Fudge Sauce\nThis recipe is a non-food ite...,43.630001,11th and 12th grade,0,20,0.757719,6.0
3,Peppermint Red Velvet Whoopie Pie\n3 cups cake...,This recipe is a solid food dish and does not ...,51.340000,8th and 9th grade,0,6,0.485934,7.0
4,Table of Polenta\n4 cups heavy cream\nPinch of...,Title of Recipe: Balsamic Vegetable Grill with...,31.309999,14th and 15th grade,1,16,0.548858,8.0
...,...,...,...,...,...,...,...,...
95,Armenian Rice Pilaf\n1/2 cup butter ADVERTISEM...,Armenian Rice Pilaf\n\nThis recipe is a solid ...,65.120003,9th and 10th grade,1,19,0.908444,6.0
96,Old Fashioned \n1 scant teaspoon simple syrup\...,"This recipe is a drink, cocktail.",73.849998,4th and 5th grade,0,1,0.524474,6.0
97,It Won't Last Cake\n1 1/3 cups vegetable oil A...,It Won't Last Cake\n\nThis recipe is a solid f...,54.220001,9th and 10th grade,0,8,0.606982,8.0
98,Watermelon-Cucumber Margarita \n1 1/2 cups 1-i...,"Watermelon-Cucumber Margarita is a drink, cock...",41.360001,13th and 14th grade,0,20,0.869689,7.0
