# Homework10

Exercises with text processing and NLP modeling

## Goals

- Understand similarities and differences between the processes of working with text, images and tabular data
- Practice with different methods of encoding and modeling text data
- See different methods for extracting information or patterns from text datasets

### Setup

Run the following 2 cells to import all necessary libraries and helpers for this homework.

In [None]:
pip install -r ./.devcontainer/requirements.txt

In [None]:
!wget -q https://github.com/PSAM-5020-2025S-A/5020-utils/raw/main/src/data_utils.py
!wget -q https://github.com/PSAM-5020-2025S-A/5020-utils/raw/main/src/text_utils.py

In [None]:
import matplotlib.cm as cm
import matplotlib.pyplot as plt

from sklearn.cluster import KMeans
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from data_utils import display_silhouette_plots, object_from_json_url
from text_utils import get_top_words

You can tell it's gonna be a good homework from the number of imports.
# 🙃

## Have protein, need seasoning

Let's create a model to help us season our foods. In the end, what we want is a model that receives a short list of ingredients and returns a list of seasonings or complementary ingredients for our original ingredients list.

In order to do that we need a dataset of recipes. We'll load that into a text dataset where each recipe is a document and the ingredients are our document *tokens*.

Let's take a look at the recipe dataset and become familiar with the data and how it's organized.

We'll load our recipes and do a bit of exploratory data analysis to look for patterns first to see if this kind of modeling makes any sense.

### Load Data

Here's our dataset. Let's load it into an object for inspection:

In [None]:
DATAPATH = "https://raw.githubusercontent.com/PSAM-5020-2025S-A/5020-utils/refs/heads/main/datasets/text/recipes"
recipes_obj = object_from_json_url(f"{DATAPATH}/recipes_min16.json")

### Look at Data

How's the data organized?

How many recipes do we have?

Do all recipes have the same number of ingredients?

Anything else stand out about the data?

In [None]:
# TODO: Look at Data here
recipes_obj
print(recipes_obj[0])
print(recipes_obj[20])
print(recipes_obj[100])

# TODO: How many recipes
print(len(recipes_obj))

# TODO: How many ingredients do the shortest and longest recipes have?
min = len(recipes_obj[0]["ingredients"])
max = len(recipes_obj[0]["ingredients"])

for recipe in recipes_obj:
    num = len(recipe["ingredients"])
    if num < min:
        min = num
    if num > max:
        max = num

print("min", min)  
print("max", max)



### Create Input Features

Our dataset doesn't really have to be a `DataFrame` here. It can, but it doesn't have to be.

Each recipe right now is described as a list of ingredients, but what we really want is a list of *sentences*, where each *sentence* is a Python `string` with all of the ingredients for a given recipe.

Instead of:<br>```["salt", "baking soda", "water", "mushroom"]```,

we want:<br>```"salt baking soda water mushroom"```

The `join()` function might help.

Another thing to consider is wether we want to do anything special about multi-word ingredients, like *baking soda*.

Do we want to let our vectorizer (spoiler) split that into two tokens, or do we want to guarantee that *baking* and *soda* always stay together? 

In [None]:
def list_to_string(list):
    words = []
    for word in list:
        parts = word.split(" ")
        if len(parts) == 1:
            words.append(word)      
        else:
            words.append("-".join(parts)) 
    return " ".join(words)

string1 = list_to_string(["baking soda", "salt", "spicy spicy pepper"])
print(string1)

string2 = list_to_string(['raisins', 'baking powder', 'egg', 'sugar', 'milk', 'flour'])
print(string2)

In [None]:
recipes = []
# TODO: turn list of objects into list of strings
for recipe in recipes_obj:
    string_ingredients = list_to_string(recipe["ingredients"])
    recipes.append(string_ingredients)


In [None]:
display(recipes[0])
display(recipes[20])
display(recipes[100])

### Encode Data

The fun part.

Let's vectorize our list of ingredient strings into a sparse document matrix using `CountVectorizer` or `TfidfVectorizer`.

The resulting matrix will have one row for each recipe, and the columns will encode the ingredients.

In [None]:
# TODO: Vectorize ingredients from our recipe list
tfid = TfidfVectorizer(stop_words="english", min_df=5, max_df=0.75, max_features=10_000)
ingredients_vectorized = tfid.fit_transform(recipes)

# TODO: How many words are in our vocabulary?
vocab = tfid.get_feature_names_out()
print(len(vocab))
display(vocab)

### Cluster Data

Now that we have our recipes/documents vectorized we can study them a little bit, and look for patterns.

What happens if we cluster our recipes ? What do the cluster centers represent ?

When might this be useful ?

In [None]:
# TODO: cluster recipes
mClust = KMeans(n_clusters=8, random_state=800)
ingredients_km = mClust.fit_predict(ingredients_vectorized)


In [None]:
ingredients_km

In [None]:
mClust.cluster_centers_

### Cluster Centers

Use the `get_top_words()` function to decode the `cluster_centers` back into ingredients.

In [None]:
# TODO: Look at cluster centers
display(get_top_words(mClust.cluster_centers_, vocab, min)[0])
display(get_top_words(mClust.cluster_centers_, vocab, 12)[0])

### Interpretation

<span style="color:hotpink">
What do these cluster centers represent ?<br>
Is there anything interesting about recipe cluster centers ?<br>
</span>

The clusters represent the most common ingredients in each recipe.
There are some ingredients like oil, salt and pepper that appear in nearly every one of them. 

I also don't think my attempt at combining words with "-" seems to have worked for the vectorized list...

### Plot Clusters

Let's plot our clusters to see if we have to adjust any of the clustering parameters.

Since we can't plot in $500$ dimensions, we should use `TruncatedSVD` to look at our clusters in $2D$ and $3D$.

In [None]:
# TODO: TruncatedSVD to reduce the dimensions of our feature space
svd = TruncatedSVD(n_components=3, random_state=1010)
ingredients_svd = svd.fit_transform(ingredients_vectorized)
ingredients_svd[:4]

# TODO: plot clusters

In [None]:
# ammended from wk08
def plot_ingredient_clusters(features_3d, clusters, title="Recipes"):
    x = features_3d[:, 0]  # SVD1
    y = features_3d[:, 1]  # SVD2
    z = features_3d[:, 2]  # SVD3

    # 2D: SVD1 vs SVD2
    plt.figure()
    plt.scatter(x, y, c=clusters, marker='o', alpha=0.5)
    plt.title(f"{title} clustering")
    plt.xlabel("SVD1")
    plt.ylabel("SVD2")
    plt.ylim([-2.2, 3])
    plt.show()

    # 2D: SVD1 vs SVD3
    plt.figure()
    plt.scatter(x, z, c=clusters, marker='o', alpha=0.5)
    plt.title(f"{title} clustering")
    plt.xlabel("SVD1")
    plt.ylabel("SVD3")
    plt.ylim([-2.2, 3])
    plt.show()

    # 3D
    fig = plt.figure(figsize=(8, 8))
    ax = fig.add_subplot(projection='3d')
    ax.scatter(x, y, z, c=clusters, marker='o', alpha=0.5)

    ax.set_title(f"{title} clustering")
    ax.set_xlabel("SVD1")
    ax.set_ylabel("SVD2")
    ax.set_zlabel("SVD3")
    ax.set_ylim(-2.5, 8)
    ax.set_zlim(-2.5, 2.5)
    plt.show()

In [None]:
plot_ingredient_clusters(ingredients_svd, ingredients_km)

### Interpretation

<span style="color:hotpink">
What does the graph look like ?<br>
Are the clusters well-separated ?
</span>

I originally tried with 12 clusters and they don't really look well separated. 8 looks marginally better

### Plot Silhouette Plots

We can also check the quality of our clustering by looking at the silhouette plots that we get from calling:<br>
`display_silhouette_plots(vectors, clusters)`.

In [None]:
display_silhouette_plots(ingredients_vectorized, ingredients_km)

### Interpretation

<span style="color:hotpink">
How many clusters did you end up with ?<br>
How do they look ?<br>
</span>

I'm going with 8 clusters. Aside from 2 clusters the silhouettes are roughly similar

## Recipe Completion

Ok. On to the main event.

Let's create some recipes.

We'll do this using a technique similar to what is used for movie/product recommendations. Given an initial set of ingredients, we'll look at recipes that have similar ingredients and "recommend" additional ingredients.

We already have all of the recipes in our dataset encoded as `tf-idf` vectors. The rest of our algorithm will be something like:
1. Start with an initial set of ingredients
2. Encode ingredients
3. Find a set of recipes that are similar to our list of ingredients
4. Find common ingredients that are in the similar recipes, but not in our list of ingredients
5. Pick representative ingredient to add to recipe
6. Repeat

Let's start.

### 1. Initial list of ingredients

This is just a string with ingredients:

In [None]:
recipe_seed_str = "chicken onion" # feel free to change this

### 2. Encode ingredients

Transform the string into a `tf-idf` vector:

In [None]:
# TODO: transform string into sparse vector
recipe_seed_vct = tfid.transform([recipe_seed_str])

### 3. Find similar recipes

The meat of the algorithm. No pun intended.

In order to find similar recipes, we'll first calculate the distance between our current list of ingredients and all recipes in our dataset.

We can start with euclidean distance and later try other kinds, but the overall processing will be the same:

1. Start with an empty list to store distances
2. Loop over the `tf-idf` recipe vectors and for each vector:
   1. Subtract the ingredient list
   2. Square the difference (to square a sparse matrix `A`, use `A.multiply(A)`)
   3. Sum the terms of the result
   4. Take the square root of the sum
   5. Append to distance list
3. Find the indices of the smallest distances (this operation is called `argsort` and will give us the indices of the recipes that are most similar to our list of ingredients)
4. Check the recipes to see if they are indeed similar (`inverse_transform()` the vectors at the indices calculated above)

In [None]:
# argsort a list (get sequence of indices that would sort the list)
# https://stackoverflow.com/a/3382369
def argsort(L, reverse=False):
  return sorted(range(len(L)), key=L.__getitem__, reverse=reverse)

In [None]:
import numpy as np

# TODO: list to keep distances
recipe_dists = []

# print(ingredients_vectorized.shape[0])
# TODO: loop over vectors and append euclidean distances to list
for i in range(ingredients_vectorized.shape[0]):
    ingredient = ingredients_vectorized[i]
#    1. Subtract the ingredient list
    diff = recipe_seed_vct - ingredient
#    2. Square the difference (to square a sparse matrix `A`, use `A.multiply(A)`)
    diff_sq = diff.multiply(diff)
#    3. Sum the terms of the result
    sum = diff_sq.sum()
#    4. Take the square root of the sum
    sum_sr = np.sqrt(sum)
#    5. Append to distance list
    recipe_dists.append(sum_sr)

# TODO: argsort list of distances to find indices of similar recipes
dist_list = argsort(recipe_dists)

# TODO: check first 4 recipes
print(dist_list[0])
print(dist_list[1])
print(dist_list[2])
print(dist_list[3])

In [None]:
for idx in dist_list[:4]:
    print(recipes[idx])

### 4. Find ingredients to recommend

We have a way to get a set of similar recipes with similar ingredients, and now want to find a *meaningful*, or *representative*, ingredient to add to our ingredients list.

Let's consider ingredients in the $16$ most similar recipes. What we are trying to do is find an ingredient that is in a lot of these recipes, but not yet in our list of ingredients.

There are many possible ways of doing this. We could count the number of times different ingredients show up in these $16$ recipes using Python dictionaries and/or sets, but what we're trying to do here is very similar to what a `TfidfVectorizer` does: calculate relative importance of terms in a series of documents.

Let's re-encode these $16$ recipes using their own separate `TfidfVectorizer`, then sum the importance of each ingredient and look at ingredients with the highest importance scores.

We could re-use the vectors/scores from the original `TfidfVectorizer`, but they're gonna be influenced by the relative frequencies of all of the ingredients that showed up in all of the recipes. Using a separate vectorizer is a little bit more precise.

The steps we need to take are:

1. Separate the $16$ recipes most similar to our list of ingredients
   1. We have lots of representations of our recipes, but `recipes` (list of strings) might be the easiest one to use here
2. Create a new `TfidfVectorizer` and encode the $16$ recipes
3. Sum the resulting vectors to get overall importance scores for each ingredient/token
4. Convert resulting vector to a list using `A.tolist()[0]`
5. `argsort` the importance scores to get sequence of ingredient indices ordered from most to least important
6. Find the most important ingredient that isn't on the ingredient list

In [None]:
# TODO: Get 16 most similar recipes
most_similar = []
for idx in dist_list[:16]:
    most_similar.append(recipes[idx])
print(most_similar)

In [None]:
# TODO: Encode the 16 recipes
tfid_similar = TfidfVectorizer(stop_words="english", max_features=10_000)
similar_vectorized = tfid_similar.fit_transform(most_similar)
# print(similar_vectorized)

vocab_similar = tfid_similar.get_feature_names_out()
print(len(vocab_similar))
display(vocab_similar)

In [None]:
# TODO: Sum the recipe vectors by column to get ingredient importance scores
# similar_vectorized.shape
importance = similar_vectorized.sum(axis=0)
print(importance)

In [None]:
# TODO: Convert sparse vector to regular list with A.tolist()[0]
importance_list = importance.tolist()[0]
print(len(importance_list))

In [None]:
# TODO: argsort the importance scores
dist_importance_list = argsort(importance_list)
dist_importance_list

In [None]:
# TODO: Find most important ingredient not yet on the list of ingredients
vocab_similar[dist_importance_list[0]]

### 5. Add ingredient to recipe

This is simply adding a word to `recipe_seed_str`

In [None]:
# TODO: add the first important ingredient to list of ingredients
recipe_seed_str += " " + vocab_similar[dist_importance_list[0]]
print(recipe_seed_str)

### 6. Repeat (Optional)

Now we can repeat this process until we get an empty list of important ingredients: 
1. Encode current recipe
2. Find similar recipes
3. Find important ingredients
4. Add important ingredient

Might be helpful to define a couple of functions, like `find_similar_recipes()` and `find_important_ingredients()`...

Only do this step if you're really curious about experimenting with generating unconventional ingredient lists. It's not going to be graded.

In [None]:
# TODO: Create find_similar_recipes(ingredients, recipes, vectorizer)

# TODO: Create find_important_ingredients(recipes)

# TODO: Create recipe by repeating calls to find_similar_recipes() and find_important_ingredients()