# Recipe Recommendation System
Creating, Testing, and Tuning unsupervised learning methods to recommend relevant recipes based on ingredient and category preference

Workflow:
1. Load, aggregate, clean, and tokenize recipe text data
    - Identify Food Recipe Specific stop words that might be useful to ignore (e.g. measurements, numbers)
    - It's likely that recipe attributes will depend heavily on ingredients and cooking methods, not necessarily
    - Think twice about assigning words as stopwords, they might end up being useful.
    - You may want to lemmatize the data to reduce sparseness; check lemmatization process to see if it strips important words, foods, or ingredients. Also, remove punctuation: it won't help for keyword 
2. ~~Create Word Embeddings using Word2Vec or GloVe Models (Consider using pretrained word embeddings)~~
    - Discuss in detail the reason to choose one over the other for this context
    - Setup neural network locally and run remotely on google colab
    - I want to try tfidf and GloVe model, because tfidf doesn't take into account the order of words, which is isn't such a problem with recipes - it's the ingredients and cooking techniques that matter more. However, GloVe word vectors may be able to produce new words outside of the corpora text when summarizing the documents
3. Compare Topic Extraction Methods
    - ~~LDA2Vec~~
    - LDA
    - NNMF
4. Generate keywords using keyword summarization and textrank
    - Create methodology selectively assigns generated categories (e.g. LSA/NNMF Score must be above certain score threshold)
    - Define metrics that evaluate the validity, breadth, and descriptive value of the assigned categories
    - Identify Food Recipe Specific stop words that passed that might be suitable in the filter.
5. Create extra features useful for search result ranking
    - ~~Difficulty (Time, Number of Ingredients, Servings (inverse relationship),~~
    - Import Unsupervised Generated Categories
    - Create ratings that calculate overall weight.
6. Find similarity scoring methods that would work best in this context. Some variation of Cosine Similarity will work best
7. Create algorithm that utilizes similarity to sort recipes based on user-inputted queries, and sort base on other features as well.

-----

Regarding the Available Recipe Images

Around 70,000 recipes out of the 125,000 have corresponding images, so it's possible to utilize these images to improve the models or create seperate, supplementary model

Ideas:
- Training a neural network to identify/predict/generate categories of foods based on their images


### The Data
Although I can no longer find the direct download, the link and code for the scraper the original user used to collect the data set is [here](https://github.com/rtlee9/recipe-box). This user collected the title, ingredients, and instructions from recipes found on Allrecipes.com, epicurious.com, and foodnetwork.com.

For this project, the data was directly downloaded and uploaded for the creation of my own model. All the code present are of my own creation or significantly modified from the Thinkful curriculum. No other software sources were used verbatim within this project.

# \~~Putting it all together~~

Reintialize the model with 'pseudo-optimized' parameters, more easily track flow of data, and toggle with parameters all in one place. The "database" will also be created so that user queries will return results in a speedy manner!

Some pieces of code will be commented out with a triple ***\''' '''*** to indicate that the code takes too long to run and should only be run when the kernel has been shutdown.

In [82]:
import pandas as pd
import numpy as np
import re
import spacy
from functools import reduce
from operator import add
import string
import re
import multiprocessing as mp

### Below is all the code necessary to clean the data into useable form for modeling.
'''
# Loading Data
allrecipes_raw = pd.read_json('../__DATA__/recipes_raw/recipes_raw_nosource_ar.json')

allrecipes = allrecipes_raw.copy().T.reset_index().drop(columns = ['index'])
recipes = pd.concat([allrecipes, epicurious, foodnetwork]).reset_index(drop=True) # Concat does not reset indices

# Cleaning
null_recs = recipes.copy().drop(columns = 'picture_link').T.isna().any()
rows_to_drop = recipes[null_recs].index
recipes = recipes.drop(index = rows_to_drop).reset_index(drop = True)

nc_ingred_index = [index for i, index in zip(recipes['ingredients'], recipes.index) if all(j.isdigit() or j in string.punctuation for j in i)]
nc_title_index = [index for i, index in zip(recipes['title'], recipes.index) if all(j.isdigit() or j in string.punctuation for j in i)]
nc_instr_index = [index for i, index in zip(recipes['instructions'], recipes.index) if all(j.isdigit() or j in string.punctuation for j in i)]

index_list = [nc_ingred_index, nc_title_index, nc_instr_index]

inds_to_drop = set(reduce(add, index_list))
print(len(inds_to_drop))
recipes = recipes.drop(index=inds_to_drop).reset_index(drop=True)
recipes.shape

empty_instr_ind = [index for i, index in zip(recipes['instructions'], recipes.index) if len(i) < 20]
recipes = recipes.drop(index = empty_instr_ind).reset_index(drop=True)

ingredients = []
for ing_list in recipes['ingredients']:
    clean_ings = [ing.replace('ADVERTISEMENT','').strip() for ing in ing_list]
    if '' in clean_ings:
        clean_ings.remove('')
    ingredients.append(clean_ings)
recipes['ingredients'] = ingredients

recipes['ingredient_text'] = ['; '.join(ingredients) for ingredients in recipes['ingredients']]
recipes['ingredient_text'].head()

recipes['ingredient_count'] = [len(ingredients) for ingredients in recipes['ingredients']]

all_text = recipes['title'] + ' ' + recipes['ingredient_text'] + ' ' + recipes['instructions']

def clean_text(documents):
    cleaned_text = []
    for doc in documents:
        doc = doc.translate(str.maketrans('', '', string.punctuation)) # Remove Punctuation
        doc = re.sub(r'\d+', '', doc) # Remove Digits
        doc = doc.replace('\n',' ') # Remove New Lines
        doc = doc.strip() # Remove Leading White Space
        doc = re.sub(' +', ' ', doc) # Remove multiple white spaces
        cleaned_text.append(doc)
    return cleaned_text

cleaned_text = clean_text(all_text)

# Testing Strategies and Code
nlp = spacy.load('en')
' '.join([token.lemma_ for token in nlp(cleaned_text[2]) if not token.is_stop])

def text_tokenizer_mp(doc):
    tok_doc = ' '.join([token.lemma_ for token in nlp(doc) if not token.is_stop])
    return tok_doc

# Parallelzing tokenizing process
pool = mp.Pool(mp.cpu_count())
tokenized_text = pool.map(text_tokenizer_mp, [doc for doc in cleaned_text])
'''

# Creating TF-IDF Matrices and recalling text dependencies

'''import text_tokenized.csv here to'''

# TF-IDF vectorizer instance
'''vectorizer = TfidfVectorizer(lowercase = True,
                            ngram_range = (1,1))'''

'''text_tfidf = vectorizer.fit_transform(tokenized_text)'''

'text_tfidf = vectorizer.fit_transform(tokenized_text)'

In [83]:
# Set All Recommendation Model Parameters
N_topics = 50             # Number of Topics to Extract from corpora
N_top_docs = 200          # Number of top documents within each topic to extract keywords
N_top_words = 25          # Number of keywords to extract from each topic
N_docs_categorized = 2000 # Number of top documents within each topic to tag 
N_neighbor_window = 4     # Length of word-radius that defines the neighborhood for
                          # each word in the TextRank adjacency table

# Query Similarity Weights
w_title = 0.2
w_text = 0.3
w_categories = 0.5
w_array = np.array([w_title, w_text, w_categories])

# Recipe Stopwords: for any high volume food recipe terminology that doesn't contribute
# to the searchability of a recipe. This list must be manually created.
recipe_stopwords = ['cup','cups','ingredient','ingredients','teaspoon','teaspoons','tablespoon',
                   'tablespoons','C','F']

In [84]:
# Renaming Data Dependencies
topic_transformed_matrix = text_nmf
root_text_data = cleaned_text

### Generating  tags (keywords/categories) and assigning to corresponding documents

In [85]:
from itertools import repeat

#recipes['tag_list'] = [[] for i in repeat(None, recipes.shape[0])]

def topic_docs_4kwsummary(topic_document_scores, root_text_data):
    '''Gathers and formats the top recipes in each topic'''
    text_index = pd.Series(topic_document_scores).sort_values(ascending = False)[:N_top_docs].index
    text_4kwsummary = pd.Series(root_text_data)[text_index]
    return text_4kwsummary

def generate_filter_kws(text_list):
    '''Filters out specific parts of speech and stop words from the list of potential keywords'''
    parsed_texts = nlp(' '.join(text_list)) 
    kw_filts = set([str(word) for word in parsed_texts 
                if (word.pos_== ('NOUN' or 'ADJ' or 'VERB'))
                and word.lemma_ not in recipe_stopwords])
    return list(kw_filts), parsed_texts

def generate_adjacency(kw_filts, parsed_texts):
    '''Tabulates counts of neighbors in the neighborhood window for each unique word'''
    adjacency = pd.DataFrame(columns=kw_filts, index=kw_filts, data = 0)
    for i, word in enumerate(parsed_texts):
        if any ([str(word) == item for item in kw_filts]):
            end = min(len(parsed_texts), i+N_neighbor_window+1) # Neighborhood Window Utilized Here
            nextwords = parsed_texts[i+1:end]
            inset = [str(x) in kw_filts for x in nextwords]
            neighbors = [str(nextwords[i]) for i in range(len(nextwords)) if inset[i]]
            if neighbors:
                adjacency.loc[str(word), neighbors] += 1
    return adjacency
                
def generate_wordranks(adjacency):
    '''Runs TextRank on adjacency table'''
    nx_words = nx.from_numpy_matrix(adjacency.values)
    ranks=nx.pagerank(nx_words, alpha=.85, tol=.00000001)
    return ranks

def generate_tag_list(ranks):
    '''Uses TextRank ranks to return actual key words for each topic in rank order'''
    rank_values = [i for i in ranks.values()]
    ranked = pd.DataFrame(zip(rank_values, list(kw_filts))).sort_values(by=0,axis=0,ascending=False)
    kw_list = ranked.iloc[:N_top_words,1].to_list()
    return kw_list

# Master Function utilizing all above functions
def generate_tags(topic_document_scores, root_text_data):
    text_4kwsummary = topic_docs_4kwsummary(topic_document_scores, root_text_data)
    kw_filts, parsed_texts = generate_filter_kws(text_4kwsummary)
    adjacency = generate_adjacency(kw_filts, parsed_texts)
    ranks = generate_wordranks(adjacency)
    kw_list = generate_tag_list(ranks)
    return kw_list

def generate_kw_index(topic_document_scores):
    kw_index = pd.Series(topic_document_scores).sort_values(ascending = False)[:N_docs_categorized].index
    return kw_index

    

In [90]:
def generate_adjacency(kw_filts, parsed_texts):
    adjacency = pd.DataFrame(columns=kw_filts, index=kw_filts, data=0)
    for i, word in enumerate(parsed_texts):
        if any([str(word) == item for item in kw_filts]):
            end = min(len(parsed_texts), i + 5)  # Window of four words
            nextwords = parsed_texts[i + 1:end]
            inset = [str(x) in kw_filts for x in nextwords]
            neighbors = [str(nextwords[i]) for i in range(len(nextwords)) if inset[i]]
            if neighbors:
                for neighbor in neighbors:
                    adjacency.loc[str(word), neighbor] += 1
    return adjacency

In [94]:
# # Generating Tags and distributing to relevant documents
# for i in range(topic_transformed_matrix.shape[1]):
#     scores = topic_transformed_matrix[:, i]
#     topic_kws = generate_tags(scores, root_text_data)
#     kw_index_4df = generate_kw_index(scores)
#     
#     # Remove duplicates from kw_index_4df
#     kw_index_4df_unique = kw_index_4df.drop_duplicates()
#     
#     # Iterate over unique index values and update DataFrame
#     for idx in kw_index_4df_unique:
#         if idx in recipes.index:
#             if 'tag_list' not in recipes.columns:
#                 recipes['tag_list'] = ''  # Create the 'tag_list' column if it doesn't exist
#             recipes.at[idx, 'tag_list'] += ', '.join(topic_kws)  # Convert list to string and concatenate
#     
#     if i % 10 == 0:
#         print('Topic #{} Checkpoint'.format(i))
# 
# print('done!')

Topic #0 Checkpoint
Topic #10 Checkpoint
Topic #20 Checkpoint
Topic #30 Checkpoint
Topic #40 Checkpoint
done!


In [95]:
# Saving the precious dataframe so that I never have to calculate that again.
# recipes.to_csv('tagged_recipes_df.csv')


# load csv

In [None]:
# # Generating Tags and distributing to relevant documents
# for i in range(topic_transformed_matrix.shape[1]):
#     scores = topic_transformed_matrix[:, i]
#     topic_kws = generate_tags(scores, root_text_data)
#     kw_index_4df = generate_kw_index(scores)
#     
#     # Iterate over unique index values and update DataFrame
#     for idx in kw_index_4df:
#         if idx in recipes.index:
#             if 'tag_list' not in recipes.columns:
#                 recipes['tag_list'] = ''  # Create the 'tag_list' column if it doesn't exist
#             recipes.at[idx, 'tag_list'] = ', '.join([recipes.at[idx, 'tag_list']] + topic_kws)  # Concatenate strings
#     
#     if i % 10 == 0:
#         print('Topic #{} Checkpoint'.format(i))
# 
# print('done!')


Topic #0 Checkpoint


In [None]:
recipes.loc[:5,'tag_list']

In [None]:
# Concatenating lists of tags into a string a collective of tags for each documents
recipes['tags'] = [' '.join(tags) for tags in recipes['tag_list']]

In [None]:
recipes.loc[:5,'tags']

### Querying Algorithm
The final product presented is a search algorithm that takes in a list of ingredients or categories, and uses the query to return relavant recipes that utilize those ingredients or are similarly related to other ingredients and those recipes.

In [None]:
recipes.columns

In [None]:
# Creating TF-IDF Matrices and recalling text dependencies

'''import text_tokenized.csv here'''

# TF-IDF vectorizer instance
'''vectorizer = TfidfVectorizer(lowercase = True,
                            ngram_range = (1,1))'''

'''text_tfidf = vectorizer.fit_transform(tokenized_text)'''
# title_tfidf = vectorizer.transform(recipes['title'])
# text_tfidf    <== Variable with recipe ingredients and instructions
# tags_tfidf = vectorizer.transform(recipes['tags'])
# recipes   <== DataFrame; For indexing and printing recipes

# Query Similarity Weights
w_title = .2
w_text = .3
w_categories = .5


In [None]:
def qweight_array(query_length, qw_array = [1]):
    '''Returns descending weights for ranked query ingredients'''
    if query_length > 1:
        to_split = qw_array.pop()
        split = to_split/2
        qw_array.extend([split, split])
        return qweight_array(query_length - 1, qw_array)
    else:
        return np.array(qw_array)

def ranked_query(query):
    '''Called if query ingredients are ranked in order of importance.
    Weights and adds each ranked query ingredient vector.'''
    query = [[q] for q in query]      # place words in seperate documents
    q_vecs = [vectorizer.transform(q) for q in query] 
    qw_array = qweight_array(len(query),[1])
    q_weighted_vecs = q_vecs * qw_array
    q_final_vector = reduce(np.add,q_weighted_vecs)
    return q_final_vector

def overall_scores(query_vector):
    '''Calculates Query Similarity Scores against recipe title, instructions, and keywords.
    Then returns weighted averages of similarities for each recipe.'''
    final_scores = title_tfidf*query_vector.T*w_title
    final_scores += text_tfidf*query_vector.T*w_text
    final_scores += tags_tfidf*query_vector.T*w_categories
    return final_scores

def print_recipes(index, query, recipe_range):
    '''Prints recipes according to query similary ranks'''
    print('Search Query: {}\n'.format(query))
    for i, index in enumerate(index, recipe_range[0]):
        print('Recipe Rank: {}\t'.format(i+1),recipes.loc[index, 'title'],'\n')
        print('Ingredients:\n{}\n '.format(recipes.loc[index, 'ingredient_text']))
        print('Instructions:\n{}\n'.format(recipes.loc[index, 'instructions']))
        
def Search_Recipes(query, query_ranked=False, recipe_range=(0,3)):
    '''Master Recipe Search Function'''
    if query_ranked == True:
        q_vector = ranked_query(query)
    else:
        q_vector = vectorizer.transform([' '.join(query)])
    recipe_scores = overall_scores(q_vector)
    sorted_index = pd.Series(recipe_scores.toarray().T[0]).sort_values(ascending = False)[recipe_range[0]:recipe_range[1]].index
    return print_recipes(sorted_index, query, recipe_range)
    

### Testing the Algorithm

In [None]:
query = ['cinnamon', 'cream', 'banana']
Search_Recipes(query, query_ranked=True, recipe_range=(0,3))

In [None]:
# Test Rank
query = ['wine', 'cilantro','butter']
Search_Recipes(query, query_ranked=False, recipe_range=(0,3))

### -- Conclusions and Model Outlook --

Overall the Search_Recipes function works quite well. From experimenting with the weighting, it's clear to me that the original text of the recipes returns better results than the categories generated with TextRank. More topics need to be added; from looking at the food topic documents, it's clear that the level of granularity with which LDA and NNMF can cluster recipe is very good. Another fix for this issue is to utilize dense word embeddings that capture semantic similarities between words with more sophistication. THe biggest issue with the current model is that the words that maps to each topics or category are limited and discreet. Even if a a words is technically more related to a topic than the words extracted from the same topic, yet the word was not extracted from the topic, then the original word query won't be factored into the search through the categories.

Also it does appear that some words are more heavily weighted than others, which biases the search results towards that ingredient, although this does require more rigorous texting. "Miso" is a word that is heavily weighted in the tfidf matrices for example. One work around is to use simple rank this ingredient lower in the Search_Recipes function, but a global solution is preferable. It is perhaps more beneficial to utilize these weights that tf-idf creates, rather than finding a way to get rid of them. But experimenting with different word embeddings would be interesting.

Also, another issue is that many recipes were not assigned categories due to the model parameters, and this decreases there ranks with the text with an unfair disadvantage. Hopefully a future iteration of this model will allow all recipes to have associated categories.

Future Implementation and Changes for this model:

- Word2Vec or GloVe embeddings
- LDA2Vec topic extraction
- Negative Querying that decreases rank of matching queries
- Using real databases to store data and creating a creating an user interface on which this model where this model can be easily utilized



In [None]:
# Test 
query = ['jelly','wine']
Search_Recipes(query, query_ranked=False, recipe_range=(0,3))

In [None]:
query = ['pepper','apple','pork']
Search_Recipes(query, query_ranked=False, recipe_range=(0,3))

In [None]:
recipes['tags'][122894]

--------
### Some notes:
List of Parameters and Evaluation Methods
- Number of Topics
- Number of Documents to pull keywords from
- Number of Keywords per topic
- Number of Documents to assign keywords to
- Neighbor Window Size
- Query Title Weight
- Query Description Weight
- Query Category Weight

In [None]:
### No Category Weight
query = ['cream','banana','cinnamon']
Search_Recipes(query, query_ranked=False, recipe_range=(0,3))

In [None]:
### Empty Query
query = []
Search_Recipes(query, query_ranked=False, recipe_range=(0,3))

In [None]:
### Only Category Weight
query = ['apple','blueberry']
Search_Recipes(query, query_ranked=False, recipe_range=(0,3))

In [None]:
### Only Category Weight
query = ['japanese']
Search_Recipes(query, query_ranked=False, recipe_range=(0,3))

Further Analysis:
- Generate Tag Count column in the Recipes data frame. Analyze distribution of tags.
- See if all of the topics are easily interpretable from the generated tags.

### Peerings into the generated topics

In [None]:
recipes.tags

In [None]:
recipes.tags[13]

In [None]:
recipes.tags[122907]

In [None]:
recipes.tags[90708]

In [None]:
recipes.tags[50409]

In [None]:
recipes.tags[30234]

In [None]:
recipes.tags[23596]

In [None]:
recipes.tags[60457]

In [None]:
recipes.tags[110997]

In [None]:
recipes.head()

OpenAIError: The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable

In [7]:
image_url


'https://oaidalleapiprodscus.blob.core.windows.net/private/org-AcrPeShL2hm9BoSqtxdSHQjn/user-JMukIn3LRA1xPKTryKczhCT4/img-gJsbu3gGSfPcCASOlBJw4Rux.png?st=2024-03-31T05%3A34%3A47Z&se=2024-03-31T07%3A34%3A47Z&sp=r&sv=2021-08-06&sr=b&rscd=inline&rsct=image/png&skoid=6aaadede-4fb3-4698-a8f6-684d7786b067&sktid=a48cca56-e6da-484e-a814-9c849652bcb3&skt=2024-03-31T00%3A21%3A32Z&ske=2024-04-01T00%3A21%3A32Z&sks=b&skv=2021-08-06&sig=BEN/CQvNA/s7gMAYkvSmsG1iE%2Brd4rP41JcZSXlI%2Bp4%3D'

Image downloaded and saved as 'downloaded_image.png'.
