<a href="https://colab.research.google.com/github/mustafajnedian/python-ml-nb/blob/main/ML_Vectorizing_Embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

🎯 Which One Should You Use in APP?

✅ Use Count Vectorizer → If you need basic word frequency features.

✅ Use TF-IDF Vectorizer → If you need importance-based word weighting.

✅ Use Word2Vec → If you want semantic understanding (best for deep learning).

📌 app's Text Processing Pipeline
- Input: User searches for a recipe like ```"Spicy chicken curry"```
- Processing: Convert the input and stored recipes into numerical form using `Count Vectorizer`, `TF-IDF`, or `Word2Vec`
- Output: Find the most similar recipe from the database

In [119]:
recipes = [
    "Spicy chicken and curry with rice and potatoes herbs and apples",
    "Mild chicken stew with vegetables",
    "Hot and spicy beef stir fry",
    "Vegan lentil soup with coconut milk",
    "Grilled salmon with garlic butter"
]

query = "Spicy chicken vegetables rice"


1️⃣ Using Count Vectorizer for Recipe Similarity
📌 What happens?

Each recipe is converted into a bag-of-words count vector
Cosine Similarity is used to find the closest match

In [120]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Convert text to Count Vectors
vectorizer = CountVectorizer()
recipe_vectors = vectorizer.fit_transform(recipes)
query_vector = vectorizer.transform([query])

# Compute similarity
similarities = cosine_similarity(query_vector, recipe_vectors)

# Find the most similar recipe
best_match_index = similarities.argmax()
print(f"Best Match (CountVectorizer): {recipes[best_match_index]}")


Best Match (CountVectorizer): Mild chicken stew with vegetables


2️⃣ Using TF-IDF for Recipe Similarity
📌 What happens?

Words like "spicy" and "chicken" are weighted based on their importance
Helps distinguish common vs. unique words

In [121]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Convert text to TF-IDF Vectors
vectorizer = TfidfVectorizer()
recipe_vectors = vectorizer.fit_transform(recipes)
query_vector = vectorizer.transform([query])

# Compute similarity
similarities = cosine_similarity(query_vector, recipe_vectors)

# Find the best match
best_match_index = similarities.argmax()
print(f"Best Match (TF-IDF): {recipes[best_match_index]}")


Best Match (TF-IDF): Mild chicken stew with vegetables


3️⃣ Using Word2Vec Embeddings for Recipe Similarity
📌 What happens?

Instead of word counts, each word is represented by a pre-trained embedding vector
Synonyms like "hot" and "spicy" will have similar representations

In [122]:
recipes = [
    "spicy chicken curry with rice",
    "mild chicken stew with vegetables",
    "hot and spicy beef stir fry",
    "vegan lentil soup with coconut milk",
    "grilled salmon with garlic butter",
    "pasta with creamy tomato sauce",
    "chocolate cake with vanilla frosting"
]


🛠 Step 2: Tokenize Recipes
Since Word2Vec expects tokenized sentences, split them into lists of words.

In [123]:
import gensim
from gensim.models import Word2Vec

# Tokenize recipes into words
tokenized_recipes = [recipe.lower().split() for recipe in recipes]
print(tokenized_recipes)


[['spicy', 'chicken', 'curry', 'with', 'rice'], ['mild', 'chicken', 'stew', 'with', 'vegetables'], ['hot', 'and', 'spicy', 'beef', 'stir', 'fry'], ['vegan', 'lentil', 'soup', 'with', 'coconut', 'milk'], ['grilled', 'salmon', 'with', 'garlic', 'butter'], ['pasta', 'with', 'creamy', 'tomato', 'sauce'], ['chocolate', 'cake', 'with', 'vanilla', 'frosting']]


🧠 Step 3: Train the Word2Vec Model
Now, train the model on your recipe dataset.

In [124]:
# Train Word2Vec
word2vec = Word2Vec(sentences=tokenized_recipes, vector_size=100, window=5, min_count=1, workers=4, epochs=50)

# Save the model
word2vec.save("app_word2vec.model")


In [127]:
ls

app_word2vec.model  [0m[01;34msample_data[0m/


🔍 Step 4: Test Word Similarity
Once trained, you can check similar words.

In [128]:
# Load the trained model
word2vec = Word2Vec.load("app_word2vec.model")

# Find words similar to 'spicy'
print(word2vec.wv.most_similar("chocolate"))


[('vegetables', 0.28531914949417114), ('spicy', 0.18995699286460876), ('vegan', 0.10841172933578491), ('with', 0.1029624417424202), ('stir', 0.09823568910360336), ('rice', 0.09716187417507172), ('salmon', 0.08844675868749619), ('cake', 0.06361950188875198), ('chicken', 0.05991499871015549), ('grilled', 0.04435737803578377)]


⚡ Step 5: Convert a Recipe to a Vector
To compare recipe similarity, convert an entire recipe into a vector.

In [129]:
import numpy as np

# Function to get average word embeddings for a sentence
def recipe_vector(sentence, model):
    words = sentence.lower().split()
    vectors = [model.wv[word] for word in words if word in model.wv]
    return np.mean(vectors, axis=0) if vectors else np.zeros(model.vector_size)

# Example: Get vector for a new recipe
new_recipe = "spicy grilled chicken with garlic butter"
vector = recipe_vector(new_recipe, word2vec)
print(vector)


[-3.88542801e-04  3.89766810e-03  2.91527179e-03  6.40447438e-03
 -1.75613665e-03  2.51298683e-04  1.07374333e-03  2.48682522e-03
 -2.95183086e-03 -1.73458271e-03  3.83635605e-04 -2.10796646e-03
  3.89191764e-03  3.26596503e-03  1.77895778e-03  1.06846994e-04
  3.98073765e-03  1.70906133e-03 -3.56898992e-03 -2.26425356e-03
  1.95413406e-04  8.92759635e-05 -1.31530315e-03 -1.69266935e-03
  3.62985977e-03  2.38684868e-03 -2.27567367e-03  2.62030284e-03
  3.33555363e-04  2.77260318e-03  6.96676376e-04 -3.04600969e-03
  6.48959074e-04 -4.84899944e-03 -1.69743737e-03  1.09213416e-03
  2.24337936e-03 -2.47957348e-03 -3.96257819e-04  3.85728665e-04
 -5.33806102e-04  1.08142721e-03 -3.34651046e-03  6.85494859e-04
 -1.95841002e-03  5.61099849e-04 -2.70867051e-04 -1.30036846e-03
 -1.12414442e-03  9.16933583e-04  1.29569368e-03 -1.90785399e-03
 -6.40225224e-03  2.51590653e-04 -1.58424489e-04  1.02637417e-03
  1.95680954e-03 -1.59009581e-03 -1.44484977e-03  8.26469157e-04
 -3.18185310e-03 -5.87507

🔥 Step 6: Find Similar Recipes in app
Now, let's find the most similar recipe from app's database.

In [130]:
from sklearn.metrics.pairwise import cosine_similarity

# Convert all recipes into vectors
recipe_vectors = np.array([recipe_vector(recipe, word2vec) for recipe in recipes])
print('Recipe Vectors\n')
#print (recipe_vectors)

# Convert the new query into a vector
query_vector = recipe_vector("spicy grilled chicken with garlic butter", word2vec).reshape(1, -1)
print(query_vector)
# Compute similarity scores
similarities = cosine_similarity(query_vector, recipe_vectors)

print('Similarities\n')

print(similarities)
print('\n')
# Find the best match
best_match_index = similarities.argmax()
print(f"Best Recipe Match: {recipes[best_match_index]}")


Recipe Vectors

[[-3.88542801e-04  3.89766810e-03  2.91527179e-03  6.40447438e-03
  -1.75613665e-03  2.51298683e-04  1.07374333e-03  2.48682522e-03
  -2.95183086e-03 -1.73458271e-03  3.83635605e-04 -2.10796646e-03
   3.89191764e-03  3.26596503e-03  1.77895778e-03  1.06846994e-04
   3.98073765e-03  1.70906133e-03 -3.56898992e-03 -2.26425356e-03
   1.95413406e-04  8.92759635e-05 -1.31530315e-03 -1.69266935e-03
   3.62985977e-03  2.38684868e-03 -2.27567367e-03  2.62030284e-03
   3.33555363e-04  2.77260318e-03  6.96676376e-04 -3.04600969e-03
   6.48959074e-04 -4.84899944e-03 -1.69743737e-03  1.09213416e-03
   2.24337936e-03 -2.47957348e-03 -3.96257819e-04  3.85728665e-04
  -5.33806102e-04  1.08142721e-03 -3.34651046e-03  6.85494859e-04
  -1.95841002e-03  5.61099849e-04 -2.70867051e-04 -1.30036846e-03
  -1.12414442e-03  9.16933583e-04  1.29569368e-03 -1.90785399e-03
  -6.40225224e-03  2.51590653e-04 -1.58424489e-04  1.02637417e-03
   1.95680954e-03 -1.59009581e-03 -1.44484977e-03  8.2646915

In [131]:
from gensim.models import Word2Vec

# Sample dataset
tokenzied_texts = [["spicy", "chicken", "curry"], ["hot", "beef", "stew"]]

# Train a small Word2Vec model
word2vec = Word2Vec(sentences=tokenzied_texts, vector_size=100, min_count=1, window=5, workers=4, epochs=50)

# Example: Get the vector for "spicy"
print(word2vec.wv["spicy"])


[-8.7274825e-03  2.1301615e-03 -8.7354420e-04 -9.3190884e-03
 -9.4281426e-03 -1.4107180e-03  4.4324086e-03  3.7040710e-03
 -6.4986930e-03 -6.8730675e-03 -4.9994122e-03 -2.2868442e-03
 -7.2502876e-03 -9.6033178e-03 -2.7436293e-03 -8.3628409e-03
 -6.0388758e-03 -5.6709289e-03 -2.3441375e-03 -1.7069972e-03
 -8.9569986e-03 -7.3519943e-04  8.1525063e-03  7.6904297e-03
 -7.2061159e-03 -3.6668312e-03  3.1185520e-03 -9.5707225e-03
  1.4764392e-03  6.5244664e-03  5.7464195e-03 -8.7630618e-03
 -4.5171441e-03 -8.1401607e-03  4.5956374e-05  9.2636338e-03
  5.9733056e-03  5.0673080e-03  5.0610625e-03 -3.2429171e-03
  9.5521836e-03 -7.3564244e-03 -7.2703874e-03 -2.2653891e-03
 -7.7856064e-04 -3.2161034e-03 -5.9258583e-04  7.4888230e-03
 -6.9751858e-04 -1.6249407e-03  2.7443992e-03 -8.3591007e-03
  7.8558037e-03  8.5361041e-03 -9.5840869e-03  2.4462664e-03
  9.9049713e-03 -7.6658037e-03 -6.9669187e-03 -7.7365171e-03
  8.3959233e-03 -6.8133592e-04  9.1444086e-03 -8.1582209e-03
  3.7430846e-03  2.63504

In [132]:
import numpy as np

# Function to get average word embeddings for a sentence
def recipe_vector(sentence, model):
    words = sentence.lower().split()
    vectors = [model.wv[word] for word in words if word in model.wv]
    return np.mean(vectors, axis=0) if vectors else np.zeros(model.vector_size)

# Example: Get vector for a new recipe
new_recipe = "spicy grilled chicken with garlic butter"
vector = recipe_vector(new_recipe, word2vec)
print(vector)


[-7.93335121e-03  1.68572832e-03 -4.02526092e-03 -5.78183634e-03
 -2.85413582e-03  2.21132976e-03  2.81521725e-03  2.90354295e-03
 -5.30477986e-03  1.76225789e-04 -5.65334130e-03  1.18003122e-03
 -7.73494551e-03 -3.78333963e-03 -3.86010366e-03 -6.30535651e-03
 -4.57371678e-03 -7.53160566e-06  1.72686647e-03 -3.34095489e-03
 -4.09199018e-03 -4.61567007e-03  7.98116345e-03  8.47356394e-03
 -4.97396570e-03 -1.43339555e-03  1.93278119e-03 -2.04622792e-03
 -3.56501085e-03  3.55463196e-03  6.30816910e-03 -3.26597411e-03
 -1.69644202e-03 -8.73128977e-03  4.26398963e-03  1.49974902e-03
  1.49053673e-03  4.28068731e-03  2.14433484e-03 -9.15726298e-04
  5.66702848e-03 -7.09268451e-03 -8.49771220e-03  3.38777294e-03
  2.70983553e-03 -5.06432028e-03  1.40555215e-03  3.84735456e-03
  2.02801428e-03 -4.37246030e-03  3.38580227e-03 -2.00599339e-03
  8.90639424e-03  2.03099567e-03 -5.48693212e-03 -2.43560341e-03
  1.03608705e-04 -8.37306958e-03 -3.99500318e-03 -7.11993501e-03
  6.62285555e-03 -3.42274

In [133]:
sample_text = [
    "A quick brown fox jumps over the lazy dog.",
    "Birds of a feather flock together"
]


In [134]:
from sklearn.feature_extraction.text import CountVectorizer

sample_text = [
    "A quick brown fox jumps over the lazy dog.", "Machine Learning is Great"
]

# Initialize CountVectorizer
count_vectorizer = CountVectorizer()


# Fit & transform text
count_vectors = count_vectorizer.fit_transform(sample_text)

# Print feature names
print("Feature Names:", count_vectorizer.get_feature_names_out())

# Print count vector representation
print("Count Vectorizer Representation:\n", count_vectors.toarray())


Feature Names: ['brown' 'dog' 'fox' 'great' 'is' 'jumps' 'lazy' 'learning' 'machine'
 'over' 'quick' 'the']
Count Vectorizer Representation:
 [[1 1 1 0 0 1 1 0 0 1 1 1]
 [0 0 0 1 1 0 0 1 1 0 0 0]]


In [135]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit & transform text
tfidf_vectors = tfidf_vectorizer.fit_transform(sample_text)

# Feature names (words)
print("Feature Names:", tfidf_vectorizer.get_feature_names_out())


# Convert to array
print("TF-IDF Representation:\n", tfidf_vectors.toarray())


Feature Names: ['brown' 'dog' 'fox' 'great' 'is' 'jumps' 'lazy' 'learning' 'machine'
 'over' 'quick' 'the']
TF-IDF Representation:
 [[0.35355339 0.35355339 0.35355339 0.         0.         0.35355339
  0.35355339 0.         0.         0.35355339 0.35355339 0.35355339]
 [0.         0.         0.         0.5        0.5        0.
  0.         0.5        0.5        0.         0.         0.        ]]


In [136]:
from gensim.models import Word2Vec

# Tokenize text
tokenized_text = [sentence.lower().split() for sentence in sample_text]

# Train Word2Vec model
word2vec = Word2Vec(sentences=tokenized_text, vector_size=15, min_count=1, window=5)

# Get word vector for 'quick'
print("Word2Vec Representation for 'quick':\n", word2vec.wv['quick'])

# Average sentence embedding
import numpy as np

def sentence_embedding(sentence, model):
    words = sentence.lower().split()
    vectors = [model.wv[word] for word in words if word in model.wv]
    return np.mean(vectors, axis=0) if vectors else np.zeros(model.vector_size)

# Compute sentence vectors
word2vec_vectors = np.array([sentence_embedding(sentence, word2vec) for sentence in sample_text])
print("Word2Vec Sentence Representation:\n", word2vec_vectors)


Word2Vec Representation for 'quick':
 [ 0.03580226  0.05179676 -0.03844338  0.04955574  0.04416997 -0.024732
 -0.05830428  0.03624978  0.04339837 -0.00525033 -0.04473237 -0.0472395
 -0.01664707  0.03428836 -0.02443492]
Word2Vec Sentence Representation:
 [[ 0.00252982  0.01001862  0.01330935 -0.00651709  0.01192626  0.01457651
  -0.0199782  -0.00322744  0.01386847  0.00894945  0.00713015  0.01558428
  -0.00046466 -0.01658536  0.01551047]
 [-0.01693707 -0.00912823  0.04208317 -0.00269584 -0.01975534 -0.01650045
   0.04017171  0.0094373  -0.00563573  0.01055243 -0.01681479  0.01174436
  -0.02385937 -0.01485522  0.00094813]]


In [137]:
from gensim.models import Word2Vec

# Tokenize text
tokenized_text = [sentence.lower().split() for sentence in sample_text]

print(tokenized_text)
# Train Word2Vec model
word2vec = Word2Vec(sentences=tokenized_text, vector_size=15, min_count=1, window=5, workers=4, epochs=50)

# Cosine similarity function
def cosine_sim(word1, word2, model):
    vec1 = model.wv[word1].reshape(1, -1)
    vec2 = model.wv[word2].reshape(1, -1)
    return cosine_similarity(vec1, vec2)[0][0]

# Find similarity
similarity = cosine_sim("quick", "fox", word2vec)
print(f"Cosine Similarity between 'quick' and 'fox': {similarity:.4f}")


[['a', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog.'], ['machine', 'learning', 'is', 'great']]
Cosine Similarity between 'quick' and 'fox': 0.1030
