# Recommendation Using Embeddings (Step-by-Step)

This notebook demonstrates how to build a simple recommendation system using
text embeddings and cosine similarity.

We will:
1. Load a news dataset
2. Generate embeddings for article descriptions
3. Cache embeddings to avoid recomputation
4. Compute similarity using vector distance
5. Recommend similar articles



In [None]:
!pip install --upgrade openai pandas scikit-learn tqdm




#STEP 1 — Install & Configure Environment
#What this step does

* Installs required libraries

* Sets API credentials securely

* Prepares Colab environment


In [None]:
import os
from google.colab import userdata

os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")
os.environ["OPENAI_BASE_URL"] = userdata.get("OPENAI_BASE_URL")


#Step 2 Embedding Utility Functions
#What this step does

**Defines helper functions to:**

* Generate embeddings

* Compute cosine distances

* Rank nearest neighbors

* Keeps logic reusable and clean

In [None]:
!mkdir -p utils


In [None]:
%%writefile utils/embeddings_utils.py
import numpy as np
from openai import OpenAI
from sklearn.metrics.pairwise import cosine_distances
import os

client = OpenAI(
    api_key=os.getenv("OPENAI_API_KEY"),
    base_url=os.getenv("OPENAI_BASE_URL"),
)

def get_embedding(text, model="text-embedding-3-small"):
    """Request an embedding from OpenAI."""
    response = client.embeddings.create(
        model=model,
        input=text
    )
    return response.data[0].embedding


def distances_from_embeddings(query_embedding, embeddings, distance_metric="cosine"):
    """Compute pairwise distances.similar have smaller values and larger have less similar values"""
    return cosine_distances([query_embedding], embeddings)[0]


def indices_of_nearest_neighbors_from_distances(distances):
    """Sort distances ascending, returning indices.Rank Results"""
    return np.argsort(distances)


Overwriting utils/embeddings_utils.py


#STEP 3— Load Dataset
#What this step does

* Loads the AG News dataset

* Verifies structure before processing

In [None]:
import pandas as pd

dataset_path = "/content/AG_news_samples.csv"
df = pd.read_csv(dataset_path)
df.head()


Unnamed: 0,title,description,label_int,label
0,World Briefings,BRITAIN: BLAIR WARNS OF CLIMATE THREAT Prime M...,1,World
1,Nvidia Puts a Firewall on a Motherboard (PC Wo...,PC World - Upcoming chip set will include buil...,4,Sci/Tech
2,"Olympic joy in Greek, Chinese press",Newspapers in Greece reflect a mixture of exhi...,2,Sports
3,U2 Can iPod with Pictures,"SAN JOSE, Calif. -- Apple Computer (Quote, Cha...",4,Sci/Tech
4,The Dream Factory,"Any product, any shape, any size -- manufactur...",4,Sci/Tech


#Step 4 Create an Embedding Cache
**Why this step is CRITICAL**

* Embeddings are expensive & rate-limited

* Cache prevents recomputation

* Mimics real-world production pipelines

In [None]:
import pickle

embedding_cache_path = "data/recommendations_embeddings_cache.pkl"
!mkdir -p data

try:
    embedding_cache = pd.read_pickle(embedding_cache_path)
except FileNotFoundError:
    embedding_cache = {}

with open(embedding_cache_path, "wb") as f:
    pickle.dump(embedding_cache, f)


#Step 5 Cached Embedding Function
* What this function does

* Checks cache before calling API

* Saves new embeddings automatically

In [None]:
from utils.embeddings_utils import get_embedding

def embedding_from_string(string: str, model="text-embedding-3-small", embedding_cache=embedding_cache):
    """Return the cached embedding if present; otherwise compute & save it."""
    if (string, model) not in embedding_cache:
        embedding_cache[(string, model)] = get_embedding(string, model=model)
    with open(embedding_cache_path, "wb") as f:
            pickle.dump(embedding_cache, f)
    return embedding_cache[(string, model)]


In [None]:
print("Example embedding:", embedding_from_string(df["description"][0])[:5])


Example embedding: [0.05458051711320877, -0.004273026250302792, 0.04776814952492714, 0.015847930684685707, -0.036448199301958084]


In [None]:
from utils.embeddings_utils import distances_from_embeddings, indices_of_nearest_neighbors_from_distances

def print_recommendations_from_strings(strings, index_of_source_string, k_nearest_neighbors=5, model="text-embedding-3-small"):
    """Print the k nearest neighbors of a given string based on cosine similarity."""
    # 1) Get embeddings for all strings (uses cache)
    embeddings = [embedding_from_string(s, model=model) for s in strings]

    # 2) Compute distances
    query_embedding = embeddings[index_of_source_string]
    distances = distances_from_embeddings(query_embedding, embeddings)

    # 3) Sort & print
    sorted_indices = indices_of_nearest_neighbors_from_distances(distances)
    query_string = strings[index_of_source_string]

    print(f"Source string: {query_string}\n")
    neighbors = []
    count = 0

    for i in sorted_indices:
        if strings[i] == query_string:
            # skip itself
            continue
        if count >= k_nearest_neighbors:
            break
        neighbors.append(i)
        print(f"--- Recommendation #{count+1} ---")
        print(f"String: {strings[i]}")
        print(f"Distance: {distances[i]:.3f}\n")
        count += 1

    return neighbors


In [None]:
article_descriptions = df["description"].tolist()


# STEP 6 — Generate Embeddings for Dataset (Markdown)
**What this step does**

* Converts each article description into a vector

* Runs once (offline step)

* Populates the cache

In [None]:
from tqdm import tqdm
tqdm.pandas()

df["embedding"] = df["description"].progress_apply(
    embedding_from_string  # cached
)



100%|██████████| 2000/2000 [11:59<00:00,  2.78it/s]


#STEP 7 — Recommendation Logic
What this step does
**bold text**
* Uses vector similarity (cosine distance)

* No API calls

* Fast and scalable

In [None]:
import numpy as np
from utils.embeddings_utils import (
    distances_from_embeddings,
    indices_of_nearest_neighbors_from_distances,
)

def print_recommendations_from_embeddings(
    df,
    index_of_source_string,
    k_nearest_neighbors=5,
):
    source_embedding = df.iloc[index_of_source_string]["embedding"]

    embeddings = np.vstack(df["embedding"].values)
    distances = distances_from_embeddings(source_embedding, embeddings)

    nearest_indices = indices_of_nearest_neighbors_from_distances(distances)

    print("Source article:\n")
    print(df.iloc[index_of_source_string]["description"])
    print("\n--- Similar articles ---\n")

    shown = 0
    for idx in nearest_indices:
        if idx == index_of_source_string:
            continue
        print(df.iloc[idx]["description"])
        print("-" * 80)
        shown += 1
        if shown >= k_nearest_neighbors:
            break


In [None]:
print_recommendations_from_embeddings(
    df=df,
    index_of_source_string=0,
    k_nearest_neighbors=5
)


Source article:

BRITAIN: BLAIR WARNS OF CLIMATE THREAT Prime Minister Tony Blair urged the international community to consider global warming a dire threat and agree on a plan of action to curb the  quot;alarming quot; growth of greenhouse gases.

--- Similar articles ---

The anguish of hostage Kenneth Bigley in Iraq hangs over Prime Minister Tony Blair today as he faces the twin test of a local election and a debate by his Labour Party about the divisive war.
--------------------------------------------------------------------------------
THE re-election of British Prime Minister Tony Blair would be seen as an endorsement of the military action in Iraq, Prime Minister John Howard said today.
--------------------------------------------------------------------------------
Israel is prepared to back a Middle East conference convened by Tony Blair early next year despite having expressed fears that the British plans were over-ambitious and designed 
------------------------------------

In [None]:
!pwd
!ls


/content
sample_data
