# **Text Recommendation System Documentation**

### 1. Project Initialization

> The pipeline begins by establishing a connection to the **OpenAI API** and loading the **AG News Dataset**. This dataset contains a collection of news articles categorized by their titles and descriptions.



In [1]:
import os
import numpy as np
import pandas as pd
import pickle
from openai import OpenAI
from google.colab import userdata
API_KEY = userdata.get("API_KEY")
BASE_URL = userdata.get("BASE_URL")
client = OpenAI(
    api_key="API_KEY",
    base_url="BASE_URL"
)

EMBEDDING_MODEL = "text-embedding-3-small"

In [None]:
# load data (full dataset available at http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html)
dataset_path = "/content/AG_news_samples.csv"
df = pd.read_csv(dataset_path)

n_examples = 5
df.head(n_examples)


In [None]:
# print the title, description, and label of each example
for idx, row in df.head(n_examples).iterrows():
    print("")
    print(f"Title: {row['title']}")
    print(f"Description: {row['description']}")
    print(f"Label: {row['label']}")


### 2. Semantic Embedding & Local Caching

> To transform text into a format a machine can understand, the system uses the `text-embedding-3-small` model. This converts each article description into a high-dimensional vector. To ensure efficiency, a local `.pkl` file acts as a database to store and reuse these embeddings.

In [5]:
# establish a cache of embeddings to avoid recomputing
# cache is a dict of tuples (text, model) -> embedding, saved as a pickle file

# set path to embedding cache
embedding_cache_path = "/content/recommendations_embeddings_cache.pkl"

# load the cache if it exists, and save a copy to disk
try:
    embedding_cache = pd.read_pickle(embedding_cache_path)
except FileNotFoundError:
    embedding_cache = {}
with open(embedding_cache_path, "wb") as embedding_cache_file:
    pickle.dump(embedding_cache, embedding_cache_file)

# define a function to retrieve embeddings from the cache if present, and otherwise request via the API
def embedding_from_string(
    string: str,
    model: str = EMBEDDING_MODEL,
    embedding_cache=embedding_cache
) -> list:
    """Return embedding of given string, using a cache to avoid recomputing."""
    if (string, model) not in embedding_cache.keys():
        embedding_cache[(string, model)] = (string,model)
        with open(embedding_cache_path, "wb") as embedding_cache_file:
            pickle.dump(embedding_cache, embedding_cache_file)
    return embedding_cache[(string, model)]


In [None]:
# as an example, take the first description from the dataset
example_string = df["description"].values[0]
print(f"\nExample string: {example_string}")

# print the first 10 dimensions of the embedding
example_embedding = embedding_from_string(example_string)
print(f"\nExample embedding: {example_embedding[:10]}...")


### 3. Similarity Calculation (Cosine Distance)


> Once articles are converted into vectors, the system calculates how "related" they are using **Cosine Similarity**.

It measures the cosine of the angle between two vectors; a smaller angle indicates that the articles share a high degree of semantic similarity.

In [7]:
import numpy as np

def cosine_distance(a, b):
    return 1 - np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def distances_from_embeddings(query_embedding, embeddings):
    return [cosine_distance(query_embedding, emb) for emb in embeddings]

def indices_of_nearest_neighbors_from_distances(distances):
    return np.argsort(distances)


In [12]:
def embedding_from_string(string: str, model=EMBEDDING_MODEL):
    key = (string, model)

    if key not in embedding_cache:
        response = client.embeddings.create(
            model=model,
            input=string
        )
        embedding_cache[key] = response.data[0].embedding

        with open(embedding_cache_path, "wb") as f:
            pickle.dump(embedding_cache, f)

    return embedding_cache[key]


In [13]:
def print_recommendations_from_strings(strings, index_of_source_string, k_nearest_neighbors=5):
    embeddings = [embedding_from_string(s) for s in strings]

    query_embedding = embeddings[index_of_source_string]

    distances = distances_from_embeddings(query_embedding, embeddings)
    sorted_indices = indices_of_nearest_neighbors_from_distances(distances)

    print("Source article:\n", strings[index_of_source_string], "\n")

    count = 0
    for i in sorted_indices:
        if i == index_of_source_string:
            continue
        if count >= k_nearest_neighbors:
            break

        count += 1
        print(f"--- Recommendation {count} ---")
        print(strings[i])
        print(f"Distance: {distances[i]:.4f}\n")

    return sorted_indices


In [14]:
article_descriptions = (
    df["description"]
    .dropna()
    .astype(str)
    .tolist()
)


In [None]:
chipset_security_articles = print_recommendations_from_strings(
    strings=article_descriptions,  # let's base similarity off of the article description
    index_of_source_string=1,  # let's look at articles similar to the second one about a more secure chipset
    k_nearest_neighbors=5,  # let's look at the 5 most similar articles
)


### 4. Dimensionality Reduction via t-SNE


> Raw embeddings are too complex for human visualization. To visualize how the articles cluster together, the system employs **t-SNE** (t-distributed Stochastic Neighbor Embedding) to "squash" high-dimensional data into a **2D plane**.

In [17]:
from sklearn.manifold import TSNE
import numpy as np

def tsne_components_from_embeddings(embeddings, n_components=2, perplexity=30, random_state=42):
    X = np.array(embeddings)
    tsne = TSNE(n_components=n_components, perplexity=perplexity, random_state=random_state)
    return tsne.fit_transform(X)


In [18]:
import plotly.express as px
import pandas as pd

def chart_from_components(components, labels, strings, width=600, height=500, title="", category_orders=None):
    df_plot = pd.DataFrame({
        "x": components[:, 0],
        "y": components[:, 1],
        "label": labels,
        "text": strings
    })

    fig = px.scatter(
        df_plot,
        x="x",
        y="y",
        color="label",
        hover_data=["text"],
        title=title,
        width=width,
        height=height,
        category_orders=category_orders
    )
    fig.show()


In [19]:
def nearest_neighbor_labels(list_of_indices, k_nearest_neighbors=5):
    labels = ["Other"] * len(article_descriptions)

    source_index = list_of_indices[0]
    labels[source_index] = "Source"

    for i in range(1, k_nearest_neighbors + 1):
        idx = list_of_indices[i]
        labels[idx] = f"Nearest neighbor (top {k_nearest_neighbors})"

    return labels


In [None]:
embeddings = [embedding_from_string(s) for s in article_descriptions]

tsne_components = tsne_components_from_embeddings(embeddings)

labels = df["label"].tolist()

chart_from_components(
    components=tsne_components,
    labels=labels,
    strings=article_descriptions,
    title="t-SNE components of article descriptions"
)


In [None]:
tony_blair_articles = print_recommendations_from_strings(
    article_descriptions,
    index_of_source_string=0,
    k_nearest_neighbors=5
)

chipset_security_articles = print_recommendations_from_strings(
    article_descriptions,
    index_of_source_string=1,
    k_nearest_neighbors=5
)



### 5. Interactive Visualization


> The final stage uses **Plotly** to generate interactive scatter plots. Points are color-coded by category, allowing you to visually confirm that similar articles are grouped together in the vector space.


In [None]:
tony_blair_labels = nearest_neighbor_labels(tony_blair_articles, 5)
chipset_security_labels = nearest_neighbor_labels(chipset_security_articles, 5)

chart_from_components(
    components=tsne_components,
    labels=tony_blair_labels,
    strings=article_descriptions,
    title="Nearest neighbors of the Tony Blair article",
    category_orders={"label": ["Other", "Nearest neighbor (top 5)", "Source"]}
)

chart_from_components(
    components=tsne_components,
    labels=chipset_security_labels,
    strings=article_descriptions,
    title="Nearest neighbors of the chipset security article",
    category_orders={"label": ["Other", "Nearest neighbor (top 5)", "Source"]}
)


# **Conclusion**
In conclusion, this project demonstrates a robust and scalable approach to building a **Content-Based Recommendation System** using modern LLM tools. By moving beyond simple keyword matching and into the realm of **Vector Embeddings**, the system can identify deep semantic relationships between news articles.

---

## Key Takeaways

* **Efficiency via Caching:** The implementation of a local pickle-based cache ensures that the system is both cost-effective and high-performing, avoiding redundant API calls for previously processed text.
* **Mathematical Precision:** By utilizing **Cosine Similarity**, the engine effectively ranks articles based on their conceptual "nearness" in a high-dimensional vector space.
* **Visual Validation:** The use of **t-SNE** allows us to translate complex, invisible mathematical relationships into a 2D map. This visual feedback confirms that the model is successfully grouping similar topics (like "Tech" or "Politics") together.
* **Explainability:** The integration of interactive charts provides a "glass-box" view of the AI, making it easy to audit why specific recommendations were made for a given source article.

---

## Potential Next Steps

To further enhance this system, one could:

1. **Hybrid Filtering:** Combine these semantic embeddings with user behavior data (Collaborative Filtering) for more personalized results.
2. **Vector Database:** Transition from a local `.pkl` file to a dedicated vector database (like Pinecone or Milvus) to handle millions of articles in real-time.
3. **Cross-Lingual Support:** Leverage OpenAI's multi-lingual embedding capabilities to recommend related articles across different languages.
