[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/recommendation/article-recommender/article_recommendations.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/recommendation/article-recommender/article_recommendations.ipynb)

# Personalized Article Recommendation Engine (Example)

This notebook demonstrates how to use Pinecone's similarity search to create a simple personalized article or content recommender.

The goal is to create a recommendation engine that retrieves the best article recommendations for each user. When making recommendations with content-based filtering, we evaluate the user’s past behavior and the content items themselves. So in this example, users will be recommended articles that are similar to those they've already read.




## Install and Import Python Packages

In [1]:
!pip install --quiet \
    matplotlib==3.10.1 \
    pandas==2.2.3 \
    pinecone==6.0.2 \
    sentence-transformers==4.0.1 \
    tf-keras==2.19.0 \
    torch==2.6.0

In [2]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
from statistics import mean

%matplotlib inline

In the following sections, we will use Pinecone to easily build an article recommendation engine. Pinecone will be responsible for storing embeddings for articles, maintaining a live index of those vectors, and returning recommended articles on-demand. 

## Pinecone Setup

Now we need a place to store these embeddings and enable a efficient vector search through them all. To do that we use Pinecone, we can get a [free API key](https://app.pinecone.io/) and enter it below where we will initialize our connection to Pinecone and create a new index.

In [3]:
import os
from pinecone import Pinecone

# Initialize connection to Pinecone (obtain API key at app.pinecone.io)
api_key = os.environ.get("PINECONE_API_KEY") or "PINECONE_API_KEY"

# Configure client
pc = Pinecone(api_key=api_key)

  from .autonotebook import tqdm as notebook_tqdm


Create the index:

In [4]:
index_name = "articles-recommendation"

In [5]:
from pinecone import ServerlessSpec

# Check if index already exists (it shouldn't if this is first time)
if not pc.has_index(name=index_name):
    # If does not exist, create index
    pc.create_index(
        index_name,
        dimension=300,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
    )

index_config = pc.describe_index(name=index_name)
index_config

{
    "name": "articles-recommendation",
    "metric": "cosine",
    "host": "articles-recommendation-dojoi3u.svc.aped-4627-b74a.pinecone.io",
    "spec": {
        "serverless": {
            "cloud": "aws",
            "region": "us-east-1"
        }
    },
    "status": {
        "ready": true,
        "state": "Ready"
    },
    "vector_type": "dense",
    "dimension": 300,
    "deletion_protection": "disabled",
    "tags": null
}

In [6]:
# Instantiate the Index client
index = pc.Index(host=index_config.host)

# View index stats
index.describe_index_stats()

{'dimension': 300,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'articles': {'vector_count': 189760}},
 'total_vector_count': 189760,
 'vector_type': 'dense'}

## Upload Articles
Next, we will prepare data for the Pinecone vector index, and insert it in batches.

This demonstration is built using the  [All the News 2 dataset](https://components.one/datasets/all-the-news-2-news-articles-dataset/) which is available for non-commercial research use. The full dataset contains 2.7 million news articles and essays from 27 American publications, but for expediency we will be working with a truncated set of 200k entries. 

First, let's define some functions to help with downloading the data we need.

In [7]:
import urllib.request
import zipfile


def progress_bar(block_num, block_size, total_size):
    downloaded = block_num * block_size
    if total_size > 0:
        percent = downloaded * 100 // total_size
        print(
            f"\rDownloading: [{'#'*percent}{' '*(100-percent)}] {percent:.2f}% ", end=""
        )
    else:
        print(f"\rDownloading: {downloaded} bytes", end="")


def download(url, filename):
    # Download the file
    print("Downloading dataset...")
    urllib.request.urlretrieve(url, filename, reporthook=progress_bar)


def unzip(zipfilename):
    # Unzip the file
    print("Extracting dataset...")
    with zipfile.ZipFile(zipfilename, "r") as zip_ref:
        zip_ref.extractall(".")

In [8]:
import os


def download_full_dataset():
    """The full dataset has 2.7M articles and may cause your notebook environment to crash"""
    zipfile = "all-the-news-2-1.zip"
    if not os.path.exists(zipfile):
        url = "https://storage.googleapis.com/examples-repo-public/article-recommender/all-the-news-2-1.zip"
        download(url, zipfile)
        unzip(zipfile)


def download_truncated_dataset():
    """This function downloads a sample of 200k articles from the full 2.7M article dataset"""
    zip_file = "all-the-news-200k.zip"
    if not os.path.exists(zip_file):
        url = "https://storage.googleapis.com/examples-repo-public/article-recommender/all-the-news-200k.zip"
        download(url, zip_file)
        unzip(zip_file)

Now let's download the dataset. This will take a few minutes to complete. 

If you prefer to work with the full data, you can call `download_full_dataset()` instead of `download_truncated_dataset()` below.

In [9]:
%time

# download_full_dataset() # Use this instead for full 2.7M record dataset
download_truncated_dataset()

CPU times: user 3 μs, sys: 1 μs, total: 4 μs
Wall time: 10.5 μs


### Create Vector Embeddings

The model used in this example is the [Average Word Embeddings Models](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html). This model allows us to create vector embeddings for each article, using the content and title of each.

In [10]:
import torch
from sentence_transformers import SentenceTransformer

# set device to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = SentenceTransformer("average_word_embeddings_komninos", device=device)

In [11]:
BATCH_SIZE = 100  # batch size for upserting

In [12]:
# CSV_FILENAME = "all-the-news-2-1.csv" # use this if you opted to download the full dataset
CSV_FILENAME = "all-the-news-200k.csv"

Let's prepare data for upload.

In [13]:
import hashlib
from tqdm.auto import tqdm
from pprint import pprint


def hash_title(title: str) -> str:
    """Generate a SHA-256 hash of the title."""
    return hashlib.sha256(title.encode("utf-8")).hexdigest()


def metadata(row) -> dict:
    meta = {
        "title": row.title,
        "article": row.article[:400],
        "section": row.section,
        "publication": row.publication,
    }
    # pprint(meta)
    return meta


def create_embeddings(data) -> pd.DataFrame:
    "Preprocesses data and prepares it for upsert."

    # Drop records where 'title' or 'article' are empty string
    data["article"] = data["article"].fillna("")
    data["title"] = data["title"].fillna("")
    data["publication"] = data["publication"].fillna("")
    data["section"] = data["section"].fillna("")

    data = data[data["title"].str.strip() != ""].copy()
    data = data[data["article"].str.strip() != ""].copy()
    data = data.reset_index(drop=True)

    # Add an id column
    data["id"] = data["title"].apply(hash_title)

    # Truncate article text
    data["article"] = data.article.apply(
        lambda x: " ".join(re.split(r"(?<=[.:;])\s", x)[:4])
    )
    data["title_article"] = data["title"] + data["article"]

    # Create a vector embedding based on title and article columns
    encoded_articles = model.encode(data["title_article"])
    data["article_vector"] = pd.Series(encoded_articles.tolist())

    return data


def upload_items(data):
    """Uploads data to the Pinecone index."""

    # create a list of items for upload
    items_to_upload = [
        (str(row.id), row.article_vector, metadata(row)) for i, row in data.iterrows()
    ]

    # upsert
    index.upsert(vectors=items_to_upload, namespace="articles", async_req=True)


def process_file(filename: str) -> pd.DataFrame:
    "Reads csv files in chunks, prepares and uploads data."

    with tqdm(total=200000, desc="records processed") as pbar:
        for chunk in pd.read_csv(filename, chunksize=BATCH_SIZE):
            data = create_embeddings(chunk)
            upload_items(data)
            pbar.update(BATCH_SIZE)

In [14]:
process_file(filename=CSV_FILENAME)

records processed: 100%|██████████| 200000/200000 [02:41<00:00, 1239.07it/s]


In [15]:
# Print index statistics
index.describe_index_stats()

{'dimension': 300,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'articles': {'vector_count': 189760}},
 'total_vector_count': 189760,
 'vector_type': 'dense'}

## Query the Pinecone Index

We will query the index to get recommendations for different types of users. For each user we will select 10 articles matching keywords related to their areas of interest and use that to represent their reading history. Then we will average the embeddings for each of these articles to get a query vector we can use to find other similar articles to recommend.

To demonstrate, we will query Pinecone to get recommendations for users matching these descriptions:
* User who likes to read Sport News
* User who likes to read Entertainment News
* User who likes to read Business News

In [16]:
def display_recommendations(recommendations: pd.DataFrame):
    print("Recommended Articles")
    for i, row in recommendations.iterrows():
        print(f"  - {row.title}")

In [17]:
import pandas as pd


def get_simulated_reading_history(filter_chunk_fn):
    filtered_rows = []  # To store matching rows
    target_count = 10

    # Read the CSV in chunks. This adds complexity, but allows us
    # to work with large CSV files without loading everything at
    # once.
    for chunk in pd.read_csv(CSV_FILENAME, chunksize=10000):
        # Filter the chunk using the function above
        filtered_chunk = filter_chunk_fn(chunk)

        if not filtered_chunk.empty:
            filtered_rows.append(filtered_chunk)

        # If we've collected enough rows, break out of the loop early
        if sum(len(df) for df in filtered_rows) >= target_count:
            break

    # Concatenate the filtered results and select the first 10 rows
    if filtered_rows:
        result = pd.concat(filtered_rows).iloc[:target_count]
    else:
        result = pd.DataFrame()

    return result

In [18]:
def vector_average(articles_read):
    a = articles_read["article_vector"]
    user_vector = [*map(mean, zip(*a))]
    return user_vector


def recommendations(articles_read):
    query_vector = vector_average(articles_read)
    res = index.query(
        vector=query_vector, top_k=20, namespace="articles", include_metadata=True
    )
    df = pd.DataFrame([m.metadata for m in res.matches])

    # Remove articles the user has already read since it doesn't make sense to recommend these
    df = df[~df["title"].isin(articles_read["title"])]

    return df

Now that we have defined these utility functions, let's see some examples. 

## Recommendations for tennis fan

Let's choose 10 articles from our dataset about Tennis to use as a user's reading history. Each piece of content has an embedding associated with it. We can average these embeddings and use that average vector to find other similar content in the dataset.

In [19]:
def sports_filter(chunk):
    return chunk[
        ((chunk["section"] == "Sports News") | (chunk["section"] == "Sports"))
        & (chunk["article"].str.contains("Tennis", na=False))
    ]


sports_articles_read = get_simulated_reading_history(sports_filter)
sports_articles_read = create_embeddings(sports_articles_read)

print("Articles user has read:")
for i, row in sports_articles_read.iterrows():
    print(f"  - {row.title}")

Articles user has read:
  - Players and fans swelter in Melbourne, brace for worse on Friday
  - Australia's Hewitt says ATP Cup success put ITF under pressure
  - Tennis: Spanish player banned for eight months for betting offences
  - Tennis: Cecchinato in dreamland two years on from match-fixing ban
  - TIU bans Australian Lindahl for 2013 match-fixing scandal
  - U.S. draw Latvia in 2020 Fed Cup qualifiers
  - Bernard Tomic's Only Sin Was Being Honest About His Job
  - Tennis: U.S. Open to top $50 million in prize money
  - Tennis: New French Open court unveiled as prize money increases
  - Ukrainian twins banned for life for match-fixing: TIU


In [20]:
sports_recs = recommendations(sports_articles_read)
display_recommendations(sports_recs)

Recommended Articles
  - No French Open wild card for snubbed Sharapova
  - Sharapova handed wild card for Rogers Cup in Toronto
  - Wilander criticizes Davis Cup revamp as 'last' final looms
  - Roger Federer Survives a Scare, and Rafael Nadal Overcomes a Din
  - Novak Djokovic’s Invincibility Takes Another Hit, Opening a Door for Others
  - Golf: Confident Oosthuizen puts on his game face for Presidents Cup
  - Serena 'ready to go' says Fed Cup captain Rinaldi
  - French Federation extends Paire suspension
  - Tennis: Wawrinka feared for career during long layoff
  - Fleetwood prepares for special homecoming at British Open
  - Basketball : It looks like the U.S. and then all the rest
  - Russia back in the fray despite build-up distractions
  - Kvitova withdraws from Birmingham Classic
  - Everything to Know About U.S. Open Champion Naomi Osaka
  - Real Madrid could break spending record to rescue Neymar from PSG
  - In Women’s Tennis, Finesse Can Fight Power


### Recommendations for a video game fan

In [21]:
def xbox_filter(chunk):
    return chunk[
        (
            (chunk["section"] == "Entertainment")
            | (chunk["section"] == "Games")
            | (chunk["section"] == "Tech by VICE")
        )
        & (chunk["article"].str.contains("Xbox"))
    ]


xbox_articles_read = get_simulated_reading_history(xbox_filter)
xbox_articles_read = create_embeddings(xbox_articles_read)

print("Articles user has read:")
for i, row in xbox_articles_read.iterrows():
    print(f"  - {row.title}")

Articles user has read:
  - How a Horror Game About Death Expanded Its Audience By Letting Them Live
  - ‘Warframe’ Hacked, Details on 775,000 Players Traded
  - Mike Patton’s the Best Rock Star to Have Ever Rolled Into Video Games
  - We Asked Nintendo, Microsoft, and 12 Other Devs How They Deal With Crunch
  - Microsoft, Nintendo and Sony's Next Gen Consoles According to E3
  - 'PUBG' Has an NFL Red Zone Channel Skipping to the Best Part of Every Match
  - Reexamining ‘Silent Hill 3,’ Gaming’s Most Unfairly Overlooked Sequel
  - The Best Work Motherboard Published in 2017
  - 'Forza Horizon 4' Is a Living Impressionist Landscape
  - Why We’re Still Addicted To ‘Rocket League,' One Year Later


In [22]:
xbox_recs = recommendations(xbox_articles_read)
display_recommendations(xbox_recs)

Recommended Articles
  - Pre-Ordering ‘Battlefield 1’ for $150 Is a Bad Idea But I Did It Anyway
  - The best video games of 2019, from 'Apex Legends' to 'Zelda' — LIST
  - First Click: How much is an hour of entertainment worth? | The Verge
  - Fortnite is finally coming to Android this summer – TechCrunch
  - We Need More Games That Take Chances With Major Franchises
  - PS4’s exclusive zombie game Days Gone deserves a chance to be itself
  - Fortnite season 5: What we know so far
  - Dangerous, explosive hoverboards are still lurking, and here's another recall to remind you of it
  - 'The Last of Us Part II' delayed: New release date, trailer, gameplay
  - How to Explain Speedrunning to Your Parents
  - Watch the Xbox Scorpio unveiling live, right here
  - Looking Back on 2015's Biggest Tech News and Ahead to CES 2016
  - Nintendo is releasing a massive Splatoon 2 update in time for the holidays
  - 'Super Mario Maker 2' Gives You More Tools to Create Chaos
  - We Talk About Game Re

### Recommendations for Business User

In [23]:
def business_filter(chunk):
    return chunk[
        ((chunk["section"] == "Business News") | (chunk["section"] == "business"))
        & (chunk["article"].str.contains("Wall Street"))
    ]


business_articles_read = get_simulated_reading_history(business_filter)
business_articles_read = create_embeddings(business_articles_read)

print("Articles user has read:")
for i, row in business_articles_read.iterrows():
    print(f"  - {row.title}")

Articles user has read:
  - Wall St. falls as investors eye a united hawkish Fed
  - Why Cloudflare Let an Extremist Stronghold Burn
  - China Stocks Plunge as Coronavirus Fears Grow
  - Morgan Stanley elbows out rivals for plum role in $1.5 billion IPO relaunch: sources
  - New York City Pension Fund to Divest Itself of Gun Retailer Stock
  - Exxon quarterly profit falls 5.2% on weak refining, chemical margins
  - PepsiCo's mini-sized sodas boost quarterly results
  - Wall Street extends rally on U.S.-China trade optimism
  - DealBook Briefing: Could ‘Down Round’ I.P.O.s Hit the Tech Unicorns?
  - Caesars must face $11 billion in lawsuits: U.S. judge


In [24]:
business_recs = recommendations(business_articles_read)
display_recommendations(business_recs)

Recommended Articles
  - GLOBAL MARKETS-Stocks tumble as coronavirus cases rise outside China
  - GLOBAL MARKETS-Asia shares reassured by Powell pick, pause for US jobs test
  - GLOBAL MARKETS-Asia shares dragged lower by Japan, dollar still groggy
  - GLOBAL MARKETS-Carmakers drive Europe higher, Johnson batters sterling
  - Unilever touts new products, structure to fight rivals
  - GLOBAL MARKETS-Asian shares retreat from highs, markets take Trump impeachment in stride
  - UPDATE 1-South Africa's rand lifted by EM demand, manufacturing surprise
  - UPDATE 1-European shares edge higher as miners and banks gain
  - European shares edge higher on trade hopes, Imperial Brands slides
  - UPDATE 3-Australia's Blackmores changes tack in China as profit falls
  - UPDATE 3-Slump in toys and gaming sales spoils Sainsbury's Christmas
  - CORRECTED-European stocks flat ahead of U.S. payrolls
  - EMERGING MARKETS-Latam FX, stocks down; Argentine stocks surge ahead of primary vote
  - UPDATE 1-Rub

### Query Results

We can see that each user's recommendations have a high similarity to what the user actually reads. A user who likes Tennis news has plenty of Tennis news recommendations. A user who likes to read about Xbox has that kind of news. And a business user has plenty of Wall Street news that he or she can enjoy. 

Since we used only the title and the content of the article to define the embeddings, and we did not take publications and sections into account, a user may get recommendations from a publication/section that he does not regularly read. You may try adding this information when creating embeddings as well and check your query results then!

## Delete the index

Delete the index once you are sure that you do not want to use it anymore. Once it is deleted, you cannot use it again.



In [25]:
pc.delete_index(name=index_name)