[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/docs/onboarding-recommender.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/docs/onboarding-recommender.ipynb)

# Recommendation Systems

Recommendation systems have revolutionized the way we discover and explore new things. These intelligent systems utilize sophisticated algorithms and data analysis to understand individual preferences and provide personalized recommendations.
<br><br>
By analyzing user data, such as **browsing history**, **purchase patterns**, and **social interactions**, recommendation systems can effectively predict and suggest items that align with users' interests. Whether it's suggesting a new movie to watch, a book to read, or a product to buy, these systems streamline decision-making and enhance the overall user experience.
<br>
With their ability to uncover hidden gems and introduce users to exciting possibilities, recommendation systems have become invaluable tools in navigating the overwhelming abundance of choices in today's digital landscape.

### Recommendation Systems and Vector Databases

In recommendation systems, understanding the similarity between users and items is crucial for generating accurate and personalized recommendations. By leveraging vector databases, these systems can store and organize user and item vectors, which capture the essential characteristics and preferences associated with each user and item.
<br><br>
The vector database employs <a href="https://www.pinecone.io/learn/vector-database/#:~:text=a%20vector%20database.-,Algorithms,-Several%20algorithms%20can">advanced indexing techniques</a>, to enable fast retrieval of **similar users or items** based on their **vector representations**. This enables recommendation systems to efficiently process *large-scale datasets* and identify meaningful connections, leading to more precise and relevant recommendations.
<br><br>
By harnessing the power of vector databases, such as **Pinecone**, recommendation systems can optimize their performance, enhance user satisfaction, and deliver tailored experiences that align with individual preferences.
<br><br><br>
Let's take a look at how we can implement one of those use cases!

We start by installing all necessary libraries.

In [1]:
!pip install -qU \
    pinecone-client==3.0.0 \
    git+https://github.com/pinecone-io/pinecone-datasets.git \
    transformers==4.30.2 \
    tensorflow==2.11.1

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.2/177.2 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m57.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m588.3/588.3 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.0/60.0 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m300.4/300.4 kB[0m [31m24.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m61.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m46.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m62.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━

---

🚨 _Note: the above `pip install` is formatted for Jupyter notebooks. If running elsewhere you may need to drop the `!`._

---

## Data Preparation

<img alt="Onboarding recommender diagram" src="https://raw.githubusercontent.com/pinecone-io/examples/master/docs/assets/onboarding_recommender_data_flow.jpg"  width="70%">

#### Downloading the Dataset

We will download a pre-embedding dataset from `pinecone-datasets`. Allowing us to skip the embedding and any other preprocessing steps.
<br><br>
When working with your own dataset you will need to perform this embedding step but we have prebuilt the embeddings so we can jump right to the action.

In [2]:
from pinecone_datasets import load_dataset

dataset_name = "movielens-user-ratings"
dataset = load_dataset(dataset_name)
dataset.head()

Unnamed: 0,id,values,sparse_values,metadata,blob
0,tt5027774,"[-0.12388430535793304, 0.23021861910820007, -0...",,,"{'imdb_id': 'tt5027774', 'movie_id': 6705, 'po..."
1,tt5463162,"[0.008479624055325985, 0.3665461540222168, -0....",,,"{'imdb_id': 'tt5463162', 'movie_id': 7966, 'po..."
2,tt4007502,"[-0.0022702165879309177, 0.5886886715888977, -...",,,"{'imdb_id': 'tt4007502', 'movie_id': 1614, 'po..."
3,tt4209788,"[0.08350061625242233, 0.4322584867477417, -0.2...",,,"{'imdb_id': 'tt4209788', 'movie_id': 7022, 'po..."
4,tt2948356,"[-0.1614755392074585, 0.41389355063438416, -0....",,,"{'imdb_id': 'tt2948356', 'movie_id': 3571, 'po..."


In [3]:
len(dataset)

970582

We can limit the number of records within our `dataset` if on the Standard Tier of Pinecone (for paid users, you can index the full dataset).

In [4]:
dataset = dataset.head(10_000)
len(dataset)

10000

#### Reformatting the Dataset

A `pinecone-dataset` always contains `id`, `values`, `sparse_values`, `metadata`, and `blob`. All we need are the IDs, vector embeddings (stored in `values`), and some metadata (which is actually stored in `blob`). Let's reformat the dataset ready for adding to Pinecone. We also drop `sparse_values` as they are not needed for this example.


In [6]:
dataset.drop(['sparse_values', 'metadata'], axis=1, inplace=True)
dataset.rename(columns={'blob': 'metadata'}, inplace=True)

dataset.head()

Unnamed: 0,id,values,metadata
0,tt5027774,"[-0.12388430535793304, 0.23021861910820007, -0...","{'imdb_id': 'tt5027774', 'movie_id': 6705, 'po..."
1,tt5463162,"[0.008479624055325985, 0.3665461540222168, -0....","{'imdb_id': 'tt5463162', 'movie_id': 7966, 'po..."
2,tt4007502,"[-0.0022702165879309177, 0.5886886715888977, -...","{'imdb_id': 'tt4007502', 'movie_id': 1614, 'po..."
3,tt4209788,"[0.08350061625242233, 0.4322584867477417, -0.2...","{'imdb_id': 'tt4209788', 'movie_id': 7022, 'po..."
4,tt2948356,"[-0.1614755392074585, 0.41389355063438416, -0....","{'imdb_id': 'tt2948356', 'movie_id': 3571, 'po..."


Here is an example of the metadata value.

In [7]:
from pprint import pp

pp(dataset['metadata'][0])

{'imdb_id': 'tt5027774',
 'movie_id': 6705,
 'poster': 'https://m.media-amazon.com/images/M/MV5BMjI0ODcxNzM1N15BMl5BanBnXkFtZTgwMzIwMTEwNDI@._V1_SX300.jpg',
 'rating': 4.0,
 'title': 'Three Billboards Outside Ebbing, Missouri (2017)',
 'user_id': 4556}


Now we move on to initializing our Pinecone vector database.

Before getting started, decide whether to use serverless or pod-based index.

In [None]:
import os

use_serverless = os.environ.get("USE_SERVERLESS", "False").lower() == "true"

## Creating an Index

Now the data is ready, we can set up our index to store it.

We begin by initializing our connection to Pinecone. To do this we need a [free API key](https://app.pinecone.io).

In [None]:
from pinecone import Pinecone

# initialize connection to pinecone (get API key at app.pc.io)
api_key = os.environ.get('PINECONE_API_KEY') or 'PINECONE_API_KEY'
environment = os.environ.get('PINECONE_ENVIRONMENT') or 'PINECONE_ENVIRONMENT'

# configure client
pc = Pinecone(api_key=api_key)

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

In [None]:
from pinecone import ServerlessSpec, PodSpec

if use_serverless:
    spec = ServerlessSpec(cloud='aws', region='us-west-2')
else:
    spec = PodSpec(environment=environment)

In order to create a new index, we need to specify the index name, similarity metric, as well as the dimension of the vectors stored in that index.
<br>
We will assign these values here.
<br>
Note that the dimension parameter has to match the embedding dimensions provided in the dataset (or the model that outputs those embeddings).

In [11]:
# embedding dimensions
len(dataset['values'][0])

32

In [12]:
index_name = 'onboarding-recommender'

First, we need to check if the index already exists. In this example, we will delete it and create a new one.

In [13]:
if index_name in pc.list_indexes().names():
    pc.delete_index(index_name)

pc.create_index(
        index_name,
        dimension=32,  
        metric='cosine',
        spec=spec
    )
# wait a moment for the index to be fully initialized
while not pc.describe_index(index_name).status['ready']:
    time.sleep(1)

We are going to initialize an index variable so that we can use it later on to describe the index and perform vector upsert.

In [14]:
index = pc.Index(index_name)

Initially the index will be empty:

In [15]:
index.describe_index_stats()

{'dimension': 32,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 0}},
 'total_vector_count': 0}

We upsert like so:

In [17]:
index.upsert_from_dataframe(dataset, batch_size=1000)

sending upsert requests:   0%|          | 0/10000 [00:00<?, ?it/s]

collecting async responses:   0%|          | 0/10 [00:00<?, ?it/s]

upserted_count: 10000

Now we should see 10K vectors in our index (note, the `total_vector_count` may take some time to fully update, paritcularly when using the Pinecone *Standard Tier*).

In [21]:
index.describe_index_stats()

{'dimension': 32,
 'index_fullness': 0.02066,
 'namespaces': {'': {'vector_count': 2066}},
 'total_vector_count': 2066}

## Querying the Index

Now, when the index is populated, we can perform queries on it to find the most relevant recommendations.
<br>
To do that, we need to instantiate our embedding models so that we can create vectors from our input user or input item objects.

### Getting the Model

We will download the models from the HuggingFace Hub. We will use one model to embed the *example user* and another model to embed the *example item*. <br>
This will allow us to retrieve the most relevant items for a specific user or find the most similar items to a specific item.

In [22]:
from huggingface_hub import from_pretrained_keras

user_model = from_pretrained_keras("pinecone/movie-recommender-user-model")
movie_model = from_pretrained_keras("pinecone/movie-recommender-movie-model")

config.json not found in HuggingFace Hub.


Fetching 7 files:   0%|          | 0/7 [00:00<?, ?it/s]

config.json not found in HuggingFace Hub.


Fetching 7 files:   0%|          | 0/7 [00:00<?, ?it/s]



Before we proceed, we can create a `movie_details` dataset that we can use later on to print out the results.

In [23]:
import pandas as pd

movies_details = pd.DataFrame(
    dataset['metadata'].values.tolist()
)
movies_details.head()

Unnamed: 0,imdb_id,movie_id,poster,rating,title,user_id
0,tt5027774,6705,https://m.media-amazon.com/images/M/MV5BMjI0OD...,4.0,"Three Billboards Outside Ebbing, Missouri (2017)",4556
1,tt5463162,7966,https://m.media-amazon.com/images/M/MV5BMDkzNm...,3.5,Deadpool 2 (2018),20798
2,tt4007502,1614,https://m.media-amazon.com/images/M/MV5BMjY3YT...,4.5,Frozen Fever (2015),26543
3,tt4209788,7022,https://m.media-amazon.com/images/M/MV5BNTkzMz...,4.0,Molly's Game (2017),4106
4,tt2948356,3571,https://m.media-amazon.com/images/M/MV5BOTMyMj...,4.0,Zootopia (2016),15259


#### Item Similarity

First, we can check how our vector database behaves when returning the most similar movies upon querying it using the movie vector created using the `movie_model` loaded above.

In [24]:
movie_id = 1263  # you can try experimenting with different movie ids to obtain different results, for example 3571
movie_vector = movie_model(movie_id).numpy().tolist()

In [25]:
movies_details[movies_details['movie_id'] == movie_id]['title'].tolist()[0]

'Avengers: Infinity War - Part I (2018)'

In [27]:
movie_query_results = index.query(vector=
    movie_vector,
    top_k=10,
    include_metadata=True
)

In [30]:
df = pd.DataFrame(
    {
        'movies': [record.metadata['title'] for record in movie_query_results.matches],
        'scores': [record.score for record in movie_query_results.matches]
    }
)
print("Recommendations: ")
display(df)

Recommendations: 


Unnamed: 0,movies,scores
0,Avengers: Infinity War - Part I (2018),0.998588
1,Avengers: Infinity War - Part II (2019),0.987156
2,Thor: Ragnarok (2017),0.981
3,Captain America: Civil War (2016),0.979533
4,Guardians of the Galaxy (2014),0.976064
5,Guardians of the Galaxy 2 (2017),0.960372
6,Avengers: Age of Ultron (2015),0.944494
7,Untitled Spider-Man Reboot (2017),0.944431
8,Logan (2017),0.936184
9,Spider-Man: Into the Spider-Verse (2018),0.930237


We can observe that it is doing an excellent job in finding similar movies, and it is accomplishing this task very quickly.

#### User Recommendations

Now, let's observe how our vector database behaves when we query it using the user vector.
<br>
We expect to receive movies that closely resemble the ones that the user rated highly.

In [91]:
user_id = 825
user_vector = user_model(user_id).numpy().tolist()

Here, we are defining a function that allows us to easily display the movies that the user rated in the past.

In [92]:
def top_movies_user_rated(user):
    # get list of movies that the user has rated
    user_movies = movies_details[movies_details["user_id"] == user]
    # order by their top rated movies
    top_rated = user_movies.sort_values(by=['rating'], ascending=False)
    # return the top 14 movies
    return pd.DataFrame(
        {
            'movies': top_rated['title'].tolist()[:14],
            'ratings': top_rated['rating'].tolist()[:14]
        }
    )

In [93]:
display(top_movies_user_rated(user_id))

Unnamed: 0,movies,ratings
0,Swiss Army Man (2016),4.0
1,Blackfish (2013),3.5
2,Raman Raghav 2.0 (2016),3.5
3,Loveless (2017),3.5
4,Ae Dil Hai Mushkil (2016),3.0
5,Ted 2 (2015),3.0
6,Inside Llewyn Davis (2013),3.0
7,Drinking Buddies (2013),2.5
8,Yamla Pagla Deewana 2 (2013),1.5


And now we can pass our `user_vector` to the query to get the recommendations.

In [94]:
query_results = index.query(vector=
    user_vector,
    top_k=10,
    include_metadata=True
)

In [95]:
df = pd.DataFrame(
    {
        'movies': [record.metadata['title'] for record in query_results.matches],
        'scores': [record.score for record in query_results.matches]
    }
)
print("Recommendations: ")
display(df)

Recommendations: 


Unnamed: 0,movies,scores
0,12 Years a Slave (2013),0.868902
1,Bridegroom (2013),0.850303
2,Capital C (2015),0.850294
3,Dangal (2016),0.843546
4,The Crew (2016),0.841317
5,Capernaum (2018),0.840019
6,Shaadi Mein Zaroor Aana (2017),0.83237
7,Tanu Weds Manu Returns (2015),0.829788
8,Spider-Man: Far from Home (2019),0.826584
9,Avengers: Infinity War - Part II (2019),0.824928


Using this method, we can identify movies similar to those the user rated highly, while avoiding those the user rated lowly.

## Summary

The notebook demonstrated the step-by-step process of creating and populating an index in the vector database. It covered aspects such as specifying the index name, similarity metric, and vector dimensions. The example also included instructions on checking if an index exists, deleting and creating new indexes when necessary.

Furthermore, the notebook illustrated the usage of embedding models to generate vector representations of both users and items. The results showed that the recommendations closely resembled the movies that the user rated highly, and dissimilar movies were not included.

Overall, this example showcased the power and efficiency of vector databases in recommendation systems. It is important to note that the benefits of vector databases extend beyond movies and can be applied to various types of items, making them a valuable tool in building effective recommendation systems.