# Vector Search with Cluster

## 1. Introduction
In the previous Chapters, we search for a movie based on a unique query, equivalent to only one vector.

In cases of multiple queries, how do we optimize the results?

For example, the user wants to search for a movie based on a list of movies that they watched in the past.

In terms of recommenders, what we can recommend to a user after he has watched a few items/movies/products recently?

For example, if a client viewed a couple of products, we want to suggest some products similar to what he viewed in the past.


## 2. Approaches

**Approaches A.** Find a center point for all vectors of past queries. This point is the closest to all vectors.

**Approaches B.** How about the cases where the points scatter? For example, a user watched a comedy and then watched a thriller. A user viewed an electric toothbrush and then viewed a garden chair. These items were in different categories. In this case, we need to cluster them and find the recommendations for each category.

We can mix two approaches and rank them based on their distance (similarity) to the history.

## 3. Demo
### Set up


In [1]:
import cohere
from cassandra import ConsistencyLevel
from cassandra.auth import PlainTextAuthProvider
from cassandra.cluster import Cluster
import numpy as np
import ast

# get free Trial API Key at https://cohere.ai/
from cred import API_key
co = cohere.Client(API_key)

from cred import (ASTRA_CLIENT_ID, ASTRA_CLIENT_SECRET,
                  SECURE_CONNECT_BUNDLE_PATH)

KEYSPACE_NAME = "demo"
TABLE_NAME = "movies_35K_vectorized"

cloud_config = {"secure_connect_bundle": SECURE_CONNECT_BUNDLE_PATH}
auth_provider = PlainTextAuthProvider(ASTRA_CLIENT_ID, ASTRA_CLIENT_SECRET)
cluster = Cluster(cloud=cloud_config, auth_provider=auth_provider, protocol_version=4)
session = cluster.connect()
session.execute(f"USE {KEYSPACE_NAME};")

<cassandra.cluster.ResultSet at 0x1e536469be0>

In [2]:
import struct
def convert_bytes_to_array_astra(bstring, nbyte=4):
    """Convert bytes from Cassanda VECTOR type to float array

    Parameters
    ----------
    bstring : input binary string 
        from Cassanda VECTOR type (SELECT STATEMENT)

    nbit: int
        number of bytes.
        'f4' means 'float32' because it has 4 bytes
    Returns
    -------
    Array[Float]
        _description_
    """
    s = b''
    size = len(bstring)//nbyte
    for i in range(size):
        s += bstring[i*nbyte:(i+1)*nbyte][::-1]
    # np.frombuffer(s, dtype=np.dtype('f4')) # using numpy.frombuffer
    return struct.unpack(size*'f', s)

def findMovie(vec, method="plot_vector_1024", limit=2):
    """Find movies based on Vector Search"""
    data = []
    for row in session.execute(
        f"SELECT year, title, wiki_link, plot FROM {KEYSPACE_NAME}.{TABLE_NAME} "
        f" ORDER BY {method} ANN OF %s LIMIT {limit}"
        , [vec]
):
        data.append( (row.year, row.title, row.wiki_link, row.plot) )
        
    return data

def filter_movies(viewed, movies):
    """Filter out the watched movies"""
    viewed = set([(x[0], x[1]) for x in viewed])
    res = []
    for movie in movies:
        if (movie[0], movie[1].strip()) not in viewed:
            res.append(movie)
            
    return res

In [19]:
import pandas as pd
filename = "data/movies_35K_year_title_only.csv"
# a list of titles and years of movies
movie_list = pd.read_csv(filename)
movie_list.tail()

Unnamed: 0,year,title,genre
34881,2014,The Water Diviner,unknown
34882,2017,Çalgı Çengi İkimiz,comedy
34883,2017,Olanlar Oldu,comedy
34884,2017,Non-Transferable,romantic comedy
34885,2017,İstanbul Kırmızısı,romantic


### Input

In [4]:
viewed = [
    (2017, "Logan", "action"),
    (2015, "Inside Out", "comedy"),
    (2015, "Mad Max: Fury Road", "action"),
    (2014, "Interstellar", "science fiction"),
    (2010, "How to Train Your Dragon", "family, fantasy"),
    (2010, "Grown Ups", "comedy"),
    (2010, "Inception","science fiction"),
]
vector_size = 1024

In [5]:
# prepared = session.prepare() # Right now Astra does not support prepared on vector data types.
data_by_title = {}
for year, title, genre in viewed:
    rows = session.execute(f"SELECT plot_vector_{vector_size} as plot_vector, plot_summary_vector_{vector_size} as plot_summary_vector FROM {TABLE_NAME} WHERE year=%s AND title=%s ", (year, title))
    for row in rows:
        data_by_title[(year, title)] = (year, title, convert_bytes_to_array_astra(row.plot_vector), convert_bytes_to_array_astra(row.plot_summary_vector))

### A. Centroid of all points

In [6]:
centroid_plot = np.mean([data_by_title[key][2] for key in data_by_title], axis=0).tolist()
centroid_plot_summary = np.mean([data_by_title[key][3] for key in data_by_title], axis=0).tolist()

#### Based on full plot

In [7]:
movies = findMovie(centroid_plot, limit=15)
movies = filter_movies(viewed, movies)
for movie in movies:
    print(movie[1], movie[0], movie[2])

Jumper 2008 https://en.wikipedia.org/wiki/Jumper_(2008_film)
Oblivion 2013 https://en.wikipedia.org/wiki/Oblivion_(2013_film)
Superhero Movie 2008 https://en.wikipedia.org/wiki/Superhero_Movie
Doom 2005 https://en.wikipedia.org/wiki/Doom_(film)
Komodo 1999 https://en.wikipedia.org/wiki/Komodo_(film)
Explorers 1985 https://en.wikipedia.org/wiki/Explorers_(film)
Pulse 2006 https://en.wikipedia.org/wiki/Pulse_(2006_film)
Premonition 2007 https://en.wikipedia.org/wiki/Premonition_(2007_film)
Intermedio 2005 https://en.wikipedia.org/wiki/Intermedio_(film)
Meet Dave 2008 https://en.wikipedia.org/wiki/Meet_Dave
U.F.O. 2012 https://en.wikipedia.org/wiki/U.F.O._(2012_film)
Monsters vs. Aliens 2009 https://en.wikipedia.org/wiki/Monsters_vs._Aliens
Phenomenon 1996 https://en.wikipedia.org/wiki/Phenomenon_(film)
Godsend 2004 https://en.wikipedia.org/wiki/Godsend_(2004_film)


#### Based on summarized plot

In [8]:
movies = findMovie(centroid_plot_summary, limit=15)
movies = filter_movies(viewed, movies)
for movie in movies:
    print(movie[1], movie[0], movie[2])

Lucid 2005 https://en.wikipedia.org/wiki/Lucid_(film)
Destiny 1944 https://en.wikipedia.org/wiki/Destiny_(1944_film)
Digger 1993 https://en.wikipedia.org/wiki/Digger_(1993_film)
Beyond 2012 https://en.wikipedia.org/wiki/Beyond_(2012_film)
Arcadia 2012 https://en.wikipedia.org/wiki/Arcadia_(film)
The Outsider 2014 https://en.wikipedia.org/wiki/The_Outsider_(2014_film)
Childstar 2004 https://en.wikipedia.org/wiki/Childstar
Inside Out 2011 https://en.wikipedia.org/wiki/Inside_Out_(2011_film)
Flourish 2006 https://en.wikipedia.org/wiki/Flourish_(film)
A.C.O.D. 2013 https://en.wikipedia.org/wiki/A.C.O.D.
Trauma 2004 https://en.wikipedia.org/wiki/Trauma_(2004_film)
Spiral 2007 https://en.wikipedia.org/wiki/Spiral_(2007_film)
Anesthesia 2016 https://en.wikipedia.org/wiki/Anesthesia_(film)
Spike 2008 https://en.wikipedia.org/wiki/Spike_(2008_film)


### B. Clusterization

There are several ways to cluster all the data points.

> i. Group by categories. For example, genre of movies, brands of products, espisode of a series or TV show.
>
> ii. Automatic Clustering with KMeans (K-Means clustering). By setting the `k`-parameter, we can group the data into `k` clusters.

#### i. Group by categories

In [9]:
genres = set([v[2] for v in viewed])
genres

{'action', 'comedy', 'family, fantasy', 'science fiction'}

Group movies by their genre (category) and calculate the mean vector of each genre.

In [10]:
vector_by_genre = []
for genre in genres:
    movies_by_genre = [(movie[0], movie[1]) for movie in viewed if movie[2]==genre]
    vector_by_genre.append(np.mean([data_by_title[key][2] for key in data_by_title if key in movies_by_genre], axis=0).tolist())

print(len(vector_by_genre), len(vector_by_genre[0]))

4 1024


For each genre, find few movies that are close to the mean vector.

In [11]:
for vector_plot in vector_by_genre:
    movies = findMovie(vector_plot, limit=4)
    movies = filter_movies(viewed, movies)
    for movie in movies:
        print(movie[1], movie[0], movie[2])

Oblivion 2013 https://en.wikipedia.org/wiki/Oblivion_(2013_film)
Explorers 1985 https://en.wikipedia.org/wiki/Explorers_(film)
Grown Ups 2 2013 https://en.wikipedia.org/wiki/Grown_Ups_2
Loser 2000 https://en.wikipedia.org/wiki/Loser_(film)
Big Bully 1996 https://en.wikipedia.org/wiki/Big_Bully_(film)
Mad Max 2: The Road Warrior 1981 https://en.wikipedia.org/wiki/Mad_Max_2:_The_Road_Warrior
 X-Men 2000 https://en.wikipedia.org/wiki/X-Men_(film)
How to Train Your Dragon 2 2014 https://en.wikipedia.org/wiki/How_to_Train_Your_Dragon_2
Dragonheart 1996 https://en.wikipedia.org/wiki/Dragonheart
Dragon Nest: Warriors' Dawn 2014 https://en.wikipedia.org/wiki/Dragon_Nest:_Warriors%27_Dawn


#### ii. Automatic Clustering with KMeans

Based on the `k-param`, the algorithm will group the observations into k groups and generate the centroids for each group.


In [12]:
from sklearn.cluster import KMeans

In [13]:
X = np.array([data_by_title[key][2] for key in data_by_title])
print(len(X), len(X[0]))

7 1024


In [14]:
def find_movies_by_auto_cluster(k):
    # fit data according to the number of clusters (k)
    kmeans = KMeans(n_clusters=k).fit(X)
    vector_auto_cluster = kmeans.cluster_centers_

    print("Total number of clusters =", len(vector_auto_cluster), " with dimension of" ,len(vector_auto_cluster[0]))
    for vector_plot in map(tuple,vector_auto_cluster):
        movies = findMovie(vector_plot, limit=5)
        movies = filter_movies(viewed, movies)
        for movie in movies:
            print(movie[1], movie[0], movie[2])

Cluster to 3 clusters (k=3)

In [15]:
k = 3
find_movies_by_auto_cluster(k)

Total number of clusters = 3  with dimension of 1024
 X-Men 2000 https://en.wikipedia.org/wiki/X-Men_(film)
X-Men Origins: Wolverine 2009 https://en.wikipedia.org/wiki/X-Men_Origins:_Wolverine
Loser 2000 https://en.wikipedia.org/wiki/Loser_(film)
They 2002 https://en.wikipedia.org/wiki/They_(2002_film)
Closet Monster 2015 https://en.wikipedia.org/wiki/Closet_Monster_(film)
Den 2001 https://en.wikipedia.org/wiki/Den_(film)
Dreamscape 1984 https://en.wikipedia.org/wiki/Dreamscape_(1984_film)
How to Train Your Dragon 2 2014 https://en.wikipedia.org/wiki/How_to_Train_Your_Dragon_2
Dragonheart 1996 https://en.wikipedia.org/wiki/Dragonheart
Dragon Nest: Warriors' Dawn 2014 https://en.wikipedia.org/wiki/Dragon_Nest:_Warriors%27_Dawn
He's a Dragon 2015 https://en.wikipedia.org/wiki/He%27s_a_Dragon


Cluster to 4 clusters (k=4)

In [16]:
k = 4
find_movies_by_auto_cluster(k)

Total number of clusters = 4  with dimension of 1024
Grown Ups 2 2013 https://en.wikipedia.org/wiki/Grown_Ups_2
Big Bully 1996 https://en.wikipedia.org/wiki/Big_Bully_(film)
Full Grown Men 2006 https://en.wikipedia.org/wiki/Full_Grown_Men
Loser 2000 https://en.wikipedia.org/wiki/Loser_(film)
Mindscape 2013 https://en.wikipedia.org/wiki/Mindscape_(film)
Oblivion 2013 https://en.wikipedia.org/wiki/Oblivion_(2013_film)
Extracted 2012 https://en.wikipedia.org/wiki/Extracted
Sphere 1998 https://en.wikipedia.org/wiki/Sphere_(1998_film)
Mad Max 2: The Road Warrior 1981 https://en.wikipedia.org/wiki/Mad_Max_2:_The_Road_Warrior
 X-Men 2000 https://en.wikipedia.org/wiki/X-Men_(film)
X-Men Origins: Wolverine 2009 https://en.wikipedia.org/wiki/X-Men_Origins:_Wolverine
How to Train Your Dragon 2 2014 https://en.wikipedia.org/wiki/How_to_Train_Your_Dragon_2
Dragonheart 1996 https://en.wikipedia.org/wiki/Dragonheart
Dragon Nest: Warriors' Dawn 2014 https://en.wikipedia.org/wiki/Dragon_Nest:_Warriors%

In [17]:
# Close connection to Cassandra
cluster.shutdown()