# Introduction
Here, we will use Lantern to implement a movie recommendation system. We will be able to search for movies similar to ones that a user has enjoyed so that we can show relevant recommendations.

We will use movie data from the [MovieLens 1M dataset](https://grouplens.org/datasets/movielens/1m/).


# Setup Postgres

We install postgres and its dev tools (necessary to build lantern from source). We also start postgres, and set up a user 'postgres' with password 'postgres' and create a database called 'ourdb'




In [None]:
# We install postgres and its dev tools
!sudo apt-get -y -qq update
!sudo apt-get -y -qq install postgresql postgresql-server-dev-all
#  Start postgres
!sudo service postgresql start

# Create user, password, and db
!sudo -u postgres psql -U postgres -c "ALTER USER postgres PASSWORD 'postgres';"
!sudo -u postgres psql -U postgres -c 'DROP DATABASE IF EXISTS ourdb;'
!sudo -u postgres psql -U postgres -c 'CREATE DATABASE ourdb;'

debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 26.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Selecting previously unselected package logrotate.
(Reading database ... 120874 files and directories currently installed.)
Preparing to unpack .../00-logrotate_3.19.0-1ubuntu1.1_amd64.deb ...
Unpacking logrotate (3.19.0-1ubuntu1.1) ...
Selecting previously unselected package netbase.
Preparing to unpack .../01-netbase_6.3_all.deb ...
Unpacking netbase (6.3) ...
Selecting previously unselected package python3-yaml.
Preparing to unpack .../02-python3-yaml_5.4.1-1ubuntu1_amd64.deb ...
Unpacking python3-yaml (5.4.1-1ubuntu1) ...
Selecting previous

# Install Lantern and build it from source

In [None]:
!git clone --recursive https://github.com/lanterndata/lantern.git

Cloning into 'lantern'...
remote: Enumerating objects: 2562, done.[K
remote: Counting objects: 100% (1342/1342), done.[K
remote: Compressing objects: 100% (414/414), done.[K
remote: Total 2562 (delta 1068), reused 1003 (delta 922), pack-reused 1220[K
Receiving objects: 100% (2562/2562), 578.18 KiB | 4.45 MiB/s, done.
Resolving deltas: 100% (1698/1698), done.
Submodule 'third_party/hnswlib' (https://github.com/ngalstyan4/hnswlib) registered for path 'third_party/hnswlib'
Submodule 'third_party/usearch' (https://github.com/ngalstyan4/usearch) registered for path 'third_party/usearch'
Cloning into '/content/lantern/third_party/hnswlib'...
remote: Enumerating objects: 1723, done.        
remote: Counting objects: 100% (333/333), done.        
remote: Compressing objects: 100% (40/40), done.        
remote: Total 1723 (delta 306), reused 293 (delta 293), pack-reused 1390        
Receiving objects: 100% (1723/1723), 530.50 KiB | 8.29 MiB/s, done.
Resolving deltas: 100% (1097/1097), done.

In [None]:
# We build lantern from source
%cd lantern
!mkdir build
%cd build
!pwd
!cmake ..
!make install

/content/lantern
/content/lantern/build
/content/lantern/build
  Compatibility with CMake < 3.5 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value or use a ...<max> suffix to tell
  CMake that the project does not need compatibility with older versions.

[0m
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Build type: 
-- Found pg_config as /usr/bin/pg_config
-- Found postgres binary at /usr/lib/postgresql/14/bin/postgres
-- PostgreSQL version PostgreSQL 14.9 (Ubuntu 14.9-0ubuntu0.22.04.1) fou

# Gathering Movie Data
As we mentioned earlier, we will use the MovieLens 1M dataset. This dataset contains over 1 million anonymous ratings of approximately 3,900 movies made by 6,040 MovieLens users.

We use the following files:

    movies.dat: Contains movie information.
    movie_vectors.txt: Contains movie vectors that can be imported to Milvus easily.


In [None]:
# Download movie information
! wget -P movie_recommender https://paddlerec.bj.bcebos.com/aistudio/movies.dat --no-check-certificate
# Download movie vectors
! wget -P movie_recommender https://paddlerec.bj.bcebos.com/aistudio/movie_vectors.txt --no-check-certificate

--2023-10-24 23:17:19--  https://paddlerec.bj.bcebos.com/aistudio/movies.dat
Resolving paddlerec.bj.bcebos.com (paddlerec.bj.bcebos.com)... 103.235.46.61, 2409:8c04:1001:1002:0:ff:b001:368a
Connecting to paddlerec.bj.bcebos.com (paddlerec.bj.bcebos.com)|103.235.46.61|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 171308 (167K) [application/octet-stream]
Saving to: ‘movie_recommender/movies.dat’


2023-10-24 23:17:22 (166 KB/s) - ‘movie_recommender/movies.dat’ saved [171308/171308]

--2023-10-24 23:17:22--  https://paddlerec.bj.bcebos.com/aistudio/movie_vectors.txt
Resolving paddlerec.bj.bcebos.com (paddlerec.bj.bcebos.com)... 103.235.46.61, 2409:8c04:1001:1002:0:ff:b001:368a
Connecting to paddlerec.bj.bcebos.com (paddlerec.bj.bcebos.com)|103.235.46.61|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1095505 (1.0M) [text/plain]
Saving to: ‘movie_recommender/movie_vectors.txt’


2023-10-24 23:17:29 (186 KB/s) - ‘movie_recommender/movie_

# Create Postgres Table

Now that we have our movie data, let's set up `psycopg2` with postgres, and enable the lantern extension


In [None]:
import psycopg2

# We use the dbname, user, and password that we specified above
conn = psycopg2.connect(
    dbname="ourdb",
    user="postgres",
    password="postgres",
    host="localhost",
    port="5432" # default port for Postgres
)

# Get a new cursor
cursor = conn.cursor()

# Execute the query to load the Lantern extension in
cursor.execute("CREATE EXTENSION IF NOT EXISTS lantern;")

conn.commit()
cursor.close()

We will make a table called `movies`, and it will have 4 columns: an id, the title of the film, a string denoting the genres of the film, and a vector that will be the embedding for the movie.

In [None]:
# Create the table
cursor = conn.cursor()

TABLE_NAME = "movies"

create_table_query = f"CREATE TABLE {TABLE_NAME} (id integer, title text, genres text, vector real[]);"

cursor.execute(create_table_query)

conn.commit()
cursor.close()

# Inserting Movie Data
Now that we have our table, let's insert our movie data into our database.

Let's first get our movie embeddings/vectors from `movie_vectors.txt` and set up a dictionary to easily get the embedding for a movie from its id.

Note that the dimensionality of these embeddings is 32

In [None]:
import json
import codecs

def get_vectors():
    with codecs.open("movie_recommender/movie_vectors.txt", "r", encoding='utf-8', errors='ignore') as f:
        lines = f.readlines()
    ids = [int(line.split(":")[0]) for line in lines]
    embeddings = []
    for line in lines:
        line = line.strip().split(":")[1][1:-1]
        str_nums = line.split(",")
        emb = [float(x) for x in str_nums]
        embeddings.append(emb)
    return ids, embeddings

ids, embeddings = get_vectors()

# make it easier to look up an embedding from a movie id
id_to_embedding = dict(zip(ids, embeddings))


print(f"Dimensionality: {len(embeddings[0])}")

Dimensionality: 32


Now we can process the data in `movies.dat` to get the metadata of each movie (id, title, and genre), and get the vector embedding from above. We'll insert each movie into our database:

In [None]:
cursor = conn.cursor()

def process_movie(lines):
    for line in lines:
        if len(line.strip()) == 0:
            continue
        tmp = line.strip().split("::")
        movie_id = int(tmp[0])
        title = tmp[1]
        genres = tmp[2]

        vector = id_to_embedding[movie_id]
        cursor.execute(f"INSERT INTO {TABLE_NAME} (id, title, genres, vector) VALUES (%s, %s, %s, %s);", (movie_id, title, genres, vector))


with codecs.open("movie_recommender/movies.dat", "r",encoding='utf-8',errors='ignore') as f:
        lines = f.readlines()
        process_movie(lines)


conn.commit()
cursor.close()


# Creating an Index
Now that we have inserted the embeddings into our database, we need to construct an index in postgres using lantern. This is important because the index will tell allow postgres to use lantern when performing vector search.

Note that we specify L2-squared distance as the distance metric. Also, as a good practice, we specify the dimension of vectors, 32 as mentioned above, in the index (although lantern can infer it from the vector's we've already inserted).

In [None]:
cursor = conn.cursor()

cursor.execute(f"CREATE INDEX ON {TABLE_NAME} USING hnsw (vector dist_l2sq_ops) WITH (dim=32);")

conn.commit()
cursor.close()

# Getting Recommendations With Vector Search!

Let's pick a movie below and assume that a user has really liked this movie. Let's find movies that are similar to this movie so we can recommend movies that they'll also like.

In [None]:
query_movie_id = ids[69]
query_vector = str(id_to_embedding[query_movie_id])

cursor = conn.cursor()

cursor.execute(f"SELECT * FROM {TABLE_NAME} where id={query_movie_id};")

results = cursor.fetchall()
#print(results[0])
query_movie_title = results[0][1]
query_movie_genre = results[0][2]


print(f"The user really liked the movie: {query_movie_title}, genre: {query_movie_genre}")
print("Let's find similar movies they'll also like...")


conn.commit()
cursor.close()

The user really liked the movie: From Dusk Till Dawn (1996), genre: Action|Comedy|Crime|Horror|Thriller
Let's find similar movies they'll also like...


To find similar movies, we will perform a vector search using lantern. We pull the 10 most similar movies (movies whose embeddings are closet to the embedding of our query movie), and specify that we don't want the same movie as the query movie.

In [None]:
cursor = conn.cursor()

cursor.execute("SET enable_seqscan = false;")
cursor.execute(f"SELECT id, title, genres FROM {TABLE_NAME} WHERE id != {query_movie_id} ORDER BY vector <-> ARRAY{query_vector} LIMIT 10;")

results = cursor.fetchall()

print(f"Recommendations if you liked '{query_movie_title}':\n")
for i,r in enumerate(results):
  print(f"#{i+1}. {r[1]}, Genre: {r[2]}")

conn.commit()
cursor.close()

Recommendations if you liked 'From Dusk Till Dawn (1996)':

#1. Strange Days (1995), Genre: Action|Crime|Sci-Fi
#2. Westworld (1973), Genre: Action|Sci-Fi|Thriller|Western
#3. Tank Girl (1995), Genre: Action|Comedy|Musical|Sci-Fi
#4. Thirteenth Floor, The (1999), Genre: Drama|Sci-Fi|Thriller
#5. Alien Nation (1988), Genre: Crime|Drama|Sci-Fi
#6. Puppet Master (1989), Genre: Horror|Sci-Fi|Thriller
#7. Village of the Damned (1960), Genre: Horror|Sci-Fi|Thriller
#8. Dog Day Afternoon (1975), Genre: Comedy|Crime|Drama
#9. Young Guns (1988), Genre: Action|Comedy|Western
#10. Sneakers (1992), Genre: Crime|Drama|Sci-Fi


# Conclusion
As we can see, we get movies that are similar to the original movie that we can now recommend to the user. A lot of these share the same genre as the original movie.

And that's how you can implement a movie recommendation system using Lantern.




### Cleanup

In [None]:
# Close the postgres connection
conn.close()