# Example of Extraction Transformation & Loading manually in PGVector using all classes

In this notebook, we will see how to retrieve data on movies released this week by extracting them from [sens-critique](https://www.senscritique.com/). We will transform them and do the embedding of the reviews with the LLM [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large/tree/main) and [Text Embedding Inference](https://github.com/huggingface/text-embeddings-inference). Finally, we will integrate this data into [PGVector Database](https://github.com/pgvector/pgvector)

The classes used in this notebook are those of the project where you are located, available [here](https://github.com/ilanaliouchouche/senscritique-weeklyreal-database/tree/main/etl)

<p align="left">
  <img src="res/sc.jpg" width="200">
</p>

## Setting up PGVector and TEI Containers

To set up the PGVector and TEI containers, follow the steps below:

<img src="res/hf.png" width="175"><img src="res/pg.png" width="175">

1. Run the following command to start the PGVector container:

In [13]:
!docker run --name db_sc -p 5432:5432 -e POSTGRES_PASSWORD=dw2 -v data:/var/lib/postgresql/data  -d ankane/pgvector 

0ac389bf5eeba3cce51ede1313b6fd1c1f38720de72bd5d72ac2cd52501d1e0b


2. Run the following command to start the TEI container:

In [14]:
!docker run --name tei_sc -p 8088:80 -v llm_data:/data --pull always -d ghcr.io/huggingface/text-embeddings-inference:cpu-0.6 --model-id intfloat/multilingual-e5-large --revision refs/pr/5 

cpu-0.6: Pulling from huggingface/text-embeddings-inference
Digest: sha256:220f70681c3f84dbd1c264200b888d0bd370be47c76ad78dbbfda2a9ce299c6f
Status: Image is up to date for ghcr.io/huggingface/text-embeddings-inference:cpu-0.6
e5185503f10a34cea06ff8f7ea2b7941b78e3b8677b73cf9710e93c7cc03dd40


## Necessary Imports

In [19]:
from etl.extract import CurrentMovieExtractor
from etl.transform import FilmTransformer
from etl.load import FilmLoader
import os
import pandas as pd
from setup_vcb import SetupPGVector
import warnings
warnings.filterwarnings('ignore')

# Environment Variables Definition
To set up the environment variables, use the following code:

In [8]:
# PGVector environment variables
os.environ['PG_HOSTNAME'] = '0.0.0.0'
os.environ['PG_USERNAME'] = 'postgres'
os.environ['PG_HPORT'] = '5432'
os.environ['PG_CPORT'] = '5432'
os.environ['PG_DBNAME'] = 'postgres'
os.environ['PG_DBPASSWORD'] = 'dw2'
os.environ['PG_DATA'] = 'data'

# TEI environment variables
os.environ['TEI_HOSTNAME'] = '0.0.0.0'
os.environ['TEI_DATA'] = 'llm_data'
os.environ['TEI_HPORT'] = '8088'
os.environ['TEI_CPORT'] = '80'
os.environ['TEI_MODEL'] = 'intfloat/multilingual-e5-large'
os.environ['TEI_REVISION'] = 'refs/pr/5'


In [9]:
# Check PGVector environment variables
pg_hostname = os.environ.get('PG_HOSTNAME')
pg_username = os.environ.get('PG_USERNAME')
pg_hport = os.environ.get('PG_HPORT')
pg_cport = os.environ.get('PG_CPORT')
pg_dbname = os.environ.get('PG_DBNAME')
pg_dbpassword = os.environ.get('PG_DBPASSWORD')
pg_data = os.environ.get('PG_DATA')

print(f"PGVector environment variables:")
print(f"PG_HOSTNAME: {pg_hostname}")
print(f"PG_USERNAME: {pg_username}")
print(f"PG_HPORT: {pg_hport}")
print(f"PG_CPORT: {pg_cport}")
print(f"PG_DBNAME: {pg_dbname}")
print(f"PG_DBPASSWORD: {pg_dbpassword}")
print(f"PG_DATA: {pg_data}")

# Check TEI environment variables
tei_hostname = os.environ.get('TEI_HOSTNAME')
tei_data = os.environ.get('TEI_DATA')
tei_hport = os.environ.get('TEI_HPORT')
tei_cport = os.environ.get('TEI_CPORT')
tei_model = os.environ.get('TEI_MODEL')
tei_revision = os.environ.get('TEI_REVISION')

print(f"\nTEI environment variables:")
print(f"TEI_HOSTNAME: {tei_hostname}")
print(f"TEI_DATA: {tei_data}")
print(f"TEI_HPORT: {tei_hport}")
print(f"TEI_CPORT: {tei_cport}")
print(f"TEI_MODEL: {tei_model}")
print(f"TEI_REVISION: {tei_revision}")


PGVector environment variables:
PG_HOSTNAME: 0.0.0.0
PG_USERNAME: postgres
PG_HPORT: 5432
PG_CPORT: 5432
PG_DBNAME: postgres
PG_DBPASSWORD: dw2
PG_DATA: data

TEI environment variables:
TEI_HOSTNAME: 0.0.0.0
TEI_DATA: llm_data
TEI_HPORT: 8088
TEI_CPORT: 80
TEI_MODEL: intfloat/multilingual-e5-large
TEI_REVISION: refs/pr/5


## Extraction of the films of the week with `CurrentMovieExtractor`

1. In this step, we will use the `CurrentMovieExtractor` class with the `extract_all_film_links` method to extract the films of the week


In [4]:
extractor = CurrentMovieExtractor()
extractor.extract_all_film_links()

Extracting all films links...
Done with all films links


In [5]:
print(f"{len(extractor.urls_films)} film links extracted for this week")

22 film links extracted for this week


2. Next, we will retrieve the details of the films using the `extract_all_film_data` method.


In [6]:
extractor.extract_all_film_data()

Extracting all films informations...


100%|██████████| 22/22 [10:25<00:00, 28.43s/it]


{'Shin Godzilla': {'Titre original': 'Shin Gojira', 'Aussi connu sous le nom de': 'シン・ゴジラ, Shin Godzilla, Godzilla : Resurgence', 'Godzilla': 'Resurgence', 'Genres': 'Action, Science-fiction', 'Groupe': 'Godzilla', 'Année': '2016', "Pays d'origine": 'Japon', 'Durée': '2 h', 'Date de sortie (Japon)': '29 juillet 2016', 'Date de sortie (France)': '11 janvier 2024', 'Réalisateurs': 'Hideaki Anno, Shinji Higuchi', 'Scénaristes': 'Hideaki Anno, Sean Whitley', 'Producteurs': 'Yoshihiro Satô, Taichi Ueda, Akihiro Yamauchi, Minami Ichikawa, Kazutoshi Wadakura, Masaya Shibusawa, Kensei Mori', 'Distributeur': 'Filmo', 'Budget': '15 000 000 $', 'Bande originale': 'Shin Godzilla Music Collection', 'url': 'https://www.senscritique.com//film/shin_godzilla/14101439', 'rate': 7.1, 'image': 'https://media.senscritique.com/media/000021838165/300/shin_godzilla.png', 'reviews': {'Positives': ['https://www.senscritique.com/film/shin_godzilla/critique/65450023', 'https://www.senscritique.com/film/shin_godzi

100%|██████████| 22/22 [01:10<00:00,  3.21s/it]

Done with all reviews





Normally, the longest step is behind us, it should take about 15 minutes

## Data Transformation with the FilmTransformer Class

To transform the extracted film data, we will use the `FilmTransformer` class. This class provides methods to clean and preprocess the data before loading it into the database. So, the transformation of numerical and categorical data and the embedding with TEI of the reviews."

Here's an example of how to use the `FilmTransformer` class:

In [17]:
transformer = FilmTransformer(extractor)

Computing embeddings...


100%|██████████| 124/124 [01:57<00:00,  1.06it/s]

Done with embeddings





We can now visualize the data in the form of a dataframe, let's display for example the reviews:

In [20]:
transformer.df_reviews

Unnamed: 0,film,is_negative,title,likes,comments,content,url,embedding
0,Shin Godzilla,False,Le colosse s'érode,98,49,"Regarder la série des Godzilla, c’est admirer ...",https://www.senscritique.com/film/shin_godzill...,"[-0.01608497, -0.006072742, 0.0033162965, -0.0..."
1,Shin Godzilla,False,The Legend of Godzilla - A Link to the Past,63,37,[AVANT-PROPOS/AVERTISSEMENT : Voir ce film le ...,https://www.senscritique.com/film/shin_godzill...,"[-0.004714112, -0.019787442, -0.02954265, -0.0..."
2,Shin Godzilla,False,Godzilla contre les politiciens,25,8,Le dernier Godzilla japonais produit par la To...,https://www.senscritique.com/film/shin_godzill...,"[0.0077194446, -0.009460152, -0.0062741246, -0..."
3,Shin Godzilla,False,Premier impact,18,2,Ce qui frappe dès le début c'est la marque de ...,https://www.senscritique.com/film/shin_godzill...,"[0.010712562, -0.044431668, -0.006084309, -0.0..."
4,Shin Godzilla,False,Gojira's Rising,15,4,"Après la version correcte de Gareth Edwards, v...",https://www.senscritique.com/film/shin_godzill...,"[0.024141837, -0.03297161, -0.003320909, -0.06..."
...,...,...,...,...,...,...,...,...
119,En plein vol,False,Un navet du vendredi soir qui s'est transformé...,0,0,J'arrive sur Netflix ce soir.,https://www.senscritique.com/film/en_plein_vol...,"[-0.020022819, -0.015799869, -0.002310835, -0...."
120,En plein vol,False,Trop c'est trop,0,0,C'est très à la mode les équipes de voleurs sy...,https://www.senscritique.com/film/en_plein_vol...,"[0.0148394285, -0.0019226844, -0.041255392, -0..."
121,En plein vol,True,Navet,1,0,Encore du fric fout par la fenêtre pour un gro...,https://www.senscritique.com/film/en_plein_vol...,"[0.013324497, 0.013040972, -0.016943585, -0.04..."
122,En plein vol,True,Un navet du vendredi soir qui s'est transformé...,0,0,J'arrive sur Netflix ce soir.,https://www.senscritique.com/film/en_plein_vol...,"[-0.020022819, -0.015799869, -0.002310835, -0...."


We see several pieces of information such as likes, content, and their embedding done by the LLM

# `SetupPGVector` class to create the PGVector database schema. 

This class provides a method to set up the necessary tables and indexes for the ETL process.

Here's an example of how to use the `SetupPGVector` class:

In [None]:
# If the database schema does not exist, create it

#    setup = SetupPGVector(dbname=os.getenv("PG_DBNAME"), user=os.getenv("PG_USER"), password=os.getenv("PG_DBPASSWORD"), host=os.getenv("PG_HOSTNAME"), port=os.getenv("PG_HPORT"))
#    setup.setup_vdb()

# Load data into the Vector Data Base with the `FilmLoader` class

This class allows connecting to the VectorDataBase through the psycopg2 driver and integrating documents using the loading method

In [18]:
loader = FilmLoader(transformer, dbname=os.getenv("PG_DBNAME"), user=os.getenv("PG_USER"), password=os.getenv("PG_DBPASSWORD"), host=os.getenv("PG_HOSTNAME"), port=os.getenv("PG_HPORT"))
loader.loading()

Connecting to database: dbname=postgres user=postgres password=dw2 host=0.0.0.0 port=5432
Connected to database
Loading data...
Done with loading


# ETL Done

At this stage, the ETL is performed manually in a notebook for those who particularly appreciate it. To consult the data, simply go to the vector database as follows:

```bash 
docker exec -it db_sc psql -U postgres
```
And then in psql client make a query 
```bash
SELECT embedding FROM reviews;
```