# Source Knowledge: prompt aumentado

Uma forma de incluir contexto e mais informações para o modelo LLM é por meio da técnica chamada de "source knowledge". Ela consiste em incluir informações relevantes para a pergunta no prompt da LLM.

In [1]:
llm_information = [
    "Australian Open 2024: Jannik Sinner, Aryna Sabalenka crowned as Grand Slam singles champions at Melbourne Park",
    "Sinner and Sabalenka took down Daniil Medvedev and Qinwen Zheng in their respective finals",
    "Sinner, Sabalenka win Australian Open singles titles",
    "Jannik Sinner came back from two sets down to beat Daniil Medvedev 3-6, 3-6, 6-4, 6-4, 6-3 in the Australian Open men's singles final, earning him his first ever Grand Slam title"
]

source_knowledge = "\n".join(llm_information)

# Embeddings

Um modelo de embedding funciona como um tradutor, convertendo palavras e frases em uma representação numérica que retém ao máximo o significado original. Imagine transformar uma passagem de livro em um conjunto de coordenadas no espaço – a distância entre os pontos transmite as relações entre as palavras.

Em vez de processar a linguagem pelo valor nominal, o embedding de texto permite que as máquinas analisem a semântica subjacente.

Para tal, será usado um modelo de embedding da OpenAI para a criação dos embedding e, em seguida, armazená-los na vector database. A criação do index no Pinecone deve ser criada seguindo a mesma configuração do modelo (incluir imagem disso no artigo do Medium).

In [2]:
%pip install -qU langchain-openai

Note: you may need to restart the kernel to use updated packages.


In [3]:
import getpass
import os

# Abre input para incluir a key da OpenAI
os.environ["OPENAI_API_KEY"] = getpass.getpass()

# Instanciação do modelo de embeddings da OpenAI (text-embedding-3-small)

In [4]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small", dimensions=512)

# Embeddings com inputs de teste

In [17]:
# texts = [
#     'this is the first chunk of text',
#     'then another second chunk of text is here'
# ]

data = [
    "Australian Open 2024: Jannik Sinner, Aryna Sabalenka crowned as Grand Slam singles champions at Melbourne Park"
    "Sinner and Sabalenka took down Daniil Medvedev and Qinwen Zheng in their respective finals",
    "Sinner, Sabalenka win Australian Open singles titles",
    "Jannik Sinner came back from two sets down to beat Daniil Medvedev 3-6, 3-6, 6-4, 6-4, 6-3 in the Australian Open men's singles final, earning him his first ever Grand Slam title",
    "Sinner was the champion in 2024"
]

data_embedded = embeddings.embed_documents(data)
len(data_embedded), len(data_embedded[0])

(4, 512)

In [6]:
%pip install psycopg2-binary

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [7]:
host = 'localhost'
database = 'vectordb'
user = 'testuser'
password = 'testpwd'

In [8]:
import psycopg2

def connect():
  conn = None
  try:
    conn = psycopg2.connect(
                   host=host,
                   database=database,
                   user=user,
                   password=password)
  except (Exception, psycopg2.DatabaseError) as error:
    print(error)
    return

  print('Conectado!')
  return conn

In [18]:
connection = connect()
cursor = connection.cursor()
try:
    for text, embedding in zip(data, data_embedded):
        cursor.execute(
            "INSERT INTO embeddings (embedding, text) VALUES (%s, %s)",
            (embedding, text)
        )
    connection.commit()
except (Exception, psycopg2.Error) as error:
    print("Error while writing to DB", error)
finally:
    if cursor:
        cursor.close()
    if connection:
        connection.close()

Conectado!


# Busca da resposta da query inicial apenas para a vector database

In [20]:
query = "Who was the Australian Open champion in 2024?"
embedding = embeddings.embed_documents(query)

In [21]:
embedding[0]

[0.04431451457507956,
 0.03508901947177311,
 0.014698833137424535,
 0.07476783498181075,
 0.000329353990118113,
 -0.009862331613295614,
 0.018255928419840778,
 -0.014446394429825793,
 -0.040321390829696546,
 -0.026437245147959462,
 0.014836527316595693,
 0.05778557771930317,
 -0.11235829374147285,
 -0.028892788859651585,
 0.0503501032199315,
 0.00022966926777225836,
 0.02377516203015574,
 -0.0014443525490652202,
 -0.04369489076880935,
 0.04463580223490902,
 -0.0721516530281393,
 0.011543346498472546,
 -0.068066730679782,
 -0.0003754312615274138,
 -0.006769954651243311,
 -0.03986240899069593,
 -0.01016640656940611,
 -0.016293788229296925,
 0.004876662015946318,
 0.019575495153370853,
 0.05746429005947371,
 -0.06131972335097584,
 0.04915675490887823,
 -0.048238794956167265,
 -0.04006895025945267,
 0.03240398670322582,
 -0.014366072514868428,
 0.031486026750514866,
 -0.016293788229296925,
 0.023270284614958257,
 0.010269676272461909,
 -0.023637467850984584,
 0.013849720274299156,
 0.00441

In [28]:
connection = connect()
cursor = connection.cursor()
try:
    cursor.execute(f"""
        SELECT text, cosine_distance(embedding, '{embedding[0]}') AS cosine_similarity
        FROM embeddings
        ORDER BY cosine_similarity desc
        LIMIT 3
    """)
    for r in cursor.fetchall():
        print(f"Text: {r[0]}; Similarity: {r[1]}")

except Exception as error:
    print("Error..", error)
finally:
    cursor.close()
    connection.close()

Conectado!
Text: Jannik Sinner came back from two sets down to beat Daniil Medvedev 3-6, 3-6, 6-4, 6-4, 6-3 in the Australian Open men's singles final, earning him his first ever Grand Slam title; Similarity: 0.9593651294708249
Text: Australian Open 2024: Jannik Sinner, Aryna Sabalenka crowned as Grand Slam singles champions at Melbourne ParkSinner and Sabalenka took down Daniil Medvedev and Qinwen Zheng in their respective finals; Similarity: 0.9417234393389894
Text: Sinner, Sabalenka win Australian Open singles titles; Similarity: 0.9277496231943592
