<a href="https://colab.research.google.com/github/luisgdelafuente/gnai/blob/main/embeddings_tests.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install openai
!pip install transformers

In [46]:
import openai
import pandas as pd
import numpy as np
import pickle
from transformers import GPT2TokenizerFast
from typing import Dict, Tuple, List

openai.api_key = "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"
COMPLETIONS_MODEL = "text-davinci-003"

# Hacer que el modelo 'alucine'

In [6]:
# Lets make the model allucinate

prompt = "Who won the 2020 Summer Olympics men's high jump?"

openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=300,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    model=COMPLETIONS_MODEL
)["choices"][0]["text"].strip(" \n")

# Marcelo Chierghini is indeed a Brazilian olympian, although he is a swimmer and came 8th in the 2020 Olympics...
# This is what's referred to as the model "hallucinating" an answer instead of just saying "I don't know" as a good AI should.

"Marcelo Chierighini of Brazil won the gold medal in the men's high jump at the 2020 Summer Olympics."

In [7]:
# we can address the hallucination problem with prompt engineering,
# or in other words, being more explicit in the instructions provided.

prompt = """Answer the question as truthfully as possible, and if you're unsure of the answer, say "Sorry, I don't know".

Q: Who won the 2020 Summer Olympics men's high jump?
A:"""

openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=300,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    model=COMPLETIONS_MODEL
)["choices"][0]["text"].strip(" \n")

"Sorry, I don't know."

# Paso 0: Prevenir alucinaciones dando pistas en el prompt

In [9]:
prompt = """Answer the question as truthfully as possible using the provided text, and if the answer is not contained within the text below, say "I don't know"

Context:
The men's high jump event at the 2020 Summer Olympics took place between 30 July and 1 August 2021 at the Olympic Stadium.
33 athletes from 24 nations competed; the total possible number depended on how many nations would use universality places 
to enter athletes in addition to the 32 qualifying through mark or ranking (no universality places were used in 2021).
Italian athlete Gianmarco Tamberi along with Qatari athlete Mutaz Essa Barshim emerged as joint winners of the event following
a tie between both of them as they cleared 2.37m. Both Tamberi and Barshim agreed to share the gold medal in a rare instance
where the athletes of different nations had agreed to share the same medal in the history of Olympics. 
Barshim in particular was heard to ask a competition official "Can we have two golds?" in response to being offered a 
'jump off'. Maksim Nedasekau of Belarus took bronze. The medals were the first ever in the men's high jump for Italy and 
Belarus, the first gold in the men's high jump for Italy and Qatar, and the third consecutive medal in the men's high jump
for Qatar (all by Barshim). Barshim became only the second man to earn three medals in high jump, joining Patrik Sjöberg
of Sweden (1984 to 1992).

Q: Who won the 2020 Summer Olympics men's high jump?
A:"""

openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=300,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    model=COMPLETIONS_MODEL
)["choices"][0]["text"].strip(" \n")

# Aquí hemos conseguido que el modelo de una respuesta acertada, pero ¿a qué precio? este sistema no es práctico porque no escala: 
# no podemos enviar la totalidad del contexto cada vez que hacemos consultas. 
# Aquí es donde tienen sentido los Embeddings. 

'Gianmarco Tamberi and Mutaz Essa Barshim emerged as joint winners of the event.'

# Utilzar Embedings

¿Qué hacemos cuando tenemos un dataser enorme de información que nes necesario parsear? Aquí es donde tienen sentido la Embedding API. 

Vamos a ver cómo proporcional información contextual para que la API nos de las respuestas correctas. siguiendo estos pasos: 
- Primero extrae la información relevante de la pregunta
- Luego utiliza la Completions API para responder correctamente. 

De forma más específica los pasos a seguir son: 
- Pre-procesar la información contextual troceándola en piezas más pequeñas y creando un vector embebido para cada una. 
- Cuando se recibe una consulta, embeberla consulta en el mismo espacio vectorial que los trozos, y encontrando el contexto más relevante a dicha consulta. 
- Añadir el contexto relevante al prompt de la consulta (como hemos hecho más arriba). 
- Enviar a GPT3 el prmpt y recibir una respuesta que si tiene en cuenta la información contextual. 

## Paso 1: procesar la información contextual

Vamos a procesar un documento con el corpus de datos y a trocearlo de forma que cada sección tenga suficiente información como para responder a una pregunta, pero sea lo suficientemente pequeño como para poder adjuntarse a uno o varios prompts. 

En general una frase o un párrafo corto funcionan, PERO puede ser necesario adaptar el tamaño a cada dataset. 

In [47]:
# This olympics dataset is a sample to customize for each client. 
# !!! Notice there´s lots of friction in the process of indexing this information: each client has tons of unstructured and duplicated data. 

df = pd.read_csv('https://cdn.openai.com/API/examples/data/olympics_sections_text.csv')
df = df.set_index(["title", "heading"])
print(f"{len(df)} rows in the data.")


3964 rows in the data.


In [16]:
# lets get some information about this dataframe: 
print(df.shape)
df.head()
df.info()
df.head()

(3964, 2)
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 3964 entries, ('2020 Summer Olympics', 'Summary') to ('Haiti at the 2020 Summer Olympics', 'Taekwondo')
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   content  3964 non-null   object
 1   tokens   3964 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 120.5+ KB


Unnamed: 0_level_0,Unnamed: 1_level_0,content,tokens
title,heading,Unnamed: 2_level_1,Unnamed: 3_level_1
2020 Summer Olympics,Summary,The 2020 Summer Olympics (Japanese: 2020年夏季オリン...,726
2020 Summer Olympics,Host city selection,The International Olympic Committee (IOC) vote...,126
2020 Summer Olympics,Impact of the COVID-19 pandemic,"In January 2020, concerns were raised about th...",374
2020 Summer Olympics,Qualifying event cancellation and postponement,Concerns about the pandemic began to affect qu...,298
2020 Summer Olympics,Effect on doping tests,Mandatory doping tests were being severely res...,163


In [48]:
# Ahora vamos a preparar el modelo para la tarea de vectorizar el documento, por ejemplo usando el modelo Curie: 

MODEL_NAME = "curie"

DOC_EMBEDDINGS_MODEL = f"text-search-{MODEL_NAME}-doc-001"
QUERY_EMBEDDINGS_MODEL = f"text-search-{MODEL_NAME}-query-001"

In [49]:
# Creamos funciones para realizar embedding, recuperamos el do

#  Generates an embedding vector for a given text using the specified model
def get_embedding(text: str, model: str) -> List[float]:
    result = openai.Embedding.create(
      model=model,
      input=text)
    return result["data"][0]["embedding"]

# Computes the embedding vector for a document text using a specific model for document embeddings.
def get_doc_embedding(text: str) -> List[float]:
    return get_embedding(text, DOC_EMBEDDINGS_MODEL)

# Calculates the embedding vector for a query text using a designated model for query embeddings.
def get_query_embedding(text: str) -> List[float]:
    return get_embedding(text, QUERY_EMBEDDINGS_MODEL)

# Creates embeddings for each row in a DataFrame df by utilizing the OpenAI Embeddings API
# and returns a dictionary mapping embeddings to their corresponding rows.
def compute_doc_embeddings(df: pd.DataFrame) -> Dict[Tuple[str, str], List[float]]:
    """
    Create an embedding for each row in the dataframe using the OpenAI Embeddings API.
    
    Return a dictionary that maps between each embedding vector and the index of the row that it corresponds to.
    """
    return {
        idx: get_doc_embedding(r.content.replace("\n", " ")) for idx, r in df.iterrows()
    }



In [50]:
# Ahora creamos una función load_embeddings que lee el documento de embeddings y las claves de un csv

def load_embeddings(fname: str) -> Dict[Tuple[str, str], List[float]]:
    """
    Read the document embeddings and their keys from a CSV.
    
    fname is the path to a CSV with exactly these named columns: 
        "title", "heading", "0", "1", ... up to the length of the embedding vectors.
    """
    
    df = pd.read_csv(fname, header=0)
    max_dim = max([int(c) for c in df.columns if c != "title" and c != "heading"])
    return {
           (r.title, r.heading): [r[str(i)] for i in range(max_dim + 1)] for _, r in df.iterrows()
    }

In [None]:
# Ahora vamos a cargar el documento de embeddings: Open AI tiene su propio documento. 
# although there is commented out code below if you want to recalculate these embeddings from scratch:

# document_embeddings = load_embeddings("https://cdn.openai.com/API/examples/data/olympics_sections_document_embeddings.csv")

# ===== OR, uncomment the below line to recaculate the embeddings from scratch. ========

context_embeddings = compute_doc_embeddings(df)

# API ERROR: 500 - 'The server had an error while processing your request': 
# Cause Issue on our servers. # Solution Retry your request after a brief wait and contact us if the issue persists. Read status page.

In [41]:
# Veamos un ejemplo concreto de embeddings

example_entry = list(document_embeddings.items())[0]
print(f"{example_entry[0]} : {example_entry[1][:5]}... ({len(example_entry[1])} entries)")

('2020 Summer Olympics', 'Summary') : [0.0037565305829048, -0.0061981128528714, -0.0087078781798481, -0.0071364338509738, -0.0025227521546185]... (1536 entries)


Ahora el documento queda dividido en secciones, y cada una de estas secciones ha quedado vectorizada. 

¡El siguiente paso es usar estos embedings para responder a las preguntas del usuario!

## Paso 2: Encontrar documentos embebidos relevantes para la query

Para poder responder a la pregunta del usuario lo primero que hacemos es vectorizar la pregunta y recuperar los vectores relevantes del documento. 

En este caso concreto los vectores están almacenados localmente, pero para modelos más grandes debemos usar un buscador de vectores (search engine vector) como Pinecone o Weaviate. 

In [43]:
# Definamos primero la similitud entre vectores: 

def vector_similarity(x: List[float], y: List[float]) -> float:
    """
    We could use cosine similarity or dot product to calculate the similarity between vectors.
    In practice, we have found it makes little difference. 
    """
    return np.dot(np.array(x), np.array(y))

In [44]:
# Ahora queremos comparar el vector de la query con los del documento, para saber los más relevantes de éste último: 

def order_document_sections_by_query_similarity(query: str, contexts: Dict[Tuple[str, str], np.array]) -> List[Tuple[float, Tuple[str, str]]]:
    """
    Find the query embedding for the supplied query, and compare it against all of the pre-calculated document embeddings
    to find the most relevant sections. 
    
    Return the list of document sections, sorted by relevance in descending order.
    """
    query_embedding = get_query_embedding(query)
    
    document_similarities = sorted([
        (vector_similarity(query_embedding, doc_embedding), doc_index) for doc_index, doc_embedding in contexts.items()
    ], reverse=True)
    
    return document_similarities


In [45]:
# Por ejemplo, para encontrar las secciones más relevantes de una query: 

order_document_sections_by_query_similarity("Who won the men's high jump?", document_embeddings)[:5]


ValueError: ignored