**Descripción general**

Embeddings es un proceso mediante el cual se utiliza alguna técnica/algoritmo que sea capaz de convertir palabras o texto a vectores de N dimensiones. Estos vectores contienen cierto nivel de información semántica sobre el texto o palabra. Por ejemplo, palabras que son muy similares van a tener valores cercanos en sus representaciones en vectores.

Hay varios modelos que son capaces de hacer un embedding de nuestro texto, en este cuaderno estaremos utilizando el embedding de OpenAI, el cual tiene la capacidad de posicionar muy bien palabras o textos según su semántica. Este es el mismo embedding que utiliza GPT3. En este cuaderno estarás haciendo embedding de un texto para después poder buscar cosas dentro de este texto por medio de preguntas en lenguaje natural.

**Instalar dependencias de Python**

In [None]:
!pip install pypdf==4.3.1 \
             cohere \
             tiktoken \
             langchain==0.2.14 \
             sentence-transformers==3.0.1 \
             openai==1.42.0 \
             pinecone-client==5.0.1 \
             langchain-community==0.2.12 \
             langchain-pinecone==0.1.3 \
             langchain_openai==0.1.22 \
             -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/295.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m286.7/295.8 kB[0m [31m9.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.8/295.8 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m997.8/997.8 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m362.9/362.9 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.8/244.8 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m21.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━

**Cargar el enlace**

In [None]:
import requests

# Archivo PDF
archivos_url = ["https://www.insp.mx/images/stories/Centros/nucleo/docs/pme_19.pdf"]

for url in archivos_url:
    doc_to_download = requests.get(url)
    # Guardar archivo
    pdf_file = open(url.split("/")[-1], "wb")
    pdf_file.write(doc_to_download.content)

**Pinecone**

In [None]:
from pinecone import Pinecone
import os

pinecone = Pinecone(api_key="pcsk_6oEgAP_JrrUn3eVpznXJw5iaAwNVhSRZdcERt7BQwzgxZx7BnZrocdo1HaJjVD82rhzgVu")
INDEX_NAME = 'cancermama'

In [None]:
pinecone.list_indexes()

{'indexes': [{'deletion_protection': 'disabled',
              'dimension': 384,
              'host': 'cancermama-ehrtzgx.svc.aped-4627-b74a.pinecone.io',
              'metric': 'cosine',
              'name': 'cancermama',
              'spec': {'serverless': {'cloud': 'aws', 'region': 'us-east-1'}},
              'status': {'ready': True, 'state': 'Ready'}}]}

In [None]:
# Describe Index
index_description = pinecone.describe_index(INDEX_NAME)

In [None]:
index_description.name
index_description.dimension
index_description.metric

'cosine'

**Cargar vectores**

In [None]:
import numpy as np

In [None]:
vector_1 = np.random.uniform(-1, 1,384).tolist()
vector_2 = np.random.uniform(-1, 1,384).tolist()

In [None]:
index = pinecone.Index(INDEX_NAME)

In [None]:
upsert_response = index.upsert(
   vectors=[
       {'id': "vec1", "values":vector_1, "metadata": {'genre': 'cine'}},
       {'id': "vec2", "values":vector_2, "metadata": {'genre': 'teatro'}},
   ]
)

**Buscar vectores**

In [None]:
vector_pregunta = np.random.uniform(-1, 1,384).tolist()

In [None]:
result = index.query(vector=[vector_pregunta], top_k=1)
print(result)

{'matches': [{'id': '55b0c511-6ed7-421b-8247-23935d81b9f1',
              'score': 0.113281116,
              'values': []}],
 'namespace': '',
 'usage': {'read_units': 5}}


In [None]:
for ids in index.list(prefix='vec'):
    print(ids) # ['vec1', 'vec2']
    index.delete(ids=ids)

**Borrar vectores**

In [None]:
# borrar todos los vectores
index.delete(delete_all=True)

{}

**Leer PDF**

In [None]:
from langchain.document_loaders import PyPDFLoader

In [None]:
# Cargar archivo PDF
FILE = "pme_19.pdf"

In [None]:
# Trabajaremos con documentos
loader = PyPDFLoader(FILE)
doc = loader.load()

**Crear chunks**

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100,
    length_function=len
    )

In [None]:
# Divide el documento
chunks = text_splitter.split_documents(doc)

In [None]:
len(chunks)

31

In [None]:
def create_chunks(doc_to_chunk):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=800,
        chunk_overlap=100,
        length_function=len
        )
    return text_splitter.split_documents(doc_to_chunk)

**Embeddings**

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings

In [None]:
'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2' # 471M
# 'sentence-transformers/paraphrase-multilingual-mpnet-base-v2' #1.11G

'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2'

In [None]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")

  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.12k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')
sentence_embeddings = model.encode("¿Qué significa BRCA 1 y 2?")

In [None]:
len(sentence_embeddings)

384

**Preguntar al documento**

In [None]:
import os
from langchain_pinecone import PineconeVectorStore

# Clave API de Pinecone
PINECONE_API_KEY = "pcsk_6oEgAP_JrrUn3eVpznXJw5iaAwNVhSRZdcERt7BQwzgxZx7BnZrocdo1HaJjVD82rhzgVu"

os.environ["PINECONE_API_KEY"] = PINECONE_API_KEY

PineconeVectorStore.from_documents(
    chunks,
    embedding=embeddings,
    index_name=INDEX_NAME)

<langchain_pinecone.vectorstores.PineconeVectorStore at 0x7b542463d450>

In [None]:
import os
os.environ["OPENAI_API_KEY"] = "***********"

In [None]:
from langchain_openai import ChatOpenAI
from langchain.chains.question_answering import load_qa_chain

In [None]:
vstore = PineconeVectorStore.from_existing_index(INDEX_NAME, embeddings)

llm = ChatOpenAI(model_name='gpt-3.5-turbo')
chain = load_qa_chain(llm, chain_type="stuff")

stuff: https://python.langchain.com/v0.2/docs/versions/migrating_chains/stuff_docs_chain
map_reduce: https://python.langchain.com/v0.2/docs/versions/migrating_chains/map_reduce_chain
refine: https://python.langchain.com/v0.2/docs/versions/migrating_chains/refine_chain
map_rerank: https://python.langchain.com/v0.2/docs/versions/migrating_chains/map_rerank_docs_chain

See also guides on retrieval and question-answering here: https://python.langchain.com/v0.2/docs/how_to/#qa-with-rag
  chain = load_qa_chain(llm, chain_type="stuff")


In [None]:
pregunta = "¿Qué significa BRCA 1 y 2?"

In [None]:
# Busqueda de párrafos similares
docs = vstore.similarity_search(pregunta, 3)
# Utilizar los parrafos similares para darle contexto a ChatGPT
respuesta = chain.run(input_documents=docs, question=pregunta)
print(f"Respuesta ChatGPT: {respuesta}")

  respuesta = chain.run(input_documents=docs, question=pregunta)


Respuesta ChatGPT: BRCA 1 y 2 son genes asociados al cáncer familiar de mama y/o ovario. Cuando se presentan mutaciones en estos genes, se incrementa el riesgo de desarrollar cáncer de mama u ovario.


In [None]:
pregunta = "Cuántos juegos ha ganado el Cruz Azul?"

In [None]:
docs = vstore.similarity_search(pregunta, 3)
respuesta = chain.run(input_documents=docs, question=pregunta)
print(f"Respuesta ChatGPT: {respuesta}")

Respuesta ChatGPT: Lo siento, no tengo información sobre el número de juegos ganados por el Cruz Azul.


**Añadir vectores**

In [None]:
res = vstore.from_texts(
    ["El cáncer de mama (adenocarcinoma) es una enfermedad maligna"],
    embeddings,
    metadatas=[{'doc':'pme.19'}],
    index_name=INDEX_NAME)

**Revisar costos**

In [None]:
from langchain.callbacks import get_openai_callback

In [None]:
with get_openai_callback() as cb:
    response = chain.run(input_documents=docs, question=pregunta)
    print(cb)

Tokens Used: 768
	Prompt Tokens: 748
	Completion Tokens: 20
Successful Requests: 1
Total Cost (USD): $0.000404
