* https://docs.mistral.ai/guides/rag/#rag-from-scratch


In [72]:
import requests
import nltk
import numpy as np
import pandas as pd
import langchain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document # Importing Document schema from Langchain
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
import google.generativeai as genai
from tqdm import tqdm

## Veo el archivo de validación:

In [2]:
df_qa = pd.read_csv('acquired-qa-evaluation.csv')
df_qa.head()

Unnamed: 0,question,human_answer,ai_answer_without_the_transcript,ai_answer_without_transcript_correctness,ai_answer_with_the_transcript,ai_answer_with_the_transcript_correctness,quality_rating_for_answer_with_transcript,post_url,file_name,Unnamed: 9
0,"When did Airbnb go public, what was the price ...","December 9,2020 at $68 per share","Airbnb went public on December 10, 2020. The i...",CORRECT,"Airbnb went public in 2020. However, the speci...",INCORRECT,4,https://www.acquired.fm/episodes/airbnb,airbnb,
1,Why did Wimdu unlike Airbnb not take off?,Wimdu gragmented the marketed focusing mostly ...,Wimdu faced challenges compared to Airbnb due ...,CORRECT,"Wimdu, similar to Airbnb, was a platform creat...",CORRECT,5,https://www.acquired.fm/episodes/airbnb,airbnb,
2,Why does market fragmentation work for airline...,Even though both the airline industry and airb...,Market fragmentation benefits the airline indu...,CORRECT,Market fragmentation can work for the airline ...,CORRECT,3,https://www.acquired.fm/episodes/airbnb,airbnb,
3,How many hot dogs does Costco currently sell p...,130 million,Costco sold just shy of 200 million hot dog an...,INCORRECT,Annual Hot Dog Sales: Costco sells 130 million...,CORRECT,5,https://www.acquired.fm/episodes/costco,costco,
4,"What store was created as ""the price club of h...",Home Depot,"The store created as the ""price club of hardwa...",CORRECT,"Store Created as ""the price club of hardware s...",CORRECT,5,https://www.acquired.fm/episodes/costco,costco,


In [3]:
df_qa.shape

(80, 10)

In [4]:
df_qa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80 entries, 0 to 79
Data columns (total 10 columns):
 #   Column                                     Non-Null Count  Dtype 
---  ------                                     --------------  ----- 
 0   question                                   80 non-null     object
 1   human_answer                               80 non-null     object
 2   ai_answer_without_the_transcript           80 non-null     object
 3   ai_answer_without_transcript_correctness   80 non-null     object
 4   ai_answer_with_the_transcript              80 non-null     object
 5   ai_answer_with_the_transcript_correctness  80 non-null     object
 6   quality_rating_for_answer_with_transcript  80 non-null     object
 7   post_url                                   80 non-null     object
 8   file_name                                  80 non-null     object
 9   Unnamed: 9                                 1 non-null      object
dtypes: object(10)
memory usage: 6.4+ KB


In [5]:
df_qa.file_name.value_counts()

file_name
qualcomm                                                6
enron                                                   5
spacex                                                  5
bitcoin                                                 4
airbnb                                                  3
costco                                                  3
disney_plus                                             3
whatsapp                                                3
berkshire_hathaway_part_i                               3
nvidia_part_iii_the_dawn_of_the_ai_era_20222023         3
nvidia_part_ii_the_machine_learning_company_20062022    3
ethereum_with_packy_mccormick                           3
nvidia_part_i_the_gpu_company_19932006                  3
walmart                                                 3
renaissance_technologies                                3
nintendos_origins                                       3
visa                                                    3
amaz

Son muchos archivos, así que me quedo solamente con los primeros 5 file_names:

In [6]:
file_names = ['qualcomm','enron','spacex','bitcoin','airbnb']
df_qa.file_name.isin(file_names).sum()

23

Creo un nuevo dataframe df_qf donde me quedo solamente con los datos correspondiente a estos 5 file_names:

In [7]:
df_qf = df_qa[['question','file_name','human_answer']][df_qa.file_name.isin(file_names)]
df_qf.head()

Unnamed: 0,question,file_name,human_answer
0,"When did Airbnb go public, what was the price ...",airbnb,"December 9,2020 at $68 per share"
1,Why did Wimdu unlike Airbnb not take off?,airbnb,Wimdu gragmented the marketed focusing mostly ...
2,Why does market fragmentation work for airline...,airbnb,Even though both the airline industry and airb...
9,"According to Information Theory, what is the i...",qualcomm,The more closely the actual communication is t...
10,Compare the impact on Qualcomm between the two...,qualcomm,Erwin Jacobs was a genius and visionary who pa...


In [54]:
len(file_names)

5

Tengo 5 archivos con 29 preguntas, con sus respectivas respuestas "correctas": respuesta humana 'human_answer' y los nombres de los archivos donde se encuentran las respuestas 'file_name'. Con estos dos datos voy a testear mi RAG. A partir de las preguntas testeo si el RAG: 
* busca la respuesta en el archivo correcto.
* la distancia coseno entre la respuesta humana y la del LLM.

## Ahora leo las transcripciones individuales y las spliteo

In [10]:
PATH = "acquired-individual-transcripts/"

docs = []

for file_name in file_names:
    with open(PATH+file_name+'.txt','r') as file:
        text = file.read()
        doc =  Document(page_content=text[113:], metadata={"source": file_name})
        docs.append(doc)

len(docs)

5

In [25]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = text_splitter.split_documents(docs)
print(f"Split {len(docs)} documents into {len(chunks)} chunks.")

Split 5 documents into 886 chunks.


In [69]:
chunks[0]

Document(metadata={'source': 'airbnb'}, page_content="Ben: Welcome to season 7, episode 8, the season finale of Acquired, the podcast about great technology companies, and the stories and playbooks behind them. I'm Ben Gilbert and I'm the co-founder of Pioneer Square Labs, a startup studio and venture capital firm in Seattle. David: And I'm David Rosenthal and I am an angel investor and startup advisor based in San Francisco. Ben: We are your hosts. Today, we cover the hottest and most anticipated company to IPO in 2020. Oddly, in a year marred by the global pandemic and just this month an all-time high number of stay-at-home orders, this hot IPO is a travel company. Airbnb—originally known as AirBed and Breakfast Incorporate—is going public today raising over $3.5 billion, and initially valued at over $47 billion. The company is insanely impressive. They operate in 220 countries and 100,000 cities. Last year, there were $38 billion of bookings made on the platform. There are over 50 m

## Creo la base de datos con los chunks y embeddings

In [71]:
# cargo los embeddings en local
model_name = "/home/lucas/Documentos/Lucas/ML/embeddings/Sentence Transformers/"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': True}
embeddings = HuggingFaceEmbeddings(model_name=model_name,model_kwargs=model_kwargs,encode_kwargs=encode_kwargs)

In [37]:
# creo la base de datos Chroma con los documentos y embeddings
db = Chroma.from_documents(collection_name='chroma_lc_rag',documents=chunks, embedding=embeddings, persist_directory='chroma_db')

In [64]:
# pruebo que la base de datos funciona bien
question = 'what is the idea about mars oasis'
sim = db.similarity_search(question, k=1)
sim[0].page_content

"this is their charter, so he makes a $100,000 donation to The Mars Society. He joins the board and he starts meeting all of these aerospace people in LA. Not just in LA, of course back up in Silicon Valley there's NASA's Jet Propulsion Lab in Mountain View. \xa0 Elon's mostly down in LA but he's going back and forth. He starts organizing these “Saturday salons,” he calls them, where he's just getting together industry leaders in aerospace and at JPL, both in LA and Palo Alto. There's no agenda, but he lets it be known to all of them that he's got some resources. He's a dot-com-rich guy and he wants to make a gesture. What could be done on the order of $10–$20 million.  They start to coalesce the group on this idea of building a “Mars Oasis” and the idea behind a Mars Oasis is that they're going to buy a rocket, and they're going to put a plant on it, and they're also going to put a robot on it, and they're going to shoot this rocket to mars. I can't remember if the Mars Rover had land

## Función para consultar:

In [55]:
# configuro la api de gemini
genai.configure(api_key=GOOGLE_API_KEY)
gemini = genai.GenerativeModel('gemini-1.5-flash')

In [62]:
# creo función para hacer consultas a la api de gemini
def consulta(question):
    sim = db.similarity_search(question, k=1)
    retrieved_chunk = sim[0]
    
    prompt = f"""
    Given the context information and not prior knowledge, answer the query.
    Context information:
    ---------------------
    {retrieved_chunk.page_content}
    ---------------------
    Query: {question}
    Answer:
    """
    
    model = genai.GenerativeModel('gemini-1.5-pro')
    response = model.generate_content(prompt)
    
    return retrieved_chunk, response.text

Y la uso para hacer una nueva pregunta:

In [63]:
question_number=9
question = df_qf['question'].iloc[question_number]
print('question: ',question)
print('------')
chunk, response = consulta(question)
print('response: ',response)
print('chunk_source: ',chunk.metadata['source'])
print('check real chunk_source: ',df_qf['file_name'].iloc[question_number])

question:  what is the idea about mars oasis
------
response:  The Mars Oasis idea involves buying a rocket, putting a plant and a robot on it, and sending it to Mars.

chunk_source:  spacex
check real chunk_source:  spacex
