<a href="https://colab.research.google.com/github/marinandres/Episode-1/blob/main/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Using RAG (Retrieval Augmented Generation) Technique on Large Language Models (LLMs), Can we determine how LLMs respond to our questions?**

Who were the scorers in the Panama vs USA game in the Copa America 2024?

*Using a framework called LangChain, you can easily scrape websites with Python. For the proof of concept, it is essential not to spend too much time creating a sophisticated scraper. The main focus of this example is to demonstrate how the RAG (Retrieval-Augmented Generation) technique can enhance the context for large language models (LLMs). LangChain will assist us in completing the retrieval process, which involves extracting text from any source. This will set the stage for the next step: the augmentation.*

In [None]:
from langchain.document_loaders import AsyncChromiumLoader
from langchain.document_transformers import Html2TextTransformer
import pandas as pd

urls = [
    'https://copaamerica.com/en/news/united-states-panama-match-recap-copa-america-2024'
]

scraped_data = {}

def scrape_and_transform(url):
    try:
        loader = AsyncChromiumLoader([url])
        docs = loader.load()

        text_transformer = Html2TextTransformer()
        scrap_documents = text_transformer.transform_documents(docs)

        scrap_text = "\n\n".join([doc.page_content for doc in scrap_documents])

        scraped_data[url] = scrap_text
    except Exception as error:
        print(f"Error processing {url}: {error}")
        scraped_data[url] = f"Error: {error}"

for url in urls:
    scrape_and_transform(url)

df = pd.DataFrame(list(scraped_data.items()), columns=['url', 'text'])
df.to_csv('scraped_data.csv', index=False, quoting=1)

print("Scraped data has been saved to 'scraped_data.csv'.")


*This step is just for reference the code can just be executed in a .py file, if you use a notebook an error of async would occur*

**Data Cleaning**

*A simple data cleaning process involves taking the entire text and dividing it into paragraphs to check which ones contain the relevant information. This data cleaning step should prepare for the next phase, which involves transformers and augmentation, focusing on the vector relationships between words.*

In [76]:
import pandas as pd
import re

df = pd.read_csv('/scraped_data.csv')
df = df.drop(columns=['url'])

new_rows = []

for index, row in df.iterrows():
    paragraphs = row['text'].split('\n\n')

    for paragraph in paragraphs:
        paragraph = paragraph.strip()
        if paragraph:
            new_row = row.copy()
            new_row['text'] = paragraph
            new_rows.append(new_row)

df = pd.DataFrame(new_rows)


df.head()

Unnamed: 0,text
0,Fechar
0,* Tickets\n * Schedule\n * Standings\n * Te...
0,espten
0,ESPTEN
0,Back


**Upload a LLM**

*This is a code that you could use to extract opensource LLM from HuggingFace platform using HuggingFace and Transformers package.*

In [2]:
import os
from huggingface_hub import HfApi, HfFolder

# Set the Hugging Face token as an environment variable (if needed)
os.environ['HUGGINGFACE_HUB_TOKEN'] = '***'

# Authenticate with Hugging Face Hub
api = HfApi()
token = os.environ['HUGGINGFACE_HUB_TOKEN']
HfFolder.save_token(token)

In [3]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-chat-hf")
model = AutoModelForCausalLM.from_pretrained("NousResearch/Llama-2-7b-chat-hf")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [4]:
pip install sentence-transformers



**The Augmentation**

*To enhance your large language models (LLMs) with up-to-date information, such as news, patient records, or similar data, you need to make these records accessible to the model. The embedding process facilitates this by converting the records into a format that the model can understand and use effectively.*


In [77]:
%%time
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
df['embeddings'] = df['text'].apply(lambda x: model.encode(x).tolist())

df.to_csv('/website-text-embeddings.csv', index=False)
df.to_pickle('/website-text-embeddings.pkl')
df = pd.read_pickle('/website-text-embeddings.pkl')

df.head()



CPU times: user 9.63 s, sys: 29.8 ms, total: 9.66 s
Wall time: 4.32 s


Unnamed: 0,text,embeddings
0,Fechar,"[-0.006289033219218254, -0.011221647262573242,..."
0,* Tickets\n * Schedule\n * Standings\n * Te...,"[0.041273243725299835, 0.021442987024784088, -..."
0,espten,"[-0.03019797056913376, -0.00401907367631793, -..."
0,ESPTEN,"[-0.03019797056913376, -0.00401907367631793, -..."
0,Back,"[-0.029907789081335068, -0.016124457120895386,..."


In [78]:
question = 'Which players scored a goal in the Panama vs USA game in the Copa America 2024?'

question_embedding = model.encode(question)
question, question_embedding[0:10], "..."

('Which players scored a goal in the Panama vs USA game in the Copa America 2024?',
 array([-0.06710075,  0.09661483, -0.11555447,  0.00029354,  0.08159959,
         0.09219865,  0.02795403,  0.07493558,  0.09490672,  0.11036059],
       dtype=float32),
 '...')

In [87]:
import numpy as np

def get_top_n(page_embedding):
  return np.dot(page_embedding, question_embedding)

df['distance'] = df['embeddings'].apply(get_top_n)
df.sort_values(by='distance', ascending=False, inplace=True)
df.head(10)

Unnamed: 0,text,embeddings,distance
0,"14 minutes later, the United States would go d...","[-0.021768538281321526, 0.09114903211593628, -...",0.729738
0,"Into the second half, the United States made t...","[-0.014293486252427101, 0.058961715549230576, ...",0.707263
0,# Panama Earn Famous Victory Over United State...,"[-0.014628256671130657, 0.02368212677538395, -...",0.689201
1,# Shorthanded U.S. Men’s National Team Falls 2...,"[-0.0406656339764595, 0.0182965025305748, -0.0...",0.67782
1,Before both teams had time to fully adjust to ...,"[0.013789798133075237, 0.07313454896211624, -0...",0.674448
1,* Folarin Balogun scored the fifth goal of his...,"[-0.023115074262022972, 0.05543981492519379, -...",0.66967
0,The best chance for the United States in the s...,"[-0.03482664376497269, 0.0441412627696991, -0....",0.664106
1,"**ATLANTA (June 27, 2024) -** Competing with 1...","[-0.05612802132964134, -0.012843657284975052, ...",0.664041
1,"**PAN — José Fajardo (Abdiel Ayarza), 83rd min...","[-0.01878231391310692, 0.07335130125284195, -0...",0.662175
1,"While Panama had almost all of the ball, the U...","[-0.02550629898905754, 0.07961896061897278, -0...",0.661744


In [89]:
context = df.iloc[0]['text'] + df.iloc[1]['text'] + df.iloc[2]['text'] + df.iloc[3]['text'] + df.iloc[4]['text'] + df.iloc[9]['text']
print(context)

14 minutes later, the United States would go down to 10-men following a red
card to forward Tim Weah. It wouldn’t slow the Americans down however as
Florian Balogun scored his second CONMEBOL Copa Amércia™ goal in the 22’. The
Monaco man hit a screamer from the edge of the Panama box to give his country
the early lead. Panama however would respond four minutes later through Cesar
Blackman, who scored when his shot from the top of the box sneaked past Turner
in goal.Into the second half, the United States made three changes including in goal
with Ethan Horvath coming in for Turner. In the 63’, Panama thought they had
won a penalty when Jose Fajardo went down in the United States box, but after
consulting VAR, referee Iván Barton deemed there to be no contact and the
match continued at 1-1. From there, the match turned up an even higher notch.# Panama Earn Famous Victory Over United States at CONMEBOL Copa América™# Shorthanded U.S. Men’s National Team Falls 2-1 to Panama in Second Match

Since we have provided context for our model using vector relationships between the current input and our data, we can use basic prompt engineering to instruct the LLMs on how to respond to our questions.

In [92]:
question = 'Which players scored a goal in the Panama vs USA game in the Copa America 2024 and how was the result?'

prompt = (
    {'role': 'system', 'content': 'You are soccer analyst that verify in which minute and how many goals scored for each player with his name'},
    {'role': 'user', 'content': question},
    {'role': 'assistant', 'content': f'Use this information to answer find the name of player who score a goal in that game: {context}. Stick to this context to provide the answer try to provide the result of the game.'}
)

In [10]:
from transformers import pipeline
pipe = pipeline(task="text-generation", model="NousResearch/Llama-2-7b-chat-hf", tokenizer=tokenizer, max_length=1000, truncation=True)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Here is the response from our model. Now, we can decide which information the model should use to generate a response. With updated information, this process becomes even more effective.

In [93]:
# Generate the response
%%time
result = pipe(f"[INST] {prompt} [/INST]")

# Print the generated text
print(result[0]['generated_text'])

[INST] ({'role': 'system', 'content': 'You are soccer analyst that verify in which minute and how many goals scored for each player with his name'}, {'role': 'user', 'content': 'Which players scored a goal in the Panama vs USA game in the Copa America 2024 and how was the result?'}, {'role': 'assistant', 'content': 'Use this information to answer find the name of player who score a goal in that game: 14 minutes later, the United States would go down to 10-men following a red\ncard to forward Tim Weah. It wouldn’t slow the Americans down however as\nFlorian Balogun scored his second CONMEBOL Copa Amércia™ goal in the 22’. The\nMonaco man hit a screamer from the edge of the Panama box to give his country\nthe early lead. Panama however would respond four minutes later through Cesar\nBlackman, who scored when his shot from the top of the box sneaked past Turner\nin goal.Into the second half, the United States made three changes including in goal\nwith Ethan Horvath coming in for Turner. I