# RAG desde web - INGLÉS

## 1. Código

In [2]:
running_urls=[
    'https://www.runnersworld.com/uk/gear/shoes/a776671/best-running-shoes/', 'https://www.wired.com/gallery/best-running-shoes/', 
    'https://runrepeat.com/guides/best-running-shoes',
    'https://runrepeat.com/catalog/cheap-running-shoes',
    'https://believeintherun.com/shoe-reviews/type/best-of-road/',
    'https://www.womensrunning.co.uk/gear/best-womens-running-shoes/',
    'https://www.telegraph.co.uk/recommended/leisure/fitness/best-running-shoes-men/',
    'https://www.runnersworld.com/gear/a24228881/affordable-running-shoes/',
    'https://www.verywellfit.com/best-cheap-running-shoes-8598713',
    'https://www.coachweb.com/running-shoes/7827/the-best-budget-running-shoes',
    'https://www.outdoorgearlab.com/topics/shoes-and-boots/best-running-shoes-womens',
    'https://www.telegraph.co.uk/recommended/leisure/fitness/best-running-shoes-women/'
    'https://runrepeat.com/guides/best-womens-running-shoes',
    'https://runrepeat.com/guides/best-mens-running-shoes',
    'https://www.gift-sprint.com/top-5-running-shoes-to-buy-2024',
    'https://nymag.com/strategist/article/best-running-shoes-workout-shoes-women.html',
    'https://www.cosmopolitan.com/uk/body/fitness-workouts/g30441126/best-running-shoes/',
    'https://believeintherun.com/shoe-reviews/best-running-shoes-of-2024-so-far/',
    'https://www.gq.com/story/best-running-shoes-for-every-budget',
    'https://www.theguardian.com/thefilter/2025/jan/02/best-running-shoes',
    'https://www.outsideonline.com/outdoor-gear/run/best-running-shoes'
]
len(running_urls)

18

In [11]:
# Introducción documento "externo"
import requests
from bs4 import BeautifulSoup

web_link='https://towardsdatascience.com/3-business-skills-you-need-to-progress-your-data-science-career-in-2025-146f841d1a1e'

response = requests.get(web_link)
if response.status_code == 200:
    soup = BeautifulSoup(response.content, 'html.parser')
    text = soup.get_text(separator="\n", strip=True)

In [12]:
# Split del texto recibido
from langchain.text_splitter  import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n",  
    chunk_size=800,
    chunk_overlap=50
)

splits = text_splitter.split_text(text)


In [16]:
#Vectorizar
from langchain_mongodb import MongoDBAtlasVectorSearch
from pymongo import MongoClient
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.schema import Document
from dotenv import load_dotenv
import os 

load_dotenv()

MONGODB_ATLAS_CLUSTER_URI = api_key = os.getenv("MONGODB_ATLAS_CLUSTER_URI")
DB_NAME = os.getenv("DB_NAME")
COLLECTION_NAME = os.getenv("COLLECTION_NAME")
ATLAS_VECTOR_SEARCH_INDEX_NAME = os.getenv("ATLAS_VECTOR_SEARCH_INDEX_NAME")
client = MongoClient(MONGODB_ATLAS_CLUSTER_URI)
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
MONGODB_COLLECTION = client[DB_NAME][COLLECTION_NAME]

vector_store = MongoDBAtlasVectorSearch(
    collection=MONGODB_COLLECTION,
    embedding=embeddings,
    index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
    relevance_score_fn="cosine",
)

for split in splits:
    embedding = embeddings.embed_documents(split) 
    document = Document(
        page_content=split,
        embedding=embedding 
    )

    vector_store.add_documents([document])


try:
    vector_store.create_vector_search_index(dimensions=768)
except Exception as e:
    print("Index has already been created")

retriever = vector_store.as_retriever()


Index has already been created


In [17]:
# Ollama 
from langchain_core.output_parsers import StrOutputParser
from langchain.prompts import ChatPromptTemplate
from langchain_ollama.chat_models import ChatOllama
from langchain_core.runnables import RunnableLambda, RunnablePassthrough

template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

ollama_llm = "llama3.2"
model_local = ChatOllama(model=ollama_llm)

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model_local
    | StrOutputParser()
)

## 2. Preguntas

In [18]:
chain.invoke("What would you do if you would like to progress in your data science career?")

'While the provided context doesn\'t directly answer the question, it does provide some information about Towards Data Science and Dr. Varshita Sher. Based on this context, I can make an educated guess.\n\nTo progress in a data science career, one might consider:\n\n1. Following Towards Data Science to stay up-to-date with the latest trends and developments in the field.\n2. Reading articles by Dr. Varshita Sher, who is described as an FTSE 100 Tech Leader, Data Science & Generative AI expert, Oxford Alumni, and a top writer on Medium.\n3. Possibly engaging with her responses (although only one response is available in this context).\n4. Exploring the "About", "Careers", "Press", "Blog", "Privacy", "Terms", "Text to speech", and "Teams" sections of the website for more information.\n\nHowever, without more specific information about Dr. Sher\'s advice or guidance on data science career development, it is difficult to provide a more concrete answer.'

In [19]:
chain.invoke("Would you insult your boss if you wanted to improve in your data science career?")

"No, the context does not suggest that insulting one's boss is a viable or recommended way to improve in their data science career. In fact, the text emphasizes the benefits of understanding numbers and contributing to the company's financial success through open communication, which implies building positive relationships with supervisors. Insulting one's boss would likely have the opposite effect."

In [20]:
chain.invoke("Is it a good idea to be a good a communication?")

"Yes, according to the context provided, being a good communicator is crucial for success in a leadership role that involves managing teams, data storytelling, and cross-team collaboration. It's implied that having strong soft communication skills can make a big difference in pitching ideas, communicating insights, motivating team members, and succeeding in general."

## 3. Creación GUI

In [21]:
def search_chroma(query, top_k):
    try:   
        results = vector_store.similarity_search(query, k=top_k)
        return chain.invoke(query)
    except Exception as e:
        return f"Error: {e}"

In [22]:
import gradio as gr
with gr.Blocks() as demo:
    gr.Markdown("### Chroma Database Search")
    
    with gr.Row():
        query_input = gr.Textbox(label="Enter Your Query", placeholder="Type your question here...")
        top_k_input = gr.Slider(1, 10, step=1, value=5, label="Number of Results")

    search_button = gr.Button("Search")
    output_box = gr.Textbox(label="Search Results", lines=15)
    search_button.click(fn=search_chroma, inputs=[query_input, top_k_input], outputs=output_box)

demo.launch()

* Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.


