# Implement RAG using a website and watsonx.ai 

In the lab **"Implement RAG Use Cases in watsonx.ai"** we looked at how to implement a RAG use with our source being from some `.pdf` and `.txt` files. In this example we instead source of content by scraping a given website URL. 

The main difference from the previous example is how data is sourced for our embedding. We'll use open source APIs [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/) and [spacy](https://spacy.io/) to get data from a Web page.

To get started we'll first verify that you have the necessary dependencies installed to run this notebook.

Go ahead and run the following code cell. **This may take a few seconds to complete.**

In [None]:
# Install dependencies
import sys
!{sys.executable} -m pip install -q chromadb==0.4.22
!{sys.executable} -m pip install -q ibm_watson_machine_learning==1.0.342
!{sys.executable} -m pip install -q langchain==0.1.3
!{sys.executable} -m pip install -q langchain_community==0.0.15
!{sys.executable} -m pip install -q beautifulsoup4==4.12.3
!{sys.executable} -m pip install -q spacy==3.7.2

!{sys.executable} -m spacy download en_core_web_md


## Bring in dependencies

In this next code cell we'll bring in all the dependencies we'll need for later use.

Go ahead and run the following code cell. **There should be no ouput**

In [27]:
# Bring in dependencies
# SQLite fix: https://docs.trychroma.com/troubleshooting#sqlite
# __import__('pysqlite3')
# import sys
# sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

# WML python SDK
from ibm_watson_machine_learning.foundation_models import Model
from ibm_watson_machine_learning.metanames import GenTextParamsMetaNames as GenParams
from ibm_watson_machine_learning.foundation_models.utils.enums import ModelTypes, DecodingMethods

import requests
from bs4 import BeautifulSoup
import spacy
import chromadb
import en_core_web_md

nlp = spacy.load("en_core_web_md")


## Some important variables

In this next code cell you'll define some variables that will be used in order to interact with your instance of watsonx.ai.

Go ahead and run the following code cell. **There should be no ouput**

In [28]:
# Update the global variables that will be used for authentication in another function
watsonx_project_id = "PASTE_PROJECT_ID_HERE"
api_key = "PASTE_API_KEY_HERE"
instance_url = "https://us-south.ml.cloud.ibm.com"


## Understanding the code

In this next code cell we'll create some functions that we can use later to interact easier with watsonx.ai. These functions are `get_model`, `create_embedding`, and `create_prompt`: 

- `get_model`: Creates a model object that will be used to invoke the LLM
- `extract_text`: Will pull text from a given website to create embedding from
- `split_text_into_sentences`: Split the text we extracted into individual sentences and clean them of any unnecessary characters
- `create_embedding`: Loads text data from a given URL into the in-memory `chromadb` instance
- `create_prompt`: Generates the prompt that is sent to watsonx.ai API
   - Notice that in the beginning of the function we query the vector database to retrieve information that’s related to our question (semantic search).

Go ahead and run the following code cell. **There should be no ouput**

In [29]:
def get_model(model_type, max_tokens, min_tokens, decoding, temperature, top_k, top_p):
    generate_params = {
        GenParams.MAX_NEW_TOKENS: max_tokens,
        GenParams.MIN_NEW_TOKENS: min_tokens,
        GenParams.DECODING_METHOD: decoding,
        GenParams.TEMPERATURE: temperature,
        GenParams.TOP_K: top_k,
        GenParams.TOP_P: top_p,
    }

    model = Model(
        model_id=model_type,
        params=generate_params,
        credentials={
            "apikey": api_key,
            "url": instance_url
        },
        project_id=watsonx_project_id
    )
    
    return model

def extract_text(url):
    try:
        # Send an HTTP GET request to the URL
        response = requests.get(url)

        # Check if the request was successful
        if response.status_code == 200:
            # Parse the HTML content of the page using BeautifulSoup
            soup = BeautifulSoup(response.text, 'html.parser')

            # Extract contents of <p> elements
            p_contents = [p.get_text() for p in soup.find_all('p')]

            # Print the contents of <p> elements
            print("\nContents of <p> elements: \n")
            for content in p_contents:
                print(content)
            raw_web_text = " ".join(p_contents)
            # remove \xa0 which is used in html to avoid words break acorss lines.
            cleaned_text = raw_web_text.replace("\xa0", " ")
            return cleaned_text

        else:
            print(f"Failed to retrieve the page. Status code: {response.status_code}")

    except Exception as e:
        print(f"An error occurred: {str(e)}")

def split_text_into_sentences(text):
    doc = nlp(text)
    sentences = [sent.text for sent in doc.sents]
    cleaned_sentences = [s.strip() for s in sentences]
    return cleaned_sentences

def create_embedding(url, collection_name):
    cleaned_text = extract_text(url)
    cleaned_sentences = split_text_into_sentences(cleaned_text)

    client = chromadb.Client()

    collection = client.get_or_create_collection(collection_name)

    # Upload text to chroma
    collection.upsert(
        documents=cleaned_sentences,
        metadatas=[{"source": str(i)} for i in range(len(cleaned_sentences))],
        ids=[str(i) for i in range(len(cleaned_sentences))],
    )

    return collection

def create_prompt(url, question, collection_name):
    # Create embeddings for the text file
    collection = create_embedding(url, collection_name)

    # query relevant information
    relevant_chunks = collection.query(
        query_texts=[question],
        n_results=5,
    )
    context = "\n\n\n".join(relevant_chunks["documents"][0])
    # Please note that this is a generic format. You can change this format to be specific to llama
    prompt = (f"{context}\n\nPlease answer the following question in one sentence using this "
              + f"text. "
              + f"If the question is unanswerable, say \"unanswerable\". Do not include information that's not relevant to the question."
              + f"Question: {question}")

    return prompt


## Gluing it together

The next function, `answer_questions_from_web`, that we create is created to help combine the previous five that we defined. This is the wrapper that we will call when we want to interact with watsonx.ai. 

Go ahead and run the following code cell. **There should be no ouput**


In [30]:
def answer_questions_from_web(url, question, collection_name):
    # Specify model parameters
    model_type = "meta-llama/llama-2-70b-chat"
    max_tokens = 100
    min_tokens = 50
    top_k = 50
    top_p = 1
    decoding = DecodingMethods.GREEDY
    temperature = 0.7

    # Get the watsonx model = try both options
    model = get_model(model_type, max_tokens, min_tokens, decoding, temperature, top_k, top_p)

    # Get the prompt
    complete_prompt = create_prompt(url, question, collection_name)

    generated_response = model.generate(prompt=complete_prompt)
    response_text = generated_response['results'][0]['generated_text']

    # Remove trailing white spaces
    response_text = response_text.strip()

    # print model response
    print("--------------------------------- Generated response -----------------------------------")
    print(response_text.strip())
    print("*********************************************************************************************")

    return response_text


## Answering some questions

The next code cell will use all the previous code we've created so far to source information from the input documents and ask a question about them using watsonx.ai (Notice the return of the `answer_questions_from_web` function). 

To do so we'll pass in a question we want to ask, the web URL we want to reference for said question, and finally the name of the collection where the embeddings exist.

Go ahead and run the next code cell. **You will see output from this cell**

In [None]:
# Try diffrent URLs and questions
web_url = "https://www.ibm.com/products/watsonx-ai"
question = "What is Prompt Lab?"
collection_name = "test_web_RAG"

answer_questions_from_web(web_url, question, collection_name)
