# News of the Day

This notebook is a modified version of the example found on the Unstructured.IO documentation [site](https://github.com/Unstructured-IO/unstructured/blob/main/examples/chroma-news-of-the-day/news-of-the-day.ipynb). The intent is to show how the used example can be easily integrated with the services offered by [watsonx](https://www.ibm.com/watsonx) to create a robust RAG pipeline.

In this notebook, we'll show how to use [Unstructured.IO](https://unstructured.io/), [ChromaDB](https://www.trychroma.com/), [LangChain](https://github.com/langchain-ai/langchain), [HugginFace]() and IBM [watsonx](https://www.ibm.com/watsonx) to summarize topics from the front page of CNN Lite and use them as context for a grounded LLM interaction.

In order to get more relevant answers we will retrieve the full set of the latest news from the site and then run a specific implementation of the retriever named [Contextual Compression](https://python.langchain.com/docs/modules/data_connection/retrievers/contextual_compression). Instead of querying the similar documents to the question being asked as-is, we ask the LLM to compress them using the context of the given question, so that only the relevant information is used for the LLM generation of the answer.

## Document printing helper

We start off by defining a print helper function. This will come handy throughout the execution in order to examine the documents extracted from the site and how they will we modified.

In [23]:
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}: \n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

## STEP 1: Gather links with `unstructured`

First, we'll gather links from the [CNN Lite](https://lite.cnn.com/) homepage using the `partition_html` function from `unstructured`. When `unstructured` partitions HTML pages, links are included in the metadata for each element, making link collection a simple task. 

**NOTE: if you are on a Mac and when calling partition_html you encounter the following error message:**

```
[nltk_data] Error loading url: <urlopen error [SSL:
[nltk_data] CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data] unable to get local issuer certificate (_ssl.c:992)>
```

You might want to try issuing the following command on a terminal (you should point to the Python version you are using):

```
sh "/Applications/Python 3.10/Install Certificates.command"
```


In [24]:
from unstructured.partition.html import partition_html
cnn_lite_url = "https://lite.cnn.com/"
elements = partition_html(url=cnn_lite_url)

links = []

for element in elements:
    if element.metadata.link_urls:
        relative_link = element.metadata.link_urls[0][1:]
        if relative_link.startswith("2024"):
            links.append(f"{cnn_lite_url}{relative_link}")

print(f"We retrieved {len(links)} links to documents from {cnn_lite_url}")

We retrieved 95 links to documents from https://lite.cnn.com/


## Ingest individual articles with `UnstructuredURLLoader`

Now that we have the links, we can preprocess individual news articles with `UnstructuredURLLoader`. `UnstructuredURLLoader` fetches content from the web and then uses the `unstructured` `partition` function to extract content and metadata. In this example we preprocess HTML files, but it works with other response types such as `application/pdf` as well. After calling `.load()`, the result is a list of `langchain` `Document` objects.

**NOTE: when calling UnstructuredURLLoader, the necessary binary packages for running the python libmagic library must be present in your OS.** 
If you encounter an error related to missing libmagic packages, try to do the following:

#### Windows

Uninstall preinstalled python-magic libraries:

    pip uninstall python-libmagic
    pip uninstall python-magic 

Install python-magic-bin instead:

    pip install python-magic-bin

#### MAC

Install libmagic with MacPorts from a terminal:

    sudo port install libmagic

If you don't have MacPorts installed on your Mac, install it from the official site: https://www.macports.org/install.php.


In [25]:
from langchain.document_loaders import UnstructuredURLLoader
loaders = UnstructuredURLLoader(urls=links[:200], show_progress_bar=True)

docs = loaders.load()

100%|██████████| 95/95 [00:18<00:00,  5.13it/s]


In [26]:
print(f"We retrieved {len(docs)} links to documents from {cnn_lite_url}")

docs[0]

We retrieved 95 links to documents from https://lite.cnn.com/


Document(page_content='CNN\n\n5/6/2024\n\nA 10-month-old girl is missing after police discovered two women dead and a 5-year-old injured in a New Mexico park\n\nBy Paradise Afshar, CNN\n\nUpdated: \n        7:06 AM EDT, Mon May 6, 2024\n\nSource: CNN\n\nAuthorities in New Mexico are searching for a 10-month-old girl they say was kidnapped from a park where her mother and another women were found dead and the infant’s 5-year-old sister was found injured.\n\n“Investigators believe Eleia Maria Torres has been abducted by the perpetrator of this crime and is in immediate danger,” the Clovis Police Department said in a news release.\n\nEleia has brown hair and brown eyes, according to an Amber Alert notice.\n\nPolice discovered the infant was missing after responding to a call shortly before 4:30 p.m. Friday about two women found dead at Ned Houk Park near Clovis, a city in eastern New Mexico that is more than 200 miles east of Albuquerque and about 100 miles southwest of Amarillo, Texas.\n

## Import credentials and helper functions
In order to access watsonx.ai, we need to import all the externally stored credentials.
Furthermore we need to import all the new HAP-related functions we are going to use in order to filter out HAP content from our RAG system.

N.B. In order to use models within watsonx, you need to set up 3 environment variables, related to your watsonx.ai instance:
- <b>WATSONX_URL</b>: URL for accessing the watsonx platform (in our case, we need two different specifications of such URL in order to handle both the call managed by the watsonx library and the one managed by the langchain library)
- <b>WATSONX_API_KEY</b>: API KEY from your IBM Cloud account. A detailed procedure on how to create an API KEY can be found in the link provided at the end of this cell.
- <b>PRJ_ID</b>: it is the ID of the project created on watsonx.ai platform to run this notebook

Information on how to find/create these variables can be found here: https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/fm-credentials.html?context=wx&audience=wdp.

In [None]:
#Import credentials
from dotenv import load_dotenv
load_dotenv()

from os import environ
credentials = {
    "url_LLM": environ.get("WATSONX_URL_LLM"),
    "url_LANGCHAIN": environ.get("WATSONX_URL_LANGCHAIN"),
    "apikey": environ.get("WATSONX_API_KEY")
}
project_id = environ.get("PRJ_ID")

In order to avoid HAP language to enter our knowledge base, we use the helper function defined in hap_utilities.py which will use IBM Granite Guardian in order to identify potentially harmful language and transform it with the use of IBM Granite v2 in watsonx.ai. This approach will help us to maintain as much information as possible in our knowledge base while assuring safety. 

In [None]:
# Import HAP functions
from hap_utilities import clean_hap_content
clean_hap_content(docs, credentials, project_id)

## Generate embeddings for the extracted articles

Before loading our knowledge base into a suitable vector DB, we need to generate embeddings for our articles.
Embeddings are a type of transformation that helps computers understand the meaning of words and phrases in a text by converting them into a continuous vector space. This makes it easier for computers to learn and make sense of complex relationships between different concepts.

In generative AI pipelines, embeddings are essential because they allow models to capture the semantic meaning of text data. By mapping words and phrases to vectors, embeddings help models maintain context across sentences and documents, making it possible to generate new text that is both coherent and relevant.

To generate embeddings for our articles, we use IBM_SLATE_30M_ENG model from watsonx.ai model library.

In [27]:
from langchain_ibm import WatsonxEmbeddings
from ibm_watsonx_ai.foundation_models.utils.enums import EmbeddingTypes

embeddings = WatsonxEmbeddings(
    model_id=EmbeddingTypes.IBM_SLATE_30M_ENG.value,
    url=credentials["url_LANGCHAIN"],
    apikey=credentials["apikey"],
    project_id=project_id
    )

## Load documents into ChromaDB

With the documents preprocessed and vectorized, we're now ready to load them into ChromaDB. We easily accomplish that leveraging the Chroma integration within Langchain. Once the documents are in Chroma, we can perform a similarity search to retrieve documents related to our topic of interest. Here we choose to limit the retrived documents to 5.

In [28]:
# Split the documents into chunks
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(docs)

from langchain.vectorstores.chroma import Chroma
vectorstore = Chroma.from_documents(texts, embeddings)

We are now ready to receive a question form the user and store it the variable named text (note that the ability to answer a specific question will depend on the content retrieved during the considered code run from the cnn-lite website and hence you may need to change it)

In [29]:
#text = input("Type what you want to search:")
text = "Who won the cup race today at NASCAR?"

We now perform a first similarity search and retrive the 10 most relevant articles and print them out. We expect this search to return documents that are not directly related to the question that the user proposed to the pipeline. We are only sorting the returned documents based on their  distance.

In [30]:
query_docs = vectorstore.similarity_search(text, k=10)

pretty_print_docs(query_docs)

Document 1: 

CNN

5/6/2024

Kyle Larson wins by 0.001 seconds in closest finish in NASCAR Cup Series history

By Thomas Schlachter, CNN

Updated: 
        5:54 AM EDT, Mon May 6, 2024

Source: CNN

Capping off a weekend of photo finishes, Kyle Larson won NASCAR’s AdventHealth 400 by a staggering 0.001 seconds in Kansas on Sunday.

Unlike the Kentucky Derby won by Mystik Dan on Saturday, not even the naked eye could separate who finished first out of Larson and eventual second-place finisher Chris Buescher in a photo finish.

The finish was so close that Roush Fenway Keselowski Racing (RFK Racing), Buescher’s team, and commentators on the broadcast originally believed Buescher had done enough to secure victory.

The RFK team could be seen on the broadcast jubilantly celebrating before the official result was finalized.

Eventually, Hendrick Motorsports’ Larson was declared winner in the closest finish in NASCAR Cup Series history, per NASCAR.
-------------------------------------------

## Compress and Summarize the Documents

After retrieving relevant documents from Chroma, we're ready to compress and then summarize them! There are multiple ways to accomplish this in `langchain`, but `ContextualCompressionRetriever` and `load_summarization_chain` is quite straightforward. In this case we're going to use the IBM Granite model within the available Langchain wrapper so to easily integrate it with our summarization chain. Here we limit the summary to snippets related to our topic of choice.

In order to use IBM Granite model within watsonx, we will use the environment variables previously set in the notebook.

In [31]:
from ibm_watsonx_ai.metanames import GenTextParamsMetaNames as GenParams
from ibm_watsonx_ai.foundation_models.utils.enums import DecodingMethods
from langchain_ibm import WatsonxLLM

granite = WatsonxLLM(
    model_id='ibm/granite-13b-chat-v2',
    url=credentials["url"],
    apikey=credentials["apikey"],
    project_id=project_id,
    params= {
        GenParams.DECODING_METHOD: DecodingMethods.SAMPLE.value,
        GenParams.MAX_NEW_TOKENS: 1024,
        GenParams.MIN_NEW_TOKENS: 1,
        GenParams.TEMPERATURE: 0.5,
        GenParams.TOP_K: 50,
        GenParams.TOP_P: 1
    }
)

Here we compress the found documents by leveraging the LLM again before passing them into the chain for geneating the answer

In [32]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor   

compressor = LLMChainExtractor.from_llm(granite)
compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=vectorstore.as_retriever())

compressed_docs = compression_retriever.get_relevant_documents(text)
pretty_print_docs(compressed_docs)



Document 1: 

Kyle Larson wins by 0.001 seconds in closest finish in NASCAR Cup Series history

By Thomas Schlachter, CNN

Updated: 
        5:54 AM EDT, Mon May 6, 2024

Source: CNN

Capping off a weekend of photo finishes, Kyle Larson won NASCAR’s AdventHealth 400 by a staggering 0.001 seconds in Kansas on Sunday.

Unlike the Kentucky Derby won by Mystik Dan on Saturday, not even the naked eye could separate who finished first out of Larson and eventual second-place finisher Chris Buescher in a photo finish.

The finish was so close that Roush Fenway Keselowski Racing (RFK Racing), Buescher’s team, and commentators on the broadcast originally believed Buescher had done enough to secure victory.

The RFK team could be seen on the broadcast jubilantly celebrating before the official result was finalized.

Eventually, Hendrick Motorsports’ Larson was declared winner in the closest finish in NASCAR Cup Series history, per NASCAR.
----------------------------------------------------------

Now we need to set up our summarization chain: the basic chain types are either "stuff" (i.e. documents are provided as context in a single prompt that is passed to the LLM) or "map-reduce" (i.e. documents are processed in a map-reduce fashion in order to obtain summaries from each single document and provide these summaries as context to the LLM). 

In order to avoid possible limits in the number of token processed by the LLM, we could opt for a map-reduce chain (more information on summarization in langchain can be found at https://python.langchain.com/docs/use_cases/summarization) since we applied compression we can stick with the "stuff" type.

We also define a prompt to pass down the invocation along witht he resulting input documents from the retriever and the compressor.

In [33]:
from langchain.chains.summarize import load_summarize_chain
chain = load_summarize_chain(granite, chain_type="stuff")

input = {
    "prompt" : "You are a AI language model designed to function as a specialized Retrieval Augmented Generation (RAG) assistant. When generating responses, prioritize correctness, i.e., ensure that your response is correct given the context and user query, and that it is grounded in the context. Furthermore, make sure that the response is supported by the given document or context. When the question cannot be answered using the context or document, output the following response: 'I'm sorry, I don't know.' Always make sure that your response is relevant to the question. If an explanation is needed, first provide the explanation or reasoning, and then give the final answer.",
    "input_documents" : compressed_docs
}
print(chain.invoke(input)['output_text'])

 Kyle Larson won the NASCAR AdventHealth 400 by a margin of only 0.001 seconds over Chris Buescher, the closest finish in the history of the NASCAR Cup Series. The race was filled with photo finishes, and Buescher's team initially celebrated believing they had won. Larson, however, was declared the winner after the official results were finalized.


Check out IBM _<a href="https://ibm.github.io/watsonx-ai-python-sdk/samples.html" target="_blank" rel="noopener no referrer">Online Documentation</a>_ for more samples, tutorials, documentation, how-tos, and blog posts.