# News of the Day

In this notebook, we'll show how to use [Unstructured.IO](https://unstructured.io/), [ChromaDB](https://www.trychroma.com/), and [LangChain](https://github.com/langchain-ai/langchain) to summarize topics from the front page of CNN Lite. Without tooling from the modern LLM stack, this would have been a time-consuming project. With Unstructured, Chroma, and LangChain, the entire workflow is less than two dozen lines of code.

## Gather links with `unstructured`

First, we'll gather links from the [CNN Lite](https://lite.cnn.com/) homepage using the `partition_html` function from `unstructured`. When `unstructured` partitions HTML pages, links are included in the metadata for each element, make link collection a simple task. 

In [1]:
from unstructured.partition.html import partition_html

In [2]:
cnn_lite_url = "https://lite.cnn.com/"

In [3]:
elements = partition_html(url=cnn_lite_url)

In [4]:
links = []

for element in elements:
    if element.metadata.link_urls:
        relative_link = element.metadata.link_urls[0][1:]
        if relative_link.startswith("2024"):
            links.append(f"{cnn_lite_url}{relative_link}")

In [5]:
len(links)

98

In [6]:
links[:20]

['https://lite.cnn.com/2024/04/23/politics/biden-abortion-rights-florida/index.html',
 'https://lite.cnn.com/2024/04/23/health/cdc-health-advisory-botox-injections/index.html',
 'https://lite.cnn.com/2024/04/23/opinions/columbia-university-protests-greenblatt/index.html',
 'https://lite.cnn.com/2024/04/23/politics/overtime-pay-salaried-workers-biden/index.html',
 'https://lite.cnn.com/2024/04/20/us/2-children-killed-michigan-birthday-party/index.html',
 'https://lite.cnn.com/2024/04/23/politics/senate-vote-foreign-aid/index.html',
 'https://lite.cnn.com/2024/04/23/politics/iranians-charged-hacking-scheme/index.html',
 'https://lite.cnn.com/2024/04/23/politics/us-ukraine-military-aid-package/index.html',
 'https://lite.cnn.com/2024/04/22/us/columbia-university-protests-hybrid-classes-passover-tuesday/index.html',
 'https://lite.cnn.com/2024/04/11/style/zendaya-challengers-tennis-press-tour-fashion-cec/index.html',
 'https://lite.cnn.com/2024/04/23/us/cleveland-police-chase-settlement/in

## Ingest individual articles with `UnstructuredURLLoader`

Now that we have the links, we can preprocess individual news articles with `UnstructuredURLLoader`. `UnstructuredURLLoader` fetches content from the web and then uses the `unstructured` `partition` function to extract content and metadata. In this example we preprocess HTML files, but it works with other response types such as `application/pdf` as well. After calling `.load()`, the result is a list of `langchain` `Document` objects.

In [7]:
from langchain.document_loaders import UnstructuredURLLoader

loaders = UnstructuredURLLoader(urls=links[:20], show_progress_bar=True)

In [8]:
loaders

<langchain_community.document_loaders.url.UnstructuredURLLoader at 0x131164550>

In [9]:
docs = loaders.load()

100%|██████████| 20/20 [00:04<00:00,  4.45it/s]


In [10]:
docs

[Document(page_content='CNN\n\n4/23/2024\n\nBiden looks to use abortion rights to put Florida in play in November\n\nBy Priscilla Alvarez and Michael Williams, CNN\n\nUpdated: \n        3:53 PM EDT, Tue April 23, 2024\n\nSource: CNN\n\nPresident Joe Biden visited his rival’s home turf in Florida on Tuesday, where his team is seeking to leverage a restrictive abortion law to visit put the state in play for Democrats, seeing reproductive rights as a galvanizing issue for voters one week before a restrictive abortion ban in that state goes into effect.\n\nDemocrats have seized on abortion ahead of November,\xa0hoping it\xa0could spur moderate voters – particularly women – to turn out in droves against former President Donald Trump by tying the abortion bans directly to him.\n\n“Donald Trump stripped away the rights and freedoms of women in America,” Biden said in Tampa on Tuesday, adding it will be “on all of us” to restore those rights.\n\n“And when you do that, it will teach Donald Trum

In [11]:
import requests
response = requests.get("https://lite.cnn.com/2024/04/10/us/tennessee-teachers-gun-carry-bill/index.html")
print(response.headers['Content-Type'])
print(response.content[:500])  # Print the first 500 bytes of the response


text/html; charset=utf-8
b'  <!DOCTYPE html>\n<html lang="en" data-layout-uri="cms.cnn.com/_layouts/layout-with-rail/instances/us-article-v1@published">\n  <head><style>body,h1,h2,h3,h4,h5{font-family:cnn_sans_display,helveticaneue,Helvetica,Arial,Utkal,sans-serif}h1,h2,h3,h4,h5{font-weight:700}:root{--theme-primary:#cc0000;--theme-background:#0c0c0c;--theme-divider:#404040;--theme-copy:#404040;--theme-copy-accent:#e6e6e6;--theme-copy-accent-hover:#ffffff;--theme-icon-color:#e6e6e6;--theme-icon-color-hover:#ffffff;--theme-a'


## Load documents into ChromaDB

With the documents preprocessed, we're now ready to load them into ChromaDB. We accomplish this easily by using the OpenAI embeddings the Chroma vectrostore from `langchain`. This workflow will vectorize the documents using the OpenAI embeddings endpoint, and then load the documents and associated vectors into Chroma. Once the documents are in Chroma, we can perform a similarity search to retrieve documents related to our topic of interest.

In [12]:
from langchain.vectorstores.chroma import Chroma
from langchain.embeddings import OpenAIEmbeddings
#from langchain_openai import OpenAIEmbeddings

In [14]:
import os
from pathlib import Path
from dotenv import load_dotenv

# Load environment variables
env_path = Path('.') / '.env'
load_dotenv(dotenv_path=env_path)
openai_api_key=os.environ['openai_api_key']

In [15]:
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
vectorstore = Chroma.from_documents(docs, embeddings)  #pip install chromadb

  warn_deprecated(


In [16]:
query_docs = vectorstore.similarity_search(
    "What is the news in Europe?", k=1
)

## Summarize the Documents

After retrieving relevant documents from Chroma, we're ready to summarize them! There are multiple ways to accomplish this in `langchain`, but `load_summarization_chain` is the easiest. Simply choose an LLM, load the summarization chain, and you're ready to summarize the documents. Here we limit the summary to snippets related to our topic of choice.

In [17]:
from langchain.chat_models import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain

In [18]:
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo",openai_api_key=openai_api_key)
chain = load_summarize_chain(llm, chain_type="stuff")

  warn_deprecated(


In [19]:
print(chain.run(query_docs))

  warn_deprecated(


Pro-Palestinian protests have spread to major American universities, leading to arrests, campus closures, and tensions between students, faculty, and administrators. Columbia University, NYU, Yale, and other schools have been affected, with demands for divestment from companies profiting from Israel's conflict with Gaza. Jewish students have expressed concerns for their safety, and lawmakers have visited campuses to address the unrest. Outside agitators have been identified as causing disruptions at protests, and the situation remains volatile as demonstrations continue.
