## Create a List containing links of all articles (Including featured article)

In [1]:
import re, os, time
from bs4 import BeautifulSoup

from utils.utils import get_website_html
from scraper import get_featured_article_link, get_articles_links, get_article_titles, get_formatted_article_text

BASE_URL = "https://www.deeplearning.ai"
THE_BATCH_URL = "https://www.deeplearning.ai/the-batch/"

In [2]:
home_page_html = get_website_html(THE_BATCH_URL)

In [3]:
unfeatured_articles_links = get_articles_links(home_page_html)

for index, link in enumerate(unfeatured_articles_links):
    print(f"{index} --> {link}")


Invalid article link (It is probably a featured article)...continuing
0 --> https://www.deeplearning.ai/the-batch/issue-280/
1 --> https://www.deeplearning.ai/the-batch/issue-279/
2 --> https://www.deeplearning.ai/the-batch/issue-278/
3 --> https://www.deeplearning.ai/the-batch/issue-277/
4 --> https://www.deeplearning.ai/the-batch/issue-276/
5 --> https://www.deeplearning.ai/the-batch/issue-275/
6 --> https://www.deeplearning.ai/the-batch/issue-274/
7 --> https://www.deeplearning.ai/the-batch/issue-273/
8 --> https://www.deeplearning.ai/the-batch/issue-272/
9 --> https://www.deeplearning.ai/the-batch/issue-271/
10 --> https://www.deeplearning.ai/the-batch/issue-270/
11 --> https://www.deeplearning.ai/the-batch/issue-269/
12 --> https://www.deeplearning.ai/the-batch/issue-268/
13 --> https://www.deeplearning.ai/the-batch/issue-267/
14 --> https://www.deeplearning.ai/the-batch/issue-266/


In [4]:
featured_article_link = get_featured_article_link(home_page_html)
print(featured_article_link)

https://www.deeplearning.ai/the-batch/issue-281/


In [5]:
all_articles_links = [featured_article_link] + unfeatured_articles_links
print("Total number of articles: ", len(all_articles_links))
for i, article in enumerate(all_articles_links):
    print(f"{i} --> {article}")

Total number of articles:  16
0 --> https://www.deeplearning.ai/the-batch/issue-281/
1 --> https://www.deeplearning.ai/the-batch/issue-280/
2 --> https://www.deeplearning.ai/the-batch/issue-279/
3 --> https://www.deeplearning.ai/the-batch/issue-278/
4 --> https://www.deeplearning.ai/the-batch/issue-277/
5 --> https://www.deeplearning.ai/the-batch/issue-276/
6 --> https://www.deeplearning.ai/the-batch/issue-275/
7 --> https://www.deeplearning.ai/the-batch/issue-274/
8 --> https://www.deeplearning.ai/the-batch/issue-273/
9 --> https://www.deeplearning.ai/the-batch/issue-272/
10 --> https://www.deeplearning.ai/the-batch/issue-271/
11 --> https://www.deeplearning.ai/the-batch/issue-270/
12 --> https://www.deeplearning.ai/the-batch/issue-269/
13 --> https://www.deeplearning.ai/the-batch/issue-268/
14 --> https://www.deeplearning.ai/the-batch/issue-267/
15 --> https://www.deeplearning.ai/the-batch/issue-266/


## Creating a list containing the titles of all articles.
#### The order will be the same as the links in all_articles_links

In [6]:
all_articles_titles = get_article_titles(home_page_html)
print(f"Total titles: {len(all_articles_titles)}")
for i, title in enumerate(all_articles_titles):
    print(f"{i} --> {title}")

Total titles: 16
0 --> Top AI Stories of 2024! Agents Rise, Prices Fall, Models Shrink, Video Takes Off, Acquisitions Morph
1 --> Phi-4 Breaks Size Barrier, HunyuanVideo Narrows Open Source Gap, Gemini 2.0 Flash Accelerates Multimodal Modeling, LLMs Propose Research Ideas
2 --> Amazon Nova’s Competitive Price/Performance, OpenAI o1 Pro’s High Price/Performance, Google’s Game Worlds on Tap, Factual LLMs
3 --> AI Agents Spend Real Money, Breaking Jailbreaks, Mistral Goes Big and Multimodal, AI’s Growing E-Waste Problem
4 --> DeepSeek Takes On OpenAI, Robots Fold Laundry, Amazon and Anthropic Expand Partnership, More Efficient Object Detection
5 --> Next-Gen Models Show Limited Gains, Real-Time Video Generation, China AI Chips Blocked, Transformer Training Streamlined
6 --> Llama On the Battlefield, Mixture of Experts Pulls Ahead, Open Agentic Platform, Voter Support Chatbot
7 --> AI Controls Desktops, Agents Train Algorithms, Does Anyone Comply With the EU’s AI Act?, Robots on the Loadin

## Creating the Chunks of each Article (Docs)

#### Storing each article's text in a text file

In [7]:
def save_all_articles_text(all_articles_links, dir_name):
    os.makedirs(dir_name, exist_ok=True)
    for index, link in enumerate(all_articles_links):
        article_text = get_formatted_article_text(link)
        slug = re.sub(r'https?://[^/]+/', '', link)  # remove scheme and domain
        slug = slug.strip("/").replace("/", "_")     # turn '/the-batch/issue-281/' -> 'the-batch_issue-281'
        file_name = f"article_{index}_{slug}.txt"
        file_path = os.path.join(dir_name, file_name)
        with open(file_path, "w", encoding="utf-8") as f:
            f.write(article_text)
            print(f"Saved: {file_path}\n")
        time.sleep(1) # Delay to not overwhelm the server :)
        

save_all_articles_text(all_articles_links)

Fetching article text with link:  https://www.deeplearning.ai/the-batch/issue-281/
Saved: formatted_articles/article_0_the-batch_issue-281.txt

Fetching article text with link:  https://www.deeplearning.ai/the-batch/issue-280/
Saved: formatted_articles/article_1_the-batch_issue-280.txt

Fetching article text with link:  https://www.deeplearning.ai/the-batch/issue-279/
Saved: formatted_articles/article_2_the-batch_issue-279.txt

Fetching article text with link:  https://www.deeplearning.ai/the-batch/issue-278/
Saved: formatted_articles/article_3_the-batch_issue-278.txt

Fetching article text with link:  https://www.deeplearning.ai/the-batch/issue-277/
Saved: formatted_articles/article_4_the-batch_issue-277.txt

Fetching article text with link:  https://www.deeplearning.ai/the-batch/issue-276/
Saved: formatted_articles/article_5_the-batch_issue-276.txt

Fetching article text with link:  https://www.deeplearning.ai/the-batch/issue-275/
Saved: formatted_articles/article_6_the-batch_issue-2

#### Splitting the article into Chunks using MarkdownTextSplitter and Saving to ChromaDB

In [8]:
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
from dotenv import load_dotenv

load_dotenv()

embeddings = OpenAIEmbeddings(model = "text-embedding-3-small")

vector_store = Chroma(collection_name="articles_collection", embedding_function=embeddings)


In [10]:
from utils.utils import get_article_chunks, parse_article_index, create_metadata

for filename in os.listdir("formatted_articles"):
    if not filename.endswith(".txt"):
        # print("Skipping: ", filename)
        continue
    print(filename)
    filepath = os.path.join("formatted_articles", filename)
    with open(filepath, "r") as myfile:
        print(myfile.name)
        article_text = myfile.read()
        docs = get_article_chunks(article_text)
        article_index = parse_article_index(myfile.name)
        print(article_index)
        print(f"Length of docs of article with index {article_index} = {len(docs)}")
        metadatas, ids = create_metadata(docs, article_index, all_articles_links, all_articles_titles)
        print(ids)
    vector_store.add_texts(texts=docs, ids=ids, metadatas=metadatas)
    time.sleep(0.5)

article_12_the-batch_issue-269.txt
formatted_articles/article_12_the-batch_issue-269.txt
12
Length of docs of article with index 12 = 12
['id12_0', 'id12_1', 'id12_2', 'id12_3', 'id12_4', 'id12_5', 'id12_6', 'id12_7', 'id12_8', 'id12_9', 'id12_10', 'id12_11']
article_14_the-batch_issue-267.txt
formatted_articles/article_14_the-batch_issue-267.txt
14
Length of docs of article with index 14 = 12
['id14_0', 'id14_1', 'id14_2', 'id14_3', 'id14_4', 'id14_5', 'id14_6', 'id14_7', 'id14_8', 'id14_9', 'id14_10', 'id14_11']
article_4_the-batch_issue-277.txt
formatted_articles/article_4_the-batch_issue-277.txt
4
Length of docs of article with index 4 = 13
['id4_0', 'id4_1', 'id4_2', 'id4_3', 'id4_4', 'id4_5', 'id4_6', 'id4_7', 'id4_8', 'id4_9', 'id4_10', 'id4_11', 'id4_12']
article_2_the-batch_issue-279.txt
formatted_articles/article_2_the-batch_issue-279.txt
2
Length of docs of article with index 2 = 15
['id2_0', 'id2_1', 'id2_2', 'id2_3', 'id2_4', 'id2_5', 'id2_6', 'id2_7', 'id2_8', 'id2_9', 'i

In [11]:
results = vector_store.similarity_search(
    "What was the law that prevented the advancements in open source AI?",
    k = 2
)

for res in results:
    print(f"* {res.page_content}")

* ## Innovation Can’t Win
Politicians and pundits have conjured visions of doom to convince lawmakers to clamp down on AI. What if terrified legislators choke off innovation in AI?
The fear: Laws and treaties that purportedly were intended to prevent harms wrought by AI are making developing new models legally risky and prohibitively expensive. Without room to experiment, AI’s benefits will be strangled by red tape.
Horror stories: At least one law that would have damaged AI innovation and open source has been blocked, but another is already limiting access to technology and raising costs for companies, developers, and users worldwide. More such efforts likely are underway.
- California SB 1047 would have held developers of models above a certain size (requiring 10 26 floating-point operations or cost $100 million to train) liable for unintended harms caused by their models, such as helping to perpetrate thefts, cyberattacks, or design weapons of mass destruction. The bill required suc

In [12]:
print(results[0].metadata)

{'article_link': 'https://www.deeplearning.ai/the-batch/issue-273/', 'article_title': 'Trick or treat! AI Devours Energy, Innovation Can’t Win, Models Collapse, Benchmark Tests Are Meaningless, No Work for Coders', 'chunk_heading': 'Innovation Can’t Win', 'source': 5}


## Querying using Chains

In [13]:
from langchain.chains.qa_with_sources.retrieval import RetrievalQAWithSourcesChain
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model = "gpt-4o-mini", max_tokens = 256, temperature=0.5)

chain = RetrievalQAWithSourcesChain.from_llm(llm = llm, retriever=vector_store.as_retriever())

query = "What was the role of AI in the elections?"

response = chain.invoke({"question": query}, return_only_outputs = True)

In [14]:
print(response)

{'answer': "AI played a role in the recent United States elections by providing voters with access to verified, nonpartisan information through platforms like Perplexity, which launched an Election Information Hub. This AI-powered website offered live updates, summaries, and explanations of key issues related to the elections. Additionally, AI chatbots were utilized by various search engines to enhance user experience, although some, like Microsoft Copilot and OpenAI's SearchGPT, opted to refer users to other sources for election-related inquiries. While there were concerns about AI creating misleading content, it appears that generative AI was not the primary method of manipulation in this election cycle; instead, the amplification effect of software bots on social media was identified as a more significant concern. Overall, fears that AI would disrupt the democratic process in the 2024 elections were largely unfounded.\n\n", 'sources': '8, 9, 0'}


In [16]:
results = vector_store.similarity_search(
    "What was the role of AI in the elections?",
    k = 2
)
print(results)
for res in results:
    print(f"* {res.page_content}")

[Document(metadata={'article_link': 'https://www.deeplearning.ai/the-batch/issue-275/', 'article_title': 'Llama On the Battlefield, Mixture of Experts Pulls Ahead, Open Agentic Platform, Voter Support Chatbot', 'chunk_heading': 'Voter’s Helper', 'source': 8}, page_content='## Voter’s Helper\nSome voters navigated last week’s United States elections with help from a large language model that generated output based on verified, nonpartisan information.\nWhat’s new: Perplexity, an AI-powered search engine founded in 2022 by former OpenAI and Meta researchers, launched its Election Information Hub , an AI-enhanced website that combines AI-generated analysis with real-time data. The model provided live updates, summaries, and explanations of key issues in the recent national, state, and local elections in the U.S. (The hub remains live, but it no longer displays information about local contests or delivers detailed results for election-related searches.)\nHow it works: Perplexity partnered 