<a href="https://colab.research.google.com/github/towardsai/ragbook-notebooks/blob/main/notebooks/Chapter%2006%20-%20Improving_Our_News_Articles_Summarizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#!pip install -q langchain==0.0.208 openai==0.27.8 python-dotenv newspaper3k

In [1]:
from dotenv import load_dotenv

load_dotenv()

True

In [2]:
import requests
from newspaper import Article

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'
}

article_url = "https://www.artificialintelligence-news.com/2022/01/25/meta-claims-new-ai-supercomputer-will-set-records/"

session = requests.Session()


try:
    response = session.get(article_url, headers=headers, timeout=10)

    if response.status_code == 200:
        article = Article(article_url)
        article.download()
        article.parse()

        print(f"Title: {article.title}")
        print(f"Text: {article.text}")
    else:
        print(f"Failed to fetch article at {article_url}")
except Exception as e:
    print(f"Error occurred while fetching article at {article_url}: {e}")

Title: Meta claims its new AI supercomputer will set records
Text: Ryan Daws is a senior editor at TechForge Media with over a decade of experience in crafting compelling narratives and making complex topics accessible. His articles and interviews with industry leaders have earned him recognition as a key influencer by organisations like Onalytica. Under his leadership, publications have been praised by analyst firms such as Forrester for their excellence and performance. Connect with him on X (@gadget_ry) or Mastodon (@gadgetry@techhub.social)

Meta (formerly Facebook) has unveiled an AI supercomputer that it claims will be the world’s fastest.

The supercomputer is called the AI Research SuperCluster (RSC) and is yet to be fully complete. However, Meta’s researchers have already begun using it for training large natural language processing (NLP) and computer vision models.

RSC is set to be fully built in mid-2022. Meta says that it will be the fastest in the world once complete and 

In [3]:
from langchain.schema import (
    HumanMessage
)

# we get the article data from the scraping part
article_title = article.title
article_text = article.text

# prepare template for prompt
template = """
As an advanced AI, you've been tasked to summarize online articles into bulleted points. Here are a few examples of how you've done this in the past:

Example 1:
Original Article: 'The Effects of Climate Change
Summary:
- Climate change is causing a rise in global temperatures.
- This leads to melting ice caps and rising sea levels.
- Resulting in more frequent and severe weather conditions.

Example 2:
Original Article: 'The Evolution of Artificial Intelligence
Summary:
- Artificial Intelligence (AI) has developed significantly over the past decade.
- AI is now used in multiple fields such as healthcare, finance, and transportation.
- The future of AI is promising but requires careful regulation.

Now, here's the article you need to summarize:

==================
Title: {article_title}

{article_text}
==================

Please provide a summarized version of the article in a bulleted list format.
"""

# format prompt
prompt = template.format(article_title=article.title, article_text=article.text)

messages = [HumanMessage(content=prompt)]

In [4]:
from langchain.chat_models import ChatOpenAI

# load the model
chat = ChatOpenAI(model_name="gpt-4-turbo", temperature=0)

In [5]:
# generate summary
summary = chat(messages)
print(summary.content)

Summary:
- Meta (formerly Facebook) has announced the development of the AI Research SuperCluster (RSC), which it claims will be the world's fastest AI supercomputer upon completion.
- The RSC is designed to train large natural language processing (NLP) and computer vision models, with full construction expected by mid-2022.
- Meta aims for the RSC to support advanced AI applications, such as real-time voice translations for large, multilingual groups, enhancing collaboration in research or augmented reality (AR) gaming.
- The RSC is projected to be significantly faster than Meta's current systems, being 20x faster than their V100-based clusters and capable of training models with tens of billions of parameters in three weeks, down from nine weeks.
- Unlike previous systems that used only open-source or publicly available datasets, the RSC incorporates enhanced security and privacy controls to utilize real-world data from Meta's production systems for training.
- This advancement will 

# ======

In [7]:
from langchain.output_parsers import PydanticOutputParser
from pydantic import field_validator
from pydantic import BaseModel, Field
from typing import List


# create output parser class
class ArticleSummary(BaseModel):
    title: str = Field(description="Title of the article")
    summary: List[str] = Field(description="Bulleted list summary of the article")

    # validating whether the generated summary has at least three lines
    @field_validator('summary')
    def has_three_or_more_lines(cls, list_of_lines):
        if len(list_of_lines) < 3:
            raise ValueError("Generated summary has less than three bullet points!")
        return list_of_lines

# set up output parser
parser = PydanticOutputParser(pydantic_object=ArticleSummary)

In [9]:
from langchain.prompts import PromptTemplate


# create prompt template
# notice that we are specifying the "partial_variables" parameter
template = """
You are a very good assistant that summarizes online articles.

Here's the article you want to summarize.

==================
Title: {article_title}

{article_text}
==================

{format_instructions}
"""

prompt_template = PromptTemplate(
    template=template,
    input_variables=["article_title", "article_text"],
    partial_variables={"format_instructions": parser.get_format_instructions()}
)

In [10]:
from langchain.chat_models import ChatOpenAI
from langchain import LLMChain

# instantiate model class
model = ChatOpenAI(model_name="gpt-4-turbo", temperature=0.0)

chain = LLMChain(llm=model, prompt=prompt_template)

# Run the LLMChain to get the AI-generated answer
output = chain.run({"article_title": article_title, "article_text":article_text})

# Parse the output into the Pydantic model
parsed_output = parser.parse(output)
print(parsed_output)

title='Meta claims its new AI supercomputer will set records' summary=["Meta (formerly Facebook) has announced the development of the AI Research SuperCluster (RSC), which it claims will be the world's fastest AI supercomputer once completed in mid-2022.", 'The RSC is designed for training extensive natural language processing (NLP) and computer vision models, with capabilities to handle models with trillions of parameters.', 'Meta aims to use RSC for advanced applications such as real-time voice translations for large groups and enhancing the metaverse with AI-driven technologies.', "The supercomputer is expected to be 20x faster than Meta's current systems and significantly quicker in specific computational tasks compared to previous technologies.", 'RSC incorporates enhanced security and privacy controls, allowing Meta to utilize real-world data from its production systems for training, particularly for identifying harmful content on its platforms.']


In [11]:
parsed_output

ArticleSummary(title='Meta claims its new AI supercomputer will set records', summary=["Meta (formerly Facebook) has announced the development of the AI Research SuperCluster (RSC), which it claims will be the world's fastest AI supercomputer once completed in mid-2022.", 'The RSC is designed for training extensive natural language processing (NLP) and computer vision models, with capabilities to handle models with trillions of parameters.', 'Meta aims to use RSC for advanced applications such as real-time voice translations for large groups and enhancing the metaverse with AI-driven technologies.', "The supercomputer is expected to be 20x faster than Meta's current systems and significantly quicker in specific computational tasks compared to previous technologies.", 'RSC incorporates enhanced security and privacy controls, allowing Meta to utilize real-world data from its production systems for training, particularly for identifying harmful content on its platforms.'])