# News Articles Summarizer (version 2)

STEPS:
1. Install libraries: requests, newspaper3k, and langchain;
2. Scrape articles: scrape the content of the target news articles from their respective URLs using the `requests` library;
3. Extract titles and text: parse the scraped HTML and extract the titles and text of the articles using  `newspaper3k`; 
4. Preprocess the text, use Few-Shot Learning Technique: provide a few examples of the LLM to guide it in generating the summaries in the desired format - a bulleted list.
5. Generate summary: utilize the model with the help of the prepared prompt, to generate concise summaries of the extracted articles' text in the desired format.
6. Use the Output Parsers: interpret the output from the language model and ensuring it aligns with the desired structure and format.
7. Output the results: present the bulleted summaries along with the original titles, enabling users to quickly grasp the main points of each article in a structured manner.

### 1. Installing and Importing Libraries

In [1]:
# Installing the required libraries
# !pip install langchain==0.0.208 deeplake openai tiktoken
# !pip install -q newspaper3k python-dotenv

In [2]:
import requests
from newspaper import Article
from langchain.schema import HumanMessage
from langchain.chat_models import ChatOpenAI

import os
import json 

#from dotenv import load_dotenv
#load_dotenv()
from keys import OPENAI_API_KEY
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

### 2. Scrape Article

In [3]:
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'
}
article_url = "https://www.artificialintelligence-news.com/2022/01/25/meta-claims-new-ai-supercomputer-will-set-records/"

session = requests.Session()
try:
    # Fetch article from the URL using the requests library with a custom User-Agent header
    response = session.get(article_url, headers=headers, timeout=10)
    
    # Extract the title and text of each article using the newspaper library
    if response.status_code == 200:
        article = Article(article_url)
        article.download()
        article.parse()
        
        print(f"Title: {article.title}")
        print(f"Text: {article.text}")
        
    else:
        print(f"Failed to fetch article at {article_url}")
except Exception as e:
    print(f"Error occurred while fetching article at {article_url}: {e}")

Title: Meta claims its new AI supercomputer will set records
Text: Ryan is a senior editor at TechForge Media with over a decade of experience covering the latest technology and interviewing leading industry figures. He can often be sighted at tech conferences with a strong coffee in one hand and a laptop in the other. If it's geeky, he’s probably into it. Find him on Twitter (@Gadget_Ry) or Mastodon (@gadgetry@techhub.social)

Meta (formerly Facebook) has unveiled an AI supercomputer that it claims will be the world’s fastest.

The supercomputer is called the AI Research SuperCluster (RSC) and is yet to be fully complete. However, Meta’s researchers have already begun using it for training large natural language processing (NLP) and computer vision models.

RSC is set to be fully built in mid-2022. Meta says that it will be the fastest in the world once complete and the aim is for it to be capable of training models with trillions of parameters.

“We hope RSC will help us build entire

### 3. Extract titles and text

In [4]:
# Get the article data from the scraping part
article_title = article.title
article_text = article.text

### 4. Preprocess the text (use Few-Shot Learning Technique)

In [5]:
from langchain.schema import (
    HumanMessage
)


# Prepare template for prompt (guiding the model to generate a bulleted list summarizing the article)
template = """
As an advanced AI, you've been tasked to summarize online articles into bulleted points. Here are a few examples of how you've done this in the past:

Example 1:
Original Article: 'The Effects of Climate Change
Summary:
- Climate change is causing a rise in global temperatures.
- This leads to melting ice caps and rising sea levels.
- Resulting in more frequent and severe weather conditions.

Example 2:
Original Article: 'The Evolution of Artificial Intelligence
Summary:
- Artificial Intelligence (AI) has developed significantly over the past decade.
- AI is now used in multiple fields such as healthcare, finance, and transportation.
- The future of AI is promising but requires careful regulation.

Now, here's the article you need to summarize:

==================
Title: {article_title}

{article_text}
==================

Please provide a summarized version of the article in a bulleted list format.
"""

# Format the Prompt
prompt = template.format(article_title=article.title, article_text=article.text)

messages = [HumanMessage(content=prompt)]

### 5. Generate Summary

In [6]:
from langchain.chat_models import ChatOpenAI

# Load the model
chat = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.0)  # model_name = "gpt-4"

# Generate summary
summary = chat(messages)
print(summary.content)

- Meta (formerly Facebook) has unveiled an AI supercomputer called the AI Research SuperCluster (RSC).
- The RSC is expected to be the world's fastest supercomputer once complete.
- It will be capable of training models with trillions of parameters.
- Meta aims to use the RSC to build new AI systems for real-time voice translations and applications in the metaverse.
- The RSC is estimated to be 20x faster than Meta's current clusters and 9x faster at running the NVIDIA Collective Communication Library (NCCL).
- It will also be 3x faster at training large-scale natural language processing (NLP) workflows.
- Meta designed the RSC with security and privacy controls to use real-world examples from its production systems.
- This will allow Meta to advance research for tasks such as identifying harmful content on its platforms.
- The RSC is expected to significantly improve training times for models with tens of billions of parameters.


### 6. Use the Output Parsers

In [7]:
from langchain.output_parsers import PydanticOutputParser
from pydantic import validator
from pydantic import BaseModel, Field
from typing import List


# Create data structure model - ArticleSummary 
# This model will serve as a blueprint for the desired structure of the generated article summary
class ArticleSummary(BaseModel):
    title: str = Field(description="Title of the article")
    summary: List[str] = Field(description="Bulleted list summary of the article")

    # Validating whether the generated summary has at least three lines
    @validator('summary', allow_reuse=True)
    def has_three_or_more_lines(cls, list_of_lines):
        if len(list_of_lines) < 3:
            raise ValueError("Generated summary has less than three bullet points!")
        return list_of_lines

    
# Instantiate a parser object 
parser = PydanticOutputParser(pydantic_object=ArticleSummary)

In [8]:
from langchain.prompts import PromptTemplate

# Create prompt template (we are specifying the "partial_variables" parameter)
# The template instructs the model how to act and incorporates the parser object
template = """
You are a very good assistant that summarizes online articles.

Here's the article you want to summarize.

==================
Title: {article_title}

{article_text}
==================

{format_instructions}
"""

prompt = PromptTemplate(
    template=template,
    input_variables=["article_title", "article_text"],
    partial_variables={"format_instructions": parser.get_format_instructions()}
)

# Format the prompt using the article title and text obtained from scraping
formatted_prompt = prompt.format_prompt(article_title=article_title, article_text=article_text)

### 7. Output the results

In [9]:
from langchain.llms import OpenAI

# Instantiate model class
model = OpenAI(model_name="text-davinci-003", temperature=0.0)

# Use the model to generate a summary
output = model(formatted_prompt.to_string())

# Parse the output into the Pydantic model
parsed_output = parser.parse(output)
print(parsed_output)

title='Meta claims its new AI supercomputer will set records' summary=['Meta (formerly Facebook) has unveiled an AI supercomputer that it claims will be the world’s fastest.', 'The supercomputer is called the AI Research SuperCluster (RSC) and is yet to be fully complete.', 'Meta says that it will be the fastest in the world once complete and the aim is for it to be capable of training models with trillions of parameters.', 'For production, Meta expects RSC will be 20x faster than Meta’s current V100-based clusters.', 'Meta says that its previous AI research infrastructure only leveraged open source and other publicly-available datasets.', 'What this means in practice is that Meta can use RSC to advance research for vital tasks such as identifying harmful content on its platforms—using real data from them.']
