## Trending News Summary

In this small project, we will scrape 50 latest trending articles from [CNBC](https://www.cnbc.com/#Homepage-TrendingNowBreaker-19), group them into some topics (E.g.: 10), then summarize most important information from each topic using ChatGPT.

### Install and import libraries

In [1]:
pip install -q openai ycnbc torch python-dotenv numpy==1.26.4

Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
import requests
import json
import torch
from openai import OpenAI
from dotenv import load_dotenv

import ycnbc
from ycnbc.news.news_utils import CNBCNewsUtils

In [3]:
# from IPython.display import HTML, display

# def set_css():
#   display(HTML('''
#   <style>
#     pre {
#         white-space: pre-wrap;
#     }
#   </style>
#   '''))
# get_ipython().events.register('pre_run_cell', set_css)

In [4]:
try:
    load_dotenv()
    OPENAI_KEY = os.getenv("OPENAI_KEY")
    RAPID_API_KEY = os.getenv("RAPID_API_KEY")
except e:
    OPENAI_KEY = "your_openai_key"
    RAPID_API_KEY = "your_rapidAPI_key"

### Scaping news from CNBC

In this step, we will get 50 latest trending news from CNBC.

In [5]:
def get_cnbc_content(link):
    newUtil = CNBCNewsUtils()
    try:
        # Get HTML content from an article url
        tree = newUtil._fetch_page(link.replace("https://www.cnbc.com/", ""))
        if "error" in tree:
            return tree
        # Get the main content by ID
        artical_content = tree.xpath("//div[contains(@data-module, 'ArticleBody') or contains(@data-module, 'featuredContent')]")
        if not artical_content:
            return None
        # Export only text from the HTML
        content = artical_content[0].text_content()
        return str(content)
    except Exception as e:
        return None


def get_trending_articles(count=50):
    url = "https://cnbc.p.rapidapi.com/news/v2/list-trending"
    querystring = {"tag" : "Articles", "count" : str(count)}
    # API Key for rapidapi
    headers = {
        "x-rapidapi-key": RAPID_API_KEY,
        "x-rapidapi-host": "cnbc.p.rapidapi.com"
    }
    # Using RapidAPI to get trending news from CNBC
    response = requests.get(url, headers=headers, params=querystring)
    json_data = response.json()
    data = []
    for element in json_data["data"]["mostPopularEntries"]["assets"]:
        # Getting all content from an url
        content = get_cnbc_content(element["url"]),
        if content is not None:
            data.append({
                "title": element["headline"],
                "description": element["description"],
                'link': element["url"],
                'content': content[0],
            })

    return data

### Text embedding with ChatGPT
Using ChatGPT API to get embedding vector of each article. By default, ChatGPT will return a 1536-dimension vector.

In [6]:
def get_embedding(data, client):
    new_data = []
    for i in range(len(data)):
        try:
            response = client.embeddings.create(
                input=data[i]['content'],
                model="text-embedding-3-small"
            )

            data[i]["embedding"] = response.data[0].embedding
            new_data.append(data[i])
        except:
            pass
            # print(i, data[i])
    return new_data

### Clustering all news into sub-topics

Using a simple k-means algorithm to cluster data.

In [7]:
def clustering_data(data, topic_number=10):
    embeddings = torch.FloatTensor(list(map(lambda x: x["embedding"], data)))
    centroids = embeddings[torch.randperm(embeddings.size(0))[:topic_number]]
    # centroids = embeddings[:topic_number]
    num_iterations = 500

    for _ in range(num_iterations):
        # Calculate distances from data points to centroids
        distances = torch.cdist(embeddings, centroids)

        # Assign each data point to the closest centroid
        _, labels = torch.min(distances, dim=1)

        # Update centroids by taking the mean of data points assigned to each centroid
        for i in range(topic_number):
            if torch.sum(labels == i) > 0:
                centroids[i] = torch.mean(embeddings[labels == i], dim=0)

    # Assign topic ID for each news
    for i in range(len(data)):
        data[i]["topic_id"] = labels[i]

    # Group data into sub-topics
    topics = []
    for i in range(topic_number):
        topic_data = list(filter(lambda x: x["topic_id"] == i, data))

        if len(topic_data) > 0:
            full_content = " ".join(list(map(lambda x: x["content"], topic_data)))
            links = list(map(lambda x: x["link"], topic_data))
            topics.append({
                "data": topic_data,
                "links": links,
                "full_content": full_content
            })

    return topics

### Summarize content of each topic (Using ChatGPT)

In [8]:
def summary_topics(topics, client):
    for i, topic in enumerate(topics):
        full_content = topic["full_content"]

        response = client.chat.completions.create(
            # Maybe gpt-4-*-preview works better for this task, but the cost is too high
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "system",
                    "content": "In this task, you need to read the content of some newspapers in a same topic, then summary the most important information in that topic.",
                },
                {
                    "role": "user",
                    "content": full_content
                },
                {
                    "role": "user",
                    "content": "Write a summary about all important points of these news in 100 words."
                }]
        )
        topics[i]["summary"] = response.choices[0].message.content
    return topics

### Main Function

In [9]:
# Global variable
client = OpenAI(api_key=OPENAI_KEY)
topic_number = 7

# Get trending articles from CNBC
trending_news = get_trending_articles(count=50)

# Get embedding vector from each article content
data = get_embedding(trending_news, client)

# Using K-means clustering algorithm to group all articles into sub-topics
topics = clustering_data(data, topic_number = topic_number)

In [10]:
# Summary most important information in each topic
topics = summary_topics(topics, client)

### Result:
Print all topics with the summary and links:

In [11]:
for index, topic in enumerate(topics):
    print(f"Topic {index+1}:")
    print(topic["summary"])
    print(topic["links"])
    print()

Topic 1:
Walgreens reported fiscal fourth-quarter sales of $37.55 billion, exceeding expectations, while adjusting profit was 39 cents per share. To address financial challenges, the company plans to close 1,200 stores, including 500 by fiscal 2025, aiming for a healthier store base. Amid a net loss of $3 billion attributed to opioid settlements, Walgreens surpassed its $1 billion cost-cutting target for 2024. Despite challenges in pharmacy margins and a soft retail environment, growth occurred in its U.S. healthcare and international segments. The company anticipates adjusted earnings of $1.40 to $1.80 per share for fiscal 2025, with revenue projected between $147 billion and $151 billion.
['https://www.cnbc.com/2024/10/15/walgreens-wba-earnings-q4-2024.html', 'https://www.cnbc.com/2024/10/15/goldman-sachs-gs-earnings-q3-2024.html']

Topic 2:
A recent analysis by GOBankingRates reveals significant variations in retirement expenses across U.S. states, with Hawaii needing the highest av