# News Summary Generator (NSG)

### Dataset

In [5]:
!pip install datasets --quiet

In [6]:
!pip install --upgrade ipywidgets --quiet

In [7]:
from datasets import load_dataset
dataset = load_dataset("cnn_dailymail", "3.0.0")  # Version 3.0.0 includes highlights (summaries)


In [4]:
dataset.shape

{'train': (287113, 3), 'validation': (13368, 3), 'test': (11490, 3)}

In [11]:
dataset["train"][0]['highlights']

"Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .\nYoung actor says he has no plans to fritter his cash away .\nRadcliffe's earnings from first five Potter films have been held in trust fund ."

### Data from scraping (Please see scraper folder)

In [17]:
import json

In [18]:
filename = "/Users/chutszkan/Desktop/Files/FYP/LLM-News-Gen/scaprer/demo_data.json"
with open(filename, "r", encoding="utf-8") as f:
    data = json.load(f)

In [19]:
data

[{'date': '2025/02/06',
  'title': 'Elon Musk promises to Make Tesla Great Again. Investors are buying it – for now',
  'text': 'These should be tough days for Tesla. Sales are down. It’s making less money on every car it does manage to sell. Competitors are eating into its market share, especially internationally. The end of various federal EV policies could cost the company billions. And then there’s the CEO. Elon Musk made the company what it is, turning it from a humble startup into a financial juggernaut. But now there’s near-constant controversy swirling around him as he upends the US government, inserts himself into international politics, plays video games and posts on his social media site, X, drawing his attention away from Tesla in what he himself describes as a “pivotal year” for the company. But you wouldn’t know that by the value of Tesla shares (TSLA), which still dwarfs any other automaker’s market value. Musk took up most of a recent investor call promising imminent, m

### Next Step, how can convert this to "Event Card" - event pattern and attributes

#### Event Extraction

- Trigger Identification
- Event Argument Identification and Classification

Use LLM (May need to do the Fine-Tuning) 

we use the event patterns as chromosomes, select crossover of chromosomes to evolve the pool of event patterns, and evaluate the chromosomes in the pool by the fitness function

In [34]:
!pip install openai python-dotenv --quiet

In [104]:
EXAMPLE_CASE = """

we use LLM to generate news summaries. News paragraphs often have long text segments and contain multiple events, each with an implied event pattern, as below example, to enhance the summary generation capability of LLM through event pattern evolution.

***Hint***
For Arugment, you need to add the time, if there is no time information, plesase state it as `Unknown`

# EXAMPLE #

### Article ###
A 48-year-old farmer was arrested on the evening of April 1 on the Greek island of Chios for allegedly robbing and injuring two underage refugees that day. According to reports, the farmer who settled on Chios Island also fired shots at the sky in 2017 to deter refugees. The two refugees who were injured this time were only 16 years old and both came from Syria. Fortunately, their injuries were not serious and they have been sent to the local hospital for treatment.

### GENERATED EVENT CARD ####

[
    {
        'event_no': 1
        'type': criminality
        'arguments': [
            'the injured': two refugees
            'time': April 1
            'nationality of the injured': Syria
        ]
    },
    {}
    ...
    
    
    
]

### GENERATED EVENT CARD ####

[
  {
    "event_no": 1,
    "type": "criminality",
    "arguments": {
      "the injured": "two refugees",
      "time": "April 1",
      "nationality of the injured": "Syria"
    }
  },
  {
    "event_no": 2,
    "type": "criminality",
    "arguments": {
      "agent": "48-year-old framer",
      "place": "Greek island of Chios",
      "time": "April 1"
    }
  },
  {
    "event_no": 3,
    "type": "criminality",
    "arguments": {
      "suspect": "48-year-old framer",
      "crime": "shooting",
      "place": "Chios",
      "result": "arrested"
    }
  },
  ...
]


"""

In [105]:
content = data[0]['text']

In [106]:
TEMPERATURE=0.1

MESSAGES = [
    {"role": "system", "content": f"You are a helpful assistant to create the EVENT CARD based on the content as below: \n {EXAMPLE_CASE}"},
    {"role": "user", "content": f"Generate the event cards based on the content below \n {content}"},
]

In [117]:
content

'These should be tough days for Tesla. Sales are down. It’s making less money on every car it does manage to sell. Competitors are eating into its market share, especially internationally. The end of various federal EV policies could cost the company billions. And then there’s the CEO. Elon Musk made the company what it is, turning it from a humble startup into a financial juggernaut. But now there’s near-constant controversy swirling around him as he upends the US government, inserts himself into international politics, plays video games and posts on his social media site, X, drawing his attention away from Tesla in what he himself describes as a “pivotal year” for the company. But you wouldn’t know that by the value of Tesla shares (TSLA), which still dwarfs any other automaker’s market value. Musk took up most of a recent investor call promising imminent, major advances on self-driving cars, robotaxis and humanoid robots and promising that next year would be “epic,” rather than addr

In [107]:
import openai
from dotenv import load_dotenv
import os

In [None]:
load_dotenv(override=True)
MODEL_NAME = os.getenv('MODEL_NAME')
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
OPENAI_ORGANIZATION = os.getenv('OPENAI_ORGANIZATION')
BASE_URL = os.getenv('OPENAI_BASE_URL')
# TWITTER_API_KEY = os.getenv('TWITTER_API_KEY')
# TWITTER_API_SECRET = os.getenv('TWITTER_API_SECRET')
# TWITTER_ACCESS_TOKEN = os.getenv('TWITTER_ACCESS_TOKEN')
# TWITTER_ACCESS_TOKEN_SECRET = os.getenv('TWITTER_ACCESS_TOKEN_SECRET')
# TWITTER_BEARER_TOKEN = os.getenv('TWITTER_BEARER_TOKEN')

In [109]:
response = openai.chat.completions.create(
    model=MODEL_NAME,
    messages=MESSAGES,
    temperature=TEMPERATURE
)

print(response.choices[0].message.content)

[
    {
        "event_no": 1,
        "type": "business_challenge",
        "arguments": {
            "company": "Tesla",
            "challenge": "Sales are down",
            "time": "Unknown"
        }
    },
    {
        "event_no": 2,
        "type": "business_challenge",
        "arguments": {
            "company": "Tesla",
            "challenge": "Making less money on each car sold",
            "time": "Unknown"
        }
    },
    {
        "event_no": 3,
        "type": "market_competition",
        "arguments": {
            "company": "Tesla",
            "competitor_effect": "Competitors are eating into its market share, especially internationally",
            "time": "Unknown"
        }
    },
    {
        "event_no": 4,
        "type": "policy_impact",
        "arguments": {
            "company": "Tesla",
            "policy": "The end of various federal EV policies could cost the company billions",
            "time": "Unknown"
        }
    },
    {
        "e

In [114]:
import json
json_string = response.choices[0].message.content
data = json.loads(json_string) 

In [116]:
data[0]

{'event_no': 1,
 'type': 'business_challenge',
 'arguments': {'company': 'Tesla',
  'challenge': 'Sales are down',
  'time': 'Unknown'}}

### Next, it is the adaptationa assessment (accriacy / comprehensiveness ?)

- Need to find the evaluation methods to calculate the fitness score
  - Comprehensiveness (higher can coverage more events) # depends on attributes
  - Accuracy (basically can measure word frequency in original article, Cos Similarity in vector, Source Reliability)
    - In this paper, we evaluate the salience and
reliability of the thesis meta-role in terms of
frequency of occurrence and importance,
respectively, and design the fitness function.
The salience and reliability of the thesis metarole determine the accuracy and
comprehensiveness of the generated summary.
    - The fitness function Q is defined as the
weighting of the TFIDF score and TextRank
score of the argumentative meta-role in the
event pattern. (paper page7)
- Then use the higher score event pattern to generate the summary (Crossover)

### Why not directly generate by LLM

- Challenges such as accuracy and coverage.
- Too much contents from a large number of event representations.
  - Since a piece of news text often has a large
amount of data, direct generation of event
patterns may lead to the introduction of noise
and phantom data due to too many
argumentative roles and may cause important
argumentative elements to be missing, so we
extract different patterns from the different
event contents contained in it and use a genetic
algorithm to evolve the extracted event patterns
and evaluate the final patterns by the fitness
function, and the pattern with the highest score
The highest scoring patterns are input to LLM
to generate news summaries.

### Pros for this model

- Can explain how LLM to generate

#### Formula:

![Remote Image](../image/formula.jpg)

In [120]:
!pip install numpy networkx scikit-learn nltk --quiet

In [126]:
!pip install transformers --quiet

In [123]:
import nltk
nltk.download('punkt', download_dir='/Users/chutszkan/miniforge3/envs/news-summarizer/nltk_data')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/chutszkan/miniforge3/envs/news-
[nltk_data]     summarizer/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [127]:
import numpy as np
import networkx as nx
from sklearn.feature_extraction.text import TfidfVectorizer
from transformers import AutoTokenizer

# Load the transformer-based tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Given event
event = {
    "event_no": 1,
    "type": "business_challenge",
    "arguments": {
        "company": "Tesla",
        "challenge": "Sales are down",
        "time": "Unknown"
    }
}

# Given content
content = """These should be tough days for Tesla. Sales are down. It’s making less money on every car it does manage to sell. 
Competitors are eating into its market share, especially internationally. The end of various federal EV policies could cost the company billions.
And then there’s the CEO. Elon Musk made the company what it is, turning it from a humble startup into a financial juggernaut. 
But now there’s near-constant controversy swirling around him as he upends the US government, inserts himself into international politics, 
plays video games and posts on his social media site, X, drawing his attention away from Tesla in what he himself describes as a “pivotal year” for the company.
"""

# Tokenization: Splitting content into sentences using periods as a basic approach
def bert_sentence_tokenizer(text):
    tokens = tokenizer.tokenize(text)
    return text.split('. ')  # A simple approach; BERT tokenizes but does not split sentences.

sentences = bert_sentence_tokenizer(content)

# Extract sentences related to the event
keywords = ["sales", "down", "decline", "market share"]
event_sentences = [s for s in sentences if any(kw in s.lower() for kw in keywords)]

# Compute TF-IDF Scores
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(event_sentences)
tfidf_scores = np.mean(tfidf_matrix.toarray(), axis=1)

# Compute TextRank Scores
graph = nx.Graph()
for i, sentence in enumerate(event_sentences):
    for j, other_sentence in enumerate(event_sentences):
        if i != j:
            graph.add_edge(i, j, weight=np.dot(tfidf_matrix.toarray()[i], tfidf_matrix.toarray()[j]))

pagerank_scores = nx.pagerank(graph)
textrank_scores = np.array([pagerank_scores[i] for i in range(len(event_sentences))])

# Compute the final fitness function
alpha, beta = 0.5, 0.5  # Hyperparameters
fitness_scores = alpha * tfidf_scores + beta * textrank_scores

# Print results
for i, sentence in enumerate(event_sentences):
    print(f"Sentence: {sentence}")
    print(f"TF-IDF Score: {tfidf_scores[i]:.4f}, TextRank Score: {textrank_scores[i]:.4f}, Fitness Score: {fitness_scores[i]:.4f}\n")


None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


Sentence: Sales are down
TF-IDF Score: 0.1557, TextRank Score: 0.5000, Fitness Score: 0.3279

Sentence: 
Competitors are eating into its market share, especially internationally
TF-IDF Score: 0.2715, TextRank Score: 0.5000, Fitness Score: 0.3858



#### We need one more step because our event pattern from more than 1 source, so we may need to classify the event and use the specific algo to sort it