<a href="https://colab.research.google.com/github/maggoatt/Grounded-Text-Summarization-of-Research-Papers/blob/main/Summarization_Model_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Summarization Model Workflow

- Baseline: TextRank
- Advanced: Facebook BART (Large-CNN)
High-level pipeline:
1. Take in the selected paper (i.e. from ```streamlit``` file)
2. Sliding window (i.e. 1k tokens) to chunk paper, take note of the section titles per chunk
3. Generate summaries per chunk per model and stitch together

### Citations/references:

1. Workflow to implement TextRank: 

Adapted from: ERRAJI, Yassine (June 19 2025). ["Understanding TextRank: A Deep Dive into Graph-Based Text Summarization and Keyword Extraction"](https://medium.com/@yassineerraji/understanding-textrank-a-deep-dive-into-graph-based-text-summarization-and-keyword-extraction-905d1fb5d266).
Medium Article.

2. Workflow to implement Facebook BART:

Adapted from: Lewis, Mike _et al._ (Accessed February 2026). ["BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension"](https://huggingface.co/facebook/bart-large-cnn).
Hugging Face Documentation.

Adapted from: baksapeter (April 11, 2025). ["Maximum number of input tokens"](https://huggingface.co/facebook/bart-large-cnn/discussions/83). Hugging Face Discussion.

3. Misc. syntax: scikit-learn documentation



In [2]:
# installing dependencies

%pip install scikit-learn networkx transformers # for TextRank (networkx) and BART (transformers)


Note: you may need to restart the kernel to use updated packages.


In [8]:
# imports

# TextRank
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx
import json

# BART
import torch
from transformers import AutoTokenizer, BartForConditionalGeneration

## TextRank Pipeline:
1. Extract + concatenate text from selected paper (to be referenced from JSON object created by UI/API request)
2. Tokenize extracted + concatenated text
3. Create similarity graph of tokens
4. Run PageRank
5. Rank by top-k and output final summary

Additionally, preserve which section the sentence originated from (for later analysis/retrieval purposes).

In [9]:
# extracting content from paper
test_path = "data/249953535.json"
k = 5 # summary sentence length

with open(test_path, 'r', encoding='utf-8') as f:
    paper = json.load(f) # load the selected paper's json file
    
cid = paper["corpusid"]
body_text = []
section_map = {} # preserving sentences' og section

# current method: concatenate all paragraphs from just the body section together. no splitting by section
for section in paper["sections"]: # (1) extract and concatenate text from selected paper
    section_title = section["section_title"]
    sentences = [s.strip() for s in section["text"].replace('?', '.').replace('!', '.').split('.')] # splitting sentences by punc, then strip any leading whitespace
   
    for sentence in sentences:
        if sentence:
            section_map[len(body_text)] = section_title  # track section of sentence based on index of sentence
            body_text.append(sentence)

In [6]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(body_text) # (2) grab doc-term mtx, treating each sentence as a document in body_text corpus
similarity_mtx = cosine_similarity(X) # (3) cosine sim on sentences based on word importance
graph = nx.from_numpy_array(similarity_mtx)

scores = nx.pagerank(graph) # (4) score sentences via PageRank

ranked = sorted(((scores[i], s, section_map[i]) for i, s in enumerate(body_text)), reverse=True) # sentences and section name ranked by highest scores

summary = ". ".join([s for _, s, _ in ranked[:k]]) + "."
print(summary) # (5)

We thus use only the standard episodes from the base NovelCraft dataset for training and validation to maximize diversity in this task's test set, which contains standard episodes and all available gameplay novelty episodes. In our evaluation we assume prior knowledge of the number of new classes in the unlabeled set, but future evaluations may utilize the method in Vaze et al. To handle this, we only score images in the filtered subset of the test set, where we are more confident in the label. In the event of a novel change in the environment, such as some trees not producing rubber, the agent must be able to change multiple steps in the solution, such as breaking those trees for wood and not using the tree tap on them, to successfully complete the modified task. In addition to the Fence and Tree novelties, we add new Supplier and Thief gameplay novelties, which are exclusive to this task due to only being labeled at the episode level instead of at the frame level.


In [10]:
# export summary to .txt document for benchmarking analysis:

file_path = f"./summaries/{cid}_textrank_summary.txt"

with open(file_path, 'w', encoding='utf-8') as file:
    file.write(summary)

## Facebook BART Pipeline:
1. Create summarization pipeline, specifying Facebook BART (large-CNN model)
2. Extract + concatenate text from selected paper
3. Check if token count exceeds Facebook BART max input token count (1024)
4. If token count > 1024, implement sliding window. Else, summarize entire input
5. Output the final summary

In [57]:
model_name = "facebook/bart-large-cnn" # (1)

full_body_text = ". ".join(body_text) # (2) turn the list of sentences into string

tokenizer = AutoTokenizer.from_pretrained(model_name)
bart_model = BartForConditionalGeneration.from_pretrained(model_name)
max_token_count = 1024 # BART's actual positional encoding limit

def summarize(text, max_new_tokens=300, min_new_tokens=20):
    """Summarize a single chunk of text using BART (input auto-truncated to 1024 tokens)."""
    inputs = tokenizer(text, return_tensors="pt", max_length=max_token_count, truncation=True)
    summary_ids = bart_model.generate(
        inputs["input_ids"],
        max_new_tokens=max_new_tokens,
        min_new_tokens=min_new_tokens,
        num_beams=4,
        length_penalty=2.0,
        forced_bos_token_id=0
    )
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

def get_token_count(text):
    return len(tokenizer.encode(text, truncation=False))

def reduce_summaries(texts, round_num=1):
    """
    Recursively summarize until the combined text fits within 1024 tokens.
    
    1. Summarize each chunk individually
    2. Concatenate the summaries
    3. If still > 1024 tokens, group into chunks and repeat
    4. Once <= 1024 tokens, produce the final summary
    """
    print(f"--- Round {round_num}: summarizing {len(texts)} chunks ---")
    
    chunk_summaries = []
    for i, text in enumerate(texts):
        tc = get_token_count(text)
        summary = summarize(text)
        print(f"  chunk {i+1}/{len(texts)}: {tc} tokens -> {get_token_count(summary)} tokens")
        chunk_summaries.append(summary)
    
    # combine all summaries into one text
    combined = " ".join(chunk_summaries)
    combined_tokens = get_token_count(combined)
    print(f"  combined result: {combined_tokens} tokens")
    
    if combined_tokens <= max_token_count:
        # fits within limit — concatenate and return as-is
        print(f"  fits within {max_token_count} tokens, concatenating summaries")
        return combined
    else:
        # still too long — group summaries into 1024-token chunks and recurse
        print(f"  still > {max_token_count} tokens, splitting again...\n")
        groups = []
        current_group = []
        current_tokens = 0
        for s in chunk_summaries:
            s_tokens = get_token_count(s)
            if current_tokens + s_tokens > max_token_count and current_group:
                groups.append(" ".join(current_group))
                current_group = [s]
                current_tokens = s_tokens
            else:
                current_group.append(s)
                current_tokens += s_tokens
        if current_group:
            groups.append(" ".join(current_group))
        
        return reduce_summaries(groups, round_num + 1)

# --- run the pipeline ---
token_count = get_token_count(full_body_text)
print(f"total tokens: {token_count}\nmax allowed tokens: {max_token_count}\n")

if token_count > max_token_count:
    # step 1: summarize each section individually (preserves 1:1 mapping with section titles)
    section_texts = [section["text"] for section in paper["sections"]]
    summaries = []
    print(f"--- Summarizing {len(section_texts)} sections ---")
    for i, text in enumerate(section_texts):
        tc = get_token_count(text)
        s = summarize(text)
        print(f"  section {i+1}/{len(section_texts)}: {tc} tokens -> {get_token_count(s)} tokens")
        summaries.append(s)
    
    # step 2: combine section summaries and reduce until it fits in 1024 tokens
    combined = " ".join(summaries)
    combined_tokens = get_token_count(combined)
    print(f"\ncombined section summaries: {combined_tokens} tokens")
    
    if combined_tokens <= max_token_count:
        # already fits — concatenate and use as final summary
        print(f"  fits within {max_token_count} tokens, concatenating summaries")
        summary_text = combined
    else:
        # need more reduction rounds
        print(f"  still > {max_token_count} tokens, entering reduction loop...\n")
        groups = []
        current_group = []
        current_tokens = 0
        for s in summaries:
            s_tokens = get_token_count(s)
            if current_tokens + s_tokens > max_token_count and current_group:
                groups.append(" ".join(current_group))
                current_group = [s]
                current_tokens = s_tokens
            else:
                current_group.append(s)
                current_tokens += s_tokens
        if current_group:
            groups.append(" ".join(current_group))
        
        summary_text = reduce_summaries(groups, round_num=2)
else:
    # small enough to summarize directly
    summary_text = summarize(full_body_text)
    summaries = [summary_text]

print(f"\n{'='*80}")
print("FINAL SUMMARY:")
print(f"{'='*80}")
print(summary_text)

Loading weights: 100%|██████████| 511/511 [00:00<00:00, 1464.05it/s, Materializing param=model.encoder.layers.11.self_attn_layer_norm.weight]  


total tokens: 8272
max allowed tokens: 1024

--- Summarizing 29 sections ---
  section 1/29: 708 tokens -> 60 tokens
  section 2/29: 319 tokens -> 77 tokens
  section 3/29: 553 tokens -> 52 tokens
  section 4/29: 617 tokens -> 47 tokens
  section 5/29: 474 tokens -> 56 tokens
  section 6/29: 274 tokens -> 74 tokens
  section 7/29: 140 tokens -> 46 tokens
  section 8/29: 506 tokens -> 68 tokens
  section 9/29: 76 tokens -> 45 tokens
  section 10/29: 522 tokens -> 65 tokens
  section 11/29: 399 tokens -> 55 tokens
  section 12/29: 170 tokens -> 62 tokens
  section 13/29: 152 tokens -> 69 tokens
  section 14/29: 437 tokens -> 51 tokens
  section 15/29: 304 tokens -> 48 tokens
  section 16/29: 152 tokens -> 66 tokens
  section 17/29: 300 tokens -> 64 tokens
  section 18/29: 265 tokens -> 86 tokens
  section 19/29: 318 tokens -> 60 tokens
  section 20/29: 99 tokens -> 54 tokens
  section 21/29: 25 tokens -> 25 tokens
  section 22/29: 366 tokens -> 50 tokens
  section 23/29: 187 tokens -> 72

In [58]:
# export summary to .txt document for benchmarking analysis:

file_path = f"./summaries/{cid}_bart_summary.txt"

with open(file_path, 'w', encoding='utf-8') as file:
    file.write(summary_text)

## Benchmarking Analysis

- For grammar, readability, and clarity: 
  - [LanguageTool API](https://languagetool.org/http-api/) - Grammar and style checking
  - [Textstat](https://textstat.org/) - Readability scores (Flesch-Kincaid, SMOG, etc.)
  - [Perplexity (Hugging Face)](https://huggingface.co/docs/transformers/en/perplexity) - Fluency proxy via GPT-2
