<a href="https://colab.research.google.com/github/maggoatt/Grounded-Text-Summarization-of-Research-Papers/blob/main/Summarization_Model_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Summarization Model Workflow

- Baseline: TextRank
- Advanced: Facebook BART (Large-CNN)
High-level pipeline:
1. Take in the selected paper (i.e. from ```streamlit``` file)
2. Sliding window (i.e. 1k tokens) to chunk paper, take note of the section titles per chunk
3. Generate summaries per chunk per model and stitch together

### Citations/references:

1. Workflow to implement TextRank: 

Adapted from: ERRAJI, Yassine (June 19 2025). ["Understanding TextRank: A Deep Dive into Graph-Based Text Summarization and Keyword Extraction"](https://medium.com/@yassineerraji/understanding-textrank-a-deep-dive-into-graph-based-text-summarization-and-keyword-extraction-905d1fb5d266).
Medium Article.

2. Workflow to implement Facebook BART:

Adapted from: Lewis, Mike _et al._ (Accessed February 2026). ["BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension"](https://huggingface.co/facebook/bart-large-cnn).
Hugging Face Documentation.

Adapted from: baksapeter (April 11, 2025). ["Maximum number of input tokens"](https://huggingface.co/facebook/bart-large-cnn/discussions/83). Hugging Face Discussion.

3. Misc. syntax: scikit-learn documentation



### Sample S2ORC JSON formatting
```{
  "paper_id": "553755490",
  "header": {
    "title": "Example Title of a Scientific Paper",
    "authors": [
      {"first": "John", "middle": ["A"], "last": "Doe", "affiliation": {"name": "University of Example"}},
      {"first": "Jane", "middle": [], "last": "Smith", "affiliation": {"name": "Example Institute"}}
    ]
  },
  "abstract": [
    {
      "text": "This is an example of the abstract text in the S2ORC corpus.",
      "cite_spans": [],
      "ref_spans": []
    }
  ],
  "body_text": [
    {
      "section": "Introduction",
      "text": "This paragraph represents the body text. It can contain citations like (Doe et al., 2020) and references to figures or tables.",
      "cite_spans": [
        {
          "start": 50,
          "end": 65,
          "text": "(Doe et al., 2020)",
          "ref_id": "BIBREF0"
        }
      ],
      "ref_spans": []
    },
    {
      "section": "Methodology",
      "text": "Details of the method, including reference to Figure 1.",
      "cite_spans": [],
      "ref_spans": [
        {
          "start": 40,
          "end": 48,
          "text": "Figure 1",
          "ref_id": "FIGREF0"
        }
      ]
    }
  ],
  "bib_entries": {
    "BIBREF0": {
      "title": "A seminal paper on the subject",
      "authors": ["Doe", "J.", "Smith", "J."],
      "year": 2020,
      "venue": "Journal of Examples",
      "volume": "1",
      "pages": "100-110"
    }
  },
  "ref_entries": {
    "FIGREF0": {
      "num": "1",
      "type": "figure",
      "text": "A description of Figure 1"
    }
  }
}```


In [None]:
# installing dependencies

%pip install scikit-learn networkx transformers # for TextRank (networkx) and BART (transformers)


Collecting scikit-learn
  Downloading scikit_learn-1.8.0-cp313-cp313-macosx_12_0_arm64.whl.metadata (11 kB)
Collecting networkx
  Using cached networkx-3.6.1-py3-none-any.whl.metadata (6.8 kB)
Collecting transformers
  Downloading transformers-5.0.0-py3-none-any.whl.metadata (37 kB)
Collecting numpy>=1.24.1 (from scikit-learn)
  Downloading numpy-2.4.2-cp313-cp313-macosx_14_0_arm64.whl.metadata (6.6 kB)
Collecting scipy>=1.10.0 (from scikit-learn)
  Downloading scipy-1.17.0-cp313-cp313-macosx_14_0_arm64.whl.metadata (62 kB)
Collecting joblib>=1.3.0 (from scikit-learn)
  Using cached joblib-1.5.3-py3-none-any.whl.metadata (5.5 kB)
Collecting threadpoolctl>=3.2.0 (from scikit-learn)
  Using cached threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Collecting filelock (from transformers)
  Using cached filelock-3.20.3-py3-none-any.whl.metadata (2.1 kB)
Collecting huggingface-hub<2.0,>=1.3.0 (from transformers)
  Downloading huggingface_hub-1.3.5-py3-none-any.whl.metadata (13 kB)
Collec

In [4]:
# imports

# TextRank
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx
import json

# BART
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM

## TextRank Pipeline:
1. Extract + concatenate text from selected paper (to be referenced from JSON object created by UI/API request)
2. Tokenize extracted + concatenated text
3. Create similarity graph of tokens
4. Run PageRank
5. Rank by top-k and output final summary

Additionally, preserve which section the sentence originated from (for later analysis/retrieval purposes).

In [None]:
# current method: concatenate all paragraphs from just the body section together. no splitting by section

k = 5 # summary sentence length

paper = json.loads(...) # load the selected paper's json file
body_text = []
section_map = {} # preserving sentences' og section

for section in paper["body_text"]: # (1) extract and concatenate text from selected paper
    section_title = section["section"]
    sentences = [s.strip() for s in section["text"].replace('?', '.').replace('!', '.').split('.')] # splitting sentences by punc, then strip any leading whitespace
   
    for sentence in sentences:
        if sentence:
            section_map[len(body_text)] = section_title  # track section of sentence based on index of sentence
            body_text.append(sentence)

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(body_text) # (2) grab doc-term mtx, treating each sentence as a document in body_text corpus
similarity_mtx = cosine_similarity(X) # (3) cosine sim on sentences based on word importance
graph = nx.from_numpy_array(similarity_mtx)

scores = nx.pagerank(graph) # (4) score sentences via PageRank

ranked = sorted(((scores[i], s, section_map[i]) for i, s in enumerate(body_text)), reverse=True) # sentences and section name ranked by highest scores

summary = ". ".join([s for _, s, _ in ranked[:k]]) + "."
print(summary) # (5)

## Facebook BART Pipeline:
1. Create summarization pipeline, specifying Facebook BART (large-CNN model)
2. Extract + concatenate text from selected paper
3. Check if token count exceeds Facebook BART max input token count (1024)
4. If token count > 1024, implement sliding window. Else, summarize entire input
5. Output the final summary

In [None]:
model="facebook/bart-large-cnn" # (1)

full_body_text = ". ".join(body_text) # (2) turn the list of sentences into string

tokenizer = AutoTokenizer.from_pretrained(model)

tokens = tokenizer.encode(full_body_text, truncation=False)
token_count = len(tokens)
max_token_count = tokenizer.model_max_length

print(f"total tokens: {token_count}\nmax allowed tokens: {max_token_count}")
if token_count > max_token_count: # (3)
    # (4) TODO: sliding window technique
    ...
else: # can pass in text from entire body of paper
    summarizer = pipeline("summarization", model=model)
    summary = summarizer(full_body_text, max_length=k*20, min_length=k*10, do_sample=False) # sentences are usually 15-20 words long

print(summary[0]["summary_text"]) # (5)