<a href="https://colab.research.google.com/github/maggoatt/Grounded-Text-Summarization-of-Research-Papers/blob/main/Data_and_Model_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Importing S2ORC Corpus

Step 1:Two Academic Graph API endpoints use Semantic Scholar’s custom-trained ranker to perform keyword searches: the paper relevance search endpoint and the paper bulk search endpoint.

https://api.semanticscholar.org/api-docs/#tag/Paper-Data/operation/get_graph_paper_relevance_search

Get search bulk (1,000 at a time)

In [4]:
!pip install python-dotenv requests

Collecting requests
  Using cached requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting charset_normalizer<4,>=2 (from requests)
  Using cached charset_normalizer-3.4.4-cp313-cp313-macosx_10_13_universal2.whl.metadata (37 kB)
Collecting idna<4,>=2.5 (from requests)
  Using cached idna-3.11-py3-none-any.whl.metadata (8.4 kB)
Collecting urllib3<3,>=1.21.1 (from requests)
  Using cached urllib3-2.6.3-py3-none-any.whl.metadata (6.9 kB)
Collecting certifi>=2017.4.17 (from requests)
  Using cached certifi-2026.1.4-py3-none-any.whl.metadata (2.5 kB)
Using cached requests-2.32.5-py3-none-any.whl (64 kB)
Using cached charset_normalizer-3.4.4-cp313-cp313-macosx_10_13_universal2.whl (208 kB)
Using cached idna-3.11-py3-none-any.whl (71 kB)
Using cached urllib3-2.6.3-py3-none-any.whl (131 kB)
Using cached certifi-2026.1.4-py3-none-any.whl (152 kB)
Installing collected packages: urllib3, idna, charset_normalizer, certifi, requests
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [3

In [1]:
import os
import requests
import json
from dotenv import load_dotenv

load_dotenv() 
api_key = os.getenv("S2_API_Key")

headers = {
    "x-api-key": api_key
}

# Fetch the ID of the latest release
response_latest_release = requests.get('https://api.semanticscholar.org/datasets/v1/release/latest', headers=headers)
if response_latest_release.status_code == 200:
    latest_release_id = response_latest_release.json()['release_id']
    print(f"Latest Release ID: {latest_release_id}")

    # Define the dataset name you want to download
    dataset_name = "s2orc"
    
    # Fetch the download links for the specified dataset in the latest release
    response_dataset = requests.get(f'https://api.semanticscholar.org/datasets/v1/release/{latest_release_id}/dataset/{dataset_name}', headers=headers)
    if response_dataset.status_code == 200:
        dataset_info = response_dataset.json()
        print(json.dumps(dataset_info, indent=2))  # This will print out the full response body
        
        download_links = dataset_info.get('download_links', [])
        if download_links:
            print("Download Links:")
            for link in download_links:
                print(link['url'])
        else:
            print("No download links found for the dataset. Here's the response content for debugging:")
            print(json.dumps(dataset_info, indent=2))  # Print the whole JSON response for debugging
    else:
        print(f"Failed to get download links for the dataset. Status Code: {response_dataset.status_code}")
        print("Response:", response_dataset.text)
else:
    print(f"Failed to fetch latest release ID. Status Code: {response_latest_release.status_code}")
    print("Response:", response_latest_release.text)


Latest Release ID: 2026-02-03
{
  "name": "s2orc",
  "description": "Full-body paper text parsed from open-access PDFs. Identifies structural elements such as paragraphs, sections, and bibliography entries.\n10M records in 30 4GB files.",
  "README": "Semantic Scholar Academic Graph Datasets\n\nThe \"s2orc\" dataset contains parsed full-body text from selected papers.\n\nA subset of this data was previously released (in a different format) as S2ORC https://github.com/allenai/s2orc\n\nThe body text is parsed from PDF documents using Grobid, documented at https://grobid.readthedocs.io.\nIts output is converted from XML into a single string with a set of annotation spans.\n\nSCHEMA\n - externalIds: IDs of this paper in different catalogs\n - content:\n   - source:\n\t   - pdfUrls: URLs to the PDF\n\t   - oaInfo: license/url/status information from Unpaywall\n   - text: Full body text as a single string\n   - annotations: Annotated spans of the full body text\n\n\nLICENSE\nThis collection 

In [None]:
import urllib.request
import gzip
import json

# Example URL from the list above
url = response_dataset.json()["files"][0] 
try:
    with urllib.request.urlopen(url) as response:
        # GzipFile can read from a file-like object. streaming!
        with gzip.GzipFile(fileobj=response) as gz:
            # We must read line by line. Since it's binary, we decode to utf-8.
            # However, GzipFile.readline() returns bytes.
            first_line = gz.readline().decode('utf-8')
            
            if first_line:
                data = json.loads(first_line)
                print("First paper structure keys:", data.keys())
                print(json.dumps(data, indent=2)[:5000] + "...")
            else:
                print("File is empty.")
except Exception as e:
    print(f"Failed: {e}")

First paper structure keys: dict_keys(['corpusid', 'externalids', 'content'])

First paper content snippet:
{
  "corpusid": 238227165,
  "externalids": {
    "arxiv": "2109.15296",
    "mag": null,
    "acl": null,
    "pubmed": null,
    "pubmedcentral": null,
    "dblp": "journals/corr/abs-2109-15296",
    "doi": null
  },
  "content": {
    "source": {
      "pdfurls": null,
      "pdfsha": null,
      "oainfo": null
    },
    "text": null,
    "annotations": {
      "abstract": null,
      "author": null,
      "authoraffiliation": null,
      "authorfirstname": null,
      "authorlastname": null,
      "bibauthor": null,
      "bibauthorfirstname": null,
      "bibauthorlastname": null,
      "bibentry": null,
      "bibref": null,
      "bibtitle": null,
      "bibvenue": null,
      "figure": null,
      "figurecaption": null,
      "figureref": null,
      "formula": null,
      "paragraph": null,
      "publisher": null,
      "sectionheader": null,
      "table": null,
     

Paper {
  corpusid: number,

  externalids: {
    arxiv: string | null,
    mag: string | null,
    acl: string | null,
    pubmed: string | null,
    pubmedcentral: string | null,
    dblp: string | null,
    doi: string | null
  },

  content: {
    source: {
      pdfurls: string[],
      pdfsha: string,
      oainfo: {
        license: string,
        openaccessurl: string,
        status: "GOLD" | "GREEN" | "BRONZE" | "CLOSED"
      }
    },

    text: string,

    annotations: {
      abstract: Span[] | null,
      title: Span[] | null,
      author: Span[] | null,
      authorfirstname: Span[] | null,
      authorlastname: Span[] | null,
      authoraffiliation: Span[] | null,

      sectionheader: Span[],
      paragraph: Span[],

      bibentry: BibEntrySpan[],
      bibauthor: Span[],
      bibauthorfirstname: Span[],
      bibauthorlastname: Span[],
      bibtitle: Span[],
      bibvenue: Span[],
      bibref: BibRefSpan[],

      figure: FigureSpan[],
      figurecaption: Span[],
      figureref: Span[] | null,

      table: TableSpan[],
      tableref: Span[] | null,

      formula: Span[] | null,
      publisher: Span[] | null,
      venue: Span[] | null
    }
  }
}
