**Author** ::
Muhammad Hassan Mukhtar

**Affiliation** ::
The University of Salford, Manchester, England, UK

**Connect** ::
[GitHub](https://github.com/MHM-Rajpoot)
[LinkedIn](https://www.linkedin.com/in/-muhammad-hassan-mukhtar-/)

### SetUp

In [1]:
!pip install requests
!pip install datetime
#!pip install xml.etree.ElementTree #Already in Base Pakages of Python
!pip install PyPDF2

Collecting datetime
  Downloading DateTime-5.5-py3-none-any.whl.metadata (33 kB)
Collecting zope.interface (from datetime)
  Downloading zope.interface-7.2-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
Downloading DateTime-5.5-py3-none-any.whl (52 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.6/52.6 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading zope.interface-7.2-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (259 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m259.8/259.8 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: zope.interface, datetime
Successfully installed datetime-5.5 zope.interface-7.2
Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.wh

It's essential to note that ignoring encoding warnings when writing data to a CSV file is generally not recommended. However, in this specific case, we're taking this approach because our subsequent data preprocessing steps will involve filtering out stop words and extraneous characters, effectively addressing potential encoding issues. This targeted approach ensures the quality and integrity of our dataset.

In [2]:
import warnings
warnings.filterwarnings("ignore")

### arXiv API Data Acquisition Example

**Fetching and Processing Research Papers from arXiv**

This Colab notebook demonstrates how to interact with the arXiv API to fetch, download, and process research papers based on a search query.

**Key Features:**

1. Retrieve Research Papers: Sends HTTP requests to the arXiv API, fetches XML data, and extracts relevant information such as titles, summaries, publication dates, and links.  
2. Filter Recent Papers: Filters the fetched papers to include only those published in the last six months.  
3. Download Full-Text PDFs: Constructs PDF URLs and downloads the research papers.  
4. Extract and Save Text: Uses PyPDF2 to extract text from downloaded PDFs and saves the content in a CSV file with two columns: Paper ID and Raw Text.  
5. Modular and Efficient: Implements functions for fetching, parsing, downloading, and saving papers, ensuring reusability and efficient batch processing.

**How It Works:**

- Define a search query, for example, "machine learning."
- Fetch and filter the latest research papers from arXiv.
- Download the full-text PDFs.
- Extract the text content and save it in a structured format.
- Use the extracted data for further analysis or natural language processing tasks.

This workflow automates research paper retrieval and processing, making it a valuable tool for academic research and data analysis.

In [3]:
import os
import csv
import requests
import xml.etree.ElementTree as ET
from datetime import datetime, timedelta
import PyPDF2

In [4]:
def fetch_arxiv_articles(query, max_results=200, batch_size=10):
    """
    Fetch research articles from the arXiv API based on a given search query.
    Retrieves a specified number of articles in batches, sorted by submission date.

    Parameters:
        query (str): The keyword or topic to search for.
        max_results (int): The maximum number of articles to retrieve.
        batch_size (int): The number of articles per API request.

    Returns:
        list: A list of dictionaries with article details (title, summary, published date, and link).
    """
    all_papers = []
    for start in range(0, max_results, batch_size):
        url = f"http://export.arxiv.org/api/query?search_query=all:{query}&start={start}&max_results={batch_size}&sortBy=submittedDate&sortOrder=descending"
        response = requests.get(url)

        if response.status_code == 200:
            papers = parse_arxiv_xml(response.text)
            all_papers.extend(papers)
        else:
            print(f"Failed to fetch papers at start={start}")
            break

    return all_papers

In [5]:
def parse_arxiv_xml(xml_data):
    """
    Parse XML response from the arXiv API and extract relevant article details.

    Parameters:
        xml_data (str): XML response from the arXiv API.

    Returns:
        list: A list of dictionaries containing article details (title, summary, published date, and link).
    """
    root = ET.fromstring(xml_data)
    papers = []

    for entry in root.findall("{http://www.w3.org/2005/Atom}entry"):
        title = entry.find("{http://www.w3.org/2005/Atom}title").text
        summary = entry.find("{http://www.w3.org/2005/Atom}summary").text
        published = entry.find("{http://www.w3.org/2005/Atom}published").text
        link = entry.find("{http://www.w3.org/2005/Atom}id").text
        papers.append({"title": title, "summary": summary, "published": published, "link": link})

    return papers

In [6]:
def download_paper(pdf_url, filename):
    """
    Download and save a research paper from arXiv as a PDF.

    Parameters:
        pdf_url (str): The URL to the PDF of the paper.
        filename (str): The filename for saving the paper.
    """
    response = requests.get(pdf_url, stream=True)
    os.makedirs(os.path.dirname(filename), exist_ok=True)
    if response.status_code == 200:
        with open(filename, "wb") as file:
            file.write(response.content)
        print(f"Paper saved as {filename}")
    else:
        print(f"Failed to download paper from {pdf_url}")

In [7]:
def extract_text_from_pdf(pdf_path):
    """
    Extracts text from a given PDF file.

    Parameters:
        pdf_path (str): The path to the PDF file.

    Returns:
        str: The extracted text content from the PDF.
    """
    try:
        with open(pdf_path, "rb") as file:
            reader = PyPDF2.PdfReader(file)
            text = "".join([page.extract_text() or "" for page in reader.pages])
        return text
    except Exception as e:
        print(f"Error extracting text from {pdf_path}: {e}")
        return ""

In [8]:
def save_papers_to_csv(papers):
    """
    Saves extracted paper text into a CSV file with two columns: paper ID and raw text.

    Parameters:
        papers (list): List of dictionaries containing paper details.
    """
    csv_filename = "arxiv_papers.csv"
    with open(csv_filename, mode="w", newline="", encoding="utf-8", errors="replace") as file:
        writer = csv.writer(file)
        writer.writerow(["Paper ID", "Raw Text"])

        for paper in papers:
            arxiv_id = paper['link'].split('/')[-1]
            pdf_filename = f"./arxiv_papers/{arxiv_id}.pdf"
            text = extract_text_from_pdf(pdf_filename)
            writer.writerow([arxiv_id, text.replace("\ufffd", "?")])  # Replace invalid characters with '?'

    print(f"Paper texts saved to {csv_filename}")

In [9]:
def main():
    """
    Main function to fetch, filter, download, and process recent research papers from arXiv.

    Workflow:
    1. Fetches up to 200 research papers related to a specified query from the arXiv API.
    2. Filters papers to include only those published within the last six months.
    3. Downloads the full-text PDFs of the filtered papers.
    4. Extracts text from the downloaded PDFs.
    5. Saves the extracted text into a CSV file with two columns: Paper ID and Raw Text.

    Steps:
    - Defines a search query (default: "machine learning").
    - Calls fetch_arxiv_articles() to retrieve papers.
    - Filters results based on publication date.
    - Iterates over the filtered papers to:
      - Construct the arXiv ID and PDF URL.
      - Download and save the paper using download_paper().
    - Calls save_papers_to_csv() to store extracted text.

    Outputs:
    - Saves the downloaded PDFs in the "./arxiv_papers/" directory.
    - Stores extracted text in a CSV file named "arxiv_papers.csv".

    Returns:
        None
    """

    query = "machine learning"

    # Choose your own max_results or leave default to get 200 results in batch of 10
    papers = fetch_arxiv_articles(query, max_results=10, batch_size=1)

    print(f"\nFetched {len(papers)} Papers")

    cutoff_date = datetime.now() - timedelta(days=180)  # 6 months ago
    papers = [p for p in papers if datetime.strptime(p["published"], "%Y-%m-%dT%H:%M:%SZ") > cutoff_date]

    print(f"\nFetched {len(papers)} Papers from the last 6 months")

    if not papers:
        print("No recent papers found.")
        return

    print("\nDownloading and extracting text from papers:")
    break_ln = int(0)
    for i, paper in enumerate(papers[:]):
        arxiv_id = paper['link'].split('/')[-1]
        pdf_url = f"https://arxiv.org/pdf/{arxiv_id}.pdf"
        filename = f"./arxiv_papers/{arxiv_id}.pdf"
        download_paper(pdf_url, filename)

    print("\nWriting Data to CSV")
    save_papers_to_csv(papers[:])
    print("Process completed!")

In [10]:
if __name__ == "__main__":
    main()


Fetched 10 Papers

Fetched 10 Papers from the last 6 months

Downloading and extracting text from papers:
Paper saved as ./arxiv_papers/2502.17437v1.pdf
Paper saved as ./arxiv_papers/2502.17436v1.pdf
Paper saved as ./arxiv_papers/2502.17433v1.pdf
Paper saved as ./arxiv_papers/2502.17432v1.pdf
Paper saved as ./arxiv_papers/2502.17429v1.pdf
Paper saved as ./arxiv_papers/2502.17427v1.pdf
Paper saved as ./arxiv_papers/2502.17425v1.pdf
Paper saved as ./arxiv_papers/2502.17424v1.pdf
Paper saved as ./arxiv_papers/2502.17423v1.pdf
Paper saved as ./arxiv_papers/2502.17421v1.pdf

Writing Data to CSV
Paper texts saved to arxiv_papers.csv
Process completed!


In [11]:
!zip -r arxiv_papers.zip arxiv_papers

  adding: arxiv_papers/ (stored 0%)
  adding: arxiv_papers/2502.17427v1.pdf (deflated 11%)
  adding: arxiv_papers/2502.17433v1.pdf (deflated 6%)
  adding: arxiv_papers/2502.17436v1.pdf (deflated 9%)
  adding: arxiv_papers/2502.17437v1.pdf (deflated 8%)
  adding: arxiv_papers/2502.17421v1.pdf (deflated 7%)
  adding: arxiv_papers/2502.17432v1.pdf (deflated 23%)
  adding: arxiv_papers/2502.17423v1.pdf (deflated 1%)
  adding: arxiv_papers/2502.17429v1.pdf (deflated 9%)
  adding: arxiv_papers/2502.17425v1.pdf (deflated 10%)
  adding: arxiv_papers/2502.17424v1.pdf (deflated 8%)


You now possess a current and comprehensive dataset for tackling NLP challenges, comprising PDF and CSV files complete with filename and raw text extract features. This valuable resource is poised to fuel innovative project ideas and future endeavors.

Thank you for your participation, and we conclude this tutorial here.