# Scrape and Download Quantum Machine Learning Papers from Nature

## Objective
This notebook scrapes recent articles on **Quantum Machine Learning (QML)** from Nature, published within the last 7 days, and downloads their PDF versions from arXiv if available. It saves article metadata to a text file and ensures only QML-specific papers are processed.

## Theory
- **Web Scraping**: Uses `requests` and `BeautifulSoup` to extract article metadata (title, date, URL) from Nature's search results.
- **QML Filtering**: Filters articles by requiring both "quantum" and ML-related terms (e.g., "machine learning", "qml") in the title.
- **arXiv Integration**: Queries arXiv's API to find PDF versions of papers using their titles.
- **Rate Limiting**: Implements delays to avoid server bans.

- `requests`: For HTTP requests to fetch web pages and PDFs.
- `BeautifulSoup`: For parsing HTML/XML content.
- `datetime`, `timedelta`: For date filtering (last 7 days).
- `quote`: For URL-encoding arXiv queries.
- `time`: For rate-limiting delays.
- `re`: For sanitizing filenames.
- `os`: For file and directory operations.

In [2]:
import requests
from bs4 import BeautifulSoup
from datetime import datetime, timedelta
from urllib.parse import quote
import time
import re
import os

- `BASE_URL`: Root URL for Nature.
- `SEARCH_URL`: Queries Nature for "quantum machine learning" articles from the last 7 days.
- `HEADERS`: Mimics a browser to avoid being blocked.
- `DELAY`: Ensures polite scraping to prevent rate-limiting.

In [3]:
# Cell 2: Define constants
BASE_URL = "https://www.nature.com"
SEARCH_URL = f"{BASE_URL}/search?q=quantum+machine+learning&date_range=last_7_days"
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
DELAY = 2  # Seconds between requests to avoid rate limiting

- Removes invalid characters (e.g., `<`, `:`, `/`) from filenames.
- Limits length to 100 characters and appends `.pdf`.

In [4]:
# Cell 3: Sanitize filenames for safe saving
def sanitize_filename(filename):
    """Remove invalid characters from filename."""
    return re.sub(r'[<>:"/\\|?*]', '', filename).strip()[:100] + ".pdf"

- Queries arXiv's API with the article title.
- Parses the XML response to extract the paper's ID.
- Converts the `abs` URL to a `pdf` URL.
- Returns `None` if no match is found or an error occurs.

In [5]:
def search_arxiv_pdf_url(title):
    """Search arXiv for PDF version of paper."""
    try:
        time.sleep(DELAY)
        base_api = "http://export.arxiv.org/api/query?"
        query = f'ti:"{title}"'
        params = {
            'search_query': query,
            'max_results': 1,
            'sortBy': 'submittedDate',
            'sortOrder': 'descending'
        }
        
        r = requests.get(base_api, params=params, timeout=10)
        r.raise_for_status()
        
        if "<entry>" in r.text:
            soup = BeautifulSoup(r.text, 'xml')
            entry = soup.find('entry')
            if entry:
                arxiv_id = entry.find('id').text.strip()
                pdf_url = arxiv_id.replace('abs', 'pdf') + '.pdf'
                return pdf_url
        return None
    except Exception as e:
        print(f"Error searching arXiv for '{title}': {e}")
        return None

- Downloads the PDF in chunks to handle large files efficiently.
- Saves to a `downloads` directory, creating it if needed.
- Returns `True` on success, `False` on failure.

In [6]:
def download_pdf(url, filename):
    """Download PDF file."""
    try:
        time.sleep(DELAY)
        r = requests.get(url, headers=HEADERS, timeout=30, stream=True)
        r.raise_for_status()
        
        # Create downloads directory if it doesn't exist
        os.makedirs('downloads', exist_ok=True)
        filepath = os.path.join('downloads', filename)
        
        with open(filepath, 'wb') as f:
            for chunk in r.iter_content(8192):
                f.write(chunk)
        print(f"✅ Saved PDF: {filename}")
        return True
    except Exception as e:
        print(f"❌ Failed to download PDF from {url}: {e}")
        return False

- **QML Filter**: Requires both `'quantum'` and one of `['machine learning', 'neural network', 'qml', 'deep learning']` in the title.
- **Date Filtering**: Ensures articles are from the last 7 days.
- **Error Handling**: Logs specific errors for each article.
- **Output**: Saves metadata to `recent_qml_nature_links.txt` and PDFs to a `downloads` folder.

In [7]:
def scrape_recent_qml_articles():
    """Scrape recent Quantum Machine Learning articles from Nature."""
    try:
        print("🔍 Searching for recent Quantum Machine Learning articles...")
        time.sleep(DELAY)
        r = requests.get(SEARCH_URL, headers=HEADERS, timeout=10)
        r.raise_for_status()
        soup = BeautifulSoup(r.text, 'html.parser')

        articles = soup.select('.c-card') or soup.select('.app-article-list-row__item')
        
        if not articles:
            print("No articles found. Nature's page structure may have changed.")
            return

        # Get current date at midnight for accurate comparison
        today = datetime.now().replace(hour=0, minute=0, second=0, microsecond=0)
        one_week_ago = today - timedelta(days=7)
        
        output_lines = []
        downloaded_count = 0

        for item in articles:
            try:
                title_tag = item.select_one('.c-card__title, .article-item__title')
                date_tag = item.select_one('time, [datetime]')
                link_tag = item.select_one('a[href^="/"]')

                if not all([title_tag, date_tag, link_tag]):
                    continue

                title = title_tag.text.strip().lower()
                
                # Stricter QML filter: require 'quantum' and ML-related terms
                if ('quantum' in title and 
                    any(term in title for term in ['machine learning', 'neural network', 'qml', 'deep learning'])):
                    date_str = date_tag.get('datetime', '').split('T')[0]
                    
                    try:
                        pub_date = datetime.strptime(date_str, "%Y-%m-%d").date()
                    except ValueError:
                        continue

                    # Only process if within date range
                    if one_week_ago.date() <= pub_date <= today.date():
                        url = BASE_URL + link_tag['href'] if link_tag['href'].startswith('/') else link_tag['href']
                        
                        print(f"\n📰 Title: {title}")
                        print(f"📅 Date: {pub_date}")
                        print(f"🔗 URL: {url}")

                        pdf_link = search_arxiv_pdf_url(title)
                        if pdf_link:
                            print(f"🆓 arXiv PDF: {pdf_link}")
                            filename = sanitize_filename(title)
                            if download_pdf(pdf_link, filename):
                                downloaded_count += 1
                        else:
                            print("ℹ️ No arXiv preprint found")

                        output_lines.append(f"Title: {title}\nDate: {pub_date}\nURL: {url}\n")
                        output_lines.append(f"arXiv PDF: {pdf_link if pdf_link else 'Not found'}\n")
                        output_lines.append("-" * 50 + "\n")

            except Exception as e:
                print(f"Error processing article '{title}': {e}")
                continue

        if output_lines:
            with open("recent_qml_nature_links.txt", "w", encoding='utf-8') as f:
                f.writelines(output_lines)
            print(f"\n✅ Saved {len(output_lines)//3} articles to recent_qml_nature_links.txt")
            print(f"📥 Downloaded {downloaded_count} PDFs")
        else:
            print("No QML articles found in the last 7 days.")

    except Exception as e:
        print(f"Error in main scraping function: {e}")

- Runs the main function to scrape and download QML articles.
- Outputs progress to the console and saves results to files.

In [8]:
if __name__ == "__main__":
    scrape_recent_qml_articles()

🔍 Searching for recent Quantum Machine Learning articles...

📰 Title: characterizing privacy in quantum machine learning
📅 Date: 2025-05-19
🔗 URL: https://www.nature.com/articles/s41534-025-01022-z
ℹ️ No arXiv preprint found

📰 Title: interpretable machine learning for atomic scale magnetic anisotropy in quantum materials
📅 Date: 2025-05-18
🔗 URL: https://www.nature.com/articles/s41524-025-01637-y
ℹ️ No arXiv preprint found

📰 Title: quantum neural networks form gaussian processes
📅 Date: 2025-05-21
🔗 URL: https://www.nature.com/articles/s41567-025-02883-z
🆓 arXiv PDF: http://arxiv.org/pdf/2305.09957v3.pdf
✅ Saved PDF: quantum neural networks form gaussian processes.pdf

✅ Saved 3 articles to recent_qml_nature_links.txt
📥 Downloaded 1 PDFs
