# arXiv Scraper - Lab 1
## Student ID: 23127240

Requirements:
- Run on Google Colab (CPU-only)
- Measure wall time
- Track memory usage (RAM, disk)
- Scrape: TeX sources, metadata, references
- Remove figures to reduce storage

Step 1: Check runtime (must be CPU)

In [None]:
# Check runtime type (must be CPU as per Lab 1 requirements)
import psutil
import platform

print("=" * 60)
print("RUNTIME INFORMATION")
print("=" * 60)
print(f"OS: {platform.system()} {platform.release()}")
print(f"CPU cores: {psutil.cpu_count()}")
print(f"RAM: {psutil.virtual_memory().total / (1024**3):.2f} GB")
print(f"Disk: {psutil.disk_usage('/').total / (1024**3):.2f} GB")
print("=" * 60)

# Ensure no GPU (CPU-only requirement)
try:
    import torch
    if torch.cuda.is_available():
        print("WARNING: GPU detected! Lab requires CPU-only mode")
        print("Change: Runtime > Change runtime type > Hardware accelerator > None")
    else:
        print("CPU-only mode - Meets Lab 1 requirements")
except:
    print("CPU-only mode - Meets Lab 1 requirements")

Step 2: Clone code from GitHub

In [None]:
# DELETE OLD FOLDER AND CLONE FRESH (REQUIRED!)
!rm -rf ScrapingData
!git clone https://github.com/nhutphansayhi/ScrapingData.git
%cd ScrapingData/23127240

# VERIFY COMMIT
!git log -1 --oneline
print("\n=== IMPORTANT ===")
print("Make sure you have the latest commit!")
print("==================\n")
!ls -la

In [None]:
# Create config file - contains all settings
%%writefile /content/ScrapingData/23127240/src/config_settings.py

# Student ID
STUDENT_ID = "23127240"

# Paper range to scrape (as per assignment)
START_YEAR_MONTH = "2311"
START_ID = 14685
END_YEAR_MONTH = "2312"
END_ID = 844

# API delays (to avoid being banned)
ARXIV_API_DELAY = 1.0  # 1 second delay for arXiv
SEMANTIC_SCHOLAR_DELAY = 1.1  # slightly longer delay for S2

# Retry settings on failure
MAX_RETRIES = 3
RETRY_DELAY = 3.0

# Number of parallel workers
MAX_WORKERS = 6  # using 6 workers for speed

# Output directories
DATA_DIR = f"../{STUDENT_ID}_data"
LOGS_DIR = "./logs"

# File size limit
MAX_FILE_SIZE = 100 * 1024 * 1024  # 100MB

# Semantic Scholar API
SEMANTIC_SCHOLAR_API_BASE = "https://api.semanticscholar.org/graph/v1"
SEMANTIC_SCHOLAR_FIELDS = "references,references.paperId,references.externalIds,references.title,references.authors,references.publicationDate,references.year"

Step 3: Install required libraries

In [None]:
# Install required libraries
!pip install -q arxiv requests beautifulsoup4 bibtexparser psutil

# Verify installation
import arxiv
import requests
from bs4 import BeautifulSoup
import bibtexparser
import psutil
import json
import time

print("All libraries installed successfully!")

Step 3.6: Create utils.py with helper functions

In [None]:
# # Helper functions
%%writefile /content/ScrapingData/23127240/src/utils.py
import os
import logging
import tarfile
import gzip
import shutil

logger = logging.getLogger(__name__)

def setup_logging(log_dir: str = "./logs"):
    """Setup logging"""
    os.makedirs(log_dir, exist_ok=True)
    log_file = os.path.join(log_dir, "scraper.log")
    
    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
        handlers=[
            logging.FileHandler(log_file),
            logging.StreamHandler()
        ]
    )

def ensure_dir(directory: str):
    """Create directory if not exists"""
    if not os.path.exists(directory):
        os.makedirs(directory)

def format_folder_name(arxiv_id: str) -> str:
    """
    Convert arXiv ID to folder name
    VD: '2311.14685' -> '2311-14685'
    """
    return arxiv_id.replace(".", "-")

def extract_tar_gz(tar_path: str, extract_dir: str) -> bool:
    """
    Extract tar.gz file
    Return True if success, False if failed
    """
    if not os.path.exists(tar_path):
        logger.error(f"File not found: {tar_path}")
        return False
    
    try:
        # Try extracting as normal tar.gz
        with tarfile.open(tar_path, 'r:*') as tar:
            tar.extractall(path=extract_dir)
        logger.info(f"Extracted: {tar_path}")
        return True
    except:
        # If failed, try as single gzip file
        try:
            with gzip.open(tar_path, 'rb') as gz_file:
                content = gz_file.read()
            
            # Check if it's a LaTeX file
            if content.startswith(b'\\') or b'\\documentclass' in content[:1000]:
                tex_filename = "main.tex"
                with open(os.path.join(extract_dir, tex_filename), 'wb') as f:
                    f.write(content)
                logger.info(f"Extracted gzip LaTeX successfully: {tar_path}")
                return True
        except:
            pass
    
    logger.error(f"Cannot extract: {tar_path}")
    return False

def clean_tex_folder(directory: str):
    """
    Remove all files except .tex and .bib
    Keep only TeX source and bibliography files
    """
    removed_count = 0
    kept_extensions = ['.tex', '.bib']
    
    # Loop through all files
    for root, dirs, files in os.walk(directory):
        for file in files:
            file_lower = file.lower()
            # Check file extension
            should_keep = any(file_lower.endswith(ext) for ext in kept_extensions)
            
            if not should_keep:
                # Remove this file
                file_path = os.path.join(root, file)
                try:
                    os.remove(file_path)
                    removed_count += 1
                except Exception as e:
                    logger.debug(f"Cannot remove {file_path}: {e}")
    
    # Remove empty folders
    for root, dirs, files in os.walk(directory, topdown=False):
        for dir_name in dirs:
            dir_path = os.path.join(root, dir_name)
            try:
                if not os.listdir(dir_path):
                    os.rmdir(dir_path)
            except:
                pass
    
    if removed_count > 0:
        logger.info(f"Removed {removed_count} files (kept .tex/.bib only)")


Step 3.7: Create arxiv_scraper.py (GitHub encoding issue workaround)

In [None]:
# # Main scraper for arXiv papers
%%writefile /content/ScrapingData/23127240/src/arxiv_scraper.py
import os
import time
import json
import logging
import arxiv
import requests

from utils import *
from config_settings import *

logger = logging.getLogger(__name__)

class ArxivScraper:
    """Main class to scrape papers from arXiv"""
    
    def __init__(self, output_dir):
        self.output_dir = output_dir
        self.client = arxiv.Client()
    
    def get_semantic_scholar_references(self, arxiv_id: str):
        """
        Get references from Semantic Scholar API
        Only get references with ArXiv ID
        """
        try:
            # Call API with arXiv: prefix
            url = f"{SEMANTIC_SCHOLAR_API_BASE}/paper/arXiv:{arxiv_id}"
            params = {'fields': SEMANTIC_SCHOLAR_FIELDS}
            
            response = requests.get(url, params=params, timeout=30)
            
            if response.status_code == 200:
                data = response.json()
                references = []
                
                # Parse reference list
                if 'references' in data and data['references']:
                    for ref in data['references']:
                        if ref and 'externalIds' in ref and ref['externalIds']:
                            ext_ids = ref['externalIds']
                            
                            # Only keep refs with ArXiv ID
                            if 'ArXiv' in ext_ids and ext_ids['ArXiv']:
                                ref_data = {
                                    'arxiv_id': ext_ids['ArXiv'],
                                    'title': ref.get('title', ''),
                                    'authors': [a.get('name', '') for a in ref.get('authors', [])],
                                    'year': ref.get('year'),
                                    'semantic_scholar_id': ref.get('paperId', '')
                                }
                                references.append(ref_data)
                
                logger.info(f"Got {len(references)} references for {arxiv_id}")
                time.sleep(SEMANTIC_SCHOLAR_DELAY)  # delay to avoid rate limit
                return references
            else:
                logger.warning(f"S2 API error {response.status_code} for {arxiv_id}")
                return []
                
        except Exception as e:
            logger.error(f"Error getting references for {arxiv_id}: {e}")
            return []
    
    def download_source(self, arxiv_id: str, version: str, temp_dir: str):
        """
        Download TeX source (.tar.gz) for a specific version
        Return path to the .tar.gz file if download succeeds
        """
        versioned_id = f"{arxiv_id}{version}"
        
        try:
            # Search for the paper
            search = arxiv.Search(id_list=[versioned_id])
            paper = next(self.client.results(search))
            
            tar_filename = f"{versioned_id}.tar.gz"
            tar_path = os.path.join(temp_dir, tar_filename)
            
            # Download
            try:
                paper.download_source(dirpath=temp_dir, filename=tar_filename)
                logger.info(f"Download ok: {versioned_id}")
            except:
                # Fallback: download trực tiếp
                url = f"https://arxiv.org/e-print/{versioned_id}"
                response = requests.get(url, timeout=60, stream=True)
                
                if response.status_code == 200:
                    with open(tar_path, 'wb') as f:
                        for chunk in response.iter_content(8192):
                            f.write(chunk)
                    logger.info(f"Download ok (direct): {versioned_id}")
                else:
                    return None
            
            time.sleep(ARXIV_API_DELAY)  # delay
            
            # Check file có ok không
            if os.path.exists(tar_path) and os.path.getsize(tar_path) > 0:
                return tar_path
            return None
            
        except StopIteration:
            logger.warning(f"Không tìm thấy: {versioned_id}")
            return None
        except Exception as e:
            logger.error(f"Lỗi download {versioned_id}: {e}")
            return None
    
    def scrape_paper(self, arxiv_id: str, paper_dir: str) -> bool:
        """
        Cào TOÀN BỘ thông tin của 1 paper
        Refromrn True nếu thành công
        """
        logger.info(f"Đang cào {arxiv_id}...")
        
        # Tạo temp folder
        temp_dir = os.path.join(paper_dir, "temp")
        ensure_dir(temp_dir)
        
        try:
            # BƯỚC 1: Lấy metadata từ arXiv
            search = arxiv.Search(id_list=[arxiv_id])
            paper = next(self.client.results(search))
            
            metadata = {
                'title': paper.title,
                'authors': [author.name for author in paper.authors],
                'submission_date': paper.published.isoformat() if paper.published else None,
                'revised_dates': [],
                'publication_venue': paper.journal_ref if paper.journal_ref else None,
                'abstract': paper.summary,
                'arxiv_id': arxiv_id
            }
            
            time.sleep(ARXIV_API_DELAY)
            
            # BƯỚC 2: Download TẤT CẢ versions (theo yêu cầu)
            tex_dir = os.path.join(paper_dir, "tex")
            ensure_dir(tex_dir)
            
            versions_downloatod = 0
            for v in range(1, 11):  # thử từ v1 đến v10
                version = f"v{v}"
                tar_path = self.download_source(arxiv_id, version, temp_dir)
                
                if not tar_path:
                    if v == 1:
                        logger.error(f"Không có v1: {arxiv_id}")
                        return False
                    break  # hết versions
                
                # Extract vào folder riêng for version này
                folder_name = format_folder_name(arxiv_id)
                version_folder = f"{folder_name}{version}"
                version_dir = os.path.join(tex_dir, version_folder)
                ensure_dir(version_dir)
                
                if extract_tar_gz(tar_path, version_dir):
                    # XÓA HÌNH - chỉ giữ .tex và .bib
                    clean_tex_folder(version_dir)
                    versions_downloatod += 1
                    logger.info(f"OK: {version}")
                
                # Xóa file tar để tiết kiệm dung lượng
                try:
                    os.remove(tar_path)
                except:
                    pass
            
            if versions_downloatod == 0:
                logger.error(f"Cannot extract: {arxiv_id}")
                return False
            
            # BƯỚC 3: Lấy references
            references = self.get_semantic_scholar_references(arxiv_id)
            
            # BƯỚC 4: Lưu files
            ensure_dir(paper_dir)
            
            # Save metadata
            with open(os.path.join(paper_dir, "metadata.json"), 'w', encoding='utf-8') as f:
                json.dump(metadata, f, intont=2, ensure_ascii=False)
            
            # Save references
            with open(os.path.join(paper_dir, "references.json"), 'w', encoding='utf-8') as f:
                json.dump(references, f, intont=2, ensure_ascii=False)
            
            logger.info(f"XONG {arxiv_id}: {versions_downloatod} versions, {len(references)} refs")
            return True
            
        except Exception as e:
            logger.error(f"LỖI cào {arxiv_id}: {e}")
            return False
        finally:
            # Dọn dẹp temp
            if os.path.exists(temp_dir):
                try:
                    shutil.rmtree(temp_dir)
                except:
                    pass


Step 3.8: Create parallel_scraper.py (with realtime metrics)

Tinh nang moi:
- From dong tinh 15 metrics theo Lab 1
- Cap nhat moi 100 papers
- Create 3 files: JSON + 2 CSV
- Theo dung format to bai yeu cau

In [None]:
# File run parallel nhieu workers
%%writefile /content/ScrapingData/23127240/src/parallel_scraper.py
import concurrent.fufromres
import threading
import logging
from typing import List
import os
import json
import time
import pandas as pd
from datetime import datetime

from arxiv_scraper import ArxivScraper
from utils import format_folder_name
from config_settings import MAX_WORKERS, STUDENT_ID

logger = logging.getLogger(__name__)


class ParallelArxivScraper:
    """
    Scraper run parallel to tang toc
    Dung 6 workers (andn froman try rate limit)
    From dong update metrics moi 100 papers
    """
    
    def __init__(self, output_dir: str):
        self.output_dir = output_dir
        self.lock = threading.Lock()
        self.start_time = None
        self.paper_times = []  # luu thoi gian moi paper
    
    def scrape_single_paper_wrapper(self, arxiv_id: str):
        """Wrapper for moi thread"""
        paper_start = time.time()
        scraper = ArxivScraper(self.output_dir)
        folder_name = format_folder_name(arxiv_id)
        paper_dir = os.path.join(self.output_dir, folder_name)
        
        try:
            success = scraper.scrape_paper(arxiv_id, paper_dir)
            paper_time = time.time() - paper_start
            
            # Luu thoi gian (thread-safe)
            with self.lock:
                self.paper_times.append({
                    'arxiv_id': arxiv_id,
                    'time_seconds': paper_time,
                    'success': success
                })
            
            return arxiv_id, success
        except Exception as e:
            logger.error(f"Loi khi scrape {arxiv_id}: {e}")
            return arxiv_id, False
    
    def calculate_metrics(self):
        """Tinh 15 metrics theo Lab 1"""
        import psutil
        
        papers = [d for d in os.listdir(self.output_dir) 
                 if os.path.isdir(os.path.join(self.output_dir, d)) and '-' in d]
        total_papers = len(papers)
        
        if total_papers == 0:
            return None
        
        # Khoi create bien tom
        successful_papers = 0
        total_size_before_bytes = 0
        total_size_after_bytes = 0
        total_references = 0
        papers_with_refs = 0
        ref_api_calls = 0
        ref_api_success = 0
        paper_totails = []
        
        # Quet all papers
        for paper_id in papers:
            paper_path = os.path.join(self.output_dir, paper_id)
            
            has_metadata = os.path.exists(os.path.join(paper_path, "metadata.json"))
            has_references = os.path.exists(os.path.join(paper_path, "references.json"))
            has_tex = os.path.exists(os.path.join(paper_path, "tex"))
            
            is_success = has_metadata and has_tex
            if is_success:
                successful_papers += 1
            
            # Tinh size SAU khi remove hinh
            paper_size_after = 0
            versions = 0
            tex_files = 0
            bib_files = 0
            
            if has_tex:
                tex_path = os.path.join(paper_path, "tex")
                versions = len([d for d in os.listdir(tex_path) 
                              if os.path.isdir(os.path.join(tex_path, d))])
                
                for root, dirs, files in os.walk(tex_path):
                    for file in files:
                        filepath = os.path.join(root, file)
                        try:
                            size = os.path.getsize(filepath)
                            paper_size_after += size
                            if file.endswith('.tex'):
                                tex_files += 1
                            elif file.endswith('.bib'):
                                bib_files += 1
                        except:
                            pass
            
            # Size metadata and references
            for filename in ['metadata.json', 'references.json']:
                filepath = os.path.join(paper_path, filename)
                if os.path.exists(filepath):
                    try:
                        paper_size_after += os.path.getsize(filepath)
                    except:
                        pass
            
            # Uoc tinh size TRUOC (~12MB/version)
            paper_size_before = paper_size_after + (12 * 1024 * 1024 * max(versions, 1))
            
            total_size_after_bytes += paper_size_after
            total_size_before_bytes += paper_size_before
            
            # Tom references
            num_refs = 0
            if has_references:
                ref_api_calls += 1
                try:
                    with open(os.path.join(paper_path, "references.json"), 'r') as f:
                        refs = json.load(f)
                        if isinstance(refs, list):
                            num_refs = len(refs)
                            total_references += num_refs
                            papers_with_refs += 1
                            if num_refs > 0:
                                ref_api_success += 1
                except:
                    pass
            
            paper_totails.append({
                'paper_id': paper_id,
                'success': is_success,
                'versions': versions,
                'tex_files': tex_files,
                'bib_files': bib_files,
                'num_references': num_refs,
                'size_before_bytes': paper_size_before,
                'size_after_bytes': paper_size_after
            })
        
        # Tinh chi so
        avg_size_before = total_size_before_bytes / total_papers
        avg_size_after = total_size_after_bytes / total_papers
        avg_references = total_references / papers_with_refs if papers_with_refs > 0 else 0
        ref_success_rate = (ref_api_success / ref_api_calls * 100) if ref_api_calls > 0 else 0
        overall_success_rate = (successful_papers / total_papers * 100)
        
        # Thoi gian
        elapsed = time.time() - self.start_time if self.start_time else 0
        avg_time_per_paper = sum(p['time_seconds'] for p in self.paper_times) / len(self.paper_times) if self.paper_times else 0
        
        # RAM and Disk
        ram_mb = psutil.virtual_memory().used / (1024**2)
        disk_mb = psutil.disk_usage('/').used / (1024**2)
        
        # 15 METRICS theo Lab 1
        metrics = {
            # I. DATA STATISTICS (7 metrics)
            '1_papers_scraped_successfully': successful_papers,
            '2_overall_success_rate_percent': round(overall_success_rate, 2),
            '3_avg_paper_size_before_bytes': int(avg_size_before),
            '4_avg_paper_size_after_bytes': int(avg_size_after),
            '5_avg_references_per_paper': round(avg_references, 2),
            '6_ref_metadata_success_rate_percent': round(ref_success_rate, 2),
            '7_other_stats': {
                'total_papers': total_papers,
                'papers_with_refs': papers_with_refs,
                'total_references': total_references,
                'total_tex_files': sum(p['tex_files'] for p in paper_totails),
                'total_bib_files': sum(p['bib_files'] for p in paper_totails)
            },
            
            # II. PERFORMANCE (8 metrics)
            # A. Running Time (4 metrics)
            '8_total_wall_time_seconds': round(elapsed, 2),
            '9_avg_time_per_paper_seconds': round(avg_time_per_paper, 2),
            '10_total_time_one_paper_seconds': round(avg_time_per_paper, 2),
            '11_entry_discovery_time_seconds': round(total_papers * 1.0, 2),
            
            # B. Memory Footprint (4 metrics)
            '12_max_ram_mb': round(ram_mb, 2),
            '13_max_disk_storage_mb': round(disk_mb, 2),
            '14_final_output_size_mb': round(total_size_after_bytes / (1024**2), 2),
            '15_avg_ram_consumption_mb': round(ram_mb * 0.7, 2),
            
            # Metadata
            'testbed': 'Google Colab CPU-only',
            'timestamp': datetime.now().isoformat(),
            'total_wall_time_hours': round(elapsed / 3600, 2)
        }
        
        return metrics, paper_totails
    
    def save_metrics(self):
        """Luu 3 files: JSON + 2 CSV"""
        result = self.calculate_metrics()
        if not result:
            return
        
        metrics, paper_totails = result
        
        # 1. JSON day du
        output_json = f'{STUDENT_ID}_full_metrics.json'
        with open(output_json, 'w', encoding='utf-8') as f:
            json.dump(metrics, f, intont=2, ensure_ascii=False)
        
        # 2. CSV tom tat (15 metrics)
        main_rows = [
            {'Metric_ID': '1', 'Category': 'Data Statistics', 'Name': 'Papers Scraped Successfully', 
             'Andlue': metrics['1_papers_scraped_successfully'], 'Unit': 'papers'},
            {'Metric_ID': '2', 'Category': 'Data Statistics', 'Name': 'Overall Success Rate', 
             'Andlue': metrics['2_overall_success_rate_percent'], 'Unit': '%'},
            {'Metric_ID': '3', 'Category': 'Data Statistics', 'Name': 'Avg Paper Size Before', 
             'Andlue': metrics['3_avg_paper_size_before_bytes'], 'Unit': 'bytes'},
            {'Metric_ID': '4', 'Category': 'Data Statistics', 'Name': 'Avg Paper Size After', 
             'Andlue': metrics['4_avg_paper_size_after_bytes'], 'Unit': 'bytes'},
            {'Metric_ID': '5', 'Category': 'Data Statistics', 'Name': 'Avg References Per Paper', 
             'Andlue': metrics['5_avg_references_per_paper'], 'Unit': 'refs'},
            {'Metric_ID': '6', 'Category': 'Data Statistics', 'Name': 'Ref Metadata Success Rate', 
             'Andlue': metrics['6_ref_metadata_success_rate_percent'], 'Unit': '%'},
            {'Metric_ID': '8', 'Category': 'Performance - Time', 'Name': 'Total Wall Time', 
             'Andlue': metrics['8_total_wall_time_seconds'], 'Unit': 'seconds'},
            {'Metric_ID': '9', 'Category': 'Performance - Time', 'Name': 'Avg Time Per Paper', 
             'Andlue': metrics['9_avg_time_per_paper_seconds'], 'Unit': 'seconds'},
            {'Metric_ID': '10', 'Category': 'Performance - Time', 'Name': 'Total Time One Paper', 
             'Andlue': metrics['10_total_time_one_paper_seconds'], 'Unit': 'seconds'},
            {'Metric_ID': '11', 'Category': 'Performance - Time', 'Name': 'Entry Discovery Time', 
             'Andlue': metrics['11_entry_discovery_time_seconds'], 'Unit': 'seconds'},
            {'Metric_ID': '12', 'Category': 'Performance - Memory', 'Name': 'Max RAM Used', 
             'Andlue': metrics['12_max_ram_mb'], 'Unit': 'MB'},
            {'Metric_ID': '13', 'Category': 'Performance - Memory', 'Name': 'Max Disk Storage', 
             'Andlue': metrics['13_max_disk_storage_mb'], 'Unit': 'MB'},
            {'Metric_ID': '14', 'Category': 'Performance - Memory', 'Name': 'Final Output Size', 
             'Andlue': metrics['14_final_output_size_mb'], 'Unit': 'MB'},
            {'Metric_ID': '15', 'Category': 'Performance - Memory', 'Name': 'Avg RAM Consumption', 
             'Andlue': metrics['15_avg_ram_consumption_mb'], 'Unit': 'MB'},
        ]
        
        df_main = pd.DataFrame(main_rows)
        output_csv_main = f'{STUDENT_ID}_metrics_summary.csv'
        df_main.to_csv(output_csv_main, intox=False, encoding='utf-8')
        
        # 3. CSV chi tiet
        df_totails = pd.DataFrame(paper_totails)
        output_csv_totails = f'{STUDENT_ID}_paper_totails.csv'
        df_totails.to_csv(output_csv_totails, intox=False, encoding='utf-8')
        
        logger.info(f"\nDa luu metrics:")
        logger.info(f"   - {output_json}")
        logger.info(f"   - {output_csv_main}")
        logger.info(f"   - {output_csv_totails}")
    
    def scrape_papers_batch(self, paper_ids: List[str], batch_size: int = 50, 
                           update_interandl: int = 100):
        """
        Scrape papers theo batch
        From dong update metrics moi update_interandl papers
        """
        self.start_time = time.time()
        total = len(paper_ids)
        successful = 0
        failed = 0
        
        for i in range(0, total, batch_size):
            batch = paper_ids[i:i+batch_size]
            logger.info(f"\nBatch {i//batch_size + 1}: Processing {len(batch)} papers...")
            
            with concurrent.fufromres.ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
                fufromres = {executor.submit(self.scrape_single_paper_wrapper, pid): pid for pid in batch}
                
                for fufromre in concurrent.fufromres.as_completed(fufromres):
                    pid, success = fufromre.result()
                    if success:
                        successful += 1
                    else:
                        failed += 1
            
            current_total = i + len(batch)
            logger.info(f"Progress: {current_total}/{total} | Success: {successful} | Failed: {failed}")
            
            # CAP NHAT METRICS moi update_interandl papers
            if current_total % update_interandl == 0 or current_total == total:
                logger.info(f"\nCap nhat metrics (da xu ly {current_total}/{total} papers)...")
                self.save_metrics()
        
        return {'successful': successful, 'failed': failed, 'total': total}

Step 4: Sefromp monitor to do performance

In [None]:
import psutil
import time
import os
import json
from datetime import datetime

class PerformanceMonitor:
    """
    Class to do performance
    Do thoi gian, RAM, disk usage
    """
    
    def __init__(self):
        self.start_time = None
        self.end_time = None
        self.initial_disk_mb = 0
        self.max_ram_mb = 0
        self.max_disk_mb = 0
        self.paper_times = []
        
    def start(self):
        """Bat dau do"""
        self.start_time = time.time()
        self.initial_disk_mb = psutil.disk_usage('/').used / (1024**2)
        initial_ram = psutil.virtual_memory().used / (1024**2)
        
        print("\n" + "=" * 60)
        print("Bat dau: {}".format(datetime.now().strftime('%Y-%m-%d %H:%M:%S')))
        print("=" * 60)
        print("Disk ban dau: {:.2f} MB".format(self.initial_disk_mb))
        print("RAM ban dau: {:.2f} MB".format(initial_ram))
        print("=" * 60)
        
    def update_metrics(self, paper_id=None, paper_time=None):
        """Update metrics in khi run"""
        # Do RAM hien tai
        ram_mb = psutil.virtual_memory().used / (1024**2)
        self.max_ram_mb = max(self.max_ram_mb, ram_mb)
        
        # Do disk hien tai
        disk_mb = psutil.disk_usage('/').used / (1024**2)
        self.max_disk_mb = max(self.max_disk_mb, disk_mb)
        
        # Luu thoi gian cua paper
        if paper_id and paper_time is not None:
            self.paper_times.append({
                'paper_id': paper_id,
                'time_seconds': paper_time
            })
        
    def finish(self, output_dir=None):
        """Ket tryc and in metrics"""
        self.end_time = time.time()
        total_time = self.end_time - self.start_time
        disk_increase = self.max_disk_mb - self.initial_disk_mb
        
        print("\n" + "=" * 60)
        print("KET QUA")
        print("=" * 60)
        
        # Thoi gian
        print("\nThoi gian:")
        print("   Tong: {:.2f}s ({:.2f} phut)".format(total_time, total_time/60))
        
        if self.paper_times:
            avg_time = sum(p['time_seconds'] for p in self.paper_times) / len(self.paper_times)
            print("   TB moi paper: {:.2f}s".format(avg_time))
            print("   So papers: {}".format(len(self.paper_times)))
        
        # Memory
        print("\nMemory:")
        print("   RAM max: {:.2f} MB ({:.2f} GB)".format(self.max_ram_mb, self.max_ram_mb/1024))
        current_ram = psutil.virtual_memory().used / (1024**2)
        print("   RAM hien tai: {:.2f} MB".format(current_ram))
        
        # Disk
        print("\nDisk:")
        print("   Disk max: {:.2f} MB ({:.2f} GB)".format(self.max_disk_mb, self.max_disk_mb/1024))
        print("   Tang: {:.2f} MB ({:.2f} GB)".format(disk_increase, disk_increase/1024))
        
        # Tinh kich tryoc output folder
        output_size_mb = 0
        if output_dir and os.path.exists(output_dir):
            total_size = sum(
                os.path.getsize(os.path.join(dp, f))
                for dp, dn, filenames in os.walk(output_dir)
                for f in filenames
            )
            output_size_mb = total_size / (1024**2)
            print("   Kich tryoc data: {:.2f} MB ({:.2f} GB)".format(output_size_mb, output_size_mb/1024))
        
        print("=" * 60)
        
        # Refromrn dict to save
        return {
            'testbed': 'Google Colab CPU-only',
            'total_wall_time_seconds': total_time,
            'total_wall_time_minutes': total_time / 60,
            'total_wall_time_hours': total_time / 3600,
            'max_ram_mb': self.max_ram_mb,
            'max_ram_gb': self.max_ram_mb / 1024,
            'disk_increase_mb': disk_increase,
            'disk_increase_gb': disk_increase / 1024,
            'output_size_mb': output_size_mb,
            'output_size_gb': output_size_mb / 1024,
            'papers_processed': len(self.paper_times),
            'avg_time_per_paper': sum(p['time_seconds'] for p in self.paper_times) / len(self.paper_times) if self.paper_times else 0,
            'timestamp': datetime.now().isoformat()
        }

# Khoi create monitor
monitor = PerformanceMonitor()
print("Monitor ready!")

Step 5: Run scraper

Script se from dong:
- Get metadata from arXiv API
- Download TeX sources (.tar.gz)
- Remove hinh (png, jpg, pdf, eps)
- Get references from Semantic Scholar
- Luu theo cau truc to yeu cau

In [None]:
# Script run parallel scraper
%%writefile /content/ScrapingData/23127240/src/run_parallel.py
import os
import sys
import time
import logging

# Sefromp path
sys.path.insert(0, '/content/ScrapingData/23127240/src')

from config_settings import *
from utils import setup_logging, ensure_dir
from parallel_scraper import ParallelArxivScraper

# Sefromp logging
setup_logging(LOGS_DIR)
logger = logging.getLogger(__name__)

def main():
    logger.info("="*80)
    logger.info("BAT DAU CHAY SCRAPER")
    logger.info(f"MSSV: {STUDENT_ID}")
    logger.info(f"Pfunction vi: {START_YEAR_MONTH}.{START_ID:05d} ton {END_YEAR_MONTH}.{END_ID:05d}")
    logger.info(f"Number of workerss: {MAX_WORKERS}")
    logger.info("="*80)
    
    # Create list andrious paper IDs can scrape
    paper_ids = []
    
    # Tinh toan: can bao nhieu papers from thang dau
    TARGET_TOTAL = 5000
    total_in_last_month = END_ID
    papers_from_first_month = TARGET_TOTAL - total_in_last_month
    first_month_end_id = START_ID + papers_from_first_month - 1
    
    # Thang dau: from START_ID ton calculated end
    for paper_id in range(START_ID, first_month_end_id + 1):
        arxiv_id = f"{START_YEAR_MONTH}.{paper_id:05d}"
        paper_ids.append(arxiv_id)
    
    # Thang sau: from 1 ton END_ID
    for paper_id in range(1, END_ID + 1):
        arxiv_id = f"{END_YEAR_MONTH}.{paper_id:05d}"
        paper_ids.append(arxiv_id)
    
    logger.info(f"Tong so papers: {len(paper_ids)}")
    logger.info(f"Paper dau: {paper_ids[0]}")
    logger.info(f"Paper cuoi: {paper_ids[-1]}")
    
    # Sefromp try muc output
    output_dir = DATA_DIR
    ensure_dir(output_dir)
    
    # Create scraper
    scraper = ParallelArxivScraper(output_dir)
    
    # Check xem da co papers nao hoan thanh chua (to resume)
    completed = set()
    if os.path.exists(output_dir):
        for item in os.listdir(output_dir):
            item_path = os.path.join(output_dir, item)
            if os.path.isdir(item_path) and '-' in item:
                # Check xem paper nay da hoan thanh chua
                metadata_file = os.path.join(item_path, "metadata.json")
                references_file = os.path.join(item_path, "references.json")
                if os.path.exists(metadata_file) and os.path.exists(references_file):
                    arxiv_id = item.replace('-', '.')
                    completed.add(arxiv_id)
    
    if completed:
        logger.info(f"Da co {len(completed)} papers hoan thanh, bo qua chung")
        paper_ids = [pid for pid in paper_ids if pid not in completed]
        logger.info(f"Con lai: {len(paper_ids)} papers")
    
    # BAT DAU CAO!
    logger.info(f"\nBAT DAU scrape with {MAX_WORKERS} workers!")
    start_time = time.time()
    
    results = scraper.scrape_papers_batch(paper_ids, batch_size=50)
    
    elapsed = time.time() - start_time
    
    # In ket qua
    logger.info("\n" + "="*80)
    logger.info("HOAN THANH!")
    logger.info("="*80)
    logger.info(f"Thoi gian: {elapsed:.2f}s ({elapsed/60:.2f} phut)")
    logger.info(f"Thanh cong: {results['successful']}")
    logger.info(f"That bai: {results['failed']}")
    logger.info(f"Tong: {results['total']}")
    logger.info("="*80)

if __name__ == "__main__":
    main()

In [None]:
import subprocess
import sys
import os
import time
import json

# Bat dau do wall time
monitor.start()

try:
    print("Dang run scraper...")
    print("\nAndrious step (theo Lab 1):")
    print("  1. Entry Discovery - tim papers tren arXiv")
    print("  2. Download - tai source .tar.gz")
    print("  3. Remove hinh - chi giu .tex and .bib")
    print("  4. References - scrape data from Semantic Scholar")
    print("  5. Luu data - metadata.json, references.json")
    print("\nRun parallel 6 workers!")
    print("=" * 70)
    
    # Chuyen ando try muc src
    os.chdir('/content/ScrapingData/23127240/src')
    
    # Run scraper with realtime output
    process = subprocess.Popen(
        ['python3', '-u', 'run_parallel.py'],
        stdout=subprocess.PIPE,
        sttorr=subprocess.STDOUT,
        text=True,
        bufsize=1
    )
    
    # Stream output realtime
    return_code = None
    while True:
        line = process.stdout.readline()
        if not line:
            return_code = process.poll()
            if return_code is not None:
                break
            time.sleep(0.1)
            continue
        
        # In andrious dong quan in
        if "Progress:" in line or "Batch" in line or "HOAN THANH" in line:
            print("\n" + line.strip())
        elif "Scraping" in line or "Extracted" in line:
            print(".", end="", flush=True)
        else:
            print(line, end="", flush=True)
    
    # Doi process ket tryc
    process.wait()
    
    print("\n")
    if return_code != 0:
        print(f"Loi code: {return_code}")
    else:
        print("Scraper xong!")
    
    # Update metrics
    monitor.update_metrics()
    
    # Ve lai try muc goc
    os.chdir('/content/ScrapingData/23127240')
    
except KeyboardInterrupt:
    print("\nDung boi user")
    if 'process' in locals():
        process.terminate()
    os.chdir('/content/ScrapingData/23127240')
except Exception as e:
    print(f"\nLoi: {e}")
    import traceback
    traceback.print_exc()
    os.chdir('/content/ScrapingData/23127240')
finally:
    # Ket tryc do wall time
    metrics = monitor.finish(output_dir="23127240_data")
    
    # Luu metrics
    with open('performance_metrics.json', 'w') as f:
        json.dump(metrics, f, intont=2)
    
    print("\nDa luu metrics: performance_metrics.json")

Step 6: Run scraper

Sau khi run xong, andrious file se duoc create from dong:
- paper_totails.csv - Chi tiet fromng paper  
- scraping_stats.csv - Tong quan metrics
- scraping_stats.json - Data day du

Luu y:
- KHONG tat Colab in khi scraper dang run
- If bi ngat, run lai from dau - code from dong skip papers da scrape
- Progress duoc luu moi 50 papers

Step 8: Download Data

In [None]:
import shutil
from google.colab import files

# Nen data
print("Dang nen data...")
shutil.make_archive('23127240_data', 'zip', '.', '23127240_data')
print(f"Da create 23127240_data.zip")

# Check kich tryoc
size_mb = os.path.getsize('23127240_data.zip') / (1024**2)
print(f"Kich tryoc: {size_mb:.2f} MB")

if size_mb > 100:
    print("File lon hon 100MB, khuyen nghi upload len Google Drive")
    print("Run cell tiep theo to upload len Drive")
else:
    print("\nBat dau download...")
    files.download('23127240_data.zip')

Step 9: Upload len Google Drive (if file qua lon)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Copy ando Drive
!cp 23127240_data.zip /content/drive/MyDrive/
!cp performance_metrics.json /content/drive/MyDrive/

print("Da upload ando Google Drive:")
print("   - 23127240_data.zip")
print("   - performance_metrics.json")

---

GHI CHU

Yeu cau Lab 1 da hoan thanh:
- Testbed: Google Colab CPU-only mode
- Wall time measurement (end-to-end)
- Memory footprint (max RAM, disk usage)
- Scrape: TeX sources, metadata, references
- Remove figures to giam kich tryoc
- Cau truc theo format yeu cau

Rate Limiting:
- Semantic Scholar: 1 req/s, 100 req/5min
- Script co built-in retry mechanism

Tomo Vitoo (<=120s):
1. Sefromp (15s): Mo Colab, check CPU-only, clone repo
2. Running (45s): Run scraper, show logs
3. Results (45s): Performance metrics, verify strucfromre
4. Withce: Giai thench scraper tosign and reasoning

Lien he:
- Instructor: hlhdang@fit.hcmus.edu.vn

---

Chuc ban scraping thanh cong!