# **Experimentation**

### **Import Relevant Libraries**

In [None]:
'''
Webpage for a date's (say 4th October 2024) daily papers has the format: https://huggingface.co/papers?date=2024-10-04
'''

In [2]:
import requests
from bs4 import BeautifulSoup
import os

# Base URL for Hugging Face
BASE_URL = "https://huggingface.co"

# URL for the Hugging Face daily papers page (update this to the correct page)
URL = f"{BASE_URL}/papers?date=2024-10-04"

# Function to extract PDF links from individual paper page
def get_pdf_link(paper_page_url):
    # Fetch the paper's page content
    paper_response = requests.get(paper_page_url)
    
    # Parse the paper's page content
    if paper_response.status_code == 200:
        paper_soup = BeautifulSoup(paper_response.content, 'html.parser')
        
        # Find the PDF link on the paper's page
        # Assuming the PDF link is in an <a> tag with 'href' that contains '.pdf'
        pdf_link = paper_soup.find('a', href=lambda href: href and ".pdf" in href)
        
        if pdf_link:
            return pdf_link['href']  # Return the PDF link
    return None

# Function to download the PDF file
def download_pdf(pdf_url, paper_title):
    # Get the PDF content
    pdf_response = requests.get(pdf_url)
    
    if pdf_response.status_code == 200:
        # Define the PDF file name, making it a valid file name
        paper_title = paper_title.replace("/", "-").replace("\\", "-")  # Clean file name
        pdf_file_name = f"{paper_title}.pdf"
        
        # Write the PDF content to a file
        with open(pdf_file_name, 'wb') as f:
            f.write(pdf_response.content)
        print(f"Downloaded: {pdf_file_name}")
    else:
        print(f"Failed to download PDF: {pdf_url}")

# Step 1: Scrape the daily papers page for paper links
response = requests.get(URL)

if response.status_code == 200:
    # Parse the main daily papers page
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Find all the paper links
    # We are looking for <a> tags with the given class attributes in the example you shared
    paper_links = soup.find_all('a', class_='line-clamp-3 cursor-pointer text-balance')
    
    for paper in paper_links:
        paper_title = paper.text.strip()  # Get the title of the paper
        paper_page_url = paper['href']   # Get the relative link to the paper's Hugging Face page

        print(paper_page_url)
        
        # Ensure the link is a full URL
        paper_page_url = f"{BASE_URL}{paper_page_url}"
        
        # Step 2: Visit each paper's Hugging Face page to extract the PDF link
        pdf_link = get_pdf_link(paper_page_url)
        
        if pdf_link:
            # Step 3: Download the PDF
            download_pdf(pdf_link, paper_title)
        else:
            print(f"PDF link not found for: {paper_title}")
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")


/papers/2410.02740
PDF link not found for: Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models
/papers/2410.02713
PDF link not found for: Video Instruction Tuning With Synthetic Data
/papers/2410.02757
PDF link not found for: Loong: Generating Minute-level Long Videos with Autoregressive Language Models
/papers/2410.02712
PDF link not found for: LLaVA-Critic: Learning to Evaluate Multimodal Models
/papers/2410.02746
PDF link not found for: Contrastive Localized Language-Image Pre-Training
/papers/2410.02073
PDF link not found for: Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
/papers/2410.01679
PDF link not found for: VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment
/papers/2410.02724
PDF link not found for: Large Language Models as Markov Chains
/papers/2410.02678
PDF link not found for: Distilling an End-to-End Voice Assistant Without Instruction Training Data
/papers/2410.02416
PDF link not found for:

### **Using My Classes**

In [1]:
# Import the paperscraper module
from paperscraper import paperscraper

# Create an instance of the class
scraper = paperscraper('2024-10-04')

# Use get_links function
paper_pdfs = scraper.get_links()

pdf_dict = scraper.get_pdf_text(paper_pdfs)

In [3]:
print(pdf_dict['Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models'])

Preprint
REVISIT LARGE -SCALE IMAGE -CAPTION DATA IN PRE-
TRAINING MULTIMODAL FOUNDATION MODELS
Zhengfeng Lai∗, Vasileios Saveris∗, Chen Chen, Hong-You Chen, Haotian Zhang
Bowen Zhang, Juan Lao Tebar, Wenze Hu, Zhe Gan, Peter Grasch
Meng Cao, Yinfei Yang†
Apple AI/ML
{jeff_lai,v_saveris,pgrasch,mengcao,yinfeiy}@apple.com
ABSTRACT
Recent advancements in multimodal models highlight the value of rewritten cap-
tions for improving performance, yet key challenges remain. For example, while
synthetic captions often provide superior quality and image-text alignment, it is
not clear whether they can fully replace AltTexts: the role of synthetic captions
and their interaction with original web-crawled AltTexts in pre-training is still not
well understood. Moreover, different multimodal foundation models may have
unique preferences for specific caption formats, but efforts to identify the optimal
captions for each model remain limited. In this work, we propose a novel, con-
trollable, and scalab

### **Using Llama 3**

In [1]:
# Import os
import os

# Import HuggingFace
os.environ['HF_HOME'] = r'C:\Users\josha\AppData\Local\Temp'    # Set cache directory for HuggingFace
from transformers import (AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline)

# Import json
import json

# Import torch
import torch

# Import numpy
import numpy

In [2]:
# HF account config
config_data = json.load(open("config.json"))
HF_TOKEN = config_data["HF_TOKEN"]

In [7]:
# Define model name
model_name = "meta-llama/Meta-Llama-3-8B"

In [8]:
# Quantisation config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, 
    bnb_4bit_use_double_quant=True, 
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16)

In [9]:
# Load the Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, token=HF_TOKEN)
tokenizer.pad_token = tokenizer.eos_token

In [None]:
# Load the model
model = AutoModelForCausalLM.from_pretrained(
    model_name,                     # Address for Llama3
    device_map="auto",              # Loads model in GPU if available, else CPU
    quantization_config=bnb_config, # Reduces precision of model params to reduce model size
    use_auth_token=HF_TOKEN         # Private token to access model
    )

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]



model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]