# Intro

## Description

This notebook contains the pipeline for querying and storing arxiv data into csv files.

## API Usage / Strategy

For future developers, important features about the arXiv API:
- **it lacks a "search by published date" feature** (very problematic)
- **in a query, the first return entry is always index 0** (meaning index 0 is different every query, also problematic)
- it has the `"start"` & `"max_results"` paramaters that allow you to slice out a subsection of the entire query (starting from index `start`, return `max_result` entries)
- it returns a max view of 30,000 but only allows you to retrieve a slice of 2000
- it requires a 3 second wait between each query (or IP banned)

I figured out a way to circumvent the lack of a "search by published date" feature below:
- using the `"sortOrder": 'ascending'` parameter, we can **fix** the oldest entry as index 0
- using `"start"` & `"max_results"` parameter, we can find out at which index the year 2020 papers started and then always consistently query from that point onwards
- because there aren't 30,000 published papers in the queries this data pipeline was built upon, it is unclear how the indexing will change when the 30,000 limit has been surpassed
    - this is a likely place to check if future bugs pop up

Takeaways
- this entire data pipeline is built on the `"sortOrder": 'ascending'`; please do not mess with this
- (also, API documentation is lacking and debugging this seemingly super tiny issue took waaaay too long)

## Imports

In [66]:
import pandas as pd
import numpy as np
import requests
import feedparser
import time
from datetime import datetime

## Global Variables and Script Setup

In [70]:
import helper

# Code

## Helper Functions

In [74]:
def fetch_request(url: str = helper.DEFAULT_URL, params: dict = helper.DEFAULT_PARAMS) -> feedparser.util.FeedParserDict:
    """
    Performs a fetch request using the arXiv API on the given URL/

    url -> str
        The given url to send the fetch request to

    params -> dict
        Parameters of the fetch request

    Returns -> feedparser.util.FeedParserDict
        A feedparser.util.FeedParserDict object that contains the JSON parsed data
    
    Example
        fetch_request(helper.DEFAULT_URL, helper.DEFAULT_PARAMS)
        See `helper.py` file for the formats of the default values
    """
    response = requests.get(url, params=params)
    if response.status_code == 200:
        feed = feedparser.parse(response.content)
        print(f"Fetched {len(feed.entries)} entries.")
        return feed
    else:
        raise ConnectionError(response.status_code)
    
def parse_request(feed: feedparser.util.FeedParserDict, verbose:int = 0) -> pd.core.frame.DataFrame:
    """
    Covnerts the given JSON feed file into a legible dataframe (useful for .csv storage).

    feed -> feedparser.util.FeedParserDict
        The given JSON object from feedparser

    verbose -> int
        0: suppresses all missing-data-points reporting
        1: at the end of script, summarize the total number of missing data points.
        2: for every queried paper, error report on the missing data features; at the end of script, also summarize the total number of missing data points.
        
    Returns -> pd.core.frame.DataFrame
        The JSON object converted to a dataframe

    Example
        parse_request(feed, 0)
    """
    all_papers = []
    num_missing_keys = 0
    for paper in feed.entries:
        paper_data = []
        for key in helper.ARXIV_KEYS:
            try:
                paper_data.append(paper[key])
            except:
                paper_data.append(np.nan)
                if verbose == 2:
                    print(f"{paper['id']} does not have {key} key")
                num_missing_keys += 1
        all_papers.append(paper_data)
    if verbose == 1 or verbose == 2:
        print(f"{num_missing_keys} missing keys.")
    df = pd.DataFrame(data=all_papers, columns=helper.MASTER_CSV_COLUMNS)
    return df

AttributeError: module 'helper' has no attribute 'DEFAULT_URL'

In [None]:
def crawl_publish_index(feed: feedparser.util.FeedParserDict, tar_date: str = helper.DEFAULT_DATE_QUERY) -> int:
    """ 
    """
    target_date_split = tar_date.split("-")
    if len(target_date_split) != 3:
        raise ValueError(f"ensure that your target date format is correct: year-month-day. eg. 2010-2-31")
    else:
        search_target_date = datetime(int(target_date_split[0]), int(target_date_split[1]), int(target_date_split[2]))

    for paper_index in range(len(feed.entries)):
        raw_date_split = feed.entries[paper_index]["published"].split("T")[0].split("-")
        paper_published_date = datetime(int(raw_date_split[0]), int(raw_date_split[1]), int(raw_date_split[2]))
        if paper_published_date >= search_target_date:
            print(f"Publish {paper_published_date} > Search {search_target_date}")
            print(f"Found index for target date: {paper_index}!")
            return paper_index
    print("Did not find index for target date!")
    return None

def helper_extract_start_index(query: str = helper.DEFAULT_SEARCH_QUERY, tar_date: str = helper.DEFAULT_DATE_QUERY) -> int:
    """
    
    """
    # every value in the up to 30,000 records that comes with the view
    for i in range(0, helper.MAX_VIEW_SIZE, helper.MAX_STEP_SIZE):
        params = {
            "search_query": query,
            "sortBy": 'submittedDate',
            "sortOrder": 'ascending',
            "start": i,
            "max_results": helper.MAX_STEP_SIZE
        }
        feed = fetch_request(helper.DEFAULT_URL, params)
        index = crawl_publish_index(feed, tar_date)
        if type(index) == int:
            return index
        time.sleep(helper.MIN_WAIT_TIME) # must be greater than 3 seconds or else arXiv will IP ban
    print(f"Did not find index for target date!")
    return None

def extract_start_index(query: str = helper.DEFAULT_SEARCH_QUERY, tar_date: str = helper.DEFAULT_DATE_QUERY) -> int:
    """
    
    """
    helper_extract_start_index(query, tar_date)
    return None

## Database Functions

In [76]:
pd.DataFrame(columns=helper.MASTER_CSV_COLUMNS)

Unnamed: 0,title,journal,authors,doi,published,abstract,url,tags


In [83]:
def reset_meta_papers_db():
    """ 
    """
    # Warning menu
    while True:
        user_input = input("Are you sure you want to run this function? This wipes the ENTIRE existing arxiv_meta_data.csv database! (y/n)")
        if user_input == "y":
            print("Proceeding to reset database.")
            break
        elif user_input == "n":
            print("Function canceled.")
            return None
        else:
            print("Wrong input. Please type 'y' or 'n'.")

    # Saving data to csv
    df = pd.DataFrame(columns=helper.META_CSV_COLUMNS)
    try:
        df.to_csv("../data/arxiv_meta_data.csv", index=False)
        print("Saved!")
    except:
        print("Failed to save...")

    # Return
    return None

In [None]:
def reset_papers_db(query: str = helper.DEFAULT_SEARCH_QUERY, verbose:int = 0) -> None:
    """
    Resets/overwrites the "arxiv.csv" database with the 5 most recently published papers about the given query.
    USE WITH CAUTION.

    query -> str
        The given query/search term

    verbose -> int
        0: suppresses all missing-data-points reporting
        1: at the end of script, summarize the total number of missing data points.
        2: for every queried paper, error report on the missing data features; at the end of script, also summarize the total number of missing data points.
        
    Returns
        None

    Example
        fetch_initial_papers("radiation", 0)
    """
    # Warning menu
    while True:
        user_input = input("Are you sure you want to run this function? This wipes the ENTIRE existing arxiv.csv database! (y/n)")
        if user_input == "y":
            print("Proceeding to reset database.")
            break
        elif user_input == "n":
            print("Function canceled.")
            return None
        else:
            print("Wrong input. Please type 'y' or 'n'.")

    # Error control
    valid_verbose = {0, 1, 2}
    if verbose not in valid_verbose:
        raise ValueError(f"verbose input must be one of the following: {valid_verbose}")

    # Fetch request
    params = {
        "search_query": query,
        "sortBy": 'submittedDate', # do not change this
        "sortOrder": 'ascending', # do not change this
        "start": 0,
        "max_results": 5
    }
    feed = fetch_request(helper.DEFAULT_URL, params)

    # Parse request
    df = parse_request(feed, 0)

    # Saving data to csv
    try:
        df.to_csv("../data/arxiv.csv", index=False)
        print("Saved!")
    except:
        print("Failed to save...")

    # Return
    return None

In [84]:
reset_meta_papers_db()

Proceeding to reset database.


AttributeError: module 'helper' has no attribute 'META_CSV_COLUMNS'

## Dev

In [16]:
def helper_fetch_papers(query: str = DEFAULT_SEARCH_QUERY, verbose:int = 0) -> None:
    """
    Resets/overwrites the "arxiv.csv" database with the 5 most recently published papers about the given query.

    query -> str
        The given query/search term

    verbose -> int
        0: suppresses all missing-data-points reporting
        1: at the end of script, summarize the total number of missing data points.
        2: for every queried paper, error report on the missing data features; at the end of script, also summarize the total number of missing data points.
        
    Returns
        None

    Example
        fetch_initial_papers("radiation", 0)
    """
    # Error control
    valid_verbose = {0, 1, 2}
    if verbose not in valid_verbose:
        raise ValueError(f"verbose input must be one of the following: {valid_verbose}")

    # Set request params
    params = {
        "search_query": query,
        "sortBy": 'submittedDate', # do not change this
        "sortOrder": 'ascending', # do not change this
        "start": 0,
        "max_results": 5
    }

    # Fetching request
    response = requests.get(url, params=params)
    if response.status_code == 200:
        feed = feedparser.parse(response.content)
        print(f"Fetched {len(feed.entries)} entries.")
    else:
        print(f"Error: {response.status_code}")
        return None

    # Parsing request
    all_papers = []
    num_missing_keys = 0
    for paper in feed.entries:
        paper_data = []
        for key in ARXIV_KEYS:
            try:
                paper_data.append(paper[key])
            except:
                paper_data.append(np.nan)
                if verbose == 2:
                    print(f"{paper['id']} does not have {key} key")
                num_missing_keys += 1
        all_papers.append(paper_data)
    if verbose == 1 or verbose == 2:
        print(f"{num_missing_keys} missing keys.")

    # Saving data to csv
    df = pd.DataFrame(data=all_papers, columns=MASTER_CSV_COLUMNS)
    try:
        df.to_csv("../data/arxiv.csv", index=False)
        print("Saved!")
    except:
        print("Failed to save...")

    # Return
    return None

def fetch_more_papers(query: str = DEFAULT_SEARCH_QUERY, verbose:int = 0, n:int = 50) -> None:
    """
    TODO: make this docstring better lol
    fetches the next n papers published after the oldest entry in the csv
    """
    # Extract oldest published date in dataset 
    df = pd.read_csv("../data/arxiv.csv")
    start_date = df["published"].iloc[-1:].values[0]

    # Set request params
    params = {
        "search_query": query,
        "start_date": start_date,
        "sortBy": 'submittedDate',  # relevance, lastUpdatedDate, submittedDate
        "max_results": n,
        "sortOrder": 'descending'
    }

    # Fetching request
    response = requests.get(url, params=params)
    if response.status_code == 200:
        feed = feedparser.parse(response.content)
        print(f"Fetched {len(feed.entries)} entries.")
    else:
        print(f"Error: {response.status_code}")
        return None

    # Parsing request
    all_papers = []
    num_missing_keys = 0
    for paper in feed.entries:
        paper_data = []
        for key in ARXIV_KEYS:
            try:
                paper_data.append(paper[key])
            except:
                paper_data.append(np.nan)
                if verbose == 1:
                    print(f"{paper['id']} does not have {key} key")
                    num_missing_keys += 1
                else:
                    num_missing_keys += 1
        all_papers.append(paper_data)
    print(f"{num_missing_keys} missing keys.")

    # Saving data to csv
    df = pd.DataFrame(data=all_papers, columns=MASTER_CSV_COLUMNS)
    try:
        df.to_csv("../data/arxiv.csv", mode="a", index=False, header=False)
        print("Saved!")
    except:
        print("Failed to save...")

    # Return
    return None

In [17]:
fetch_more_papers()

Fetched 50 entries.
93 missing keys.
Failed to save...


In [5]:
def fetch_papers(query: str = DEFAULT_SEARCH_QUERY, verbose:int = 0, n:int = 50) -> None:
    """ 
    TODO docstring

    wrapper for fetch more papers
    """
    if n <= 0:
        print("n cannot be negative or 0.")
        return None
    else:
        mult_of_10 = n // 10
        leftoever_of_10 = n - mult_of_10

        if mult_of_10 > 0:
            for _ in range(mult_of_10):
                fetch_more_papers(query, verbose)
        fetch_more_papers(query, verbose, leftoever_of_10)
        print("Finished loops!")
        return None

In [6]:
# WARNING, RUNNING THIS FUNCTION WILL RESET THE DATABASE
fetch_initial_papers(DEFAULT_SEARCH_QUERY, 0)

Fetched 10 entries.
16 missing keys.
Saved!


In [7]:
fetch_more_papers(DEFAULT_SEARCH_QUERY, 0, 1000)

Fetched 1000 entries.
1698 missing keys.
Saved!
