# Intro

## Description

This notebook contains the pipeline for querying and storing arxiv data into csv files.

## API Usage / Strategy

For future developers, important features about the arXiv API:
- **it lacks a "search by published date" feature** (very problematic)
- **in a query, the first return entry is always index 0** (meaning index 0 is different every query, also problematic)
- it has the `"start"` & `"max_results"` paramaters that allow you to slice out a subsection of the entire query (starting from index `start`, return `max_result` entries)
- it returns a max view of 30,000 but only allows you to retrieve a slice of 2000
- it requires a 3 second wait between each query (or IP banned)

I figured out a way to circumvent the lack of a "search by published date" feature below:
- using the `"sortOrder": 'ascending'` parameter, we can **fix** the oldest entry as index 0
- using `"start"` & `"max_results"` parameter, we can find out at which index the year 2020 papers started and then always consistently query from that point onwards
- because there aren't 30,000 published papers in the queries this data pipeline was built upon, it is unclear how the indexing will change when the 30,000 limit has been surpassed
    - this is a likely place to check if future bugs pop up

Takeaways
- this entire data pipeline is built on the `"sortOrder": 'ascending'`; please do not mess with this
- (also, API documentation is lacking and debugging this seemingly super tiny issue took waaaay too long)

## Imports

In [1]:
import pandas as pd
import numpy as np
import requests
import feedparser
import time
from datetime import datetime

## Global Variables and Script Setup

In [2]:
import helper

# Code

## Helper Functions

ConnectionError: ('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))

In [3]:
def fetch_request(url: str=helper.DEFAULT_URL, query: str=helper.DEFAULT_SEARCH_QUERY, start: int=0, max_results: int=10) -> feedparser.util.FeedParserDict:
    """
    Performs a fetch request using the arXiv API on the given URL.

    url -> str
        The given url to send the fetch request to

    query -> str
        The given search query for arxiv to find papers on

    start -> int
        The index of the papers at which to start pulling data on

    max_results -> int
        The total number of papers after `start` to pull from; cannot exceed 2000

    Returns -> feedparser.util.FeedParserDict
        A feedparser.util.FeedParserDict object that contains the JSON parsed data
    
    Example
        fetch_request(helper.DEFAULT_URL, "radiation", 10, 50)
        Query "radiation" returns 30,000 (default API behavior). We take the 11th article on the list (0th index) up to 49th article on the list, returning 50 total articles.
    """
    params = {
        "search_query": query,
        "sortBy": 'submittedDate',
        "sortOrder": 'ascending',
        "start": start,
        "max_results": max_results
    }
    response = requests.get(url, params=params)
    if response.status_code == 200:
        feed = feedparser.parse(response.content)
        print(f"Fetched {len(feed.entries)} entries.")
        return feed
    else:
        raise ConnectionError(response.status_code)

In [4]:
def parse_request(feed: feedparser.util.FeedParserDict, query: str, verbose: int=0) -> pd.core.frame.DataFrame:
    """
    Converts the given JSON feed file into a legible dataframe (useful for .csv storage).

    feed -> feedparser.util.FeedParserDict
        The given JSON object from feedparser

    query -> str
        The query term that was used to generate the feed. This is not enforced to be correct, so users need to manually double-check that this field is correct.
        Used in .csv saving.

    verbose -> int
        0: suppresses all missing-data-points reporting
        1: at the end of script, summarize the total number of missing data points.
        2: for every queried paper, error report on the missing data features; at the end of script, also summarize the total number of missing data points.
        
    Returns -> pd.core.frame.DataFrame
        The JSON object converted to a dataframe

    Example
        parse_request(feed, "radiation", 0)
    """
    all_papers = []
    num_missing_keys = 0
    for paper in feed.entries:
        paper_data = [query, datetime.now()]
        for key in helper.ARXIV_KEYS:
            try:
                if key == "summary":
                    paper_data.append(paper[key].replace("\n", " "))
                elif key == "authors":
                    paper_data.append([item["name"] for item in paper[key]])
                else:
                    paper_data.append(paper[key])
            except:
                paper_data.append(np.nan)
                if verbose == 2:
                    print(f"{paper['id']} does not have {key} key")
                num_missing_keys += 1
        all_papers.append(paper_data)
    if verbose == 1 or verbose == 2:
        print(f"{num_missing_keys} missing keys.")
    df = pd.DataFrame(data=all_papers, columns=helper.MASTER_CSV_COLUMNS)
    print("Parsed!")
    return df

In [5]:
def crawl_request_for_published(feed: feedparser.util.FeedParserDict, tar_date: str=helper.DEFAULT_DATE_QUERY) -> int:
    """
    Finds the first index position of a paper whose published date is either on or after the given target date (useful for selective querying).

    feed -> feedparser.util.FeedParserDict
        The given JSON object from feedparser

    tar_date -> str
        The given target date
        
    Returns -> int, None
        If found index, returns int
        Else, return None

    Example
        crawl_request_for_published(feed, "2010-2-31")
    """
    target_date_split = tar_date.split("-")
    if len(target_date_split) != 3:
        raise ValueError(f"ensure that your target date format is correct: year-month-day. eg. 2010-2-31")
    else:
        search_target_date = datetime(int(target_date_split[0]), int(target_date_split[1]), int(target_date_split[2]))

    for paper_index in range(len(feed.entries)):
        raw_date_split = feed.entries[paper_index]["published"].split("T")[0].split("-")
        paper_published_date = datetime(int(raw_date_split[0]), int(raw_date_split[1]), int(raw_date_split[2]))
        if paper_published_date >= search_target_date:
            print(f"Found index for target date: {paper_index}!")
            return paper_index
    print("Did not find index for target date!")
    return None

In [6]:
def helper_parse_and_crawl(query: str=helper.DEFAULT_SEARCH_QUERY, tar_date: str=helper.DEFAULT_DATE_QUERY) -> tuple:
    """
    Helper function that automatically saves all queried data to database and return the starting index.

    query -> str
        The given search query

    tar_date -> str
        The given target date
        
    Returns -> int, None
        If found index, returns int
        Else, return None

    Example
        helper_parse_and_crawl("radiation", "2010-2-31")
    """
    # every value in the up to 30,000 records that comes with the view
    final_index = 0
    for i in range(0, helper.MAX_VIEW_SIZE, helper.MAX_STEP_SIZE):
        # parsing all data
        print("--")
        print(f"Checking index {i}-{i+helper.MAX_STEP_SIZE}")
        print("Fetching request...")
        feed = fetch_request(helper.DEFAULT_URL, query, i, helper.MAX_STEP_SIZE)
        print("Parsing request...")
        df = parse_request(feed, query, 0)
        print("Crawling request for target date...")
        index = crawl_request_for_published(feed, tar_date)

        # Saving data to csv
        print(f"Saving {i}-{i+helper.MAX_STEP_SIZE}...")
        try:
            df.to_csv("../data/arxiv.csv", mode='a', header=False, index=False)
            print("Successfully saved!")
        except:
            print("Failed to save.")

        # Returning index if found
        if type(index) == int:
            print(f"Found target date! Sub-index: {index}, total: {final_index + index}")
            entry = df.iloc[index, :]
            final_index += index
            return (final_index, entry)
        else:
            final_index += helper.MAX_STEP_SIZE
        
        # Query wait
        time.sleep(helper.MIN_WAIT_TIME) # wait or else arXiv will IP ban
    print(f"Did not find index for target date in entire window.")
    return (np.nan, np.nan)

In [7]:
def get_papers_and_start_index(query: str=helper.DEFAULT_SEARCH_QUERY, tar_date: str=helper.DEFAULT_DATE_QUERY) -> None:
    """
    Fetches papers of the query up to the tar_date; saves data in the meta-data database.

    query -> str
        The given search query 

    tar_date -> str
        The given target date
        
    Returns
        None

    Example
        get_papers_and_start_index("radiation", "2010-2-31")
    """
    final_index, entry = helper_parse_and_crawl(query, tar_date)
    try:
        data_constr = [query,
                       entry["url"],
                       entry["title"],
                       entry["published"],
                       final_index,
                       tar_date,
                       datetime.now()
                       ]
    except:
        data_constr = [query,
                       np.nan,
                       np.nan,
                       np.nan,
                       np.nan,
                       tar_date,
                       datetime.now()
                       ]
    
    # Saving 
    df = pd.DataFrame(data=np.array(data_constr).reshape(1, 7), columns=helper.META_CSV_COLUMNS)
    try:
        df.to_csv("../data/arxiv_meta_data.csv", mode='a', index=False, header=False)
        print("Saved!")
    except:
        print("Failed to save...")
    return None

## Database Functions

In [8]:
def reset_meta_papers_db():
    """
    Overwrites the existing metadata database.
    USE WITH CAUTION.

    Returns
        None

    Example
        reset_meta_papers_db()
    """
    # Warning menu
    while True:
        user_input = input("Are you sure you want to run this function? This wipes the ENTIRE existing arxiv_meta_data.csv database! (y/n)")
        if user_input == "y":
            print("Proceeding to reset database.")
            break
        elif user_input == "n":
            print("Function canceled.")
            return None
        else:
            print("Wrong input. Please type 'y' or 'n'.")

    # Saving data to csv
    df = pd.DataFrame(columns=helper.META_CSV_COLUMNS)
    try:
        df.to_csv("../data/arxiv_meta_data.csv", index=False)
        print("Saved!")
    except:
        print("Failed to save...")

    # Return
    return None

In [9]:
def reset_papers_db() -> None:
    """
    Resets/overwrites the "arxiv.csv" database.
    USE WITH CAUTION.

    Returns
        None

    Example
        reset_papers_db()
    """
    # Warning menu
    while True:
        user_input = input("Are you sure you want to run this function? This wipes the ENTIRE existing arxiv.csv database! (y/n)")
        if user_input == "y":
            print("Proceeding to reset database.")
            break
        elif user_input == "n":
            print("Function canceled.")
            return None
        else:
            print("Wrong input. Please type 'y' or 'n'.")

    # Saving data to csv
    df = pd.DataFrame(columns=helper.MASTER_CSV_COLUMNS)
    try:
        df.to_csv("../data/arxiv.csv", index=False)
        print("Saved!")
    except:
        print("Failed to save...")

    # Return
    return None

## Script

In [10]:
# RESET ARXIV DATABASES; CAREFUL WHEN RUNNING
# reset_meta_papers_db#()
# reset_papers_db#()

In [11]:
get_papers_and_start_index("radiation", helper.DEFAULT_DATE_QUERY)

Proceeding to reset database.
Saved!
Proceeding to reset database.
Saved!
Checking index 0-2000
Fetching request...
Fetched 2000 entries.
Parsing request...
Parsed!
Crawling request for target date...
Did not find index for target date!
Saving 0-2000...
Successfully saved!
Checking index 2000-4000
Fetching request...
Fetched 2000 entries.
Parsing request...
Parsed!
Crawling request for target date...
Did not find index for target date!
Saving 2000-4000...
Successfully saved!
Checking index 4000-6000
Fetching request...
Fetched 2000 entries.
Parsing request...
Parsed!
Crawling request for target date...
Did not find index for target date!
Saving 4000-6000...
Successfully saved!
Checking index 6000-8000
Fetching request...
Fetched 2000 entries.
Parsing request...
Parsed!
Crawling request for target date...
Did not find index for target date!
Saving 6000-8000...
Successfully saved!
Checking index 8000-10000
Fetching request...
Fetched 2000 entries.
Parsing request...
Parsed!
Crawling requ

In [None]:
15758

## UNDER Dev

In [12]:
def helper_fetch_papers(query: str=helper.DEFAULT_SEARCH_QUERY, verbose:int=0) -> None:
    """
    Resets/overwrites the "arxiv.csv" database with the 5 most recently published papers about the given query.

    query -> str
        The given query/search term

    verbose -> int
        0: suppresses all missing-data-points reporting
        1: at the end of script, summarize the total number of missing data points.
        2: for every queried paper, error report on the missing data features; at the end of script, also summarize the total number of missing data points.
        
    Returns
        None

    Example
        fetch_initial_papers("radiation", 0)
    """
    # Error control
    valid_verbose = {0, 1, 2}
    if verbose not in valid_verbose:
        raise ValueError(f"verbose input must be one of the following: {valid_verbose}")

    # Set request params
    params = {
        "search_query": query,
        "sortBy": 'submittedDate', # do not change this
        "sortOrder": 'ascending', # do not change this
        "start": 0,
        "max_results": 5
    }

    # Fetching request
    response = requests.get(url, params=params)
    if response.status_code == 200:
        feed = feedparser.parse(response.content)
        print(f"Fetched {len(feed.entries)} entries.")
    else:
        print(f"Error: {response.status_code}")
        return None

    # Parsing request
    all_papers = []
    num_missing_keys = 0
    for paper in feed.entries:
        paper_data = []
        for key in ARXIV_KEYS:
            try:
                paper_data.append(paper[key])
            except:
                paper_data.append(np.nan)
                if verbose == 2:
                    print(f"{paper['id']} does not have {key} key")
                num_missing_keys += 1
        all_papers.append(paper_data)
    if verbose == 1 or verbose == 2:
        print(f"{num_missing_keys} missing keys.")

    # Saving data to csv
    df = pd.DataFrame(data=all_papers, columns=helper.MASTER_CSV_COLUMNS)
    try:
        df.to_csv("../data/arxiv.csv", index=False)
        print("Saved!")
    except:
        print("Failed to save...")

    # Return
    return None

def fetch_more_papers(query: str=helper.DEFAULT_SEARCH_QUERY, verbose: int=0, n: int=50) -> None:
    """
    TODO: make this docstring better lol
    fetches the next n papers published after the oldest entry in the csv
    """
    # Extract oldest published date in dataset 
    df = pd.read_csv("../data/arxiv.csv")
    start_date = df["published"].iloc[-1:].values[0]

    # Set request params
    params = {
        "search_query": query,
        "start_date": start_date,
        "sortBy": 'submittedDate',  # relevance, lastUpdatedDate, submittedDate
        "max_results": n,
        "sortOrder": 'descending'
    }

    # Fetching request
    response = requests.get(url, params=params)
    if response.status_code == 200:
        feed = feedparser.parse(response.content)
        print(f"Fetched {len(feed.entries)} entries.")
    else:
        print(f"Error: {response.status_code}")
        return None

    # Parsing request
    all_papers = []
    num_missing_keys = 0
    for paper in feed.entries:
        paper_data = []
        for key in ARXIV_KEYS:
            try:
                paper_data.append(paper[key])
            except:
                paper_data.append(np.nan)
                if verbose == 1:
                    print(f"{paper['id']} does not have {key} key")
                    num_missing_keys += 1
                else:
                    num_missing_keys += 1
        all_papers.append(paper_data)
    print(f"{num_missing_keys} missing keys.")

    # Saving data to csv
    df = pd.DataFrame(data=all_papers, columns=MASTER_CSV_COLUMNS)
    try:
        df.to_csv("../data/arxiv.csv", mode="a", index=False, header=False)
        print("Saved!")
    except:
        print("Failed to save...")

    # Return
    return None

In [13]:
fetch_more_papers()

NameError: name 'url' is not defined

In [None]:
def fetch_papers(query: str = DEFAULT_SEARCH_QUERY, verbose:int = 0, n:int = 50) -> None:
    """ 
    TODO docstring

    wrapper for fetch more papers
    """
    if n <= 0:
        print("n cannot be negative or 0.")
        return None
    else:
        mult_of_10 = n // 10
        leftoever_of_10 = n - mult_of_10

        if mult_of_10 > 0:
            for _ in range(mult_of_10):
                fetch_more_papers(query, verbose)
        fetch_more_papers(query, verbose, leftoever_of_10)
        print("Finished loops!")
        return None

In [None]:
# WARNING, RUNNING THIS FUNCTION WILL RESET THE DATABASE
fetch_initial_papers(DEFAULT_SEARCH_QUERY, 0)

Fetched 10 entries.
16 missing keys.
Saved!


In [None]:
fetch_more_papers(DEFAULT_SEARCH_QUERY, 0, 1000)

Fetched 1000 entries.
1698 missing keys.
Saved!
