## import dependencies

`requests`  
Because `arxiv` is not a dynamic web page (maybe server-side rendering is used), I could scrap all the contents simply by using `requests` module.

`BeautifulSoup`  
is used to parse the html element easily.

`typing`  
Python is dynamically typed language.  
To avoid error-prone non-typed code, I used `typing` module.

`pandas`  
provides useful data structure `DataFrame`.  
is used to neatly express the parsed data.

`threading`  
Scrapping all the content (almost 120,000) in one thread is too slow.  
I created 20 threads to scrap the content rapidly.

`queue`  
is used to exchange data safely in multi-thread environment.  
> [from python document - queue](https://docs.python.org/3/library/queue.html#module-queue)  
> The queue module implements multi-producer, multi-consumer queues. It is especially useful in threaded programming when information must be exchanged safely between multiple threads. The Queue class in this module implements all the required locking semantics.

In [None]:
import requests
from bs4 import BeautifulSoup
import typing as T
import pandas as pd
import threading as th
import queue

## define constants

In [None]:
TERMS = ["diffusion"] # the terms to search for

YEAR = 1980 # the year to start searching from

SEARCH_SIZE = 200 # the maximum number of results to return (arxiv limit)

URL_FORMAT_HEAD = "https://arxiv.org/search/advanced?advanced="

URL_FORMAT_TAIL = (
    "&classification-physics_archives=all&classification-include_cross_list=include&date-filter_by=specific_year&date-year={year}"
    + f"&date-from_date=&date-to_date=&date-date_type=submitted_date&abstracts=show&size={SEARCH_SIZE}"
    + "&order=announced_date_first&start={index}"
)

N_THREADS = 20

## define parsed data structure

In [None]:
class ArxivResult:
    id: int
    title: str
    authors: T.List[str]
    link: str
    tags: T.List[str]
    originally_announced_at: str

    def __init__(
        self,
        id: int,
        title: str,
        authors: T.List[str],
        link: str,
        tags: T.List[str],
        originally_announced_at: str,
    ):
        self.id = id
        self.title = title
        self.authors = authors
        self.link = link
        self.tags = tags
        self.originally_announced_at = originally_announced_at

    def __str__(self):
        return f"ArxivResult(id={self.id}, title={self.title}, authors={self.authors}, link={self.link}, tags={self.tags}, originally_announced_at={self.originally_announced_at})"

## define used functions

### get

This is just wrapper function of the requests.get func.  
But, if I want to use another module (like selenium) to scrap, I only need to change this function.  
**(Change of the dependency will not affect whole code!!)**

In [None]:
def get(url: str):
    response = requests.get(url)
    return response.text

### get_url, get_arxiv_results

`get_url`  
combines `year`, `start_index` (start index to fetch), `terms`, and return proper url to search.

`get_arxiv_results`  
searches and retrieves arxiv results via url returned by get_url function.  
if the param `result_queue` is not None, puts the result into the queue (for multi-threading usecase).

In [None]:
def get_url(year: int, start_index: int, terms: T.List[str]):
    assert len(terms) >= 1
    terms = terms.copy()
    url = URL_FORMAT_HEAD
    term = terms.pop(0)
    url += f"&terms-0-operator=AND&terms-0-term={term}&terms-0-field=all"
    for i, term in enumerate(terms):
        url += f"&terms-{i+1}-term={term}&terms-{i+1}-operator=OR&terms-{i+1}-field=all"
    url += URL_FORMAT_TAIL.format(year=year, index=start_index)
    return url


def get_arxiv_results(
    year: int,
    start_index: int,
    item_index: int,
    terms: T.List[str],
    result_queue: T.Optional[queue.Queue] = None,
) -> T.List[ArxivResult]:
    results = []

    url = get_url(year, start_index, terms)
    html = get(url)
    bs = BeautifulSoup(html, "html.parser")
    li_results = bs.find_all("li", {"class": "arxiv-result"})
    for i, row in enumerate(li_results):
        id = item_index + i + 1
        title = row.find("p", {"class": "title"}).text.strip()
        authors = list(
            map(lambda author_tag: author_tag.text.strip(), row.find("p", {"class": "authors"}).find_all("a"))
        )
        link = row.find("p", {"class": "list-title"}).find("a").attrs["href"].strip()
        tags = list(
            map(lambda tag: tag.text.strip(), row.find("div", {"class": "tags"}).find_all(attrs={"class": "tag"}))
        )

        try:
            originally_announced_at = row.find("p", {"class": "is-size-7"}).text.strip()
        except:
            originally_announced_at = None

        results.append(ArxivResult(id, title, authors, link, tags, originally_announced_at))  # type: ignore

    if result_queue is not None:
        result_queue.put(results)

    return results

## main function

1. repeat 2-5 until all the contents parsed.
2. parse `N_THREADS` pages. One thread will be in charge of one page at a time.
3. wait all threads to be finished.
4. combine the results. if no results are found, stop the loop.
5. extend the result DataFrame and export as csv.

In [None]:
if __name__ == "__main__":
    search_idx = 0
    item_idx = 0
    arxiv_results = []

    while True:
        print(f"year: {YEAR}")
        queues = [queue.Queue() for _ in range(N_THREADS)]
        threads = []
        for i in range(N_THREADS):
            thread = th.Thread(target=get_arxiv_results, args=(YEAR, search_idx, item_idx, TERMS, queues[i]))
            thread.start()
            threads.append(thread)
            search_idx += SEARCH_SIZE
            item_idx += SEARCH_SIZE

        prev_size = len(arxiv_results)

        for i in range(N_THREADS):
            threads[i].join()
            arxiv_results.extend(queues[i].get())

        try:
            item_idx = arxiv_results[-1].id
        except:
            item_idx = 0

        if len(arxiv_results) == prev_size:
            YEAR += 1
            search_idx = 0
            continue

        pd.DataFrame(
            [
                [
                    arxiv_result.id,
                    arxiv_result.title,
                    arxiv_result.authors,
                    arxiv_result.link,
                    arxiv_result.tags,
                    arxiv_result.originally_announced_at,
                ]
                for arxiv_result in arxiv_results
            ],
            columns=["id", "title", "authors", "link", "tags", "originally_announced_at"],
        ).to_csv(f"arxiv_results.csv", index=False)