# Dev: Papers IDs collector


### References

- [arXiv.org](https://arxiv.org/)
- [arXiv API-Homepage](https://pypi.org/project/arxiv/)
- [arXiv API-Documentation](http://lukasschwab.me/arxiv.py/index.html)

In [59]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from datetime import datetime, date
import warnings
warnings.filterwarnings('ignore')

## arguments

In [34]:
required_categories = ['math.ST', 'stat.ME', 'stat.AP', 'stat.CO', 'cs.LG', 'stat.ML', 'cs.AI', 'math.PR']
folder_output = 'datasets'

In [69]:
# url: Computer Science (cs)
url_cs = 'https://arxiv.org/search/advanced?advanced=&terms-0-operator=AND&terms-0-term=&terms-0-field=title&classification-computer_science=y&classification-physics_archives=all&classification-include_cross_list=include&date-filter_by=all_dates&date-year=&date-from_date=&date-to_date=&date-date_type=submitted_date&abstracts=show&size=200&order=-announced_date_first&start=0'
# url: Mathematics (math)
url_math = 'https://arxiv.org/search/advanced?advanced=&terms-0-operator=AND&terms-0-term=&terms-0-field=title&classification-mathematics=y&classification-physics_archives=all&classification-include_cross_list=include&date-filter_by=all_dates&date-year=&date-from_date=&date-to_date=&date-date_type=submitted_date&abstracts=show&size=200&order=-announced_date_first&start=0'
# url: Statistics (stat)
url_stat = 'https://arxiv.org/search/advanced?advanced=&terms-0-operator=AND&terms-0-term=&terms-0-field=title&classification-physics_archives=all&classification-statistics=y&classification-include_cross_list=include&date-filter_by=all_dates&date-year=&date-from_date=&date-to_date=&date-date_type=submitted_date&abstracts=show&size=200&order=-announced_date_first&start=0'

## functions

In [70]:
# check if two list of categories have any common element
def is_category(cat_paper:list, cat_required: list):
    cat_inter = list(set(cat_paper) & set(cat_required))
    return len(cat_inter) > 0


## parse papers information in a page of the advanced query search
def parser_page(url:str, verbose:bool = False)->pd.DataFrame:
    # initialize
    col_x = ['paper_id', 'categories', 'submission_date', 'title', 'authors', 'abstract']

    # download html content
    try:
        # get request
        reqs = requests.get(url)
        # get html page
        soup = BeautifulSoup(reqs.text, 'lxml')
    except Exception as e:
        print(f'[error] It was not possible download the html content of this url: "{url}"')
        print(str(e))
        return pd.DataFrame(columns = col_x)

    # initialize
    records = list() 

    # loop of results
    for tag in soup.find_all("li", {"class": "arxiv-result"}):

        # parse categories
        tag_d = tag.find_all("div", {"class": "tags is-inline-block"})[0]
        categories = [it.rstrip().lstrip() for it in tag_d.text.split('\n') if it != '']
        if verbose:
            print("\n{0}: {1}".format(tag_d.name, categories))   

        # only continue if is a required paper by category
        if is_category(categories, required_categories):

            # parse paper id
            tag_a = tag.find_all("a")[0]
            paper_id = tag_a.text.replace('arXiv:', '').rstrip().lstrip()
            if verbose:
                print("{0}: {1}".format(tag_a.name, paper_id))

            # parse submission date
            tag_p = tag.find_all("p")[4]
            sdate = tag_p.text.split(';')[0].replace('Submitted', '').rstrip().lstrip()
            dt = datetime.strptime(sdate, '%d %B, %Y')
            submission_date = date(dt.year, dt.month, dt.day)
            if verbose:
                print("{0}: {1}".format(tag_p.name, submission_date))

            # parse title
            tag_p = tag.find_all("p")[1]
            title = tag_p.text.replace('\n','').rstrip().lstrip()
            if verbose:
                print("{0}: {1}".format(tag_p.name, title))
                
            # parse authors
            tag_p = tag.find_all("p")[2]
            authors = [a.rstrip().lstrip() for a in tag_p.text.replace('Authors:', '').replace('\n','').split(',')]
            if verbose:
                print("{0}: {1}".format(tag_p.name, authors))

            # parse abstract
            tag_s = tag.find_all("span", {"class": "abstract-full has-text-grey-dark mathjax"})[0]
            abstract = tag_s.contents[0].replace('\n','').rstrip().lstrip()
            if verbose:
                print("{0}: {1}".format(tag_s.name, abstract))

            # append if any common cat
            records.append([paper_id, categories, submission_date, title, authors, abstract])
        else:
            if verbose:
                print("discarted because is not required.")

    # store in a df
    df = pd.DataFrame(records, columns = col_x)
    if verbose:
        print(f'\nFinally was parsed {len(df)} papers.')

    # return
    return df


## get size of page
def get_page_size(url:str)->int:
    try:
        return int([iu for iu in url.split('&') if 'size' in iu][0].replace('size=', ''))
    except:
        print('[error] It is not available "size" tag in this url.')
        return None

    
## get next page
def next_paginate(url:str, size:int)->str:
    try:
        start = int([iu for iu in url.split('&') if 'start' in iu][0].replace('start=', ''))
        return url.replace(f'start={start}', f'start={start + size}')
    except:
        print('[error] It is not available "start" tag in this url.')
        return None

    
## check if url is valid
def is_valid_url(url:str)->bool:
    if 'size' in url and 'start=0' in url and '=all' in url and 'abstracts=show' in url and 'include_cross_list=include' in url:
        return True
    else:
        return False
    

## parse papers information in a N pages of the advanced query search
def parser_pages(url:str, max_num_pages:int = 200000, verbose:bool = False)->pd.DataFrame:    
    # check if a valid url
    if is_valid_url(url):
        print('It is a valid url.')
    else:
        print('[error] It is not a valid url.')
        return None

    # get page size
    size = get_page_size(url)
    # initialize
    num_page = 1
    # loop
    while num_page <= max_num_pages:
        # parse
        idf = parser_page(url, verbose = verbose)
        # validate
        if len(idf)== 0:
            print('Stop loop!')
            break
        else:
            # append
            if num_page == 1:
                df = idf.copy()
            else:
                df = df.append(idf)
            # display
            dti = np.min(idf.submission_date).strftime("%Y-%m-%d")
            dtf = np.max(idf.submission_date).strftime("%Y-%m-%d")
            print(f'--> Page {num_page} - total num records = {len(df)} / period: {dtf} - {dti}')
            # clean
            del idf, dti, dtf
            # get next page url
            url = next_paginate(url, size)
        # add counter
        num_page += 1
    # return
    return df

## main

In [72]:
parser_pages(url_stat, max_num_pages = 2, verbose = False)

It is a valid url.
--> Page 1 - total num records = 200 / period: 2021-09-16 - 2021-08-15
--> Page 2 - total num records = 399 / period: 2021-09-16 - 2021-07-12


Unnamed: 0,paper_id,categories,submission_date,title,authors,abstract
0,2109.08139,"[eess.SP, cs.LG, cs.NI, stat.ML]",2021-09-16,Adversarial Attacks against Deep Learning Base...,"[Brian Kim, Yi Shi, Yalin E. Sagduyu, Tugba Er...",We consider adversarial machine learning based...
1,2109.08134,"[cs.LG, stat.ML]",2021-09-16,Comparison and Unification of Three Regulariza...,"[Sarah Rathnam, Susan A. Murphy, Finale Doshi-...","In batch reinforcement learning, there can be ..."
2,2109.08066,"[stat.AP, stat.CO]",2021-09-10,Efficient Uncertainty Quantification and Sensi...,"[Bjørn Jensen, Allan P. Engsig-Karup, Kim Knud...",In the political decision process and control ...
3,2109.08065,[stat.AP],2021-09-09,Sigmoids behaving badly: why they usually cann...,"[Anders Sandberg, Stuart Armstrong, Rebecca Go...",Sigmoids (AKA s-curves or logistic curves) are...
4,2109.08051,"[stat.ML, cs.LG]",2021-09-16,Frame by frame completion probability of an NF...,"[Gustavo Pompeu da Silva, Rafael de Andrade Mo...",American football is an increasingly popular s...
5,2109.08010,"[cs.LG, stat.ML]",2021-09-16,WildWood: a new Random Forest algorithm,"[Stéphane Gaïffas, Ibrahim Merad, Yiyang Yu]","We introduce WildWood (WW), a new ensemble alg..."
6,2109.08009,[stat.ME],2021-09-16,Sparse logistic functional principal component...,"[Rou Zhong, Shishi Liu, Haocheng Li, Jingxiao ...",Functional binary datasets occur frequently in...
7,2109.07978,"[stat.ME, stat.AP, stat.OT]",2021-09-16,On variable selection in joint modeling of mea...,"[Edmilson Rodrigues Pinto, Leandro Alves Pereira]",The joint modeling of mean and dispersion (JMM...
8,2109.07956,[stat.AP],2021-09-16,On the ordering of credibility factors,"[Jae Youn Ahn, Himchan Jeong, Yang Lu]",Traditional credibility analysis of risks in i...
9,2109.07896,"[math.OC, stat.ML]",2021-09-16,Distributionally Robust Optimal Power Flow wit...,"[Adrián Esteban-Pérez, Juan M. Morales]","In this paper, we develop a distributionally r..."
