## Arxiv Paper Dataframe by Crawling

<br>

In [2]:
import re
import arxiv
import pandas as pd
import numpy as np
from datetime import datetime

In [3]:
def make_arxiv_paper_df_with_abstract(paper_ids):

    arxiv_paper_df_with_abstract = pd.DataFrame({'Title':[0],
                               'Journal/Conference':[0],
                               'Date':[0], 
                               'Author':[0],
                               'Link':[0],
                               'Abstract':[0]})
    
    for idx, paper_id in enumerate(paper_ids):
        search = arxiv.Search(id_list=[paper_id])
        paper = next(search.results())
        
        paper_journal_conf = re.search(r'[A-Z ]+[0-9]+[0-9]+[0-9]+[0-9]', str(paper.comment))
        if paper_journal_conf != None:
            paper_journal_conf = paper_journal_conf.group().strip()
            if len(paper_journal_conf) > 4:
                if paper_journal_conf[-4] != " ":
                    paper_journal_conf = paper_journal_conf[:-4] + " " + paper_journal_conf[-4:]
                else:
                    paper_journal_conf = paper_journal_conf
            elif len(paper_journal_conf) <= 4:
                paper_journal_conf = ""
        elif paper_journal_conf == None:
            paper_journal_conf = ""

        arxiv_paper_df_with_abstract.loc[idx] = [paper.title, 
                                paper_journal_conf,
                                paper.published.date(), 
                                str(paper.authors[0]) + ' et al',
                                    paper.entry_id,
                                    paper.summary]

    return arxiv_paper_df_with_abstract

In [4]:
def str_convert_datetime(date):
    return datetime.strptime(date, '%Y-%m-%d').date()

In [5]:
def add_other_papers_column(arxiv_paper_df_with_abstract, other_papers):
  
  df_length = len(arxiv_paper_df_with_abstract) - 1

  for other_paper in other_papers:
    df_length += 1
    arxiv_paper_df_with_abstract.loc[df_length] = other_paper
  
  return arxiv_paper_df_with_abstract

In [6]:
def hyperlink(x):
    hyperlink= '[Link]' + '(' + x + ')'
    return hyperlink

In [7]:
def make_arxiv_paper_df(arxiv_paper_df_with_abstract):

    arxiv_paper_df_with_abstract = pd.DataFrame(arxiv_paper_df_with_abstract.sort_values(by='Date').reset_index()).drop(['index'], axis='columns')
    arxiv_paper_df_with_abstract.index = np.arange(1, len(arxiv_paper_df_with_abstract) + 1)    
    arxiv_paper_df = arxiv_paper_df_with_abstract.drop(['Abstract'], axis='columns')

    return arxiv_paper_df

In [8]:
paper_ids = ["1409.0473v7", "1409.3215v3", "1706.03762v5", "1609.08144v2",
             "1508.07909v5", "1301.3781v3", "1808.06226v1", "1802.05365v2",
             "1810.04805v2", "2104.02395v3", "2202.07105v2", "1503.02531v1",
             "1910.01108v4", "1908.09355v1", "2008.05030v4", "1603.08983v6",
             "1709.01686v1", "1804.07461v3", "1902.03393v2", "2004.02178v2",
             "2002.10957v2", "2012.15828v2"]

arxiv_paper_df_with_abstract = make_arxiv_paper_df_with_abstract(paper_ids)

In [9]:
other_papers = [["Model Compression", "ACM SIGKDD 2006", str_convert_datetime("2006-08-20"),
                "Cristian Bucil˘a et al", "https://dl.acm.org/doi/abs/10.1145/1150402.1150464", 
                "Often the best performing supervised learning models are ensembles of hundreds or thousands of base-level classifiers. Unfortunately, the space required to store this many classifiers, and the time required to execute them at run-time, prohibits their use in applications where test sets are large (e.g. Google), where storage space is at a premium (e.g. PDAs), and where computational power is limited (e.g. hea-ring aids). We present a method for 'compressing' large, complex ensembles into smaller, faster models, usually without significant loss in performance."],
                ["Adaptive Mixtures of Local Experts", "MIT Press 1991", str_convert_datetime("1991-03-01"),
                "Robert A. Jacobs et al", "https://ieeexplore.ieee.org/abstract/document/6797059", 
                "We present a new supervised learning procedure for systems composed of many separate networks, each of which learns to handle a subset of the complete set of training cases. The new procedure can be viewed either as a modular version of a multilayer supervised network, or as an associative version of competitive learning. It therefore provides a new link between these two apparently different approaches. We demonstrate that the learning procedure divides up a vowel discrimination task into appropriate subtasks, each of which can be solved by a very simple expert network."],
                ["Dropout: A Simple Way to Prevent Neural Networks from Overfitting", " JMLR 2014", str_convert_datetime("2014-01-01"),
                "Nitish Srivastava et al", "https://ieeexplore.ieee.org/abstract/document/6797059", 
                "Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different âthinnedâ networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets."],
                ["Linguistic Regularities in Continuous Space Word Representations", " NAACL 2013", str_convert_datetime("2013-06-01"),
                "Tomas Mikolov et al", "https://aclanthology.org/N13-1090/", 
                "Continuous space language models have recently demonstrated outstanding results across a variety of tasks. In this paper, we examine the vector-space word representations that are implicitly learned by the input-layer weights. We find that these representations are surprisingly good at capturing syntactic and semantic regularities in language, and that each relationship is characterized by a relation-specific vector offset. This allows vector-oriented reasoning based on the offsets between words. For example, the male/female relationship is automatically learned, and with the induced vector representations, “King - Man + Woman” results in a vector very close to “Queen.” We demonstrate that the word vectors capture syntactic regularities by means of syntactic analogy questions (provided with this paper), and are able to correctly answer almost 40% of the questions. We demonstrate that the word vectors capture semantic regularities by using the vector offset method to answer SemEval-2012 Task 2 questions. Remarkably, this method outperforms the best previous systems."],
                 ["Large-Scale Distributed Language Modeling", " IEEE 2007", str_convert_datetime("2007-04-05"),
                "Ahmad Emami et al", "https://ieeexplore.ieee.org/document/4218031", 
                "A novel distributed language model that has no constraints on the n-gram order and no practical constraints on vocabulary size is presented. This model is scalable and allows for an arbitrarily large corpus to be queried for statistical estimates. Our distributed model is capable of producing n-gram counts on demand. By using a novel heuristic estimate for the interpolation weights of a linearly interpolated model, it is possible to dynamically compute the language model probabilities. The distributed architecture follows the client-server paradigm and allows for each client to request an arbitrary weighted mixture of the corpus. This allows easy adaptation of the language model to particular test conditions. Experiments using the distributed LM for re-ranking N-best lists of a speech recognition system resulted in considerable improvements in word error rate (WER), while integration with a machine translation decoder resulted in significant improvements in translation quality as measured by the BLEU score."],
                 ["BLEU: a method for automatic evaluation of machine translation", " ACL 2002", str_convert_datetime("2002-07-01"),
                "Kishore Papineni et al", "https://dl.acm.org/doi/10.3115/1073083.1073135", 
                "Human evaluations of machine translation are extensive but expensive. Human evaluations can take months to finish and involve human labor that can not be reused. We propose a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run. We present this method as an automated understudy to skilled human judges which substitutes for them when there is need for quick or frequent evaluations."],
                ["Large Language Models in Machine Translation", " EMNLP 2007", str_convert_datetime("2007-06-01"),
                "Thorsten Brants et al", "https://aclanthology.org/D07-1090/", 
                "This paper reports on the benefits of largescale statistical language modeling in machine translation. A distributed infrastructure is proposed which we use to train onup to 2 trillion tokens, resulting in languagemodels having up to 300 billion n-grams. Itis capable of providing smoothed probabilities for fast, single-pass decoding. We introduce a new smoothing method, dubbedStupid Backoff, that is inexpensive to trainon large data sets and approaches the qualityof Kneser-Ney Smoothing as the amount oftraining data increases."]
                ]

arxiv_paper_df_with_abstract = add_other_papers_column(arxiv_paper_df_with_abstract, other_papers)

In [10]:
arxiv_paper_df_with_abstract["Link"] = arxiv_paper_df_with_abstract["Link"].apply(hyperlink)

In [11]:
arxiv_paper_df = make_arxiv_paper_df(arxiv_paper_df_with_abstract)

In [14]:
arxiv_paper_df_with_abstract

Unnamed: 0,Title,Journal/Conference,Date,Author,Link,Abstract
0,Neural Machine Translation by Jointly Learning...,ICLR 2015,2014-09-01,Dzmitry Bahdanau et al,[Link](http://arxiv.org/abs/1409.0473v7),Neural machine translation is a recently propo...
1,Sequence to Sequence Learning with Neural Netw...,,2014-09-10,Ilya Sutskever et al,[Link](http://arxiv.org/abs/1409.3215v3),Deep Neural Networks (DNNs) are powerful model...
2,Attention Is All You Need,,2017-06-12,Ashish Vaswani et al,[Link](http://arxiv.org/abs/1706.03762v5),The dominant sequence transduction models are ...
3,Google's Neural Machine Translation System: Br...,,2016-09-26,Yonghui Wu et al,[Link](http://arxiv.org/abs/1609.08144v2),Neural Machine Translation (NMT) is an end-to-...
4,Neural Machine Translation of Rare Words with ...,ACL 2016,2015-08-31,Rico Sennrich et al,[Link](http://arxiv.org/abs/1508.07909v5),Neural machine translation (NMT) models typica...
5,Efficient Estimation of Word Representations i...,,2013-01-16,Tomas Mikolov et al,[Link](http://arxiv.org/abs/1301.3781v3),We propose two novel model architectures for c...
6,SentencePiece: A simple and language independe...,EMNLP 2018,2018-08-19,Taku Kudo et al,[Link](http://arxiv.org/abs/1808.06226v1),"This paper describes SentencePiece, a language..."
7,Deep contextualized word representations,NAACL 2018,2018-02-15,Matthew E. Peters et al,[Link](http://arxiv.org/abs/1802.05365v2),We introduce a new type of deep contextualized...
8,BERT: Pre-training of Deep Bidirectional Trans...,,2018-10-11,Jacob Devlin et al,[Link](http://arxiv.org/abs/1810.04805v2),We introduce a new language representation mod...
9,Ensemble deep learning: A review,,2021-04-06,M. A. Ganaie et al,[Link](http://arxiv.org/abs/2104.02395v3),Ensemble learning combines several individual ...


In [15]:
arxiv_paper_df_with_abstract.to_excel("arxiv_paper_df_with_abstract.xlsx")
arxiv_paper_df.to_excel("arxiv_paper_df.xlsx")

### Upload Dataframe on Github

[Excel to Markdown Converter](https://tabletomarkdown.com/convert-spreadsheet-to-markdown/)