## Arxiv Paper Dataframe by Crawling

<br>

In [7]:
import re
import arxiv
import pandas as pd
import numpy as np
from datetime import datetime

In [2]:
arxiv_paper_df_with_abstract = pd.DataFrame({'Title':[0],
                               'Journal/Conference':[0],
                               'Date':[0], 
                               'Author':[0],
                               'Link':[0],
                               'Abstract':[0]})

In [3]:
paper_ids = ["1409.0473v7", "1409.3215v3", "1706.03762v5", "1609.08144v2",
             "1508.07909v5", "1301.3781v3", "1808.06226v1", "1802.05365v2",
             "1810.04805v2", "2104.02395v3", "2202.07105v2", "1503.02531v1",
             "1910.01108v4", "1908.09355v1"]

In [4]:
for idx, paper_id in enumerate(paper_ids):
    search = arxiv.Search(id_list=[paper_id])
    paper = next(search.results())
    
    paper_journal_conf = re.search(r'[A-Z ]+[0-9]+[0-9]+[0-9]+[0-9]', str(paper.comment))
    if paper_journal_conf != None:
        paper_journal_conf = paper_journal_conf.group().strip()
        if len(paper_journal_conf) > 4:
            if paper_journal_conf[-4] != " ":
                paper_journal_conf = paper_journal_conf[:-4] + " " + paper_journal_conf[-4:]
            else:
                paper_journal_conf = paper_journal_conf
        elif len(paper_journal_conf) <= 4:
            paper_journal_conf = ""
    elif paper_journal_conf == None:
        paper_journal_conf = ""

    arxiv_paper_df_with_abstract.loc[idx] = [paper.title, 
                              paper_journal_conf,
                               paper.published.date(), 
                               str(paper.authors[0]) + ' et al.',
                                 paper.entry_id,
                                 paper.summary]

In [5]:
def str_convert_datetime(date):
    return datetime.strptime(date, '%Y-%m-%d').date()

In [6]:
other_papers = [["Model Compression", "ACM SIGKDD 2006", str_convert_datetime("2006-08-20"),
                 "Cristian Bucil˘a et al.", "https://dl.acm.org/doi/abs/10.1145/1150402.1150464", 
                 "Often the best performing supervised learning models are ensembles of hundreds or thousands of base-level classifiers. Unfortunately, the space required to store this many classifiers, and the time required to execute them at run-time, prohibits their use in applications where test sets are large (e.g. Google), where storage space is at a premium (e.g. PDAs), and where computational power is limited (e.g. hea-ring aids). We present a method for 'compressing' large, complex ensembles into smaller, faster models, usually without significant loss in performance."],
                 ["Adaptive Mixtures of Local Experts", "MIT Press 1991", str_convert_datetime("1991-03-01"),
                 "Robert A. Jacobs et al.", "https://ieeexplore.ieee.org/abstract/document/6797059", 
                 "We present a new supervised learning procedure for systems composed of many separate networks, each of which learns to handle a subset of the complete set of training cases. The new procedure can be viewed either as a modular version of a multilayer supervised network, or as an associative version of competitive learning. It therefore provides a new link between these two apparently different approaches. We demonstrate that the learning procedure divides up a vowel discrimination task into appropriate subtasks, each of which can be solved by a very simple expert network."],
                 ["Dropout: A Simple Way to Prevent Neural Networks from Overfitting", " JMLR 2014", str_convert_datetime("2014-01-01"),
                 "Nitish Srivastava et al.", "https://ieeexplore.ieee.org/abstract/document/6797059", 
                 "Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different âthinnedâ networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets."]
                 ]
df_length = len(arxiv_paper_df_with_abstract) - 1

for other_paper in other_papers:
  df_length += 1
  arxiv_paper_df_with_abstract.loc[df_length-1] = other_paper

In [8]:
arxiv_paper_df_with_abstract = pd.DataFrame(arxiv_paper_df_with_abstract.sort_values(by='Date').reset_index()).drop(['index'], axis='columns')
arxiv_paper_df_with_abstract.index = np.arange(1, len(arxiv_paper_df_with_abstract) + 1)    
arxiv_paper_df = arxiv_paper_df_with_abstract.drop(['Abstract'], axis='columns')

In [9]:
arxiv_paper_df_with_abstract

Unnamed: 0,Title,Journal/Conference,Date,Author,Link,Abstract
1,Adaptive Mixtures of Local Experts,MIT Press 1991,1991-03-01,Robert A. Jacobs et al.,https://ieeexplore.ieee.org/abstract/document/...,We present a new supervised learning procedure...
2,Model Compression,ACM SIGKDD 2006,2006-08-20,Cristian Bucil˘a et al.,https://dl.acm.org/doi/abs/10.1145/1150402.115...,Often the best performing supervised learning ...
3,Efficient Estimation of Word Representations i...,,2013-01-16,Tomas Mikolov et al.,http://arxiv.org/abs/1301.3781v3,We propose two novel model architectures for c...
4,Dropout: A Simple Way to Prevent Neural Networ...,JMLR 2014,2014-01-01,Nitish Srivastava et al.,https://ieeexplore.ieee.org/abstract/document/...,Deep neural nets with a large number of parame...
5,Neural Machine Translation by Jointly Learning...,ICLR 2015,2014-09-01,Dzmitry Bahdanau et al.,http://arxiv.org/abs/1409.0473v7,Neural machine translation is a recently propo...
6,Sequence to Sequence Learning with Neural Netw...,,2014-09-10,Ilya Sutskever et al.,http://arxiv.org/abs/1409.3215v3,Deep Neural Networks (DNNs) are powerful model...
7,Distilling the Knowledge in a Neural Network,NIPS 2014,2015-03-09,Geoffrey Hinton et al.,http://arxiv.org/abs/1503.02531v1,A very simple way to improve the performance o...
8,Neural Machine Translation of Rare Words with ...,ACL 2016,2015-08-31,Rico Sennrich et al.,http://arxiv.org/abs/1508.07909v5,Neural machine translation (NMT) models typica...
9,Google's Neural Machine Translation System: Br...,,2016-09-26,Yonghui Wu et al.,http://arxiv.org/abs/1609.08144v2,Neural Machine Translation (NMT) is an end-to-...
10,Attention Is All You Need,,2017-06-12,Ashish Vaswani et al.,http://arxiv.org/abs/1706.03762v5,The dominant sequence transduction models are ...


In [195]:
arxiv_paper_df_with_abstract.to_excel("arxiv_paper_df_with_abstract.xlsx")
arxiv_paper_df.to_excel("arxiv_paper_df.xlsx")

### Upload Dataframe on Github

[Excel to Markdown Converter](https://tabletomarkdown.com/convert-spreadsheet-to-markdown/)