# Parse the list of papers listed on ``readthedocs.org`` to a JSON file

This notebook has some tools to parse the HTML list of papers on the ``readthedocs.org`` ``galpy`` page as it was in January 2020 into a JSON file. Note that this code did not succeed in parsing the entire HTML, instead it was run in a few stages, manually parsing some difficult-to-parse entries (especially those with emphassis in the title).

This notebook is kept to save some of the tools that were helpful in getting the JSON file.

In [74]:
import html, html.parser
import json
import ads

Custom HTML parser to extract title, authors, journal, volume, pages, year, and URL:

In [66]:
class MyHTMLParser(html.parser.HTMLParser):
    def __init__(self,*args,**kwargs):
        self.verbose= kwargs.pop('verbose',True)
        html.parser.HTMLParser.__init__(self,*args,**kwargs)
        self.workingon= 'title'
        self.title= None
        self.author= None
        self.year= None
        self.journal= None
        self.volume= None
        self.pages= None
        self.url= None
        
    def handle_starttag(self, tag, attrs):
        if self.verbose: print("Encountered a start tag:", tag)
        if self.workingon == 'title':
            if tag == 'em':
                self.title= True
        if self.workingon == 'journal':
            if tag == 'em':
                self.journal= True
        if self.workingon == 'volume':
            if tag == 'strong':
                self.volume= True
        if self.workingon == 'url' and tag == 'a':
            self.url= dict(attrs)['href']
        
    def handle_data(self,data):
        if self.verbose: print("Data     :", data)
        if self.workingon == 'title' and self.title is True:
            self.title= data
            self.workingon= 'author'
            self.author= True
        elif self.workingon == 'author' and self.author is True:
            self.author= data.split('(')[0][1:].strip()
            self.year= data.split('(')[1].split(')')[0]
            self.workingon= 'journal'
        elif self.workingon == 'journal' and self.journal is True:
            self.journal= data
            self.workingon= 'volume'
        elif self.workingon == 'volume' and self.volume is True:
            self.volume= data
            self.workingon= 'pages'
            self.pages= True
        elif self.workingon == 'pages' and self.pages is True:
            self.pages= data.split(',')[1].split('(')[0].strip()
            self.workingon= 'url'

Parse from the raw HTML in an input file. The following code will search ADS for an updated publication record if the original entry did not have a volume (meaning that it was listed as *submitted* or *in press* in the original HTML):

In [150]:
out= {}
with open('../src/data/raw-papers.txt','r') as rawfile:
    ii= 0
    for line in rawfile:
        parser = MyHTMLParser(verbose=False)
        parser.feed(line)
        if parser.title:
            if not parser.volume:
                try: # searching ADS
                    paper= list(ads.SearchQuery(title="{}".format(' '.join(parser.title.split(' ')[:9])),
                                                fl=['volume','page','identifier']))[0]
                    parser.volume= paper.volume
                    parser.pages= paper.page[0]
                    arXivFound= False
                    jj= 0
                    while not arXivFound:
                        if 'arXiv:' in paper.identifier[jj]:
                            parser.url= 'http://arxiv.org/abs/{}'.format(paper.identifier[jj].split(':')[1])
                            arXivFound= True
                        else: jj+= 1
                except: pass
            print(parser.title, parser.author, parser.year, parser.journal, parser.volume, parser.pages, parser.url)
            out['{ii}_{a}'.format(ii=ii,a=parser.author.split(' ')[1].split(',')[0])]=\
                                  {'author': parser.author,
                                   'title': parser.title,
                                   'year': parser.year,
                                   'journal': parser.journal,
                                   'volume': parser.volume,
                                    'pages': parser.pages,
                                    'url': parser.url}
            ii+= 1

Proper motions in the VVV Survey: Results for more than 15 million stars across NGC 6544 R. Contreras Ramos, M. Zoccali, F. Rojas, A. Rojas-Arriagada, M. Gárate, P. Huijse, F. Gran, M. Soto, A.A.R. Valcarce, P. A. Estévez, & D. Minniti 2017 Astron. & Astrophys. 608 A140 http://arxiv.org/abs/1709.07919
How to make a mature accreting magnetar A. P. Igoshev & S. B. Popov 2017 Mon. Not. Roy. Astron. Soc. 473 3204 http://arxiv.org/abs/1709.10385
iota Horologii is unlikely to be an evaporated Hyades star I. Ramirez, D. Yong, E. Gutierrez, M. Endl, D. L. Lambert, J.-D. Do Nascimento Jr 2017 Astrophys. J. 850 80 http://arxiv.org/abs/1710.05930
Confirming chemical clocks: asteroseismic age dissection of the Milky Way disk(s) V. Silva Aguirre, M. Bojsen-Hansen, D. Slumstrup, et al. 2017 Mon. Not. Roy. Astron. Soc. 475 5487 http://arxiv.org/abs/1710.09847
The universality of the rapid neutron-capture process revealed by a possible disrupted dwarf galaxy star Andrew R. Casey & Kevin C. Schlaufman 

The Galactic Disc in Action Space as seen by Gaia DR2 Wilma H. Trick, Johanna Coronado, Hans-Walter Rix 2018 Mon. Not. Roy. Astron. Soc. 484 3291 http://arxiv.org/abs/1805.03653
Tidal ribbons Walter Dehnen & Hasanuddin 2018 Mon. Not. Roy. Astron. Soc. 479 4720 http://arxiv.org/abs/1805.08481
Apocenter Pile-Up: Origin of the Stellar Halo Density Break Alis J. Deason, Vasily Belokurov, Sergey E. Koposov, & Lachlan Lancaster 2018 Astrophys. J. Lett. 862 L1 http://arxiv.org/abs/1805.10288
Bootes III is a disrupting dwarf galaxy associated with the Styx stellar stream Jeffrey L. Carlin & David J. Sand 2018 Astrophys. J. None None None
Proper motions of Milky Way Ultra-Faint satellites with Gaia DR2 × DES DR1 Andrew B. Pace & Ting S. Li 2018 Astrophys. J. 875 77 http://arxiv.org/abs/1806.02345
Transient spiral structure and the disc velocity substructure in Gaia DR2 Jason A. S. Hunt, Jack Hong, Jo Bovy, Daisuke Kawata, Robert J. J. Grand 2018 Mon. Not. Roy. Astron. Soc. 481 3794 http://arxiv

Secular dynamics of binaries in stellar clusters I: general formulation and dependence on cluster potential Chris Hamilton & Roman R. Rafikov 2019 Mon. Not. Roy. Astron. Soc. 488 5489 http://arxiv.org/abs/1902.01344
Secular dynamics of binaries in stellar clusters II: dynamical evolution Chris Hamilton & Roman R. Rafikov 2019 Mon. Not. Roy. Astron. Soc. 488 5512 http://arxiv.org/abs/1902.01345
Discovery of Tidal Tails in Disrupting Open Clusters: Coma Berenices and a Neighbor Stellar Group Shih-Yun Tang, Xiaoying Pang, Zhen Yuan, W. P. Chen, Jongsuk Hong, Bertrand Goldman, Andreas Just, Bekdaulet Shukirgaliyev, & Chien-Cheng Lin 2019 Astrophys. J. 877 12 http://arxiv.org/abs/1902.01404
A class of partly burnt runaway stellar remnants from peculiar thermonuclear supernovae R. Raddi, M. A. Hollands, D. Koester, et al. 2019 Mon. Not. Roy. Astron. Soc. None None None
Extended stellar systems in the solar neighborhood - III. Like ships in the night: the Coma Berenices neighbor moving group 

Modelling the Effects of Dark Matter Substructure on Globular Cluster Evolution with the Tidal Approximation Jeremy J. Webb, Jo Bovy, Raymond G. Carlberg, & Mark Gieles 2019 Mon. Not. Roy. Astron. Soc. 488 5748 http://arxiv.org/abs/1907.13132
Kinematic study of the association Cyg OB3 with Gaia DR2 Anjali Rao, Poshak Gandhi, Christian Knigge, John A. Paice, Nathan W. C. Leigh, & Douglas Boubert 2019 Mon. Not. Roy. Astron. Soc. None arXiv:1908.00810 http://arxiv.org/abs/1908.00810
Gravitational Potential from small-scale clustering in action space: Application to Gaia DR2 T. Yang, S. S. Boruah, & N. Afshordi 2019 Mon. Not. Roy. Astron. Soc. None arXiv:1908.02336 http://arxiv.org/abs/1908.02336
The Intrinsic Scatter of the Radial Acceleration Relation Connor Stone & Stephane Courteau 2019 Astrophys. J. 882 6 http://arxiv.org/abs/1908.06105
Radial Velocity Discovery of an Eccentric Jovian World Orbiting at 18 au Sarah Blunt, Michael Endl, Lauren M. Weiss, et al. 2019 Astron. J. 158 181 ht

Dump to a JSON file, which was then copied into the master JSON file:

In [151]:
with open('../src/data/processed.json','w') as outfile:
    json.dump(out,outfile,indent=2)