DBLP SAX Parser

DBLP SAX Parser

What is it?

A parsing package using the Simple API for XML (SAX).

There are a total of 10 elements: "article", "inproceedings", "proceedings", "book", "incollection", "phdthesis", "mastersthesis", "www", "person", "data".

Across the elements, these are the feature types available: "address", "author", "booktitle","cdrom", "chapter", "cite", "crossref", "editor", "ee", "isbn", "journal", "month", "note", "number", "pages", "publisher", "publnr", "school", "series", "title", "url", "volume", "year".

Features

download dblp files from the dblp website directly
parse throught the dblp xml file into a dataframe, exported with either csv or pickle format.

Future features for consideration

add more methods to parse data from a specific attribute. E.g. only for years in 2016
select which elements or features to be included/excluded

Context and Purpose

I created this package when working on a project as part of a course module. The aim of this package is to provide a quick way to parse DBLP elements directly, with the contents exported as a csv file for further preprocessing based on individual's use case.

Installation

pip install dblp-sax-parser

# import package
from dblp_parser import DBLP_Parser as dp

Usage

First step to using this parser is to instantiate the dblp_parser

# Instantiate the dblp class 
dblp = dp()

You can also DBLP_Parser to download the dblp data assets from the dblp website. However, note that it might be faster to just head to the dblp site to download the file contents directly.

# download latest data sets from dblp website
dblp.download_latest_dump()

Parsing the xml file

filename = 'dblp.xml'

# execute the parser from the dblp class
parser, handler = dblp.execute_parser(filename=<filename>)

# you can use the handler to convert the handler output to dataframe
handler.to_df()

# the dataframe can be persisted as a pickle file or exported as csv file
handler.to_csv() # export to csv
handler.save() # persist as pickle

DBLP Methods

class DBLP_Parser

This is the main class to be instantiated when before using the parser

class DBLP_Parser.download_latest_dump

Begins downloading the latest dblp files from the dblp website. If the url location where files are hosted is changed/incorrect, a separate url can be used instead.
This downloads the dblp .dtd and .xml.gz files, and decompress the .gz file into .xml.
dtd_url[str]: url location of the .dtd file to be downloaded from.
xml_zip_url [str]: url of the .xml.tz file to be downloaded from.
xml_zip_filename [str]: specify filename of the downloaded .xml.gz file.
xml_filename [str]: specify filename of the .xml file that is decompressed.

class DBLP_Parser.execute_parser

This executes the underlying SAX parser, calling the xml.sax.handler.ContentHandler
filename [str]: path and name of XML file to be parsed. If **download_latest_dump() was used, the file to be parsed will be "dblp.xml".

License

This code is published under the MIT licence.

References

There are two main references that helped contributed to writing this package. Instantiating the outer dblp class to download dblp materials directly came from from angelosalatino. Some component of the SAX parsing logic itself was borrowed from hibernator11.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

src

src

test

test

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

pyproject.toml

pyproject.toml

requirements.txt

requirements.txt

Repository files navigation

DBLP SAX Parser

What is it?

Context and Purpose

Usage

DBLP Methods

License

References

About

Releases 1

Packages

Languages

License

leonswl/dblp-sax-parser

Folders and files

Latest commit

History

Repository files navigation

DBLP SAX Parser

What is it?

Context and Purpose

Usage

DBLP Methods

License

References

About

Resources

License

Stars

Watchers

Forks

Languages