Apifier

Apifier is a very simple HTML parser written in Python.

It aims to parse HTML documents in a declarative way using css or xpath selectors. Its main purpose is to parse tabular and/or paginated data.

Install

Apifier is available for python 3

pip install apifier

Example

Getting all comments from an article at "LeFigaro.fr"

from apifier import Apifier

config = {
    "name": "FigaroBot article comments",
    "encoding": "latin-1",
    "url": "http://www.lefigaro.fr/politique/le-scan/2016/07/21/25001-20160721ARTFIG00062-attentat-de-nice-la-droite-demande-une-enquete-independante.php",
    "foreach": "#fig-pagination-nav > li > a",
    "context": "page",
    "xpath": False,
    "prefix": "#reagir > div > div > div.fig-col.fig-col--comments > div:nth-child(3) > ul > li > article >",
    "description": {
        "author": "div.fig-comment-header a",
        "comment": "div.fig-comment-msg p"
    }
}

api = Apifier(config=config)
data = api.load()

Config

name : name of the current configuration
encoding : is the encoding the page is using, data will be converted from this encoding to utf-8 for sanity
url : page url, first page in case of paginated data
xpath: boolean, set to true if selectors are xpath instead of css
next : selector for a "next" link, apifier will crawl pages with next link until none is found

foreach : selector for the pagination links int this example pagination looks like :

<ul id="fig-pagination-nav">
  <li class="fig-pagination-current"><a href="…"> 1 </a></li>
  <li><a href="…"> 2 </a></li>
  <li><a href="…"> 3 </a></li>
</ul>

context : each data will be associated with a special variable named after the content of the pagination link in this case, this content is just the page number, but the pagination mechanism can be used for othher purpose like categories
prefix : descriptors will be prefixed by this option
description : descriptor for content to parse, in this example, comment content and author name.

To use xpath selector instead of css write them prefixed by a $.

The result is :

    data =
    [
        {'comment': "…", 'author': '…', 'page': '1'}, etc
    ]

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
apifier		apifier
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
LICENSE.txt		LICENSE.txt
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apifier

Install

Example

Config

About

Releases

Packages

Languages

License

luxcem/apifier

Folders and files

Latest commit

History

Repository files navigation

Apifier

Install

Example

Config

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages