Skip to content
/ apifier Public

Apifier is a very simple HTML parser written in Python based on CSS selectors

License

Notifications You must be signed in to change notification settings

luxcem/apifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Apifier

Apifier is a very simple HTML parser written in Python.

It aims to parse HTML documents in a declarative way using css or xpath selectors. Its main purpose is to parse tabular and/or paginated data.

Install

Apifier is available for python 3

Build Status codecov PyPI version

pip install apifier

Example

Getting all comments from an article at "LeFigaro.fr"

from apifier import Apifier

config = {
    "name": "FigaroBot article comments",
    "encoding": "latin-1",
    "url": "http://www.lefigaro.fr/politique/le-scan/2016/07/21/25001-20160721ARTFIG00062-attentat-de-nice-la-droite-demande-une-enquete-independante.php",
    "foreach": "#fig-pagination-nav > li > a",
    "context": "page",
    "xpath": False,
    "prefix": "#reagir > div > div > div.fig-col.fig-col--comments > div:nth-child(3) > ul > li > article >",
    "description": {
        "author": "div.fig-comment-header a",
        "comment": "div.fig-comment-msg p"
    }
}

api = Apifier(config=config)
data = api.load()

Config

  • name : name of the current configuration
  • encoding : is the encoding the page is using, data will be converted from this encoding to utf-8 for sanity
  • url : page url, first page in case of paginated data
  • xpath: boolean, set to true if selectors are xpath instead of css
  • next : selector for a "next" link, apifier will crawl pages with next link until none is found
  • foreach : selector for the pagination links int this example pagination looks like :
    <ul id="fig-pagination-nav">
      <li class="fig-pagination-current"><a href="…"> 1 </a></li>
      <li><a href="…"> 2 </a></li>
      <li><a href="…"> 3 </a></li>
    </ul>
    
  • context : each data will be associated with a special variable named after the content of the pagination link in this case, this content is just the page number, but the pagination mechanism can be used for othher purpose like categories
  • prefix : descriptors will be prefixed by this option
  • description : descriptor for content to parse, in this example, comment content and author name.

To use xpath selector instead of css write them prefixed by a $.

The result is :

    data =
    [
        {'comment': "…", 'author': '…', 'page': '1'}, etc
    ]

About

Apifier is a very simple HTML parser written in Python based on CSS selectors

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published