Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
320 lines (260 sloc) 22 KB

Python Web Scraping

This list contains python libraries related to web scraping and data processing

Network

  • General
    • urllib - network library (stdlib)
    • requests - network library
    • grab - network library (pycurl based)
    • pycurl - network library (binding to libcurl)
    • urllib3 - Python HTTP library with thread-safe connection pooling, file post support, sanity friendly, and more.
    • httplib2 - network library
    • RoboBrowser - A simple, Pythonic library for browsing the web without a standalone web browser.
    • MechanicalSoup - A Python library for automating interaction with websites.
    • mechanize - Stateful programmatic web browsing.
    • socket low-level networking interface (stdlib)
    • Unirest for Python - Unirest is a set of lightweight HTTP libraries available in multiple languages
    • hyper - HTTP/2 Client for Python
    • PySocks - Updated and actively maintained version of SocksiPy, with bug fixes and extra features. Acts as a drop-in replacement to the socket module.
  • Asynchronous
    • treq - requests like API (twisted based)
    • aiohttp - http client/server for asyncio (PEP-3156)

Web-Scraping Frameworks

  • Full Featured Crawlers
    • grab - web-scraping framework (pycurl/multicurl based)
    • scrapy - web-scraping framework (twisted based).
    • pyspider - A powerful spider system.
    • cola - A distributed crawling framework.
  • Other
    • portia - Visual scraping for Scrapy.
    • restkit - HTTP resource kit for Python. It allows you to easily access to HTTP resource and build objects around it.
    • requests-html - Pythonic HTML Parsing for Humans.
    • demiurge - PyQuery-based scraping micro-framework.
    • ScrapydWeb - A full-featured web UI for Scrapyd cluster management, which supports Scrapy Log Analysis & Visualization, Auto Packaging, Timer Tasks, Email Notice and so on.

HTML/XML Parsing

  • General
    • lxml - effective HTML/XML processing library. Supports XPATH. Written in C.
    • cssselect - working with DOM tree with CSS selectors
    • pyquery - working with DOM tree with jQuery-like selectors
    • BeautifulSoup - slow HTML/XMl processing library, written in pure python
    • html5lib - builds DOM of HTML/XML document according to WHATWG spec. That spec is used in all modern browsers.
    • feedparser - parsing of RSS/ATOM feeds.
    • MarkupSafe - Implements a XML/HTML/XHTML Markup safe string for Python.
    • xmltodict - Working with XML feel like you are working with JSON.
    • xhtml2pdf - HTML/CSS to PDF converter.
    • untangle - Converts XML documents to Python objects for easy access.
    • hodor - Configuration driven wrapper around lxml and cssselect.
    • chopper - Tool to extract a part from HTML page with corresponding CSS rules and preserving correct HTML.
    • selectolax - Python bindings to Modest engine (fast HTML5 parser with CSS selectors).
  • Sanitizing
    • Bleach - cleaning of HTML (requires html5lib)
    • sanitize - Bringing sanity to world of messed-up data.

Text Processing

Libraries for parsing and manipulating plain texts.

  • General

    • difflib - (Python standard library) Helpers for computing deltas.
    • Levenshtein - Fast computation of Levenshtein distance and string similarity.
    • fuzzywuzzy - Fuzzy String Matching.
    • esmre - Regular expression accelerator.
    • ftfy - Makes Unicode text less broken and more consistent automagically.
  • Transliteration

    • unidecode - ASCII transliterations of Unicode text.
  • Character encoding

    • uniout - Print readable chars instead of the escaped string.
    • chardet - Python 2/3 compatible character encoding detector.
    • xpinyin - A library to translate Chinese hanzi (漢字) to pinyin (拼音).
    • pangu.py - Spacing texts for CJK and alphanumerics.
    • cchardet - cChardet is high speed universal character encoding detector. - binding to uchardet.
  • Slugify

    • awesome-slugify - A Python slugify library that can preserve unicode.
    • python-slugify - A Python slugify library that translates unicode to ASCII.
    • unicode-slugify - A slugifier that generates unicode slugs.
    • pytils - Simple tools for processing strings in russian (including pytils.translit.slugify)
  • General Parser

    • PLY - Implementation of lex and yacc parsing tools for Python
    • pyparsing - A general purpose framework for generating parsers.
  • Human names

  • Phone Number

    • phonenumbers - Parsing, formatting, storing and validating international phone numbers.
  • User-agent string

  • robots.txt

    • reppy - Modern robots.txt Parser for Python

Specific Formats Processing

Libraries for parsing and manipulating specific text formats.

  • General

    • tablib - A module for Tabular Datasets in XLS, CSV, JSON, YAML.
    • textract - Extract text from any document, Word, PowerPoint, PDFs, etc.
    • messytables - Tools for parsing messy tabular data
    • rows - A common, beautiful interface to tabular data, no matter the format (currently CSV, HTML, XLS, TXT -- more coming!)
  • Office

    • python-docx - Reads, queries and modifies Microsoft Word 2007/2008 docx files.
    • xlwt / xlrd - Writing and reading data and formatting information from Excel files.
    • XlsxWriter - A Python module for creating Excel .xlsx files.
    • xlwings - A BSD-licensed library that makes it easy to call Python from Excel and vice versa.
    • openpyxl - A library for reading and writing Excel 2010 xlsx/xlsm/xltx/xltm files.
    • Marmir - Takes Python data structures and turns them into spreadsheets.
  • PDF

    • PDFMiner - A tool for extracting information from PDF documents.
    • PyPDF2 - A library capable of splitting, merging and transforming PDF pages.
    • ReportLab - Allowing Rapid creation of rich PDF documents.
    • pdftables - Extract tables from PDF files directly
  • Markdown

    • Python-Markdown - A Python implementation of John Gruber’s Markdown.
    • Mistune - Fastest and full featured pure Python parsers of Markdown.
    • markdown2 - A fast and complete Python implementation of Markdown
  • YAML

    • PyYAML - YAML implementations for Python.
  • CSS

  • ATOM/RSS

  • SQL

    • sqlparse - A non-validating SQL parser.
  • HTTP

    • http-parser - HTTP request/response parser for python in C
  • Microformats

    • opengraph - A Python module to parse the Open Graph Protocol tags
  • Portable Executable

  • pefile - A multi-platform module to parse and work with Portable Executable (aka PE) files.

  • PSD

Natural Language Processing

Libraries for working with human languages.

  • NLTK - A leading platform for building Python programs to work with human language data.
  • Pattern - A web mining module for the Python. It has tools for natural language processing, machine learning, among others.
  • TextBlob - Providing a consistent API for diving into common NLP tasks. Stands on the giant shoulders of NLTK and Pattern.
  • jieba - Chinese Words Segmentation Utilities.
  • SnowNLP - A library for processing Chinese text.
  • loso - Another Chinese segmentation library.
  • genius - A Chinese segment base on Conditional Random Field.
  • langid.py - Stand-alone language identification system.
  • Korean - A library for Korean morphology.
  • pymorphy2 - Morphological analyzer (POS tagger + inflection engine) for Russian language.
  • PyPLN - A distributed pipeline for natural language processing, made in Python. he goal of the project is to create an easy way to use NLTK for processing big corpora, with a Web interface.
  • langdetect - Port of Google's language-detection library to Python

Browser automation and emulation

  • Browsers

    • selenium - automating real browsers (Chrome, Firefox, Opera, IE)
    • Ghost.py - wrapper of QtWebKit (requires PyQT)
    • Spynner - wrapper of QtWebKit QtWebKit (requires PyQT)
    • Splinter - univeral API to browser emulators (selenium webdrivers, django client, zope)
    • Requestium - Integration layer between Requests and Selenium for automation of web actions.
    • Splash - Lightweight, scriptable browser as a service with an HTTP API.
  • Headless tools

    • pyppeteer - Headless chrome/chromium automation library (unofficial port of puppeteer)
    • xvfbwrapper - Python wrapper for running a display inside X virtual framebuffer (Xvfb)

Multiprocessing

  • threading - standard python library to run threads. Effective for I/O-bound tasks. Useless for CPU-bound tasks because of python GIL.
  • multiprocessing - standard python library to run processes.
  • celery - An asynchronous task queue/job queue based on distributed message passing.
  • rq - Simple job queues for Python
  • concurrent-futures - The concurrent.futures module provides a high-level interface for asynchronously executing callables.

Asynchronous

Libraries for asynchronous networking programming.

  • asyncio - (Python standard library in Python 3.4+) Asynchronous I/O, event loop, coroutines and tasks.
  • Twisted - An event-driven networking engine.
  • Tornado - A Web framework and asynchronous networking library.
  • pulsar - Event-driven concurrent framework for Python.
  • diesel - Greenlet-based event I/O Framework for Python.
  • gevent - A coroutine-based Python networking library that uses greenlet.
  • eventlet - Asynchronous framework with WSGI support.
  • Tomorrow - Magic decorator syntax for asynchronous code.
  • grequests - Make asynchronous HTTP Requests easily.

Queue

  • celery - An asynchronous task queue/job queue based on distributed message passing.
  • huey - Little multi-threaded task queue.
  • mrq - Mr. Queue - A distributed worker task queue in Python using Redis & gevent.
  • RQ - lightweight task queue manager based on redis
  • simpleq - A simple, infinitely scalable, Amazon SQS based queue.
  • python-gearman - python API for Gearman

Cloud Computing

Email

Libraries for parsing email.

  • flanker - A email address and Mime parsing library.
  • Talon - Mailgun library to extract message quotations and signatures.

URL and Network Address Manipulation

Libraries for parsing/modifying URLs and network addresses.

  • URL
    • furl - A small Python library that makes manipulating URLs simple.
    • purl - A simple, immutable URL class with a clean API for interrogation and manipulation.
    • urllib.parse - interface to break Uniform Resource Locator (URL) strings up in components (addressing scheme, network location, path etc.), to combine the components back into a URL string, and to convert a “relative URL” to an absolute URL given a “base URL.” (stdlib)
    • tldextract - Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List.
  • Network Address
    • netaddr - A Python library for representing and manipulating network addresses.
    • micawber - A small library for extracting rich content from URLs.

Web Content Extracting

Libraries for extracting web contents.

  • Text and metadata from HTML pages
    • newspaper - News extraction, article extraction and content curation in Python.
    • python-goose - HTML Content/Article Extractor.
    • scrapely - Library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely constructs a parser for all similar pages.
  • Metadata from HTML pages
    • htmldate - Find creation date using common structural patterns or text-based heuristics.
    • lassie - Web Content Retrieval for Humans.
  • Text/Data from HTML pages
    • html2text - Convert HTML to Markdown-formatted text.
    • libextract - Extract data from websites.
    • python-readability - Fast Python port of arc90's readability tool.
    • sumy - A module for automatic summarization of text documents and HTML pages.
  • Images
    • Haul - An Extensible Image Crawler.
  • Video
    • you-get - A YouTube/Youku/Niconico video downloader written in Python 3.
    • youtube-dl - A small command-line program to download videos from YouTube.
  • Wiki
    • WikiTeam - Tools for downloading and preserving wikis.
  • Sitemap
    • linkchecker - check links in web documents or full websites
    • python-sitemap - Mini website crawler to make sitemap from a website.

WebSocket

Libraries for working with WebSocket.

  • Crossbar - Open-source Unified Application Router (Websocket & WAMP for Python on Autobahn).
  • AutobahnPython - WebSocket & WAMP for Python on Twisted and asyncio.
  • WebSocket-for-Python - WebSocket client and server library for Python 2 and 3 as well as PyPy.

DNS Resolving

  • dnsyo - Check your DNS against over 1500 global DNS servers.
  • pycares - interface to c-ares. c-ares is a C library that performs DNS requests and name resolutions asynchronously

Computer Vision

  • OpenCV - Open Source Computer Vision Library.
  • SimpleCV - Concise, readable interface for cameras, image manipulation, feature extraction, and format conversion (based on OpenCV).
  • mahotas - fast computer vision algorithms (all implemented in C++) operating over numpy arrays.

Proxy Server

  • scylla - Intelligent proxy pool for Humans
  • ProxyBroker - Proxy [Finder | Checker | Server]. HTTP(S) & SOCKS
  • shadowsocks - A fast tunnel proxy that helps you bypass firewalls (TCP & UDP support, User management API, TCP Fast Open, Workers and graceful restart, Destination IP blacklist)
  • tproxy - tproxy is a simple TCP routing proxy (layer 7) built on Gevent that lets you configure the routine logic in Python

Other python lists

You can’t perform that action at this time.