Skip to content
Permalink
master
Switch branches/tags
Go to file
 
 
Cannot retrieve contributors at this time

Ruby Web Scraping

This list contains ruby libraries related to web scraping and data processing

Network

  • httparty Makes http fun again!
  • http A simple Ruby DSL for making HTTP requests
  • excon Usable, fast, simple HTTP(S) 1.1 for Ruby
  • nestful Simple Ruby HTTP/REST client with a sane API
  • EM-HTTP-Request - EventMachine based asynchronous HTTP client
  • excon - Usable, fast, simple Ruby HTTP 1.1. It works great as a general HTTP(s) client and is particularly well suited to usage in API clients.
  • Faraday - an HTTP client lib that provides a common interface over many adapters (such as Net::HTTP) and embraces the concept of Rack middleware when processing the request/response cycle.
  • Http Client - Gives something like the functionality of libwww-perl (LWP) in Ruby.
  • HTTP - The HTTP Gem: a simple Ruby DSL for making HTTP requests.
  • Http-2 - Pure Ruby implementation of HTTP/2 protocol
  • Patron - Patron is a Ruby HTTP client library based on libcurl.
  • RESTClient - Simple HTTP and REST client for Ruby, inspired by microframework syntax for specifying actions.
  • Savon - Savon is a SOAP client for the Ruby programming language.
  • Sawyer - Secret user agent of HTTP, built on top of Faraday.
  • Spyke - Interact with REST services in an ActiveRecord-like manner.
  • Typhoeus - Typhoeus wraps libcurl in order to make fast and reliable requests.
  • Mechanize - Mechanize is a ruby library that makes automated web interaction easy.

Web-Scraping Frameworks

  • upton - A batteries-included framework for easy web-scraping
  • Wombat - Web scraper with an elegant DSL that parses structured data from web pages.
  • Anemone - web spider framework that can spider a domain and collect useful information about the pages it visits
  • Spidr - versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
  • kimuraframework - Modern web scraping framework written in Ruby which works out of box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows to scrape and interact with JavaScript rendered websites
  • arachnid2 A simple, fast, framework-less crawler with sensible defaults and lots of options. Crawls the page and runs your code directly against either Typhoeus responses or a Watir browser.

HTML/XML Parsing

  • nokogiri - HTML, XML, SAX, and Reader parser with XPath and CSS selector support
  • loofah - HTML/XML manipulation and sanitization based on Nokogiri
  • HappyMapper - allows you to parse XML data and convert it quickly and easily into ruby data structures.
  • HTML::Pipeline - HTML processing filters and utilities.
  • Oga - An XML/HTML parser written in Ruby. Oga does not require system libraries such as libxml, making it easier and faster to install on various platforms.
  • Ox - A fast XML parser and Object marshaller.
  • ROXML - Custom mapping and bidirectional marshalling between Ruby and XML using annotation-style class methods, via Nokogiri or LibXML.
  • equivalent-xml - Easy tests of equivalency of XML documents for Nokogiri::XML

Text Processing

Libraries for parsing and manipulating plain texts.

  • General
    • Kiba - library for writing reliable, concise, well-tested & maintainable data-processing code
    • diffy - a convenient way to generate a diff from two strings or files
    • CommonRegexRuby - find a lot of kinds of common information in a string
  • Phone number
    • GlobalPhone - Parse, validate, and format phone numbers in Ruby using Google's libphonenumber database.
  • Country names
    • i18n_data - country/language names and 2-letter-code pairs, in 85 languages, for country/language i18n.
    • normalize_country - Convert country names and codes to a standard, includes a conversion program for XMLs, CSVs and DBs.
  • User agent
    • Device Detector - A precise and fast user agent parser and device detector, backed by the largest and most up-to-date user agent database.
  • General parser
    • Parslet - A small Ruby library for constructing parsers in the PEG (Parsing Expression Grammar) fashion.
    • Treetop - PEG (Parsing Expression Grammar) parser.
    • rley - Ruby gem implementing a general context-free grammar parser based on Earley's algorithm
  • Date & time
    • Chronic - A natural language date/time parser written in pure Ruby.
    • yymmdd - Tiny DSL for idiomatic date parsing and formatting.
    • Chronic Between - a simple Ruby natural language parser for date and time ranges
    • Chronic Duration - a simple Ruby natural language parser for elapsed time
    • Kronic - a dirt simple library for parsing and formatting human readable dates
    • Nickel - extracts date, time, and message information from naturally worded text
    • Tickle - a natural language parser for recurring events
  • Human Names
    • nameable - A Ruby gem that provides parsing and output of person names, as well as Gender & Ethnicity matching
  • N-grams
    • N-Gram - N-Gram generator in Ruby
    • ngram - break words and phrases into ngrams
    • raingrams - a flexible and general-purpose ngrams library written in Ruby
  • Text Similarity
    • FuzzyMatch - find a needle in a haystack based on string similarity and regular expression rules
    • fuzzy-string-match - fuzzy string matching library for ruby
    • FuzzyTools - In-memory TF-IDF fuzzy document finding with a fancy default tokenizer tuned on diverse record linkage datasets for easy out-of-the-box use
    • Going the Distance - contains scripts that do various distance calculations
    • hotwater - Fast Ruby FFI string edit distance algorithms
    • levenshtein-ffi - fast string edit distance computation, using the Damerau-Levenshtein algorithm
    • TF-IDF - Term Frequency - Inverse Document Frequency in Ruby
    • tf-idf-similarity - calculate the similarity between texts using tf*idf

Specific Formats Processing

Libraries for parsing and manipulating specific text formats.

  • General
    • markup — GitHub library to convert mardown, rst, creole, etc into HTML
  • Office
    • Yomu - Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)
    • spreadsheet - The Spreadsheet Library is designed to read and write Spreadsheet Documents.
    • roo - Roo implements read access for all spreadsheet types and read/write access for Google spreadsheets.
    • google-spreadsheet-ruby - This is a library to read/write Google Spreadsheet.
    • rubyXL - rubyXL is a gem which allows the parsing, creation, and manipulation of Microsoft Excel (.xlsx/.xlsm) Documents
    • remote_table - Open local or remote XLSX, XLS, ODS, CSV (comma separated), TSV (tab separated), other delimited, fixed-width files, and Google Docs.
    • sheets - Work with spreadsheets easily in a native ruby format.
    • workbook - Workbook contains workbooks, as in a table, contains rows, contains cells, reads/writes excel, ods and csv and tab separated files...
    • oxcelix - A fast Excel 2007/2010 (.xlsx) file parser that returns a collection of Matrix objects
    • wrap_excel - WrapExcel is to wrap the win32ole, and easy to use Excel operations with ruby. Detailed description please see the README.
  • libpcap
    • PacketFul - A library for reading and writing packets to an interface or to a libpcap-formatted file.
  • JSON
    • JsonCompare - Returns the difference between two JSON files
    • JSON — includes pure Ruby and C implementation for JSON.
    • JSON::Stream — a streaming JSON parser that generates SAX-like events.
    • YAJL — a streaming JSON parsing and encoding library for Ruby (C bindings to YAJL).
    • OJ — Optimized JSON, as the name implies, was written to provide speed optimized JSON handling. So far it has achieved that, and is about 2 times faster than any other Ruby JSON parser, and 3 or more times faster at serializing JSON.
  • Markdown
    • kramdown - Kramdown is yet-another-markdown-parser but fast, pure Ruby, using a strict syntax definition and supporting several common extensions.
    • Maruku - A pure-Ruby Markdown-superset interpreter.
    • Redcarpet - A fast, safe and extensible Markdown to (X)HTML parser.
  • ATOM/RSS
    • Feed normalizer - Extensible Ruby wrapper for Atom and RSS parsers.
    • Feedjira - A feed fetching and parsing library.
    • Ratom - A fast, libxml based, Ruby Atom library.
    • Simple rss - A simple, flexible, extensible, and liberal RSS and Atom reader.
  • BSON
  • MessagePack
    • MessagePack — an efficient binary serialization format. It lets you exchange data among multiple languages like JSON but it's faster and smaller. For example, small integers (like flags or error code) are encoded into a single byte, and typical short strings only require an extra byte in addition to the strings themselves. See http://msgpack.org
  • Protobuf
    • Protobuf — Ruby implementation for Protocol Buffers.
  • RDF
    • rdf - pure-Ruby library for working with Resource Description Framework (RDF) data

Natural Language Processing

Libraries for working with human languages.

  • General
    • Treat - Treat is a toolkit for natural language processing and computational linguistics in Ruby
    • Pragmatic Segmenter - Pragmatic Segmenter is a rule-based sentence boundary detection gem that works out-of-the-box across many languages.
    • Text - A collection of text algorithms including Levenshtein distance, Metaphone, Soundex 2, Porter stemming & White similarity.
    • whatlanguage - a language detection library for Ruby that uses bloom filters for speed
    • nlp - NLP tools for the Polish language
    • NlpToolz - Basic NLP tools, mostly based on OpenNLP, at this time sentence finder, tokenizer and POS tagger implemented, plus Berkeley Parser
    • Open NLP (Ruby bindings)
    • Stanford Core NLP (Ruby bindings)
    • ve - a linguistic framework that's easy to use
    • zipf - a collection of various NLP tools and libraries
    • ruby-ner - named entity recognition with Stanford NER and Ruby
    • ruby-nlp - Ruby Binding for Stanford Pos-Tagger and Name Entity Recognizer
    • linkparser - a Ruby binding for the Abiword version of CMU's Link Grammar, a syntactic parser of English
  • Part-of-Speech Tagger
    • engtagger - English Part-of-Speech Tagger Library; a Ruby port of Lingua::EN::Tagger
    • rbtagger - a simple ruby rule-based part of speech tagger
    • TreeTagger for Ruby - Ruby based wrapper for the TreeTagger by Helmut Schmid
  • Sentence segmentation
  • Stemmers
  • Summarization
    • Epitome - A small gem to make your text shorter; an implementation of the Lexrank algorithm
    • ots - Ruby bindings to open text summarizer
    • summarize - Ruby C wrapper for Open Text Summarizer
  • Tokenizers
    • Jieba - Chinese tokenizer and segmenter (jRuby)
    • MeCab - Japanese morphological analyzer [MeCab Heroku buildpack]
    • NLP Pure - natural language processing algorithms implemented in pure Ruby with minimal dependencies
    • rseg - a Chinese Word Segmentation (中文分词) routine in pure Ruby
    • thailang4r - Thai tokenizer
    • tiny_segmenter - Ruby port of TinySegmenter.js for tokenizing Japanese text
    • tokenizer - a simple multilingual tokenizer
  • Word Count
    • wc - a rubygem to count word occurrences in a given text
    • word_count - a word counter for String and Hash in Ruby
    • Word Count Analyzer - analyzes a string for potential areas of the text that might cause word count discrepancies depending on the tool used
    • WordsCounted - a highly customisable Ruby text analyser

Browser automation and emulation

  • selenium - A browser automation framework and ecosystem
  • Watir - Watir implementation built on WebDriver's Ruby bindings
  • capybara-webkit - A Capybara driver for headless WebKit to test JavaScript web apps
  • poltergeist - A PhantomJS driver for Capybara

Multiprocessing

  • Celluloid - Actor-based concurrent object framework for Ruby
  • Parallel - Run any code in parallel Processes (> use all CPUs) or Threads (> speedup blocking operations).
  • Concurrent Ruby - Modern concurrency tools including agents, futures, promises, thread pools, supervisors, and more.
  • childprocess - Cross-platform ruby library for managing child processes.
  • forkoff - brain-dead simple parallel processing for ruby.
  • posix-spawn - Fast Process::spawn for Rubys >= 1.8.7 based on the posix_spawn() system interfaces.
  • thread — extensions to the thread library (includes thread pool).
  • Sprawling — spawn gem for Rails to easily fork or thread long-running code blocks.

Asynchronous

Libraries for asynchronous networking programming.

  • EventMachine - event-driven I/O and lightweight concurrency library

Queue

  • Resque A Redis-backed Ruby library for creating background jobs, placing them on multiple queues.
  • Delayed::Job — Database backed asynchronous priority queue.
  • Qu A Ruby library for queuing and processing background jobs.
  • Sidekiq - A full-featured background processing framework for Ruby. It aims to be simple to integrate with any modern Rails application and much higher performance than other existing solutions.
  • Sneakers - A fast background processing framework for Ruby and RabbitMQ
  • Backburner - Backburner is a beanstalkd-powered job queue that can handle a very high volume of jobs.
  • Delayed::Job - Database backed asynchronous priority queue.
  • Que - A Ruby job queue that uses PostgreSQL's advisory locks for speed and reliability.
  • Shoryuken - A super efficient AWS SQS thread based message processor for Ruby.
  • Sucker Punch - A single process background processing library using Celluloid. Aimed to be Sidekiq's little brother.

Email

Libraries for parsing email.

  • mail A Really Ruby Mail Library

URL Manipulation

Libraries for parsing URLs.

  • addressable - Addressable is a replacement for the URI implementation that is part of Ruby's standard library. It more closely conforms to RFC 3986, RFC 3987, and RFC 6570 (level 4), providing support for IRIs and URI templates.

Web Content Extracting

Libraries for extracting web contents.

  • Metainspector - scrapes a given URL, and returns its title, meta description, meta keywords, an array with all the links, all the images in it, etc
  • LinkThumbnailer - Ruby gem that generates thumbnail images and videos from a given URL. Much like popular social website with link preview.
  • docsplit - Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts
  • Ruby Readability - a tool for extracting the primary readable content of a webpage

WebSocket

Libraries for working with WebSocket.

  • em-websocket - EventMachine based WebSocket server
  • Faye - A set of tools for simple publish-subscribe messaging between web clients.
  • Firehose - Build realtime Ruby web applications.
  • Slanger - Open Pusher implementation compatible with Pusher libraries.

DNS Resolving

  • em-resolve-replace - EventMachine-aware pure Ruby DNS resolution
  • Celluloid::DNS - a high-performance DNS client resolver and server which can be easily integrated into other projects or used as a stand-alone daemon. It was forked from RubyDNS which is now implemented in terms of this library.

Computer Vision

Geolocation

  • geocoder - A complete geocoding solution for Ruby. With Rails it adds geocoding (by street or IP address), reverse geocoding (find street address based on given coordinates), and distance queries.
  • Geokit - Geokit gem provides geocoding and distance/heading calculations.
  • geoip - Searches a GeoIP database for a given host or IP address, and returns information about the country where the IP address is allocated, and the city, ISP and other information.

Other Ruby Lists