Skip to content
Crawly, a high-level web crawling & scraping framework for Elixir.
Elixir Shell
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
config
docs missing articles, typos, small markdown syntax errors, etc. Oct 14, 2019
lib Log pipe errors Oct 2, 2019
scripts Update hex.sh to publish package docs Aug 4, 2019
test
.formatter.exs Add data storage basic processes Apr 22, 2019
.gitignore Log pipe errors Oct 2, 2019
.travis.yml Fix push to hex for tagged commits May 27, 2019
LICENSE Initial commit Mar 9, 2019
README.md missing articles, typos, small markdown syntax errors, etc. Oct 14, 2019
mix.exs Update Crawly to 0.5.0 Aug 10, 2019
mix.lock Fix push to hex for tagged commits May 27, 2019

README.md

Crawly

Build Status Coverage Status

Overview

Crawly is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

Requirements

  1. Elixir "~> 1.7"
  2. Works on Linux, Windows, OS X and BSD

Installation

  1. Generate an new Elixir project: mix new <project_name> --sup
  2. Add Crawly to you mix.exs file
    def deps do
        [{:crawly, "~> 0.5.0"}]
    end
  3. Fetch crawly: mix deps.get

Quickstart

In this section we will show how to bootstrap a small project and to setup Crawly for proper data extraction.

  1. Create a new Elixir project: mix new crawly_example --sup
  2. Add Crawly to the dependencies (mix.exs file):
defp deps do
    [
      {:crawly, "~> 0.5.0"}
    ]
end
  1. Fetch dependencies: mix deps.get
  2. Define Crawling rules (Spider)
cat > lib/crawly_example/esl_spider.ex << EOF
defmodule EslSpider do
  @behaviour Crawly.Spider
  alias Crawly.Utils

  @impl Crawly.Spider
  def base_url(), do: "https://www.erlang-solutions.com"

  @impl Crawly.Spider
  def init(), do: [start_urls: ["https://www.erlang-solutions.com/blog.html"]]

  @impl Crawly.Spider
  def parse_item(response) do
    hrefs = response.body |> Floki.find("a.more") |> Floki.attribute("href")

    requests =
      Utils.build_absolute_urls(hrefs, base_url())
      |> Utils.requests_from_urls()

    title = response.body |> Floki.find("article.blog_post h1") |> Floki.text()

    %{
      :requests => requests,
      :items => [%{title: title, url: response.request_url}]
    }
  end
end
EOF
  1. Configure Crawly: By default Crawly does not require any configuration. But obviously you will need a configuration for fine tuning the Crawls:
config :crawly,
  closespider_timeout: 10,
  concurrent_requests_per_domain: 8,
  follow_redirects: true,
  closespider_itemcount: 1000,
  output_format: "csv",
  item: [:title, :url],
  item_id: :title,
  middlewares: [
    Crawly.Middlewares.DomainFilter,
    Crawly.Middlewares.UniqueRequest,
    Crawly.Middlewares.UserAgent
  ],
  pipelines: [
    Crawly.Pipelines.Validate,
    Crawly.Pipelines.DuplicatesFilter,
    Crawly.Pipelines.CSVEncoder
  ]
  1. Start the Crawl:
  • iex -S mix
  • Crawly.Engine.start_spider(EslSpider)
  1. Results can be seen in: cat /tmp/EslSpider.jl

Documentation

Documentation is available online at https://oltarasenko.github.io/crawly/#/ and in the docs directory.

Tutorial

The crawly tutorial: https://oltarasenko.github.io/crawly/#/?id=crawly-tutorial

Roadmap

  1. Cookies support
  2. XPath support
  3. Pluggable HTTP client
  4. Project generators (spiders)
  5. Retries support
  6. UI for jobs management

We are looking for contributors

We would gladly accept your contributions!

Articles

  1. Blog post on Erlang Solutions website: https://www.erlang-solutions.com/blog/web-scraping-with-elixir.html

Example projects

  1. Blog crawler: https://github.com/oltarasenko/crawly-spider-example
  2. E-commerce websites: https://github.com/oltarasenko/products-advisor
  3. Car shops: https://github.com/oltarasenko/crawly-cars
You can’t perform that action at this time.