Skip to content

Commit

Permalink
Adding some more information on the added statistics.
Browse files Browse the repository at this point in the history
Also changing some of the default options related to statistics.
  • Loading branch information
nietaki committed May 15, 2017
1 parent 355a58c commit b0cf085
Show file tree
Hide file tree
Showing 3 changed files with 17 additions and 3 deletions.
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,10 @@ Here's a rough diagram:

![crawlie architecture diagram](assets/crawlie_arch_v0.2.0.png)

## Statistics

If you're interested in the crawling statistics or want to track the progress in real time, see [`Crawlie.crawl_and_track_stats/3`](https://hexdocs.pm/crawlie/Crawlie.html#crawl_and_track_stats/3). It starts a [`Stats GenServer`](https:/hexdocs.pm/crawlie/Crawlie.Stats.Server.html) in Crawlie's supervision tree, which accumulates the statistics for the crawling session.

## Configuration

See [the docs](https://hexdocs.pm/crawlie/Crawlie.html#crawl/3) for supported options.
Expand Down
12 changes: 11 additions & 1 deletion lib/crawlie.ex
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ defmodule Crawlie do
the options for [HttPoison](https://hexdocs.pm/httpoison/HTTPoison.html#request/5),
as well as Crawlie specific options.
It is perfectly ok to run multiple crawling sessions at the same time, they're independent.
## arguments
- `source` - a `Stream` or an `Enum` containing the urls to crawl
Expand Down Expand Up @@ -64,7 +65,16 @@ defmodule Crawlie do
@doc """
Crawls the urls provided in `source`, using the `Crawlie.ParserLogic` provided and collects the crawling statistics.
See `Crawlie.crawl/3` for details
The statistics are accumulated independently, per `Crawlie.crawl_and_track_stats/3` call.
See `Crawlie.crawl/3` for details.
## Additional options
(apart from the ones from `Crawlie.crawl/3`, which all apply as well)
- `:max_fetch_failed_uris_tracked` - `100` by default. The maximum quantity of uris that will be kept in the `Crawlie.Stats.Server`, for which the fetch operation was failed.
- `:max_parse_failed_uris_tracked` - `100` by default. The maximum quantity of uris that will be kept in the `Crawlie.Stats.Server`, for which the parse operation was failed.
"""
def crawl_and_track_stats(source, parser_logic, options \\ []) do
ref = StatsServer.start_new()
Expand Down
4 changes: 2 additions & 2 deletions lib/crawlie/options.ex
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,8 @@ defmodule Crawlie.Options do
stages: core_count(),
],
pqueue_module: :pqueue3,
max_fetch_failed_uris_tracked: 1000,
max_parse_failed_uris_tracked: 1000,
max_fetch_failed_uris_tracked: 100,
max_parse_failed_uris_tracked: 100,
]
end

Expand Down

0 comments on commit b0cf085

Please sign in to comment.