Skip to content

nightnoryu/chan-scraper

Repository files navigation

Chan scraper

This program is capable of downloading attachments from threads on 2ch and 4chan. You can select what to download: images, videos or all files.

Requirements

Installation

Clone the repo using this command:

git clone https://github.com/m3tro1d/chan-scraper

Or just download the zip.

Install the dependencies:

python -m pip install -r requirements.txt

Master branch is usually stable, so there won't be any issues.

Usage

Usage: chan-scraper.py [OPTIONS] URL [URL]...

URL:
  Thread's URL

Options:
  -h,  --help     show help
  -m,  --mode     specify content for downloading:
                  all, images, videos (def: all)
  -p,  --pause    make a pause after each download
                  useful if the server throttles (def: False)
  -o,  --output   output directory (def: current)

For more information visit:
https://github.com/m3tro1d/chan-scraper

For example:

python chan-scraper.py -o img -m images https://2ch.hk/s/res/2127464.html

This will download all images from the 2127464 thread on /s/ in the img folder.

Another one:

python chan-scraper.py -o threads https://boards.4channel.org/g/thread/77369090 https://boards.4channel.org/g/thread/77368911

This will download all files from both threads and place them into separate folders with their thread number in the threads folder.

Attention: by default, if the directory you have selected with -o option exists and there was an image with the conflicting name it won't be replaced.

Extending

If you want to add support for another imageboard, there is a simple scheme for an 'extractor'. It is a class containing the following properties:

  • name - string representing imageboard's name. For example: self.name = "fourchan". This is used for naming the directories when dowloading multiple threads;
  • match() - a static method (docs) that returns a re.match object. Determines which links the extractor supports;
  • thread_number - int with thread's number according to the URL;
  • get_files_urls_names() - function that returns a tuple (or list) of tuples, each containing files' URL and name.

The constructor (e.g. __init__) must trow an error if a network error is encountered. All handling is done in the Scraper class.

Also make sure to modify the Scraper constructor: import your extractor and add it to the list self.extractors.

Background implementation is up to you, but I suggest reading the documentation on imageboard's API and use it if possible. Also refer to the existing extractors for more practical info.

TODO

  • Pass the thread if it yields an HTTP error and continue to other threads (and files)
  • Option to pause after each download to prevent server throttling
  • Rewrite the script to make it more modular and easier to maintain and extend
  • Print the full information (summary) at the end of the downloading (make it an option?)
  • Add usercode_auth optional cookie code for dvach restricted boards (as an input argument)
  • Use thread's header text for naming the output folders
  • Add option for saving with poster's filename

About

A simple scraper for batch downloading files from different imageboards

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages