Skip to content

Scrapes a given board catalog on 4Chan for all comments, files, and associated metadata.

Notifications You must be signed in to change notification settings

malavmodi/4Chan-Scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 

Repository files navigation

4Chan-Scraper

Scrapes a given board catalog on 4Chan for all comments, files, and associated metadata with help of the BASC 4Chan Python Library.

UPDATE for Python 3.9+/ Script Failure

  1. Windows Navigation:
C:\Users\USER\AppData\Local\Programs\Python\Python3x\Lib\site-packages\basc_py4chan\util.py
  1. LINUX Navigation:
/usr/lib/site-packages/python3.x/site-packages/basc_py4chan/util.py
  1. Rename HTMLParser dependency from HTMLParser
# HTML parser was renamed in python 3.x
try:
    from html.parser import HTMLParser
except ImportError:
    from HTMLParser import HTMLParser
_parser = html.HTMLParser()

to the newly named dependency html:

# HTML parser was renamed in python 3.x
import html
_parser = html

Installation:

Download .zip from the github repo or clone using

git clone https://github.com/malavmodi/4Chan-Scraper.git

as well as install the required dependencies with pip:

pip3 install -r requirements.txt 

Commands:

  1. --board_name
    • Board from 4Chan to Scrape (Required)
  2. --num_threads:
    • Number of threads to scrape (Required)
  3. --debug:
    • Additional log output (Optional / Case Insensitive)

Example Usage:

NOTE: For additional information on usage, run python 4chan_scraper.py -h to check options.

  • Scraping the first 5 threads of /pol/
    • python 4chan_scraper.py --board_name "pol" --num_threads 5 --debug "False"

Runtime:

When running the script, it will create a folder with all associated data in the current working directory in a hierarchial structure as such:

  • Thread ID with Subject (if not Null) (Folder)
    • Thread ID files (Folder)
      • File Data
    • CSV with comments/replies from posts
    • JSON formatted output of thread
    • File Metadata
    • Thread Metadata

About

Scrapes a given board catalog on 4Chan for all comments, files, and associated metadata.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages