GitHub - khanonnie/foolfuuka_scraper: Scripts to scrape and clean 4chan threads from foolfuuka archives, for language model training. Disclaimer: I don't know python.

What?

This consists of three scripts that can be used to generate NovelAI training data from 4chan archived threads:

_scraper.py, which downloads a list of all threads matching a given arch.b4k.co search result
_dumper.py, which downloads all threads generated by scraper.py
_cleaner.py, which parses, cleans, and reformats the threads dumped by dumper.py

Search b4k archive for the threads you want to train a module on, then copy the URL
- Be sure to select Only Opening Posts
- Your URL should look something like: https://arch.b4k.co/vg/search/subject/%2Faids%2F/type/op/
Paste URL into scraper.py's BASE_URL, then save the script.
Open your terminal and run python _scraper.py
- This will generate a file output.txt containing all of the threads matching your search
Open your terminal and run python _dumper.py
- This will download the HTML for every thread listed in output.txt and write it to disk
Create a folder called cleaned in the same folder as the scripts
Open your terminal and run python _cleaner.py
- This will parse the downloaded HTML files, clean them, and write cleaned text files to the cleaned folder
- This process is not fast.

I've only tested it with arch.b4k.co but it will probably work with most FoolFuuka-based 4chan archive sites with no or minimal changes.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
_cleaner.py		_cleaner.py
_dumper.py		_dumper.py
_scraper.py		_scraper.py