This consists of three scripts that can be used to generate NovelAI training data from 4chan archived threads:
- _scraper.py, which downloads a list of all threads matching a given arch.b4k.co search result
- _dumper.py, which downloads all threads generated by scraper.py
- _cleaner.py, which parses, cleans, and reformats the threads dumped by dumper.py
- python3
- BeautifulSoup (
pip install BeautifulSoup
) - ftfy (
pip install ftfy
) - requests (
pip install requests
)
- Search b4k archive for the threads you want to train a module on, then copy the URL
- Be sure to select Only Opening Posts
- Your URL should look something like: https://arch.b4k.co/vg/search/subject/%2Faids%2F/type/op/
- Paste URL into scraper.py's
BASE_URL
, then save the script. - Open your terminal and run
python _scraper.py
- This will generate a file
output.txt
containing all of the threads matching your search
- This will generate a file
- Open your terminal and run
python _dumper.py
- This will download the HTML for every thread listed in
output.txt
and write it to disk
- This will download the HTML for every thread listed in
- Create a folder called
cleaned
in the same folder as the scripts - Open your terminal and run
python _cleaner.py
- This will parse the downloaded HTML files, clean them, and write cleaned text files to the
cleaned
folder - This process is not fast.
- This will parse the downloaded HTML files, clean them, and write cleaned text files to the
I've only tested it with arch.b4k.co but it will probably work with most FoolFuuka-based 4chan archive sites with no or minimal changes.