Skip to content

khanonnie/foolfuuka_scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

What?

This consists of three scripts that can be used to generate NovelAI training data from 4chan archived threads:

  • _scraper.py, which downloads a list of all threads matching a given arch.b4k.co search result
  • _dumper.py, which downloads all threads generated by scraper.py
  • _cleaner.py, which parses, cleans, and reformats the threads dumped by dumper.py

Requirements

  • python3
  • BeautifulSoup (pip install BeautifulSoup)
  • ftfy (pip install ftfy)
  • requests (pip install requests)

Instructions

  1. Search b4k archive for the threads you want to train a module on, then copy the URL
  2. Paste URL into scraper.py's BASE_URL, then save the script.
  3. Open your terminal and run python _scraper.py
    • This will generate a file output.txt containing all of the threads matching your search
  4. Open your terminal and run python _dumper.py
    • This will download the HTML for every thread listed in output.txt and write it to disk
  5. Create a folder called cleaned in the same folder as the scripts
  6. Open your terminal and run python _cleaner.py
    • This will parse the downloaded HTML files, clean them, and write cleaned text files to the cleaned folder
    • This process is not fast.

Notes

I've only tested it with arch.b4k.co but it will probably work with most FoolFuuka-based 4chan archive sites with no or minimal changes.

About

Scripts to scrape and clean 4chan threads from foolfuuka archives, for language model training. Disclaimer: I don't know python.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages