chan-a

expands threads, scrapes, edits html for corpus linguistics usage

-Added a corpus of all text from the front page in 2015-07

-Edit nearly four years later: Surprisingly the scripts still mostly work with Firefox Quantum and Firefox-ESR. You have to download Gecko Driver and add it to your /usr/bin. More info here.

-In the process of a new month-long corpus for a/b/pol.

The purpose of this project is to scrape a forum for use in corpus linguistics research. By looking at use of language (on an image board!) we can see trends, topical discussion, repetitive themes, etc.

The html output is a balance of readability (for limited context) and ease of using in NLTK. Some html tags are still present and can be edited from the html.

The shell script makes it easy to add as a cronjob and have timed scrapes on a headless machine or VPS. This is ideal for a raspberry pi or something similar.

You'll have to install Selenium's Webdriver library: http://www.seleniumhq.org/download/

Q: Why not just use the API?

A: I wanted all text from the front page, not just the OP and recent activity in the thread.

To do:

Make monthly corpus
Ability to convert to txt for NLTK
Test on other sites with similar forum software

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
Corpus-tools		Corpus-tools
Corpus		Corpus
htmlfiles		htmlfiles
.gitignore		.gitignore
4chan-a-scraper.py		4chan-a-scraper.py
README.md		README.md
corpusmaker.py		corpusmaker.py
script-scrape-4chan-a.sh		script-scrape-4chan-a.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

chan-a

About

Releases

Packages

Languages

johnschriner/chan-a

Folders and files

Latest commit

History

Repository files navigation

chan-a

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages