GitHubReadmeCorpus

A project to compile a corpus of all github README documentation in English

Currently compatible with Python3

Aim

a corpus of all publicly available README documentation from GitHub has a number of potential applications

create a list of the most common words by word class (noun, verb, adjective...) for this register
same as above, but broken down by programming language
create a dataset for teachers to use in creating materials for a hypothetical "writing documentation" component of an ESP course for programmers
etc...

Running the scripts to create your own dataset

Clone this repository
Install the dependencies via

pip install -r requirements.txt

Create a github token and add it to the environment variables (if in a linux environment, just type this in the terminal)

export GITHUB_TOKEN=YOUR_GITHUB_TOKEN_HERE

add a data directory (or whichever name you like and update that in collect.py)

mkdir data

Run the script

python collect.py

Docker method

build the container

docker build -t corpus_collector .

run it

be sure to export GITHUB_TOKEN=YOUR_TOKEN_HERE prior to running this.

docker run -it --rm -v $(pwd)/data:/app/data -e GITHUB_TOKEN=$GITHUB_TOKEN corpus_collector

The data should wind up in the local ./data dir so you can stop the container and start it again later without needing to re-process all the readmes from the start.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE.txt		LICENSE.txt
README.md		README.md
collect.py		collect.py
docker-compose.yml		docker-compose.yml
process.py		process.py
rateLimit.py		rateLimit.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GitHubReadmeCorpus

Aim

Running the scripts to create your own dataset

Docker method

About

Releases

Packages

Languages

License

lpmi-13/GitHubReadmeCorpus

Folders and files

Latest commit

History

Repository files navigation

GitHubReadmeCorpus

Aim

Running the scripts to create your own dataset

Docker method

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages