Robots.txt Database

Robots.txt files are used to tell search engines (and other robots) what content on a website they should crawl and potentially show in search results. If a website doesn't want certain content to be craweld (by Google, for example), that website can specify a set of rules telling search engines what content to not crawl. URLs (or URL patterns) that appear in robots.txt files are not guaranteed to disappear from search results, but such pages are "likely" to be removed from Google's index, according to Google.

This project aims to collect robots.txt files for websites across the internet, starting with 9000+ government websites. The contents of each robots.txt file are collected once a week, committed to this repo, and parsed for their specific directives (e.g. Disallow: /user/login/) to allow for researchers, journalists, and anyone else who is curious about patterns across robots.txt files. The data collected here can help answer questions about what types of content these websites are choosing to hide from search engines.

Accessing the data

There are a number of ways to access this "database" of robots.txt files.

Via webapp, https://robots-dot-txt-db.com/, which lets you search across all directives contained in 9000+ robots.txt files, and provides links to the Internet Archive for URLs and URL prefixes that have content that doesn't show up in search engine results.
Via Datasette at https://robots-dot-txt-db.herokuapp.com/robotstxt. Datasette is a tool used to publish data and allow people to run SQL queries against datasets. If you're familiar with SQL and want to dig into the details of the data collected here, this is a good alternative to the webapp listed above.
Searching this repo. All of the collected robots.txt files are in the data/ directory, so if you are familiar with grep or other such tools, you can search the text of the files directly that way.

Which websites are you collecting robots.txt files for?

As of November 2020, the focus is on government websites. There are a few sources used to collect as many domains as possible:

.gov domains, from https://home.dotgov.gov/data/ and https://raw.githubusercontent.com/GSA/data/master/dotgov-domains/current-full.
non .gov government domains, from https://raw.githubusercontent.com/GSA/govt-urls/master/1_govt_urls_full.csv
State and local public health departments as listed at https://www.cdc.gov/publichealthgateway/sitesgovernance/index.html
Federal agencies listed at: https://www.usa.gov/federal-agencies
State and local COVID-related websites, via the Covid Tracking Project. Specifically this google sheet as linked to from https://github.com/COVID19Tracking/covid-tracking#urlsyaml

Note that there can be overlap in these sources, but subdomains have their own robots.txt files - so while the full list of dotgov domains may have alaska.gov, it doesn't have dhss.alaska.gov - which could have a completely different robots.txt file.

See combine_all_domains_to_check.py for more detail on how the domains are combined. The source_scripts/ directory contains scripts used to collect domains, for cases where the domains aren't already provided in a CSV. For example, there is a script there to parse out all of the websites listed at https://www.cdc.gov/publichealthgateway/sitesgovernance/index.html.

Code

Deploying

Deploying updated data

The webapp gets its data from a Datasette app running on Heroku. To update the data there with the latest data in this repo:

# Generate a CSV based on all of the collected robots.txt files
python robots_txt_to_csv.py

# Turn that CSV into a sqlite DB
csvs-to-sqlite data/all_parsed_robots_txt_data.csv /path/to/sqlitedb.db

# Publish it
datasette publish heroku /path/to/sqlitedb.db -n robots-dot-txt-db --extra-options="--config max_returned_rows:5000"

Right now, the CSV generation is not automated - so running the robots_txt_to_csv.py script will cause local changes that should be committed to git. In the future, the CSV should be generated as part of the Github action that runs weekly.

Name		Name	Last commit message	Last commit date
Latest commit History 303 Commits
.github/workflows		.github/workflows
data		data
domain_sources		domain_sources
robots-txt-webapp		robots-txt-webapp
source_scripts		source_scripts
.gitignore		.gitignore
README.md		README.md
all_domains_to_check.csv		all_domains_to_check.csv
combine_all_domains_to_check.py		combine_all_domains_to_check.py
requirements.txt		requirements.txt
robots_txt_collector.py		robots_txt_collector.py
robots_txt_to_csv.py		robots_txt_to_csv.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Robots.txt Database

Accessing the data

Which websites are you collecting robots.txt files for?

Code

Deploying

Deploying updated data

About

Releases

Packages

Languages

nrjones8/robots-dot-txt-archive-bot

Folders and files

Latest commit

History

Repository files navigation

Robots.txt Database

Accessing the data

Which websites are you collecting robots.txt files for?

Code

Deploying

Deploying updated data

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages