crawler-user-agents

This repository contains a list of of HTTP user-agents used by robots, crawlers, and spiders as in single JSON file.

Install

Direct download

Download the crawler-user-agents.json file from this repository directly.

Npm / Yarn

crawler-user-agents is deployed on npmjs.com: https://www.npmjs.com/package/crawler-user-agents

To use it using npm or yarn:

npm install --save crawler-user-agents
# OR
yarn add crawler-user-agents

In Node.js, you can require the package to get an array of crawler user agents.

const crawlers = require('crawler-user-agents');
console.log(crawlers);

Usage

Each pattern is a regular expression. It should work out-of-the-box wih your favorite regex library:

JavaScript: if (RegExp(entry.pattern).test(req.headers['user-agent']) { ... }
PHP: add a slash before and after the pattern: if (preg_match('/'.$entry['pattern'].'/', $_SERVER['HTTP_USER_AGENT'])): ...
Python: if re.search(entry['pattern'], ua): ...
Go: use this package, it provides global variable Crawlers (it is synchronized with crawler-user-agents.json), functions IsCrawler and MatchingCrawlers.

Example of Go program:

package main

import (
	"fmt"

	"github.com/monperrus/crawler-user-agents"
)

func main() {
	userAgent := "Mozilla/5.0 (compatible; Discordbot/2.0; +https://discordapp.com)"

	isCrawler := agents.IsCrawler(userAgent)
	fmt.Println("isCrawler:", isCrawler)

	indices := agents.MatchingCrawlers(userAgent)
	fmt.Println("crawlers' indices:", indices)
	fmt.Println("crawler' URL:", agents.Crawlers[indices[0]].URL)
}

Output:

isCrawler: true
crawlers' indices: [237]
crawler' URL: https://discordapp.com

Contributing

I do welcome additions contributed as pull requests.

The pull requests should:

contain a single addition
specify a discriminant relevant syntactic fragment (for example "totobot" and not "Mozilla/5 totobot v20131212.alpha1")
contain the pattern (generic regular expression), the discovery date (year/month/day) and the official url of the robot
result in a valid JSON file (don't forget the comma between items)

Example:

{
  "pattern": "rogerbot",
  "addition_date": "2014/02/28",
  "url": "http://moz.com/help/pro/what-is-rogerbot-",
  "instances" : ["rogerbot/2.3 example UA"]
}

License

The list is under a MIT License. The versions prior to Nov 7, 2016 were under a CC-SA license.

Related work

There are a few wrapper libraries that use this data to detect bots:

Voight-Kampff (Ruby)
isbot (Ruby)
crawlers (Clojure)
isBot (Node.JS)

Other systems for spotting robots, crawlers, and spiders that you may want to consider are:

Crawler-Detect (PHP)
BrowserDetector (PHP)
browscap (JSON files)

Name		Name	Last commit message	Last commit date
Latest commit History 394 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
composer.json		composer.json
crawler-user-agents.json		crawler-user-agents.json
go.mod		go.mod
go.sum		go.sum
index.d.ts		index.d.ts
main.php		main.php
package-lock.json		package-lock.json
package.json		package.json
test_validation.py		test_validation.py
validate.go		validate.go
validate.php		validate.php
validate.py		validate.py
validate_test.go		validate_test.go

License

monperrus/crawler-user-agents

Folders and files

Latest commit

History

Repository files navigation

crawler-user-agents

Install

Direct download

Npm / Yarn

Usage

Contributing

License

Related work

About

Resources

License

Stars

Watchers

Forks

Languages