Collected sensitive Chinese keywords from various sources; for censorship testing and searching for sensitive content
Switch branches/tags
Nothing to show
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
csv uploading individual csv files, gb2312 encoding Dec 10, 2014
dta added Stata .dta file Dec 22, 2014
README.md Update README.md Dec 16, 2014

README.md

chinese-keywords

Contained is a set of sensitive Chinese keywords (that is, keywords related to the Chinese Communist party, pornography, dissident material, violence/terrorism, censorship, etc). These keywords may be helpful to researchers who are searching for sensitive content in Chinese or testing for network interference.

As of Dec 9, there are 9,054 sensitive keywords collected from 13 different lists (see below for detailed info on the lists). To get a sense of what data is included in these CSV files, you can view a Google Doc spreadsheet of these 9,054 keywords sorted by the number of lists they appear on: https://docs.google.com/spreadsheets/d/19eS47Dg086vR1jh9oo51pXstYVT2wft13JGCrnAeU7A/edit?usp=sharing

The CSV files contain machine translations (from Google) and human translations/notes for most of the keywords. Many also have theme and category variables included as well thanks to various sources which have previously tagged their keyword lists. Currently, there are three different versions:

The thirteen lists this collection contains are:

Creator/source Tested on/found from + original use of terms # of keywords Year Method found + source
UNM/The Citizen Lab Sina UC, triggers censorship of message in app 1,818 2013 reverse engineered from the client; more analysis here; download link
UNM/The Citizen Lab Tom-Skype, triggers either surveillance or censorship in app 2,574 2013 reverse engineered from the client; more analysis here; download link
The Citizen Lab LINE, triggers censorship of message in chat app 673 2014 reverse engineered from the client; more analysis here; download link
Jason Q. Ng (Blocked on Weibo) Sina Weibo, triggers blocked searches on Weibo 839 2013 running Wikipedia China article titles through Sina Weibo search; more analysis and book
Xia Chu Great Firewall, triggers blocking of webpages 669 2014 HTTP request scans of Wikipedia China articles to see if they'd trigger GFW block; more analysis here; download link (removed duplicates and keywords related to meta and user pages)
China Digital Times Sina Weibo, triggers blocked searches on Weibo 2,448 2014 crowdsourced testing of suspected sensitive keywords on Sina Weibo; more analysis on CDT and in CDT's Grass Mud Horse Lexicon e-book; download link
GreatFire.org Wikipedia, names of Wikipedia articles blocked from access 488 2013 testing to see if Wikipedia pages are available in China; more info; download link
Google/ATGFW.org Google/Great Firewall, see 'Method found' 456 2012 ATGFW.org and GreatFire.org reverse engineered the keywords Google was using to warn users of potential censorship while using their service in China; download link
Jeffrey Knockel Sina Show, keywords found in application's binary files 910 2014 extracted list from Sina Show app; of the 910 unique keywords, only 108 are used for censoring chat messages; download link
Unknown 163.com, unclear 376 2008 archived by Nart Villeneuve; circulated on 163.com, a Chinese portal website download link
Omnitalk BBS users? Tencent QQ, keywords blocked by chat app in 2004 863 2004 archived by Nart Villeneuve; extracted from Tencent QQ app, more info and analysis from CDT download link
Jed Crandall et al / "ConceptDoppler" Great Firewall, "keywords found to be censored at the 'gateway' level" 669 2008 archived by Nart Villeneuve; "HTTP keyword filtering by Internet routers"; website; paper; download link
Unknown a "blog provider", unclear 844 2005 archived by Nart Villeneuve; according to Villeneuve: "This is a keyword list from a blog provider in China." download link
This project was started at The Citizen Lab's 2014 Connaught Summer Institute workshop.