Skip to content

news-r/greatfire

Repository files navigation

Travis build status AppVeyor build status

greatfire

Datasets of keywords and urls censored on the Chinese internet, taken from greatfire.org

🚧 Note

Though not publicly mentioned greatfire.org has an API publicly available, it’s not difficult to uncover. However, as I very much appreciate their project this package does not provide direct access to it and instead provides two datasets that I pledge to keep updated. If it is too outdated please open an issue.

Last updated: 2020-02-22 20:26:52

Installation

You can install greatfire from Github using via the remotes or devtools package.

# install.packages("remotes")
remotes::install_github("news-r/greatfire")

Example

Both datasets are rather large and therefore not lazily loaded, you therefore have to explicitly call the data function.

library(greatfire)

data(censored_keywords)
nrow(censored_keywords)
#> [1] 27387
head(censored_keywords)
#>                        title blocked_last_30_days             changed
#> 1          www.pinterest.com                    6 2012-09-23 20:13:39
#> 2                       占领                    6 2012-09-23 22:27:46
#> 3                       王悦                   33 2012-09-23 11:34:48
#> 4            "f**k" in china                   33 2019-07-20 05:25:09
#> 5                   "Github"                   33 2019-06-19 16:58:47
#> 6 "harmony" high-speed train                   33 2019-07-03 13:53:26

data(censored_urls)
nrow(censored_urls)
#> [1] 89322
head(censored_urls)
#>                                                                                                                                                        title blocked_last_30_days
#> 1                                                                                                                          ftp://creatorsinpack.dynalias.com                  100
#> 2 ftp://jinshu.myftp.org/book/400books/%E7%8E%8B%E5%8A%9B%E9%9B%84-------------%E6%88%91%E7%9A%84%E8%A5%BF%E5%9F%9F+%E4%BD%A0%E7%9A%84%E4%B8%9C%E5%9C%9F.txt                  100
#> 3                                                                                                                        ftp://jinshu.myftp.org/gfw/new.html                  100
#> 4                                                                                                                  ftp://jinshu.myftp.org:20021/gfw/new.html                  100
#> 5                                                                                                                                      http://0.facebook.com                  100
#> 6                                                                                                                                      http://000.1024gc.com                  100
#>               changed
#> 1 2019-05-07 19:22:32
#> 2 2019-05-18 10:16:53
#> 3 2019-05-06 20:36:55
#> 4 2019-05-11 03:40:58
#> 5 2019-07-05 16:15:49
#> 6 2019-06-21 07:58:36

There are two convenience search_* functions to search through the data.

gh <- search_urls("github.com") 
knitr::kable(gh)
title blocked_last_30_days changed
6282 http://aoxu.github.com 50 2019-07-11 17:43:34
21736 http://fanzuoyong.github.com 33 2019-07-11 17:56:18
22931 http://gist.github.com 100 2019-07-17 03:16:23
22932 http://gist.github.com/acethical/40081f64eb1461ee6b66 100 2019-06-08 13:40:25
22940 http://github.com/bannedbook/fanqiang/wiki 100 2019-07-31 21:25:07
22941 http://github.com/getlantern 100 2019-07-28 09:03:55
22942 http://github.com/getlantern/forum 100 2019-07-05 23:12:16
22943 http://github.com/greatfire/wiki 100 2019-07-26 04:07:25
22944 http://github.com/greatfire/wiki/issues 100 2019-07-19 15:03:19
22945 http://github.com/htcc/m 100 2019-06-28 22:59:31
22946 http://github.com/mothran/mongol 100 2019-05-24 10:24:28
28933 http://nodeload.github.com/goagent/goagent/legacy.zip/3.0 100 2019-07-14 23:44:30
29349 http://opendns.github.com/dnscrypt-osx-client 33 2019-07-26 14:38:35
29568 http://pages.github.com 20 2019-07-28 14:43:50
63854 http://www.google.com/search?q=gist.github.com 100 2019-07-09 23:23:29
63859 http://www.google.com/search?q=github.com 100 2019-06-05 14:08:49
69770 http://www.google.com/search?q=www.github.com 100 2019-07-11 18:02:40
81698 https://gist.github.com 100 2019-07-31 19:15:35
81699 https://gist.github.com/acethical/40081f64eb1461ee6b66 100 2019-05-29 13:19:55
81700 https://gist.github.com/anonymous/e6e9c344eff02dca5bc4 100 2019-07-31 06:13:51
81701 https://gist.github.com/cnrat/538bec6826f47a1288a1fb19ff73f821 100 2019-05-25 18:39:27
81702 https://gist.github.com/laurieainley/7663756 100 2019-05-29 11:24:17
81703 https://gist.github.com/mandiwise/5954bbb2e95c011885ff 100 2019-06-27 02:07:50
81704 https://gist.github.com/maxwelleite/10774746 100 2019-06-27 11:39:40
81705 https://gist.github.com/rongmu/0e1c4341800c008b4649 67 2019-07-14 15:16:32
81706 https://gist.github.com/shenzhuxi/4635732 100 2019-07-05 20:48:41
81707 https://gist.github.com/udacityandroid/5cb10300bb10becb5f8185012b913c8e 100 2019-06-04 16:46:15
81708 https://gist.github.com/warpthatdot/27609e05c1d22e855f71ba5747d56281.js 100 2019-06-01 17:07:51
81710 https://github.com 3 2019-07-28 16:27:49

About

🔥Keywords and URLs Censored on the Chinese Internet

Topics

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages