wikidig

wikidig: Dig Wikipedia Knowledge.

使用示例

val id = 1323
val name = PageDb.getNameById(id).getOrElse("<empty>")
val content = PageContentDb.getContent(id).getOrElse("<empty>")

提供的数据库及功能

wikidig将维基百科数据经过清洗变换之后，保存到rocksdb数据库之中，以方便独立运行，避免对第三方数据库的依赖。目前的数据库对象及功能如下：

wiki.dig.db.CategoryDb

根据名称获取id
根据id获取分类名称
根据id获取所有的子类(outlinks)id列表
根据id获取所有的父类(inlinks)id列表
根据id获取所有的指向该分类的文章页面id列表

wiki.dig.db.PageDb

根据名称(包括重定向过来的名称)获取id
根据id获取名称
根据id获取所有的可重定向过来的名称列表
根据id获取所有的入链(inlinks)
根据id获取所有的出链
根据id获取所隶属的类别集合

wiki.dig.db.PageContentDb

根据id获取对一个的维基百科全文

wiki.dig.db.CategoryHierarchyDb

获取所有的“Main topic classification“下的主要分类，如Mathematics， People等
抽样n个三角形
记录节点对之间的父子关系，并记录各个深度下，所有节点的文章数量

articles in depth 2: AcTuple(27031,15116742) articles in depth 5: AcTuple(4731811,12344957) articles in depth 4: AcTuple(1605788,13898532) articles in depth 1: AcTuple(788,15117523) articles in depth 3: AcTuple(1200166,15090304) articles in depth 6: AcTuple(7708999,7708999)

试验

当前问题

文档的Embedding没有去除停用词，没有取对数平滑TF

命令行功能

生成三角形

bin/start –sample n

生成n个三角形，n为一个整数，执行完毕后，会在当前目录下输出以sample开始的三个文件：

sample.triangle.ids.txt 每行为一个三角形，第一个id是父节点，剩余两个为子节点 sample.category.ids.txt 当前抽样结果中出现的分类id集合 sample.cat_page.txt 抽样结果中分类与文章id的对应关系，每一行中的第一个数字为分类id，其余数字为该分类下的文章id sample.page.ids.txt 抽样结果中涉及的文章id集合

生成文章的embedding

nohup bin/start –outPageEmbedding > embedding.log &

读取sample.page.ids.txt文件，如果一篇文章的长度大于500，则利用词向量平均值作为文档的向量表示，并保存到 expt数据库中，同时输出到文件： sample.page.embedding.txt

sample.page.embedding.txt 文章对应的所有词语的embedding结果的平均值

通过保存有PageId的文件，转换得到Page全文的文本文件

./bin/page-db -i path_pageids_big -o tmp.page.txt.big > outpage.log &

子图抽样和子图内的文章抽样, 生成一个文本文件，里面保存有抽取出来的子图及文章信息。

每一个子图的格式如下：

“` corpus $corpusId nodes: $nodeText edges: $edgeText articles: $articleText “` 其中，$开始的变量会替换为具体值。命令执行方式如下：

./bin/corpus-generator <start-index> <size> <filename> E.g.: ./bin/corpus-generator 1 10000 corpus.txt

其中的start-index表示corpus id的取值开始范围

参考

Wiki Clickstream

https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream

https://figshare.com/articles/Wikipedia_Clickstream/1305770

最新的Clickstream: 2017年1月

https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/7563832/2017_01_en_clickstream.tsv.gz

我搜了一下bing学术搜索，没有发现和Wikipedia Clickstream相关的论文

Page view statistics for Wikimedia projects(Click Log)

下载地址: https://dumps.wikimedia.org/other/pageviews/

格式说明： https://dumps.wikimedia.org/other/pagecounts-raw/

维基百科的相关数据

Wikipedia每隔一段时间，就会在当前的维基百科数据，以压缩文件导出，其中，网站上把最新的维基百科数据导出，供人们使用。

维基百科镜像数据的下载地址：

https://dumps.wikimedia.org/mirrors.html

例如，我们可以把SQL格式的维基页面数据通过以下链接下载：

http://caesar.ftp.acc.umu.se/mirror/wikimedia.org/dumps/enwiki/20170601/enwiki-20170601-page.sql.gz

中文： http://ftp.acc.umu.se/mirror/wikimedia.org/dumps/zhwiki/20170520/

维基页面数据 (Page)

下载维基页面数据维基百科提供了XML、SQL等不同形式的导出数据，此处我们使用维基百科提供的SQL格式的数据，下载地址：
http://caesar.ftp.acc.umu.se/mirror/wikimedia.org/dumps/enwiki/20170601/enwiki-20170601-page.sql.gz

该文件大小大约1.5G

维基页面的表结构，可参考网页： https://www.mediawiki.org/wiki/Manual:Page_table

~~——————–~~———————+——+—–+———+—————-+

Field

Type

Null

Key

Default

Extra

~~——————–~~———————+——+—–+———+—————-+

page_id	int(10) unsigned	NO	PRI	NULL	AUTO_INCREMENT
page_namespace	int(11)	NO	MUL	NULL
page_title	varchar(255) binary	NO		NULL
page_restrictions	tinyblob	NO		NULL
page_is_redirect	tinyint(3) unsigned	NO	MUL	0
page_is_new	tinyint(3) unsigned	NO		0
page_random	real unsigned	NO	MUL	NULL
page_touched	binary(14)	NO		NULL
page_links_updated	varbinary(14)	YES		NULL
page_latest	int(10) unsigned	NO		NULL
page_len	int(10) unsigned	NO	MUL	NULL
page_content_model	varbinary(32)	YES		NULL
page_lang	varbinary(35)	YES		NULL

~~——————–~~———————+——+—–+———+—————-+

该表中大约有3500万行记录，页面的类型有多种，其中我们更关心的是维基百科的文章页面，文如果表中的page_namespace字段的值为0，则表示文章。

类别页面(Category page)用于表示页面之间的父子关系，通过page_namespace=14表示类别。

类别之间的链接(Categorylinks)

下载地址（2.1G）
http://ftp.acc.umu.se/mirror/wikimedia.org/dumps/enwiki/20170601/enwiki-20170601-categorylinks.sql.gz

表结构描述：

https://www.mediawiki.org/wiki/Manual:Categorylinks_table

~~——————-~~——————————+——+———+——————-+—————————–+

Field

Type

Null

Key

Default

Extra

~~——————-~~——————————+——+———+——————-+—————————–+

cl_from	int(10) unsigned	NO	UNI/PRI	0
cl_to	varchar(255) binary	NO	PRI	NULL
cl_sortkey	varbinary(230)	NO		NULL
cl_sortkey_prefix	varchar(255) binary	NO		NULL
cl_timestamp	timestamp	NO		CURRENT_TIMESTAMP	on update CURRENT_TIMESTAMP
cl_collation	varbinary(32)	NO	MUL	NULL
cl_type	enum(‘page’,’subcat’,’file’)	NO		‘page’

~~——————-~~——————————+——+———+——————-+—————————–+

该表保存了文章到类别、类别与子类别之间链接关系。

cl_from: Stores the page.page_id of the article where the link was placed.

cl_to: Stores the name (excluding namespace prefix) of the desired category. Spaces are replaced by underscores (_)

cl_sortkey: Stores the title by which the page should be sorted in a category list. This is the binary sortkey, that depending on $wgCategoryCollation may or may not be readable by a human (but should sort in correct order when comparing as a byte string)

cl_timestamp: Stores the time at which that link was last updated in the table.

cl_sortkey_prefix: This is either the empty string if a page is using the default sortkey (aka the sortkey is unspecified). Otherwise it is the human readable version of cl_sortkey. Needed mostly so that cl_sortkey can be easily updated in certain situations without re-parsing the entire page.

cl_collation: What collation is in use. Used so that if the collation changes, the updateCollation.php script knows what rows need to be fixed in db.

cl_type: What type of page is this (file, subcat (subcategory) or page (normal page)). Used so that the different sections on a category page can be paged independently in an efficient manner.

页面和类别之间的关系

下面我们看一下page和categorylinks两个表之间的关系。例如，我们要把所有出现在文章中的类别，根据类别关系构建成一棵树。

获取所有的文章/指定的文章进行观察

“` select * from page where page_namespace = 0; select * from page where page_namespace = 0 and page_title=’Anarchism’ “`

执行第2条SQL，将返回Anarchism的信息，假设其页面的page_id = 12;

查看分类信息

“` select * from categorylinks where cl_from = 12 “`

返回Anarchism页面拥有的所有的分类信息，假设拥有有一个类别Political_culture，根据该名称，我们可以进一步查询page表，获取其对应的page_id:

“` select * from page where page_namespace = 14 and page_title=’Political_culture’ “`

假设其page_id = 21722732，那么我们可以进一步获取该类别的父类别：

“` select * from categorylinks where cl_from = 21722732 “`

数据初始化

创建数据库语句

CREATE DATABASE zhwiki DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
CREATE USER 'xiatian'@'%' IDENTIFIED BY 'password';
GRANT ALL ON digger.* TO 'xiatian'@'%';

数据库中文乱码问题：

use digger SET NAMES utf8;

vi /etc/my.cnf 增加： [mysql] default-character-set = utf8

创建表

根据文件：jwpl_tables.sql

mysql zhwiki
source jwpl_tables.sql

导入数据到MySQL

使用JWPL的DataMachine创建某一个日期的维基百科库。

Data

English:

https://dumps.wikimedia.org/enwiki/20180801/ https://dumps.wikimedia.org/enwiki/20180801/enwiki-20180801-pages-articles.xml.bz2 https://dumps.wikimedia.org/enwiki/20180801/enwiki-20180801-pagelinks.sql.gz https://dumps.wikimedia.org/enwiki/20180801/enwiki-20180801-categorylinks.sql.gz

Chinese

https://dumps.wikimedia.org/zhwiki/20180801/ https://dumps.wikimedia.org/zhwiki/20180801/zhwiki-20180801-pages-articles.xml.bz2 https://dumps.wikimedia.org/zhwiki/20180801/zhwiki-20180801-pagelinks.sql.gz https://dumps.wikimedia.org/zhwiki/20180801/zhwiki-20180801-categorylinks.sql.gz

利用JWPL处理下载的数据

下载JWPL源代码，解压开，进行编译： mvn -DskipTests=true package
执行命令：

nohup java -Xmx16G -jar de.tudarmstadt.ukp.wikipedia.datamachine-1.2.0-SNAPSHOT-jar-with-dependencies.jar english Contents Disambiguation_pages /data/wiki/enwiki/20180801 &

此时会运行较长时间，需要数个小时；运行完毕后，会在enwiki/20180801/目录下生成一个output目录，里面包含了可以导入数据库的文本文件。

－执行导入命令 nohup mysqlimport -uroot -pxiatian –local –default-character-set=utf8 digger *.txt > /tmp/nohup.log &

中文百科的导入

nohup mysqlimport -uxiatian -pxiatian --local --default-character-set=utf8 zhwiki *.txt > /tmp/nohup.log &

英文：Contents Disambiguation_pages

中文：页面分类消歧义

构建本地的维基百科词条数据库

nohup ./bin/start --buildPageDb --buildPageContentDb --startId=0 --batchSize=1000 &

Third libraries

JWPL (Java Wikipedia Library) https://dkpro.github.io/dkpro-jwpl/

JWPL is a free, Java-based application programming interface that allows to access all information in Wikipedia.

wp-download https://github.com/pacurromon/wp-download

With wp-download you can automatically download the newest database dumps for all language edition you want:

Name		Name	Last commit message	Last commit date
Latest commit History 141 Commits
conf		conf
doc		doc
lib		lib
project		project
samples		samples
src/main		src/main
.gitignore		.gitignore
LICENSE		LICENSE
README.org		README.org
build.sbt		build.sbt
classification.md		classification.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wikidig

使用示例

提供的数据库及功能

wiki.dig.db.CategoryDb

wiki.dig.db.PageDb

wiki.dig.db.PageContentDb

wiki.dig.db.CategoryHierarchyDb

试验

当前问题

命令行功能

参考

Wiki Clickstream

Page view statistics for Wikimedia projects(Click Log)

维基百科的相关数据

维基页面数据 (Page)

类别之间的链接(Categorylinks)

页面和类别之间的关系

数据初始化

创建数据库语句

数据库中文乱码问题：

创建表

导入数据到MySQL

Data

English:

Chinese

利用JWPL处理下载的数据

构建本地的维基百科词条数据库

Third libraries

About

Releases

Packages

Languages

License

iamxiatian/wikidig

Folders and files

Latest commit

History

Repository files navigation

wikidig

使用示例

提供的数据库及功能

wiki.dig.db.CategoryDb

wiki.dig.db.PageDb

wiki.dig.db.PageContentDb

wiki.dig.db.CategoryHierarchyDb

试验

当前问题

命令行功能

参考

Wiki Clickstream

Page view statistics for Wikimedia projects(Click Log)

维基百科的相关数据

维基页面数据 (Page)

类别之间的链接(Categorylinks)

页面和类别之间的关系

数据初始化

创建数据库语句

数据库中文乱码问题：

创建表

导入数据到MySQL

Data

English:

Chinese

利用JWPL处理下载的数据

构建本地的维基百科词条数据库

Third libraries

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages