Skip to content
This repository has been archived by the owner on Jul 27, 2022. It is now read-only.

Automatically exported from code.google.com/p/wiki-links

License

Notifications You must be signed in to change notification settings

google-research-datasets/wiki-links

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This data set is made available under a Creative Commons License:
  http://creativecommons.org/licenses/by/3.0/

  Attribution 3.0 Unported (CC BY 3.0) Human-Readable Summary
  ===========================================================

  You are free:
   - to Share -- to copy, distribute and transmit the work;
    - to Remix -- to adapt the work;
    and to make commercial use of the work.


    Under the following conditions:

    * Attribution -- You must attribute the work in the manner specified
    by the author or licensor (but not in any way that suggests that
    they endorse you or your use of the work).


    With the understanding that:

    * Waiver -- Any of the above conditions can be waived if you get
    permission from the copyright holder.

    * Public Domain -- Where the work or any of its elements is in
    the public domain under applicable law, the status is in no way
    affected by the license.

    * Other Rights -- In no way are any of the following rights
    affected by the license:

    - Your fair dealing or fair use rights, or other applicable
    copyright exceptions and limitations;

    - The author's moral rights;

    - Rights other persons may have either in the work itself or
    in how the work is used, such as publicity or privacy rights.

    * Notice -- For any reuse or distribution, you must make clear
    to others the license terms of this work.  The best way to do
    this is with a link to this web page:
    http://creativecommons.org/licenses/by/3.0/


 Introduction
 ============

 The Wikipedia links (WikiLinks) data consists of web pages that
 satisfy the following two constraints:

   a. contain at least one hyperlink that points to Wikipedia, and
     b. the anchor text of that hyperlink closely matches the
	  title of the target Wikipedia page.

	  We treat each page on Wikipedia as representing an entity
	  (or concept or idea), and the anchor text as a mention of the
	  entity. The WikiLinks data set was obtained by iterating 
	  over Google's web index. 

	  Content
	  =======

	  This dataset is accompanied by the following tech report:

	  https://web.cs.umass.edu/publication/docs/2012/UM-CS-2012-015.pdf

	  Please cite the above report if you use this data.

	  The dataset is divided over 109 text files.

	  Each file is in the following format:

	  -------

	  URL\t<url>\n
	  MENTION\t<mention>\t<byte_offset>\t<target_url>\n
	  MENTION\t<mention>\t<byte_offset>\t<target_url>\n
	  MENTION\t<mention>\t<byte_offset>\t<target_url>\n
	  ...
	  TOKEN\t<token>\t<byte_offset>\n
	  TOKEN\t<token>\t<byte_offset>\n
	  TOKEN\t<token>\t<byte_offset>\n
	  ...
	  \n\n
	  URL\t<url>\n
	  ...

	  -------

	  where each web-page is identified by its url (annotated
	  by "URL"). For every mention (denoted by "MENTION"), we provide the
	  actual mention string, the byte offset of the mention from the start
	  of the page and the target url all separated by a tab. It is
	  possible (and in many cases very likely) that the contents of a
	  web-page may change over time. The dataset also contains information
	  about the top 10 least frequent tokens on that page at the time
	  it was crawled. These line started with a "TOKEN" and contain
	  the string of the token and the byte offset from the start of the page.
	  These token strings can be used as fingerprints
	  to verify if the page used to generate the data has changed. Finally,
	  pages are separated from each other by two blank lines.

	  Basic Statistics
	  ================

	  Number of Document: 11 million
	  Number of entities:  3 million
	  Number of mentions: 40 million


	  Finally please note that this dataset was created automatically
	  from the web and therefore contains some amount of noise.

	  Enjoy!

	  Amar Subramanya (asubram@google.com)
	  Sameer Singh (sameer@cs.umass.edu)
	  Fernando Pereira (pereira@google.com)
	  Andrew McCallum (mccallum@cs.umass.edu)
	  Dave Orr (dmorr@google.com)

About

Automatically exported from code.google.com/p/wiki-links

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published