This repository has been archived by the owner on Jul 27, 2022. It is now read-only.
Automatically exported from code.google.com/p/wiki-links
License
google-research-datasets/wiki-links
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
master
Could not load branches
Nothing to show
Could not load tags
Nothing to show
{{ refName }}
default
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code
-
Clone
Use Git or checkout with SVN using the web URL.
Work fast with our official CLI. Learn more.
- Open with GitHub Desktop
- Download ZIP
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Git stats
Files
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
This data set is made available under a Creative Commons License: http://creativecommons.org/licenses/by/3.0/ Attribution 3.0 Unported (CC BY 3.0) Human-Readable Summary =========================================================== You are free: - to Share -- to copy, distribute and transmit the work; - to Remix -- to adapt the work; and to make commercial use of the work. Under the following conditions: * Attribution -- You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). With the understanding that: * Waiver -- Any of the above conditions can be waived if you get permission from the copyright holder. * Public Domain -- Where the work or any of its elements is in the public domain under applicable law, the status is in no way affected by the license. * Other Rights -- In no way are any of the following rights affected by the license: - Your fair dealing or fair use rights, or other applicable copyright exceptions and limitations; - The author's moral rights; - Rights other persons may have either in the work itself or in how the work is used, such as publicity or privacy rights. * Notice -- For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to this web page: http://creativecommons.org/licenses/by/3.0/ Introduction ============ The Wikipedia links (WikiLinks) data consists of web pages that satisfy the following two constraints: a. contain at least one hyperlink that points to Wikipedia, and b. the anchor text of that hyperlink closely matches the title of the target Wikipedia page. We treat each page on Wikipedia as representing an entity (or concept or idea), and the anchor text as a mention of the entity. The WikiLinks data set was obtained by iterating over Google's web index. Content ======= This dataset is accompanied by the following tech report: https://web.cs.umass.edu/publication/docs/2012/UM-CS-2012-015.pdf Please cite the above report if you use this data. The dataset is divided over 109 text files. Each file is in the following format: ------- URL\t<url>\n MENTION\t<mention>\t<byte_offset>\t<target_url>\n MENTION\t<mention>\t<byte_offset>\t<target_url>\n MENTION\t<mention>\t<byte_offset>\t<target_url>\n ... TOKEN\t<token>\t<byte_offset>\n TOKEN\t<token>\t<byte_offset>\n TOKEN\t<token>\t<byte_offset>\n ... \n\n URL\t<url>\n ... ------- where each web-page is identified by its url (annotated by "URL"). For every mention (denoted by "MENTION"), we provide the actual mention string, the byte offset of the mention from the start of the page and the target url all separated by a tab. It is possible (and in many cases very likely) that the contents of a web-page may change over time. The dataset also contains information about the top 10 least frequent tokens on that page at the time it was crawled. These line started with a "TOKEN" and contain the string of the token and the byte offset from the start of the page. These token strings can be used as fingerprints to verify if the page used to generate the data has changed. Finally, pages are separated from each other by two blank lines. Basic Statistics ================ Number of Document: 11 million Number of entities: 3 million Number of mentions: 40 million Finally please note that this dataset was created automatically from the web and therefore contains some amount of noise. Enjoy! Amar Subramanya (asubram@google.com) Sameer Singh (sameer@cs.umass.edu) Fernando Pereira (pereira@google.com) Andrew McCallum (mccallum@cs.umass.edu) Dave Orr (dmorr@google.com)
About
Automatically exported from code.google.com/p/wiki-links
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published