This repository has been archived by the owner on Jul 27, 2022. It is now read-only.
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Failed to load latest commit information.
Latest commit message
This data set is made available under a Creative Commons License: http://creativecommons.org/licenses/by/3.0/ Attribution 3.0 Unported (CC BY 3.0) Human-Readable Summary =========================================================== You are free: - to Share -- to copy, distribute and transmit the work; - to Remix -- to adapt the work; and to make commercial use of the work. Under the following conditions: * Attribution -- You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). With the understanding that: * Waiver -- Any of the above conditions can be waived if you get permission from the copyright holder. * Public Domain -- Where the work or any of its elements is in the public domain under applicable law, the status is in no way affected by the license. * Other Rights -- In no way are any of the following rights affected by the license: - Your fair dealing or fair use rights, or other applicable copyright exceptions and limitations; - The author's moral rights; - Rights other persons may have either in the work itself or in how the work is used, such as publicity or privacy rights. * Notice -- For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to this web page: http://creativecommons.org/licenses/by/3.0/ Introduction ============ The Wikipedia links (WikiLinks) data consists of web pages that satisfy the following two constraints: a. contain at least one hyperlink that points to Wikipedia, and b. the anchor text of that hyperlink closely matches the title of the target Wikipedia page. We treat each page on Wikipedia as representing an entity (or concept or idea), and the anchor text as a mention of the entity. The WikiLinks data set was obtained by iterating over Google's web index. Content ======= This dataset is accompanied by the following tech report: https://web.cs.umass.edu/publication/docs/2012/UM-CS-2012-015.pdf Please cite the above report if you use this data. The dataset is divided over 109 text files. Each file is in the following format: ------- URL\t<url>\n MENTION\t<mention>\t<byte_offset>\t<target_url>\n MENTION\t<mention>\t<byte_offset>\t<target_url>\n MENTION\t<mention>\t<byte_offset>\t<target_url>\n ... TOKEN\t<token>\t<byte_offset>\n TOKEN\t<token>\t<byte_offset>\n TOKEN\t<token>\t<byte_offset>\n ... \n\n URL\t<url>\n ... ------- where each web-page is identified by its url (annotated by "URL"). For every mention (denoted by "MENTION"), we provide the actual mention string, the byte offset of the mention from the start of the page and the target url all separated by a tab. It is possible (and in many cases very likely) that the contents of a web-page may change over time. The dataset also contains information about the top 10 least frequent tokens on that page at the time it was crawled. These line started with a "TOKEN" and contain the string of the token and the byte offset from the start of the page. These token strings can be used as fingerprints to verify if the page used to generate the data has changed. Finally, pages are separated from each other by two blank lines. Basic Statistics ================ Number of Document: 11 million Number of entities: 3 million Number of mentions: 40 million Finally please note that this dataset was created automatically from the web and therefore contains some amount of noise. Enjoy! Amar Subramanya (firstname.lastname@example.org) Sameer Singh (email@example.com) Fernando Pereira (firstname.lastname@example.org) Andrew McCallum (email@example.com) Dave Orr (firstname.lastname@example.org)