Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

Initial commit of files

  • Loading branch information...
commit 7fc35b850d9d95910126919f84036e6788aeedca 1 parent 6284d29
@kenshiro-o authored
View
4 .gitignore
@@ -0,0 +1,4 @@
+node_modules
+.idea
+initproject.sh
+wikipedia-js.iml
View
3  History.md
@@ -0,0 +1,3 @@
+0.0.1 / 2013-03-10
+====================
+* Initial release
View
70 README.md
@@ -0,0 +1,70 @@
+# wikipedia-js
+
+ wikipedia-js is a simple client that enables you to query Wikipedia articles in english. The results are formatted
+in basic HTML.
+Presently, wikipedia-js only works if you request a summary of an article (i.e. everything before the table of contents
+on a Wikipedia page). Work is currently in progress to format a whole Wikipedia article page.
+
+## Rationale
+
+ This project was created because Wikipedia do currently not support a node.js API.
+
+## Installation
+
+ $ npm install wikipedia-js
+
+## Usage
+ All searches are performed via the single method *searchArticles*:
+
+ ```js
+ var wikiParser = require("wikiParser");
+ var query = "Napoleon Bonaparte";
+ var options = {query: query, format: "json", summaryOnly: true};
+ wikiClient.searchArticle(options, function(err, htmlWikiText){
+ if(err){
+ console.log("An error occurred[query=%s, error=%s]", query, err);
+ }else{
+ console.log("Query successful[query=%s, html-formatted-wiki-text=%s]", query, htmlWikiText);
+ /*You should see something along the lines of:
+ <p><strong>Napoleon Bonaparte</strong> (French: Napoléon Bonaparte [napoleɔ̃ bɔnɑpaʁt], Italian: Napoleone Buonaparte; 15 August 1769&nbsp;– 5 May 1821) was a French military and political leader who rose to prominence during the latter stages of the <a href=http://en.wikipedia.org/French_Revolution">French Revolution</a> and its associated <a href=http://en.wikipedia.org/French_Revolutionary_Wars">wars</a> in Europe.</p>
+ <p>As <strong>Napoleon I</strong>, he was <a href=http://en.wikipedia.org/Emperor_of_the_French">Emperor of the French</a> from 1804 to 1815. His legal reform, the <a href=http://en.wikipedia.org/Napoleonic_Code">Napoleonic Code</a>, has been a major influence on many <a href=http://en.wikipedia.org/Civil_law_(legal_system)">civil law</a> jurisdictions worldwide, but he is best remembered for his role in the wars led against France by a series of coalitions, the so-called <a href=http://en.wikipedia.org/Napoleonic_Wars">Napoleonic Wars</a>. He established hegemony over most of continental Europe and sought to spread the ideals of the French Revolution, while consolidating an <a href=http://en.wikipedia.org/First_French_Empire">imperial monarchy</a> which restored aspects of the deposed <em><a href=http://en.wikipedia.org/Ancien_Régime">Ancien Régime</a>.</em> Due to his success in these wars, often against numerically superior enemies, he is generally regarded as one of the greatest military commanders of all time, and his campaigns are studied at military academies worldwide.(ref: Schom 1998)</p>
+ <p>Napoleon was born at <a href=http://en.wikipedia.org/Ajaccio">Ajaccio</a> in <a href=http://en.wikipedia.org/Corsica">Corsica</a> in a family of <a href=http://en.wikipedia.org/Nobility_of_Italy">noble Italian</a> ancestry which had settled Corsica in the 16th century. He trained as an artillery officer in mainland France. He rose to prominence under the <a href=http://en.wikipedia.org/French_First_Republic">French First Republic</a> and led successful campaigns against the <a href=http://en.wikipedia.org/First_Coalition">First</a> and <a href=http://en.wikipedia.org/War_of_the_Second_Coalition">Second</a> Coalitions arrayed against France. He led a successful invasion of the Italian peninsula.</p>
+ <p>In 1799, he staged a <em><a href=http://en.wikipedia.org/18_Brumaire">coup d</em>état</a> and installed himself as <a href=http://en.wikipedia.org/First_Consul">First Consul</a>; five years later the French Senate proclaimed him emperor, following a <a href=http://en.wikipedia.org/plebiscite">plebiscite</a> in his favour. In the first decade of the 19th century, the <a href=http://en.wikipedia.org/First_French_Empire">French Empire</a> under Napoleon engaged in a series of conflicts—the Napoleonic Wars—that involved every major European power.(ref: Schom 1998) After a streak of victories, France secured a dominant position in continental Europe, and Napoleon maintained the French <a href=http://en.wikipedia.org/sphere_of_influence">sphere of influence</a> through the formation of extensive alliances and the appointment of friends and family members to rule other European countries as French <a href=http://en.wikipedia.org/client_state">client state</a>s.</p>
+ <p>The <a href=http://en.wikipedia.org/Peninsular_War">Peninsular War</a> and 1812 <a href=http://en.wikipedia.org/French_invasion_of_Russia">French invasion of Russia</a> marked turning points in Napoleons fortunes. His <a href=http://en.wikipedia.org/Grande_Armée">Grande Armée</a> was badly damaged in the campaign and never fully recovered. In 1813, the <a href=http://en.wikipedia.org/Sixth_Coalition">Sixth Coalition</a> defeated his forces <a href=http://en.wikipedia.org/Battle_of_Leipzig">at Leipzig</a>; the following year the Coalition invaded France, forced Napoleon to abdicate and exiled him to the island of <a href=http://en.wikipedia.org/Elba">Elba</a>. Less than a year later, he escaped Elba and returned to power, but was defeated at the <a href=http://en.wikipedia.org/Battle_of_Waterloo">Battle of Waterloo</a> in June 1815. Napoleon spent the last six years of his life in confinement by the British on the island of <a href=http://en.wikipedia.org/Saint_Helena">Saint Helena</a>. An autopsy concluded he died of <a href=http://en.wikipedia.org/stomach_cancer">stomach cancer</a>, but there has been some debate about the cause of his death, as some scholars have speculated that he was a victim of <a href=http://en.wikipedia.org/arsenic_poisoning">arsenic poisoning</a>.</p>
+ */
+ }
+ }
+ });
+ ```
+
+## Additional features
+
+ The following features will be added soon:
+- retrieve whole article as opposed to summary only
+- return only wiki markup to user if requested (we are currently systematically formatting to HTML)
+- improve performance
+
+## Licence
+
+(The MIT License)
+
+Copyright (c) 2013 Kenshiro &lt;kenshiro@kenshiro.me&gt;
+
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+'Software'), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
View
2  lib/constants/wikiConstants.js
@@ -0,0 +1,2 @@
+module.exports.WIKIPEDIA_EN_URL = "http://en.wikipedia.org";
+module.exports.WIKIPEDIA_EN_API_URL = "http://en.wikipedia.org/w/api.php";
View
91 lib/converter/wikiToHTMLConverter.js
@@ -0,0 +1,91 @@
+var constants = require("./../constants/wikiConstants");
+
+var STYLE_REGEX = /[']+\s*([^']*)\s*[']+/g;
+var LINK_REGEX = /\[\[([^\]]*)\]\]/g;
+var SPECIAL_LANGUAGE_INFO_REGEX = /\{\{([^\}]*)\}\}/g;
+var REFERENCE_REGEX = /<ref\s*name="([^"]*)"\s*\/?>[^<]*(<\/ref>)?/g;
+var REFERENCES_TO_IGNORE = /<ref[^\/]*\/>|<ref[^>]*>[^<]*<\/ref>/g;
+var COMMENTS_REGEX = /<!--\s*[^-]*-->/g;
+
+var LANGUAGE_MAP = {
+ "lang-fr": "French",
+ "lang-es": "Spanish",
+ "lang-en": "English",
+ "lang-it": "Italian"
+};
+
+function convertLineToHTML(line){
+
+ line = line.replace(STYLE_REGEX, function(match, subMatch1){
+ var index = match.indexOf(subMatch1);
+ if(index === 2){
+ return "<em>" + subMatch1 + "</em>";
+ }else if(index === 3){
+ return "<strong>" + subMatch1 + "</strong>";
+ }else if(index === 5){
+ return "<strong><em>" + subMatch1 + "</em></strong>";
+ }else{
+ return subMatch1;
+ }
+ });
+
+ line = line.replace(LINK_REGEX, function(match, matchedLink){
+ var underscoreLink = "";
+ if(/\|/.test(matchedLink)){
+ //TODO Use underscore to perform trimming
+ var splitLink = matchedLink.split("|");
+ matchedLink = splitLink[1];
+ underscoreLink = splitLink[0].replace(/\s/g, "_");
+ }else{
+ underscoreLink = matchedLink.replace(/\s/g, "_");
+ }
+
+ return '<a href=' + constants.WIKIPEDIA_EN_URL + '/' + underscoreLink + '">' + matchedLink + '</a>';
+ });
+
+ line = line.replace(COMMENTS_REGEX, "");
+
+ line = line.replace(SPECIAL_LANGUAGE_INFO_REGEX, function(match, matchedLangStr){
+ var splitInfo = matchedLangStr.split("|");
+ if(splitInfo.length > 0){
+ //short circuit rightaway if we are dealing with a citation/reference
+ if(/cite/.test(splitInfo[0])) {
+ return "";
+ }
+ var langInfo = LANGUAGE_MAP[splitInfo[0]];
+ var prefix = "";
+ var suffix = "";
+
+ //Print the language information if it is present
+ if(langInfo){
+ langInfo = langInfo ? langInfo + ": " : "";
+ prefix = langInfo;
+ }else{
+ prefix = "[";
+ suffix = "]"
+ }
+ var ret = prefix;
+ for(var i = 1; i < splitInfo.length; ++i){
+ var curr = splitInfo[i];
+ if(!/links|IPA|icon/.test(curr)){
+ ret += splitInfo[i];
+ }
+ }
+ ret += suffix;
+ return ret;
+ }else{
+ return "";
+ }
+ });
+
+ line = line.replace(REFERENCE_REGEX, function(match, matchedAuthor){
+ return "(ref: " + matchedAuthor + ")";
+ });
+
+ line = line.replace(REFERENCES_TO_IGNORE, "");
+
+ return line;
+}
+
+
+module.exports.convertLineToHTML = convertLineToHTML;
View
60 lib/parser/wikiParser.js
@@ -0,0 +1,60 @@
+var _ = require("underscore"),
+ converter = require("./../converter/wikiToHTMLConverter");
+
+_.str = require("underscore.string");
+_.mixin(_.str.exports());
+
+function parseJson(wiki, callback) {
+ if (wiki.query && wiki.query.pages) {
+ var keys = Object.keys(wiki.query.pages);
+ if (keys.length > 0) {
+ //Take the first result
+ var latestArticle = wiki.query.pages[keys[0]].revisions[0]["*"];
+ var lines = latestArticle.split("\n");
+ var linesToParse = [];
+ var parsedText = "";
+
+ lines.forEach(function (line) {
+ if(_(line).startsWith("}}")){
+ linesToParse = [];
+ }else if(!(_(line).startsWith("|") ||
+ _(line).startsWith("{") ||
+ _(line).startsWith("<") ||
+ _(line).startsWith("[[File")||
+ _(line).startsWith("\n")||
+ line.length == 0)){
+ linesToParse.push(line);
+ }
+ });
+
+ linesToParse.forEach(function(line){
+ line = converter.convertLineToHTML(line);
+ parsedText += "<p>" + line + "</p>\n";
+// console.log(line);
+ });
+
+ process.nextTick(function () {
+ callback(null, parsedText);
+ });
+ } else {
+ process.nextTick(function () {
+ callback(null, null);
+ });
+ }
+ } else {
+ process.nextTick(function () {
+ callback(new Error("Unable to create"), null);
+ });
+ }
+}
+
+
+module.exports.parse = function (wiki, format, callback) {
+ if (format === "json") {
+ parseJson(wiki, callback);
+ }else{
+ process.nextTick(function(){
+ callback(new Error("Unrecognized format [format=" + format + "]"));
+ });
+ }
+};
View
33 lib/wikiClient.js
@@ -0,0 +1,33 @@
+var superAgent = require("superagent"),
+ wikiParser = require("./parser/wikiParser"),
+ wikiConstants = require("./constants/wikiConstants");
+
+function searchArticle(queryAndOptions, callback) {
+ var query = queryAndOptions.query;
+ if (!query) {
+ return callback(new Error("No search query was provided"));
+ }
+ var format = queryAndOptions.format || "json";
+ var summaryOnly = queryAndOptions.summaryOnly;
+ var queryParams = {action: "query", format: format, prop: "revisions",
+ rvprop: "content", titles: query, redirects: 1 };
+ if(summaryOnly){
+ queryParams.rvsection = 0;
+ }
+
+ superAgent.get(wikiConstants.WIKIPEDIA_EN_API_URL)
+ .query(queryParams)
+ .set("User-Agent", "Node.js wikipedia-js client (kenshiro@kenshiro.me)")
+ .end(function (res) {
+ if (res.ok) {
+ var jsonData = JSON.parse(res.text);
+ wikiParser.parse(jsonData, format, callback);
+ } else {
+ process.nextTick(function () {
+ return callback(new Error("Unexpected HTTP status received [status=" + res.status + "]"));
+ });
+ }
+ });
+}
+
+module.exports.searchArticle = searchArticle;
View
23 package.json
@@ -0,0 +1,23 @@
+{
+ "name": "wikipedia-js",
+ "description": "A simple client to query wikipedia",
+ "version": "0.0.1",
+ "keywords": ["wikipedia", "wiki", "client", "search", "node", "node.js"],
+ "author": "kenshiro-o<kenshiro@kenshiro.me>",
+ "main": "lib/wikiClient",
+ "repository":{
+ "type": "git",
+ "url": "git://github.com/kenshiro-o/wikipedia-js.git"
+ },
+ "dependencies":{
+ "superagent": "0.12.x",
+ "cheerio": "0.10.x",
+ "underscore": "*",
+ "underscore.string": "2.3.x"
+ },
+ "devDependencies":{
+ "vows":"*",
+ "expect.js": "*"
+ },
+ "engine": "node >= 0.8.x"
+}
View
23 test/wikiClientTest.js
@@ -0,0 +1,23 @@
+var vows = require("vows"),
+ wikiClient = require("../lib/wikiClient"),
+ expect = require("expect.js"),
+ _ = require("underscore");
+
+_.str = require("underscore.string");
+_.mixin(_.str.exports());
+
+vows.describe("Wikipedia search checks").addBatch({
+ "When searching (in json) for Napoleon's wiki summary":{
+ topic: function(){
+ var options = {query: "Napoleon Bonaparte", format: "json", summaryOnly: true};
+ wikiClient.searchArticle(options, this.callback);
+ },
+
+ "A valid set of paragraphs is returned": function(err, response){
+ expect(err).to.be(null);
+ console.log(response);
+ expect(_(response).startsWith("<p><strong>Napoleon Bonaparte</strong>")).to.be(true);
+
+ }
+ }
+}).run();
Please sign in to comment.
Something went wrong with that request. Please try again.