Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lang_links property to WikipediaPage #45

Closed
wants to merge 2 commits into from
Closed

Add lang_links property to WikipediaPage #45

wants to merge 2 commits into from

Conversation

jichu4n
Copy link

@jichu4n jichu4n commented May 13, 2014

This change adds a lang_links property method to WikipediaPage, which returns a list of links to versions of the Wikipedia page in different languages. The result is a list of (language_prefix, title) tuples.

This property method returns a list of links to versions of the Wikipedia page
in different languages. The result is a list of (language_prefix, title) tuples.
@javierprovecho
Copy link
Contributor

Whouldn't be better to be a list of (language_prefix, page_id, (optionan = titles)) tuples?

@jichu4n
Copy link
Author

jichu4n commented May 13, 2014

Um - could you explain why it'd be better?

I chose (language_prefix, title) to be consistent with the `links' property method, which returns page titles rather than page IDs. But I'm open to suggestions :)

@javierprovecho
Copy link
Contributor

For important articles with many languages, is less probable that a encoding error appears due to a "title request". Numbers are clear and for sure, more direct and easier to handle for wikipedia for future requests. I think you should provide both, titles and page ids. What do you think?

EDIT: before your contribution I developed a small function to obtain the langlinks.

# Langlinks
def langlinks(idioma, pageid):
    values = {
        'action'    : 'query',
        'prop'      : 'langlinks',
        'lllimit'   : '500',
        'format'    : 'json',
        'pageids'    : pageid
    }
    query = 'https://'+idioma+'.wikipedia.org/w/api.php?'+urllib.urlencode(values)
    response = urllib2.urlopen(query)
    html = response.read()
    jsondata = json.loads(html)
    result = {}
    for page in jsondata['query']['pages']:
        for langlink in jsondata['query']['pages'][page]['langlinks']:
            # print langlink['lang'], langlink['*']
            result[langlink['lang']] = getID(langlink['lang'], langlink['*']) # langlink['*']
    return result

@jichu4n
Copy link
Author

jichu4n commented May 15, 2014

So we have to make an API request for each (language, title) to get the corresponding page ID? For a popular page like "Wikipedia", that means >250 requests in this method. That's pretty expensive, especially if the user doesn't need to load the page in all of the languages.

On the other hand, to get the page ID, I'd be doing

set_lang(lang)
pageid = page(title=title).pageid

If the caller actually does want to access the content of the page in different languages, they will have to do the following:

for lang, pageid in my_page.lang_links:
    wikipedia.set_lang(lang)
    page_in_lang = wikipedia.page(pageid=pageid)

In other words, they'd have to send a duplicate request for each language following this method.

An alternative is to return WikipediaPage objects from langlinks; however, since each of the returned WikipediaPage objects will have a different language, but _wiki_requests() always uses the global singleton language setting (API_URL), none of the property methods on the returned WikipediaPage objects would work.

@jichu4n
Copy link
Author

jichu4n commented May 15, 2014

Or, we can refactor WikipediaPage to be language-aware - i.e., we can add a `lang' field to WikipediaPage, and change page(), _wiki_request(), and WikipediaPage property methods to accept lang as an argument, rather than relying on a global language setting. That would also fix the following problem which we have today:

wikipedia.set_lang('en')
my_page = wikipedia.page('Wikipedia')
wikipedia.set_lang('zh')
print(my_page.summary)    # Does NOT work - raises KeyError

It'd be cleaner, but it'd be a much bigger change...

@goldsmith
Copy link
Owner

Alright, I just did some refactoring and pushed out a new version so I'm not sure if this PR will still work. Thanks for the awesome discussion @jichuan89 and @javierprovecho so far, I think that we should definitely add I18N support more formally now (the first version of support was pretty hacky, as you can tell).

I'm definitely in favor of turning lang into an argument and a property of the module functions (including page()), instead of relying on a stateful global setting that leads to weird edge case bugs. With that being said, you should be able to set a default language to avoid having to pass in that argument to every func call.

So, here's my full proposal (please let me know what you think):

  • refactor each of API_LANG, RATE_LIMIT, RATE_LIMIT_LAST_CALL, RATE_LIMIT_MIN_WAIT, and USER_AGENT into an explicit SETTINGS dictionary
    that is passed into each _wiki_request call
  • keep the set_lang function that settings SETTINGS['API_LANG'] to a default language code
  • add a decorator called @explicit_settings that looks something like this:
def explicit_settings(fn):
  SETTINGS_KEYS = ('api_lang', 'rate_limit', 'rate_limit_min_wait', 'user_agent')

  def wrapped_fn(*args, **kwargs):
    global SETTINGS
    for key in kwargs:
      if key not in SETTINGS_KEYS:
        continue

      # usually this will just be an assignment, but
      # for rate limiting there is some state to modify as well
      set_setting(key, kwargs.pop(key), SETTINGS)  

    kwargs['SETTINGS'] = SETTINGS
    return fn(*args, **kkwargs)
  return wrapped_fn

Then, if we apply that to each of the relevant wikipedia methods and make sure they pass their SETTINGS argument into _wiki_request, we can do things like wikipedia.page('New York', lang='fr') and it will work automatically

What do you guys think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants