New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added API for oSUMO. #1574

Closed
wants to merge 10 commits into
base: master
from

Conversation

Projects
None yet
3 participants
@shuhaowu
Copy link
Contributor

shuhaowu commented Aug 18, 2013

For more information, see #1444

It's all rebased and stuff!

Should be fairly stable.

One thing:

  • If you run a test. You need to run ./manage.py cron build_kb_bundles. Probably should get this resolved so that the code uses a different redis bin in test mode. (How?)

@shuhaowu shuhaowu referenced this pull request Aug 19, 2013

Closed

Offline SUMO Kitsune API #1444

7 of 7 tasks complete
def find_word_locations_with_spaces(s):
"""Builds an index in the format of {word: location}.
This is a english like search. For languages without spaces to

This comment has been minimized.

@mythmon

mythmon Aug 19, 2013

Member

english should be English

elif c in '.!?': # We want to treat . as a big stop. Add two space.
words.append(u'')
words.append(u'')
elif c in '\'"[]1234567890/\\()_':

This comment has been minimized.

@mythmon

mythmon Aug 19, 2013

Member

Several of these characters are in string.punctuation, which makes this if-tree misleading. The should be removed: '"[]\/()_. In other words, everything except numbers.

elif c in u'。!?':
words.append(u'')
words.append(u'')
elif c in u'\'"[]1234567890/\\()_()【】『』、¥《》’‘”“':

This comment has been minimized.

@mythmon

mythmon Aug 19, 2013

Member

This if-tree has similar issues to the one above. You are duplicating characters from above. Also, I imagine that English-style punctuation should also be included in line 57.

simple and intuitive NLP technique that scores words in a document
given a corpus based on how important this word is.
A full explanation of this is provided at <insert url when ready>.

This comment has been minimized.

@mythmon

mythmon Aug 19, 2013

Member

<insert url when ready>

def _f(self, term, doc_id):
return self.local_word_freq[doc_id][term]

# Algorithm adapted from wikipedia.

This comment has been minimized.

@mythmon

mythmon Aug 19, 2013

Member

This comment should be a docstring.

o = self._f(term, doc_id) / max(self.local_word_freq[doc_id].values())
return 0.5 + (0.5 * o)

# Wikipedia is amazing

This comment has been minimized.

@mythmon

mythmon Aug 19, 2013

Member

This comment should also be a doc string.

@@ -0,0 +1,9 @@
from django.conf.urls import patterns, url

# Note tha these url do not get considered into the locale middleware.

This comment has been minimized.

@mythmon

mythmon Aug 19, 2013

Member

Typo: tha -> that.


def transform_html(dochtml):
"""

This comment has been minimized.

@mythmon

mythmon Aug 19, 2013

Member

This docstring is malformed.


# This is because the database format is actually meant to have all of this
# in a list format
if 'locales' in merged_bundle:

This comment has been minimized.

@mythmon

mythmon Aug 19, 2013

Member

These 4 if statements might be better off as a loop:

for key in ['locales', 'topics', 'docs', 'indexes']:
    merged_bundle[key] = merged_bundle[key].values()

You might also use something similar above.

return bundle, bundle_hash


def cors_enabled(origin, methods=['GET']):

This comment has been minimized.

@mythmon

mythmon Aug 19, 2013

Member

We normally put decorators in a file named decorators.py.

bundle_hash = redis.hget(name, 'hash')

# redis.hget could return none if it does not exist.
# if redis is not available, toss_bundle won't actually toss it.

This comment has been minimized.

@mythmon

mythmon Aug 19, 2013

Member

"toss" is a pretty ill defined word here. Instead of "won't actually toss it", you might say something like "won't be able to insert it into Redis."

@@ -315,6 +324,7 @@
'api',
'favicon.ico',
'media',
'offline',

This comment has been minimized.

@mythmon

mythmon Aug 19, 2013

Member

For other reviewers: This is in the section about locale-immune urls.

@@ -490,6 +500,7 @@
'kitsune.karma',
'kitsune.tags',
'kitsune.kpi',
'kitsune.offline',

This comment has been minimized.

@mythmon

mythmon Aug 19, 2013

Member

For other reviewers: This is INSTALLED_APPS.

@mythmon

This comment has been minimized.

Copy link
Member

mythmon commented Aug 19, 2013

Some issues, mostly minor. Your Redis issues will go away when #1568 goes away. We already have a mechanism for using different Redises in tests, but since settings_test.py isn't being read for tests right now, it isn't being activated.

I'd like it if @rlr or @willkg could look at this in regards to DRF. Is it ok to be writing all this API view stuff like this, or would it be easier/better to do it with DRF?

@mythmon

This comment has been minimized.

Copy link
Member

mythmon commented Aug 19, 2013

And #1568 just got closed, so you could rebase and hopefully pick up those if you want.

@willkg

This comment has been minimized.

Copy link
Member

willkg commented Aug 19, 2013

I'll try to look at this before the end of Tuesday.

# This is at least the punctuations in Chinese.
elif c in u'。!?':
words.append(u'')
words.append(u'')

This comment has been minimized.

@willkg

willkg Aug 20, 2013

Member

Why append two empty strings here?

This comment has been minimized.

@shuhaowu

shuhaowu Aug 20, 2013

Contributor

This is a future proof routine in case we want to add a better indexer (especially for languages like korean, japanese, and chinese). This allows a bigger gap between the two words so they will score lower if two search terms near a period is found.

Example:

search. term will score lower than search term as their location is further apart.

elif _alpha_regex.match(c) is not None:
words.append(c)
else:
continue # Something weird, but it is totally okay

This comment has been minimized.

@willkg

willkg Aug 20, 2013

Member

Why skip this character? Doesn't this cause the locations to be off?

This comment has been minimized.

@shuhaowu

shuhaowu Aug 20, 2013

Contributor

It might. I'm not sure what to make of these characters. Most of them that I've seen are unicode variations of the dash. Or even invisible characters.

It should not affect the indexing that much. Though if there are better ideas I am all up for it.

This comment has been minimized.

@mythmon

mythmon Aug 20, 2013

Member

Also, the locations are of tokens, not of characters in the source string. So skipping over characters that aren't used doesn't really affect the locations.

@@ -0,0 +1,298 @@
# -*- coding: utf-8 -*-

This comment has been minimized.

@willkg

willkg Aug 20, 2013

Member

You need to rename this module to test__utils.py (two underscores).

This comment has been minimized.

@shuhaowu

shuhaowu Aug 20, 2013

Contributor

I thought test__utils are where we put utility functions for tests. Am I wrong?

This comment has been minimized.

@willkg

willkg Aug 20, 2013

Member

"test_utils" is a package that we use that has stuff in it. Because of that, we can't have other things named "test_utils". It needs to be "test__utils" with two underscores.

@cors_enabled('*')
def get_bundle(request):
if 'locale' not in request.GET or 'product' not in request.GET:
return HttpResponseBadRequest()

This comment has been minimized.

@willkg

willkg Aug 20, 2013

Member

Does this return an HTML page? If so, it's kind of janky to return an HTML page for a JSON API.

redis.hset(name, 'hash', bundle_hash)
redis.hset(name, 'bundle', bundle)

return bundle, bundle_hash

This comment has been minimized.

@willkg

willkg Aug 20, 2013

Member

How big are these bundles? Is Redis the best place to store them? I don't think we've stored large things in Redis up to now.

This comment has been minimized.

@willkg

willkg Aug 20, 2013

Member

Also, I don't see any code for removing things from Redis. Does this just accumulate bundles forever or is there some cleanup code I missed?

This comment has been minimized.

@willkg

willkg Aug 20, 2013

Member

Scratch that last one--I see the key now.

This comment has been minimized.

@shuhaowu

shuhaowu Aug 20, 2013

Contributor

The bundle is about 300MB in size, total.

This is used in both the cron job and the view.
"""
bundle = json.dumps(bundle)
bundle_hash = sha1(bundle).hexdigest() # track version

This comment has been minimized.

@willkg

willkg Aug 20, 2013

Member

Why use a hash and not a datetimestamp of when it was built?

This comment has been minimized.

@shuhaowu

shuhaowu Aug 20, 2013

Contributor

When we build it, the content may not have been changed at all. This is to address that possibility.

This comment has been minimized.

@willkg

willkg Aug 20, 2013

Member

Does that happen often?

This comment has been minimized.

@shuhaowu

shuhaowu Aug 21, 2013

Contributor

I don't know if it happens or not. Are there days where we don't get any changes to any articles for a particular product and locale?

bundle = json.dumps(bundle)
bundle_hash = sha1(bundle).hexdigest() # track version

if redis:

This comment has been minimized.

@willkg

willkg Aug 20, 2013

Member

This puzzles me. Why would redis be Falsey here?

This comment has been minimized.

@shuhaowu

shuhaowu Aug 20, 2013

Contributor

Is Redis a hard requirement now or still a recommended setup? This is to address the case where people don't have redis installed.

This comment has been minimized.

@willkg

willkg Aug 20, 2013

Member

It's a requirement for production. If you have tests that run through this code and Redis isn't available, then they should throw SkipTest.

That still doesn't really answer my question, though. Why would the argument be Falsey? Is the caller sending None in here? Seems like this function is named one thing and does another.

This comment has been minimized.

@shuhaowu

shuhaowu Aug 20, 2013

Contributor

In the event of a RedisError (I assume that means there is no redis on the dev machine), it is possible that redis that's passed in will be None. See https://github.com/mozilla/kitsune/pull/1574/files#L10R37

That might not be the optimal behaviour. Right it is written so that bundles can be downloaded even without redis. Checking updates from the app should just ignore that output.

'key': doc_key(doc.locale, doc.slug),
'title': doc.title,
'archived': True,
'slug': doc.slug

This comment has been minimized.

@willkg

willkg Aug 20, 2013

Member

Why is this return dict different than the one where the document is not archived? Why aren't they the same except for the archived value?

This comment has been minimized.

@shuhaowu

shuhaowu Aug 20, 2013

Contributor

We could probably do this. Though I thought since archived docs are probably out of date, the html and what not takes up more space if we were to put it into the bundle, and hence increasing the download size (I did not measure this effect, however.)

This comment has been minimized.

@willkg

willkg Aug 20, 2013

Member

So, I'm asking questions, not implying you need to do things. It's really hard to understand why you're doing what you're doing in this code.

This comment has been minimized.

@shuhaowu

shuhaowu Aug 20, 2013

Contributor

Yeah, a comment might help here. The primary reason is to save space, nothing more.

docs = Document.objects.filter(
locale=locale,
is_template=False,
category__in=(CATEGORIES[0][0], CATEGORIES[1][0])

This comment has been minimized.

@willkg

willkg Aug 20, 2013

Member

This is way easier to think about if you instead did:

category__in=(TROUBLESHOOTING_CATEGORY, HOW_TO_CATEGORY)
for text, boost in texts:

locations = get_locations(text)
for w, l in locations.iteritems():

This comment has been minimized.

@willkg

willkg Aug 20, 2013

Member

Nit: I'm a huge fan of never using lower-case-l as a variable name because it looks too much like a 1 in most fonts.

Adapted from Wikipedia.
idf(t, D) = \log \\frac{|D|}{|{d \in D : t \in D}|}
"""
appearance = 0 # Avoid division by 0 problem

This comment has been minimized.

@willkg

willkg Aug 20, 2013

Member

I don't understand this comment. How does this line avoid division by 0?

This comment has been minimized.

@shuhaowu

shuhaowu Aug 20, 2013

Contributor

Oops. That was left in. Division by 0 in our use case should never occur. I forgot the take the comment out.

appearance = 0 # Avoid division by 0 problem
for doc_id, words in self.local_word_freq.iteritems():
appearance += 1 if term in words else 0
# Add a 1 so we are approximately the same.. not really

This comment has been minimized.

@willkg

willkg Aug 20, 2013

Member

I don't understand this comment at all. Can you clarify it?

This comment has been minimized.

@shuhaowu

shuhaowu Aug 20, 2013

Contributor

Same as the div by zero comment.

if topic:
doc.topics.add(topic)

return doc, expected

This comment has been minimized.

@willkg

willkg Aug 20, 2013

Member

This doesn't need to be an instance method. You could make it a module function and save a bunch of characters.

'locale': expected_locale_doc,
'topic1': expected_topic1,
'topic2': expected_topic2
}

This comment has been minimized.

@willkg

willkg Aug 20, 2013

Member

This doesn't need to be an instance method. You could make it a module function and save a bunch of characters.

def feed(self, doc_id, texts, get_locations):
self.doc_count += 1
if doc_id in self.local_word_freq:
return

This comment has been minimized.

@willkg

willkg Aug 20, 2013

Member

What does this if statement do? Is it possible for a document to be in the feed twice?

This comment has been minimized.

@shuhaowu

shuhaowu Aug 20, 2013

Contributor

This shouldn't happen. I'm also not too sure why I left this in. I'll investigate to make sure.

This comment has been minimized.

@shuhaowu

shuhaowu Aug 23, 2013

Contributor

Looks like this is something left in by accident. Is removed.


serialized_doc = serialize_document_for_offline(doc)

# Only non-archived documents need to be indexed.

This comment has been minimized.

@willkg

willkg Aug 20, 2013

Member

If only non-archived documents are indexed, why don't we weed out all the archived documents altogether?

This comment has been minimized.

@shuhaowu

shuhaowu Aug 20, 2013

Contributor

Yes.

bundle, bundle_hash = insert_bundle_into_redis(redis,
product.slug,
locale,
bundle)

This comment has been minimized.

@willkg

willkg Aug 20, 2013

Member

If we're rebuilding the bundles every night, why should we build the bundle in get_bundle? Doesn't it take too long to build bundles to do it in a get request?

This comment has been minimized.

@shuhaowu

shuhaowu Aug 20, 2013

Contributor

When we deploy this, the bundles will not be in redis (or anytime we fire up a dev server without building it). This builds it and then insert it into redis.



@cors_enabled('*')
def bundle_meta(request):

This comment has been minimized.

@willkg

willkg Aug 20, 2013

Member

What's the purpose of this view?

This comment has been minimized.

@shuhaowu

shuhaowu Aug 20, 2013

Contributor

For updating. The client checks for updates by seeing if the hash on the server differs from the local copy. This way we don't need to download everything and is very light.

This comment has been minimized.

@willkg

willkg Aug 20, 2013

Member

Please add a docstring explaining what this view is for.

Also, maybe it makes sense to use E-tag and HEAD requests.

This comment has been minimized.

@shuhaowu

shuhaowu Aug 20, 2013

Contributor

For ETag, it might make sense. I'm not yet sure how that would interact with CORS (shouldn't be a problem.. but)

response['Access-Control-Allow-Origin'] = origin
return response
return decorated_func
return decorator

This comment has been minimized.

@willkg

willkg Aug 20, 2013

Member

This isn't available in any of the libs Kitsune uses or somewhere else? I'm kind of loathe to maintain our own CORS middleware if we don't have to.

This comment has been minimized.

@shuhaowu

shuhaowu Aug 20, 2013

Contributor

I'm not sure (and I'm not sure where to look). This can be pretty easily replaced if one is available, though.

@shuhaowu

This comment has been minimized.

Copy link
Contributor

shuhaowu commented Aug 21, 2013

Is there anything else that I have to fix?

@shuhaowu

This comment has been minimized.

Copy link
Contributor

shuhaowu commented Aug 23, 2013

I've added an admin view and removed all redis check and generation on view load.

The arguments to this function must be strings. This key is used
for the index.
"""
return locale + '~' + product_slug

This comment has been minimized.

@mythmon

mythmon Aug 26, 2013

Member

Using a tilde here is a little odd. This is mainly used in Redis keys, which traditionally use colons for separators. And in other parts of the code you even combine this with other keys with a colon, so you end up with keys like foo:bar~baz, which seems really strange. In other words: I think you should use a colon here instead of a squiggle.

This comment has been minimized.

@shuhaowu

shuhaowu Aug 26, 2013

Contributor

It is a bit odd, I was trying to find the least likely character conflict with as a separator as that gets passed to the offline app. The osumo: is just a name space as we are using the default redis database

@mythmon

This comment has been minimized.

Copy link
Member

mythmon commented Sep 5, 2013

This landed in cf8ccaa, r=me.

@mythmon mythmon closed this Sep 5, 2013

@shuhaowu shuhaowu deleted the shuhaowu:osumo branch Sep 6, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment