mozilla/kitsune

Closed
wants to merge 10 commits into
from

Conversation

Projects
None yet
3 participants
Contributor

shuhaowu commented Aug 18, 2013

 For more information, see #1444 It's all rebased and stuff! Should be fairly stable. One thing: If you run a test. You need to run ./manage.py cron build_kb_bundles. Probably should get this resolved so that the code uses a different redis bin in test mode. (How?)

Closed

mythmon reviewed Aug 19, 2013

 def find_word_locations_with_spaces(s): """Builds an index in the format of {word: location}. This is a english like search. For languages without spaces to

mythmon Aug 19, 2013

Member

english should be English

mythmon reviewed Aug 19, 2013

 elif c in '.!?': # We want to treat . as a big stop. Add two space. words.append(u'') words.append(u'') elif c in '\'"[]1234567890/\\()_':

mythmon Aug 19, 2013

Member

Several of these characters are in string.punctuation, which makes this if-tree misleading. The should be removed: '"[]\/()_. In other words, everything except numbers.

mythmon reviewed Aug 19, 2013

 elif c in u'。！？': words.append(u'') words.append(u'') elif c in u'\'"[]1234567890/\\()_（）【】『』、￥《》’‘”“':

mythmon Aug 19, 2013

Member

This if-tree has similar issues to the one above. You are duplicating characters from above. Also, I imagine that English-style punctuation should also be included in line 57.

mythmon reviewed Aug 19, 2013

 simple and intuitive NLP technique that scores words in a document given a corpus based on how important this word is. A full explanation of this is provided at .

mythmon Aug 19, 2013

Member

<insert url when ready>

mythmon reviewed Aug 19, 2013

 def _f(self, term, doc_id): return self.local_word_freq[doc_id][term] # Algorithm adapted from wikipedia.

mythmon Aug 19, 2013

Member

This comment should be a docstring.

mythmon reviewed Aug 19, 2013

 o = self._f(term, doc_id) / max(self.local_word_freq[doc_id].values()) return 0.5 + (0.5 * o) # Wikipedia is amazing

mythmon Aug 19, 2013

Member

This comment should also be a doc string.

mythmon reviewed Aug 19, 2013

 @@ -0,0 +1,9 @@ from django.conf.urls import patterns, url # Note tha these url do not get considered into the locale middleware.

mythmon Aug 19, 2013

Member

Typo: tha -> that.

mythmon reviewed Aug 19, 2013

 def transform_html(dochtml): """

mythmon Aug 19, 2013

Member

This docstring is malformed.

mythmon reviewed Aug 19, 2013

 # This is because the database format is actually meant to have all of this # in a list format if 'locales' in merged_bundle:

mythmon Aug 19, 2013

Member

These 4 if statements might be better off as a loop:

for key in ['locales', 'topics', 'docs', 'indexes']:
merged_bundle[key] = merged_bundle[key].values()

You might also use something similar above.

mythmon reviewed Aug 19, 2013

 return bundle, bundle_hash def cors_enabled(origin, methods=['GET']):

mythmon Aug 19, 2013

Member

We normally put decorators in a file named decorators.py.

mythmon reviewed Aug 19, 2013

 bundle_hash = redis.hget(name, 'hash') # redis.hget could return none if it does not exist. # if redis is not available, toss_bundle won't actually toss it.

mythmon Aug 19, 2013

Member

"toss" is a pretty ill defined word here. Instead of "won't actually toss it", you might say something like "won't be able to insert it into Redis."

mythmon reviewed Aug 19, 2013

 @@ -315,6 +324,7 @@ 'api', 'favicon.ico', 'media', 'offline',

mythmon Aug 19, 2013

Member

For other reviewers: This is in the section about locale-immune urls.

mythmon reviewed Aug 19, 2013

 @@ -490,6 +500,7 @@ 'kitsune.karma', 'kitsune.tags', 'kitsune.kpi', 'kitsune.offline',

mythmon Aug 19, 2013

Member

For other reviewers: This is INSTALLED_APPS.

Member

mythmon commented Aug 19, 2013

 Some issues, mostly minor. Your Redis issues will go away when #1568 goes away. We already have a mechanism for using different Redises in tests, but since settings_test.py isn't being read for tests right now, it isn't being activated. I'd like it if @rlr or @willkg could look at this in regards to DRF. Is it ok to be writing all this API view stuff like this, or would it be easier/better to do it with DRF?
Member

mythmon commented Aug 19, 2013

 And #1568 just got closed, so you could rebase and hopefully pick up those if you want.
Member

willkg commented Aug 19, 2013

 I'll try to look at this before the end of Tuesday.

willkg reviewed Aug 20, 2013

 # This is at least the punctuations in Chinese. elif c in u'。！？': words.append(u'') words.append(u'')

willkg Aug 20, 2013

Member

Why append two empty strings here?

shuhaowu Aug 20, 2013

Contributor

This is a future proof routine in case we want to add a better indexer (especially for languages like korean, japanese, and chinese). This allows a bigger gap between the two words so they will score lower if two search terms near a period is found.

Example:

search. term will score lower than search term as their location is further apart.

willkg reviewed Aug 20, 2013

 elif _alpha_regex.match(c) is not None: words.append(c) else: continue # Something weird, but it is totally okay

willkg Aug 20, 2013

Member

Why skip this character? Doesn't this cause the locations to be off?

shuhaowu Aug 20, 2013

Contributor

It might. I'm not sure what to make of these characters. Most of them that I've seen are unicode variations of the dash. Or even invisible characters.

It should not affect the indexing that much. Though if there are better ideas I am all up for it.

mythmon Aug 20, 2013

Member

Also, the locations are of tokens, not of characters in the source string. So skipping over characters that aren't used doesn't really affect the locations.

willkg reviewed Aug 20, 2013

 @@ -0,0 +1,298 @@ # -*- coding: utf-8 -*-

willkg Aug 20, 2013

Member

You need to rename this module to test__utils.py (two underscores).

shuhaowu Aug 20, 2013

Contributor

I thought test__utils are where we put utility functions for tests. Am I wrong?

willkg Aug 20, 2013

Member

"test_utils" is a package that we use that has stuff in it. Because of that, we can't have other things named "test_utils". It needs to be "test__utils" with two underscores.

willkg reviewed Aug 20, 2013

 @cors_enabled('*') def get_bundle(request): if 'locale' not in request.GET or 'product' not in request.GET: return HttpResponseBadRequest()

willkg Aug 20, 2013

Member

Does this return an HTML page? If so, it's kind of janky to return an HTML page for a JSON API.

willkg reviewed Aug 20, 2013

 redis.hset(name, 'hash', bundle_hash) redis.hset(name, 'bundle', bundle) return bundle, bundle_hash

willkg Aug 20, 2013

Member

How big are these bundles? Is Redis the best place to store them? I don't think we've stored large things in Redis up to now.

willkg Aug 20, 2013

Member

Also, I don't see any code for removing things from Redis. Does this just accumulate bundles forever or is there some cleanup code I missed?

willkg Aug 20, 2013

Member

Scratch that last one--I see the key now.

shuhaowu Aug 20, 2013

Contributor

The bundle is about 300MB in size, total.

willkg reviewed Aug 20, 2013

 This is used in both the cron job and the view. """ bundle = json.dumps(bundle) bundle_hash = sha1(bundle).hexdigest() # track version

willkg Aug 20, 2013

Member

Why use a hash and not a datetimestamp of when it was built?

shuhaowu Aug 20, 2013

Contributor

When we build it, the content may not have been changed at all. This is to address that possibility.

willkg Aug 20, 2013

Member

Does that happen often?

shuhaowu Aug 21, 2013

Contributor

I don't know if it happens or not. Are there days where we don't get any changes to any articles for a particular product and locale?

willkg reviewed Aug 20, 2013

 bundle = json.dumps(bundle) bundle_hash = sha1(bundle).hexdigest() # track version if redis:

willkg Aug 20, 2013

Member

This puzzles me. Why would redis be Falsey here?

shuhaowu Aug 20, 2013

Contributor

Is Redis a hard requirement now or still a recommended setup? This is to address the case where people don't have redis installed.

willkg Aug 20, 2013

Member

It's a requirement for production. If you have tests that run through this code and Redis isn't available, then they should throw SkipTest.

That still doesn't really answer my question, though. Why would the argument be Falsey? Is the caller sending None in here? Seems like this function is named one thing and does another.

shuhaowu Aug 20, 2013

Contributor

In the event of a RedisError (I assume that means there is no redis on the dev machine), it is possible that redis that's passed in will be None. See https://github.com/mozilla/kitsune/pull/1574/files#L10R37

That might not be the optimal behaviour. Right it is written so that bundles can be downloaded even without redis. Checking updates from the app should just ignore that output.

willkg reviewed Aug 20, 2013

 'key': doc_key(doc.locale, doc.slug), 'title': doc.title, 'archived': True, 'slug': doc.slug

willkg Aug 20, 2013

Member

Why is this return dict different than the one where the document is not archived? Why aren't they the same except for the archived value?

shuhaowu Aug 20, 2013

Contributor

We could probably do this. Though I thought since archived docs are probably out of date, the html and what not takes up more space if we were to put it into the bundle, and hence increasing the download size (I did not measure this effect, however.)

willkg Aug 20, 2013

Member

So, I'm asking questions, not implying you need to do things. It's really hard to understand why you're doing what you're doing in this code.

shuhaowu Aug 20, 2013

Contributor

Yeah, a comment might help here. The primary reason is to save space, nothing more.

willkg reviewed Aug 20, 2013

 docs = Document.objects.filter( locale=locale, is_template=False, category__in=(CATEGORIES[0][0], CATEGORIES[1][0])

willkg Aug 20, 2013

Member

category__in=(TROUBLESHOOTING_CATEGORY, HOW_TO_CATEGORY)


willkg reviewed Aug 20, 2013

 for text, boost in texts: locations = get_locations(text) for w, l in locations.iteritems():

willkg Aug 20, 2013

Member

Nit: I'm a huge fan of never using lower-case-l as a variable name because it looks too much like a 1 in most fonts.

willkg reviewed Aug 20, 2013

 Adapted from Wikipedia. idf(t, D) = \log \\frac{|D|}{|{d \in D : t \in D}|} """ appearance = 0 # Avoid division by 0 problem

willkg Aug 20, 2013

Member

I don't understand this comment. How does this line avoid division by 0?

shuhaowu Aug 20, 2013

Contributor

Oops. That was left in. Division by 0 in our use case should never occur. I forgot the take the comment out.

willkg reviewed Aug 20, 2013

 appearance = 0 # Avoid division by 0 problem for doc_id, words in self.local_word_freq.iteritems(): appearance += 1 if term in words else 0 # Add a 1 so we are approximately the same.. not really

willkg Aug 20, 2013

Member

I don't understand this comment at all. Can you clarify it?

shuhaowu Aug 20, 2013

Contributor

Same as the div by zero comment.

willkg reviewed Aug 20, 2013

 if topic: doc.topics.add(topic) return doc, expected

willkg Aug 20, 2013

Member

This doesn't need to be an instance method. You could make it a module function and save a bunch of characters.

willkg reviewed Aug 20, 2013

 'locale': expected_locale_doc, 'topic1': expected_topic1, 'topic2': expected_topic2 }

willkg Aug 20, 2013

Member

This doesn't need to be an instance method. You could make it a module function and save a bunch of characters.

willkg reviewed Aug 20, 2013

 def feed(self, doc_id, texts, get_locations): self.doc_count += 1 if doc_id in self.local_word_freq: return

willkg Aug 20, 2013

Member

What does this if statement do? Is it possible for a document to be in the feed twice?

shuhaowu Aug 20, 2013

Contributor

This shouldn't happen. I'm also not too sure why I left this in. I'll investigate to make sure.

shuhaowu Aug 23, 2013

Contributor

Looks like this is something left in by accident. Is removed.

willkg reviewed Aug 20, 2013

 serialized_doc = serialize_document_for_offline(doc) # Only non-archived documents need to be indexed.

willkg Aug 20, 2013

Member

If only non-archived documents are indexed, why don't we weed out all the archived documents altogether?

Contributor

Yes.

willkg reviewed Aug 20, 2013

 bundle, bundle_hash = insert_bundle_into_redis(redis, product.slug, locale, bundle)

willkg Aug 20, 2013

Member

If we're rebuilding the bundles every night, why should we build the bundle in get_bundle? Doesn't it take too long to build bundles to do it in a get request?

shuhaowu Aug 20, 2013

Contributor

When we deploy this, the bundles will not be in redis (or anytime we fire up a dev server without building it). This builds it and then insert it into redis.

willkg reviewed Aug 20, 2013

 @cors_enabled('*') def bundle_meta(request):

willkg Aug 20, 2013

Member

What's the purpose of this view?

shuhaowu Aug 20, 2013

Contributor

For updating. The client checks for updates by seeing if the hash on the server differs from the local copy. This way we don't need to download everything and is very light.

willkg Aug 20, 2013

Member

Also, maybe it makes sense to use E-tag and HEAD requests.

shuhaowu Aug 20, 2013

Contributor

For ETag, it might make sense. I'm not yet sure how that would interact with CORS (shouldn't be a problem.. but)

willkg reviewed Aug 20, 2013

 response['Access-Control-Allow-Origin'] = origin return response return decorated_func return decorator

willkg Aug 20, 2013

Member

This isn't available in any of the libs Kitsune uses or somewhere else? I'm kind of loathe to maintain our own CORS middleware if we don't have to.

shuhaowu Aug 20, 2013

Contributor

I'm not sure (and I'm not sure where to look). This can be pretty easily replaced if one is available, though.

Contributor

shuhaowu commented Aug 21, 2013

 Is there anything else that I have to fix?

shuhaowu added some commits Jun 3, 2013

 Added API for oSUMO. 
For more information, see #1444

Added URL for get language.
 bb3d18e 
 Some changes after feedback. 
 d5b0f6d 
 Added url. 
 453a0a7 
 Changed view to always return json. 
 91f78a0 
 Minor changes. 
 ffe9783 
 Fixed some stuff according to comments. 
 913dd9d 
 Added doc string. 
 e153316 
 Removed unused code 
 a95af1b 
 Added admin view to generate bundles 
Deleted falsy redis.
 91c87a4 
Contributor

mythmon reviewed Aug 26, 2013

 The arguments to this function must be strings. This key is used for the index. """ return locale + '~' + product_slug

mythmon Aug 26, 2013

Member

Using a tilde here is a little odd. This is mainly used in Redis keys, which traditionally use colons for separators. And in other parts of the code you even combine this with other keys with a colon, so you end up with keys like foo:bar~baz, which seems really strange. In other words: I think you should use a colon here instead of a squiggle.

shuhaowu Aug 26, 2013

Contributor

It is a bit odd, I was trying to find the least likely character conflict with as a separator as that gets passed to the offline app. The osumo: is just a name space as we are using the default redis database

 Added more comments/docstrings 
 33dc898 
Member

mythmon commented Sep 5, 2013

 This landed in cf8ccaa, r=me.