Fix bug# 67471678 - Support If-Modified-Since.#115
Fix bug# 67471678 - Support If-Modified-Since.#115srinicodebytes wants to merge 6 commits intomasterfrom
Conversation
|
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed, please reply here (e.g.
|
wiarlawd
left a comment
There was a problem hiding this comment.
Partial review. I haven't looked at the tests yet.
| log.finer("queried for stream"); | ||
| boolean hasAction | ||
| = hasColumn(rs.getMetaData(), GsaSpecialColumns.GSA_ACTION); | ||
| boolean hasTimestamp = |
There was a problem hiding this comment.
This is in between getting hasAction and logging hasAction, and we're not logging hasTimestamp. Either get/log/get/log or get/get/log/log.
| rs.getTimestamp(GsaSpecialColumns.GSA_TIMESTAMP.toString(), | ||
| updateTimestampTimezone); | ||
| if ((ts != null) | ||
| && (lastDocIdTimestamp == null || ts.after(lastDocIdTimestamp))) { |
There was a problem hiding this comment.
What's the merit of checking the cache here, and what's a plausible scenario where ts < lastDocIdTimestamp?
There was a problem hiding this comment.
Trying to check if the ts is newer than cached value before updating cache. ts won't be < lastDocIdTimestamp, but can be same as lastDocIdTimestamp. The additional check is not required if I update the cache always..
Removed the check.
| private static final HashMap<Integer, String> sqlTypeNames = new HashMap<>(); | ||
|
|
||
| /** Cache for DocId and last modified time stamp. */ | ||
| private static Cache<DocId, Timestamp> docIdLastModifiedMap; |
There was a problem hiding this comment.
s/docId//; I know it's a map from DocId to Timestamp, but it's really a cache of last modified dates. The "docId" feels like noise. I could see using "Timestamp" consistently, since that's the name of the column, so maybe "lastModifiedCache" or "timestampCache"?
There was a problem hiding this comment.
Thx, meant to change the name from "..Map" to "..Cache". Removed "docId" prefix as suggested.
| Timestamp lastDocIdTimestamp = docIdLastModifiedMap.getIfPresent(id); | ||
| if (lastDocIdTimestamp != null | ||
| && !req.hasChangedSinceLastAccess( | ||
| new Date(lastDocIdTimestamp.getTime()))) { |
There was a problem hiding this comment.
You should store Date rather than timestamp, to avoid having to do this for each crawl request. That also avoids the need to check for null here (because hasChangedSinceLastAccess does that, too).
| } | ||
| // Cache last modified time stamp for this DocId | ||
| if (hasTimestamp) { | ||
| Timestamp lastDocIdTimestamp = docIdLastModifiedMap.getIfPresent(id); |
There was a problem hiding this comment.
What does "last" mean here? I think either "cachedTimestamp" or "cachedLastModified".
There was a problem hiding this comment.
Renamed to "cachedLastModified"
| resp.respondNotFound(); | ||
| return; | ||
| } | ||
| // Check if modified since last access. |
There was a problem hiding this comment.
This is supposed to happen before the query, or the cache is silly, and we could just compare it to a GSA_TIMESTAMP column in these results (which would be a valid design choice, but the whole idea was to avoid hitting the database in case that's unusually expensive).
There was a problem hiding this comment.
Yes, moved it.
| // One record means one document. | ||
| resp.setNoFollow(true); | ||
| log.log(Level.FINE, "Content modified since last crawl: {0}", id); | ||
| resp.setNoFollow(true); |
There was a problem hiding this comment.
Restored the whitespace
| // In database adaptor's case, we almost never want to follow the URLs. | ||
| // One record means one document. | ||
| resp.setNoFollow(true); | ||
| log.log(Level.FINE, "Content modified since last crawl: {0}", id); |
There was a problem hiding this comment.
Why not do this right away, on line 724?
There was a problem hiding this comment.
Moved to line 724.
| private String modeOfOperation; | ||
|
|
||
| private Calendar updateTimestampTimezone; | ||
| private DateFormat formatter; |
There was a problem hiding this comment.
You've got a threading problem here. getDocIds and getModifiedDocIds can run concurrently (though each one separately cannot). We could 1) maintain a separate formatter for each of them (slightly tedious but not a bad option), 2) synchronize the uses, 3) put these into a ThreadLocal (boo), or 4) create an instance for each call (again, slightly tedious, but not a bad option, given that calls happen every 15 minutes at best, by default). I'm leaning toward option 1.
There was a problem hiding this comment.
Changed to have separate formatter for each (option 1).
Done.
wiarlawd
left a comment
There was a problem hiding this comment.
More thoughts, still partial.
| log.config("primary key: " + uniqueKey); | ||
|
|
||
| docIdLastModifiedMap = | ||
| CacheBuilder.newBuilder().maximumSize(1000000).build(); |
There was a problem hiding this comment.
We might want to set an initial capacity. The (undocumented) default is 16 (though I think it might be 40 in practice), which is pretty small, and will lead to lots of allocations to grow the cache (done by doubling). Maybe initialCapacity(10000)?
| log.config("primary key: " + uniqueKey); | ||
|
|
||
| docIdLastModifiedMap = | ||
| CacheBuilder.newBuilder().maximumSize(1000000).build(); |
There was a problem hiding this comment.
maximumSize takes a long, so add L to the value (FYI, initialCapacity takes an int).
| @@ -462,6 +485,8 @@ public void getDocIds(DocIdPusher pusher) throws IOException, | |||
| log.finer("queried for stream"); | |||
| boolean hasAction | |||
| = hasColumn(rs.getMetaData(), GsaSpecialColumns.GSA_ACTION); | |||
There was a problem hiding this comment.
Should be OK to get the ResultSetMetaData twice, but let's not.
| log.finer("queried for stream"); | ||
| boolean hasAction | ||
| = hasColumn(rs.getMetaData(), GsaSpecialColumns.GSA_ACTION); | ||
| boolean hasTimestamp = |
There was a problem hiding this comment.
I thought we were going to clear the cache in getDocIds (but not getModifiedDocIds, obviously). That avoids keeping a record for deleted rows, and in particular, if the ordering in getDocContent is fixed, avoids us reporting deleted rows as unmodified instead.
There was a problem hiding this comment.
Yes, missed it. Added a call to invalidateAll().
wiarlawd
left a comment
There was a problem hiding this comment.
I still haven't read the tests, but you changed the semantics of the prod code without changing any tests, which isn't a good sign.
| DateFormat formatter = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z"); | ||
| formatter.setTimeZone(updateTimestampTimezone.getTimeZone()); | ||
| BufferedPusher outstream = new BufferedPusher(pusher); | ||
| lastModifiedCache.invalidateAll(); |
There was a problem hiding this comment.
I think I would move this to the first line (it's tied to the semantics of getDocIds, not this try statement or the loop, and these fields/variables are used in this order below: lastModifiedCache, formatter, outstream).
| // Cache last modified time stamp for this DocId | ||
| if (hasTimestamp) { | ||
| Timestamp lastDocIdTimestamp = docIdLastModifiedMap.getIfPresent(id); | ||
| Date cachedLastModified = lastModifiedCache.getIfPresent(id); |
There was a problem hiding this comment.
This is never used, now, is it? Just overridden below and logged?
There was a problem hiding this comment.
Yes, removed call to getIfPresent()
| docIdLastModifiedMap.put(id, lastDocIdTimestamp); | ||
| log.log(Level.FINE, "docIdLastModifiedMap updated: {0}", | ||
| formatter.format(new Date(lastDocIdTimestamp.getTime()))); | ||
| if ((ts != null)) { |
| docIdLastModifiedMap.put(id, lastDocIdTimestamp); | ||
| log.log(Level.FINE, "lastDocIdTimestamp updated: {0}", | ||
| formatter.format(new Date(lastDocIdTimestamp.getTime()))); | ||
| if (cachedLastModified == null) { |
There was a problem hiding this comment.
This if statement is wrong. Why would a row have to be uncached in order to update the cache with a new last modified?
| log.log(Level.FINE, "lastDocIdTimestamp updated: {0}", | ||
| formatter.format(new Date(lastDocIdTimestamp.getTime()))); | ||
| if (cachedLastModified == null) { | ||
| cachedLastModified = new Date(ts.getTime()); |
There was a problem hiding this comment.
We're constructing this same Date object twice here, maybe they could be shared, if it doesn't seem too artificial.
| @Override | ||
| public void getDocIds(DocIdPusher pusher) throws IOException, | ||
| InterruptedException { | ||
| DateFormat formatter = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z"); |
There was a problem hiding this comment.
This is option 4, not option 1, but that's fine.
srinicodebytes
left a comment
There was a problem hiding this comment.
Code changes were mostly around updating cache. Earlier there were additional checks and cache is updated only if newer, now cache is always updated. Hence the tests didn't change. New tests are mostly around null modified date, modified date is newer or not modified.
| DateFormat formatter = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z"); | ||
| formatter.setTimeZone(updateTimestampTimezone.getTimeZone()); | ||
| BufferedPusher outstream = new BufferedPusher(pusher); | ||
| lastModifiedCache.invalidateAll(); |
| // Cache last modified time stamp for this DocId | ||
| if (hasTimestamp) { | ||
| Timestamp lastDocIdTimestamp = docIdLastModifiedMap.getIfPresent(id); | ||
| Date cachedLastModified = lastModifiedCache.getIfPresent(id); |
There was a problem hiding this comment.
Yes, removed call to getIfPresent()
| docIdLastModifiedMap.put(id, lastDocIdTimestamp); | ||
| log.log(Level.FINE, "docIdLastModifiedMap updated: {0}", | ||
| formatter.format(new Date(lastDocIdTimestamp.getTime()))); | ||
| if ((ts != null)) { |
| log.log(Level.FINE, "lastDocIdTimestamp updated: {0}", | ||
| formatter.format(new Date(lastDocIdTimestamp.getTime()))); | ||
| if (cachedLastModified == null) { | ||
| cachedLastModified = new Date(ts.getTime()); |
| docIdLastModifiedMap.put(id, lastDocIdTimestamp); | ||
| log.log(Level.FINE, "lastDocIdTimestamp updated: {0}", | ||
| formatter.format(new Date(lastDocIdTimestamp.getTime()))); | ||
| if (cachedLastModified == null) { |
to helper methods in tests.
No description provided.