Skip to content
This repository was archived by the owner on Jun 23, 2022. It is now read-only.

Fix bug# 67471678 - Support If-Modified-Since.#115

Open
srinicodebytes wants to merge 6 commits intomasterfrom
sri-ifmodifiedsince
Open

Fix bug# 67471678 - Support If-Modified-Since.#115
srinicodebytes wants to merge 6 commits intomasterfrom
sri-ifmodifiedsince

Conversation

@srinicodebytes
Copy link
Copy Markdown
Contributor

No description provided.

@googlebot
Copy link
Copy Markdown

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed, please reply here (e.g. I signed it!) and we'll verify. Thanks.


  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.
  • If your company signed a CLA, they designated a Point of Contact who decides which employees are authorized to participate. You may need to contact the Point of Contact for your company and ask to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the project maintainer to go/cla#troubleshoot. The email used to register you as an authorized contributor must be the email used for the Git commit.
  • In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again. If the bot doesn't comment, it means it doesn't think anything has changed.

Copy link
Copy Markdown
Contributor

@wiarlawd wiarlawd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review. I haven't looked at the tests yet.

log.finer("queried for stream");
boolean hasAction
= hasColumn(rs.getMetaData(), GsaSpecialColumns.GSA_ACTION);
boolean hasTimestamp =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in between getting hasAction and logging hasAction, and we're not logging hasTimestamp. Either get/log/get/log or get/get/log/log.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

rs.getTimestamp(GsaSpecialColumns.GSA_TIMESTAMP.toString(),
updateTimestampTimezone);
if ((ts != null)
&& (lastDocIdTimestamp == null || ts.after(lastDocIdTimestamp))) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the merit of checking the cache here, and what's a plausible scenario where ts < lastDocIdTimestamp?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to check if the ts is newer than cached value before updating cache. ts won't be < lastDocIdTimestamp, but can be same as lastDocIdTimestamp. The additional check is not required if I update the cache always..

Removed the check.

private static final HashMap<Integer, String> sqlTypeNames = new HashMap<>();

/** Cache for DocId and last modified time stamp. */
private static Cache<DocId, Timestamp> docIdLastModifiedMap;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/docId//; I know it's a map from DocId to Timestamp, but it's really a cache of last modified dates. The "docId" feels like noise. I could see using "Timestamp" consistently, since that's the name of the column, so maybe "lastModifiedCache" or "timestampCache"?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, meant to change the name from "..Map" to "..Cache". Removed "docId" prefix as suggested.

Timestamp lastDocIdTimestamp = docIdLastModifiedMap.getIfPresent(id);
if (lastDocIdTimestamp != null
&& !req.hasChangedSinceLastAccess(
new Date(lastDocIdTimestamp.getTime()))) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should store Date rather than timestamp, to avoid having to do this for each crawl request. That also avoids the need to check for null here (because hasChangedSinceLastAccess does that, too).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
// Cache last modified time stamp for this DocId
if (hasTimestamp) {
Timestamp lastDocIdTimestamp = docIdLastModifiedMap.getIfPresent(id);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "last" mean here? I think either "cachedTimestamp" or "cachedLastModified".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to "cachedLastModified"

resp.respondNotFound();
return;
}
// Check if modified since last access.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is supposed to happen before the query, or the cache is silly, and we could just compare it to a GSA_TIMESTAMP column in these results (which would be a valid design choice, but the whole idea was to avoid hitting the database in case that's unusually expensive).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, moved it.

// One record means one document.
resp.setNoFollow(true);
log.log(Level.FINE, "Content modified since last crawl: {0}", id);
resp.setNoFollow(true);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whitespace only change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restored the whitespace

// In database adaptor's case, we almost never want to follow the URLs.
// One record means one document.
resp.setNoFollow(true);
log.log(Level.FINE, "Content modified since last crawl: {0}", id);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not do this right away, on line 724?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to line 724.

private String modeOfOperation;

private Calendar updateTimestampTimezone;
private DateFormat formatter;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've got a threading problem here. getDocIds and getModifiedDocIds can run concurrently (though each one separately cannot). We could 1) maintain a separate formatter for each of them (slightly tedious but not a bad option), 2) synchronize the uses, 3) put these into a ThreadLocal (boo), or 4) create an instance for each call (again, slightly tedious, but not a bad option, given that calls happen every 15 minutes at best, by default). I'm leaning toward option 1.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to have separate formatter for each (option 1).
Done.

Copy link
Copy Markdown
Contributor

@wiarlawd wiarlawd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More thoughts, still partial.

log.config("primary key: " + uniqueKey);

docIdLastModifiedMap =
CacheBuilder.newBuilder().maximumSize(1000000).build();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to set an initial capacity. The (undocumented) default is 16 (though I think it might be 40 in practice), which is pretty small, and will lead to lots of allocations to grow the cache (done by doubling). Maybe initialCapacity(10000)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, done.

log.config("primary key: " + uniqueKey);

docIdLastModifiedMap =
CacheBuilder.newBuilder().maximumSize(1000000).build();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maximumSize takes a long, so add L to the value (FYI, initialCapacity takes an int).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, done.

@@ -462,6 +485,8 @@ public void getDocIds(DocIdPusher pusher) throws IOException,
log.finer("queried for stream");
boolean hasAction
= hasColumn(rs.getMetaData(), GsaSpecialColumns.GSA_ACTION);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be OK to get the ResultSetMetaData twice, but let's not.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, done.

log.finer("queried for stream");
boolean hasAction
= hasColumn(rs.getMetaData(), GsaSpecialColumns.GSA_ACTION);
boolean hasTimestamp =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were going to clear the cache in getDocIds (but not getModifiedDocIds, obviously). That avoids keeping a record for deleted rows, and in particular, if the ordering in getDocContent is fixed, avoids us reporting deleted rows as unmodified instead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, missed it. Added a call to invalidateAll().

Copy link
Copy Markdown
Contributor

@wiarlawd wiarlawd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still haven't read the tests, but you changed the semantics of the prod code without changing any tests, which isn't a good sign.

DateFormat formatter = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z");
formatter.setTimeZone(updateTimestampTimezone.getTimeZone());
BufferedPusher outstream = new BufferedPusher(pusher);
lastModifiedCache.invalidateAll();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would move this to the first line (it's tied to the semantics of getDocIds, not this try statement or the loop, and these fields/variables are used in this order below: lastModifiedCache, formatter, outstream).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// Cache last modified time stamp for this DocId
if (hasTimestamp) {
Timestamp lastDocIdTimestamp = docIdLastModifiedMap.getIfPresent(id);
Date cachedLastModified = lastModifiedCache.getIfPresent(id);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is never used, now, is it? Just overridden below and logged?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, removed call to getIfPresent()

docIdLastModifiedMap.put(id, lastDocIdTimestamp);
log.log(Level.FINE, "docIdLastModifiedMap updated: {0}",
formatter.format(new Date(lastDocIdTimestamp.getTime())));
if ((ts != null)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete extra parens.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thx

docIdLastModifiedMap.put(id, lastDocIdTimestamp);
log.log(Level.FINE, "lastDocIdTimestamp updated: {0}",
formatter.format(new Date(lastDocIdTimestamp.getTime())));
if (cachedLastModified == null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if statement is wrong. Why would a row have to be uncached in order to update the cache with a new last modified?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it.

log.log(Level.FINE, "lastDocIdTimestamp updated: {0}",
formatter.format(new Date(lastDocIdTimestamp.getTime())));
if (cachedLastModified == null) {
cachedLastModified = new Date(ts.getTime());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're constructing this same Date object twice here, maybe they could be shared, if it doesn't seem too artificial.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Override
public void getDocIds(DocIdPusher pusher) throws IOException,
InterruptedException {
DateFormat formatter = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is option 4, not option 1, but that's fine.

Copy link
Copy Markdown
Contributor Author

@srinicodebytes srinicodebytes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code changes were mostly around updating cache. Earlier there were additional checks and cache is updated only if newer, now cache is always updated. Hence the tests didn't change. New tests are mostly around null modified date, modified date is newer or not modified.

DateFormat formatter = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS Z");
formatter.setTimeZone(updateTimestampTimezone.getTimeZone());
BufferedPusher outstream = new BufferedPusher(pusher);
lastModifiedCache.invalidateAll();
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// Cache last modified time stamp for this DocId
if (hasTimestamp) {
Timestamp lastDocIdTimestamp = docIdLastModifiedMap.getIfPresent(id);
Date cachedLastModified = lastModifiedCache.getIfPresent(id);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, removed call to getIfPresent()

docIdLastModifiedMap.put(id, lastDocIdTimestamp);
log.log(Level.FINE, "docIdLastModifiedMap updated: {0}",
formatter.format(new Date(lastDocIdTimestamp.getTime())));
if ((ts != null)) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thx

log.log(Level.FINE, "lastDocIdTimestamp updated: {0}",
formatter.format(new Date(lastDocIdTimestamp.getTime())));
if (cachedLastModified == null) {
cachedLastModified = new Date(ts.getTime());
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

docIdLastModifiedMap.put(id, lastDocIdTimestamp);
log.log(Level.FINE, "lastDocIdTimestamp updated: {0}",
formatter.format(new Date(lastDocIdTimestamp.getTime())));
if (cachedLastModified == null) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants