Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ready for Testing - Nutch REST Refactor #720

Merged
merged 18 commits into from
Oct 30, 2015
Merged

Conversation

ahmadia
Copy link
Contributor

@ahmadia ahmadia commented Oct 21, 2015

Quite a few todos here. Off the top of my head:

  • - Handle multiple crawls simultaneously (needs to use the crawl name to manage crawl contexts/configurations as well as Bokeh documents)
    • punted,
  • - Restore the stuff I had to disable in crawl.js (chat with Brittain about this)
    • verified.
  • - Programmatically extract seeds instead of using writing to file (chat with Brittain about this)
    • verified
  • - Defensive programming for situations when Bokeh server and RMQ are not available
    • verified.
  • - Packaging of new Nutch + RESTful interface
  • - Documenting the RMQ dependency
  • - move URLs over to the right side
  • - get rid of grid lines
  • - hide toolbar
  • - Different colors for different domains
    • punted
  • - List of URLs that have been viewed?
    • punted
  • - Make the domain name bold, if possible?
    • punted
  • - Color code the URLS?
    • punted
  • - Disable wheel zoom
    • verified
  • - Verify outstanding issues are resolved.

@ahmadia
Copy link
Contributor Author

ahmadia commented Oct 21, 2015

@sujen1412 @chrismattmann @brittainhard - this is probably of interest. To try this locally, you'll need a normal memex explorer install, check out this branch, then make sure a RESTful Nutch is installed and on the path and that you've installed the latest version of nutch-python into the current environment.

After that, kick off services with supervisord and start a Nutch crawl:

ezgif com-gif-maker 1

@chrismattmann
Copy link
Contributor

wow that is awesome @ahmadia

@ahmadia
Copy link
Contributor Author

ahmadia commented Oct 22, 2015

@bryanv has suggested a few improvements to the layout of the visualization that I'm going to incorporate. They're added above.

@ahmadia
Copy link
Contributor Author

ahmadia commented Oct 22, 2015

@brittainhard - I think we should also wrap the Celery task with a try/except clause -- a real Pokemon catch :) -- and have it log the exception and push a "Internal Server error" status if the exception is raised.

* Styling improvements suggested by Bryan Van de Ven
* Move URL strip out of capture.  Will create a "double/dot" if the page
  possesses both https/http links to the same page
@@ -30,8 +30,9 @@ $( document ).ready(function() {
}

function onOffStatus(status){
onOffGroup(statuses.states[status]["disabled"], statuses.buttons, true);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brittainhard - Please take a look at this, I need to understand how to reenable these.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahmadia why was this disabled in the first place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was causing JavaScript errors in my console. I suspect there's a state somewhere I need to set that wasn't on.

@ahmadia
Copy link
Contributor Author

ahmadia commented Oct 27, 2015

This is now waiting on a few minor edits and conversations before merge. I'd like to land this PR by end of day Tuesday.

if self.crawl.crawler != "nutch":
raise ValueError("Crawl must be using the Nutch crawler.")

# TODO: Replace with real crawl stream monitoring
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be removed, there's real crawl stream monitoring here!

@brittainhard
Copy link
Contributor

I think we need to truncate the URLS in the graph. when you get a really large URL name the graph gets squished.
screen shot 2015-10-27 at 11 03 34 am

EDIT: Also, I think maybe the plot needs to be smaller. It occupies a large part of the screen, so when you use your mousewheel you end up zooming in and out of the graph. This could also benefit from having the refresh button, which resets the plot to the original zoom position.

* Reconfigures Nutch to route by crawl name
* Consumes routing keys by crawl on viz side
@ahmadia ahmadia changed the title WIP - Nutch REST Refactor Ready for Testing - Nutch REST Refactor Oct 29, 2015
@ahmadia
Copy link
Contributor Author

ahmadia commented Oct 29, 2015

Alright, please comment here if you've gotten a chance to test this, and what portions of the functionality you tested. I'd like to make sure we get:

  • ACHE crawls
  • Nutch crawls with bokeh-server/rmq off
  • Nutch crawls with visualization on
  • ES indices for Nutch crawls are being created and populated with content

@ahmadia
Copy link
Contributor Author

ahmadia commented Oct 29, 2015

ES indices for Nutch crawls are being created and populated with content

Uh oh, ES appears to be empty. Investigating.

@ahmadia
Copy link
Contributor Author

ahmadia commented Oct 29, 2015

Okay, the issue was that nutch-python wasn't performing the index step the same way the old crawl script does. I've added that functionality and pushed a new build of nutch-python.

@ahmadia
Copy link
Contributor Author

ahmadia commented Oct 29, 2015

nutch-python-1.11-py27_1.tar.bz2 - to be precise.

@ahmadia
Copy link
Contributor Author

ahmadia commented Oct 29, 2015

Alright, ES is now updating on index.

@ahmadia
Copy link
Contributor Author

ahmadia commented Oct 29, 2015

  • ES indices for Nutch crawls are being created and populated with content

verified by Aron

@ahmadia
Copy link
Contributor Author

ahmadia commented Oct 29, 2015

  • Nutch crawls with bokeh-server/rmq off

verified by Aron

@ahmadia
Copy link
Contributor Author

ahmadia commented Oct 29, 2015

  • ACHE crawls

verified by Aron

@ahmadia
Copy link
Contributor Author

ahmadia commented Oct 29, 2015

  • Nutch crawls with visualization on

verified!

@ahmadia
Copy link
Contributor Author

ahmadia commented Oct 29, 2015

Respinning elasticnutch and nutch-python builds with common crawl support.

@ahmadia
Copy link
Contributor Author

ahmadia commented Oct 29, 2015

nutch-python-1.11-py27_2.tar.bz2 and elasticnutch-1.11-3.tar.bz2

@ahmadia
Copy link
Contributor Author

ahmadia commented Oct 29, 2015

reimplementing the common crawl dump task.

@ahmadia
Copy link
Contributor Author

ahmadia commented Oct 29, 2015

#744

* I touched things I usually don't, double-check the template modifications
@ahmadia
Copy link
Contributor Author

ahmadia commented Oct 29, 2015

CCA dump functionality has been restored.

@ahmadia ahmadia added this to the v0.4 milestone Oct 30, 2015
@ahmadia
Copy link
Contributor Author

ahmadia commented Oct 30, 2015

Retesting

ACHE crawls
Nutch crawls with bokeh-server/rmq off
Nutch crawls with visualization on
ES indices for Nutch crawls are being created and populated with content

Off a fresh database.

@ahmadia
Copy link
Contributor Author

ahmadia commented Oct 30, 2015

All verified.

ahmadia added a commit that referenced this pull request Oct 30, 2015
@ahmadia ahmadia merged commit 0b9f96f into master Oct 30, 2015
@ahmadia ahmadia deleted the ahmadia/nutch-rest-refactor branch October 30, 2015 18:24
@ahmadia
Copy link
Contributor Author

ahmadia commented Oct 30, 2015

merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants