Ready for Testing - Nutch REST Refactor #720

ahmadia · 2015-10-21T19:42:35Z

Quite a few todos here. Off the top of my head:

* Some JavaScript is disabled in crawl.js + this needs to be fixed before merge in master

ahmadia · 2015-10-21T19:44:12Z

@sujen1412 @chrismattmann @brittainhard - this is probably of interest. To try this locally, you'll need a normal memex explorer install, check out this branch, then make sure a RESTful Nutch is installed and on the path and that you've installed the latest version of nutch-python into the current environment.

After that, kick off services with supervisord and start a Nutch crawl:

chrismattmann · 2015-10-21T20:48:00Z

wow that is awesome @ahmadia

ahmadia · 2015-10-22T01:41:28Z

@bryanv has suggested a few improvements to the layout of the visualization that I'm going to incorporate. They're added above.

ahmadia · 2015-10-22T16:29:02Z

@brittainhard - I think we should also wrap the Celery task with a try/except clause -- a real Pokemon catch :) -- and have it log the exception and push a "Internal Server error" status if the exception is raised.

* Styling improvements suggested by Bryan Van de Ven * Move URL strip out of capture. Will create a "double/dot" if the page possesses both https/http links to the same page

ahmadia · 2015-10-26T23:59:37Z

source/base/static/base/js/crawl.js

@@ -30,8 +30,9 @@ $( document ).ready(function() {
  }

  function onOffStatus(status){
-    onOffGroup(statuses.states[status]["disabled"], statuses.buttons, true);


@brittainhard - Please take a look at this, I need to understand how to reenable these.

@ahmadia why was this disabled in the first place?

It was causing JavaScript errors in my console. I suspect there's a state somewhere I need to set that wasn't on.

ahmadia · 2015-10-27T02:35:43Z

This is now waiting on a few minor edits and conversations before merge. I'd like to land this PR by end of day Tuesday.

ahmadia · 2015-10-27T02:42:32Z

source/apps/crawl_space/viz/plot.py

+        if self.crawl.crawler != "nutch":
+            raise ValueError("Crawl must be using the Nutch crawler.")
+
+    # TODO: Replace with real crawl stream monitoring


This needs to be removed, there's real crawl stream monitoring here!

brittainhard · 2015-10-27T16:04:42Z

I think we need to truncate the URLS in the graph. when you get a really large URL name the graph gets squished.

EDIT: Also, I think maybe the plot needs to be smaller. It occupies a large part of the screen, so when you use your mousewheel you end up zooming in and out of the graph. This could also benefit from having the refresh button, which resets the plot to the original zoom position.

* Reconfigures Nutch to route by crawl name * Consumes routing keys by crawl on viz side

ahmadia · 2015-10-29T15:59:23Z

Alright, please comment here if you've gotten a chance to test this, and what portions of the functionality you tested. I'd like to make sure we get:

ACHE crawls
Nutch crawls with bokeh-server/rmq off
Nutch crawls with visualization on
ES indices for Nutch crawls are being created and populated with content

ahmadia · 2015-10-29T16:24:07Z

ES indices for Nutch crawls are being created and populated with content

Uh oh, ES appears to be empty. Investigating.

ahmadia · 2015-10-29T17:10:59Z

Okay, the issue was that nutch-python wasn't performing the index step the same way the old crawl script does. I've added that functionality and pushed a new build of nutch-python.

ahmadia · 2015-10-29T17:11:55Z

nutch-python-1.11-py27_1.tar.bz2 - to be precise.

ahmadia · 2015-10-29T17:18:02Z

Alright, ES is now updating on index.

ahmadia · 2015-10-29T17:18:25Z

ES indices for Nutch crawls are being created and populated with content

verified by Aron

ahmadia · 2015-10-29T17:18:40Z

Nutch crawls with bokeh-server/rmq off

verified by Aron

ahmadia · 2015-10-29T17:20:20Z

ACHE crawls

verified by Aron

ahmadia · 2015-10-29T17:26:37Z

Nutch crawls with visualization on

verified!

ahmadia · 2015-10-29T19:58:47Z

Respinning elasticnutch and nutch-python builds with common crawl support.

ahmadia · 2015-10-29T20:14:56Z

nutch-python-1.11-py27_2.tar.bz2 and elasticnutch-1.11-3.tar.bz2

ahmadia · 2015-10-29T20:15:52Z

reimplementing the common crawl dump task.

ahmadia · 2015-10-29T20:17:28Z

#744

* I touched things I usually don't, double-check the template modifications

ahmadia · 2015-10-29T20:50:12Z

CCA dump functionality has been restored.

ahmadia · 2015-10-30T13:57:15Z

Retesting

ACHE crawls
Nutch crawls with bokeh-server/rmq off
Nutch crawls with visualization on
ES indices for Nutch crawls are being created and populated with content

Off a fresh database.

ahmadia · 2015-10-30T18:21:41Z

All verified.

Nutch REST Refactor

ahmadia · 2015-10-30T18:24:24Z

merged.

ahmadia added 4 commits October 20, 2015 15:07

Remove log.io and Salt more completely

8c790c6

remove redundant requirements.txt

47b02e7

Upgrade to Bokeh 0.10

370c055

WIP - Restful Nutch crawling + Bokeh visualization

fc8638d

* Some JavaScript is disabled in crawl.js + this needs to be fixed before merge in master

ahmadia mentioned this pull request Oct 22, 2015

404 error when running on DigitalOcean #715

Closed

Improvements to the Bokeh Stream plot

c9d756f

* Styling improvements suggested by Bryan Van de Ven * Move URL strip out of capture. Will create a "double/dot" if the page possesses both https/http links to the same page

ahmadia mentioned this pull request Oct 23, 2015

Port over styling from Memex Explorer into crawl example? nasa-jpl-memex/nutch-python#7

Open

ahmadia reviewed Oct 26, 2015
View reviewed changes

ahmadia added 2 commits October 26, 2015 22:13

Update elasticnutch dependency

c55658e

Add nutch-python dependency

54a17f8

ahmadia added 3 - In Progress 4 - Ready to review Needs Discussion labels Oct 27, 2015

ahmadia reviewed Oct 27, 2015
View reviewed changes

Enable simultaneous Nutch crawl visualization

1a718df

* Reconfigures Nutch to route by crawl name * Consumes routing keys by crawl on viz side

ahmadia force-pushed the ahmadia/nutch-rest-refactor branch from c2a52ac to 1a718df Compare October 29, 2015 15:54

Restore crawl page count monitoring

2b9114d

ahmadia changed the title ~~WIP - Nutch REST Refactor~~ Ready for Testing - Nutch REST Refactor Oct 29, 2015

ahmadia mentioned this pull request Oct 29, 2015

Add link versions, dedup and optional indexing nasa-jpl-memex/nutch-python#19

Merged

ahmadia mentioned this pull request Oct 29, 2015

dump images and dump common crawl have been disabled #717

Closed

Restore CCA Dump

8c07361

* I touched things I usually don't, double-check the template modifications

ahmadia added this to the v0.4 milestone Oct 30, 2015

ahmadia added a commit that referenced this pull request Oct 30, 2015

Merge pull request #720 from memex-explorer/ahmadia/nutch-rest-refactor

0b9f96f

Nutch REST Refactor

ahmadia merged commit 0b9f96f into master Oct 30, 2015

ahmadia deleted the ahmadia/nutch-rest-refactor branch October 30, 2015 18:24

ahmadia added 5 - Done and removed 4 - Ready to review labels Oct 30, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ready for Testing - Nutch REST Refactor #720

Ready for Testing - Nutch REST Refactor #720

ahmadia commented Oct 21, 2015

ahmadia commented Oct 21, 2015

chrismattmann commented Oct 21, 2015

ahmadia commented Oct 22, 2015

ahmadia commented Oct 22, 2015

ahmadia Oct 26, 2015

brittainhard Oct 27, 2015

ahmadia Oct 27, 2015

ahmadia commented Oct 27, 2015

ahmadia Oct 27, 2015

brittainhard commented Oct 27, 2015

ahmadia commented Oct 29, 2015

ahmadia commented Oct 29, 2015

ahmadia commented Oct 29, 2015

ahmadia commented Oct 29, 2015

ahmadia commented Oct 29, 2015

ahmadia commented Oct 29, 2015

ahmadia commented Oct 29, 2015

ahmadia commented Oct 29, 2015

ahmadia commented Oct 29, 2015

ahmadia commented Oct 29, 2015

ahmadia commented Oct 29, 2015

ahmadia commented Oct 29, 2015

ahmadia commented Oct 29, 2015

ahmadia commented Oct 29, 2015

ahmadia commented Oct 30, 2015

ahmadia commented Oct 30, 2015

ahmadia commented Oct 30, 2015

Ready for Testing - Nutch REST Refactor #720

Ready for Testing - Nutch REST Refactor #720

Conversation

ahmadia commented Oct 21, 2015

ahmadia commented Oct 21, 2015

chrismattmann commented Oct 21, 2015

ahmadia commented Oct 22, 2015

ahmadia commented Oct 22, 2015

ahmadia Oct 26, 2015

Choose a reason for hiding this comment

brittainhard Oct 27, 2015

Choose a reason for hiding this comment

ahmadia Oct 27, 2015

Choose a reason for hiding this comment

ahmadia commented Oct 27, 2015

ahmadia Oct 27, 2015

Choose a reason for hiding this comment

brittainhard commented Oct 27, 2015

ahmadia commented Oct 29, 2015

ahmadia commented Oct 29, 2015

ahmadia commented Oct 29, 2015

ahmadia commented Oct 29, 2015

ahmadia commented Oct 29, 2015

ahmadia commented Oct 29, 2015

ahmadia commented Oct 29, 2015

ahmadia commented Oct 29, 2015

ahmadia commented Oct 29, 2015

ahmadia commented Oct 29, 2015

ahmadia commented Oct 29, 2015

ahmadia commented Oct 29, 2015

ahmadia commented Oct 29, 2015

ahmadia commented Oct 29, 2015

ahmadia commented Oct 30, 2015

ahmadia commented Oct 30, 2015

ahmadia commented Oct 30, 2015