-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ready for Testing - Nutch REST Refactor #720
Conversation
* Some JavaScript is disabled in crawl.js + this needs to be fixed before merge in master
@sujen1412 @chrismattmann @brittainhard - this is probably of interest. To try this locally, you'll need a normal memex explorer install, check out this branch, then make sure a RESTful Nutch is installed and on the path and that you've installed the latest version of After that, kick off services with |
wow that is awesome @ahmadia |
@bryanv has suggested a few improvements to the layout of the visualization that I'm going to incorporate. They're added above. |
@brittainhard - I think we should also wrap the Celery task with a try/except clause -- a real Pokemon catch :) -- and have it log the exception and push a "Internal Server error" status if the exception is raised. |
* Styling improvements suggested by Bryan Van de Ven * Move URL strip out of capture. Will create a "double/dot" if the page possesses both https/http links to the same page
@@ -30,8 +30,9 @@ $( document ).ready(function() { | |||
} | |||
|
|||
function onOffStatus(status){ | |||
onOffGroup(statuses.states[status]["disabled"], statuses.buttons, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brittainhard - Please take a look at this, I need to understand how to reenable these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahmadia why was this disabled in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was causing JavaScript errors in my console. I suspect there's a state somewhere I need to set that wasn't on.
This is now waiting on a few minor edits and conversations before merge. I'd like to land this PR by end of day Tuesday. |
if self.crawl.crawler != "nutch": | ||
raise ValueError("Crawl must be using the Nutch crawler.") | ||
|
||
# TODO: Replace with real crawl stream monitoring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be removed, there's real crawl stream monitoring here!
* Reconfigures Nutch to route by crawl name * Consumes routing keys by crawl on viz side
c2a52ac
to
1a718df
Compare
Alright, please comment here if you've gotten a chance to test this, and what portions of the functionality you tested. I'd like to make sure we get:
|
Uh oh, ES appears to be empty. Investigating. |
Okay, the issue was that nutch-python wasn't performing the |
nutch-python-1.11-py27_1.tar.bz2 - to be precise. |
Alright, ES is now updating on index. |
verified by Aron |
verified by Aron |
verified by Aron |
verified! |
Respinning elasticnutch and nutch-python builds with common crawl support. |
nutch-python-1.11-py27_2.tar.bz2 and elasticnutch-1.11-3.tar.bz2 |
reimplementing the common crawl dump task. |
* I touched things I usually don't, double-check the template modifications
CCA dump functionality has been restored. |
Retesting ACHE crawls Off a fresh database. |
All verified. |
merged. |
Quite a few todos here. Off the top of my head:
Handle multiple crawls simultaneously (needs to use the crawl name to manage crawl contexts/configurations as well as Bokeh documents)Viz stuff:
Different colors for different domainsList of URLs that have been viewed?Make the domain name bold, if possible?Color code the URLS?