Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce JobsObserver interface and place both JobStats and error notification behind that interface #436

Merged
merged 8 commits into from May 3, 2015

Conversation

azakkerman
Copy link
Contributor

I ran all tests and they seem to pass.

@@ -100,8 +102,7 @@ class JobManagementResource @Inject()(val jobScheduler: JobScheduler,
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This deleted job notification will no longer be sent. No notification is sent in deregisterJob. We should still send the notification here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deregisterJob calls jobsObserver.onEvent(JobRemoved(job)), I just need to handle that in NotifyingJobObserver

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, could you add that so that this notification is still sent?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see it was just added. Thanks!

import com.google.inject.Inject
import org.apache.mesos.chronos.scheduler.jobs._
import org.joda.time.{DateTimeZone, DateTime}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer the naming JobNotificationObserver

class JobNotificationObserver @Inject()(val notificationClients: List[ActorRef] = List(),
val clusterName: Option[String] = None) extends JobsObserver {
private[this] val log = Logger.getLogger(getClass.getName)
val clusterPrefix = clusterName.map(name => s"[$name]").getOrElse("")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's evaluate clusterPrefix inside JobRetriesExhausted (only where it is used)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also used in JobRemoved case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, makes sense

message=Some(taskStatus.getMessage),
attempt=Some(attempt),
isFailure=Some(true))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a more descriptive variable name other than j?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jobOrName?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, another option is jobInfo or jobNameOrJob.

@elingg
Copy link
Member

elingg commented Apr 28, 2015

Great work, @azakkerman! I think use of an common interface for JobStats and JobNotifications is much cleaner. Thanks for addressing my comments.

@elingg
Copy link
Member

elingg commented Apr 28, 2015

LGTM. @brndnmtthws will also be reviewing this since this is a significant PR.

import org.joda.time.Period
import org.specs2.mock._

object MockJobUtils extends Mockito {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the test!

@brndnmtthws
Copy link
Member

This looks great to me. Before we merge, can you please test this in a cluster? I think we can use the test-suite for that purpose. @elingg can you help with that?

@elingg
Copy link
Member

elingg commented Apr 29, 2015

@brndnmtthws, yes, I will test in my own cluster. One thing is that we haven't had a chance to hook up Chronos with Cassandra in the test cluster, and I'm not sure we will get to it today. Do you think we should get that up before merging?

@azakkerman, could you also try testing in a cluster as well? @kensipe should be able to help you with set up as well.

@brndnmtthws
Copy link
Member

It's pretty easy, just point it at C* (using Mesos-DNS names). You'll have to create the keyspace manually, but we should really have it happen within Chronos (CREATE KEYSPACE IF NOT EXISTS [derp] WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 };).

@elingg
Copy link
Member

elingg commented Apr 29, 2015

ok @brndnmtthws I will be sure test with Cassandra before merging it. If @azakkerman also has a chance to test earlier than I do, that would be great.

@azakkerman
Copy link
Contributor Author

I am trying to bring up DCOS in AWS with Ken's help to try testing it.

On Wed, Apr 29, 2015 at 10:37 AM, Elizabeth Lingg notifications@github.com
wrote:

ok @brndnmtthws https://github.com/brndnmtthws I will be sure test with
Cassandra before merging it. If @azakkerman
https://github.com/azakkerman also has a chance to test earlier than I
do, that would be great.


Reply to this email directly or view it on GitHub
#436 (comment).

@azakkerman
Copy link
Contributor Author

I have a DCOS cluster running in AWS with Cassandra deployed, but don't quite know how to configure chronos to run against that Cassandra instance. Who is a good resource for this?

@elingg
Copy link
Member

elingg commented Apr 29, 2015

Hello @azakkerman, you will have to run your version on Chronos with the cassandra options enabled and simply point to cassandra-dcos-node.cassandra.dcos.mesos for the cassandra hostname. The PORT on DCOS is 9160. You will also need to create the keyspace as Brenden mentioned above:
CREATE KEYSPACE IF NOT EXISTS [derp] WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 };)

@elingg
Copy link
Member

elingg commented Apr 29, 2015

If you would like to discuss over chat since I am WFH today, maybe you could ask @kensipe to add you to our slack channel. We can talk via IM. I will also be in the office tomorrow.

@elingg
Copy link
Member

elingg commented Apr 29, 2015

Here are the Cassandra options for Chronos:

"cassandra_consistency”: Consistency to use for Cassandra (default = ANY).

“cassandra_contact_points”: Comma separated list of contact points for Cassandra.

“cassandra_keyspace”: Keyspace to use for Cassandra (default = metrics).

“cassandra_port”: Port for Cassandra (default = 9042).

“cassandra_table”: Table to use for Cassandra (default = chronos).

“cassandra_ttl”: TTL for records written to Cassandra (default = 31536000).

@azakkerman
Copy link
Contributor Author

I was able to launch chronos with my custom jar, and write to cassandra. I see rows in cassandra corresponding to my test job instances. What sort of tests should I perform?

@azakkerman
Copy link
Contributor Author

cqlsh> select * from metrics.chronos
... ;

id | ts | attempt | is_failure | job_name | job_owner | job_parents | job_schedule | message | slave_id | task_state
--------------------------+--------------------------+---------+------------+----------+--------------------+-------------+----------------------------------+---------+-----------------------------------------+---------------
ct:1430339602000:0:foo1: | 2015-04-29 20:33:22+0000 | 0 | null | foo1 | akkerman@gmail.com | null | R1/2015-04-29T20:35:22.000Z/PT1M | null | 20150429-173600-1157627914-5050-1074-S3 | TASK_RUNNING
ct:1430339602000:0:foo1: | 2015-04-29 20:33:22+0000 | 0 | null | foo1 | akkerman@gmail.com | null | R1/2015-04-29T20:35:22.000Z/PT1M | null | 20150429-173600-1157627914-5050-1074-S3 | TASK_FINISHED
ct:1430339662000:0:foo1: | 2015-04-29 20:34:22+0000 | 0 | null | foo1 | akkerman@gmail.com | null | R0/2015-04-29T20:36:22.000Z/PT1M | null | 20150429-173600-1157627914-5050-1074-S1 | TASK_RUNNING
ct:1430339662000:0:foo1: | 2015-04-29 20:34:22+0000 | 0 | null | foo1 | akkerman@gmail.com | null | R0/2015-04-29T20:36:22.000Z/PT1M | null | 20150429-173600-1157627914-5050-1074-S1 | TASK_FINISHED
ct:1430339542265:0:foo1: | 2015-04-29 20:32:22+0000 | 0 | null | foo1 | akkerman@gmail.com | null | R0/2015-04-29T20:33:22.000Z/PT1M | null | 20150429-173600-1157627914-5050-1074-S4 | TASK_RUNNING
ct:1430339542265:0:foo1: | 2015-04-29 20:32:26+0000 | 0 | null | foo1 | akkerman@gmail.com | null | R0/2015-04-29T20:33:22.000Z/PT1M | null | 20150429-173600-1157627914-5050-1074-S4 | TASK_FINISHED

@elingg
Copy link
Member

elingg commented Apr 29, 2015

Nice job! Can you see the job history in the UI?

@azakkerman
Copy link
Contributor Author

screen shot 2015-04-29 at 2 00 29 pm

@elingg
Copy link
Member

elingg commented Apr 29, 2015

Looking good! Can you try a job that takes longer to run to confirm you can see some stats? Here is an example of what the UI should look like: #402

@azakkerman
Copy link
Contributor Author

I ran a sleep job that exits 1 in the end and verified that the email notification still works. Also deleted a job and got some emails. What specific stats are you looking for?

@elingg
Copy link
Member

elingg commented Apr 29, 2015

Notification emails are also a great test. If you run a job that runs longer than 0 or 1 sec (let's say sleep 5), you should see stats like in the first screen shot in this link: #402

@elingg
Copy link
Member

elingg commented Apr 29, 2015

After that, we should have tested sufficiently!

@azakkerman
Copy link
Contributor Author

Btw, it looks like if one had a job with limited repeat count and the job gets disabled (because repeat count is now 0), there is no way to re-enable the job by incrementing the repeat count and updating start time, the job remains disabled. I think this is a bug in the UI.
screen shot 2015-04-29 at 2 45 38 pm

@elingg
Copy link
Member

elingg commented Apr 29, 2015

Very possible that it is a UI bug. Anyway your test cases sound great! I think we are good to merge. Any final objections @brndnmtthws?

@brndnmtthws
Copy link
Member

Wonderful!

brndnmtthws added a commit that referenced this pull request May 3, 2015
Introduce JobsObserver interface and place both JobStats and error notification behind that interface
@brndnmtthws brndnmtthws merged commit 3ad15f0 into mesos:master May 3, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants