Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor statusapiv1 to trait and implement for ease of creation of these objects when we implement our own parser #248

Merged
merged 9 commits into from
May 23, 2017

Conversation

shankar37
Copy link
Contributor

This changes the statusapiv1 classes to be a trait and creates an impl classes for the same. This change in itself seems to achieve nothing. But this is in preparation for rewriting the eventlogparsing code to be implemented using our custom listeners and create the objects from that. The previous classes with val were hard to create as it required to have all the data before creating the objects. the new impl classes have var members which will make it easy to set data.

@superbobry
Copy link
Contributor

Hi @shankar37, I think the PR is missing a commit renaming API classes to *Impl.

The PR seems to be missing a bit of context. Do you plan to make the JSON parsing streaming?

@shkhrgpt
Copy link
Contributor

@shankar37 From this PR, it is difficult to understand the motivation of the change, and therefore it's difficult to review. Can you please share the design of the eventlogparsing code.

@shankar37
Copy link
Contributor Author

Here is hopefully the context.

Currently, the event log parsing happens in two ways. For the SparkFetcher, it streams the json event log and looks for EnvironmentUpdate Event only. This code is in SparkLogClient.scala. This does not use the spark's replaybus and uses only the public apis of Spark. In addition, the SparkFetcher uses the SparkRestClient to get the rest of the data from REST APIs and deserializes them directly into the objects of statusapiv1.

The other one is in SparkFSFetcher, which uses replaybus and sparklistners's to parse the event log. Then it uses the LegacyDataConverter to convert the data read into the statusapiv1. There are a couple of problems with this. First, this doesn't parse all the data from the event log and some of the data we might require is not present. Second, in light of SPARK-18085 and this commit (apache/spark@561e9cc), these replaybus and these listeners are deprecated. Hence, we need to change the parsing of event log for FS Fetcher to use our own parser.

What I am planning to do is to write a generic EventLogParser which will take a stream and parse the event log json. It will parse it like SparkLogClient but for all events that we care for and will produce the StatusApiV1 data in its completely. Then both the SparkFSFetcher and SparkLogClient will be made to use. It will take boolean flags to indicate which part of the data needs to be parsed to avoid parsing stuff the client dont need. When we do that the EventLogParser needs to read one event at a time, store some intermediate data and convert it into statusapiv1 structure. I am trying to avoid create a lot of intermediate data and field by field copy to statusapiv1 like for legacydataconverter does. Instead, I want to create the statusapiv1 Impl objects, fill and modify data as I parse and calculate them and then just return it asInstanceOf statusapiv1 trait. This PR is the first step towards that. It makes the SparkAppliationData contain only traits. So, readers( heuristics and metricsAggregator) will continue to get read only data. But writers can create the Impl objects and write to individual fields. The problem with have a class with only Val is that you have to create it only when all the fields are available. And that is difficult to do when you are parsing event logs one line at a time.

@shkhrgpt
Copy link
Contributor

Thanks @shankar37 for providing this detail.
I completely agree that using Spark replaybus and listeners is not a good idea. We are also facing the issue that some of the Spark listeners have changed from Spark 1.6 to 2.1, which makes Dr Elephant incompatible against Spark 2.1 enetlogs.
I strongly believe that we should completely eliminate the dependency on Spark, and therefore I like the motivation for this PR.
I just looked into SparkLogClient, and it has the logic to parse eventlog. However, it still depends on a couple of Spark classes, SparkListenerEnvironmentUpdate and SparkListenerEvent. Are you planning to remove those dependencies as well in the proposed eventlog parser?
Please let me know if I can help in this change.

@shankar37
Copy link
Contributor Author

I am planning to move that code into the new eventlogparser and expand on that. I will continue to rely on SparkListenerEvent and its derived classes as I didnt find any good way to remove that dependency. Do you have any ideas on how to not depend on that ?

@shkhrgpt
Copy link
Contributor

shkhrgpt commented May 16, 2017

I don't have any ideas either on how to remove these dependencies. Maybe we should think about it when you submit the PR. As of now, I think the goal should be to have the minimum dependency on Spark.

@shkhrgpt
Copy link
Contributor

This change LGTM.
Thanks @shankar37

val numCompletedStages: Int,
val numSkippedStages: Int,
val numFailedStages: Int)
trait ApplicationInfo {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shankar37
Why do we have these traits? Why can't we just have simple classes with var arguments?

@shankar37
Copy link
Contributor Author

shankar37 commented May 22, 2017 via email

@shankar37
Copy link
Contributor Author

shankar37 commented May 22, 2017 via email

@shkhrgpt
Copy link
Contributor

@shankar37 LGTM.
Please merge this change.
Thank you.

@akshayrai akshayrai merged commit cae79c7 into linkedin:master May 23, 2017
skakker pushed a commit to skakker/dr-elephant that referenced this pull request Dec 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants