Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various fixes to embedded gobblin to allow it to run on mr mode. #1344

Merged
merged 2 commits into from
Oct 31, 2016

Conversation

ibuenros
Copy link
Contributor

@ibuenros ibuenros commented Oct 26, 2016

Major changes:

  • Embedded Gobblin will automatically add the correct classpath to distribute cache for a basic Gobblin run, and allow users to add their own jars.
  • Correctly set sys config in embedded mode.
  • Mr mode will only upload jars to distributed cache when setting up the job. This avoids having to upload them if there are no work units. Additionally, mr mode can now reuse a common jar directory to avoid new uploads for every run.
  • bin/gobblin.sh shell script will automatically load Hadoop classpath if HADOOP_HOME is defined.

@ibuenros
Copy link
Contributor Author

@chavdar can you review?

@ibuenros ibuenros assigned ibuenros and chavdar and unassigned ibuenros Oct 26, 2016
@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 286f539 on ibuenros:embedded-working into * on linkedin:master*.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 286f539 on ibuenros:embedded-working into * on linkedin:master*.

@ibuenros ibuenros force-pushed the embedded-working branch 2 times, most recently from 111b1bc to 0d3bc00 Compare October 27, 2016 16:12
@@ -140,6 +140,10 @@

final FileSystem sourceFs = getSourceFileSystem(state);
final FileSystem targetFs = getTargetFileSystem(state);

log.info(String.format("Identified source file system at %s and target file system at %s.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log.info("Identified source file system at {} and target file system at {}., ...) is more efficient.

* uses {@link #_sysConfig}, which is only initialized when the user runs {@link #withSysConfig(Configurable)} after
* construction.
*/
public Launcher initialize() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like initialize() methods. How about the getMetrics() method check if _metrics is null and set it on the fly?

@@ -358,6 +370,9 @@ public Launcher withSysConfig(Configurable sysConfig) {
/** Parent Gobblin instance */
public Launcher withGobblinInstanceEnvironment(GobblinInstanceEnvironment gobblinInstance) {
_gobblinEnv = Optional.of(gobblinInstance);
if (!_sysConfig.isPresent()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getSysConfig() is already doing that through getDefaultSysConfig().

throws IOException {
TimingEvent distributedCacheSetupTimer =
this.eventSubmitter.getTimingEvent(TimingEvent.RunJobTimings.MR_DISTRIBUTED_CACHE_SETUP);

Path jarFileDir = new Path(this.mrJobDir, JARS_DIR_NAME);
Path jarFileDir = this.jarsDir;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is dangerous as jarsDir will be shared across different jobs/runs. This may cause jar with mismatched versions to be stored.

I think you have two options:

  • Use both a job-specific and global directory. Use the latter if the name and size match, otherwise upload to the job-specific dir.
  • Have a more complex layout for the jar cache which allows different versions to be stored (like the gradle cache).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jars are added individually to distributed classpath, so the same directory can safely have multiple versions of the same jar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand. Aren't all embedded gobblin jobs re-use the same path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 362 in MRJobLauncher is this: DistributedCache.addFileToClassPath(destJarFile, this.conf, this.fs); , so the jars get added individually to the distributed classpath, the directory is just a container for all jars.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind. If the jars are already versioned, it should work.

@ibuenros
Copy link
Contributor Author

@chavdar addressed comments.

@ibuenros ibuenros merged commit 9f09668 into apache:master Oct 31, 2016
@ibuenros ibuenros deleted the embedded-working branch October 31, 2016 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants