Development Mode

Utz Westermann edited this page Aug 11, 2017 · 14 revisions

Schedoscope has a development mode, which assists you in testing views in a dev environment. The feature allows you mark a view for in-development. If a materialize command addresses this particular view, it will load its dependencies from a different Schedoscope environment or Hadoop cluster. In particular, it will transform its dependencies using a distcp transformation that copyies data from a configured environment / hadoop cluster.

In this manner, developers can get realistic test data into a dev environment easily. It relieves them from the problem of tracing down all required stage data for their test setup only to bring Schedoscope to materialize the direct dependencies of the views they are really interested in.

Usage

The development mode is enabled and configured via schedoscope.conf. Check Schedoscope Configuration for more information.

schedoscope {

  # ...

  #
  # Settings related to dev environment
  #

  development {

    #
    # Enables the development feature. Default is false.
    #
    enabled = true

    #
    # Views under development i.e. the views for which the dependencies will be stubbed
    # by loading data from the prod environment. Default is [""].
    #
    viewUrls = ["schedoscope.example.osm.datamart/ShopProfiles"]

    #
    # Adress of the production namenode. Default is "localhost:8020".
    #
    prodNameNode = "prod-hadoop:8020"

    #
    # The environment under prod. Default is "prod".
    #
    prodEnv = "prod"

    #
    # Hdfs root on prod. Default is "/hdp".
    #
    prodViewDataHdfsRoot = "/hdp"
  }

}

viewUrls marks the view under development. The remaining settings help Schedoscope to determine the location of views in the remote Schedoscope instance you want to copy data from. In principle, you can copy these values from the schedoscop.conf config file of the instance you want to get data from.

Trigger distcp via ssh

In some cases, the namenode of the production cluster is not accessible from the dev environment. In this case, the distcp command can be executed via ssh provided the prod cluster has access to the namenodes of your dev cluster. At the moment, all authentication is done via Kerberos including ssh.

Note: Ensure the target host for ssh is included in the known_hosts.

The feature can by enabled by extending the previous configuration:

  development {
    
    ...

    #
    # Enables the usage of ssh to run the distcp on a remote machine. Default is false.
    #
    sshEnabled = true

    #
    # Ssh to this machine. Only needed if ssh is enabled. Default is "".
    #
    sshTarget = "remote-machine"
   
  }

Note: The ssh command is executed by a shell transformation. Which by default has a concurrency of 1. To speed up the copy process you can increase this value.

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.