Skip to content

Avro2TF Official Tutorial

Chenya Zhang edited this page Mar 15, 2019 · 13 revisions



Avro2TF is designed to fill the gap of data processing before training to make users’ training data ready to be consumed by deep learning training frameworks. It reads raw user input data with any format supported by Spark to generate Avro (LinkedIn as a heavy user) or TFRecord tensorized training data.

Avro2TF exposes to users a JSON config to specify the tensors that a modeler wants to use in training.

For each tensor, a user should specify two kinds of information:

What existing features are used to construct the tensor.

The expected name, dtype, and shape of the tensor.

The final tensorized training data will be stored in Avro or TFRecord format. Downstream in TensorFlow, we provide a native AvroRecordDataset TensorFlow API in addition to the official TFRecordDataset API to load the tensorized data to in-memory tensors.


Docker provides a way to run applications securely isolated in a container, packaged with all its dependencies and libraries.

Docker is a platform for developers and sysadmins to develop, deploy, and run applications with containers. The use of Linux containers to deploy applications is called containerization. Containers are not new, but their use for easily deploying applications is.

Containerization is increasingly popular because containers are:

Flexible: Even the most complex applications can be containerized. Lightweight: Containers leverage and share the host kernel. Interchangeable: You can deploy updates and upgrades on-the-fly. Portable: You can build locally, deploy to the cloud, and run anywhere. Scalable: You can increase and automatically distribute container replicas. Stackable: You can stack services vertically and on-the-fly.

Jupyter Notebook

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text.

The notebook extends the console-based approach to interactive computing in a qualitatively new direction, providing a web-based application suitable for capturing the whole computation process: developing, documenting, and executing code, as well as communicating the results. The Jupyter notebook combines two components:

A web application: a browser-based tool for interactive authoring of documents which combine explanatory text, mathematics, computations and their rich media output. Notebook documents: a representation of all content visible in the web application, including inputs and outputs of the computations, explanatory text, mathematics, images, and rich media representations of objects.

Docker Environment Setup

The Avro2TF tutorial runs in Docker. You will need to install Docker or Docker Toolbox on your system to use it.

To install Docker, visit and follow the instructions to download and install Docker for your operating system.

After you have installed Docker, launch the Docker daemon (this happens automatically on some systems).

Click the Docker icon from your application folder. Then try something like, docker login, docker ps.

Install Avro2TF Open Source Docker Image

The next step is to install and launch the Avro2TF Open Source Docker image.

In your terminal, run the following to launch a container for the docker image:

docker run -p 8080:8888 --name avro2tf-offcial-tutorial linkedin/avro2tf-official-tutorial:latest

Copy and paste the URL into your browser to play with a Jupyter notebook. (Notice: Remember to change 8888 to 8080.)

Docker image from Docker Hub:

The tutorial contains detailed instructions on how to run it within the Jupyter notebook.

The tutorial sample gives you read permission. If you want to do some of your own quick experiments on it, feel free to "Make a copy" but not distribute it without noticing us.

Start playing with our tutorials with movie lens and text data now! :)

Clone this wiki locally
You can’t perform that action at this time.