Abacuza

A Simplified Data Processing Platform

Architecture Diagram

Prerequisites

docker engine: v19.03 or above
docker compose: v1.27.2 or above

How to Build

Clone the repo:

git clone https://github.com/daxnet/abacuza
Build everything with the following command:

docker-compose -f docker-compose.build.yaml build

How to Debug (Services)

Start the infrastructure services like database or redis cache:

docker-compose -f docker-compose.dev.yaml up
Open abacuza.sln in Visual Studio 2019 from src/services directory
Press F5 to debug

How to Run (Debug Mode)

Follow the instructions in How to Debug (Services) to start the infrastructure services and the backend services
Go to the src/client directory
Run npm install to install the dependencies
Run npm start to start the Angular development server at localhost:4200
Navigate to http://localhost:4200 in a web browser to access the Abacuza Administrator dashboard

How to Run

Execute the following command to run everything:

docker-compose up
Navigate to http://localhost:9320 in a web browser to access the Abacuza Administrator dashboard

Quick Start: Word Count

Microsoft provides a .NET for Spark tutorial that demonstrates the counting of the words in a given text file. We will use that demo script to show the features and data processing capabilities provided by Abacuza.

Develop the Word Count Application

An application in Abacuza describes how the data should be processed or transformed, it is usually developed by data scientists to meet their analysis needs. Applications will be assigned to the Job Runners and then loaded by the job runner when a project requests a data processing session. Developing an application for Abacuza involves the following tasks:

Create a new .NET 5 console application
Add Microsoft.Spark and Abacuza.JobRunners.Spark.SDK NuGet package reference
Customize the application
Build and pack the application

Create a new .NET 5 console application

$ dotnet new console -f net5.0 -n WordCountApp

Add NuGet package reference

$ dotnet add package Microsoft.Spark --version 1.0.0
$ dotnet add package Abacuza.JobRunners.Spark.SDK --prerelease

Add a new class which derives from the SparkRunnerBase, actually its code is copied from the example code provided by Microsoft:

using Abacuza.JobRunners.Spark.SDK;
using Microsoft.Spark.Sql;

namespace WordCountApp
{
   public class WordCountRunner : SparkRunnerBase
   {
      public WordCountRunner(string[] args) : base(args)
      {
      }

      protected override DataFrame RunInternal(SparkSession sparkSession, DataFrame dataFrame)
            => dataFrame
               .Select(Functions.Split(Functions.Col("value"), " ").Alias("words"))
               .Select(Functions.Explode(Functions.Col("words"))
               .Alias("word"))
               .GroupBy("word")
               .Count()
               .OrderBy(Functions.Col("count").Desc());
   }
}

Modify the Program.cs, in the Main method, simply invoke the WordCountRunner:

static void Main(string[] args)
{
   new WordCountRunner(args).Run();
}

Under the WordCount project folder, execute the following command to publish the application that targets to Linux x64 platform:
```
$ dotnet publish -c Release -f net5.0 -r linux-x64 -o published
```
Zip the contents in the published folder, note that the zip file should only contains the content under the published folder, the published folder itself shouldn't be zipped

Creating the Spark Cluster Connection

Before doing the data transformation, you will need to create a cluster connection in Abacuza which connects to a data processing cluster. By default, Abacuza delivers the Spark cluster implementation, which is also the one that is used here.

Start Abacuza services and front-end dashboard by using the following command:
```
$ docker-compose up
```
For more information about running Abacuza locally, please refer to the steps above
Open your web browser, navigate to http://localhost:9320, this opens the Abacuza dashboard
In the left pane, From Cluster menu, click Connections, then in the Cluster Connections page, click the plus icon to create a new cluster connection:
In the Create New Cluster Connection dialog, fill in the name, description fields, for Cluster type choose spark. In the Settings text box, input the Spark settings in JSON format. To be simple, we just specify the base URL to the Spark livy. Click SAVE button to save the changes:
Now your cluster connection which connects to the running Spark instance should be ready:

Preparing the Job Runner

Follow the steps below to create a job runner in Abacuza.

From Jobs menu, click Job Runners, then in the Job Runners page, click the plus icon to create a new job runner
In the Create Job Runner dialog, fill in the name and description for the job runner, and for the Cluster type, choose Spark:
Click SAVE button, Abacuza will redirect you to the Job Runner Details page
In the Job Runner Details page, under the Binaries section, add the following two files to the Job Runner:
1. microsoft-spark-3-0_2.12-1.0.0.jar - you can find it in your published folder
2. WordCount20210313.zip - This is the Zip file you created in step 6 of chapter Develop the Word Count Application

Under the Payload template section, use the following JSON document:

{
   "file": "${jr:binaries:microsoft-spark-3-0_2.12-1.0.0.jar}",
   "className": "org.apache.spark.deploy.dotnet.DotnetRunner",
   "args": [
      "${jr:binaries:WordCountApp20210313.zip}",
      "WordCountApp",
      "${proj:input-endpoint}",
      "${proj:input-endpoint-settings}",
      "${proj:output-endpoint}",
      "${proj:output-endpoint-settings}",
      "${proj:context}"
   ]
}

Note that the ${jr:binaries} place holder refers to the binary files that you've uploaded to the current job runner.

Save the job runner

Creating the Project

From Projects menu, click Projects
In the Projects page, click the plus icon to add a new project
In the Create New Project dialog, fill in the name, description of the project. For Input Endpoint, choose Text Files; for Output Endpoint, choose Console, which means that we want the output of the data process to be shown in the console log. For the Job Runner, choose the one that we just created in previous steps
Save the project, the Project Details page will show
Let's prepare some data. Follow the instructions described on Microsoft official site to create a demo.txt file
On the Project Details page, under INPUT section, add the demo.txt as the project input
Click SUBMIT button, the data processing job will be submitted to one of the clusters whose type is spark, and on that cluster, the customized application that we developed above will be executed for data processing. You can monitor the status of the execution from the REVISIONS tab of the Project Details page:
Once the job is completed successfully, you can click the log icon to see the logs. In this example, you can see the following output in the log:

For more information about the architecture, the design concepts and the developer's manual, please refer to the Abacuza Documentation.

Documentation

Click here for the documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
data		data
docker		docker
docs		docs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.build.nospark.yaml		docker-compose.build.nospark.yaml
docker-compose.build.yaml		docker-compose.build.yaml
docker-compose.dev.yaml		docker-compose.dev.yaml
docker-compose.spark-cluster.yaml		docker-compose.spark-cluster.yaml
docker-compose.yaml		docker-compose.yaml
postbuild.ps1		postbuild.ps1
prebuild.ps1		prebuild.ps1
template.env		template.env

License

nowanys/abacuza

Folders and files

Latest commit

History

Repository files navigation

Abacuza

Architecture Diagram

Table of Contents

Prerequisites

How to Build

How to Debug (Services)

How to Run (Debug Mode)

How to Run

Quick Start: Word Count

Develop the Word Count Application

Creating the Spark Cluster Connection

Preparing the Job Runner

Creating the Project

Documentation

About

Resources

License

Stars

Watchers

Forks

Languages