A Simplified Data Processing Platform
- docker engine: v19.03 or above
- docker compose: v1.27.2 or above
-
Clone the repo:
git clone https://github.com/daxnet/abacuza
-
Build everything with the following command:
docker-compose -f docker-compose.build.yaml build
-
Start the infrastructure services like database or redis cache:
docker-compose -f docker-compose.dev.yaml up
-
Open
abacuza.sln
in Visual Studio 2019 fromsrc/services
directory -
Press F5 to debug
- Follow the instructions in How to Debug (Services) to start the infrastructure services and the backend services
- Go to the
src/client
directory - Run
npm install
to install the dependencies - Run
npm start
to start the Angular development server at localhost:4200 - Navigate to http://localhost:4200 in a web browser to access the Abacuza Administrator dashboard
-
Execute the following command to run everything:
docker-compose up
-
Navigate to http://localhost:9320 in a web browser to access the Abacuza Administrator dashboard
Microsoft provides a .NET for Spark tutorial that demonstrates the counting of the words in a given text file. We will use that demo script to show the features and data processing capabilities provided by Abacuza.
An application in Abacuza describes how the data should be processed or transformed, it is usually developed by data scientists to meet their analysis needs. Applications will be assigned to the Job Runners and then loaded by the job runner when a project requests a data processing session. Developing an application for Abacuza involves the following tasks:
- Create a new .NET 5 console application
- Add
Microsoft.Spark
andAbacuza.JobRunners.Spark.SDK
NuGet package reference - Customize the application
- Build and pack the application
- Create a new .NET 5 console application
$ dotnet new console -f net5.0 -n WordCountApp
- Add NuGet package reference
$ dotnet add package Microsoft.Spark --version 1.0.0 $ dotnet add package Abacuza.JobRunners.Spark.SDK --prerelease
- Add a new class which derives from the
SparkRunnerBase
, actually its code is copied from the example code provided by Microsoft:using Abacuza.JobRunners.Spark.SDK; using Microsoft.Spark.Sql; namespace WordCountApp { public class WordCountRunner : SparkRunnerBase { public WordCountRunner(string[] args) : base(args) { } protected override DataFrame RunInternal(SparkSession sparkSession, DataFrame dataFrame) => dataFrame .Select(Functions.Split(Functions.Col("value"), " ").Alias("words")) .Select(Functions.Explode(Functions.Col("words")) .Alias("word")) .GroupBy("word") .Count() .OrderBy(Functions.Col("count").Desc()); } }
- Modify the
Program.cs
, in theMain
method, simply invoke theWordCountRunner
:static void Main(string[] args) { new WordCountRunner(args).Run(); }
- Under the WordCount project folder, execute the following command to publish the application that targets to Linux x64 platform:
$ dotnet publish -c Release -f net5.0 -r linux-x64 -o published
- Zip the contents in the
published
folder, note that the zip file should only contains the content under thepublished
folder, thepublished
folder itself shouldn't be zipped
Before doing the data transformation, you will need to create a cluster connection in Abacuza which connects to a data processing cluster. By default, Abacuza delivers the Spark cluster implementation, which is also the one that is used here.
-
Start Abacuza services and front-end dashboard by using the following command:
$ docker-compose up
For more information about running Abacuza locally, please refer to the steps above
-
Open your web browser, navigate to
http://localhost:9320
, this opens the Abacuza dashboard -
In the left pane, From
Cluster
menu, clickConnections
, then in theCluster Connections
page, click the plus icon to create a new cluster connection: -
In the
Create New Cluster Connection
dialog, fill in the name, description fields, forCluster type
choosespark
. In theSettings
text box, input the Spark settings in JSON format. To be simple, we just specify the base URL to the Spark livy. ClickSAVE
button to save the changes: -
Now your cluster connection which connects to the running
Spark
instance should be ready:
Follow the steps below to create a job runner in Abacuza.
-
From
Jobs
menu, clickJob Runners
, then in theJob Runners
page, click the plus icon to create a new job runner -
In the
Create Job Runner
dialog, fill in the name and description for the job runner, and for theCluster type
, chooseSpark
: -
Click
SAVE
button, Abacuza will redirect you to theJob Runner Details
page -
In the
Job Runner Details
page, under theBinaries
section, add the following two files to theJob Runner
:microsoft-spark-3-0_2.12-1.0.0.jar
- you can find it in yourpublished
folderWordCount20210313.zip
- This is the Zip file you created in step 6 of chapter Develop the Word Count Application
-
Under the
Payload template
section, use the following JSON document:{ "file": "${jr:binaries:microsoft-spark-3-0_2.12-1.0.0.jar}", "className": "org.apache.spark.deploy.dotnet.DotnetRunner", "args": [ "${jr:binaries:WordCountApp20210313.zip}", "WordCountApp", "${proj:input-endpoint}", "${proj:input-endpoint-settings}", "${proj:output-endpoint}", "${proj:output-endpoint-settings}", "${proj:context}" ] }
Note that the
${jr:binaries}
place holder refers to the binary files that you've uploaded to the current job runner. -
Save the job runner
-
From
Projects
menu, clickProjects
-
In the
Projects
page, click the plus icon to add a new project -
In the
Create New Project
dialog, fill in the name, description of the project. ForInput Endpoint
, chooseText Files
; forOutput Endpoint
, chooseConsole
, which means that we want the output of the data process to be shown in the console log. For theJob Runner
, choose the one that we just created in previous steps -
Save the project, the
Project Details
page will show -
Let's prepare some data. Follow the instructions described on Microsoft official site to create a
demo.txt
file -
On the
Project Details
page, underINPUT
section, add thedemo.txt
as the project input -
Click
SUBMIT
button, the data processing job will be submitted to one of the clusters whose type isspark
, and on that cluster, the customized application that we developed above will be executed for data processing. You can monitor the status of the execution from theREVISIONS
tab of theProject Details
page: -
Once the job is completed successfully, you can click the
log
icon to see the logs. In this example, you can see the following output in the log:
For more information about the architecture, the design concepts and the developer's manual, please refer to the Abacuza Documentation.
Click here for the documentation.