Asessing CPU vs FPGA Performance Using Spark

We benchmark performance of a simple query using NYC Taxi Dataset.

Downloading and Normalizing Data

We use a subset of the Yellow Trip Data consisting of files with 18 columns. We further normalize the data in these files to conform to the schema published on the NYC Taxi Dataset site for Yellow Taxi trips.

To download the data locally:

Clone this repo
Clone nyc-taxi-data git repo
Edit download_raw_data.sh in the root of nyc-taxi-data repo replacing the default setup_files/raw_data_urls.txt with <this_repo>/dataset/yellow_taxi_files.csv
Run ./download_raw_data.sh to download the dataset.
Normalize the downloaded dataset by going through the Notebooks/Standardize Schema.ipynb in this repo. The clean_schema function is what does normalization.

Our dataset is 100 GB in size split into 82 files.

Collecting Performance Data

We use local Spark configured with local[*] (default) and structured streaming to measure and aggregate performance on our dataset. Each streaming batch consists of a single CSV file. The profiled query is:

select payment_type, count(*) as total from nyctaxidata group by payment_type

As configured above, all local CPU cores are utilized for the query.

CPU

Install Apache Spark.
In <repo_root>/benchmarking/queries/benchmark_taxi.scala, modify the following values as appropriate:

val rootPath = s"~/data/taxi_data_cleaned_18_standard" //root of the dataset
val magentaOutDir = s"~/data/queries_e8/$queryName/processed/results" // query results
val checkpointLoc = s"~/data/queries_e8/$queryName/checkpoint" // checkpoint files
val logDir = s"~/data/queries_e8/$queryName/monitor/results" // profiling results

Launch spark-shell with enough memory to stream the data:

From the root of this repo:

$ cd benchmarking/queries
$ spark-shell --driver-memory 50G

Load the relevant file and launch Spark processing

scala> :load benchmark_taxi.scala
scala> Benchmark.main(1)

Benchmarking results will be placed in the directories prefixed with logDir, so in the example above these will be:

~/data/queries_e8/q1/monitor/results_0
~/data/queries_e8/q1/monitor/results_1
~/data/queries_e8/q1/monitor/results_2
etc

Collect the results:

$ cat ~/data/queries_e8/q1/monitor/results_*/*.csv > taxi_q1_profile.csv

FPGA

Provisioning an NP-10 machine in Azure and going through the above steps should yield the benchmarks for FPGA. For the demo we used a custom "one-off" implementation of this query to assess FPGA performance.

Web UI

See this README for instruction on how to visualize performance data in Web UI. Here is the finished app

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Notebooks		Notebooks
benchmarking		benchmarking
demo		demo
docs		docs
fpga/vstream		fpga/vstream
ingestion		ingestion
magenta/pipeline		magenta/pipeline
tools		tools
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Asessing CPU vs FPGA Performance Using Spark

Downloading and Normalizing Data

Collecting Performance Data

CPU

FPGA

Web UI

About

Releases

Packages

Contributors 3

Languages

License

microsoft/build20AcceleratedSpark

Folders and files

Latest commit

History

Repository files navigation

Asessing CPU vs FPGA Performance Using Spark

Downloading and Normalizing Data

Collecting Performance Data

CPU

FPGA

Web UI

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages