Skip to content

microsoft/build20AcceleratedSpark

Repository files navigation

Asessing CPU vs FPGA Performance Using Spark

We benchmark performance of a simple query using NYC Taxi Dataset.

perf web app

Downloading and Normalizing Data

We use a subset of the Yellow Trip Data consisting of files with 18 columns. We further normalize the data in these files to conform to the schema published on the NYC Taxi Dataset site for Yellow Taxi trips.

To download the data locally:

  1. Clone this repo
  2. Clone nyc-taxi-data git repo
  3. Edit download_raw_data.sh in the root of nyc-taxi-data repo replacing the default setup_files/raw_data_urls.txt with <this_repo>/dataset/yellow_taxi_files.csv
  4. Run ./download_raw_data.sh to download the dataset.
  5. Normalize the downloaded dataset by going through the Notebooks/Standardize Schema.ipynb in this repo. The clean_schema function is what does normalization.

Our dataset is 100 GB in size split into 82 files.

Collecting Performance Data

We use local Spark configured with local[*] (default) and structured streaming to measure and aggregate performance on our dataset. Each streaming batch consists of a single CSV file. The profiled query is:

select payment_type, count(*) as total from nyctaxidata group by payment_type

As configured above, all local CPU cores are utilized for the query.

CPU

  1. Install Apache Spark.
  2. In <repo_root>/benchmarking/queries/benchmark_taxi.scala, modify the following values as appropriate:
val rootPath = s"~/data/taxi_data_cleaned_18_standard" //root of the dataset
val magentaOutDir = s"~/data/queries_e8/$queryName/processed/results" // query results
val checkpointLoc = s"~/data/queries_e8/$queryName/checkpoint" // checkpoint files
val logDir = s"~/data/queries_e8/$queryName/monitor/results" // profiling results
  1. Launch spark-shell with enough memory to stream the data:

From the root of this repo:

$ cd benchmarking/queries
$ spark-shell --driver-memory 50G
  1. Load the relevant file and launch Spark processing
scala> :load benchmark_taxi.scala
scala> Benchmark.main(1)

Benchmarking results will be placed in the directories prefixed with logDir, so in the example above these will be:

~/data/queries_e8/q1/monitor/results_0
~/data/queries_e8/q1/monitor/results_1
~/data/queries_e8/q1/monitor/results_2
etc
  1. Collect the results:
$ cat ~/data/queries_e8/q1/monitor/results_*/*.csv > taxi_q1_profile.csv

FPGA

Provisioning an NP-10 machine in Azure and going through the above steps should yield the benchmarks for FPGA. For the demo we used a custom "one-off" implementation of this query to assess FPGA performance.

Web UI

See this README for instruction on how to visualize performance data in Web UI. Here is the finished app