G10: Enabling An Efficient Unified GPU Memory and Storage Architecture with Smart Tensor Migrations (Artifact)

In this artifact, we provide the source code of G10 and the necessary instructions to reproduce the key performance results in our paper.

0. Hardware and Software Dependencies

The artifact can be executed on any x86 machine with at least 30 GB of main memory and at least 120 GB of disk space. We strongly recommend running the artifact on a workstation with multi-cores and at least 128 GB memory. The artifact needs a Linux environment (preferably Ubuntu) and a compiler that supports the C++14 standard.

1. Installation

1.1 Downloading the Repository

Use the following command to download the artifact:

git clone https://github.com/platformxlab/G10.git

1.2 Installation

Install the following dependencies:

sudo apt install flex bison tmux python3-pip
pip3 install matplotlib networkx pandas PyPDF2

Build G10 (the output executable is named gpg):

cd G10/src
make clean
make

2. Experiment Workflow

This section describes the steps to generate and run the necessary experiments. We strongly recommend that the reader follow the src/resources/README.md to understand more about each script used in this section.

2.1 Gnenrating Configuration Files

The first step is to generate appropriate config files. In this artifact, we provide the Python script resources/genconfigs.py to generate all the config files used in this artifact (in the src/configs/ directory).

python3 resources/genconfigs.py

2.2 Launching A Single Experiment

Every configuration file specifies the DNN model and the batch size to be used, as well as other system configuration parameters (such as GPU memory size, SSD Bandwidth, the baseline type, and so on). All the DNN model graph information and their execution traces are already included if users use the configs generated by the resources/genconfigs.py script.

To run a single experiment, directly find its corresponding config file and use ((use G10-(BERT, batchsize=256)) as an example)):

./gpg "$relative_path_to_config_file"
    # e.g.,  ./gpg configs/BERT/256-sim_prefetch_lru.config

The program will execute the Tensor Vitality Analysis and Smart Tensor Migration Algorithms, and do a performance simulation of the DNN training. The results will be generated in "$G10_HOME"/results directory.

For each experiment, our program will generate separate logs for analyzed DNN graph information, tensor vitality analysis results, smart tensor migration scheduling, and performance simulation results. See results/README.md for more details of our program's output.

2.3 Launching Batched Experiments

To run a large number of experiments at one time, we provide the resources/run.sh Shell script. It can use regular expressions to match multiple config files, and it will automatically spawn different experiments to multiple tmux windows for parallel execution.

To evaluate all the experiments more conveniently, we provide a Shell script, artifact_run.sh, which will be introduced in the next section. To run individual experiments corresponding to the figures in the paper, see lines 23-45 of artifact_run.sh:

# First run experiments for figure 11-14
./run.sh -p "(BERT\/256|VIT\/1280|Inceptionv3\/1536|ResNet152\/1280|SENet154\/1024)-sim_(deepUM|prefetch_lru|FlashNeuron|G10GDSSSD|G10GDSFULL|lru)\.config" -dr -j $MAX_PROCESS_NUM
# The time for running this is about 104m33.975s (for MAX_PROCESS_NUM=6)

# Then run experiments for figure 15
./run.sh -p "(BERT\/(128|256|512|768|1024)|VIT\/(256|512|768|1024|1280)|Inceptionv3\/(512|768|1024|1280|1536|1792)|ResNet152\/(256|512|768|1024|1280)|SENet154\/(256|512|768|1024))-sim_(deepUM|prefetch_lru|FlashNeuron|lru)\.config" -dr -j $MAX_PROCESS_NUM
# The time for running this is about 155m11.104s (for MAX_PROCESS_NUM=6)

# Then run experiments for figure 16
./run.sh -p "(BERT\/(256|384|512|640)|VIT\/(768|1024|1280|1536)|Inceptionv3\/(512|1024|1280|1536)|ResNet152\/(768|1024|1280|1536)|SENet154\/(256|512|768|1024))-sim_prefetch_lru(-cpu(0|16|32|64|96|192|256))?\.config" -dr -j $MAX_PROCESS_NUM
# The time for running this is about 406m30.954s (for MAX_PROCESS_NUM=6)

# Then run experiments for figure 17
./run.sh -p "(VIT\/1024|Inceptionv3\/1280)-sim_(deepUM|prefetch_lru|FlashNeuron)-cpu(0|16|32|64|256)\.config" -dr -j $MAX_PROCESS_NUM
# The time for running this is about 24m8.144s (for MAX_PROCESS_NUM=6)

# Then run experiments for figure 18
./run.sh -p "(BERT\/512|VIT\/1280|Inceptionv3\/1536|ResNet152\/1280|SENet154\/1024)-sim_(deepUM|prefetch_lru|FlashNeuron|lru)-ssd(6_4|12_8|19_2|25_6|32)-.*\.config" -dr -j $MAX_PROCESS_NUM
# The time for running this is about 354m40.747s (for MAX_PROCESS_NUM=6)

# Then run experiments for figure 19
./run.sh -p "(BERT\/256|VIT\/1280|Inceptionv3\/1536|ResNet152\/1280|SENet154\/1024)-sim_prefetch_lru-var0_(05|10|15|20|25)\.config" -dr -j $MAX_PROCESS_NUM
# The time for running this is about 124m17.909s (for MAX_PROCESS_NUM=6)

The variable MAX_PROCESS_NUM is the maximum allowed number of parallel experiments in the script. Note that user may have to change the MAX_PROCESS_NUM based on their machine's main memory capacity (Each experiment requires a peak memory of about 28.5 GB). See lines 1-6 of artifact_run.sh:

# --------------------------------- IMPORTART -------------------------------------------------
# ! Please modify this number based on your machine's main memory capacity. One experiment process will need a peak memory of 28.5 GB.
# We recommend reserving 30 GB for each process to ensure that the program won't crash.
# For example, if your machine has 128 GB of main memory, this number can be set as 4.
MAX_PROCESS_NUM=4
# ---------------------------------------------------------------------------------------------

3. Evaluation and Expected Results

After specifying the MAX_PROCESS_NUM, to evaluate the artifact results, simply run:

./artifact_run.sh

This script runs all the experiments, data gathering, and figure drawing sequentially. A detailed description of each command and the output figures' position is also included in this script.

To run individual data gathering and figure drawing scripts, see lines 49-124 of artifact_run.sh:

#-------------------------------- Gathering Data -----------------------------------------------------------------------------------=

# Collect all the numbers, store it in raw_output/data.json
python3 gatherKernelInfo.py

# Gather data for figure 11
python3 figureDrawingDataPrepOverallPerformance.py  # The gathered data is stored in figure_drawing/overall_performance

# Gather data for figure 12
python3 figureDrawingDataPrepBreakdown.py  # The gathered data is stored in figure_drawing/overall_breakdown

# Gather data for figure 13
./figureDrawingDataPrepKernelCDF.sh  # The gathered data is stored in figure_drawing/overall_slowdown_cdf

# Gather data for figure 14
python3 figureDrawingDataPrepTraffic.py  # The gathered data is stored in figure_drawing/overall_traffic

# Gather data for figure 15
python3 figureDrawingDataPrep.py  # The gathered data is stored in figure_drawing/overall_batchsize

# Gather data for figure 16
python3 figureDrawingDataPrepCPUsensitivity.py  # The gathered data is stored in figure_drawing/sensitivity_cpumem

# Gather data for figure 17
python3 figureDrawingDataPrepCPUSensitivityCombined.py  # The gathered data is stored in figure_drawing/sensitivity_cpumem_combined

# Gather data for figure 18
python3 figureDrawingDataPrepSSD.py  # The gathered data is stored in figure_drawing/sensitivity_ssdbw

# Gather data for figure 19
python3 figureDrawingDataPrepVariation.py  # The gathered data is stored in figure_drawing/sensitivity_variation



#-------------------------------- Drawing Figures -----------------------------------------------------------------------------------

cd figure_drawing

# Plot figures for Figure 2-4, and Figure 20-21 (Appendix)

python3 plot_mem_consumption.py  # Figure 2 is output/dnn_memconsumption.pdf

python3 plot_tensor_time_cdf.py  # Figure 3 is output/tensor_time_cdf.pdf

python3 plot_tensor_period_distribution.py  # Figure 4 is output/tensor_periods_distribution.pdf

python3 plot_detail_mem_breakdown_live.py  # Figure 20 is output/dnn_mem_consumption_breakdown_live.pdf

python3 plot_detail_mem_breakdown_active.py  # Figure 21 is output/dnn_mem_consumption_breakdown_active.pdf

# Draw Figure 11
python3 overallPerf.py  # Figure 11 is output/OverallPerfNew.pdf

# Draw Figure 12
python3 overallBreakdown.py  # Figure 12 is output/Breakdown.pdf

# Draw Figure 13
python3 overallSlowdownCDF.py  # Figure 13 is output/KernelTimeCDF.pdf

# Draw Figure 14
python3 overallTraffic.py  # Figure 14 is output/OverallTraffic.pdf

# Draw Figure 15
python3 overallBatchSize.py  # Figure 15 is output/OverallPerfBatchSize.pdf

# Draw Figure 16
python3 sensitivityCPUMem.py  # Figure 16 is output/OverallPerfCPUMem.pdf

# Draw Figure 17
python3 sensitivityCPUMemCombined.py  # Figure 17 is output/OverallPerfCPUMemCombined.pdf

# Draw Figure 18 
python3 sensitivitySSDbw.py  # Figure 18 is output/OverallPerfSSDBW.pdf 

# Draw Figure 19
python3 SensitivityKernelVariation.py # Figure 19 is output/SensitivityVariation.pdf

We have provided the expected result files in the directory example_results. To verify the results, one can compare the generated figures directly with those in the paper, or compare the data for each figure with the example results we provided.

4. Experiment Customization

4.1 Changing Simulation Configurations

In addition to the provided configurations, users can also customize their own config files and evaluate other settings. The simplest way to do this is to modify the resources/genconfigs.py script. Note that we only provided DNN training execution traces used in our paper.

4.2 Custom DNN Training Profiling

Users can also generate their own traces of DNN training on their own GPUs. It's also possible to generate traces for customized batch sizes. Custom profiling can be done by modifying the config files named "profile" rather than "sim", and running them with the G10 executable (gpg). Note that to do this, users have to first correctly install CUDA (11.0 and higher) Tool-kits with cudnn and cublas libraries. Before profiling, please make sure that the CUDA code generation part of our framework is built:

cd G10/src/cudnn
make clean && make

5. Licence

This project is licenced under the terms of the Apache 2.0 licence.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
example_results		example_results
frontend		frontend
results		results
src		src
LICENCE		LICENCE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

example_results

example_results

frontend

frontend

results

results

src

src

LICENCE

LICENCE

README.md

README.md

Repository files navigation

G10: Enabling An Efficient Unified GPU Memory and Storage Architecture with Smart Tensor Migrations (Artifact)

0. Hardware and Software Dependencies

1. Installation

1.1 Downloading the Repository

1.2 Installation

2. Experiment Workflow

2.1 Gnenrating Configuration Files

2.2 Launching A Single Experiment

2.3 Launching Batched Experiments

3. Evaluation and Expected Results

4. Experiment Customization

4.1 Changing Simulation Configurations

4.2 Custom DNN Training Profiling

5. Licence

About

Releases

Packages

Contributors 2

Languages

License

platformxlab/G10

Folders and files

Latest commit

History

Repository files navigation

G10: Enabling An Efficient Unified GPU Memory and Storage Architecture with Smart Tensor Migrations (Artifact)

0. Hardware and Software Dependencies

1. Installation

1.1 Downloading the Repository

1.2 Installation

2. Experiment Workflow

2.1 Gnenrating Configuration Files

2.2 Launching A Single Experiment

2.3 Launching Batched Experiments

3. Evaluation and Expected Results

4. Experiment Customization

4.1 Changing Simulation Configurations

4.2 Custom DNN Training Profiling

5. Licence

About

Resources

License

Stars

Watchers

Forks

Languages