# FPGA Development Flow and optimizations

##### Sections
- [oneAPI with Intel® FPGAs](#oneAPI-with-Intel®-FPGAs)
- [Stage 1: Emulation](#Stage-1:-Emulation)
- [Stage 2: Optimization Report Generation](#Stage-2:-Optimization-Report-Generation)
- [Stage 3: Optimizing the previous code](#Stage-3:-Optimizing-the-previous-code)
- [References to Learn More](#References-to-Learn-More)

***
# oneAPI with Intel® FPGAs

The development flow for Intel FPGAs with oneAPI includes multiple stages. The purpose of these stages is to:
* Ensure functionality of your code (you get the correct answers from your computation)
* Ensure the custom hardware built to implement your code has optimal performance

Without having to endure the lengthy compile to a full FPGA executable each time.

The detailed flow is shown in the diagram below.

In this lab, we will practice the first 2 stages of the flow - emulating your code to make sure your code is functional, and generating an optimization report to see how optimized the hardware image generated from your code is. (A subsequent lab will give you practice working with the optimization report.)

<img src="assets/fpga_flow.png">

***
# Stage 1: Emulation

The first stage developing code for FPGAs with oneAPI is __emulation__. The purpose of emulation is to make sure that your code is __functional__, or in other words, that you __get the correct answers from your computations__.

The compile time for this stage will be very quick, usually seconds.

This quick compile time allows you to iterate through this stage many times, until your code is functionally correct.

__Now, let's give it a try!__

The code below just adds two vectors. Please have a quick look to get an overview of what it is doing.

We will continue using this simple piece of code to learn about the development flow for FPGAs with oneAPI and howto apply optimizations.

__After you are finished examining the code, click ▶ to save the code to a file.__

In [1]:
%%writefile lab/loop.cpp
//==============================================================
// Copyright (c) Intel Corporation
//
// SPDX-License-Identifier: Apache-2.0
// =============================================================
#include <CL/sycl.hpp>
#include <iomanip>
#include <iostream>
#include <vector>

// dpc_common.hpp can be found in the dev-utilities include folder.
// e.g., $ONEAPI_ROOT/dev-utilities/latest/include/dpc_common.hpp
#include "dpc_common.hpp"

// Header locations and some DPC++ extensions changed between beta09 and beta10
// Temporarily modify the code sample to accept either version
#define BETA09 20200827
#if __SYCL_COMPILER_VERSION <= BETA09
  #include <CL/sycl/intel/fpga_extensions.hpp>
  namespace INTEL = sycl::intel;  // Namespace alias for backward compatibility
#else
  #include <CL/sycl/INTEL/fpga_extensions.hpp>
#endif

using namespace sycl;

// This function instantiates the vector add kernel, which contains
// a loop that adds up the two summand arrays and stores the result
// into sum. 
void VecAdd(const std::vector<float> &summands1,
            const std::vector<float> &summands2, std::vector<float> &sum,
            size_t array_size) {


#if defined(FPGA_EMULATOR)
  INTEL::fpga_emulator_selector device_selector;
#else
  INTEL::fpga_selector device_selector;
#endif

  try {
    queue q(device_selector, dpc_common::exception_handler,
            property::queue::enable_profiling{});

    buffer buffer_summands1(summands1);
    buffer buffer_summands2(summands2);
    // Use verbose SYCL 1.2 syntax for the output buffer.
    // (This will become unnecessary in a future compiler version.)
    buffer<float, 1> buffer_sum(sum.data(), array_size);

    event e = q.submit([&](handler &h) {
      auto acc_summands1 = buffer_summands1.get_access<access::mode::read>(h);
      auto acc_summands2 = buffer_summands2.get_access<access::mode::read>(h);
      auto acc_sum = buffer_sum.get_access<access::mode::discard_write>(h);

      h.single_task([=]() {
        for (size_t i = 0; i < array_size; i++) {
          acc_sum[i] = acc_summands1[i] + acc_summands2[i];
        }
      });
    });

    double start = e.get_profiling_info<info::event_profiling::command_start>();
    double end = e.get_profiling_info<info::event_profiling::command_end>();
    // convert from nanoseconds to ms
    double kernel_time = (double)(end - start) * 1e-6;

    std::cout << " kernel time : " << kernel_time << " ms\n";
    std::cout << "Throughput for kernel " << ": ";
    std::cout << std::fixed << std::setprecision(3)
              << ((double)array_size / kernel_time) / 1e6f << " GFlops\n";

  } catch (sycl::exception const &e) {
    // Catches exceptions in the host code
    std::cout << "Caught a SYCL host exception:\n" << e.what() << "\n";

    // Most likely the runtime couldn't find FPGA hardware!
    if (e.get_cl_code() == CL_DEVICE_NOT_FOUND) {
      std::cout << "If you are targeting an FPGA, please ensure that your "
                   "system has a correctly configured FPGA board.\n";
      std::cout << "If you are targeting the FPGA emulator, compile with "
                   "-DFPGA_EMULATOR.\n";
    }
    std::terminate();
  }
}

int main(int argc, char *argv[]) {
  size_t array_size = 1 << 26;

  if (argc > 1) {
    std::string option(argv[1]);
    if (option == "-h" || option == "--help") {
      std::cout << "Usage: \n<executable> <data size>\n\nFAILED\n";
      return 1;
    } else {
      array_size = std::stoi(option);
    }
  }

  std::vector<float> summands1(array_size);
  std::vector<float> summands2(array_size);

  std::vector<float> sum(array_size);

  // Initialize the two summand arrays (arrays to be added to each other) to
  // 1:N and N:1, so that the sum of all elements is N + 1
  for (size_t i = 0; i < array_size; i++) {
    summands1[i] = static_cast<float>(i + 1);
    summands2[i] = static_cast<float>(array_size - i);
  }

  std::cout << "Input Array Size:  " << array_size << "\n";

  // Instantiate VecAdd kernel that contains a loop that adds up the two summand arrays.
  VecAdd(summands1, summands2, sum, array_size);

  // Verify that the output data is the same for every unroll factor
  for (size_t i = 0; i < array_size; i++) {
    if (sum[i] != summands1[i] + summands2[i]) {
      std::cout << "FAILED: The results are incorrect\n";
      return 1;
    }
  }
  std::cout << "PASSED: The results are correct\n";
  return 0;
}

Writing lab/loop.cpp


__Now, you will compile the code to target the FPGA emulator.__


In [None]:
! echo "##" $(whoami) is compiling a simple FPGA code with a loop
! dpcpp -fintelfpga lab/loop.cpp -DFPGA_EMULATOR -o bin/loop.emu
! bin/loop.emu

***
## Stage 2: Optimization Report Generation

In this next section of the lab, you will compile the kernel using different command line options with Intel's oneAPI DPC++ compiler in order to create an optimization report. You will also be using the Jupyter Lab interface to browse and open a file, so that will be explained to you.


__Let's compile the code and generate an optimization report.__

The commands you need to do this are shown below. (We are using the two-step method since there is a current issue showing the source code in the report with the one-step method.)

```
dpcpp -fintelfpga lab/loop.cpp -c -o bin/loop.o
dpcpp -fintelfpga bin/loop.o -fsycl-link -Xshardware -o bin/loop.a
```

__This compilation may take approximately 2 minutes.__

In [None]:
! rm -rf bin/loop.a
! echo "##" $(whoami) is working on generating the report for loop.cpp lab step 2
! dpcpp -fintelfpga lab/loop.cpp -c -o bin/loop.o
! dpcpp -fintelfpga bin/loop.o -fsycl-link -Xshardware -o bin/loop.a
! echo "Compilation done."

You may wonder if these two previous dpcpp calls can be summarized into a single one. The answer is yes, and it would be:
```
dpcpp -fintelfpga lab/loop.cpp -fsycl-link -Xshardware -o bin/loop.a
```
But then you may get a warning saying:
```
aoc: Warning: Cannot find dependency file "/home/u32284/tmp/loop.d" for source file "/home/u32284/tmp/loop-172fb0.spv". Source code will not be available in the HLD Reports. Ensure you ran dpcpp with the -fintelfpga flag.
```
Which means that you won't see the `source code` in the resulting reports. Try at your own risk.

In [None]:
! echo "##" $(whoami) is working on generating the report for loop.cpp lab step 2
! dpcpp -fintelfpga lab/loop.cpp -fsycl-link -Xshardware -o bin/loop.a
! echo "The compile is finished."

__When you see the "Compilation done." message above, an optimization report file will have been generated for our example. You will see a warning if you compile more than once. The warning can be ignored.__

Now, let's examine the generated report file.

Within the Jupyter Lab environment, it’s easy to navigate to the generated optimization report through the file browser on the leftmost panel. Browse to the directory `bin/loop.prj/reports/` (Double click on directories to open.)

__Double-click on report.html.__ The report will open up as another tab beside the notebook tab in Jupyter Lab, as shown below.

__You will need to click on "Trust HTML" for the report to display correctly.__

<img src="assets/FPGA-report.png">

You've now learned how to use the first two stages of the FPGA development flow with oneAPI! You will spend most of your development time in these two stages. In the next sections of the lab, you will learn different optimization techniques.

***
## Stage 3: Optimizing the code for FPGA

As you may have noticed, the `Initiation Interval` is too high (`II=286`) and as a result the FPGA device would be running at 240Mz which is not ideal. To overcome those kind of issues, in this exercise we will exemplarily apply three optimizations:

1. Give a hint to the compiler promising that the vectors `acc_sum`, `acc_summands1` and `acc_summands2` do not share elements (they are non-overlapping regions of memory) and therefore they do not result in loop-carried dependencies.
2. Unroll the loop to have more pipelines processing elements of the arrays per clock cycle
3. Replicate the kernel to better utilize the FPGA HW resources and also enabling more operations per sencod. For the sake of expediency, we will replicate the kernel but changing the unroll factor for each replica. Replication is implemented in modern C++ using templated functions and classes.

In the following code, some changes are needed:

1. Inspect the code cell below and make the following modifications.
  1. STEP A: Insert in the q.single_task the intel::kernel_args_restrict C++ attribute (as in Slide 16 in 10-FPGA-Optimizations)
  2. STEP B: Insert in the q.single_task the pragma unroll using the template argument unroll_factor (as in Slide 13) 
  3. STEP C: Add in the main() function a fith invocation to VecAdd now using unroll_factor=16 and sum_unrollx16
2. When the modifications are complete, click run ▶ to save the code to a file.
3. Run ▶ the cell in the __Build and check the report of the modified code__ section below the code snippet to compile and execute the code in the saved file.



In [2]:
%%writefile lab/loop_unroll.cpp
//==============================================================
// Copyright (c) Intel Corporation
//
// SPDX-License-Identifier: Apache-2.0
// =============================================================
#include <CL/sycl.hpp>
#include <iomanip>
#include <iostream>
#include <vector>

// dpc_common.hpp can be found in the dev-utilities include folder.
// e.g., $ONEAPI_ROOT/dev-utilities//include/dpc_common.hpp
#include "dpc_common.hpp"

// Header locations and some DPC++ extensions changed between beta09 and beta10
// Temporarily modify the code sample to accept either version
#define BETA09 20200827
#if __SYCL_COMPILER_VERSION <= BETA09
  #include <CL/sycl/intel/fpga_extensions.hpp>
  namespace INTEL = sycl::intel;  // Namespace alias for backward compatibility
#else
  #include <CL/sycl/INTEL/fpga_extensions.hpp>
#endif

using namespace sycl;

template <int unroll_factor> class VAdd;

// This function instantiates the vector add kernel, which contains
// a loop that adds up the two summand arrays and stores the result
// into sum. This loop will be unrolled by the specified unroll_factor.
template <int unroll_factor>
void VecAdd(const std::vector<float> &summands1,
            const std::vector<float> &summands2, std::vector<float> &sum,
            size_t array_size) {


#if defined(FPGA_EMULATOR)
  INTEL::fpga_emulator_selector device_selector;
#else
  INTEL::fpga_selector device_selector;
#endif

  try {
    queue q(device_selector, dpc_common::exception_handler,
            property::queue::enable_profiling{});

    buffer buffer_summands1(summands1);
    buffer buffer_summands2(summands2);
    // Use verbose SYCL 1.2 syntax for the output buffer.
    // (This will become unnecessary in a future compiler version.)
    buffer<float, 1> buffer_sum(sum.data(), array_size);

    event e = q.submit([&](handler &h) {
      auto acc_summands1 = buffer_summands1.get_access<access::mode::read>(h);
      auto acc_summands2 = buffer_summands2.get_access<access::mode::read>(h);
      auto acc_sum = buffer_sum.get_access<access::mode::discard_write>(h);

//This template argument to single_task allows for the correct name_mangling for all the kernels
      h.single_task<VAdd<unroll_factor>>([=]()
//STEP A: Insert here the intel::kernel_args_restrict C++ attribute (as in Slide 16 in 10-FPGA-Optimizations)
                                         {
        // Unroll the loop fully or partially, depending on unroll_factor
//STEP B: Insert here the pragma unroll using the template argument unroll_factor (as in Slide 13)
        for (size_t i = 0; i < array_size; i++) {
          acc_sum[i] = acc_summands1[i] + acc_summands2[i];
        }
      });
    });

    double start = e.get_profiling_info<info::event_profiling::command_start>();
    double end = e.get_profiling_info<info::event_profiling::command_end>();
    // convert from nanoseconds to ms
    double kernel_time = (double)(end - start) * 1e-6;

    std::cout << "unroll_factor " << unroll_factor
              << " kernel time : " << kernel_time << " ms\n";
    std::cout << "Throughput for kernel with unroll_factor " << unroll_factor
              << ": ";
    std::cout << std::fixed << std::setprecision(3)
              << ((double)array_size / kernel_time) / 1e6f << " GFlops\n";

  } catch (sycl::exception const &e) {
    // Catches exceptions in the host code
    std::cout << "Caught a SYCL host exception:\n" << e.what() << "\n";

    // Most likely the runtime couldn't find FPGA hardware!
    if (e.get_cl_code() == CL_DEVICE_NOT_FOUND) {
      std::cout << "If you are targeting an FPGA, please ensure that your "
                   "system has a correctly configured FPGA board.\n";
      std::cout << "If you are targeting the FPGA emulator, compile with "
                   "-DFPGA_EMULATOR.\n";
    }
    std::terminate();
  }
}

int main(int argc, char *argv[]) {
  size_t array_size = 1 << 26;

  if (argc > 1) {
    std::string option(argv[1]);
    if (option == "-h" || option == "--help") {
      std::cout << "Usage: \n<executable> <data size>\n\nFAILED\n";
      return 1;
    } else {
      array_size = std::stoi(option);
    }
  }

  std::vector<float> summands1(array_size);
  std::vector<float> summands2(array_size);

  std::vector<float> sum_unrollx1(array_size);
  std::vector<float> sum_unrollx2(array_size);
  std::vector<float> sum_unrollx4(array_size);
  std::vector<float> sum_unrollx8(array_size);
  std::vector<float> sum_unrollx16(array_size);

  // Initialize the two summand arrays (arrays to be added to each other) to
  // 1:N and N:1, so that the sum of all elements is N + 1
  for (size_t i = 0; i < array_size; i++) {
    summands1[i] = static_cast<float>(i + 1);
    summands2[i] = static_cast<float>(array_size - i);
  }

  std::cout << "Input Array Size:  " << array_size << "\n";

  // Instantiate VecAdd kernel with different unroll factors: 1, 2, 4, 8, 16
  // The VecAdd kernel contains a loop that adds up the two summand arrays.
  // This loop will be unrolled by the specified unroll factor.
  // The sum array is expected to be identical, regardless of the unroll factor.
  VecAdd<1>(summands1, summands2, sum_unrollx1, array_size);
  VecAdd<2>(summands1, summands2, sum_unrollx2, array_size);
  VecAdd<4>(summands1, summands2, sum_unrollx4, array_size);
  VecAdd<8>(summands1, summands2, sum_unrollx8, array_size);

//STEP C: Add a fith invocation to VecAdd now using unroll_factor=16 and sum_unrollx16

  // Verify that the output data is the same for every unroll factor
  for (size_t i = 0; i < array_size; i++) {
    if (sum_unrollx1[i] != summands1[i] + summands2[i] ||
        sum_unrollx1[i] != sum_unrollx2[i] ||
        sum_unrollx1[i] != sum_unrollx4[i] ||
        sum_unrollx1[i] != sum_unrollx8[i] ||
        sum_unrollx1[i] != sum_unrollx16[i]) {
      std::cout << "FAILED: The results are incorrect\n";
      return 1;
    }
  }
  std::cout << "PASSED: The results are correct\n";
  return 0;
}


Writing lab/loop_unroll.cpp


### Build and check the report of the modified code

Select the cell below and click Run ▶ to compile and generate the report for the code that you modified above:

In [None]:
! rm -rf lab/loop_unroll.a
! echo "##" $(whoami) is working on optimizing with loop unrolling
! dpcpp -fintelfpga lab/loop_unroll.cpp -c -o lab/loop_unroll.o
! dpcpp -fintelfpga lab/loop_unroll.o -fsycl-link -Xshardware -o lab/loop_unroll.a
! echo "The compile is finished."

Browse to the directory `lab`/loop_unroll.prj/reports/ (Double click on directories to open them.)

### Solution (Don't peak unless you have to)

In [3]:
%%writefile solutions/loop_unroll.cpp
//==============================================================
// Copyright (c) Intel Corporation
//
// SPDX-License-Identifier: Apache-2.0
// =============================================================
#include <CL/sycl.hpp>
#include <iomanip>
#include <iostream>
#include <vector>

// dpc_common.hpp can be found in the dev-utilities include folder.
// e.g., $ONEAPI_ROOT/dev-utilities//include/dpc_common.hpp
#include "dpc_common.hpp"

// Header locations and some DPC++ extensions changed between beta09 and beta10
// Temporarily modify the code sample to accept either version
#define BETA09 20200827
#if __SYCL_COMPILER_VERSION <= BETA09
  #include <CL/sycl/intel/fpga_extensions.hpp>
  namespace INTEL = sycl::intel;  // Namespace alias for backward compatibility
#else
  #include <CL/sycl/INTEL/fpga_extensions.hpp>
#endif

using namespace sycl;

template <int unroll_factor> class VAdd;

// This function instantiates the vector add kernel, which contains
// a loop that adds up the two summand arrays and stores the result
// into sum. This loop will be unrolled by the specified unroll_factor.
template <int unroll_factor>
void VecAdd(const std::vector<float> &summands1,
            const std::vector<float> &summands2, std::vector<float> &sum,
            size_t array_size) {


#if defined(FPGA_EMULATOR)
  INTEL::fpga_emulator_selector device_selector;
#else
  INTEL::fpga_selector device_selector;
#endif

  try {
    queue q(device_selector, dpc_common::exception_handler,
            property::queue::enable_profiling{});

    buffer buffer_summands1(summands1);
    buffer buffer_summands2(summands2);
    // Use verbose SYCL 1.2 syntax for the output buffer.
    // (This will become unnecessary in a future compiler version.)
    buffer<float, 1> buffer_sum(sum.data(), array_size);

    event e = q.submit([&](handler &h) {
      auto acc_summands1 = buffer_summands1.get_access<access::mode::read>(h);
      auto acc_summands2 = buffer_summands2.get_access<access::mode::read>(h);
      auto acc_sum = buffer_sum.get_access<access::mode::discard_write>(h);

      h.single_task<VAdd<unroll_factor>>([=]()
                                         [[intel::kernel_args_restrict]] {
        // Unroll the loop fully or partially, depending on unroll_factor
        #pragma unroll unroll_factor
        for (size_t i = 0; i < array_size; i++) {
          acc_sum[i] = acc_summands1[i] + acc_summands2[i];
        }
      });
    });

    double start = e.get_profiling_info<info::event_profiling::command_start>();
    double end = e.get_profiling_info<info::event_profiling::command_end>();
    // convert from nanoseconds to ms
    double kernel_time = (double)(end - start) * 1e-6;

    std::cout << "unroll_factor " << unroll_factor
              << " kernel time : " << kernel_time << " ms\n";
    std::cout << "Throughput for kernel with unroll_factor " << unroll_factor
              << ": ";
    std::cout << std::fixed << std::setprecision(3)
              << ((double)array_size / kernel_time) / 1e6f << " GFlops\n";

  } catch (sycl::exception const &e) {
    // Catches exceptions in the host code
    std::cout << "Caught a SYCL host exception:\n" << e.what() << "\n";

    // Most likely the runtime couldn't find FPGA hardware!
    if (e.get_cl_code() == CL_DEVICE_NOT_FOUND) {
      std::cout << "If you are targeting an FPGA, please ensure that your "
                   "system has a correctly configured FPGA board.\n";
      std::cout << "If you are targeting the FPGA emulator, compile with "
                   "-DFPGA_EMULATOR.\n";
    }
    std::terminate();
  }
}

int main(int argc, char *argv[]) {
  size_t array_size = 1 << 26;

  if (argc > 1) {
    std::string option(argv[1]);
    if (option == "-h" || option == "--help") {
      std::cout << "Usage: \n<executable> <data size>\n\nFAILED\n";
      return 1;
    } else {
      array_size = std::stoi(option);
    }
  }

  std::vector<float> summands1(array_size);
  std::vector<float> summands2(array_size);

  std::vector<float> sum_unrollx1(array_size);
  std::vector<float> sum_unrollx2(array_size);
  std::vector<float> sum_unrollx4(array_size);
  std::vector<float> sum_unrollx8(array_size);
  std::vector<float> sum_unrollx16(array_size);

  // Initialize the two summand arrays (arrays to be added to each other) to
  // 1:N and N:1, so that the sum of all elements is N + 1
  for (size_t i = 0; i < array_size; i++) {
    summands1[i] = static_cast<float>(i + 1);
    summands2[i] = static_cast<float>(array_size - i);
  }

  std::cout << "Input Array Size:  " << array_size << "\n";

  // Instantiate VecAdd kernel with different unroll factors: 1, 2, 4, 8, 16
  // The VecAdd kernel contains a loop that adds up the two summand arrays.
  // This loop will be unrolled by the specified unroll factor.
  // The sum array is expected to be identical, regardless of the unroll factor.
  VecAdd<1>(summands1, summands2, sum_unrollx1, array_size);
  VecAdd<2>(summands1, summands2, sum_unrollx2, array_size);
  VecAdd<4>(summands1, summands2, sum_unrollx4, array_size);
  VecAdd<8>(summands1, summands2, sum_unrollx8, array_size);
  VecAdd<16>(summands1, summands2, sum_unrollx16, array_size);

  // Verify that the output data is the same for every unroll factor
  for (size_t i = 0; i < array_size; i++) {
    if (sum_unrollx1[i] != summands1[i] + summands2[i] ||
        sum_unrollx1[i] != sum_unrollx2[i] ||
        sum_unrollx1[i] != sum_unrollx4[i] ||
        sum_unrollx1[i] != sum_unrollx8[i] ||
        sum_unrollx1[i] != sum_unrollx16[i]) {
      std::cout << "FAILED: The results are incorrect\n";
      return 1;
    }
  }
  std::cout << "PASSED: The results are correct\n";
  return 0;
}


Writing solutions/loop_unroll.cpp


Generate the report:

In [None]:
! rm -rf solutions/loop_unroll.a
! echo "##" $(whoami) is working on optimizing with loop unrolling
! dpcpp -fintelfpga solutions/loop_unroll.cpp -c -o solutions/loop_unroll.o
! dpcpp -fintelfpga solutions/loop_unroll.o -fsycl-link -Xshardware -o solutions/loop_unroll.a
! echo "The compile is finished."

Browse to the directory `solutions`/loop_unroll.prj/reports/ (Double click on diretories to push down into them.)

The report should now look like:

<img src="assets/FPGA-unroll-report.png">


Feel free to compile it as we did in the first hands-on session and you can expect a similar output running on an Intel Arria10 FPGA that is available in DevCloud:
```
Input Array Size:  67108864
unroll_factor 1 kernel time : 242.849 ms
Throughput for kernel with unroll_factor 1: 0.276 GFlops
unroll_factor 2 kernel time : 122.319 ms
Throughput for kernel with unroll_factor 2: 0.549 GFlops
unroll_factor 4 kernel time : 63.950 ms
Throughput for kernel with unroll_factor 4: 1.049 GFlops
unroll_factor 8 kernel time : 39.567 ms
Throughput for kernel with unroll_factor 8: 1.696 GFlops
unroll_factor 16 kernel time : 37.500 ms
Throughput for kernel with unroll_factor 16: 1.790 GFlops
PASSED: The results are correct
```


***
## References to Learn More

Please refer to the following resources to learn more. This is a great thing to do if you have extra time during the lab!

#### FPGA Specific Documentation

* [Website hub for using FPGAs with oneAPI](https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/fpga.html)
* [Intel® oneAPI Programming Guide](https://software.intel.com/content/www/us/en/develop/download/intel-oneapi-programming-guide.html)
* [Intel® oneAPI DPC++ FPGA Optimization Guide](https://software.intel.com/content/www/us/en/develop/download/oneapi-fpga-optimization-guide.html)
* [oneAPI Samples on GitHub](https://github.com/oneapi-src/oneAPI-samples)

#### Intel® oneAPI Toolkit documentation
* [Intel® oneAPI main page](https://software.intel.com/oneapi "oneAPI main page")
* [Intel® DevCloud Signup](https://software.intel.com/en-us/devcloud/oneapi "Intel DevCloud")  Sign up here if you do not have an account.
* [Intel® DevCloud Connect](https://devcloud.intel.com/datacenter/connect)  Login to the DevCloud here.
* [oneAPI Specification elements](https://www.oneapi.com/spec/)
* [DPC++ reference](https://docs.oneapi.com/versions/latest/dpcpp/)

#### SYCL 
* [SYCL* Specification (for version 1.2.1)](https://www.khronos.org/registry/SYCL/specs/sycl-1.2.1.pdf)

#### Modern C++
* [CPPReference](https://en.cppreference.com/w/)
* [CPlusPlus](http://www.cplusplus.com/)