### Scaling applications in CUDA Quantum
Main reference: https://nvidia.github.io/cuda-quantum/latest/examples/python/tutorials/multi_gpu_workflows.html

Targets

    - A combination of quantum circuit simulators and hardware.
    - Allows you to switch between QPUs, CPUs and GPUs.
    - The default target provides a state vector simulator based on the CPU-only, OpenMP threaded Q++ library. 


Available Targets¶

        qpp-cpu: The default multithreaded CPU backend.
        nvidia: GPU based backend which accelerates quantum circuit simulation on NVIDIA GPUs.
        nvidia-mqpu: Enables users to program workflows utilizing multiple quantum processors enabled today by GPU emulation.
        nvidia-mgpu: Allows for scaling circuit simulation beyond what is feasible with any QPU today.
        density-matrix-gpu: Noisy simulations via density matrix calculations. CPU version if also availabel.
        tensornet: GPU accelerated TN backend.

<div style="display:flex;justify-content:center;">
    <img src="images/targets.png" alt="Image Title" width="600">
</div>


In [1]:
# Print all the availble targets for your system
import cudaq

targets = cudaq.get_targets()

for target in targets:
    print(target)

Target nvidia-mgpu
	simulator=nvidia_mgpu
	platform=default
	description=

Target quantinuum
	simulator=qpp
	platform=default
	description=

Target photonics
	simulator=qpp
	platform=default
	description=

Target density-matrix-cpu
	simulator=dm
	platform=default
	description=The Density Matrix CPU Target provides a simulated QPU via OpenMP-enabled, CPU-only density matrix emulation.

Target iqm
	simulator=qpp
	platform=default
	description=

Target nvidia-mqpu-fp64
	simulator=custatevec_fp64
	platform=mqpu
	description=The NVIDIA MQPU FP64 Target provides a simulated QPU for every available CUDA GPU on the underlying system. Each QPU is simulated via cuStateVec FP64.

Target tensornet
	simulator=tensornet
	platform=default
	description=

Target orca
	simulator=qpp
	platform=default
	description=

Target qpp-cpu
	simulator=qpp
	platform=default
	description=QPP-based CPU-only backend target

Target remote-mqpu
	simulator=qpp
	platform=mqpu
	description=

Target nvidia-mqpu
	simulator=c

    Some  ways to scale your application:
  
    1. Increasing the number of qubits (weak scaling)
    
            - mgpu backend
    
    2. Distributing the circuit execution (strong scaling)
            2.1 asynchronous sampling
            2.2 Hamiltonian batching
            2.3 Parameter batching

            - mqpu backend
            - Each gpu acts as a virtual qpu

         As a rule of thumb, we can parallelize over any of the input parameters to `cudaq.sample()` or `cudaq.observe()` - kernel, hamiltonian, kernel parameters, etc.

### Multiple NVIDIA GPUs for the mgpu backend

    - The increase in qubit count leads to an exponential increase in the size of the statevector.
    
    - The nvidia-mgpu target allows for scaling the qubit count by pooling memory from GPUs across multiple nodes.


### Asynchronous sampling via mqpu backend

In [21]:
# Please run this code snippet in a Python script only using
# mpirun -np N python <filename>
# N is the number of gpus you have available 

import cudaq 

# set the target here
# alternatively this target could also be set at runtime
cudaq.set_target("nvidia-mqpu")
target = cudaq.get_target()
num_qpus = target.num_qpus()
print("Number of QPUs:", num_qpus)

# Construct a quantum circuit
kernel = cudaq.make_kernel()
qubits = kernel.qalloc(2)
kernel.h(qubits[0])
kernel.cx(qubits[0], qubits[1])
kernel.mz(qubits)

# Sample the circuits asynchronously
futures = []
for i in range(num_qpus):
  futures.append(cudaq.sample_async(kernel, qpu_id=i))
  
# You can do some other processing while you wait
# for your asynchronous results 
  
# Extract the results
for count in futures:
    print(count.get())


Number of QPUs: 2
{ 00:493 11:507 }

{ 00:463 11:537 }



      Asynchronous expectation value computation

In [22]:
# Please run this code snippet in a Python script using
# mpirun -np N python <filename>
# N is the number of gpus you have available 

import cudaq
from cudaq import spin

kernel = cudaq.make_kernel()
qubit = kernel.qalloc()
kernel.x(qubit)

# Measuring in the Z-basis.
hamiltonian = spin.z(0)

# Call `cudaq.observe()` at the specified number of shots.
future = cudaq.observe_async(kernel=kernel,
                            spin_operator=hamiltonian,
                            qpu_id=0,
                            shots_count=2000)
observe_result = future.get()
got_expectation = observe_result.expectation()

                    Hamiltonian term distribution over multiple QPUs
<div style="display:flex;justify-content:center;">
    <img src="images/hsplit.png" alt="Image Title" width="500">
</div>
