# 离散傅里叶变换

这个 notebook 展示硬件加速 DFT 的效果。由于未用到 FFT 算法，所以不是最快的实现。

## 导入 Overlay

导入 Overlay 并查看 ip 名称：

In [27]:
from pynq import Overlay
overlay = Overlay("./dft.bit")
for key in overlay.ip_dict: # 查看所有ip
    print(key)

dft_0


得到 ip 的名称为 dft_0：

In [28]:
dft = overlay.dft_0

查看 ip 的寄存器：

In [29]:
dft.register_map

RegisterMap {
  CTRL = Register(AP_START=0, AP_DONE=0, AP_IDLE=1, AP_READY=0, RESERVED_1=0, AUTO_RESTART=0, RESERVED_2=0),
  GIER = Register(Enable=0, RESERVED=0),
  IP_IER = Register(CHAN0_INT_EN=0, CHAN1_INT_EN=0, RESERVED=0),
  IP_ISR = Register(CHAN0_INT_ST=0, CHAN1_INT_ST=0, RESERVED=0),
  real_sample_1 = Register(real_sample=402960384),
  real_sample_2 = Register(real_sample=0),
  imag_sample_1 = Register(imag_sample=402980864),
  imag_sample_2 = Register(imag_sample=0),
  real_op_1 = Register(real_op=402984960),
  real_op_2 = Register(real_op=0),
  imag_op_1 = Register(imag_op=402968576),
  imag_op_2 = Register(imag_op=0)
}

查看各寄存器的地址：

In [30]:
print(format(dft.register_map.real_sample_1.address, 'x'))
print(format(dft.register_map.imag_sample_1.address, 'x'))
print(format(dft.register_map.real_op_1.address, 'x'))
print(format(dft.register_map.imag_op_1.address, 'x'))

10
1c
28
34


---

## Allocate 创建缓冲区

用 allocate 创建缓冲区，并写入输入数据：

In [31]:
from pynq import allocate
import numpy as np
size = 1024
input_buffer_1 = allocate(shape=(size,), dtype='f4')
input_buffer_2 = allocate(shape=(size,), dtype='f4')
output_buffer_1 = allocate(shape=(size,), dtype='f4')
output_buffer_2 = allocate(shape=(size,), dtype='f4')

np.random.seed(0)
# real_in = np.ones(shape=(1024,), dtype='f4')
# imag_in = np.zeros(shape=(1024,), dtype='f4')
real_in = np.random.random((1024,))
imag_in = np.random.random((1024,))

np.copyto(input_buffer_1, real_in)
np.copyto(input_buffer_2, imag_in)

print(input_buffer_1[0:10])
print(input_buffer_2[0:10])

[ 0.54881352  0.71518934  0.60276335  0.54488319  0.42365479  0.64589411
  0.4375872   0.89177299  0.96366274  0.38344151]
[ 0.77827615  0.84834528  0.49041989  0.18534859  0.99581528  0.12935576
  0.47145733  0.0680931   0.94385087  0.96492493]


将缓冲区的地址写到 ip 的寄存器中：

In [32]:
dft.write(0x10, input_buffer_1.physical_address)
dft.write(0x1c, input_buffer_2.physical_address)
dft.write(0x28, output_buffer_1.physical_address)
dft.write(0x34, output_buffer_2.physical_address)

---

## 运行 IP

运行 IP 并计时：

In [33]:
import time

dft.write(0x00, 0x01)
start_time = time.time()
while True:
    reg = dft.read(0x00)
    if reg != 1:
        break
end_time = time.time()

print("耗时：{}s".format(end_time - start_time))

耗时：0.01782536506652832s


查看计算结果：

In [34]:
rst_hw = output_buffer_1 + output_buffer_2*1j
print(rst_hw[0:10])

[ 506.74743652 +5.22333923e+02j    8.50856686 -9.61070824e+00j
    5.17804146 -6.18007779e-03j   -5.57392788 -8.27330017e+00j
    4.22003651 +6.31724644e+00j    4.42737579 +3.13718653e+00j
   -7.51551008 +9.44968283e-01j   17.26153946 +1.53106999e+00j
   19.68187714 +7.86181021e+00j   -1.89949572 -9.49977398e+00j]


---

## 结果与讨论

与软件计算结果对比：

In [35]:
start_time = time.time()
rst_ps = np.fft.fft(real_in+imag_in*1j)
end_time = time.time()

print("耗时：{} s".format(end_time - start_time))
print(rst_ps[0:10])
rmse = np.sqrt((np.abs(rst_ps-rst_hw) ** 2).sum()/1024)
print("RMSE: {}".format(rmse))

耗时：0.0024955272674560547 s
[ 506.74752192 +5.22334173e+02j    8.50859753 -9.61077954e+00j
    5.17805970 -6.18167449e-03j   -5.57394648 -8.27328398e+00j
    4.22003839 +6.31722764e+00j    4.42736651 +3.13717990e+00j
   -7.51552925 +9.44970169e-01j   17.26157257 +1.53107281e+00j
   19.68191240 +7.86180915e+00j   -1.89950851 -9.49974523e+00j]
RMSE: 1.4750579194844413e-05


可见

(1) RMSE（均方根误差）很小，输出结果正确。

(2) 软件耗时为 2 毫秒，而硬件耗时为 17 毫秒。硬件反而比软件慢一些。

这是因为软件部分使用的是快速傅里叶变换算法，而硬件部分没有使用快速傅里叶变换算法。

不过即使是这样，两者也已经比较接近了。



---

如果使用PS，且不用快速傅里叶变换算法（即直接用 tb 作为软件设计，已实现在 `./ps_time_used` 文件夹下），则运行结果如下：

In [36]:
import os
os.chdir("./ps_time_used")
!./a.out
os.chdir("..")

time used: 214155 us
----------------------------------------------
   RMSE(R)           RMSE(I)
0.001853297697380 0.001314980210736
----------------------------------------------
*******************************************
PASS: The output matches the golden output!
*******************************************


可以看到运行时间为 20 毫秒，而硬件加速的结果不到 20 毫秒，加速了 10 倍多

---

# 封装

可以继承 DefaultIP 将 IP 的调用封装起来。

注意要用 bindto 设置 Driver 绑定到的 IP，本文件夹下的 IP 版本为 1.2。

In [39]:
from pynq import DefaultIP

class DftDriver(DefaultIP):
    bindto = ['xilinx.com:hls:dft:1.2']
    def __init__(self, description):
        super().__init__(description=description)
        self.input_buffer_1 = allocate(shape=(size,), dtype='f4')
        self.input_buffer_2 = allocate(shape=(size,), dtype='f4')
        self.output_buffer_1 = allocate(shape=(size,), dtype='f4')
        self.output_buffer_2 = allocate(shape=(size,), dtype='f4')
        dft.write(0x10, input_buffer_1.physical_address)
        dft.write(0x1c, input_buffer_2.physical_address)
        dft.write(0x28, output_buffer_1.physical_address)
        dft.write(0x34, output_buffer_2.physical_address)

    def dft(self, inre, inim):      
        np.copyto(input_buffer_1, inre)
        np.copyto(input_buffer_2, inim)

        dft.write(0x00, 0x01)
        while True:
            reg = dft.read(0x00)
            if reg != 1:
                break
        
        np.copyto(outre, output_buffer_1)
        np.copyto(outim, output_buffer_2)
        
        self.input_buffer_1.freebuffer()
        self.input_buffer_2.freebuffer()
        self.output_buffer_1.freebuffer()
        self.output_buffer_2.freebuffer()
        
        return outre, outim

调用 IP：

In [41]:
overlay = Overlay("./dft.bit")
dft = overlay.dft_0

start_time = time.time()
real_out, imag_out = dft.dft(real_in, imag_in)
end_time = time.time()
print("耗时：{} s".format(end_time - start_time))

rst_hw = real_out + imag_out*1j
rmse = np.sqrt((np.abs(rst_ps-rst_hw) ** 2).sum()/1024)
print("RMSE: {}".format(rmse))

耗时：0.018622636795043945 s
RMSE: 1.4750579194844413e-05
