Kaggle上下载得到creditcard数据集。查看数据集的shape以及label。

In [None]:
import pandas as pd
data = pd.read_csv(r'creditcard.csv')
print(data.values.shape)
print(data['Class'].value_counts())

(284807, 31)
0    284315
1       492
Name: Class, dtype: int64

数据有30个特征，“normal”类样本有28W个，而“fraud”类特征仅492个，数据分布极不平衡。对于此类数据集，常用的方法有上采样和下采样。由于对“fraud”类样本进行上采样会导致数据量急剧增加，从而给量子神经网络增添极大的负担。因此采用下采样方法。

首先将数据的特征与标签分离，并去除掉无关列“Time”:

In [None]:
data = pd.read_csv("creditcard.csv").drop('Time',axis=1).values[:,0:-1]
label = pd.read_csv("creditcard.csv").values[:,-1].reshape(-1,1)

数据中包括有30个特征，对于量子神经网络来说负担过大，因此使用PCA对数据进行降维。从sklearn中导入PCA：

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=16)
down_data = pca.fit_transform(data)
print(down_data.shape)

(284807, 16)

降维成16个特征。从sklearn导入划分训练集和测试集的包，并将训练集中的“normal”类样本与“fraud”类样本分别提取出来：

In [None]:
from sklearn.model_selection import train_test_split

new_data = np.hstack((down_data,label))
train, test = train_test_split(new_data,random_state=2,train_size=0.7)

test_data = test[:,0:-1]
test_label = test[:,-1]

train_normal_data = train[train[:,-1] == 0][:,0:-1]
train_normal_label = train[train[:,-1] == 0][:,-1]

fraud_data = train[train[:,-1] == 1]

对“normal”类样本进行下采样，在这里使用聚类的方法，从聚成的若干类中分别提取出若干个样本，再与“fraud”类组成新的训练集。聚类的个数使用轮廓系数法确定：

In [None]:
from sklearn.cluster import KMeans

n_clusters_range = range(2, 11)
silhouette_scores = []
for n_clusters in n_clusters_range:
    kmeans = KMeans(n_clusters=n_clusters, random_state=0)
    cluster_labels = kmeans.fit_predict(train_normal_data)
    silhouette_scores.append(silhouette_score(train_normal_data, cluster_labels))

best_n_clusters = n_clusters_range[np.argmax(silhouette_scores)]
print(f"最佳聚类数: best_n_clusters")
print(f"“1”类样本个数：len(fraud_data")


最佳聚类数：2
“1”类样本个数: 358

将“normal”类样本制作成和“fraud”类样本相同的个数，因此从聚成的两类中，分别选择离聚类中心最近的179个样本，组成new_normal_data：

In [None]:
kmeans = KMeans(n_clusters=2, random_state=0).fit(train_normal_data)

labels = kmeans.labels_
centers = kmeans.cluster_centers_

distances = kmeans.transform(train_normal_data)

n = len(fraud_data) / best_n_clusters
indices = []
new_normal_data = []
for i in range(kmeans.n_clusters):
    indices_i = np.argsort(distances[:,i])[:n]
    indices.append(indices_i)

for i in range(kmeans.n_clusters):
    for idx in indices[i]:
        new_normal_data.append(train_normal_data[idx])
new_normal_data = np.array(new_normal_data)

将得到的新的“normal”类样本与“fraud”类组成train_data，并进行归一化：

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from pandas import DataFrame

zero = np.zeros(358).reshape(-1,1)
new_normal = np.hstack((new_normal_data,zero))

train_data = np.vstack((new_normal,fraud_data))[:,0:-1]
train_label = np.vstack((new_normal,fraud_data))[:,-1]

da = StandardScaler().fit(train_data)
train_data = da.transform(train_data)
minmax_scale = MinMaxScaler((-1, 1)).fit(train_data)
train_data = minmax_scale.transform(train_data)


test_data = da.transform(test_data)
test_data = minmax_scale.transform(test_data)

# d1 = DataFrame(train_data)
# d1.to_csv(r'train_data.csv',header=None,index=False)
# d2 = DataFrame(train_label)
# d2.to_csv(r'train_label.csv',header=None,index=False)
# d3 = DataFrame(test_data)
# d3.to_csv(r'test_data.csv',header=None,index=False)
# d4 = DataFrame(test_label)
# d4.to_csv(r'test_label.csv',header=None,index=False)

构建一个简单的SVM模型：

In [None]:
from sklearn import svm
#################可在此处直接导入处理好的数据######################
# train_data = np.genfromtxt(r'train_data.csv', delimiter=',')
# test_data = np.genfromtxt(r'test_data.csv', delimiter=',')
# train_label = np.genfromtxt(r'train_label.csv', delimiter=',')
# test_label = np.genfromtxt(r'test_label.csv', delimiter=',')
clf = svm.SVC(kernel = 'rbf')
clf.fit(train_data,train_label)

由于数据不平衡，我们使用accuracy、recall和confusion_matrix进行评价：

In [None]:
from sklearn.metrics import recall_score, accuracy_score, confusion_matrix


p = clf.predict(test_data)
labels = [0, 1] 
cm = confusion_matrix(test_label, p, labels=labels)
print(cm)
print("accuracy_svm:",accuracy_score(test_label,p))
print("recall_svm:",recall_score(test_label,p))

[[81864  3445]
 [   13   121]]
accuracy_svm: 0.9595285746052924
recall_svm: 0.9029850746268657

由于在训练数据时将“normal”类样本进行了下采样，虽然减少了数据量，加快了模型的训练速度，但也导致在测试模型时有3445个“normal”类样本分类错误。下采样的效果劣于上采样，优于不处理，但此对比并不在题目范围内，因此未放在notebook上。

使用量子神经网络构建分类模型。使用tensorflow_quantum建立QNN。首先将经典数据导入电路中，这里借鉴文献[1]的方法，设置一个阈值，并将大于阈值的经典数字置为1，反之置为0。若第i个数字为1，则在第i个qubit上加入一个X门，数字为0不采取任何动作:

In [None]:
import tensorflow as tf
import tensorflow_quantum as tfq
import cirq
import sympy

thres = 0  #由于对数据进行处理时缩放到了-1，1的范围内，所以以0为阈值

train_data_binary = np.array(train_data > thres, dtype=np.float32)
test_data_binary = np.array(test_data > thres, dtype=np.float32)

def convert(data):
    values = np.ndarray.flatten(data)
    qubits = cirq.GridQubit.rect(4, 4)   #16个特征，定义16个qubit
    circuit = cirq.Circuit()
    for i, value in enumerate(values):   #同时循环样本的序列和数值
        if value == 1:
            circuit.append(cirq.X(qubits[i]))   #当数值为1时，记录其序列，并在该序列对应的qubit上作用一个X门
    return circuit

train_data_circ = [convert(i) for i in train_data_binary]
test_data_circ = [convert(j) for j in test_data_binary]

#转换为tensor
train_data_tensor = tfq.convert_to_tensor(train_data_circ)
test_data_tensor = tfq.convert_to_tensor(test_data_circ)

构建量子网络层：

In [None]:
def create_quantum_model():
    data_qubits = cirq.GridQubit.rect(4, 4)  # 16个特征对应16个qubit
    readout = cirq.GridQubit(-1, -1)         
    circuit = cirq.Circuit()

    # Prepare the readout qubit.
    circuit.append(cirq.X(readout))
    circuit.append(cirq.H(readout))

    for i, qubit in enumerate(data_qubits):
        symbol = sympy.Symbol('cx' + '-' + str(i))
        circuit.append(cirq.CX(qubit, readout) ** symbol)
        symbol = sympy.Symbol('cz' + '-' + str(i))
        circuit.append(cirq.CZ(qubit, readout) ** symbol)  #构建参数电路，使用的量子门为cx和cz
    circuit.append(cirq.H(readout))
    return circuit, cirq.Z(readout)


model_circuit, model_readout = create_quantum_model()
model = tf.keras.Sequential([                              #构建tensorflow的sequential模型
    tf.keras.layers.Input(shape=(), dtype=tf.string),
    tfq.layers.PQC(model_circuit, model_readout),          #加入一个PQC量子层，量子电路为上述构建的model_circuit
])

model.compile(
    loss=tf.keras.losses.BinaryCrossentropy(),             #二分类模型使用二次交叉熵
    optimizer=tf.keras.optimizers.Adam(),
    metrics=['accuracy'])                                  #训练模型的两类样本数量相等，因此使用accuracy

model.fit(                                                 #fit模型
      train_data_tensor, train_label,
      batch_size=8,
      epochs=10,                                           #云上jupyterhub内存有限，因此仅迭代10次
      verbose=1)


训练模型的过程如下：

Epoch 1/10
90/90 [==============================] - 14s 157ms/step - loss: 0.6327 - accuracy: 0.6187
Epoch 2/10
90/90 [==============================] - 14s 157ms/step - loss: 0.6067 - accuracy: 0.6858
Epoch 3/10
90/90 [==============================] - 14s 158ms/step - loss: 0.5851 - accuracy: 0.7444
Epoch 4/10
90/90 [==============================] - 14s 153ms/step - loss: 0.5642 - accuracy: 0.7737
Epoch 5/10
90/90 [==============================] - 14s 157ms/step - loss: 0.5441 - accuracy: 0.7877
Epoch 6/10
90/90 [==============================] - 14s 157ms/step - loss: 0.5227 - accuracy: 0.7989
Epoch 7/10
90/90 [==============================] - 14s 154ms/step - loss: 0.5047 - accuracy: 0.8003
Epoch 8/10
90/90 [==============================] - 14s 158ms/step - loss: 0.4923 - accuracy: 0.8059
Epoch 9/10
90/90 [==============================] - 14s 155ms/step - loss: 0.4823 - accuracy: 0.8045
Epoch 10/10
90/90 [==============================] - 14s 158ms/step - loss: 0.4737 - accuracy: 0.8045

10次的迭代精度基本也达到饱和。

使用训练好的模型预测测试数据并计算效果：

In [None]:
pre_qnn = model.predict(test_data_tensor)
pre_qnn = (p > 0.5).astype("int32")

cm_qnn = confusion_matrix(test_label, p, labels=labels)
print(cm_qnn)
print("accuracy_qnn:",accuracy_score(test_label,pre_qnn))
print("recall_qnn:",recall_score(test_label,pre_qnn))

[[83270  2039]
 [   43    91]]
accuracy_qnn: 0.975632878059057
recall_qnn: 0.6791044776119403

与上述svm进行对比：

[[81864  3445]
 [   13   121]]
accuracy_svm: 0.9595285746052924
recall_svm: 0.9029850746268657

可以看到在accuracy上qnn略高于svm，但在对“欺诈”类样本的分类不如svm。

参考文献：
[1] https://arxiv.org/pdf/1802.06002.pdf