# Submission Notebook

## Reproduction

### Running the QRF Model 


    The model is based off the paper Srikumar et al., "A kernel-based quantum random forest for improved classification", (2022). The code is intended for research purposes and the development of proof of concepts. For questions about the code, please email maiyuren.s@gmail.com for clarification.

In [30]:
from quantum_random_forest import QuantumRandomForest, set_multiprocessing
from split_function import SplitCriterion
from data_construction import data_preprocessing
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics, datasets

#### Load dataset

Here you can load your own dataset. The preprocessing can be left untouched. However, it is important to note that certain embeddings require data of certain dimension. PCA reduction to the required dimension can be achieved by changing the X_dim variable. 

In [31]:
data = datasets.load_breast_cancer()
num_classes = 2
X, y = data.data, data.target

In [32]:
training_set, testing_set = data_preprocessing(X, y, 
                                               train_prop=0.75,        # Proportion of dataset allocated for training
                                               X_dim=6)                # Determine the required dimension of the dataset. None for default.

New datapoint dimension: 6


#### Model parameters 

In [33]:
n_qubits = 6                                         # Number of qubits for the embedding
dt_type = 'qke'                                      # Do not touch
ensemble_var = None                                  # Do not touch
branch_var = ['eff_anz_pqc_arch', 
              'iqp_anz_pqc_arch', 
              'eff_anz_pqc_arch']                    # Type of Anzatz, or as a list for different down the tree - as given 
num_trees = 3                                        # Number of trees in ensemble 
split_num = 2                                        # Do not touch
pqc_sample_num = 2024                                # Number of circuit samples per kernel estimation
num_classes = num_classes                            # Number of classes in dataset
max_depth = 4                                        # Maximum depth of the tree
num_params_split = n_qubits*(n_qubits +1)            # Number of parameters in the embedding (different for different anzatz), list for different down the tree [2 * n_qubits ** 2 , n_qubits*(n_qubits +1), 2 * n_qubits ** 2]
num_rand_gen = 1                                     # Do not touch
num_rand_meas_q = n_qubits                           # Do not touch 
svm_num_train = 5                                    # L, Number of Landmarks
svm_c = 10                                           # C term in SVM optimisation, or list down the tree [100, 50, 20]
min_samples_split = svm_num_train                    # Minimum number of samples
embedding_type = ['as_params_all', 
                  'as_params_iqp', 
                  'as_params_all']                   # Type of embedding, or as a list - as given
criterion = SplitCriterion.init_info_gain('clas')    # Do not touch
device = 'cirq'                                      # Choose a device. Also possible to run on IBM

#### Set up model

In [34]:
qrf_reproduction = QuantumRandomForest(n_qubits, 'clas', num_trees, criterion, max_depth=max_depth, 
                          min_samples_split=min_samples_split, tree_split_num=split_num, num_rand_meas_q=num_rand_meas_q,
                          ensemble_var=ensemble_var, dt_type=dt_type, num_classes=num_classes, ensemble_vote_type='ave',
                          num_params_split=num_params_split, num_rand_gen=num_rand_gen, pqc_sample_num=pqc_sample_num,
                          embed=embedding_type, branch_var=branch_var, svm_num_train=svm_num_train, svm_c=svm_c, 
                          nystrom_approx=True, device=device, cholesky=False)

#### Train

In [35]:
qrf_reproduction.train(training_set, 
          partition_sample_size=180)               # Partition size is the number of instances given to each tree. Set to None to use all the data for all trees



Training tree 1 of 3 ------------------------------------------------------------

---Training sub-tree of depth: 1 (180 instances)
We use the Nyström Approximation WITHOUT Incomplete Cholesky Decomposition
[  2  47  38 115 156]


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 116.50it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 900/900 [00:11<00:00, 75.37it/s]


Info gain: 0.3802
Accuracy for binary dataset: 0.8556
Number of SV: [34 31]
----> Selected SVM info gain: 0.3802

---Training sub-tree of depth: 2 (72 instances)
We use the Nyström Approximation WITHOUT Incomplete Cholesky Decomposition
[66 43 14 47 28]


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 44.70it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 360/360 [00:11<00:00, 30.63it/s]


Info gain: 0.0000
Accuracy for binary dataset: 0.8194
Number of SV: [19 13]
Increase SVM_C...
We use the Nyström Approximation WITHOUT Incomplete Cholesky Decomposition
[34 51 50  8 62]


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 48.88it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 360/360 [00:11<00:00, 31.74it/s]


Info gain: 0.1015
Accuracy for binary dataset: 0.8611
Number of SV: [14 12]
----> Selected SVM info gain: 0.1015

---Training sub-tree of depth: 3 (67 instances)
We use the Nyström Approximation WITHOUT Incomplete Cholesky Decomposition
[23 34 50 56 27]


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 119.10it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 335/335 [00:04<00:00, 81.72it/s]


Info gain: 0.1916
Accuracy for binary dataset: 0.9104
Number of SV: [8 7]
----> Selected SVM info gain: 0.1916

---Training sub-tree of depth: 4 (58 instances)

---Training sub-tree of depth: 4 (9 instances)

---Training sub-tree of depth: 3 (5 instances)
We use the Nyström Approximation WITHOUT Incomplete Cholesky Decomposition
[4 2 1 4 3]


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 174.77it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 413.24it/s]


Info gain: 0.7219
Accuracy for binary dataset: 0.0000
Number of SV: [2 1]
----> Selected SVM info gain: 0.7219

---Training sub-tree of depth: 4 (4 instances)

---Training sub-tree of depth: 4 (1 instances)

---Training sub-tree of depth: 2 (108 instances)
We use the Nyström Approximation WITHOUT Incomplete Cholesky Decomposition
[92 91 66 91  8]


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 66.35it/s]
 16%|█████████████████████████▎                                                                                                                                         | 84/540 [00:02<00:13, 34.46it/s]


KeyboardInterrupt: 

#### Test

In [None]:
acc, preds_qrf = qrf_reproduction.test(testing_set, 
                          ret_pred=True, 
                          parallel=False,            # Set to False if you don't want parallel computation. Needs to be False for calc_tree_corr to be True.
                          calc_tree_corr=True)       # True is required to later look at correlations between trees

In [None]:
# Classification report
print(f"Classification report for QRF:\n"
      f"{metrics.classification_report(testing_set.y, preds_qrf)}\n")

#### Further analysis of model

In [None]:
# Print out tree
qrf.print_trees()

In [None]:
corr_dict = qrf.compute_tree_correlation()
for k,v in corr_dict.items():
    print("Class", k)
    plt.imshow(v)
    plt.colorbar()
    plt.show()

## Own implementation: Comparing Cholesky to non-cholesky

In [None]:
data = datasets.load_breast_cancer()
num_classes = 2
X, y = data.data, data.target

In [None]:
training_set, testing_set = data_preprocessing(X, y, 
                                           train_prop=0.75,        # Proportion of dataset allocated for training
                                           X_dim=6)                # Determine the required dimension of the dataset. None for default.

In [None]:
branch_var = 'eff_anz_pqc_arch'                   
num_trees = 6                                      
max_depth = 7
embedding_type = 'as_params_all'            

In [None]:
cholesky = True
qrf = QuantumRandomForest(n_qubits, 'clas', num_trees, criterion, max_depth=max_depth, 
                      min_samples_split=min_samples_split, tree_split_num=split_num, num_rand_meas_q=num_rand_meas_q,
                      ensemble_var=ensemble_var, dt_type=dt_type, num_classes=num_classes, ensemble_vote_type='ave',
                      num_params_split=num_params_split, num_rand_gen=num_rand_gen, pqc_sample_num=pqc_sample_num,
                      embed=embedding_type, branch_var=branch_var, svm_num_train=svm_num_train, svm_c=svm_c, 
                      nystrom_approx=True, device=device, cholesky=cholesky)
qrf.train(training_set, partition_sample_size=180)
acc, preds_qrf = qrf.test(testing_set, 
                      ret_pred=True, 
                      parallel=False,            # Set to False if you don't want parallel computation. Needs to be False for calc_tree_corr to be True.
                      calc_tree_corr=True)       # True is required to later look at correlations between trees
# Classification report
classification_report = metrics.classification_report(testing_set.y, preds_qrf)
print(f"Classification report for QRF (With Cholesky Improvement):\n"
      f"{classification_report}\n")

In [None]:
cholesky = False
qrf = QuantumRandomForest(n_qubits, 'clas', num_trees, criterion, max_depth=max_depth, 
                      min_samples_split=min_samples_split, tree_split_num=split_num, num_rand_meas_q=num_rand_meas_q,
                      ensemble_var=ensemble_var, dt_type=dt_type, num_classes=num_classes, ensemble_vote_type='ave',
                      num_params_split=num_params_split, num_rand_gen=num_rand_gen, pqc_sample_num=pqc_sample_num,
                      embed=embedding_type, branch_var=branch_var, svm_num_train=svm_num_train, svm_c=svm_c, 
                      nystrom_approx=True, device=device, cholesky=cholesky)
qrf.train(training_set, partition_sample_size=180)
acc, preds_qrf = qrf.test(testing_set, 
                      ret_pred=True, 
                      parallel=False,            # Set to False if you don't want parallel computation. Needs to be False for calc_tree_corr to be True.
                      calc_tree_corr=True)       # True is required to later look at correlations between trees
# Classification report
classification_report = metrics.classification_report(testing_set.y, preds_qrf)
print(f"Classification report for QRF (No Cholesky Improvement):\n"
      f"{classification_report}\n")