[![Binder](https://mybinder.org/badge_logo.svg)](https://lab.mlpack.org/v2/gh/mlpack/examples/master?urlpath=lab%2Ftree%2Fbreast_cancer_wisconsin_transformation_with_pca%2Fbreast-cancer-wisconsin-pca-cpp.ipynb)

In [1]:
/**
 * @file breast-cancer-wisconsin-pca-cpp.ipynb
 *
 * A simple example usage of Principal Component Analysis (PCA)
 * applied to the UCI Breast Cancer dataset.
 *
 * https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
 */

In [2]:
!wget https://lab.mlpack.org/data/breast-cancer-wisconsin.csv

--2020-07-27 14:35:32--  https://lab.mlpack.org/data/breast-cancer-wisconsin.csv
Resolving lab.mlpack.org (lab.mlpack.org)... 95.216.66.112
Connecting to lab.mlpack.org (lab.mlpack.org)|95.216.66.112|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 124197 (121K) [application/octet-stream]
Saving to: ‘breast-cancer-wisconsin.csv.1’

     0K .......... .......... .......... .......... .......... 41%  221M 0s
    50K .......... .......... .......... .......... .......... 82%  227M 0s
   100K .......... .......... .                               100%  254M=0.001s

2020-07-27 14:35:32 (229 MB/s) - ‘breast-cancer-wisconsin.csv.1’ saved [124197/124197]



In [3]:
#include <mlpack/xeus-cling.hpp>

#include <mlpack/core.hpp>
#include <mlpack/methods/pca/pca.hpp>

In [4]:
// Header files to create and show the plot.
#define WITHOUT_NUMPY 1
#include "matplotlibcpp.h"
#include "xwidgets/ximage.hpp"

namespace plt = matplotlibcpp;

In [5]:
using namespace mlpack;

In [6]:
using namespace mlpack::pca;

In [7]:
arma::mat input;
data::Load("breast-cancer-wisconsin.csv", input);

In [8]:
// Print the first 10 rows of the input data.
std::cout << input.submat(0, 0, input.n_rows - 1 , 10).t() << std::endl;

            0   1.7990e+01   1.0380e+01   1.2280e+02   1.0010e+03   1.1840e-01   2.7760e-01   3.0010e-01   1.4710e-01   2.4190e-01   7.8710e-02   1.0950e+00   9.0530e-01   8.5890e+00   1.5340e+02   6.3990e-03   4.9040e-02   5.3730e-02   1.5870e-02   3.0030e-02   6.1930e-03   2.5380e+01   1.7330e+01   1.8460e+02   2.0190e+03   1.6220e-01   6.6560e-01   7.1190e-01   2.6540e-01   4.6010e-01   1.1890e-01            0
   1.0000e+00   2.0570e+01   1.7770e+01   1.3290e+02   1.3260e+03   8.4740e-02   7.8640e-02   8.6900e-02   7.0170e-02   1.8120e-01   5.6670e-02   5.4350e-01   7.3390e-01   3.3980e+00   7.4080e+01   5.2250e-03   1.3080e-02   1.8600e-02   1.3400e-02   1.3890e-02   3.5320e-03   2.4990e+01   2.3410e+01   1.5880e+02   1.9560e+03   1.2380e-01   1.8660e-01   2.4160e-01   1.8600e-01   2.7500e-01   8.9020e-02            0
   2.0000e+00   1.9690e+01   2.1250e+01   1.3000e+02   1.2030e+03   1.0960e-01   1.5990e-01   1.9740e-01   1.2790e-01   2.0690e-01   5.9990e-02   7.4560e-01   7.8690e

In [9]:
// Split the labels and ids (first and last column).
arma::rowvec labels = input.row(input.n_rows - 1);
arma::mat dataset = input.rows(1, input.n_rows - 2);

In [10]:
// Perform Principal Components Analysis using the exact method.
// Other decomposition methods are 'randomized', 'randomized-block-krylov', 'quic'.
//
// For more information checkout https://www.mlpack.org/doc/mlpack-3.3.2/doxygen/classmlpack_1_1pca_1_1PCA.html
// or uncomment the line below.
// ?PCA<>
PCA<> pca(true);
pca.Apply(dataset, 2);

In [11]:
// Print the first ten columns of the transformed input.
std::cout << dataset.cols(0, 10).t() << std::endl;

   -9.1848   -1.9469
   -2.3857    3.7649
   -5.7289    1.0742
   -7.1167  -10.2666
   -3.9318    1.9464
   -2.3782   -3.9465
   -2.2369    2.6877
   -2.1414   -2.3382
   -3.1721   -3.3888
   -6.3462   -7.7204
    0.8097    2.6569



In [12]:
// Plot the transformed input.

// Get the indices for the labels  0.0 / Benign.
arma::mat dataset0 = dataset.cols(arma::find(labels == 0.0));

// Get the data to for the indices.
std::vector<double> x0 = arma::conv_to<std::vector<double>>::from(dataset0.row(0));
std::vector<double> y0 = arma::conv_to<std::vector<double>>::from(dataset0.row(1));

// Get the indices for the label 1.0 / Malignant.
arma::mat dataset1 = dataset.cols(arma::find(labels == 1.0));

// Get the data to for the indices.
std::vector<double> x1 = arma::conv_to<std::vector<double>>::from(dataset1.row(0));
std::vector<double> y1 = arma::conv_to<std::vector<double>>::from(dataset1.row(1));

plt::figure_size(800, 800);
plt::scatter(x0, y0, 4);
plt::scatter(x1, y1, 4);

plt::xlabel("Principal Component - 1");
plt::ylabel("Principal Component - 2");
plt::title("Projection of Breast Cancer dataset onto first two principal components");

plt::save("./basic2.png");
auto im = xw::image_from_file("basic2.png").finalize();
im

A Jupyter widget

We can observe that the two classes Benign and Malignant, when projected to a two-dimensional space,
can be linearly separable up to some extent. Also we can observe that the Benign class is spread
out as compared to the Malignant class.