<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# Lab 6.4
# *PCA Lab*

**In this lab, we will:**
- Explore how PCA is related to correlation.
- Use PCA to perform dimensionality reduction.

### 1. Load Data

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. n the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server: ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/

Also can be found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Attribute Information:

1) ID number 2) Diagnosis (M = malignant, B = benign) 3-32)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter) b) texture (standard deviation of gray-scale values) c) perimeter d) area e) smoothness (local variation in radius lengths) f) compactness (perimeter^2 / area - 1.0) g) concavity (severity of concave portions of the contour) h) concave points (number of concave portions of the contour) i) symmetry j) fractal dimension ("coastline approximation" - 1)

The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.

Missing attribute values: none

Class distribution: 357 benign, 212 malignant

In [None]:
# IMPORT LIBRARIES
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [None]:
breast_cancer_csv = r'C:\Users\pabarca\OneDrive - GRUPO GRANSOLAR\Desktop\IOD - Python\DATA\breast-cancer-wisconsin-data.csv'
breast_cancer = pd.read_csv(breast_cancer_csv, index_col='id')

### 2. EDA

Explore dataset. Clean data. Find correlation.

In [None]:
breast_cancer.head()

In [None]:
breast_cancer.shape

In [None]:
breast_cancer.info()

In [None]:
breast_cancer.isnull().sum()

In [None]:
breast_cancer.drop(labels='Unnamed: 32', axis=1, inplace=True)

In [None]:
breast_cancer['diagnosis'].value_counts(normalize=True).plot(kind='bar')
plt.show()

In [None]:
sns.pairplot(breast_cancer)

In [None]:
# Copied code from seaborn examples
# https://seaborn.pydata.org/examples/many_pairwise_correlations.html
sns.set(style="white")

# Generate a mask for the upper triangle
mask = np.zeros_like(breast_cancer.corr(numeric_only=True), dtype=bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(18, 18))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(breast_cancer.corr(numeric_only=True), mask=mask, cmap=cmap, vmax=1, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True)

plt.show();

### 3. Subset & Normalise

Subset the data to only include all columns except diagnosis, then apply StandardScaler.

In [None]:
# ANSWER
from sklearn.preprocessing import StandardScaler

In [None]:
breast_cancer.columns

In [None]:
# Select target column name
target_column = 'diagnosis'

# Save feature column names as a list
feature_columns = breast_cancer.columns.drop('diagnosis')
#feature_columns = [c for c in breast_cancer.columns if c != 'diagnosis']
feature_columns

In [None]:
X = breast_cancer[feature_columns]
y = breast_cancer['diagnosis']

In [None]:
X.head()

In [None]:
# Use StandardScaler to fit and transform X to be standardised
scaler = StandardScaler()
Xs = scaler.fit_transform(X)

In [None]:
Xs

### Calculate correlation matrix

We will be using the correlation matrix to calculate the eigenvectors and eigenvalues.

In [None]:
# ANSWER
pd.DataFrame(Xs, columns=feature_columns)

In [None]:
# Create dataframe from Xs and calculate correlation matrix with .corr() method

Xcorr = pd.DataFrame(Xs, columns=feature_columns).corr()
Xcorr

In [None]:
type(Xcorr)

In [None]:
# Calculate eigenvalues and eigenvectors of correlation matrix
eig_vals, eig_vecs = np.linalg.eig(Xcorr)

In [None]:
len(eig_vals)

In [None]:
eig_vals

In [None]:
# Print the first eigenvalue
eig_vals[0]

In [None]:
# Print the corresponding eigenvector
eig_vecs[0]

### 5. Calculate and plot the explained variance

A useful measure is the **explained variance**, which is calculated from the eigenvalues.

The explained variance tells us how much information (variance) is captured by each principal component.

### $$ ExpVar_i = \bigg(\frac{eigenvalue_i}{\sum_j^n{eigenvalue_j}}\bigg) * 100$$

In [None]:
def calculate_cum_var_exp(eig_vals):
    tot = sum(eig_vals)
    #var_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True)]
    #var_exp = [(i / tot)*100 for i in eig_vals]
    var_exp = []
    for i in eig_vals:
        var_i = (i / tot)*100
        var_exp.append(var_i)
    cum_var_exp = np.cumsum(var_exp)
    return cum_var_exp

In [None]:
def plot_var_exp(eig_vals):

    cum_var_exp = calculate_cum_var_exp(eig_vals)

    plt.figure(figsize=(9,7))

    component_number = [i+1 for i in range(len(cum_var_exp))]

    plt.plot(component_number, cum_var_exp, lw=7)

    plt.axhline(y=0, linewidth=5, color='grey', ls='dashed')
    plt.axhline(y=100, linewidth=3, color='grey', ls='dashed')

    ax = plt.gca()
    ax.set_xlim([1,30])
    ax.set_ylim([-5,105])

    ax.set_ylabel('cumulative variance explained', fontsize=16)
    ax.set_xlabel('component', fontsize=16)

    for tick in ax.xaxis.get_major_ticks():
        tick.label1.set_fontsize(12)

    for tick in ax.yaxis.get_major_ticks():
        tick.label1.set_fontsize(12)

    ax.set_title('component vs cumulative variance explained\n', fontsize=20)

    plt.show()

In [None]:
plot_var_exp(eig_vals)

### 6. Using sklearn For PCA

    from sklearn.decomposition import PCA
    
- Create an instance of PCA
    - Fit X
- Plot the cumulative explained variance
- Apply dimensionality reduction to X with n_components=16
    - Fit and transform X
- Create a pairplot of PCA-transformed data

In [None]:
# ANSWER
# Create an instance of PCA (do not set n_components)

# Fit Xs (breast cancer dataset having standardised features)
from sklearn.decomposition import PCA

In [None]:
# Instantiate the PCA class
breast_pca = PCA()

# Fit PCA with standardised features
breast_pca.fit(Xs)

In [None]:
# ANSWER
# Plot cumulative variance explained vs number of components using the plot_var_exp function from above
plt.plot(range(1, 31), 100 - (100*breast_pca.explained_variance_ratio_))
plt.xlabel('Number of components')
plt.ylabel('Explained cumulative variance %')
plt.show()

In [None]:
# Plot cumulative variance explained vs number of components with custom plot_var_exp function
plot_var_exp(breast_pca.explained_variance_)

In [None]:
# Create another instance of PCA (this time with n_components = 16)
breast_pca = PCA(n_components=16)

# Fit PCA with standardised features
breast_pca.fit(Xs)
std_x_pca = breast_pca.transform(Xs)

In [None]:
# ANSWER
# Create a pairplot of PCA-transformed data
# Show principal components as a dataframe

pd.DataFrame(std_x_pca).head()

In [None]:
pd.DataFrame(Xs, columns=feature_columns)

In [None]:
sns.pairplot(pd.DataFrame(std_x_pca), kind='reg');

In [None]:
# Plot PC1 vs PC2
plt.scatter(std_x_pca[:, 0], std_x_pca[:, 1])
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()

You should notice that the transformed features have been decorrelated (neither increasing nor decreasing trends in pairs of variables).

### 7. Split Data to 80/20 and use PCA prior to a supervised learning task

In this section we use PCA as a preprocessing step to a supervised learning algorithm.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

Split the original dataset 80/20. Then apply standard scaler followed by PCA.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply standard scaler to X_train and X_test (fit_transform on X_train, transform on X_test):
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Instantiate the PCA class and set at 16 components
breast_pca = PCA(n_components=16)

# Apply PCA to the standardised features
X_train_scaled_pca = breast_pca.fit_transform(X_train_scaled)
X_test_scaled_pca = breast_pca.transform(X_test_scaled)

Apply a KNN algorithm on `X_train_scaled` and `X_train_scaled_pca` with 5 neighbours, then evaluate using `X_test_scaled` and `X_test_scaled_pca`. Has performance been impacted as a result of dimension reduction?

In [None]:
# Set KNN classifier to use 5 neighbors and fit to X_train_scaled
knn5 = KNeighborsClassifier(n_neighbors=5)
knn5.fit(X_train_scaled, y_train)

# Test accuracy of KNN using standardised data
print("Number of features in standardised data:       ", X_test_scaled.shape[1])
print("Test accuracy using standardised data:    ", knn5.score(X_test_scaled, y_test))


In [None]:
# Set KNN classifier to use 5 neighbors and fit to X_train_scaled_pca
knn5 = KNeighborsClassifier(n_neighbors=5)
knn5.fit(X_train_scaled_pca, y_train)

# Test accuracy of KNN using standardised PCA-transformed data
print("Number of features in standardised PCA-transformed data:       ", X_test_scaled_pca.shape[1])
print("Test accuracy using standardised PCA-transformed data:    ", knn5.score(X_test_scaled_pca, y_test))

**References**

[Breast Cancer Wisconsin (Diagnostic) Data Set](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data/downloads/breast-cancer-wisconsin-data.zip/2)

[Breast Cancer Machine Learning Prediction](https://gtraskas.github.io/post/breast_cancer/)

[Understanding PCA (Principal Component Analysis) with Python](https://towardsdatascience.com/dive-into-pca-principal-component-analysis-with-python-43ded13ead21)



---



---



> > > > > > > > > © 2025 Institute of Data


---



---



