# Lecture 10. Principal Component Analysis (PCA)

## What is Principal Component Analysis (PCA)?

Principal Component Analysis (PCA) is a dimensionality reduction technique used in machine learning and data analysis. It is a powerful tool for simplifying complex datasets by transforming the original set of features into a smaller set of uncorrelated variables called principal components.

The main idea behind PCA is to find the directions (principal components) that maximize the variance in the data. These principal components are orthogonal (perpendicular) to each other, and they capture the most important patterns or characteristics in the data.

### Why use PCA?

PCA is useful in several scenarios, including:

1. **Data Visualization**: PCA can be used to visualize high-dimensional data in 2D or 3D space by projecting the data onto the first few principal components.

2. **Dimensionality Reduction**: PCA can reduce the number of features in a dataset by keeping only the most important principal components, which can improve model performance and computational efficiency.

3. **Noise Reduction**: PCA can help remove noise and redundancy from the data by separating the signal (principal components) from the noise (remaining components).

4. **Feature Extraction**: PCA can be used for feature extraction by transforming the original features into a new set of uncorrelated features (principal components).

### How does PCA work?

PCA follows these steps:

1. **Standardize the data**: Center the data by subtracting the mean from each feature, and scale the data by dividing each feature by its standard deviation.

2. **Calculate the covariance matrix**: Compute the covariance matrix that describes the variance and correlation between features.

3. **Calculate the eigenvectors and eigenvalues**: Find the eigenvectors and eigenvalues of the covariance matrix. Eigenvectors represent the principal components, and eigenvalues represent the amount of variance captured by each principal component.

4. **Select principal components**: Choose the top `k` eigenvectors (principal components) that capture the most variance in the data.

5. **Project the data**: Transform the original data onto the new subspace defined by the selected principal components.

After performing PCA, the transformed data (principal components) can be used as input for various machine learning algorithms or for data visualization and analysis.

PCA is a powerful technique, but it also has limitations. It assumes that the data is linear and may not be suitable for non-linear data. Additionally, the principal components may not always have a clear interpretation, which can make the results difficult to understand.

---

Using the Pima Diabetes dataset again, we would like to demonstrate PCA.

In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the dataset
file_path = 'https://raw.githubusercontent.com/npradaschnor/Pima-Indians-Diabetes-Dataset/master/diabetes.csv'
data = pd.read_csv(file_path)

# Split the data into features and target
X = data.drop('Outcome', axis=1)
y = data['Outcome']

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [2]:
# Perform PCA
pca = PCA(n_components=2)  # Choose the number of principal components
X_pca = pca.fit_transform(X_scaled)

Here we only choose two components, and see how effective it is.

In [3]:

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=42)

# Train a logistic regression model on the principal components
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

# Make predictions on the test set
y_pred = logreg.predict(X_test)

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.7077922077922078


It seems like the accuracy has decreased from 75% from previous notes (say ensemble) to 70%, that means in PCA there is still some informaiton loss.