Theory of PCA with example

(PCA) is a dimensionality reduction method that keeps the majority of the crucial information in complicated datasets while making them simpler. The way it operates is by converting the initial variables into a new set of main components, which are orthogonal variables. The first component captures the maximum variance, followed by the second, in the order of how much variance in the data they explain.
Let's say we wish to use PCA to reduce a dataset that contains two features—weight and height—to only one dimension. PCA finds two eigenvectors and computes the covariance matrix after normalizing the input. The direction of the maximum variance in the data is represented by the eigenvector with the highest eigenvalue. The original data is then projected onto this eigenvector by PCA to create a one-dimensional representation of the dataset.


In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [None]:
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

In [None]:
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(X_iris, y_iris, test_size=0.2, random_state=42)

In [None]:
model_iris_without_pca = RandomForestClassifier(random_state=42)
model_iris_without_pca.fit(X_train_iris, y_train_iris)
y_pred_iris_without_pca = model_iris_without_pca.predict(X_test_iris)
accuracy_iris_without_pca = accuracy_score(y_test_iris, y_pred_iris_without_pca)
print("Accuracy without PCA on Iris dataset:", accuracy_iris_without_pca)

Accuracy without PCA on Iris dataset: 1.0


In [None]:
scaler_iris = StandardScaler()
X_train_scaled_iris = scaler_iris.fit_transform(X_train_iris)
X_test_scaled_iris = scaler_iris.transform(X_test_iris)
pca_iris = PCA(n_components=2)
X_train_pca_iris = pca_iris.fit_transform(X_train_scaled_iris)
X_test_pca_iris = pca_iris.transform(X_test_scaled_iris)

In [None]:
model_iris_with_pca = RandomForestClassifier(random_state=42)
model_iris_with_pca.fit(X_train_pca_iris, y_train_iris)
y_pred_iris_with_pca = model_iris_with_pca.predict(X_test_pca_iris)
accuracy_iris_with_pca = accuracy_score(y_test_iris, y_pred_iris_with_pca)
print("Accuracy with PCA on Iris dataset:", accuracy_iris_with_pca)

Accuracy with PCA on Iris dataset: 0.9


In [None]:
model_custom_without_pca = RandomForestClassifier(random_state=42)
model_custom_without_pca.fit(X_train_custom, y_train_custom)
y_pred_custom_without_pca = model_custom_without_pca.predict(X_test_custom)
accuracy_custom_without_pca = accuracy_score(y_test_custom, y_pred_custom_without_pca)
print("Accuracy without PCA on custom dataset:", accuracy_custom_without_pca)

Accuracy without PCA on custom dataset: 0.9435897435897436


In [None]:
scaler_custom = StandardScaler()
X_train_scaled_custom = scaler_custom.fit_transform(X_train_custom)
X_test_scaled_custom = scaler_custom.transform(X_test_custom)
pca_custom = PCA(n_components=2)
X_train_pca_custom = pca_custom.fit_transform(X_train_scaled_custom)
X_test_pca_custom = pca_custom.transform(X_test_scaled_custom)

In [None]:
model_custom_with_pca = RandomForestClassifier(random_state=42)
model_custom_with_pca.fit(X_train_pca_custom, y_train_custom)
y_pred_custom_with_pca = model_custom_with_pca.predict(X_test_pca_custom)
accuracy_custom_with_pca = accuracy_score(y_test_custom, y_pred_custom_with_pca)
print("Accuracy with PCA on custom dataset:"), accuracy_custom_with_pca

Accuracy with PCA on custom dataset:


(None, 0.2762820512820513)

**Analyse the result. Write in brief a paragraph on your observations.**

it's evident that the models perform exceptionally well on the Iris dataset, achieving a perfect accuracy of 1.0 without PCA and a high accuracy of 0.9 with PCA. This suggests that the Iris dataset is relatively well-structured and easily separable by the chosen RandomForestClassifier model. On the other hand, the custom dataset presents a more challenging task, with an accuracy of 0.9436 without PCA. However, with PCA applied, the accuracy drops significantly to 0.2763. This drastic reduction in accuracy indicates that PCA might not be suitable for this particular custom dataset. It's possible that the dataset's features do not exhibit strong correlations or that the dimensionality reduction caused by PCA results in a loss of crucial information for accurate classification. Further analysis and experimentation may be required to improve the model's performance on the custom dataset.




