# Dimensionality Reduction and Classification Performance Analysis on the Mushroom DatasetThis notebook explores the impact of Principal Component Analysis (PCA) on classification performance using Logistic Regression. We use the Mushroom Dataset, which consists entirely of categorical features, to demonstrate how dimensionality reduction can affect model accuracy and interpretability.

## 1. Load and Prepare the Mushroom DatasetWe begin by loading the Mushroom Dataset. If the dataset is not available locally, we will download it from the UCI Machine Learning Repository.

In [ ]:
# Load the Mushroom Datasetimport pandas as pdimport osDATA_URL = "https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"COLUMN_NAMES = [    "class", "cap-shape", "cap-surface", "cap-color", "bruises", "odor",    "gill-attachment", "gill-spacing", "gill-size", "gill-color", "stalk-shape",    "stalk-root", "stalk-surface-above-ring", "stalk-surface-below-ring",    "stalk-color-above-ring", "stalk-color-below-ring", "veil-type", "veil-color",    "ring-number", "ring-type", "spore-print-color", "population", "habitat"]local_csv = "../../data/mushroom.csv"if not os.path.exists(local_csv):    df = pd.read_csv(DATA_URL, header=None, names=COLUMN_NAMES)    df.to_csv(local_csv, index=False)else:    df = pd.read_csv(local_csv)df.head()

## 2. One-Hot Encoding of Categorical FeaturesSince all features are categorical, we need to convert them into a numerical format suitable for PCA. One-hot encoding transforms each categorical variable into a set of binary columns, allowing PCA to operate in a continuous vector space.**Why is one-hot encoding necessary before PCA?**PCA requires numerical input and operates on the covariance structure of the data. Categorical variables must be converted to a numeric format so that PCA can capture relationships and redundancy among features.

In [ ]:
# Apply one-hot encoding to all categorical features except the targetX_raw = df.drop("class", axis=1)y = df["class"]X_encoded = pd.get_dummies(X_raw)X_encoded.head()

## 3. Separate Features and Target VariableWe have already separated the features (`X_encoded`) from the target variable (`y`). The target variable indicates whether a mushroom is edible ('e') or poisonous ('p').

In [ ]:
# Confirm separationprint("Features shape:", X_encoded.shape)print("Target shape:", y.shape)

## 4. Print Dataset Dimensions After EncodingLet's observe the increase in the number of features after one-hot encoding.

In [ ]:
# Print the shape of the dataset after encodingprint(f"Number of samples: {X_encoded.shape[0]}")print(f"Number of features after one-hot encoding: {X_encoded.shape[1]}")

## 5. Standardize FeaturesStandardization ensures that each feature contributes equally to the analysis, even if the features are binary. PCA is sensitive to the scale of input features, so standardizing helps prevent features with larger variances from dominating the principal components.**Why standardize binary features?**Even though one-hot encoded features are binary, their variance can differ depending on the frequency of each category. Standardization centers the data and scales it to unit variance, making PCA more effective.

In [ ]:
# Standardize the featuresfrom sklearn.preprocessing import StandardScalerscaler = StandardScaler()X_scaled = scaler.fit_transform(X_encoded)

## 6. Apply PCAWe apply PCA to the standardized, one-hot encoded dataset without specifying the number of components initially. This allows us to analyze the explained variance for all components.

In [ ]:
# Apply PCAfrom sklearn.decomposition import PCApca = PCA()X_pca = pca.fit_transform(X_scaled)explained_variance = pca.explained_variance_ratio_

## 7. Create Scree Plot and Cumulative Explained Variance PlotA scree plot helps visualize how much variance each principal component explains. The cumulative plot shows how much total variance is explained as we add more components.

In [ ]:
import matplotlib.pyplot as pltimport numpy as npplt.figure(figsize=(10,5))plt.plot(np.arange(1, len(explained_variance)+1), explained_variance, marker='o', label='Explained Variance Ratio')plt.plot(np.arange(1, len(explained_variance)+1), np.cumsum(explained_variance), marker='s', label='Cumulative Explained Variance')plt.xlabel('Principal Component')plt.ylabel('Variance Ratio')plt.title('Scree Plot and Cumulative Explained Variance')plt.legend()plt.grid(True)plt.show()

## 8. Determine Optimal Number of Principal ComponentsWe select the number of principal components that retain at least 95% of the variance. This balances dimensionality reduction with information preservation.**Justification:**  Retaining 95% of the variance ensures that most of the information in the original data is preserved, while reducing redundancy and collinearity.

In [ ]:
# Find the number of components to retain 95% variancecum_var = np.cumsum(explained_variance)optimal_components = np.argmax(cum_var >= 0.95) + 1print(f"Optimal number of principal components to retain 95% variance: {optimal_components}")

## 9. Project Data onto First Two Principal Components and VisualizeWe project the data onto the first two principal components and visualize the separability of edible and poisonous mushrooms.

In [ ]:
# 2D scatter plot of first two principal componentsplt.figure(figsize=(8,6))colors = {'e': 'green', 'p': 'red'}for label in ['e', 'p']:    idx = (y == label)    plt.scatter(X_pca[idx, 0], X_pca[idx, 1], c=colors[label], label=f"{'Edible' if label=='e' else 'Poisonous'}", alpha=0.8, s=10)plt.xlabel('PC1')plt.ylabel('PC2')plt.title('Projection onto First Two Principal Components')plt.legend()plt.grid(True)plt.show()

## 10. Visualize Additional Principal Component Pair Plots (if needed)If more than two principal components are retained, we can visualize additional pair plots to further explore class separability.

In [ ]:
# Pair plot for first four principal components (if optimal_components > 2)import seaborn as snsif optimal_components > 2:    pcs_df = pd.DataFrame(X_pca[:, :4], columns=['PC1', 'PC2', 'PC3', 'PC4'])    pcs_df['class'] = y.values    sns.pairplot(pcs_df, hue='class', palette=colors, plot_kws={'alpha':0.5, 's':10})    plt.suptitle('Pair Plots of First Four Principal Components', y=1.02)    plt.show()

## 11. Split Data into Training and Testing SetsWe split the standardized original data into training and testing sets for model evaluation.

In [ ]:
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(    X_scaled, y, test_size=0.2, random_state=42, stratify=y)

## 12. Train and Evaluate Baseline Logistic Regression ModelWe train a Logistic Regression classifier on the original standardized data and evaluate its performance.

In [ ]:
from sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import classification_report, accuracy_score# Train baseline modelbaseline_clf = LogisticRegression(max_iter=1000, random_state=42)baseline_clf.fit(X_train, y_train)y_pred_baseline = baseline_clf.predict(X_test)# Evaluateprint("Baseline Logistic Regression Performance:")print(classification_report(y_test, y_pred_baseline))print(f"Accuracy: {accuracy_score(y_test, y_pred_baseline):.4f}")

## 13. Transform Data Using Optimal Principal ComponentsWe transform both training and testing sets using the selected number of principal components.

In [ ]:
# Transform data using optimal number of principal componentspca_opt = PCA(n_components=optimal_components)X_train_pca = pca_opt.fit_transform(X_train)X_test_pca = pca_opt.transform(X_test)

## 14. Train and Evaluate Logistic Regression on PCA-Transformed DataWe train a Logistic Regression classifier on the PCA-transformed training data and evaluate its performance.

In [ ]:
# Train model on PCA-transformed datapca_clf = LogisticRegression(max_iter=1000, random_state=42)pca_clf.fit(X_train_pca, y_train)y_pred_pca = pca_clf.predict(X_test_pca)# Evaluateprint("Logistic Regression Performance on PCA-Transformed Data:")print(classification_report(y_test, y_pred_pca))print(f"Accuracy: {accuracy_score(y_test, y_pred_pca):.4f}")

## 15. Compare and Analyze Model PerformanceLet's compare the performance metrics of the baseline and PCA-transformed models.- **Did dimensionality reduction affect classification accuracy?**- **Was there a trade-off between reducing redundancy and information loss?**- **Did PCA help with feature collinearity and redundancy?**- **Is Logistic Regression a good surrogate for evaluating PCA effectiveness?**### Discussion- If the accuracy and F1-score remain high after PCA, it suggests that most of the relevant information was retained, and redundancy was reduced.- If performance drops significantly, it may indicate that some important information was lost during dimensionality reduction.- PCA is particularly useful for handling collinearity and redundancy, which are common in one-hot encoded categorical datasets.- Logistic Regression is a suitable surrogate for evaluating PCA effectiveness because its performance is sensitive to feature quality and redundancy.**Conclusion:**  Dimensionality reduction using PCA can simplify models and reduce computational cost, but it's important to balance variance retention with information loss. In high-dimensional categorical datasets, PCA can help improve model interpretability and potentially performance, especially when features are highly redundant.