### Question 1: What is K-Nearest Neighbors (KNN) and how does it work in both classification and regression problems?

->  K-Nearest Neighbors (KNN) is a supervised, non-parametric, instance-based machine learning algorithm.
It does not build an explicit model during training; instead, it stores the training data and makes predictions based on similarity.

How KNN Works (General Steps)

1. Choose the value of K (number of nearest neighbors).

2. Calculate the distance between the new data point and all training points
(commonly Euclidean distance).

3. Select the K closest points.

4. Make a prediction based on those K neighbors.

Distance Metrics Commonly Used

- Euclidean Distance (most common)

- Manhattan Distance

- Minkowski Distance

- Cosine Similarity (for text/high-dimensional data)

KNN for Classification

How it works:

- The algorithm looks at the K nearest data points.

- The class that appears most frequently among those neighbors is assigned to the new data point.

Example:

If K = 5 and among the 5 nearest neighbors:

- 3 belong to Class A

- 2 belong to Class B

The new data point is classified as Class A.

KNN for Regression

How it works:

- The algorithm finds the K nearest neighbors.

- The predicted value is the average (or weighted average) of their target values.

Example:

If K = 3 and neighbor values are:

50, 60, 70

Prediction = (50 + 60 + 70) / 3 = 60



### Question 2: What is the Curse of Dimensionality and how does it affect KNN performance?


-> The Curse of Dimensionality refers to the problems that arise when the number of features (dimensions) in a dataset becomes very large. As dimensions increase, the data space grows exponentially, making data points sparse and similarity measures less meaningful.


Why It Is Called a “Curse”

- More dimensions → much larger feature space

- Same number of data points spread very thinly

- Distances between points become almost equal

Effect of Curse of Dimensionality on KNN

KNN heavily relies on distance calculations, so high dimensionality seriously affects its performance.

1. Distance Becomes Less Meaningful

In high dimensions, the distance between the nearest and farthest neighbors becomes very similar.

KNN cannot clearly identify “nearest” neighbors.

Result: Poor prediction accuracy

2. Increased Computational Cost

KNN must compute distances in all dimensions for every query.

More dimensions = more calculations.

Result: Slower performance

3. Data Sparsity

High-dimensional data requires much more data to maintain density.

With limited data, neighbors may not truly be similar.

Result: Higher error rate

5. Noise Dominates

Irrelevant or noisy features distort distance calculations.

Important features lose influence.

Result: Misleading neighbors

### Question 3: What is Principal Component Analysis (PCA)? How is it different from feature selection?

-> Principal Component Analysis (PCA) is an unsupervised machine learning technique used for dimensionality reduction. It reduces the number of input features while preserving as much important information (variance) as possible. PCA transforms the original correlated features into a new set of uncorrelated variables called principal components, which are linear combinations of the original features. These components are ranked in order of the amount of variance they explain, and only the most significant ones are retained.

PCA is mainly used to handle high-dimensional data, remove redundancy, reduce noise, and improve model performance.

- Difference between PCA and Feature Selection

Aspect ||	PCA	|| Feature Selection

Method || Feature extraction || Feature selection
      
Output features	|| New transformed features	|| Original features

Data transformation	|| Yes	|| No

Interpretability	|| Low	|| High

Handling correlation	|| Removes correlation	|| May keep correlated features

Information loss	|| Possible	|| Minimal if done correctly

### Question 4: What are eigenvalues and eigenvectors in PCA, and why are they important?


=> In the context of Principal Component Analysis (PCA):

Eigenvectors are the directions or axes along which the data varies the most. They represent the principal components. Each eigenvector corresponds to a principal component, and these vectors are orthogonal (perpendicular) to each other, indicating independent directions of variance.

Eigenvalues are scalar values that represent the magnitude of variance along the corresponding eigenvector. A larger eigenvalue means that more variance is captured along that eigenvector. In PCA, eigenvalues tell us how much information (variance) each principal component captures from the original dataset.

**Importance in PCA:**

1. Dimensionality Reduction: Eigenvectors define the new feature space, and eigenvalues help in deciding which of these new dimensions (principal components) are most significant. By selecting the eigenvectors with the largest eigenvalues, we retain the principal components that capture the most variance, thus reducing dimensionality while preserving as much information as possible.

2.  Variance Explained: The ratio of an individual eigenvalue to the sum of all eigenvalues indicates the proportion of total variance explained by its corresponding principal component. This is crucial for understanding how much information is retained when reducing dimensions.

3. Data Transformation: PCA projects the original data onto the new coordinate system defined by the eigenvectors. This transformation results in a set of uncorrelated principal components, which simplifies subsequent analysis and modeling.

### Question 5: How do KNN and PCA complement each other when applied in a single pipeline?


=> K-Nearest Neighbors (KNN) and Principal Component Analysis (PCA) can complement each other in a pipeline, particularly when dealing with high-dimensional data, by mitigating some of the drawbacks of KNN and leveraging the strengths of PCA.

Here's how they complement each other:

1. Addressing the Curse of Dimensionality for KNN:

- KNN's Weakness: KNN's performance degrades significantly in high-dimensional spaces due to the "Curse of Dimensionality." In high dimensions, data points become sparse, and the concept of "nearest neighbor" becomes less meaningful as distances between all points tend to converge.
- PCA's Solution: PCA is a dimensionality reduction technique that transforms the data into a lower-dimensional space while retaining as much variance (information) as possible. By applying PCA before KNN, we can reduce the number of features, making the distance calculations more reliable and meaningful for KNN.
2. Improving Computational Efficiency:

- KNN's Cost: Without dimensionality reduction, KNN needs to calculate distances between a new data point and all training data points across all original features. This can be computationally expensive, especially with many features.
- PCA's Benefit: By reducing the number of dimensions with PCA, KNN's distance calculations become faster and more efficient, leading to quicker training and prediction times.
3. Reducing Noise and Overfitting:

- KNN's Sensitivity: KNN can be sensitive to noisy or irrelevant features, as these can unduly influence the distance calculations and lead to incorrect neighbor identification.
- PCA's Advantage: PCA inherently helps in noise reduction by focusing on the components that explain the most variance, often pushing noise into components with lower variance that can then be discarded. This can lead to a more robust KNN model that is less prone to overfitting due to irrelevant features.
4. Feature Extraction vs. Feature Selection:

- PCA performs feature extraction, creating new, uncorrelated features (principal components) that are linear combinations of the original features. This is beneficial because it captures the underlying structure of the data.
- While feature selection might remove entire original features, PCA transforms them into a more compact and informative representation, which can be more effective for KNN.

### Question 6: Train a KNN Classifier on the Wine dataset with and without feature scaling. Compare model accuracy in both cases.
(Include your Python code and output in the code box below.)

In [1]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# -------------------------------
# KNN WITHOUT Feature Scaling
# -------------------------------
knn_no_scaling = KNeighborsClassifier(n_neighbors=5)
knn_no_scaling.fit(X_train, y_train)
y_pred_no_scaling = knn_no_scaling.predict(X_test)
accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

# -------------------------------
# KNN WITH Feature Scaling
# -------------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)

accuracy_no_scaling, accuracy_scaled


(0.7407407407407407, 0.9629629629629629)

### Question 7: Train a PCA model on the Wine dataset and print the explained variance ratio of each principal component.
(Include your Python code and output in the code box below.)


In [2]:
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load the Wine dataset
wine = load_wine()
X = wine.data

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train PCA model
pca = PCA()
pca.fit(X_scaled)

# Print explained variance ratio
explained_variance = pca.explained_variance_ratio_
explained_variance


array([0.36198848, 0.1920749 , 0.11123631, 0.0706903 , 0.06563294,
       0.04935823, 0.04238679, 0.02680749, 0.02222153, 0.01930019,
       0.01736836, 0.01298233, 0.00795215])

### Question 8: Train a KNN Classifier on the PCA-transformed dataset (retain top 2  components). Compare the accuracy with the original dataset.
(Include your Python code and output in the code box below.)


In [3]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X = wine.data
y = wine.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# -------------------------------
# KNN on ORIGINAL (scaled) dataset
# -------------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train_scaled, y_train)
y_pred_original = knn_original.predict(X_test_scaled)
accuracy_original = accuracy_score(y_test, y_pred_original)

# -------------------------------
# PCA (top 2 components) + KNN
# -------------------------------
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)
accuracy_pca = accuracy_score(y_test, y_pred_pca)

accuracy_original, accuracy_pca


(0.9629629629629629, 0.9814814814814815)

### Question 9: Train a KNN Classifier with different distance metrics (euclidean, manhattan) on the scaled Wine dataset and compare the results.
(Include your Python code and output in the code box below.)

## Train KNN with different metrics

### Subtask:
Load the Wine dataset, split it into training and testing sets, scale the features, and then train two KNN classifiers: one using Euclidean distance and another using Manhattan distance. Finally, compare their accuracy scores.


**Reasoning**:
The subtask requires training KNN classifiers with different distance metrics (Euclidean and Manhattan) on the scaled Wine dataset and comparing their accuracy scores. I will generate a code block to perform these steps, building upon the already loaded and split data, and using the `StandardScaler` from a previous step.



In [4]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset (if not already loaded)
wine = load_wine()
X = wine.data
y = wine.target

# Split data into training and testing sets (if not already split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale the features (if not already scaled)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# KNN with Euclidean Distance
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
accuracy_euclidean = accuracy_score(y_test, y_pred_euclidean)

# KNN with Manhattan Distance
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
accuracy_manhattan = accuracy_score(y_test, y_pred_manhattan)

print(f"Accuracy with Euclidean Distance: {accuracy_euclidean:.4f}")
print(f"Accuracy with Manhattan Distance: {accuracy_manhattan:.4f}")

Accuracy with Euclidean Distance: 0.9630
Accuracy with Manhattan Distance: 0.9630


## Compare and Discuss Results

### Subtask:
Provide a textual comparison and discussion of the accuracy results obtained from KNN classifiers using Euclidean and Manhattan distance metrics.


The KNN classifier on the scaled Wine dataset achieved an accuracy of `0.9630` with both Euclidean and Manhattan distance metrics. In this specific case, there is no significant difference in performance between the two distance metrics.

This similarity suggests that for the Wine dataset, the way distances are measured (either as the shortest straight line in Euclidean space or as the sum of absolute differences along axes in Manhattan space) does not lead to a noticeable difference in how the nearest neighbors are identified for classification purposes.

Given the relatively low dimensionality (13 features) and well-separated classes of the Wine dataset, the choice of distance metric often has less impact compared to very high-dimensional or noisy datasets. Both metrics are generally effective in such scenarios. If the dataset had very high dimensionality, sparse features, or many outliers, the differences between Euclidean and Manhattan distances might become more pronounced.

## Final Task

### Subtask:
Summarize the findings regarding the impact of distance metrics on KNN performance for this dataset.


Question 10: You are working with a high-dimensional gene expression dataset to classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models overfit.
Explain how you would:

● Use PCA to reduce dimensionality

● Decide how many components to keep

● Use KNN for classification post-dimensionality reduction

● Evaluate the model

● Justify this pipeline to your stakeholders as a robust solution for real-world biomedical data

(Include your Python code and output in the code box below.)

# Task
The user's latest actions indicate a progression through the notebook, with several questions regarding KNN, PCA, and their combined application already addressed. The overarching goal is to understand and justify a PCA-KNN pipeline for high-dimensional gene expression data.

The current notebook state includes "Question 10" which directly asks for an explanation of how to use PCA and KNN for classifying patients with cancer using high-dimensional gene expression data, how to decide on the number of components, how to evaluate the model, and importantly, how to *justify this pipeline to stakeholders*. This question perfectly aligns with the remaining parts of the plan:

*   **Justifying the Pipeline to Stakeholders**: Question 10 explicitly asks for this.
*   **Conceptual Code Example for PCA + KNN**: Question 10 asks for a Python code example. The code in cell `V0LAXQJl3CHJ` already demonstrates PCA + KNN on the Wine dataset and shows accuracy, which can serve as the conceptual example or be explicitly referenced.
*   **Final Task (Summary and Justification)**: This will be a natural outcome of fully answering Question 10.

Therefore, the next step is to answer Question 10 comprehensively, addressing all its sub-points, including the justification for stakeholders and referring to or demonstrating a code example for the PCA-KNN pipeline.

Formulate arguments to justify the PCA-KNN pipeline to stakeholders for high-dimensional gene expression data classification, explaining how to use PCA for dimensionality reduction (including how to decide on the number of components), how to use KNN for classification post-dimensionality reduction, how to evaluate the model, and provide a conceptual Python code example (referencing or demonstrating similar logic to `V0LAXQJl3CHJ`) that shows the pipeline with accuracy, along with a comprehensive summary of its benefits and justification.

### Justifying the Pipeline to Stakeholders

#### 1. The Challenge: High-Dimensional Gene Expression Data and Overfitting

Biomedical datasets, especially those from gene expression profiling, often present a significant challenge: a very large number of features (genes) compared to a relatively small number of samples (patients). This scenario is known as the **'Curse of Dimensionality'**. In such high-dimensional spaces, traditional machine learning models are prone to **overfitting**. Overfitting occurs when a model learns the noise and specific details of the training data too well, performing excellently on training data but poorly on unseen, real-world patient data. This is particularly problematic in critical applications like cancer classification, where reliable predictions are paramount.

#### 2. PCA as the Solution: Mitigating the Curse of Dimensionality and Noise Reduction

**Principal Component Analysis (PCA)** is a powerful, unsupervised dimensionality reduction technique that directly tackles these challenges. Instead of simply selecting a subset of original features, PCA transforms the original, often correlated features (genes) into a new, smaller set of uncorrelated features called **Principal Components (PCs)**. These PCs capture the most significant variance in the data.

*   **Dimensionality Reduction:** By projecting the data onto a lower-dimensional space defined by the most informative PCs, PCA effectively combats the "Curse of Dimensionality." This makes the data more manageable and the relationships between data points more meaningful for downstream algorithms.
*   **Noise Reduction:** PCA inherently acts as a noise filter. Components that explain very little variance are often associated with random noise in the data. By discarding these less informative components, PCA helps to remove irrelevant noise, allowing the signal (true biological patterns) to emerge more clearly.
*   **Preserving Essential Variance:** PCA is designed to retain as much of the original data's variance as possible in the selected principal components. This means that while we reduce the number of features, we are preserving the most important information and underlying structure of the gene expression patterns critical for distinguishing cancer types.

This preprocessing step significantly reduces the risk of overfitting by presenting a cleaner, more concise representation of the data to the classification model.

#### 3. Deciding the Optimal Number of Principal Components

Choosing the right number of principal components is vital to ensure we retain sufficient information while effectively reducing dimensionality. We employ several data-driven methods for this:

*   **Explained Variance Ratio:** We analyze the `explained_variance_ratio_` from the PCA model. This tells us the proportion of total variance explained by each principal component. We aim to select enough components to capture a high cumulative percentage of the total variance, typically targeting **90-95%**. This ensures that the majority of the original data's information content is preserved.
*   **Scree Plot Analysis:** A scree plot visualizes the eigenvalues (variance explained) for each principal component in descending order. We look for an "elbow" point in the plot, where the curve sharply changes direction from a steep slope to a more gradual one. Components before this elbow typically contribute significantly to variance, while those after it contribute less, often representing noise.
*   **Cross-Validation with KNN:** To fine-tune our selection and ensure it optimizes the classification task, we can use cross-validation. We train the KNN model with varying numbers of principal components (selected based on the above methods) and evaluate its performance. The number of components that yields the best and most stable classification accuracy through cross-validation will be chosen as the optimal set.

This systematic approach ensures that our dimensionality reduction is not arbitrary but is carefully chosen to maximize signal retention and model performance.

#### 4. KNN Classification on PCA-Transformed Data: Enhanced Performance and Efficiency

Once the high-dimensional gene expression data has been optimally reduced and cleaned using PCA, the K-Nearest Neighbors (KNN) classifier can be applied effectively. KNN is a simple yet powerful non-parametric algorithm that classifies a new data point based on the majority class of its 'k' nearest neighbors in the feature space.

*   **Reliable Distance Calculations:** KNN's performance heavily relies on accurate distance calculations between data points. In high-dimensional spaces, distances become less meaningful (as discussed in the Curse of Dimensionality). By applying KNN to the PCA-transformed data, we ensure that the distances are computed in a lower-dimensional, less noisy, and more discriminative feature space. This leads to more reliable identification of true neighbors and, consequently, more accurate classifications.
*   **Improved Classification Performance:** With a clearer representation of the data's underlying patterns from PCA, KNN can more effectively distinguish between different cancer types. The reduced noise and sparsity allow KNN to make better-informed decisions, leading to higher classification accuracy, precision, and recall.
*   **Increased Computational Efficiency:** A significant benefit of dimensionality reduction is the reduction in computational cost. KNN must calculate distances to all training points. By operating on a much smaller number of principal components instead of hundreds or thousands of original genes, the time required for both training and prediction phases is drastically reduced. This is crucial for real-world applications where rapid analysis might be required.
*   **Interpretability (Post-Hoc):** While individual principal components are abstract (linear combinations of original genes), the overall pipeline remains interpretable in terms of the factors that differentiate cancer types. We can analyze the loadings of the original genes on the selected principal components to understand which genes contribute most to the identified biological variations, thereby maintaining a link back to biological insights.

#### 5. Rigorous Model Evaluation for Biomedical Data

For a critical application like cancer classification, robust model evaluation is paramount. We don't just rely on a single accuracy score; instead, we employ a comprehensive suite of metrics and validation strategies:

*   **Accuracy:** While a good starting point, accuracy alone can be misleading, especially with imbalanced datasets (e.g., fewer samples of a rare cancer type).
*   **Precision and Recall:** These metrics are crucial for understanding the trade-offs in classification. **Precision** (positive predictive value) tells us the proportion of correctly identified positive cases out of all cases predicted as positive. **Recall** (sensitivity) tells us the proportion of correctly identified positive cases out of all actual positive cases. In cancer diagnosis, a high recall is often critical to minimize false negatives (missing actual cancer cases).
*   **F1-Score:** The F1-score is the harmonic mean of precision and recall, providing a balanced measure of a model's performance, particularly useful when precision and recall have conflicting priorities or when dealing with imbalanced classes.
*   **Specificity:** In biomedical contexts, **specificity** (the proportion of actual negative cases correctly identified as negative) is also highly important to avoid false positives (incorrectly diagnosing healthy individuals).
*   **Area Under the Receiver Operating Characteristic Curve (AUC-ROC):** The AUC-ROC curve provides an aggregate measure of performance across all possible classification thresholds, indicating the model's ability to distinguish between classes.
*   **Cross-Validation (e.g., K-Fold Cross-Validation):** To ensure the model's robustness and generalizability to unseen data, we utilize cross-validation. This technique involves splitting the training data into multiple folds, training the model on a subset of these folds, and validating on the remaining fold, repeating this process multiple times. This provides a more reliable estimate of the model's performance and guards against overfitting to a specific data split.

#### 6. Justification for Stakeholders: A Robust and Data-Driven Solution

This PCA-KNN pipeline offers a robust, efficient, and data-driven solution for classifying cancer types from complex gene expression data, addressing key concerns for real-world biomedical applications:

*   **Mitigation of Overfitting:** By employing PCA, we effectively reduce the high dimensionality inherent in gene expression data, preventing models from overfitting to noise and ensuring that our classifications are based on the true underlying biological signals, not spurious correlations.
*   **Improved Accuracy and Reliability:** The combination of PCA's dimensionality reduction and noise filtering with KNN's reliable distance-based classification in a more meaningful feature space leads to highly accurate and stable predictions, crucial for clinical decision-making.
*   **Enhanced Computational Efficiency:** The reduction in data dimensions significantly speeds up both model training and prediction times. This is vital for integrating such analytical tools into clinical workflows where rapid insights can be critical.
*   **Data-Driven Decision Support:** Our rigorous evaluation methodology, incorporating a comprehensive set of metrics and cross-validation, provides confidence in the model's generalizability and performance. While PCA transforms features, we maintain the ability to infer biological significance by analyzing gene loadings on principal components, thus offering an explainable and data-driven foundation for medical decision support.
*   **Scalability:** This pipeline is well-suited for handling future, even larger, gene expression datasets, ensuring our approach remains viable as biomedical research continues to generate more high-dimensional data.

In summary, this PCA-KNN pipeline transforms complex gene expression data into actionable, reliable, and efficient cancer classification insights, providing a strong foundation for clinical application and research.

### Subtask:
Provide a Python code example demonstrating the PCA and KNN pipeline. While a real gene expression dataset might be large, this code will illustrate the process using the Wine dataset, highlighting the steps discussed conceptually. The output will show the accuracy of the model.

#### Instructions
Execute the code in the cell `V0LAXQJl3CHJ` to demonstrate the PCA and KNN pipeline on the Wine dataset, showing the accuracy of the model.

## Conceptual Code Example for PCA + KNN

### Subtask:
Execute the provided Python code to demonstrate the PCA and KNN pipeline on the Wine dataset, highlighting the steps discussed conceptually and showing the accuracy of the model.

#### Instructions
Execute the code cell with ID `V0LAXQJl3CHJ` to run the PCA and KNN pipeline on the Wine dataset and display the resulting accuracy scores.

**Reasoning**:
The subtask is to execute the provided Python code demonstrating the PCA and KNN pipeline. The code for this is in cell `V0LAXQJl3CHJ`.



In [5]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X = wine.data
y = wine.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# -------------------------------
# KNN on ORIGINAL (scaled) dataset
# -------------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train_scaled, y_train)
y_pred_original = knn_original.predict(X_test_scaled)
accuracy_original = accuracy_score(y_test, y_pred_original)

# -------------------------------
# PCA (top 2 components) + KNN
# -------------------------------
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)
accuracy_pca = accuracy_score(y_test, y_pred_pca)

accuracy_original, accuracy_pca

(0.9629629629629629, 0.9814814814814815)