# Question 1: What is K-Nearest Neighbors (KNN) and How Does It Work in Both Classification and Regression Problems?

**Answer:**  
**K-Nearest Neighbors (KNN)** is a **supervised learning algorithm** used for both **classification** and **regression** tasks. It is a **non-parametric** and **instance-based** learning method, meaning it doesn’t make assumptions about the data distribution and makes predictions based on the closest training samples.

---

## How KNN Works
1. Choose the number of neighbors, **K** (e.g., K = 3 or 5).  
2. For a given test point:
   - Calculate the **distance** between the test point and all training points (commonly using **Euclidean distance**).  
   - Identify the **K nearest neighbors** based on these distances.  
3. Depending on the task:
   - **Classification:** The class most common among the K neighbors is assigned to the test point (majority voting).  
   - **Regression:** The output is the **average** (or weighted average) of the values of the K neighbors.

---

## KNN for Classification
- Example: Predicting whether a loan applicant will **default or not** based on features like income, age, and credit score.  
- The algorithm finds the K most similar applicants and predicts the class with the majority vote.  

**Prediction Rule:**  
\[
\hat{y} = \text{mode}(y_i \text{ of K nearest neighbors})
\]

---

## KNN for Regression
- Example: Predicting **house prices** based on nearby houses’ prices.  
- The algorithm takes the **average** of the prices of the K nearest houses.  

**Prediction Rule:**  
\[
\hat{y} = \frac{1}{K} \sum_{i=1}^{K} y_i
\]

---

## Key Points
- **K value selection:**  
  - Small K → more sensitive to noise (overfitting).  
  - Large K → smoother decision boundary (may underfit).  
- **Distance metrics:** Euclidean, Manhattan, or Minkowski distance are commonly used.  
- **Feature scaling:** Important, since KNN depends on distance — features must be normalized or standardized.


# Question 2: What is the Curse of Dimensionality and How Does It Affect KNN Performance?

**Answer:**  
The **Curse of Dimensionality** refers to the problems that arise when data has a **large number of features (dimensions)**. As the number of dimensions increases, the data becomes **sparse**, distances between points become **less meaningful**, and models that rely on distance measures—like **K-Nearest Neighbors (KNN)**—suffer in performance.

---

## What Happens in High Dimensions
1. **Data Sparsity:**  
   - In high-dimensional spaces, data points are far apart and the notion of “closeness” becomes weak.  
   - Most data points appear to be at almost the same distance from each other.  

2. **Distance Measures Lose Meaning:**  
   - In high dimensions, the difference between the nearest and farthest neighbors becomes very small.  
   - This makes it difficult for KNN to find truly “nearest” neighbors.

3. **Increased Computation:**  
   - More features mean more distance calculations, which increases computational cost and memory usage.

---

## Impact on KNN Performance
- **Reduced Accuracy:**  
  - Since KNN relies heavily on distance to classify or predict, less meaningful distances lead to poor model decisions.  
- **Overfitting Risk:**  
  - With too many irrelevant or noisy features, KNN may misclassify because every point starts looking equally close.  
- **Slower Predictions:**  
  - KNN must compute distances to all training samples, and high dimensionality increases this burden.

---

## How to Handle the Curse of Dimensionality
1. **Feature Selection:**  
   - Keep only the most relevant features using techniques like correlation analysis or feature importance scores.  
2. **Dimensionality Reduction:**  
   - Apply **PCA (Principal Component Analysis)** or **t-SNE** to reduce the number of dimensions.  
3. **Normalization:**  
   - Scale features so that all contribute equally to the distance metric.

# Question 3: What is Principal Component Analysis (PCA)? How is it Different from Feature Selection?

**Answer:**  
**Principal Component Analysis (PCA)** is an **unsupervised dimensionality reduction technique** used to transform high-dimensional data into a smaller set of **uncorrelated features** called **principal components**. These components capture the **maximum variance** (information) present in the original dataset.

---

## How PCA Works
1. **Standardize the Data:**  
   - Ensure all features have the same scale (mean = 0, variance = 1).

2. **Compute the Covariance Matrix:**  
   - Measures how features vary with respect to each other.

3. **Calculate Eigenvalues and Eigenvectors:**  
   - Eigenvectors represent directions (principal components).  
   - Eigenvalues represent the amount of variance captured by each component.

4. **Select Top Components:**  
   - Choose the top *k* components that capture most of the variance (e.g., 95%).

5. **Transform the Data:**  
   - Project original data onto the new principal components to obtain reduced dimensions.

---

## Example
If you have 10 features, PCA might find that 2 or 3 principal components capture 95% of the information — allowing you to reduce dimensions while preserving most of the variability.

---

## PCA vs. Feature Selection

| Aspect | PCA (Feature Extraction) | Feature Selection |
|--------|--------------------------|-------------------|
| **Approach** | Creates **new features** by combining existing ones (linear combinations) | **Selects** a subset of the original features |
| **Goal** | Reduce dimensionality while retaining maximum variance | Keep only the most relevant features for prediction |
| **Output Features** | Transformed, uncorrelated (principal components) | Original features (subset) |
| **Interpretability** | Less interpretable (new synthetic features) | More interpretable (original features retained) |
| **Supervision** | Unsupervised method | Can be supervised or unsupervised |



# Question 4: What are Eigenvalues and Eigenvectors in PCA, and Why are They Important?

**Answer:**

In **Principal Component Analysis (PCA)**, **eigenvalues** and **eigenvectors** are mathematical concepts derived from the **covariance matrix** of the data. They play a crucial role in identifying the **principal components**, which are the new directions along which the data varies the most.

---

##  What are Eigenvectors?
- **Eigenvectors** represent the **directions (axes)** of the new feature space.
- Each eigenvector defines a **principal component**, which is a linear combination of the original features.
- They show **where** the data spreads the most in the multi-dimensional space.

### Example:
If your dataset has two correlated features, PCA will find new axes (eigenvectors) that point in the directions of maximum and minimum variance.

---

##  What are Eigenvalues?
- **Eigenvalues** represent the **magnitude (amount)** of variance captured by each eigenvector.
- A **larger eigenvalue** means that the corresponding eigenvector (principal component) explains **more variance** in the data.
- Eigenvalues help us **rank** components and decide how many to keep.

---

##  Why Are They Important in PCA?

| Concept | Role in PCA |
|----------|--------------|
| **Eigenvectors** | Determine the **direction** of maximum variance (principal components). |
| **Eigenvalues** | Determine the **importance** (variance explained) of each component. |
| **Dimensionality Reduction** | Components with small eigenvalues are often dropped since they carry little information. |

---

##  Example Analogy
Imagine data points scattered on a 2D plane:
- The **eigenvector** points along the direction where points are most spread out.
- The **eigenvalue** tells **how much** spread (variance) there is in that direction.





# Question 5: How do KNN and PCA Complement Each Other When Applied in a Single Pipeline?

**Answer:**

**K-Nearest Neighbors (KNN)** and **Principal Component Analysis (PCA)** are often used together in machine learning workflows because they **complement each other’s strengths and limitations**.

---

## 🔹 Step-by-Step Relationship Between PCA and KNN

### 1. **Dimensionality Reduction Before KNN**
- KNN relies on **distance metrics** (like Euclidean distance) to classify or predict.
- In high-dimensional data, distances become less meaningful — this is known as the **Curse of Dimensionality**.
- **PCA reduces the number of dimensions** while keeping the most informative patterns, making KNN more efficient and accurate.

---

### 2. **Noise Reduction**
- PCA removes **correlated and noisy features** by focusing on directions of maximum variance.
- KNN performs better on cleaner, less noisy data since its decisions depend directly on nearby points.

---

### 3. **Improved Computation Efficiency**
- KNN’s computation cost grows with the number of features.
- Using PCA to reduce features **speeds up distance calculations**, improving KNN’s performance on large datasets.

---

### 4. **Better Visualization and Interpretability**
- After applying PCA, data can be projected into 2D or 3D space.
- This helps visualize how KNN separates classes or clusters in reduced-dimensional space.



# Create a pipeline with PCA + KNN
model = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=2)),
    ('knn', KNeighborsClassifier(n_neighbors=5))
])

# Train and evaluate
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Accuracy with PCA + KNN:", accuracy_score(y_test, y_pred))


In [2]:
# Question 6: Train a KNN Classifier on the Wine Dataset with and without Feature Scaling. Compare Model Accuracy in Both Cases.

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load the Wine dataset
data = load_wine()
X, y = data.data, data.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


# KNN Without Feature Scaling

knn_no_scaling = KNeighborsClassifier(n_neighbors=5)
knn_no_scaling.fit(X_train, y_train)
y_pred_no_scale = knn_no_scaling.predict(X_test)
accuracy_no_scale = accuracy_score(y_test, y_pred_no_scale)


#  KNN With Feature Scaling

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)
y_pred_scaled = knn_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)


# Compare Results

print("Accuracy without scaling :", round(accuracy_no_scale, 4))
print("Accuracy with scaling    :", round(accuracy_scaled, 4))

Accuracy without scaling : 0.7407
Accuracy with scaling    : 0.963


In [4]:
# Question 7: Train a PCA Model on the Wine Dataset and Print the Explained Variance Ratio of Each Principal Component


# Import required libraries
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import pandas as pd

# Load the Wine dataset
data = load_wine()
X = data.data

# Step 1: Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 2: Apply PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Step 3: Display the explained variance ratio
explained_variance = pca.explained_variance_ratio_

# Convert to DataFrame for better readability
df = pd.DataFrame({
    'Principal Component': [f'PC{i+1}' for i in range(len(explained_variance))],
    'Explained Variance Ratio': explained_variance
})

print(df)
print("\nTotal Variance Explained:", round(sum(explained_variance), 4))

   Principal Component  Explained Variance Ratio
0                  PC1                  0.361988
1                  PC2                  0.192075
2                  PC3                  0.111236
3                  PC4                  0.070690
4                  PC5                  0.065633
5                  PC6                  0.049358
6                  PC7                  0.042387
7                  PC8                  0.026807
8                  PC9                  0.022222
9                 PC10                  0.019300
10                PC11                  0.017368
11                PC12                  0.012982
12                PC13                  0.007952

Total Variance Explained: 1.0


In [6]:
# Question 8: Train a KNN Classifier on the PCA-Transformed Dataset (Retain Top 2 Components). Compare the Accuracy with the Original Dataset.


from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load the Wine dataset
data = load_wine()
X, y = data.data, data.target

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 1: Standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 2: Train KNN on the original (scaled) dataset
knn_original = KNeighborsClassifier(n_neighbors=5)
knn_original.fit(X_train_scaled, y_train)
y_pred_original = knn_original.predict(X_test_scaled)
accuracy_original = accuracy_score(y_test, y_pred_original)

# Step 3: Apply PCA (retain top 2 components)
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

# Step 4: Train KNN on PCA-transformed dataset
knn_pca = KNeighborsClassifier(n_neighbors=5)
knn_pca.fit(X_train_pca, y_train)
y_pred_pca = knn_pca.predict(X_test_pca)
accuracy_pca = accuracy_score(y_test, y_pred_pca)

# Step 5: Compare results
print("Accuracy on Original Scaled Data :", round(accuracy_original, 4))
print("Accuracy on PCA (2 Components)   :", round(accuracy_pca, 4))

Accuracy on Original Scaled Data : 0.963
Accuracy on PCA (2 Components)   : 0.9815


In [None]:
# Question 9: Train a KNN Classifier with Different Distance Metrics (Euclidean, Manhattan) on the Scaled Wine Dataset and Compare the Results

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import pandas as pd

# Load the Wine dataset
data = load_wine()
X, y = data.data, data.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 1: Standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 2: Train KNN with Euclidean distance
knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_euclidean.fit(X_train_scaled, y_train)
y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
accuracy_euclidean = accuracy_score(y_test, y_pred_euclidean)

# Step 3: Train KNN with Manhattan distance
knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn_manhattan.fit(X_train_scaled, y_train)
y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
accuracy_manhattan = accuracy_score(y_test, y_pred_manhattan)

# Step 4: Compare results
print("Accuracy using Euclidean Distance :", round(accuracy_euclidean, 4))
print("Accuracy using Manhattan Distance :", round(accuracy_manhattan, 4))

# Optional: Display confusion matrices for deeper insight
print("\nConfusion Matrix (Euclidean):\n", confusion_matrix(y_test, y_pred_euclidean))
print("\nConfusion Matrix (Manhattan):\n", confusion_matrix(y_test, y_pred_manhattan))

# Classification reports
print("\nClassification Report (Euclidean):\n", classification_report(y_test, y_pred_euclidean))
print("\nClassification Report (Manhattan):\n", classification_report(y_test, y_pred_manhattan))

# Question 10: Using PCA and KNN for High-Dimensional Gene Expression Data

**Answer:**  

High-dimensional datasets, such as gene expression profiles, pose unique challenges in machine learning:

- Thousands of gene features (dimensions) but only a few patient samples.  
- Traditional models easily **overfit** due to sparse data in high-dimensional space.  
- **KNN** alone may fail because distance metrics become less meaningful (**curse of dimensionality**).  

To build a robust and interpretable pipeline, we can combine **PCA for dimensionality reduction** with **KNN for classification**.

---

## 1. Use PCA to Reduce Dimensionality

**Principal Component Analysis (PCA)** transforms the original high-dimensional data into a new set of uncorrelated variables called **principal components (PCs)**.

- Steps:
  1. **Standardize the data**: Mean = 0, Variance = 1.
  2. **Compute covariance matrix** and extract **eigenvectors** and **eigenvalues**.
  3. **Transform data** onto the principal components.

- Benefits for gene expression data:
  - Captures most **variance** while discarding noise.  
  - Reduces feature space, mitigating **overfitting**.  
  - Improves computational efficiency.

---

## 2. Decide How Many Components to Keep

- Use **explained variance ratio** to determine the number of PCs that capture most of the information.
- Common strategy:
  - Keep enough components to retain **90–95% of total variance**.
  - Example:
    ```python
    cumulative_variance = np.cumsum(pca.explained_variance_ratio_)
    num_components = np.where(cumulative_variance >= 0.95)[0][0] + 1
    ```
- This balances **information retention** with **dimensionality reduction**.

---

## 3. Use KNN for Classification Post-Dimensionality Reduction

- KNN is **distance-based**; lower-dimensional data improves accuracy and reduces sensitivity to noise.
- Steps:
  1. Apply PCA transformation to both **training and test sets**.  
  2. Train **KNN classifier** on the PCA-reduced features.  
  3. Predict cancer types for test patients.

- Hyperparameter considerations:
  - **k (number of neighbors)**: Use cross-validation to choose optimal k.  
  - **Distance metric**: Euclidean distance works well after scaling.

---

## 4. Evaluate the Model

- Use **k-fold cross-validation** (e.g., 5- or 10-fold) to estimate performance.
- Metrics suitable for biomedical classification:
  - **Accuracy**: Overall correctness.  
  - **Precision, Recall, F1-Score**: Important for imbalanced cancer classes.  
  - **ROC-AUC**: Measures discrimination ability of the model.

- Example evaluation in Python:
```python
from sklearn.model_selection import cross_val_score
scores = cross_val_score(knn, X_pca, y, cv=5, scoring='accuracy')
print("Cross-validated Accuracy:", scores.mean())
