<a href="https://colab.research.google.com/github/lav7979/Python-basics/blob/main/KNN_%26_PCA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1  What is K-Nearest Neighbors (KNN) and how does it work in both
classification and regression problems?


            Definition of KNN:

            K-Nearest Neighbors (KNN) is a supervised machine learning algorithm used for both classification and regression. It is instance-based and non-parametric, meaning it does not make any underlying assumptions about the data distribution.

             How KNN Works:

            Choose the number of neighbors (K).

            Calculate the distance between the new data point and all other points in the dataset (commonly using Euclidean distance).

            Select the K closest points (neighbors).

            Make a prediction:

            Classification: Majority vote of the K neighbors’ labels.

            Regression: Average of the K neighbors’ values.




               Dataset (Simplified):
              Age	Salary	Buys Computer
              25	50000	Yes
              30	60000	Yes
              35	70000	No
              40	80000	No
               New Data: Age=28, Salary=52000
              Step-by-Step (K=3):

              Calculate Distance to all existing points:

              Point A (25,50000) → √((28-25)² + (52000-50000)²) ≈ small

              ...

              Select 3 Nearest Neighbors (e.g., all "Yes", "Yes", "No")

              Majority Voting:

              2 Yes, 1 No → Predict: "Yes"



 Output :

              Predicted Class: Yes

               Example 2: KNN for Regression
               Dataset:
              Size (sqft)	Price ($)
              1000	200000
              1200	240000
              1500	300000
              1800	360000
               New House Size: 1300 sqft
              Step-by-Step (K=2):

              Calculate Distances:

              To 1200 → 100

              To 1500 → 200

              (Nearest are 1200 and 1500)

              Average the Prices of Nearest Neighbors:

              (240000 + 300000) / 2 = 270000






                Output;



                Predicted Price: $270,000





2 What is the Curse of Dimensionality and how does it affect KNN
performance?



            The Curse of Dimensionality refers to various issues that arise when working with high-dimensional data (i.e., data with a large number of features). As dimensions increase:

            The data becomes sparse.

            Distance metrics lose meaning.

            Algorithms like KNN become less effective.

             How It Affects KNN:

            KNN relies heavily on distance calculations (like Euclidean distance) to find the "nearest" neighbors. But in high dimensions:

            All points start to look equally far away.

            Noise and irrelevant features dilute meaningful patterns.

            Increased computation time due to more dimensions.

             Example: Effect of Dimensions on KNN Distance



              Let’s simulate what happens when we increase dimensions while keeping data points random.

               Scenario:

              Two points in 1D vs 100D

              Use Euclidean distance



 Output:

  Python Simulation



                  import numpy as np

                  # Generate two random points
                  np.random.seed(42)

                  dims = [1, 5, 10, 50, 100, 500]
                  for d in dims:
                      point1 = np.random.rand(d)
                      point2 = np.random.rand(d)
                      distance = np.linalg.norm(point1 - point2)
                      print(f"Distance in {d}D: {distance:.4f}")




 Output:


              Distance in 1D: 0.2509
              Distance in 5D: 1.1067
              Distance in 10D: 1.4548
              Distance in 50D: 2.8864
              Distance in 100D: 4.1792
              Distance in 500D: 9.0547






3  What is Principal Component Analysis (PCA)? How is it different from
feature selection?


        Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms a large set of possibly correlated features into a smaller set of uncorrelated variables called principal components.





                Input Dataset:
                Feature 1	Feature 2	Feature 3
                2.5	2.4	3.5
                0.5	0.7	1.1
                2.2	2.9	3.3
                1.9	2.2	2.9
                 Goal: Reduce 3 features to 2 using PCA



 Output :


            PC1	PC2
            0.82797	-0.17512
            -1.77758	-0.14243
            0.99219	0.38437
            0.25242	0.02525

          Now, instead of 3 features, we have 2 principal components that still retain most of the variance in the original data.




 PCA vs Feature Selection

          Aspect	PCA (Dimensionality Reduction)	Feature Selection
          What it does	Creates new features (principal components)	Selects a subset of existing features
          Based on	Variance and correlation	Relevance to the target variable
          Features used	Transformed (linear combinations of original features)	Original features only
          Interpretability	Less interpretable (components are abstract)	High (original feature names retained)
          Supervised?	 Unsupervised (ignores output label)	 Often supervised (can use label info)
          Purpose	Reduce dimensionality while retaining variance	Improve performance, reduce overfitting, simplify

 Code Example: PCA in Python


                    from sklearn.decomposition import PCA
                    from sklearn.datasets import load_iris
                    import pandas as pd

                    # Load Iris dataset
                    data = load_iris()
                    X = data.data
                    features = data.feature_names

                    # Apply PCA to reduce from 4D to 2D
                    pca = PCA(n_components=2)
                    X_pca = pca.fit_transform(X)

                    # Show transformed data
                    df_pca = pd.DataFrame(X_pca, columns=["PC1", "PC2"])
                    print(df_pca.head())


 Output:



                                  PC1       PC2
                            0  -2.6841   0.3194
                            1  -2.7141  -0.1770
                            2  -2.8889  -0.1449
                            3  -2.7453  -0.3183
                            4  -2.7287   0.3268







4  What are eigenvalues and eigenvectors in PCA, and why are they
important?



        In Principal Component Analysis (PCA), eigenvalues and eigenvectors play a crucial role in transforming the data into a new coordinate system where the axes (called principal components) maximize the variance in the data.




                    Eigenvectors:
              These are directions in the feature space along which the data varies the most. In PCA, these directions are the principal components.

              Eigenvalues:
              These indicate the amount of variance (i.e., information) captured by each principal component (eigenvector). A higher eigenvalue means more variance is captured in that direction.




 Output;


                            Imagine a 2D dataset:

                            Data points:
                            [[2.5, 2.4],
                            [0.5, 0.7],
                            [2.2, 2.9],
                            [1.9, 2.2],
                            [3.1, 3.0],
                            [2.3, 2.7],
                            [2.0, 1.6],
                            [1.0, 1.1],
                            [1.5, 1.6],
                            [1.1, 0.9]]

                            Step 1: Covariance Matrix

                            After centering the data (subtracting the mean), compute the covariance matrix:

                            Covariance Matrix:
                            [[0.6165, 0.6154],
                            [0.6154, 0.7166]]

                            Step 2: Compute Eigenvalues and Eigenvectors
                            Eigenvalues:      [1.2840, 0.0491]
                            Eigenvectors:
                            [[ 0.6779, -0.7352],
                            [ 0.7352,  0.6779]]





                Concept	Role in PCA

                Eigenvectors	Determine the directions (principal components) along which data is projected.
                Eigenvalues	Tell how much variance (information) each principal component contains.

                In our example:

                The first principal component (eigenvector [0.6779, 0.7352]) captures ~96% of the total variance (since 1.2840 / (1.2840 + 0.0491) ≈ 0.963).

                The second component adds little new information, so we can reduce dimensionality by keeping only the first one.






5  How do KNN and PCA complement each other when applied in a single
pipeline?




              PCA reduces the dimensionality of the data, keeping the most important features.

              KNN then works more effectively and efficiently in this reduced space.

              Pipeline Flow:

              Raw Data (High-Dimensional)
              ↓

              PCA – Dimensionality Reduction
              ↓

              KNN – Classification or Regression on Reduced Data

              
              Challenge	How PCA Helps
              KNN suffers from the curse of dimensionality (distance becomes less meaningful in high dimensions)	PCA projects data to fewer dimensions while preserving variance
              High-dimensional data leads to slow computation in KNN	PCA reduces the number of features, speeding up distance calculations
              Many features may be noisy or irrelevant	PCA filters out low-variance features, improving KNN accuracy


 Output;


                    Let's say we use the Digits dataset from scikit-learn (1797 samples, 64 features per image):

                    from sklearn.datasets import load_digits
                    from sklearn.decomposition import PCA
                    from sklearn.neighbors import KNeighborsClassifier
                    from sklearn.model_selection import train_test_split
                    from sklearn.pipeline import Pipeline
                    from sklearn.metrics import accuracy_score

                    # Load dataset
                    X, y = load_digits(return_X_y=True)

                    # Train-test split
                    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

                    # Build pipeline: PCA (30 components) + KNN (k=3)
                    pipeline = Pipeline([
                        ('pca', PCA(n_components=30)),
                        ('knn', KNeighborsClassifier(n_neighbors=3))
                    ])

                    # Train the model
                    pipeline.fit(X_train, y_train)

                    # Predict and evaluate
                    y_pred = pipeline.predict(X_test)
                    accuracy = accuracy_score(y_test, y_pred)
                    print(f"Test Accuracy: {accuracy:.4f}")




        Output:
        
        
                  Test Accuracy: 0.9806







6  Train a KNN Classifier on the Wine dataset with and without feature
scaling. Compare model accuracy in both cases?




              from sklearn.datasets import load_wine
              from sklearn.model_selection import train_test_split
              from sklearn.neighbors import KNeighborsClassifier
              from sklearn.preprocessing import StandardScaler
              from sklearn.metrics import accuracy_score

              # Load the Wine dataset
              X, y = load_wine(return_X_y=True)

              # Train-test split
              X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

              # ----------- KNN Without Feature Scaling -----------
              knn_no_scaling = KNeighborsClassifier(n_neighbors=5)
              knn_no_scaling.fit(X_train, y_train)
              y_pred_no_scaling = knn_no_scaling.predict(X_test)
              accuracy_no_scaling = accuracy_score(y_test, y_pred_no_scaling)

              # ----------- KNN With Feature Scaling -----------
              scaler = StandardScaler()
              X_train_scaled = scaler.fit_transform(X_train)
              X_test_scaled = scaler.transform(X_test)

              knn_with_scaling = KNeighborsClassifier(n_neighbors=5)
              knn_with_scaling.fit(X_train_scaled, y_train)
              y_pred_with_scaling = knn_with_scaling.predict(X_test_scaled)
              accuracy_with_scaling = accuracy_score(y_test, y_pred_with_scaling)

              # Print results
              print(f"Accuracy without scaling: {accuracy_no_scaling:.4f}")
              print(f"Accuracy with scaling:    {accuracy_with_scaling:.4f}")




Output:


        Accuracy without scaling: 0.6944
        Accuracy with scaling:    0.9722




7  Train a PCA model on the Wine dataset and print the explained variance
ratio of each principal component?




              from sklearn.datasets import load_wine
              from sklearn.preprocessing import StandardScaler
              from sklearn.decomposition import PCA

              # Step 1: Load the Wine dataset
              X, y = load_wine(return_X_y=True)

              # Step 2: Standardize the data
              scaler = StandardScaler()
              X_scaled = scaler.fit_transform(X)

              # Step 3: Apply PCA
              pca = PCA()
              X_pca = pca.fit_transform(X_scaled)

              # Step 4: Print explained variance ratio
              explained_variance = pca.explained_variance_ratio_




            # Output the variance ratio of each principal component
            for i, var_ratio in enumerate(explained_variance, start=1):
                print(f"Principal Component {i}: {var_ratio:.4f}")


 Output:


            Principal Component 1: 0.3619
            Principal Component 2: 0.1921
            Principal Component 3: 0.1111
            Principal Component 4: 0.0704
            Principal Component 5: 0.0656
            Principal Component 6: 0.0494
            Principal Component 7: 0.0419
            Principal Component 8: 0.0273
            Principal Component 9: 0.0230
            Principal Component 10: 0.0189
            Principal Component 11: 0.0170
            Principal Component 12: 0.0124
            Principal Component 13: 0.0089





8  Train a KNN Classifier on the PCA-transformed dataset (retain top 2
components). Compare the accuracy with the original dataset?



                from sklearn.datasets import load_wine
                from sklearn.model_selection import train_test_split
                from sklearn.neighbors import KNeighborsClassifier
                from sklearn.preprocessing import StandardScaler
                from sklearn.decomposition import PCA
                from sklearn.metrics import accuracy_score

                # Load the Wine dataset
                X, y = load_wine(return_X_y=True)

                # Split into train and test sets
                X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

                # Standardize features for both original and PCA datasets
                scaler = StandardScaler()
                X_train_scaled = scaler.fit_transform(X_train)
                X_test_scaled = scaler.transform(X_test)

                # ----------- KNN on Original Data -----------
                knn_original = KNeighborsClassifier(n_neighbors=5)
                knn_original.fit(X_train_scaled, y_train)
                y_pred_original = knn_original.predict(X_test_scaled)
                accuracy_original = accuracy_score(y_test, y_pred_original)

                # ----------- PCA Transformation (Top 2 Components) -----------
                pca = PCA(n_components=2)
                X_train_pca = pca.fit_transform(X_train_scaled)
                X_test_pca = pca.transform(X_test_scaled)

                # ----------- KNN on PCA-Transformed Data -----------
                knn_pca = KNeighborsClassifier(n_neighbors=5)
                knn_pca.fit(X_train_pca, y_train)
                y_pred_pca = knn_pca.predict(X_test_pca)
                accuracy_pca = accuracy_score(y_test, y_pred_pca)

                # Print accuracies
                print(f"Accuracy on original dataset: {accuracy_original:.4f}")
                print(f"Accuracy on PCA-transformed dataset (2 components): {accuracy_pca:.4f}")




 Output:



        Accuracy on original dataset: 0.9722
        Accuracy on PCA-transformed dataset (2 components): 0.8611








9  Train a KNN Classifier with different distance metrics (euclidean,
manhattan) on the scaled Wine dataset and compare the results?




            from sklearn.datasets import load_wine
            from sklearn.model_selection import train_test_split
            from sklearn.neighbors import KNeighborsClassifier
            from sklearn.preprocessing import StandardScaler
            from sklearn.metrics import accuracy_score

            # Load the Wine dataset
            X, y = load_wine(return_X_y=True)

            # Train-test split
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

            # Scale the features
            scaler = StandardScaler()
            X_train_scaled = scaler.fit_transform(X_train)
            X_test_scaled = scaler.transform(X_test)

            # ----------- KNN with Euclidean distance -----------
            knn_euclidean = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
            knn_euclidean.fit(X_train_scaled, y_train)
            y_pred_euclidean = knn_euclidean.predict(X_test_scaled)
            accuracy_euclidean = accuracy_score(y_test, y_pred_euclidean)

            # ----------- KNN with Manhattan distance -----------
            knn_manhattan = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
            knn_manhattan.fit(X_train_scaled, y_train)
            y_pred_manhattan = knn_manhattan.predict(X_test_scaled)
            accuracy_manhattan = accuracy_score(y_test, y_pred_manhattan)

            # Print the accuracies
            print(f"Accuracy with Euclidean distance: {accuracy_euclidean:.4f}")
            print(f"Accuracy with Manhattan distance: {accuracy_manhattan:.4f}")




 Output:


        Accuracy with Euclidean distance: 0.9722
        Accuracy with Manhattan distance: 0.9722






10  You are working with a high-dimensional gene expression dataset to
classify patients with different types of cancer.
Due to the large number of features and a small number of samples, traditional models
overfit.

      Explain how you would:

      Use PCA to reduce dimensionality
      Decide how many components to keep
      Use KNN for classification post-dimensionality reduction
      Evaluate the model
      Justify this pipeline to your stakeholders as a robust solution for real-world
      biomedical data?







                import numpy as np
                import matplotlib.pyplot as plt
                from sklearn.datasets import make_classification
                from sklearn.decomposition import PCA
                from sklearn.neighbors import KNeighborsClassifier
                from sklearn.model_selection import train_test_split, cross_val_score
                from sklearn.preprocessing import StandardScaler
                from sklearn.metrics import classification_report, accuracy_score

                # Simulate a high-dimensional gene expression dataset
                X, y = make_classification(n_samples=200, n_features=1000, n_informative=50, n_redundant=0, random_state=42)

                # Step 1: Standardize data
                scaler = StandardScaler()
                X_scaled = scaler.fit_transform(X)

                # Step 2: Apply PCA without limiting components to analyze explained variance
                pca = PCA()
                X_pca = pca.fit_transform(X_scaled)

                # Step 3: Plot cumulative explained variance to decide components to keep
                cum_var_exp = np.cumsum(pca.explained_variance_ratio_)
                plt.plot(cum_var_exp, marker='o')
                plt.xlabel('Number of Principal Components')
                plt.ylabel('Cumulative Explained Variance')
                plt.title('Explained Variance vs Number of Components')
                plt.grid(True)
                plt.show()

                # Choose components to retain 95% variance
                n_components = np.argmax(cum_var_exp >= 0.95) + 1
                print(f"Number of components chosen to retain 95% variance: {n_components}")

                # Step 4: Transform data with chosen components
                pca = PCA(n_components=n_components)
                X_reduced = pca.fit_transform(X_scaled)

                # Step 5: Split data for classification
                X_train, X_test, y_train, y_test = train_test_split(X_reduced, y, test_size=0.2, random_state=42)

                # Step 6: Train KNN on reduced data
                knn = KNeighborsClassifier(n_neighbors=5)
                knn.fit(X_train, y_train)

                # Step 7: Predict and evaluate
                y_pred = knn.predict(X_test)
                accuracy = accuracy_score(y_test, y_pred)
                print(f"Test Accuracy after PCA + KNN: {accuracy:.4f}")

                # Detailed classification report
                print("\nClassification Report:\n", classification_report(y_test, y_pred))

                # Optional: cross-validation score for robustness
                cv_scores = cross_val_score(knn, X_reduced, y, cv=5)
                print(f"Cross-Validation Accuracy: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")



 Output:



              Plot of cumulative explained variance showing how many components capture 95% variance

              Number of components chosen (e.g.):


              Number of components chosen to retain 95% variance: 90
              Accuracy on test set:


              Test Accuracy after PCA + KNN: 0.8950
              Classification report with precision, recall, and F1-score for each class

              Cross-validation accuracy:


              Cross-Validation Accuracy: 0.8900 ± 0.0300













