- In the K-Dimension Tree algorithm, the data points are partitioned into a binary tree based on their features. This allows for faster computation of the distances between the new data point and the training data points.
- Brute Force KNN has a high computational cost for high-dimensional data because it requires the computation of the distance between the new data point and all the training data points. K-Dimension Tree partitions the data points into a binary tree, allowing for faster computation of the distances and making it more efficient for high-dimensional data
-  Ball Tree is more appropriate for data with unevenly distributed data points because it partitions the data into hyperspheres, which can better capture the geometry of the data.
- In Brute Force KNN, the distance between the new data point and all the training data points needs to be computed to classify a single new data point. Therefore, the number of distances to be computed is equal to the number of training data points multiplied by the number of features, which is 1000 x 10 = 10,000.
- In K-Dimension Tree, the binary tree is constructed by recursively splitting the data points along each feature dimension. The number of splits required to construct the binary tree is equal to the logarithm of the number of data points to the base 2, multiplied by the number of features. Therefore, the number of splits in this case is log2(1000) x 10 = 20.

# 20th April Assignment-1

## Q1. What is the KNN algorithm?

The K-nearest neighbors (KNN) algorithm is a supervised machine learning algorithm used for both classification and regression tasks. It is a non-parametric method that makes predictions based on the k closest training examples in the feature space.

In the classification task, KNN determines the class of a new data point by finding the K nearest neighbors from the training data and assigning the class label that is most common among those neighbors. The algorithm measures the distances between the new data point and the training examples using a distance metric, such as Euclidean distance.

In the regression task, KNN predicts the value of a continuous target variable for a new data point by averaging the values of the K nearest neighbors. Instead of assigning a class label, KNN computes the average or weighted average of the target variable values of the K nearest neighbors and uses that as the predicted value.








## Q2. How do you choose the value of K in KNN?
The choice of the value of K in KNN is a hyperparameter that needs to be tuned based on the problem at hand. Selecting the appropriate value of K is crucial, as it can impact the performance of the KNN algorithm. There is no definitive rule for choosing K, but here are a few considerations:

Smaller values of K make the model more sensitive to individual data points, potentially leading to overfitting. This means that the decision boundary or regression line might be too complex and follow the noise in the data.
Larger values of K provide a smoother decision boundary or regression line, but they might overlook local patterns in the data and lead to underfitting.
The value of K should be an odd number to avoid ties when determining the class label in the classification task.
The value of K should not be too large compared to the number of data points in the training set, as it may cause the majority class to dominate the prediction for any new data point.
Typically, the value of K is chosen through experimentation and cross-validation, where different values are tested and the performance of the model is evaluated using appropriate evaluation metrics.

## Q3. What is the difference between KNN classifier and KNN regressor?
 The main difference between the KNN classifier and KNN regressor lies in their respective prediction tasks:

KNN Classifier: It is used for classification tasks, where the goal is to assign class labels to data points based on their features. The KNN classifier determines the class membership of a new data point by considering the class labels of its K nearest neighbors. The class label that appears most frequently among the neighbors is assigned to the new data point. The output of the KNN classifier is a discrete class label.

KNN Regressor: It is used for regression tasks, where the goal is to predict continuous target variables. The KNN regressor predicts the value of a new data point by averaging the values of its K nearest neighbors' target variables. The output of the KNN regressor is a continuous numerical value.

In summary, the KNN classifier assigns class labels based on the majority vote of the K nearest neighbors, while the KNN regressor predicts a continuous value by averaging the target variable values of the K nearest neighbors.


## Q4. How do you measure the performance of KNN?

The performance of the KNN algorithm can be measured using various evaluation metrics, depending on whether it is used for classification or regression tasks. Here are some commonly used performance metrics for KNN:

For Classification Tasks:

- Accuracy: It is the most straightforward metric and represents the proportion of correctly classified instances out of the total number of instances. It is suitable for balanced datasets but may be misleading when dealing with imbalanced datasets.

- Confusion Matrix: A confusion matrix provides a detailed breakdown of the classification results. It shows the number of true positives, true negatives, false positives, and false negatives. From the confusion matrix, additional metrics can be derived:

- Precision: It measures the proportion of correctly predicted positive instances out of all instances predicted as positive. Precision focuses on the accuracy of positive predictions.
- Recall (Sensitivity or True Positive Rate): It measures the proportion of correctly predicted positive instances out of all actual positive instances. Recall focuses on the coverage of positive instances.
- F1 Score: It is the harmonic mean of precision and recall and provides a balanced measure of both metrics.
- ROC Curve and AUC: Receiver Operating Characteristic (ROC) curve is a graphical representation of the trade-off between true positive rate (sensitivity) and false positive rate (1 - specificity) at various classification thresholds. The Area Under the Curve (AUC) summarizes the ROC curve's performance in a single value, with a higher AUC indicating better performance.

For Regression Tasks:

- Mean Absolute Error (MAE): It measures the average absolute difference between the predicted values and the actual values. It provides a direct interpretation of the average prediction error.

- Mean Squared Error (MSE): It measures the average squared difference between the predicted values and the actual values. MSE gives higher weight to larger errors compared to MAE.

- Root Mean Squared Error (RMSE): It is the square root of MSE and provides an interpretable metric in the original units of the target variable.

- R-squared (coefficient of determination): It measures the proportion of the variance in the target variable that is predictable from the independent variables. It ranges from 0 to 1, with higher values indicating a better fit.

When evaluating the performance of KNN, it is important to consider the specific problem and choose the appropriate metrics that align with the task's requirements and characteristics of the dataset. Cross-validation techniques can also be used to obtain more reliable estimates of the performance metrics.








## Q5. What is the curse of dimensionality in KNN?
The curse of dimensionality refers to the phenomenon where the performance of certain algorithms, including K-nearest neighbors (KNN), degrades as the number of dimensions or features in the dataset increases. It poses challenges when dealing with high-dimensional data. There are a few key effects of the curse of dimensionality:

- Increased Sparsity: As the number of dimensions increases, the data becomes more sparse. In other words, the available data points become more spread out in the feature space. This sparsity makes it harder to find meaningful patterns or neighbors in the data.

- Increased Computational Complexity: With high-dimensional data, the number of potential combinations and distances to calculate grows exponentially. This leads to increased computational complexity, making it more time-consuming and resource-intensive to perform KNN calculations.

- Decreased Discriminative Power: In high-dimensional spaces, the distance between the nearest and farthest points tends to become similar. Consequently, the differences between nearest neighbors become less informative, making it harder to distinguish between different classes or regression targets.

To mitigate the curse of dimensionality in KNN, some techniques can be employed, such as dimensionality reduction methods (e.g., Principal Component Analysis, t-SNE) or feature selection techniques. These approaches aim to reduce the number of dimensions while preserving meaningful information and patterns in the data.






## Q6. How do you handle missing values in KNN?
Handling missing values in KNN can be approached in the following ways:

- Deletion: One straightforward approach is to delete the instances or features with missing values. If a significant portion of the data has missing values, this approach may result in a loss of valuable information.

- Imputation: Missing values can be replaced or imputed with estimated values. In KNN, the most common imputation method involves computing the missing value based on the average or weighted average of the nearest neighbors. The algorithm finds the K nearest neighbors of the instance with the missing value and uses their known values to impute the missing value.

- Distance-based imputation: Another approach is to treat missing values as a separate category and modify the distance metric to handle missing values appropriately. For example, you can define a distance metric that ignores missing values or assigns a penalty for missing values during the distance calculation.

It's important to note that the choice of handling missing values in KNN depends on the specific dataset and the nature of the missingness. Careful consideration should be given to the impact of imputation on the overall data quality and the potential biases introduced by the imputation process. Additionally, it is advisable to evaluate the performance of KNN with different handling techniques and compare their results to determine the most suitable approach for the given problem.


## Q7. Compare and contrast the performance of the KNN classifier and regressor. Which one is better for which type of problem?
 The performance of the KNN classifier and regressor can vary depending on the specific problem and dataset characteristics. Here are some key points to compare and contrast their performance:

KNN Classifier:

Strengths:
- Simple and easy to understand.
- Non-parametric nature allows it to handle complex decision boundaries.
- Can handle multiclass classification problems.
- Can make probabilistic predictions by considering the class probabilities of the nearest neighbors.
Weaknesses:
- Computationally expensive, especially with large datasets or high-dimensional feature spaces.
- Sensitive to irrelevant and noisy features.
- Relies on the appropriate choice of the K value, which may require careful tuning.
- Performs poorly when the dataset has imbalanced class distributions.

KNN Regressor:

Strengths:
- Simple and intuitive.
- Non-parametric nature makes it suitable for capturing nonlinear relationships.
- Can handle both univariate and multivariate regression problems.
- Provides interpretable predictions by averaging the target values of nearest neighbors.

Weaknesses:
- Sensitive to outliers and noisy data points.
- Computationally expensive, especially with large datasets or high-dimensional feature spaces.
- Relies on the appropriate choice of the K value, which may require careful tuning.
- May produce smoothed predictions that do not capture sharp changes or local patterns in the data.

Choosing between the KNN classifier and regressor depends on the nature of the problem and the type of target variable. Use the KNN classifier for problems where the goal is to assign class labels to data points based on their features, and use the KNN regressor when the goal is to predict continuous target variables.







## Q8. What are the strengths and weaknesses of the KNN algorithm for classification and regression tasks, and how can these be addressed?
Strengths and weaknesses of the KNN algorithm for classification and regression tasks, along with possible mitigations, include:

Strengths:

- Intuitive and simple to understand.
- Non-parametric nature allows it to capture complex patterns and decision boundaries.
- Versatile and can handle both classification and regression tasks.
- No assumptions about the underlying data distribution.
- Can be effective with small to medium-sized datasets.

Weaknesses:

- Computationally expensive, especially with large datasets or high-dimensional feature spaces. Mitigation: Techniques like dimensionality reduction or approximate nearest neighbor algorithms can be applied to reduce computation time.
- Sensitive to the choice of K value, which may require careful tuning. Mitigation: Perform cross-validation or grid search to find the optimal value of K.
- Sensitive to the presence of irrelevant and noisy features. Mitigation: Feature selection or feature engineering techniques can be used to improve performance.
- Prone to the curse of dimensionality, leading to reduced performance in high-dimensional spaces. Mitigation: Apply dimensionality reduction techniques to reduce the number of features or use feature selection methods to choose the most relevant features.
- To address these weaknesses, it is important to preprocess the data, handle missing values appropriately, perform feature selection or dimensionality reduction when necessary, and carefully tune hyperparameters such as the K value. Additionally, using ensemble methods like weighted voting or distance weighting can improve the performance of KNN. Overall, understanding the strengths and weaknesses of KNN can help in selecting appropriate preprocessing techniques and addressing potential challenges for improved results.


## Q9. What is the difference between Euclidean distance and Manhattan distance in KNN?

The difference between Euclidean distance and Manhattan distance lies in the way they measure the distance between two points in a feature space. In KNN, the distance metric is crucial for determining the nearest neighbors. Here's a comparison between Euclidean distance and Manhattan distance:

Euclidean Distance:

- Euclidean distance is also known as the straight-line or L2 distance.
- It calculates the square root of the sum of squared differences between corresponding features of two data points.
- It considers the magnitude and direction of the differences between points.
- Euclidean distance is based on the Pythagorean theorem and can be interpreted as the length of the straight line connecting two points in a multi-dimensional space.
- It is sensitive to the scale of the features and tends to give more weight to features with larger values.

Manhattan Distance:

- Manhattan distance is also known as the city block or L1 distance.
- It calculates the sum of absolute differences between corresponding features of two data points.
- It only considers the magnitude of the differences between points, ignoring the direction.
- Manhattan distance measures the distance as the sum of the lengths of the horizontal and vertical legs of the right-angled triangle formed by the differences between points.
- It is less sensitive to the scale of the features and treats all features equally.
- The choice between Euclidean distance and Manhattan distance in KNN depends on the nature of the problem and the characteristics of the dataset. Generally, Euclidean distance is more commonly used and suitable when the features have similar scales and the direction of differences is important. On the other hand, Manhattan distance can be useful when dealing with high-dimensional data or when the direction of differences is less significant.

Q10. F







## Q10. What is the role of feature scaling in KNN?
feature scaling plays an important role in KNN because the algorithm calculates distances between data points. If the features have different scales, it can lead to biased distance calculations, resulting in some features having a larger influence on the distance than others. Feature scaling ensures that all features are on a comparable scale, which helps in making fair distance-based comparisons between data points.

The main reasons for applying feature scaling in KNN are:

- Equal Weightage: Scaling the features brings them to a similar range, ensuring that no single feature dominates the distance calculations due to its larger scale. This prevents features with larger magnitudes from overshadowing others.

- Improved Accuracy: Scaling can improve the accuracy and performance of the KNN algorithm. It helps in identifying patterns and similarities based on the actual relationships between features rather than their scales.

- Convergence Speed: Feature scaling can aid in faster convergence during the training process. Normalized features allow the algorithm to reach convergence more quickly by reducing the influence of large-scale features.

Common methods for feature scaling in KNN include:

- Min-max scaling (also known as normalization): It scales the features to a specific range, typically between 0 and 1, by subtracting the minimum value and dividing by the range.
- Standardization (Z-score normalization): It transforms the features to have zero mean and unit variance by subtracting the mean and dividing by the standard deviation.
Applying feature scaling ensures that all features contribute equally to the distance calculations in KNN and can lead to improved performance and more reliable results.

# 21th April Assignment-2

## Q1. What is the main difference between the Euclidean distance metric and the Manhattan distance metric in KNN? How might this difference affect the performance of a KNN classifier or regressor?
 
The main difference between the Euclidean distance metric and the Manhattan distance metric in KNN lies in how they calculate the distance between two points in a feature space.

Euclidean distance:

- Euclidean distance is also known as the L2 distance or the straight-line distance.
- It measures the straight-line distance between two points, taking into account both the magnitude and direction of the differences between their feature values.
- It calculates the square root of the sum of squared differences between corresponding features of two data points.

Manhattan distance:

- Manhattan distance is also known as the L1 distance or the city block distance.
- It measures the distance between two points by summing the absolute differences between their feature values along each dimension.
- It considers only the magnitude of the differences between feature values, without considering their direction.
- The difference in how Euclidean distance and Manhattan distance calculate distances can affect the performance of a KNN classifier or regressor in the following ways:

Sensitivity to feature scales:
- Euclidean distance is sensitive to the scales of the features. Features with larger scales will have a larger impact on the distance calculation compared to features with smaller scales.
- Manhattan distance, on the other hand, is not sensitive to feature scales since it considers only the magnitude of the differences between feature values.

Influence of outliers:
- Euclidean distance is more sensitive to outliers because it incorporates the squared differences between feature values. Outliers can significantly affect the overall distance calculation.
- Manhattan distance is less affected by outliers since it considers only the absolute differences between feature values.

Decision boundary or regression line shape:
- Euclidean distance tends to produce a circular or spherical decision boundary or regression line. It considers both magnitude and direction, resulting in a more rounded shape.
- Manhattan distance tends to produce a square or diamond-shaped decision boundary or regression line. It only considers magnitude, resulting in a more axis-aligned shape.

The choice between Euclidean distance and Manhattan distance in KNN depends on the specific problem and the characteristics of the data. Euclidean distance is commonly used when the scales of the features are meaningful and the direction of differences is important. Manhattan distance can be useful when dealing with features of different scales or when the direction of differences is less significant. It is recommended to try both distance metrics and evaluate their impact on the performance of the KNN algorithm for a particular problem.







## Q2. How do you choose the optimal value of k for a KNN classifier or regressor? What techniques can be used to determine the optimal k value?

Choosing the optimal value of k for a KNN classifier or regressor is crucial for achieving good performance. The optimal value of k depends on the specific dataset and problem at hand. Here are some techniques that can be used to determine the optimal k value:

- **Cross-validation**: Split the dataset into training and validation sets. Train the KNN model with different values of k on the training set and evaluate their performance on the validation set using appropriate evaluation metrics (e.g., accuracy, F1 score, mean squared error). Select the k value that yields the best performance on the validation set. This technique helps in estimating the generalization performance of the model.

- **Grid search**: Define a range of possible k values and perform a grid search by training and evaluating the KNN model with each k value using cross-validation. Compute the performance metrics for each k value and select the one that gives the best results. This approach allows for an exhaustive search over multiple k values and can be combined with other hyperparameter tuning techniques.

- **Elbow method**: For regression tasks, plot the mean squared error (MSE) or another appropriate evaluation metric against different k values. Look for the k value where the error decreases significantly and then starts to level off. This point is often referred to as the "elbow" of the curve and can be considered as a good choice for the optimal k value.

- **Domain knowledge and problem understanding**: Consider the nature of the problem, the complexity of the dataset, and the expected characteristics of the data. For example, in a classification problem, if the classes are expected to be well-separated, a smaller value of k may be appropriate. On the other hand, if the classes are overlapping or noisy, a larger value of k may be beneficial.

- **Experimentation and validation**: Try different k values and evaluate their performance on a validation set or through cross-validation. Compare the results and assess which k value provides the best trade-off between bias and variance. It may also be helpful to compare the performance with different k values on multiple datasets or using resampling techniques like bootstrapping to gain more insights into the stability and robustness of the results.

It's important to note that the optimal k value may vary for different datasets and problems. It is recommended to consider multiple techniques, evaluate the performance using appropriate metrics, and select the k value that gives the best results based on the specific requirements of the problem.







## 3. How does the choice of distance metric affect the performance of a KNN classifier or regressor? In what situations might you choose one distance metric over the other?

The choice of distance metric in a KNN classifier or regressor can significantly affect its performance. Different distance metrics capture different aspects of the relationships between data points. Here's how the choice of distance metric can impact the performance:

Euclidean Distance:

- Euclidean distance takes into account both the magnitude and direction of differences between feature values.
- It is well-suited when the magnitude and direction of feature differences are both important for the problem at hand.
- Euclidean distance is commonly used when the features have meaningful scales and their relative distances are crucial in determining similarities or differences between data points.
- It tends to produce a circular or spherical decision boundary or regression line.

Manhattan Distance:

- Manhattan distance considers only the magnitude of differences between feature values, ignoring the direction.
- It is appropriate when the direction of differences between features is not as significant, and only the overall differences in magnitude matter.
- Manhattan distance is often used when dealing with features of different scales or in situations where the scales are not meaningful.
- It tends to produce a square or diamond-shaped decision boundary or regression line.

Other Distance Metrics:

- Besides Euclidean and Manhattan distances, there are other distance metrics that can be used in KNN, such as Minkowski distance (which generalizes both Euclidean and Manhattan distances) or Mahalanobis distance (which accounts for covariance between features).
- The choice of these distance metrics depends on the specific characteristics of the data and the problem requirements. For example, Mahalanobis distance is useful when dealing with correlated features or when the distribution of the data is non-spherical.

The choice of distance metric should be made based on the nature of the problem and the characteristics of the dataset. Consider the scales and meanings of the features, as well as the importance of directionality in capturing similarities or differences between data points. It is advisable to experiment with different distance metrics and evaluate their impact on the performance of the KNN algorithm using appropriate evaluation metrics. Additionally, domain knowledge and understanding of the problem can guide the selection of the most appropriate distance metric for a given situation.







## Q4. What are some common hyperparameters in KNN classifiers and regressors, and how do they affect the performance of the model? How might you go about tuning these hyperparameters to improve model performance?

In KNN classifiers and regressors, there are several common hyperparameters that can be tuned to improve model performance. Here are some of the key hyperparameters and their impact on the model:

Number of neighbors (K):

- The K hyperparameter determines the number of nearest neighbors considered for classification or regression.
- A smaller value of K makes the model more sensitive to noise and can lead to overfitting.
- A larger value of K can smooth out decision boundaries or regression lines, but may oversimplify the model.
- Tuning: Perform cross-validation or grid search to find the optimal K value based on the dataset and problem. Test a range of values and select the one that provides the best trade-off between bias and variance.

Distance metric:

- The distance metric determines how distances are calculated between data points.
- Common options include Euclidean distance, Manhattan distance, Minkowski distance, or customized distance functions.
- The choice of distance metric should align with the characteristics of the data and the problem requirements.
- Tuning: Experiment with different distance metrics and evaluate their impact on the model performance using appropriate evaluation metrics. Choose the distance metric that gives the best results based on the problem and data properties.

Weighting scheme:

- KNN can use different weighting schemes to consider the influence of neighbors in the decision-making process.
- Common weighting options include uniform weighting (all neighbors contribute equally) and distance-based weighting (closer neighbors have a higher influence).
- Weighting can help address imbalanced datasets or give more importance to neighbors that are closer in distance.
- Tuning: Evaluate the performance of the model using different weighting schemes and select the one that yields the best results based on the problem and data.

Feature scaling:

- Feature scaling ensures that all features contribute equally to distance calculations by putting them on a similar scale.
- Common scaling techniques include min-max scaling (normalization) or standardization (Z-score normalization).
- Scaling is important as it prevents features with larger scales from dominating the distance calculations.
- Tuning: Apply different scaling techniques to the features and assess their impact on model performance. Choose the scaling technique that results in the best performance.

Other considerations:

Other hyperparameters, such as the method for handling ties when determining the class or regression value (e.g., majority voting or averaging), may also impact model performance.
Additionally, preprocessing steps like feature selection, dimensionality reduction, or handling of missing values can influence model performance.

To tune these hyperparameters and improve model performance:

- Perform cross-validation or grid search over a range of hyperparameter values.
- Evaluate the model using appropriate evaluation metrics.
- Select the hyperparameter values that yield the best performance on a validation set or through cross-validation.
- It's important to avoid overfitting by using a separate test set to assess the final model's performance.
- Hyperparameter tuning should be guided by domain knowledge, problem understanding, and careful experimentation to identify the optimal combination of hyperparameters that leads to the best performance for the specific dataset and problem.



## Q5. How does the size of the training set affect the performance of a KNN classifier or regressor? What techniques can be used to optimize the size of the training set?
The size of the training set can significantly impact the performance of a KNN classifier or regressor. Here's how the size of the training set affects the performance and some techniques to optimize its size:

1. Impact on Performance:

- Small Training Set: When the training set is small, the model may not capture the underlying patterns and relationships in the data effectively. The model may be prone to overfitting, resulting in poor generalization to unseen data.
- Large Training Set: A larger training set provides more diverse examples and a better representation of the data distribution. It helps the model generalize better to unseen instances and reduces the risk of overfitting.

2. Techniques to Optimize Training Set Size:

- Cross-validation: Utilize cross-validation techniques, such as k-fold cross-validation or stratified sampling, to make the most of the available data. Cross-validation allows for training and evaluating the model on different subsets of the data, providing a more robust assessment of performance.
- Data augmentation: If the training set is limited, data augmentation techniques can be employed to artificially increase the size and diversity of the training set. This can involve techniques such as adding noise, rotating, flipping, or scaling the existing data to generate new samples.
- Sampling techniques: If the training set is too large, sampling techniques such as random sampling, stratified sampling, or cluster-based sampling can be used to reduce the size of the training set while maintaining its representativeness. This can help to reduce training time and memory requirements without sacrificing performance.
- Active learning: In situations where labeling new instances is more expensive or time-consuming than training the model, active learning techniques can be applied. These techniques intelligently select the most informative instances for labeling, gradually expanding the training set while maximizing the model's performance.

The choice of the optimal training set size depends on the specific problem, available resources, and trade-offs between model performance and computational constraints. It is important to find a balance between having enough data to capture the underlying patterns and minimizing the risk of overfitting. Regular monitoring and evaluation of the model's performance using appropriate metrics can help determine the optimal training set size for a given problem.







## Q6. What are some potential drawbacks of using KNN as a classifier or regressor? How might you overcome these drawbacks to improve the performance of the model?

While KNN is a simple and intuitive algorithm, it has some potential drawbacks as a classifier or regressor. Here are a few drawbacks and approaches to overcome them:

Computational Complexity:

- KNN requires computing distances between the query instance and all training instances, making it computationally expensive for large datasets.
- Overcoming this drawback:
  - Implement efficient data structures like **KD-trees** or **ball trees** to speed up nearest neighbor searches.
  - Utilize approximate nearest neighbor algorithms to trade off accuracy for computational efficiency.
  - Reduce the dimensionality of the data using feature selection or dimensionality reduction techniques to reduce the computational burden.

Sensitivity to Irrelevant Features:

- KNN considers all features equally, which means irrelevant or noisy features can negatively impact its performance.
- Overcoming this drawback:
  - Perform feature selection or feature engineering to identify and remove irrelevant or redundant features.
  - Use techniques like regularization or L1-based feature selection to automatically select relevant features during model training.
  - Apply dimensionality reduction techniques like Principal Component Analysis (PCA) to reduce the dimensionality and focus on the most informative features.

Sensitivity to Outliers:

- KNN can be sensitive to outliers since it considers all neighbors equally, including those that are far away and potentially noisy.
- Overcoming this drawback:
  - Outlier detection techniques can be applied to identify and handle outliers before training the KNN model.
  - Consider using a weighted distance metric that gives less weight to neighbors that are far away or apply distance-based outlier removal techniques.

Imbalanced Data:

- KNN can struggle with imbalanced datasets where one class is significantly more prevalent than others.
- Overcoming this drawback:
  - Apply resampling techniques like oversampling the minority class or undersampling the majority class to balance the dataset.
  - Use techniques like Synthetic Minority Over-sampling Technique (SMOTE) to generate synthetic examples of the minority class.
  - Adjust the class weights in the KNN algorithm to give more importance to the minority class during classification.

Determining the Optimal K:

- Selecting the optimal value of K is crucial for KNN's performance. Choosing an inappropriate K value can lead to underfitting or overfitting.
- Overcoming this drawback:
  - Perform hyperparameter tuning using techniques like cross-validation or grid search to find the optimal K value for the dataset.
  - Evaluate the model's performance for different K values using appropriate evaluation metrics and select the one that provides the best trade-off between bias and variance.

By addressing these potential drawbacks through appropriate preprocessing, feature selection, hyperparameter tuning, and data balancing techniques, it is possible to improve the performance of the KNN classifier or regressor and make it more robust and accurate.







# 22th April Assignment-3

## Q1. Write a Python code to implement the KNN classifier algorithm on load_iris dataset in sklearn.datasets.

In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a KNN classifier object
knn = KNeighborsClassifier(n_neighbors=3)

# Train the classifier
knn.fit(X_train, y_train)

# Make predictions on the test set
y_pred = knn.predict(X_test)

# Calculate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Accuracy: 1.0


## Q2. Write a Python code to implement the KNN regressor algorithm on load_boston dataset in sklearn.datasets.

In [6]:
from sklearn.datasets import load_boston,load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

# Load the boston dataset
# boston = load_boston()
# boston = load_boston()
# X = boston.data
# y = boston.target


# iris = load_iris()
# X = iris.data
# y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a KNN regressor object
knn = KNeighborsRegressor(n_neighbors=3)

# Train the regressor
knn.fit(X_train, y_train)

# Make predictions on the test set
y_pred = knn.predict(X_test)

# Calculate the mean squared error (MSE)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)


Mean Squared Error: 0.007407407407407404


### Q3. Write a Python code snippet to find the optimal value of K for the KNN classifier algorithm using cross-validation on load_iris dataset in sklearn.datasets.


In [7]:
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier

# Load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Define a list of K values to test
k_values = list(range(1, 21))

# Perform cross-validation for each K value
cv_scores = []
for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X, y, cv=5, scoring='accuracy')
    cv_scores.append(scores.mean())

# Find the optimal K value with the highest cross-validated accuracy
optimal_k = k_values[cv_scores.index(max(cv_scores))]
print("Optimal K:", optimal_k)


Optimal K: 6


## Q5. Write a Python code snippet to implement the KNN classifier algorithm with weighted voting on load_iris dataset in sklearn.datasets.

In [9]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a KNN classifier object with weighted voting
knn = KNeighborsClassifier(n_neighbors=3, weights='distance')

# Train the classifier
knn.fit(X_train, y_train)

# Make predictions on the test set
y_pred = knn.predict(X_test)

# Calculate the accuracy of the classifier
accuracy = knn.score(X_test, y_test)
print("Accuracy:", accuracy)




Accuracy: 1.0


## Q6. Implement a function to standardise the features before applying KNN classifier.

In [10]:
from sklearn.preprocessing import StandardScaler

def preprocess_data(X_train, X_test):
    # Create a StandardScaler object
    scaler = StandardScaler()

    # Fit the scaler on the training data
    scaler.fit(X_train)

    # Standardize the training and testing data
    X_train_std = scaler.transform(X_train)
    X_test_std = scaler.transform(X_test)

    return X_train_std, X_test_std


from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Load the iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocess the data by standardizing the features
X_train_std, X_test_std = preprocess_data(X_train, X_test)

# Create a KNN classifier object
knn = KNeighborsClassifier(n_neighbors=3)

# Train the classifier on the standardized training data
knn.fit(X_train_std, y_train)

# Make predictions on the standardized test data
y_pred = knn.predict(X_test_std)

# Calculate the accuracy of the classifier
accuracy = knn.score(X_test_std, y_test)
print("Accuracy:", accuracy)


Accuracy: 1.0


## Q7. Write a Python function to calculate the euclidean distance between two points.

In [13]:
import numpy as np

def euclidean_distance(point1, point2):
    # Calculate the squared Euclidean distance between the two points
    squared_distance = np.sum(np.square(point1 - point2))
    
    # Take the square root of the squared distance to obtain the Euclidean distance
    distance = np.sqrt(squared_distance)
    
    return distance

d= euclidean_distance(3,8)

print(d)

5.0


## Q8. Write a Python function to calculate the manhattan distance between two points.

In [15]:
import numpy as np

def manhattan_distance(point1, point2):
    # Calculate the Manhattan distance between the two points
    distance = np.sum(np.abs(point1 - point2))
    
    return distance
point1 = np.array([1, 2, 4])
point2 = np.array([4, 5, 6])

distance = manhattan_distance(point1, point2)
print("Manhattan Distance:", distance)


Manhattan Distance: 8
