#### Instructions
###### Follow the instructions given in comments prefixed with ## and write your code below that.
###### Also fill the partial code in given blanks. 
###### Don't make any changes to the rest part of the codes

### Answer the questions given at the end of this notebook within your report.


### You would need to submit your GitHub repository link. Refer to the Section 6: Final Submission on the PDF document for the details.


In [1]:
# Import necessary libraries
import cv2
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from scipy.spatial import distance
from matplotlib.offsetbox import OffsetImage, AnnotationBbox

# Read the image plaksha_Faculty.jpg
img = cv2.imread("Plaksha_Faculty.jpg")

# Convert the image to grayscale
gray_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Load Haar cascade classifier
face_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + "haarcascade_frontalface_default.xml")

# Detect faces in the image
faces_rect = face_cascade.detectMultiScale(gray_img, 1.05, 4, minSize=(25,25), maxSize=(50,50))

# Define text parameters
text = "Face"
font = cv2.FONT_HERSHEY_SIMPLEX
font_scale = 0.5
font_color = (0, 0, 255)  # Red color in BGR
font_thickness = 1

# Draw rectangles around faces and add labels
for (x, y, w, h) in faces_rect:
    cv2.rectangle(img, (x, y), (x+w, y+h), (0, 0, 255), 2)
    cv2.putText(img, text, (x, y-5), font, font_scale, font_color, font_thickness)

# Display the image
cv2.imshow(f"Total number of faces detected: {len(faces_rect)}", img)
cv2.waitKey(0)
cv2.destroyAllWindows()

# Convert image to HSV
img_hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)

hue_saturation = []
face_images = []

# Extract Hue and Saturation values
for (x, y, w, h) in faces_rect:
    face = img_hsv[y:y + h, x:x + w]
    hue = np.mean(face[:, :, 0])
    saturation = np.mean(face[:, :, 1])
    hue_saturation.append((hue, saturation))
    face_images.append(face)

hue_saturation = np.array(hue_saturation)

# Perform K-Means clustering
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(hue_saturation)

# Create a figure
fig, ax = plt.subplots(figsize=(12, 6))

# Plot clustered faces
for i, (x, y, w, h) in enumerate(faces_rect):
    im = OffsetImage(cv2.cvtColor(cv2.resize(face_images[i], (20, 20)), cv2.COLOR_HSV2RGB))
    ab = AnnotationBbox(im, (hue_saturation[i, 0], hue_saturation[i, 1]), frameon=False, pad=0)
    ax.add_artist(ab)
    plt.scatter(hue_saturation[i, 0], hue_saturation[i, 1], c='r' if kmeans.labels_[i] == 0 else 'b')

plt.xlabel("Hue")
plt.ylabel("Saturation")
plt.title("Face Clustering based on Hue and Saturation")
plt.grid(True)
plt.show()

# Scatter plot for clusters
fig, ax = plt.subplots(figsize=(12, 6))
cluster_0_points = np.array([hue_saturation[i] for i in range(len(kmeans.labels_)) if kmeans.labels_[i] == 0])
cluster_1_points = np.array([hue_saturation[i] for i in range(len(kmeans.labels_)) if kmeans.labels_[i] == 1])

plt.scatter(cluster_0_points[:, 0], cluster_0_points[:, 1], c='green', label="Cluster 0")
plt.scatter(cluster_1_points[:, 0], cluster_1_points[:, 1], c='blue', label="Cluster 1")

# Compute and plot centroids
centroid_0 = kmeans.cluster_centers_[0]
centroid_1 = kmeans.cluster_centers_[1]
plt.scatter(centroid_0[0], centroid_0[1], c='black', marker='x', s=100, label="Centroid 0")
plt.scatter(centroid_1[0], centroid_1[1], c='black', marker='x', s=100, label="Centroid 1")

plt.xlabel("Hue")
plt.ylabel("Saturation")
plt.title("K-Means Clustering of Faces")
plt.legend()
plt.grid(True)
plt.show()

# Template Matching
template_img = cv2.imread("Dr_Shashi_Tharoor.jpg")
template_gray = cv2.cvtColor(template_img, cv2.COLOR_BGR2GRAY)

# Detect face in template image
template_faces = face_cascade.detectMultiScale(template_gray, 1.05, 4, minSize=(25,25), maxSize=(50,50))

# Draw rectangles around detected faces
for (x, y, w, h) in template_faces:
    cv2.rectangle(template_img, (x, y), (x + w, y + h), (0, 255, 0), 3)

cv2.imshow("Template Image - Detected Face", template_img)
cv2.waitKey(0)
cv2.destroyAllWindows()

## Report:
## Answer the following questions within your report:


#### 1. What are the common distance metrics used in distance-based classification algorithms? 

The most common distance measures employed in distance-based classification techniques are:

Euclidean Distance: Calculates the straight-line distance between two points in a multi-dimensional space.

Mahalanobis Distance: Takes into account correlations between variables and scales data, commonly applied in anomaly detection.

Manhattan Distance: Calculates the sum of absolute differences between coordinates (also referred to as taxicab or city block distance).

Minkowski Distance: Generalization of both Euclidean and Manhattan distances.

Cosine Similarity: Calculates the cosine of the angle between two vectors, applied in text classification and high-dimensional space.


#### 2. What are some real-world applications of distance-based classification algorithms? 

Distance-based classification algorithms, i.e., K-Nearest Neighbors (KNN), are applied in the following applications:

Handwritten Digit Recognition: Applied in Optical Character Recognition (OCR) systems.

Medical Diagnosis: Assists in the classification of diseases according to patient symptoms.

Recommender Systems: Recommends products based on user interests by calculating similarity.

Anomaly Detection: Finds out fraudulent financial transactions.

Face Recognition: Compares face embeddings in security and surveillance systems.

#### 3. Explain various distance metrics. 

Euclidean Distance: That measures the straight-line distance between two points in space and is commonly utilized in classification and clustering.

Mahalanobis Distance: Takes into consideration correlations in data and scales features proportionally, thus being effective for anomaly detection and high-dimensional data.

Manhattan Distance: Calculates the total of absolute differences in coordinates, similar to movement in a grid-like direction.

Minkowski Distance: An extension of Manhattan and Euclidean distances, where the contribution of varying dimensions is governed by a parameter.

Cosine Similarity: Compares the angle between two vectors to compute similarity, generally applied in document and text analysis.

#### 4. What is the role of cross validation in model performance? 

Cross-validation assists in the evaluation of the model's performance by:

Avoiding overfitting through the assurance of the model generalizing well to unseen data.

Giving a more accurate estimate of accuracy by averaging over multiple splits.

Assisting in hyperparameter tuning by comparing different settings (e.g., best k value in K-NN).

Some common approaches are k-fold cross-validation, leave-one-out cross-validation (LOOCV), and stratified k-fold cross-validation.

#### 5. Explain variance and bias in terms of KNN? 

Bias: How far the model's predictions are from the true values. A high bias model (e.g., large k in K-NN) simplifies patterns too much, resulting in underfitting.

Variance: How much the model's predictions vary with different datasets. A high variance model (e.g., small k in K-NN) memorizes training data, resulting in overfitting.

Bias-Variance Tradeoff: Low k (e.g., 1-NN) → Low bias, high variance (overfits).
High k → High bias, low variance (underfits).
Optimal k trades off bias and variance for improved generalization.