#### Instructions
###### Follow the instructions given in comments prefixed with ## and write your code below that.
###### Also fill the partial code in given blanks. 
###### Don't make any changes to the rest part of the codes

### Answer the questions given at the end of this notebook within your report.

### You would need to submit your GitHub repository link. Refer to the PDF document for the instructions and details.





In [None]:
## import cv2
import cv2
## import numpy
import numpy as np
## import matplotlib pyplot
import matplotlib.pyplot as plt
## import KMeans cluster from sklearn
from sklearn.cluster import KMeans
## import distance from scipy.spatial
from scipy.spatial import distance
from matplotlib.offsetbox import OffsetImage, AnnotationBbox

In [None]:
## Reading the image plaksha_Faculty.jpg
img = cv2.imread('plaksha_Faculty.jpg')
  
## Convert the image to grayscale
gray_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
  
# Loading the required haar-cascade xml classifier file
face_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + 'haarcascade_frontalface_default.xml')
  
# Applying the face detection method on the grayscale image. 
## Change the parameters for better detection of faces in your case.
faces_rect = face_cascade.detectMultiScale(gray_img, 1.05, 4, minSize=(25,25), maxSize=(50,50))
 
# Define the text and font parameters
text = 'Face' ## The text you want to write
font = cv2.FONT_HERSHEY_SIMPLEX  ## Font type
font_scale = 0.5  ## Font scale factor
font_color = (0, 0, 255)  ## Text color in BGR format (here, it's red)
font_thickness = 1  ## Thickness of the text

  
# Iterating through rectangles of detected faces
for (x, y, w, h) in faces_rect:
    cv2.rectangle(img, (x, y), (x+w, y+h), (0, 0, 255), 2)
    # Use cv2.putText to add the text to the image, Use text, font, font_scale, font_color, font_thickness here
    cv2.putText(img, text, (x, y - 10), font, font_scale, font_color, font_thickness)
    
## Display the image and window title should be "Total number of face detected are #"  
cv2.imshow('Total number of face detected are ' + str(len(faces_rect)), img)
cv2.waitKey(0)
cv2.destroyAllWindows()

In [None]:

from matplotlib.offsetbox import OffsetImage, AnnotationBbox
# Extract face region features (Hue and Saturation)
img_hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV) ## call the img and convert it from BGR to HSV and store in img_hsv
hue_saturation = []
face_images = []  # To store detected face images

for (x, y, w, h) in faces_rect:
    face = img_hsv[y:y + h, x:x + w]
    hue = np.mean(face[:, :, 0])
    saturation = np.mean(face[:, :, 1])
    hue_saturation.append((hue, saturation))
    face_images.append(face)

hue_saturation = np.array(hue_saturation)

## Perform k-Means clustering on hue_saturation and store in kmeans
kmeans = KMeans(n_clusters=2, random_state=0).fit(hue_saturation)
#centroids = kmeans.cluster_centers_
#labels = kmeans.labels_

# Create a figure and axis
fig, ax = plt.subplots(figsize=(12, 6))

# Plot the clustered faces with custom markers
for i, (x,y,w,h ) in enumerate(faces_rect):
    im = OffsetImage(cv2.cvtColor(cv2.resize(face_images[i], (20, 20)), cv2.COLOR_HSV2RGB))
    ab = AnnotationBbox(im, (hue_saturation[i, 0], hue_saturation[i, 1]), frameon=False, pad=0)
    ax.add_artist(ab)
    plt.plot(hue_saturation[i, 0], hue_saturation[i, 1])
    

## Put x label
plt.xlabel('Hue')
## Put y label
plt.ylabel('Saturation')
## Put title
plt.title('Face Features - Hue vs Saturation')
## Put grid
plt.grid(True)
## show the plot
plt.show()

In [None]:
# Create an empty list to store legend labels
legend_labels = []

# Create lists to store points for each cluster
cluster_0_points = []
cluster_1_points = []

# Your code for scatter plot goes here
fig, ax = plt.subplots(figsize=(12, 6))
for i, (x, y, w, h) in enumerate(faces_rect):
    if kmeans.labels_[i] == 0:
        cluster_0_points.append((hue_saturation[i, 0], hue_saturation[i, 1]))
    else:
        cluster_1_points.append((hue_saturation[i, 0], hue_saturation[i, 1]))


cluster_0_points = np.array(cluster_0_points)
# Plot points for cluster 0 in green
plt.scatter(cluster_0_points[:, 0], cluster_0_points[:, 1], c='green', label='Cluster 0', marker='o')


cluster_1_points = np.array(cluster_1_points)
# Plot points for cluster 1 in blue
plt.scatter(cluster_1_points[:, 0], cluster_1_points[:, 1], c='blue', label='Cluster 1', marker='o')

# Calculate and plot centroids
centroid_0 = kmeans.cluster_centers_[0]
centroid_1 = kmeans.cluster_centers_[1]

# Plot both the centroid for cluster 0 and cluster 1 
plt.scatter(centroid_0[0], centroid_0[1], c='red', marker='X', s=200, label='Centroid 0')
plt.scatter(centroid_1[0], centroid_1[1], c='black', marker='X', s=200, label='Centroid 1')

## Put x label
plt.xlabel('Hue')
## Put y label
plt.ylabel('Saturation')
## Put title
plt.title('K-Means Clustering of Detected Faces (Hue vs Saturation)')
## Add a legend
plt.legend()
## Add grid
plt.grid(True)
## Show the plot
plt.show()


In [None]:
## Read the class of the template image 'Dr_Shashi_Tharoor.jpg' using cv2 and store it in template_img
template_img = cv2.imread('Dr_Shashi_Tharoor.jpg')
# Detect face  in the template image after converting it to gray and store it in template_faces
template_faces = face_cascade.detectMultiScale(cv2.cvtColor(template_img, cv2.COLOR_BGR2GRAY), 1.05, 4)
# Draw rectangles around the detected faces
for (x, y, w, h) in template_faces:
    cv2.rectangle(template_img, (x, y), (x + w, y + h), (0, 255, 0), 3)
cv2.imshow('Template Image - Dr Shashi Tharoor - Faces Detected: ' + str(len(template_faces)), template_img)
cv2.waitKey(0)
cv2.destroyAllWindows()      

In [None]:
# Convert the template image to HSV color space and store it in template_hsv
template_hsv = cv2.cvtColor(template_img, cv2.COLOR_BGR2HSV)

# Extract hue and saturation features from the template image as we did it for detected faces.
template_hue = np.mean(template_hsv[:, :, 0])
template_saturation = np.mean(template_hsv[:, :, 1])

# Predict the cluster label for the template image and store it in template_label
template_label = kmeans.predict([[template_hue, template_saturation]])[0]

# Create a figure and axis for visualization
fig, ax = plt.subplots(figsize=(12, 6))

# Plot the clustered faces with custom markers (similar to previous code)
for i, (x, y, w, h) in enumerate(faces_rect):
    color = 'red' if kmeans.labels_[i] == 0 else 'blue'
    im = OffsetImage(cv2.cvtColor(cv2.resize(face_images[i], (20, 20)), cv2.COLOR_HSV2RGB))
    ab = AnnotationBbox(im, (hue_saturation[i, 0], hue_saturation[i, 1]), frameon=False, pad=0)
    ax.add_artist(ab)
    plt.plot(hue_saturation[i, 0], hue_saturation[i, 1], 'o', markersize=5, color=color)

# Plot the template image in the respective cluster
if template_label == 0:
    color = 'red'
else:
    color = 'blue'
im = OffsetImage(cv2.cvtColor(cv2.resize(template_img, (20, 20)), cv2.COLOR_BGR2RGB))
ab = AnnotationBbox(im, (template_hue, template_saturation), frameon=False, pad=0)
ax.add_artist(ab)

## Put x label
plt.xlabel('Hue')
## Put y label
plt.ylabel('Saturation')
## Put title
plt.title('Clustered Faces with Template Image')
## Add grid
plt.grid(True)
## show plot
plt.show()

In [None]:
# Create an empty list to store legend labels
legend_labels = []

# Create lists to store points for each cluster
cluster_0_points = []
cluster_1_points = []

# Your code for scatter plot goes here
fig, ax = plt.subplots(figsize=(12, 6))
for i, (x, y, w, h) in enumerate(faces_rect):
    if kmeans.labels_[i] == 0:
        cluster_0_points.append((hue_saturation[i, 0], hue_saturation[i, 1]))
    else:
        cluster_1_points.append((hue_saturation[i, 0], hue_saturation[i, 1]))

# Plot points for cluster 0 in green
cluster_0_points = np.array(cluster_0_points)
plt.scatter(cluster_0_points[:, 0], cluster_0_points[:, 1], c='green', label='Cluster 0', marker='o')

# Plot points for cluster 1 in blue
cluster_1_points = np.array(cluster_1_points)
plt.scatter(cluster_1_points[:, 0], cluster_1_points[:, 1], c='blue', label='Cluster 1', marker='o')

# Calculate and plot centroids for both the clusters
centroid_0 = kmeans.cluster_centers_[0]
centroid_1 = kmeans.cluster_centers_[1]
plt.scatter(centroid_0[0], centroid_0[1], c='red', marker='X', s=200, label='Centroid 0') ## plot for centroid 0
plt.scatter(centroid_1[0], centroid_1[1], c='black', marker='X', s=200, label='Centroid 1')  ## plot for centroid 1
plt.plot(template_hue, template_saturation, marker='o', c= 'violet',markersize= 10, label=' Class ?' )

## Put x label
plt.xlabel('Hue')
## Put y label
plt.ylabel('Saturation')
## Put title
plt.title('K-Means Clustering with Template Image Classification')
## Add a legend
plt.legend()
## Add grid
plt.grid(True)
## show the plot
plt.show()
                                            ## End of the lab 5 ##

## Report:
## Answer the following questions within your report:


#### 1. What are the common distance metrics used in distance-based classification algorithms? 

The common distance metrics include:
- **Euclidean Distance**: The straight-line distance between two points in n-dimensional space. It is the most widely used metric.
- **Manhattan Distance (City Block / L1 Norm)**: The sum of absolute differences between the coordinates of two points.
- **Minkowski Distance**: A generalized form of both Euclidean and Manhattan distance, parameterized by *p*.
- **Cosine Similarity / Distance**: Measures the cosine of the angle between two vectors; useful for high-dimensional text data.
- **Hamming Distance**: Counts the number of positions at which the corresponding symbols are different; commonly used for categorical or binary data.
- **Chebyshev Distance (L∞ Norm)**: The maximum of the absolute differences between coordinates.
- **Mahalanobis Distance**: Accounts for correlations between features and is scale-invariant.

#### 2. What are some real-world applications of distance-based classification algorithms? 

- **Face recognition**: Classifying faces by comparing feature distances (as done in this lab).
- **Recommendation systems**: Finding similar users or items (collaborative filtering with KNN).
- **Medical diagnosis**: Classifying patients based on symptoms or biomarker similarity.
- **Handwriting / digit recognition**: Classifying handwritten characters by comparing pixel feature distances (e.g., MNIST dataset).
- **Anomaly / fraud detection**: Flagging transactions that are far from normal patterns.
- **Gene expression analysis**: Grouping genes with similar expression profiles.
- **Customer segmentation**: Clustering customers by purchasing behaviour using K-Means.

#### 3. Explain various distance metrics. 

a) **Euclidean Distance**: This is the most common way to measure distance, like measuring the straight-line distance between two points on a piece of paper with a ruler. It calculates the shortest path between point A and point B. It works well when all features are numeric and have similar scales.

b) **Manhattan Distance**: Imagine you are in a city with grid-like streets and you want to go from one intersection to another. You can't go through buildings, so you have to walk along the streets (horizontal + vertical distance). That sum of horizontal and vertical steps is the Manhattan distance. It is useful when your data has many dimensions or when outliers might mess up the Euclidean calculation.

c) **Minkowski Distance**: This is a generalized form that combines both Euclidean and Manhattan distances. It uses a parameter 'p'. If p=1, it becomes Manhattan distance. If p=2, it becomes Euclidean distance. It gives you the flexibility to choose the best distance type for your problem.

d) **Cosine Distance**: Instead of measuring how far apart two points are, this measures the angle between two vectors. Think of two arrows starting from the same point; if they point in the same direction, the distance is small (similarity is high), even if one arrow is much longer than the other. This is great for text analysis where the length of the document doesn't matter as much as the content.

e) **Hamming Distance**: This is used for categorical data or computer strings. It simply counts the number of positions where two strings of equal length are different. For example, the Hamming distance between "face" and "fact" is 1 because only the last letter is different. It measures the minimum number of substitutions required to change one string into the other.

f) **Chebyshev Distance**: Imagine playing chess; the King can move to any adjacent square (horizontal, vertical, or diagonal). The minimum number of moves the King needs to get from one square to another is the Chebyshev distance. Mathematically, it is simply the greatest single difference along any coordinate dimension.

#### 4. What is the role of cross validation in model performance? 

Cross-validation is a technique used to assess how well a model generalizes to unseen data. Its roles include:
- **Reducing overfitting**: By training and testing on different subsets, it detects models that memorize training data rather than learning general patterns.
- **Hyperparameter tuning**: Helps select the best value of *k* in KNN or the number of clusters in K-Means.
- **Reliable performance estimation**: Provides a more robust estimate of model accuracy than a single train-test split, since every data point is used for both training and testing across different folds.
- **Efficient use of data**: Particularly valuable when the dataset is small, as it maximizes the use of available samples for both training and validation.

Common methods include k-fold CV, stratified k-fold CV, and leave-one-out CV (LOOCV).

#### 5. Explain variance and bias in terms of KNN? 

- **Bias** refers to errors due to overly simplistic assumptions in the model. In KNN, a **large value of k** (e.g., k = N) leads to **high bias** because the model averages over too many neighbors, effectively underfitting and ignoring local patterns.
- **Variance** refers to sensitivity to fluctuations in the training data. A **small value of k** (e.g., k = 1) leads to **high variance** because the prediction depends on a single neighbor and is very sensitive to noise.
- The **bias-variance tradeoff** in KNN is controlled by k:
  - **Small k → Low bias, High variance** (complex, noisy decision boundary)
  - **Large k → High bias, Low variance** (smooth, potentially underfitting boundary)
- The optimal k balances this tradeoff and is typically found through cross-validation.