In [50]:
import numpy as np
import os
import matplotlib.pyplot as plt

import pandas as pd
from collections import Counter
from PIL import Image
import imagehash
from collections import defaultdict
from tensorflow.keras.applications import ResNet50

# Working with Images Lab
## Information retrieval, preprocessing, and feature extraction

In this lab, you'll work with images of felines (cats), which have been classified according to their taxonomy. Each subfolder contains images of a particular species. The dataset is located [here](https://www.kaggle.com/datasets/datahmifitb/felis-taxonomy-image-classification) but it's also provided to you in the `data/` folder.

### Problem 1. Some exploration (1 point)
How many types of cats are there? How many images do we have of each? What is a typical image size? Are there any outliers in size?

#### Types of Cats 
In the dataset we have seven types of cats - African wildcat, Blackfoot cat, Chinese mountain cat, Domestic cat, European wildcat, Jungle cat, Sand cat. 


#### Images of Each Type
- In African wildcat dataset we have 91 images;
- In Blackfoot cat dataset we have 79 images;
- In Chinese mountain cat dataset we have 42 images;
- In Domestic cat dataset we have 64 images;
- In European wildcat dataset we have 85 images;
- In Jungle cat dataset we have 86 images;
- In Sand cat dataset we have 72 images;

In [7]:
os.listdir("data/blackfoot-cat")

['bc (1).jpg',
 'bc (10).jpg',
 'bc (11).jpg',
 'bc (12).jpg',
 'bc (13).jpg',
 'bc (14).jpg',
 'bc (15).jpg',
 'bc (16).jpg',
 'bc (17).jpg',
 'bc (18).jpg',
 'bc (19).jpg',
 'bc (2).jpg',
 'bc (20).jpg',
 'bc (21).jpg',
 'bc (22).jpg',
 'bc (23).jpg',
 'bc (24).jpg',
 'bc (25).jpg',
 'bc (26).jpg',
 'bc (27).jpg',
 'bc (28).jpg',
 'bc (29).jpg',
 'bc (3).jpg',
 'bc (30).jpg',
 'bc (31).jpg',
 'bc (32).jpg',
 'bc (33).jpg',
 'bc (34).jpg',
 'bc (35).jpg',
 'bc (36).jpg',
 'bc (37).jpg',
 'bc (38).jpg',
 'bc (39).jpg',
 'bc (4).jpg',
 'bc (40).jpg',
 'bc (41).jpg',
 'bc (42).jpg',
 'bc (43).jpg',
 'bc (44).jpg',
 'bc (45).jpg',
 'bc (46).jpg',
 'bc (47).jpg',
 'bc (48).jpg',
 'bc (49).jpg',
 'bc (5).jpg',
 'bc (50).jpg',
 'bc (51).jpg',
 'bc (52).jpg',
 'bc (53).jpg',
 'bc (54).jpg',
 'bc (55).jpg',
 'bc (56).jpg',
 'bc (57).jpg',
 'bc (58).jpg',
 'bc (59).jpg',
 'bc (6).jpg',
 'bc (60).jpg',
 'bc (61).jpg',
 'bc (62).jpg',
 'bc (63).jpg',
 'bc (64).jpg',
 'bc (65).jpg',
 'bc (66).jpg'

In [8]:
os.listdir("data/chinese-mountain-cat")

['.ipynb_checkpoints',
 'ch (1).jpg',
 'ch (10).jpg',
 'ch (11).jpg',
 'ch (12).jpg',
 'ch (13).jpg',
 'ch (14).jpg',
 'ch (15).jpg',
 'ch (16).jpg',
 'ch (17).jpg',
 'ch (18).jpg',
 'ch (19).jpg',
 'ch (2).jpg',
 'ch (20).jpg',
 'ch (21).jpg',
 'ch (22).jpg',
 'ch (23).jpg',
 'ch (24).jpg',
 'ch (25).jpg',
 'ch (26).jpg',
 'ch (27).jpg',
 'ch (28).jpg',
 'ch (29).jpg',
 'ch (3).jpg',
 'ch (30).jpg',
 'ch (31).jpg',
 'ch (32).jpg',
 'ch (33).jpg',
 'ch (34).jpg',
 'ch (35).jpg',
 'ch (36).jpg',
 'ch (37).jpg',
 'ch (38).jpg',
 'ch (39).jpg',
 'ch (4).jpg',
 'ch (40).jpg',
 'ch (41).jpg',
 'ch (42).jpg',
 'ch (5).jpg',
 'ch (6).jpg',
 'ch (7).jpg',
 'ch (8).jpg',
 'ch (9).jpg']

In [9]:
os.listdir("data/domestic-cat")

['dc (1).jpg',
 'dc (10).jpg',
 'dc (11).jpg',
 'dc (12).jpg',
 'dc (13).jpg',
 'dc (14).jpg',
 'dc (15).jpg',
 'dc (16).jpg',
 'dc (17).jpg',
 'dc (18).jpg',
 'dc (19).jpg',
 'dc (2).jpg',
 'dc (20).jpg',
 'dc (21).jpg',
 'dc (22).jpg',
 'dc (23).jpg',
 'dc (24).jpg',
 'dc (25).jpg',
 'dc (26).jpg',
 'dc (27).jpg',
 'dc (28).jpg',
 'dc (29).jpg',
 'dc (3).jpg',
 'dc (30).jpg',
 'dc (31).jpg',
 'dc (32).jpg',
 'dc (33).jpg',
 'dc (34).jpg',
 'dc (35).jpg',
 'dc (36).jpg',
 'dc (37).jpg',
 'dc (38).jpg',
 'dc (39).jpg',
 'dc (4).jpg',
 'dc (40).jpg',
 'dc (41).jpg',
 'dc (42).jpg',
 'dc (43).jpg',
 'dc (44).jpg',
 'dc (45).jpg',
 'dc (46).jpg',
 'dc (47).jpg',
 'dc (48).jpg',
 'dc (49).jpg',
 'dc (5).jpg',
 'dc (50).jpg',
 'dc (51).jpg',
 'dc (52).jpg',
 'dc (53).jpg',
 'dc (54).jpg',
 'dc (55).jpg',
 'dc (56).jpg',
 'dc (57).jpg',
 'dc (58).jpg',
 'dc (59).jpg',
 'dc (6).jpg',
 'dc (60).jpg',
 'dc (61).jpg',
 'dc (62).jpg',
 'dc (63).jpg',
 'dc (64).jpg',
 'dc (7).jpg',
 'dc (8).jpg',


In [10]:
os.listdir("data/jungle-cat")

['jg (1).jpg',
 'jg (10).jpg',
 'jg (11).jpg',
 'jg (12).jpg',
 'jg (13).jpg',
 'jg (14).jpg',
 'jg (15).jpg',
 'jg (16).jpg',
 'jg (17).jpg',
 'jg (18).jpg',
 'jg (19).jpg',
 'jg (2).jpg',
 'jg (20).jpg',
 'jg (21).jpg',
 'jg (22).jpg',
 'jg (23).jpg',
 'jg (24).jpg',
 'jg (25).jpg',
 'jg (26).jpg',
 'jg (27).jpg',
 'jg (28).jpg',
 'jg (29).jpg',
 'jg (3).jpg',
 'jg (30).jpg',
 'jg (31).jpg',
 'jg (32).jpg',
 'jg (33).jpg',
 'jg (34).jpg',
 'jg (35).jpg',
 'jg (36).jpg',
 'jg (37).jpg',
 'jg (38).jpg',
 'jg (39).jpg',
 'jg (4).jpg',
 'jg (40).jpg',
 'jg (41).jpg',
 'jg (42).jpg',
 'jg (43).jpg',
 'jg (44).jpg',
 'jg (45).jpg',
 'jg (46).jpg',
 'jg (47).jpg',
 'jg (48).jpg',
 'jg (49).jpg',
 'jg (5).jpg',
 'jg (50).jpg',
 'jg (51).jpg',
 'jg (52).jpg',
 'jg (53).jpg',
 'jg (54).jpg',
 'jg (55).jpg',
 'jg (56).jpg',
 'jg (57).jpg',
 'jg (58).jpg',
 'jg (59).jpg',
 'jg (6).jpg',
 'jg (60).jpg',
 'jg (61).jpg',
 'jg (62).jpg',
 'jg (63).jpg',
 'jg (64).jpg',
 'jg (65).jpg',
 'jg (66).jpg'

In [11]:
os.listdir("data/european-wildcat")

['eu (1).jpg',
 'eu (10).jpg',
 'eu (11).jpg',
 'eu (12).jpg',
 'eu (13).jpg',
 'eu (14).jpg',
 'eu (15).jpg',
 'eu (16).jpg',
 'eu (17).jpg',
 'eu (18).jpg',
 'eu (19).jpg',
 'eu (2).jpg',
 'eu (20).jpg',
 'eu (21).jpg',
 'eu (22).jpg',
 'eu (23).jpg',
 'eu (24).jpg',
 'eu (25).jpg',
 'eu (26).jpg',
 'eu (27).jpg',
 'eu (28).jpg',
 'eu (29).jpg',
 'eu (3).jpg',
 'eu (30).jpg',
 'eu (31).jpg',
 'eu (32).jpg',
 'eu (33).jpg',
 'eu (34).jpg',
 'eu (35).jpg',
 'eu (36).jpg',
 'eu (37).jpg',
 'eu (38).jpg',
 'eu (39).jpg',
 'eu (4).jpg',
 'eu (40).jpg',
 'eu (41).jpg',
 'eu (42).jpg',
 'eu (43).jpg',
 'eu (44).jpg',
 'eu (45).jpg',
 'eu (46).jpg',
 'eu (47).jpg',
 'eu (48).jpg',
 'eu (49).jpg',
 'eu (5).jpg',
 'eu (50).jpg',
 'eu (51).jpg',
 'eu (52).jpg',
 'eu (53).jpg',
 'eu (54).jpg',
 'eu (55).jpg',
 'eu (56).jpg',
 'eu (57).jpg',
 'eu (58).jpg',
 'eu (59).jpg',
 'eu (6).jpg',
 'eu (60).jpg',
 'eu (61).jpg',
 'eu (62).jpg',
 'eu (63).jpg',
 'eu (64).jpg',
 'eu (65).jpg',
 'eu (66).jpg'

In [12]:
os.listdir("data/sand-cat")

['.ipynb_checkpoints',
 'sd (1).jpg',
 'sd (10).jpg',
 'sd (11).jpg',
 'sd (12).jpg',
 'sd (13).jpg',
 'sd (14).jpg',
 'sd (15).jpg',
 'sd (16).jpg',
 'sd (17).jpg',
 'sd (18).jpg',
 'sd (19).jpg',
 'sd (2).jpg',
 'sd (20).jpg',
 'sd (21).jpg',
 'sd (22).jpg',
 'sd (23).jpg',
 'sd (24).jpg',
 'sd (25).jpg',
 'sd (26).jpg',
 'sd (27).jpg',
 'sd (28).jpg',
 'sd (29).jpg',
 'sd (3).jpg',
 'sd (30).jpg',
 'sd (31).jpg',
 'sd (32).jpg',
 'sd (33).jpg',
 'sd (34).jpg',
 'sd (35).jpg',
 'sd (36).jpg',
 'sd (37).jpg',
 'sd (38).jpg',
 'sd (39).jpg',
 'sd (4).jpg',
 'sd (40).jpg',
 'sd (41).jpg',
 'sd (42).jpg',
 'sd (43).jpg',
 'sd (44).jpg',
 'sd (45).jpg',
 'sd (46).jpg',
 'sd (47).jpg',
 'sd (48).jpg',
 'sd (49).jpg',
 'sd (5).jpg',
 'sd (50).jpg',
 'sd (51).jpg',
 'sd (52).jpg',
 'sd (53).jpg',
 'sd (54).jpg',
 'sd (55).jpg',
 'sd (56).jpg',
 'sd (57).jpg',
 'sd (58).jpg',
 'sd (59).jpg',
 'sd (6).jpg',
 'sd (60).jpg',
 'sd (61).jpg',
 'sd (62).jpg',
 'sd (63).jpg',
 'sd (64).jpg',
 'sd (6

In [13]:
os.listdir("data/african-wildcat")

['af (1).jpg',
 'af (10).jpg',
 'af (11).jpg',
 'af (12).jpg',
 'af (13).jpg',
 'af (14).jpg',
 'af (15).jpg',
 'af (16).jpg',
 'af (17).jpg',
 'af (18).jpg',
 'af (19).jpg',
 'af (2).jpg',
 'af (20).jpg',
 'af (21).jpg',
 'af (22).jpg',
 'af (23).jpg',
 'af (24).jpg',
 'af (25).jpg',
 'af (26).jpg',
 'af (27).jpg',
 'af (28).jpg',
 'af (29).jpg',
 'af (3).jpg',
 'af (30).jpg',
 'af (31).jpg',
 'af (32).jpg',
 'af (33).jpg',
 'af (34).jpg',
 'af (35).jpg',
 'af (36).jpg',
 'af (37).jpg',
 'af (38).jpg',
 'af (39).jpg',
 'af (4).jpg',
 'af (40).jpg',
 'af (41).jpg',
 'af (42).jpg',
 'af (43).jpg',
 'af (44).jpg',
 'af (45).jpg',
 'af (46).jpg',
 'af (47).jpg',
 'af (48).jpg',
 'af (49).jpg',
 'af (5).jpg',
 'af (50).jpg',
 'af (51).jpg',
 'af (52).jpg',
 'af (53).jpg',
 'af (54).jpg',
 'af (55).jpg',
 'af (56).jpg',
 'af (57).jpg',
 'af (58).jpg',
 'af (59).jpg',
 'af (6).jpg',
 'af (60).jpg',
 'af (61).jpg',
 'af (62).jpg',
 'af (63).jpg',
 'af (64).jpg',
 'af (65).jpg',
 'af (66).jpg'

#### Typical Image Size and Outliers

In [15]:
files1 = os.listdir("data/african-wildcat")
files2 = os.listdir("data/blackfoot-cat")
files3 = os.listdir("data/chinese-mountain-cat")
files4 = os.listdir("data/domestic-cat")
files5 = os.listdir("data/european-wildcat")
files6 = os.listdir("data/jungle-cat")
files7 = os.listdir("data/sand-cat")
cat_files = files1 + files2 + files3 + files4 + files5 + files6 + files7

In [16]:
# Supported image extensions
IMAGE_EXTENSIONS = ('.jpg', '.jpeg', '.png')

# Folders to scan
folders = [
    "data/african-wildcat",
    "data/blackfoot-cat",
    "data/sand-cat",
    "data/chinese-mountain-cat",
    "data/domestic-cat",
    "data/european-wildcat",
    "data/jungle-cat"
]

# Step 1: Collect image paths and sizes
image_data = []
for folder in folders:
    for file in os.listdir(folder):
        full_path = os.path.join(folder, file)
        if os.path.isfile(full_path) and file.lower().endswith(IMAGE_EXTENSIONS):
            try:
                with Image.open(full_path) as img:
                    width, height = img.size
                    image_data.append((full_path, width, height))
            except Exception as e:
                print(f"Error reading {full_path}: {e}")

# Step 2: Create DataFrame
df = pd.DataFrame(image_data, columns=["path", "width", "height"])

# Step 3: Count most common sizes
size_counts = Counter([(w, h) for _, w, h in image_data])
most_common = size_counts.most_common()

print("📏 Most common image sizes (width x height):")
for size, count in most_common[:10]:  # show top 10
    print(f"{size[0]} x {size[1]} - {count} images")

# Step 4: Average size
if not df.empty:
    avg_width = df["width"].mean()
    avg_height = df["height"].mean()
    print(f"\n📊 Average size: {avg_width:.1f} x {avg_height:.1f} pixels")


# Step 1: Compute area and size
df["area"] = df["width"] * df["height"]
df["size"] = list(zip(df["width"], df["height"]))

# Step 2: Group by size and count
size_counts = df.groupby("size").agg(
    count=("size", "count"),
    width=("width", "first"),
    height=("height", "first"),
    area=("area", "first")
).reset_index(drop=True)

# Step 3: IQR outlier detection function
def detect_outliers(series):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    return (series < Q1 - 1.5 * IQR) | (series > Q3 + 1.5 * IQR)

# Step 4: Apply IQR on width & height
size_counts["outlier_width"] = detect_outliers(size_counts["width"])
size_counts["outlier_height"] = detect_outliers(size_counts["height"])
size_counts["is_outlier"] = size_counts["outlier_width"] | size_counts["outlier_height"]

# Step 5: Filter to only outliers with high counts
threshold = 3
filtered_outliers = size_counts[(size_counts["is_outlier"]) & (size_counts["count"] >= threshold)]

# Step 6: Show outlier size summaries
print(f"\n🔍 Outlier sizes that appear at least {threshold} times:")
for _, row in filtered_outliers.iterrows():
    print(f"{row['width']} x {row['height']} - {row['count']} images")

# Step 7: Get actual image paths from df
outlier_sizes = set(zip(filtered_outliers["width"], filtered_outliers["height"]))
outlier_df = df[df["size"].isin(outlier_sizes)]

print(f"\n📸 Total outlier images after filtering: {len(outlier_df)}")
print(outlier_df[["path", "width", "height"]].to_string(index=False))

📏 Most common image sizes (width x height):
275 x 183 - 108 images
259 x 194 - 30 images
225 x 225 - 22 images
183 x 275 - 18 images
300 x 168 - 12 images
262 x 192 - 12 images
265 x 190 - 10 images
251 x 201 - 8 images
276 x 183 - 7 images
194 x 259 - 6 images

📊 Average size: 406.6 x 310.9 pixels

🔍 Outlier sizes that appear at least 3 times:
1280 x 720 - 3 images
1300 x 866 - 3 images
1300 x 956 - 3 images
2000 x 1333 - 4 images

📸 Total outlier images after filtering: 13
                                 path  width  height
data/chinese-mountain-cat\ch (35).jpg   1280     720
data/chinese-mountain-cat\ch (36).jpg   1280     720
    data/european-wildcat\eu (10).jpg   2000    1333
    data/european-wildcat\eu (15).jpg   1300     866
    data/european-wildcat\eu (26).jpg   2000    1333
    data/european-wildcat\eu (29).jpg   1300     956
     data/european-wildcat\eu (3).jpg   2000    1333
    data/european-wildcat\eu (46).jpg   1300     956
    data/european-wildcat\eu (58).jpg   128

### Problem 2. Duplicat(e)s (1 point)
Find a way to filter out (remove) identical images. I would recommnend using file hashes, but there are many approaches. Keep in mind that during file saving, recompression, etc., a lot of artifacts can change the file content (bytes), but not visually.

In [18]:
folders = [
    "data/african-wildcat",
    "data/blackfoot-cat",
    "data/chinese-mountain-cat",
    "data/domestic-cat",
    "data/european-wildcat",
    "data/jungle-cat",
    "data/sand-cat"
]

In [19]:
!pip install imagehash pillow



In [20]:
IMAGE_EXTENSIONS = ('.jpg', '.jpeg', '.png')
folders = [
    "data/african-wildcat",
    "data/blackfoot-cat",
    "data/chinese-mountain-cat",
    "data/domestic-cat",
    "data/european-wildcat",
    "data/jungle-cat",
    "data/sand-cat"
]

hash_dict = defaultdict(list)
duplicate_paths = []

for folder in folders:
    for fname in os.listdir(folder):
        path = os.path.join(folder, fname)
        if os.path.isfile(path) and fname.lower().endswith(IMAGE_EXTENSIONS):
            try:
                with Image.open(path) as img:
                    hash_val = imagehash.phash(img)  # perceptual hash
                    if hash_val in hash_dict:
                        duplicate_paths.append(path)  # This is a duplicate
                    else:
                        hash_dict[hash_val].append(path)
            except Exception as e:
                print(f"❌ Error with file {path}: {e}")

print(f"✅ Found {len(duplicate_paths)} duplicates:")
for dup in duplicate_paths:
    print(dup)

✅ Found 54 duplicates:
data/african-wildcat\af (32).jpg
data/african-wildcat\af (37).jpg
data/african-wildcat\af (61).jpg
data/african-wildcat\af (74).jpg
data/blackfoot-cat\bc (63).jpg
data/chinese-mountain-cat\ch (20).jpg
data/chinese-mountain-cat\ch (32).jpg
data/chinese-mountain-cat\ch (39).jpg
data/chinese-mountain-cat\ch (42).jpg
data/chinese-mountain-cat\ch (9).jpg
data/domestic-cat\dc (27).jpg
data/domestic-cat\dc (36).jpg
data/domestic-cat\dc (42).jpg
data/domestic-cat\dc (5).jpg
data/domestic-cat\dc (52).jpg
data/european-wildcat\eu (11).jpg
data/european-wildcat\eu (3).jpg
data/european-wildcat\eu (32).jpg
data/european-wildcat\eu (33).jpg
data/european-wildcat\eu (42).jpg
data/european-wildcat\eu (47).jpg
data/european-wildcat\eu (48).jpg
data/european-wildcat\eu (5).jpg
data/european-wildcat\eu (50).jpg
data/european-wildcat\eu (51).jpg
data/european-wildcat\eu (54).jpg
data/european-wildcat\eu (55).jpg
data/european-wildcat\eu (61).jpg
data/european-wildcat\eu (62).jpg
da

### Problem 3. Loading a model (2 points)
Find a suitable, trained convolutional neural network classifier. I recommend `ResNet50` as it's small enough to run well on any machine and powerful enough to make reasonable predictions. Most ready-made classifiers have been trained for 1000 classes.

You'll need to install libraries and possibly tinker with configurations for this task. When you're done, display the total number of layers and the total number of parameters. For ResNet50, you should expect around 50 layers and 25M parameters.

In [52]:
model = ResNet50(weights='imagenet', include_top=True)

In [54]:
model.summary()

In [56]:
print(f"\n🔍 Total layers: {len(model.layers)}")
print(f"🧮 Total parameters: {model.count_params():,}")


🔍 Total layers: 177
🧮 Total parameters: 25,636,712


### Problem 4. Prepare the images (1 point)
You'll need to prepare the images for passing to the model. To do so, they have to be resized to the same dimensions. Most available models have a specific requirement for sizes. You may need to do additional preprocessing, depending on the model requirements. These requirements should be easily available in the model documentation.

### Problem 5. Load the images efficiently (1 point)
Now that you've seen how to prepare the images for passing to the model... find a way to do it efficiently. Instead of loading the entire dataset in the RAM, read the images in batches (e.g. 4 images at a time). The goal is to read these, preprocess them, maybe save the preprocessed results in RAM.

If you've already done this in one of the previous problems, just skip this one. You'll get your point for it.

\* Even better, save the preprocessed image arrays (they will not be valid .jpg file) as separate files, so you can load them "lazily" in the following steps. This is a very common optimization to work with large datasets.

### Problem 6. Predictions (1 point)
Finally, you're ready to get into the meat of the problem. Obtain predictions from your model and evaluate them. This will likely involve manual work to decide how the returned classes relate to the original ones.

Create a [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix) to evaluate the classification.

### Problem 7. Grayscale (1 point)
Converting the images to grayscale should affect the classification negatively, as we lose some of the color information.

Find a way to preprocess the images to grayscale (using what you already have in Problem 4 and 5), pass them to the model, and compare the classification results to the previous ones.

### Problem 8. Deep image features (1 point)
Find a way to extract one-dimensional vectors (features) for each (non-grayscale) image, using your model. This is typically done by "short-circuiting" the model output to be an intermediate layer, while keeping the input the same. 

In case the outputs (also called feature maps) have different shapes, you can flatten them in different ways. Try to not create huge vectors; the goal is to have a relatively short sequence of numbers which describes each image.

You may find a tutorial like [this](https://towardsdatascience.com/exploring-feature-extraction-with-cnns-345125cefc9a) pretty useful but note your implementation will depend on what model (and framework) you've decided to use.

It's a good idea to save these as one or more files, so you'll spare yourself a ton of preprocessing.

### Problem 9. Putting deep image features to use (1 points)
Try to find similar images, using a similarity metric on the features you got in the previous problem. Two good metrics are `mean squared error` and `cosine similarity`. How do they work? Can you spot images that look too similar? Can you explain why?

\* If we were to take Fourier features (in a similar manner, these should be a vector of about the same length), how do they compare to the deep features; i.e., which features are better to "catch" similar images?

### * Problem 10. Explore, predict, and evaluate further
You can do a ton of things here, at your desire. For example, how does masking different areas of the image affect classification - a method known as **saliency map** ([info](https://en.wikipedia.org/wiki/Saliency_map))? Can we detect objects? Can we significantly reduce the number of features (keeping the quality) that we get? Can we reliably train a model to predict our own classes? We'll look into these in detail in the future.