# Homework06

Some exercises with image and audio data preparation.

## Goals

- Even more practice with lists
- Get familiar with pandas `DataFrames`
- Practice dataset exploration and normalization/scaling
- Set up a dataset for proper classification

### Setup

Run the following 2 cells to import all necessary libraries and helpers for this homework

In [None]:
!wget -q https://github.com/PSAM-5020-2025F-A/5020-utils/raw/main/src/audio_utils.py
!wget -q https://github.com/PSAM-5020-2025F-A/5020-utils/raw/main/src/data_utils.py
!wget -q https://github.com/PSAM-5020-2025F-A/5020-utils/raw/main/src/image_utils.py

!wget -q https://github.com/PSAM-5020-2025F-A/Homework04/raw/main/Homework04_utils.pyc
!wget -q https://github.com/PSAM-5020-2025F-A/Homework05/raw/main/Homework05_utils.pyc

!wget -qO- https://github.com/PSAM-5020-2025F-A/5020-utils/releases/latest/download/forest-tree.tar.gz | tar xz
!wget -qO- https://github.com/PSAM-5020-2025F-A/5020-utils/releases/latest/download/instruments.tar.gz | tar xz

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.preprocessing import StandardScaler, MinMaxScaler

from image_utils import make_image, get_pixels

from Homework06_utils import AwesomeAudioClassifier, AwesomeImageClassifier

AUDIO_PATH = "./data/audio/instruments/test"
IMAGE_PATH = "./data/image/forest-tree/test"

## More Image/Audio Classification

We're going to re-visit the classification exercises from `Homework04` and `Homework05`.

This exercise is a bit different though. In some ways it's the opposite of the previous exercises because we'll already have classification models ready to be used, but will have to normalize and standardize our dataset in order to run them. This is more representative of the type of work that goes into using real, pre-trained, ML models in the wild.

### The Models

We have two `Awesome` models, one for audio classification (`AwesomeAudioClassifier`), and one for image classification (`AwesomeImageClassifier`).

Unlike the classification models we set up for `Homework04` and `Homework05`, these models have more strict requirements about the shape and values of their input data. We can't run them on the files as they are.

### The Data

Audio and Image files are in the `data/audio/` and `data/image/` directories respectively.

We will use the `get_training_data()` from each of our classifiers to get the initial training data and labels for our audio and image files.

### The Features

This is the challenging part.

The data returned by `get_training_data()` is a representation of the content of the audio and image files, but it hasn't been processed or normalized in order to be used by the classifier models provided.

We can try to create a `DataFrame` directly from those, and it might seem like it works, but if we take a look at the result we'll see some `NaN` (Not-a-Number) values in some of the columns, and if we send that to the model it will barf and complain about having `NaN`s in the data.

This happens because all of the audios and images have different sizes. Hoooray !!

Welcome to Machine Learning. This is probably where most of the time in any ML project is spent: cleaning up data and making sure it has the right format, size and shape that a model expects.

For this exercise it won't be too hard to fix these.

Let's start with the audio files since they're one-dimensional, and once we have the audio modeling working we'll come back to the image files.

<div style="background:#040; padding:10px; width:calc(100% - 28px)">

## Audio Data

Let's run `AwesomeAudioClassifier.get_training_data()` function to get some audio data. This function returns audio data and labels from files inside a specified directory.

</div>

In [None]:
features, labels = AwesomeAudioClassifier.get_training_data(AUDIO_PATH)

### Audio Features

The audio data returned is actually in the frequency domain and is not samples, so even though we can't play these audio files, we can still plot this data and will have to normalize and clean it before we can run it through our classifier.

Let's take a look at this data.

What are the labels ? How many records do we have ? How many features do we have in each record ? Can we plot our data ?

In [None]:
# TODO: How many records ?
print("the labels are:", set(labels))
print("the amount of recordings is:", len(labels))

# TODO: How many features ?
# the data is not yet structures as a dataframe, so we can not get the amount of features in each record by counting feature values in the first record
# (different records can have different amounts of features)
# creating a list of all the features length of all the recordings
features_lengths = [len(f) for f in features]
min_features = min(features_lengths)
max_features = max(features_lengths)
print("the records have from",min_features, "to", max_features, "features.")

# TODO: Plot some features
first_record_features = features[0]
plt.plot(first_record_features)

### Looks like data !

Looks like audio frequency-domain data to be more specific.

If we were to follow some of the data exploration steps we saw in class we would want to put this data in a `DataFrame` in order to get calculate some of its statistical properties, and maybe scale/normalize it before we use it in a classifier model.

Let's try it:

In [None]:
features_df = pd.DataFrame(features)
features_df

It looks like it works, but when we look closely at the `DataFrame`, specially if we look at the features that are further to the right, we'll see our problem: `NaN` values.

As previously mentioned, this happens because the length of our features is different for each file.

### Fix Audio Data

Let's fix this by making all of the feature lists have the same length. We can either pad the short ones or slice the longer ones to have the same length as the shortest feature list. The second option is preferable since padding would require adding information to the dataset and that might have side effects.

So, we'll go through the lists of lists, create a list of lengths and find the smallest length.

Then, we'll iterate through the lists of lists and slice all the feature lists to have the same length.

In [None]:
# TODO: go through the list of features and make their lengths consistent
# creating an empty list to populate with lengths of features
features_length = []
for f in features:
    features_length.append(len(f))
print("features length:",features_length)

# finding the smallest length
smallest_length = min(features_length)
print("smallest length:", smallest_length)

# cropping the features longer than smallest_length and appending the cropped versions to a new list called cropped_features
cropped_features = []
for f in features:
    cropped_f = f[:smallest_length]
    cropped_features.append(cropped_f)
    
# checking the length of the first cropped feature, which should be 43000
print("length of the first cropped feature:",len(cropped_features[0]))

A `DataFrame` created using the cropped features should look more consistent now.

In [None]:
features_df = pd.DataFrame(cropped_features)
features_df

### Bonus: Empty features

We've removed the `NaN` values, but it seems like we have a lot of columns that are all zeros or nearly all zeros.

While it's not necessary, we could also remove these in order to speed up the modeling later.

In [None]:
# sum of all columns
display(features_df.sum())

# columns where the sum is less than 100
display((features_df.sum(axis=0) < 100))

# TODO: remove columns with no information
# getting the actual columns that are nearly all zeros
columns_to_drop = features_df.columns[(features_df.sum(axis=0) < 100)]
# dropping those columns
features_df = features_df.drop(columns=columns_to_drop)

### Run the Model

Now that we have a `DataFrame` with consistent rows, we can fit and evaluate our model.

The next cell runs the pre-defined classification model, fitting it with our `features_df` `DataFrame` and then reports the accuracy of our model.

We just have to run it.

In [None]:
# Fit the classifier and report training accuracy
AwesomeAudioClassifier.fit(features_df, labels)

### Scale / Normalize

Hmmm.... it runs, but we can do better.

We saw in class that normalizing/rescaling our features can help us find actual patterns in our data. It also helps models find patterns.

Try scaling the `DataFrame` using either a `MinMaxScaler` or a `StandardScaler` object.

In [None]:
# TODO: scale/normalize features

# creating a scaler object
# min_max_scaler = MinMaxScaler().set_output(transform="pandas")
std_scaler = StandardScaler().set_output(transform="pandas")

# applying the scaler object to our dataframe
# features_min_max_df = min_max_scaler.fit_transform(features_df)
features_scaled_df = std_scaler.fit_transform(features_df)

# this is the scaled version of the dataframe:
# features_min_max_df
features_scaled_df

### Run the Model Again

This time with scaled data.

In [None]:
# Fit the classifier and report training accuracy
AwesomeAudioClassifier.fit(features_scaled_df, labels)

### Interpretation

<span style="color:hotpink;">
Do different scaling strategies influence the prediction results ? What might that tell us about our data ?
</span>

<span style="color:hotpink;">
training accuracy with the minmax scaler: {'clarinet': 0.90476, 'guitar': 0.9, 'piano': 1.0, 'overall': 0.93492}

training accuracy with standard scaler: {'clarinet': 0.90476, 'guitar': 0.9, 'piano': 1.0, 'overall': 0.93492}

The MinMaxScaler transforms every feature to fit within a specific range (0 to 1), relative to the minimum and maximum values of that feature across all audio samples. Using this scaler on the DataFrame helps because it places all features on a comparable scale, preventing features with large ranges from dominating the model's training process. For instance, if one feature has values ranging from 0 to 100000000000 and another has values from 0 to 100, an unscaled model would base its decisions almost entirely on the differences in the first feature. It would nearly ignore the second one, even though that feature could be very useful for differentiating between instruments. When we scale the data, the model makes better decisions because it can evaluate all features on a level playing field.

A key difference between MinMaxScaler and StandardScaler is how they handle new data that falls outside the range of the original training set. If a new data point is outside the range seen during training, MinMaxScaler will transform it to a value greater than 1 or less than 0. This happens because the scaler's range is fixed based on the minimum and maximum of the initial data. StandardScaler, on the other hand, uses the mean and standard deviation for each column to center its values at 0 and have a range of about [-3, 3]. However, this output range is not technically fixed, so new data points can be scaled to any value. In this sense, it is less sensitive to outliers than MinMaxScaler.

In our case, the training accuracy improved significantly after scaling with both methods, which means that scaling was necessary. However, the prediction results did not change when using the different scaling strategies, which suggests that the data does not have extreme outliers that would give one scaler an advantage over the other.</span>

<div style="background:#040; padding:10px; width:calc(100% - 28px)">

## Image Data

This is a bit trickier, but only because our classifier model for images is a bit pickier. Not only do we have to ensure that all of our records have the same number of features (images have the same number of pixels), we will also have to convert the pixels into grayscale pixels.

Let's start by reading the data and looking at what we get.

</div>

In [None]:
imgs, labels = AwesomeImageClassifier.get_training_data(IMAGE_PATH)

### Image Data

What did we get in the `imgs` variable ? How many records do we have ? How many features does each record/image have ?

In [None]:
# TODO: look at the imgs and labels variables and get some information about the data

# How many imgs ?
print("the number of imgs is", len(imgs))

# How many labels ?
print("the labels are:", set(labels))


# How many records ?
print("the amount of records is:", len(labels))


# How many features ?
# assuming that each pixel is a feature, we need to calculare the dimensions (height x width) of each image
# we can get this dimensions by getting the size of the image
features_imgs = []
for i in imgs:
    width, height = i.size
    num_pixels = width * height
    features_imgs.append(num_pixels)
min_features = min(features_imgs)
max_features = max(features_imgs)

print("the records have from",min_features, "to", max_features, "features (/pixels).")


### Create Image Features

It seems like we have actual `PIL` image objects and their labels. 

This will work to our advantage because if we try to just create a `DataFrame` of the extracted pixels from these images we'll probably have a problem with missing feature values again.

In [None]:
features = []
for img in imgs:
  features.append(get_pixels(img))

print(len(features), len(features[0]), len(features[11]))

features_df = pd.DataFrame(features)
features_df

### Fix Images

We could follow a similar approach to how we fixed the audio data, and just slice our pixel arrays to have the same length as the shortest pixel array, but that will distort our images. Try it out to see the result, but instead of taking pixels out from the end of the image, what we really have to do is change their dimensions so they all have the same `width` and `height` before we get their pixels.

There are a couple of ways to achieve this:
- Crop: use the `image.crop()` function to cut the images.
- Resize: use `image.resize()` to stretch/squeeze the images into specific shapes.

Documentation for [`crop()`](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.crop) and [`resize()`](https://pillow.readthedocs.io/en/stable/reference/Image.html#PIL.Image.Image.resize).

Take a look at a few images before picking a strategy and then take a look after to see what the chosen strategy does to the images.

In [None]:
# TODO: look at characteristics/dimensions of the images

# looking at two random images before resizing
display(imgs[14])
display(imgs[8])

# getting all the widths and heights to calculate the average, as the size to which we are going to resize the images
widths = []
heights = []
for i in imgs:
  width, height = i.size
  widths.append(width)
  heights.append(height)
avg_width = int(sum(widths) / len(widths))
avg_height = int(sum(heights) / len(heights))

# TODO: go through the images and make their dimensions consistent
resized_imgs = []
new_size = (avg_width, avg_height)
for i in imgs:
   resized_img = i.resize(new_size)
   resized_imgs.append(resized_img)

# TODO: look at some images
display(resized_imgs[14])
display(resized_imgs[8])
# (they're bigger)

### Create Features

Now that we have images with consistent dimensions, we can extract their pixels and convert them to grayscale, so we get a nice looking `DataFrame` to send to our classifier model.

In [None]:
# TODO: calculate grayscale pixel values


grayscale_pixel_values = []
for img in resized_imgs:
    # getting the list of (R, G, B) pixels from the current image
    ipxs = list(img.getdata())
    bwpxs = []
    # calculating the grayscale value for each pixel
    for r,g,b in get_pixels(img):
      gval = (r + g + b) // 3
      bwpxs.append(gval)
    grayscale_pixel_values.append(bwpxs)
    

# TODO: look at some images with make_image()
himg = make_image(grayscale_pixel_values[0])
display(himg)

# TODO: create DataFrame
features_df = pd.DataFrame(grayscale_pixel_values)

### Run the Image Model

Now that we have a `DataFrame` with consistent features, we can fit and evaluate our model.

The next cell runs the pre-defined classification model, fitting it with our `features_df` `DataFrame` and then reports the accuracy of our model.

We just have to run it (and wait a bit because it can take up to $20$ seconds for it to run).

In [None]:
# Fit the classifier and report training accuracy
AwesomeImageClassifier.fit(features_df, labels)

### Scaling / Normalizing

Run the classifier model again, but this time using normalized features.

In [None]:
# TODO: create scaler object, scale data and re-run classification

# TODO: scale/normalize features

# creating a scaler object
min_max_scaler = MinMaxScaler().set_output(transform="pandas")
# std_scaler = StandardScaler().set_output(transform="pandas")

# applying the scaler object to our dataframe
features_min_max_df = min_max_scaler.fit_transform(features_df)
# features_scaled_df = std_scaler.fit_transform(features_df)

# this is the scaled version of the dataframe:
features_min_max_df
# features_scaled_df

# Fit the classifier and report training accuracy
AwesomeImageClassifier.fit(features_min_max_df, labels)
# AwesomeImageClassifier.fit(features_scaled_df, labels)


### Interpretation

<span style="color:hotpink;">
Do different scaling strategies influence the prediction results ? What might that tell us about our data ?
</span>

<span style="color:hotpink;">
training accuracy with the minmax scaler: {'florist': 0.97917, 'forest': 1.0, 'tree': 0.825, 'overall': 0.93472}

training accuracy with the standard scaler: {'florist': 0.97917, 'forest': 1.0, 'tree': 0.85, 'overall': 0.94306}

The different scaling strategies did very slightly influence the prediction results, with StandardScaler performing marginally better overall. The slight performance edge of StandardScaler might indicate that centering the data around a mean of 0 was a little more effective than scaling it to a 0-1 range, but the fact that the results are so similar suggests that the data likely does not have significant outliers. </span>