# COGS 108 - Data Checkpoint

# Names

- Mateo Ignacio
- Samuel Piltch
- Nate del Rosario 🐐
- Lisa Hwang
- Geovaunii D. White

<a id='research_question'></a>
# Research Question

Since we’ve never worked with audio data or classification of audio data we wanted to try a binary example. We ask the question: can we classify animal noises using audio analysis?

# Dataset(s)

*Fill in your dataset information here*

(Copy this information for each dataset)
- Dataset Name: Audio Cats and Dogs
- Link to the dataset: [data](https://www.kaggle.com/datasets/mmoreaux/audio-cats-and-dogs)
- Number of observations: 277

The dataset consists of 277 wav files of cats and dogs. In addition, a supplemental csv of train/test splits has been provided.
A deeper dive into the data can be found [here](https://www.kaggle.com/code/mmoreaux/a-look-into-the-data?scriptVersionId=1573551)
where we can see that the wav files can be processed into more usable forms such as numpy arrays.

# Setup

In [28]:
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
import os
import glob
import librosa

# audio plot 
from scipy.io import wavfile as wav
import IPython.display as ipd

# DL libraries that we may or may not need
import tensorflow
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential,Model
from tensorflow.keras.callbacks import Callback,EarlyStopping
from sklearn.metrics import confusion_matrix, classification_report
from tensorflow.keras.layers import Conv2D, Activation, Flatten, Dense,GlobalAveragePooling2D, Dropout

In [29]:
# List the wav files
ROOT_DIR = 'data/cats_dogs/'
X_path = os.listdir(ROOT_DIR)

# change y to int values
y = [0 if 'cat' in f else 1 for f in X_path] 

# Split train and test
X_train, X_test, y_train, y_test = train_test_split(X_path, y, test_size=0.33)

print(f"in Total, there are {len(y) - sum(y)} cats and {sum(y)} dogs")
print(f"in X_train, there are {len(y_train) - sum(y_train)} cats and {sum(y_train)} dogs")
print(f"in X_test, there are {len(y_test) - sum(y_test)} cats and { sum(y_test)} dogs")

in Total, there are 164 cats and 115 dogs
in X_train, there are 104 cats and 82 dogs
in X_test, there are 60 cats and 33 dogs


# Data Cleaning

Since we are dealing with wav files, we will need to handle them differently as they are not in numerical form.
Our approach is as follows:
- create a DataFrame of filepaths and assign labels to our data
- for each observation, we use the filepath to generate a new set of data, the audio features and statistics for that particular wav file. 
This is a similar approach to [getting the song uri from the general dataset here](https://developer.spotify.com/documentation/web-api/reference/get-several-audio-features) and producing a new feature series for it.
- we will then merge these two DataFrames using an outer join to combine all the columns from both. This process will ideally not require much imputation since the two DataFrames will have the same column observations to merge on.

Lets read in our train test split:

In [30]:
df = pd.read_csv('data/train_test_split.csv')
df = df[['test_cat', 'test_dog', 'train_cat', 'train_dog']]
df

Unnamed: 0,test_cat,test_dog,train_cat,train_dog
0,cat_22.wav,dog_barking_97.wav,cat_99.wav,dog_barking_33.wav
1,cat_116.wav,dog_barking_0.wav,cat_54.wav,dog_barking_86.wav
2,cat_155.wav,dog_barking_93.wav,cat_34.wav,dog_barking_45.wav
3,cat_58.wav,dog_barking_10.wav,cat_132.wav,dog_barking_76.wav
4,cat_77.wav,dog_barking_26.wav,cat_124.wav,dog_barking_4.wav
...,...,...,...,...
110,,,cat_15.wav,
111,,,cat_88.wav,
112,,,cat_73.wav,
113,,,cat_32.wav,


As we can see, there are a lot of NaN values because the partitions are different sizes for each. We can also see that this DataFrame is not in the most useable format, so we will load our data into a different form

In [31]:
base_dir = 'data/cats_dogs'
labels = ['cat', 'dog']
data = []

# Iterate through the files in the base directory
for filename in os.listdir(base_dir):
    if filename.endswith('.wav'):
        file_path = os.path.join(base_dir, filename)

        # Determine the label based on the filename
        if filename.startswith('cat'):
            label = 'cat'
        elif filename.startswith('dog'):
            label = 'dog'
        else:
            # Skip the file if the label cannot be determined
            continue

        # Append the file path and label to the data list
        data.append({'file_path': file_path, 'label': label})

# Convert the data list into a pandas DataFrame
cats_and_dogs = pd.DataFrame(data).sort_values(by = 'label', ascending=True)
cats_and_dogs


Unnamed: 0,file_path,label
0,data/cats_dogs/cat_74.wav,cat
146,data/cats_dogs/cat_163.wav,cat
149,data/cats_dogs/cat_31.wav,cat
151,data/cats_dogs/cat_25.wav,cat
152,data/cats_dogs/cat_19.wav,cat
...,...,...
184,data/cats_dogs/dog_barking_112.wav,dog
51,data/cats_dogs/dog_barking_64.wav,dog
134,data/cats_dogs/dog_barking_3.wav,dog
128,data/cats_dogs/dog_barking_36.wav,dog


As we can see, this form is much neater since we are now dealing with labelled data, which will make binary classification easier.

In [32]:
counts = cats_and_dogs.groupby('label').agg('count')
counts

Unnamed: 0_level_0,file_path
label,Unnamed: 1_level_1
cat,164
dog,113


In [33]:
def extract_features(file_path):
    """Idk what audio features to extract these are the ones that were recommended but we should get more"""
    
    y, sr = librosa.load(file_path)

    # Extract MFCC features
    mfcc = librosa.feature.mfcc(y=y, sr=sr)
    mfcc_mean = np.mean(mfcc, axis=1)

    # Extract chroma features
    chroma = librosa.feature.chroma_stft(y=y, sr=sr)
    chroma_mean = np.mean(chroma, axis=1)

    # Extract spectral contrast features
    spec_contrast = librosa.feature.spectral_contrast(y=y, sr=sr)
    spec_contrast_mean = np.mean(spec_contrast, axis=1)

    features = np.concatenate([mfcc_mean, chroma_mean, spec_contrast_mean])
    return features

def add_features(cats_and_dogs):
    """Adds certain features to the existing dataset"""
    
    # Assuming cats_and_dogs is a DataFrame containing file paths under the 'file_path' column
    num_rows = len(cats_and_dogs)

    # Create empty NumPy arrays for storing features
    mfcc_features = np.empty((num_rows, 20))
    chroma_features = np.empty((num_rows, 12))
    spec_contrast_features = np.empty((num_rows, 7))

    # Iterate through the rows of the DataFrame, extract features for the current file
    for index, row in cats_and_dogs.iterrows():
        file_path = row['file_path']
        features = extract_features(file_path)

        # Check if the index is within bounds (will error otherwise)
        if index < num_rows:
            # Add the extracted features to the respective NumPy arrays
            mfcc_features[index] = features[:20] 
            chroma_features[index] = features[20:32]  
            spec_contrast_features[index] = features[32:]  

    # Add the extracted features as new columns in the DataFrame
    cats_and_dogs['mfcc'] = list(mfcc_features)
    cats_and_dogs['chroma'] = list(chroma_features)
    cats_and_dogs['spec_contrast'] = list(spec_contrast_features)
    return cats_and_dogs

In [34]:
cats_and_dogs = add_features(cats_and_dogs)
cats_and_dogs

Unnamed: 0,file_path,label,mfcc,chroma,spec_contrast
0,data/cats_dogs/cat_74.wav,cat,"[-276.0411071777344, 105.24787139892578, -22.0...","[0.47054484486579895, 0.5720701813697815, 0.58...","[14.061530355875648, 14.454450031504017, 14.83..."
146,data/cats_dogs/cat_163.wav,cat,"[-352.4539794921875, 95.86172485351562, -27.75...","[0.21052506566047668, 0.12869735062122345, 0.2...","[18.26224054072285, 23.685789045613287, 34.535..."
149,data/cats_dogs/cat_31.wav,cat,"[-381.098876953125, 133.116943359375, 28.10297...","[0.6545570492744446, 0.4674604833126068, 0.354...","[23.258493391398986, 18.01721994561419, 18.512..."
151,data/cats_dogs/cat_25.wav,cat,"[-282.0267639160156, 135.3180694580078, -27.07...","[0.2937651574611664, 0.3864949643611908, 0.277...","[22.03832494555295, 22.202616951665533, 26.545..."
152,data/cats_dogs/cat_19.wav,cat,"[-342.8298645019531, 94.88453674316406, -37.67...","[0.21329431235790253, 0.21201826632022858, 0.2...","[16.66223592411783, 13.437311764820556, 21.077..."
...,...,...,...,...,...
184,data/cats_dogs/dog_barking_112.wav,dog,"[-262.3529357910156, 104.95082092285156, -17.7...","[0.5691002011299133, 0.5664060711860657, 0.488...","[13.296213402630706, 14.02747405875588, 16.428..."
51,data/cats_dogs/dog_barking_64.wav,dog,"[-261.4265441894531, 160.21958923339844, -19.4...","[0.16335465013980865, 0.17941848933696747, 0.2...","[16.418053164264435, 29.863924132396324, 29.88..."
134,data/cats_dogs/dog_barking_3.wav,dog,"[-160.60214233398438, 118.60657501220703, -48....","[0.646116316318512, 0.6256576776504517, 0.5965...","[16.576413890232995, 10.888284651165053, 11.98..."
128,data/cats_dogs/dog_barking_36.wav,dog,"[-426.46875, 31.09389877319336, -27.6320247650...","[0.48396408557891846, 0.4430707097053528, 0.46...","[20.314767679158702, 10.943070518954004, 17.36..."
