# COGS 108 - Data Checkpoint

# Names

- Mateo Ignacio
- Samuel Piltch
- Nate del Rosario üêê
- Lisa Hwang
- Geovaunii D. White

<a id='research_question'></a>
# Research Question

Since we‚Äôve never worked with audio data or classification of audio data we wanted to try a binary example. We ask the question: can we classify animal noises using audio analysis?

# Dataset(s)

*Fill in your dataset information here*

(Copy this information for each dataset)
- Dataset Name: Audio Cats and Dogs
- Link to the dataset: [data](https://www.kaggle.com/datasets/mmoreaux/audio-cats-and-dogs)
- Number of observations: 277

The dataset consists of 277 wav files of cats and dogs. In addition, a supplemental csv of train/test splits has been provided.
A deeper dive into the data can be found [here](https://www.kaggle.com/code/mmoreaux/a-look-into-the-data?scriptVersionId=1573551)
where we can see that the wav files can be processed into more usable forms such as numpy arrays.

# Setup

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
import os
import glob

# audio plot 
from scipy.io import wavfile as wav
import IPython.display as ipd

# DL libraries that we may or may not need
import tensorflow
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential,Model
from tensorflow.keras.callbacks import Callback,EarlyStopping
from sklearn.metrics import confusion_matrix, classification_report
from tensorflow.keras.layers import Conv2D, Activation, Flatten, Dense,GlobalAveragePooling2D, Dropout

ModuleNotFoundError: No module named 'plotly'

In [31]:
# List the wav files
ROOT_DIR = 'data/cats_dogs/'
X_path = os.listdir(ROOT_DIR)

# change y to int values
y = [0 if 'cat' in f else 1 for f in X_path] 

# Split train and test
X_train, X_test, y_train, y_test = train_test_split(X_path, y, test_size=0.33)

print(f"In Total, there are {len(y) - sum(y)} cats and {sum(y)} dogs.")
print(f"In X_train, there are {len(y_train) - sum(y_train)} cats and {sum(y_train)} dogs.")
print(f"Iin X_test, there are {len(y_test) - sum(y_test)} cats and { sum(y_test)} dogs.")

in Total, there are 164 cats and 115 dogs
in X_train, there are 107 cats and 79 dogs
in X_test, there are 57 cats and 36 dogs


# Data Cleaning

Since we are dealing with wav files, we will need to handle them differently as they are not in numerical form.
Our approach is as follows:
- create a DataFrame of filepaths and assign labels to our data
- for each observation, we use the filepath to generate a new set of data, the audio features and statistics for that particular wav file. 
This is a similar approach to [getting the song uri from the general dataset here](https://developer.spotify.com/documentation/web-api/reference/get-several-audio-features) and producing a new feature series for it.
- we will then merge these two DataFrames using an outer join to combine all the columns from both. This process will ideally not require much imputation since the two DataFrames will have the same column observations to merge on.

Lets read in our train test split:

In [33]:
df = pd.read_csv('data/train_test_split.csv')
df = df[['test_cat', 'test_dog', 'train_cat', 'train_dog']]
df

Unnamed: 0,test_cat,test_dog,train_cat,train_dog
0,cat_22.wav,dog_barking_97.wav,cat_99.wav,dog_barking_33.wav
1,cat_116.wav,dog_barking_0.wav,cat_54.wav,dog_barking_86.wav
2,cat_155.wav,dog_barking_93.wav,cat_34.wav,dog_barking_45.wav
3,cat_58.wav,dog_barking_10.wav,cat_132.wav,dog_barking_76.wav
4,cat_77.wav,dog_barking_26.wav,cat_124.wav,dog_barking_4.wav
...,...,...,...,...
110,,,cat_15.wav,
111,,,cat_88.wav,
112,,,cat_73.wav,
113,,,cat_32.wav,


As we can see, there are a lot of NaN values because the partitions are different sizes for each. We can also see that this DataFrame is not in the most useable format, so we will load our data into a different form

In [40]:
base_dir = 'data/cats_dogs'
labels = ['cat', 'dog']
data = []

# Iterate through the files in the base directory
for filename in os.listdir(base_dir):
    if filename.endswith('.wav'):
        file_path = os.path.join(base_dir, filename)

        # Determine the label based on the filename
        if filename.startswith('cat'):
            label = 'cat'
        elif filename.startswith('dog'):
            label = 'dog'
        else:
            # Skip the file if the label cannot be determined
            continue

        # Append the file path and label to the data list
        data.append({'file_path': file_path, 'label': label})

# Convert the data list into a pandas DataFrame
cats_and_dogs = pd.DataFrame(data).sort_values(by = 'label', ascending=True)
cats_and_dogs


Unnamed: 0,file_path,label
0,data/cats_dogs/cat_74.wav,cat
146,data/cats_dogs/cat_163.wav,cat
149,data/cats_dogs/cat_31.wav,cat
151,data/cats_dogs/cat_25.wav,cat
152,data/cats_dogs/cat_19.wav,cat
...,...,...
184,data/cats_dogs/dog_barking_112.wav,dog
51,data/cats_dogs/dog_barking_64.wav,dog
134,data/cats_dogs/dog_barking_3.wav,dog
128,data/cats_dogs/dog_barking_36.wav,dog


As we can see, this form is much neater since we are now dealing with labelled data, which will make binary classification easier.

In [48]:
counts = cats_and_dogs.groupby('label').agg('count')
counts

Unnamed: 0_level_0,file_path
label,Unnamed: 1_level_1
cat,164
dog,113
