# IMPORTS FOR DATA HANDLING
**NOTE**: The functions below are designed to be easy to import and thus independent of specific user-defined files. Hence, there are constants (such as the list of genres or number of genres) that were obtained elsewhere but are simply typed out here.

# Imported libraries & modules

In [1]:
# For handling data:
import numpy as np
import csv
import pandas as pd

# For enabling data shuffling:
from random import shuffle

# For handling tensors:
import tensorflow as tf
from tensorflow.keras.utils import to_categorical

# Preparing datasets
Preparing data for the following:

- Viewing and working the data and target labels in simple formats
- Working with neural networks (abstracting aspects like batches and data shuffling)

**SOME NOTES**:

- `to_categorical` was imported as `from tensorflow.keras.utils import to_categorical`
- `to_categorical` converts integer labels to the appropriate 1-hot encoding
- The below is mostly to increases convenience; we can do without it

In [4]:
# Segmenting the data (along with duplicating the corresponding labels of course):
def get_segmented_data(df, audio_data, segments_per_file, shuffle_data=True):
    # Segmenting the audio data's frames as indicated by `segments_per_file`...
    
    # NOTE: We assume the audio data to be either spectrograms, melspectrograms or MFCC arrays
    # NOTE: We also assume that each entry in the raw data is equally dimensioned
    n_frames = audio_data.shape[2] # We assume the data to be shaped as: <Entries>, <MFCCs/Frequencies>, <Frames>
    segment_size = int(n_frames // segments_per_file)
    data = []
    for i in range(len(df['TRACK'])):
        for j in range(segments_per_file):
            data.append(audio_data[i, :, j*segment_size:(j+1)*segment_size])
    
    #________________________
    # Duplicating labels to match each segment...
    # Total labels:
    labels = []
    for label in df['TARGET']:
        labels += [label]*segments_per_file

    #________________________
    # Shuffling the data for unbiased training and testing (hence better convergence of model):
    # Joining melspectrograms and labels to shuffle data and labels in corresponding order...
    D = list(zip(data, labels))
    # Shuffling list items...
    if shuffle_data:
        shuffle(D)
    # Separating melspectograms and their labels for future convenience...
    segmented_data = np.array([d[0] for d in D])
    segmented_labels = np.array([d[1] for d in D])

    return segmented_data, segmented_labels

#================================================
# Dividing the data and labels into training and validation datasets:
def get_data_in_splits(data, labels, validation_start):
    # Specifying proportions for datasets:
    validation_start = round(validation_start*len(labels)) # Might as well be `len(data)`
    
    # Training data:
    train_data = data[:validation_start] # Feature values
    train_labels = labels[:validation_start] # Target values
    
    # Testing data:
    validation_data = data[validation_start:] # Feature values
    validation_labels = labels[validation_start:] # Target values
    
    print(f'Training data shape = {train_data.shape}, Validation data shape = {validation_data.shape}')

    return train_data, train_labels, validation_data, validation_labels

#================================================
# Get datasets wrapped in a `tf.data.Dataset` object for convenience when working with neural networks:
def get_data(df, audio_data, n_classes, segments_per_file=4, validation_start=0.7, batch_size=32, shuffle_data=True):
    # NOTE: `n_classes` = Number of target classes
    
    data, labels = get_segmented_data(df, audio_data, segments_per_file, shuffle_data)
    
    # Dividing the data and labels into training and validation datasets:
    train_data, train_labels, validation_data, validation_labels = get_data_in_splits(data, labels, validation_start)
    
    #------------------------------------
    # Dictionary of training and validation data and labels in simpler data types:
    data_and_labels = {}
    data_and_labels['train_data'] = train_data
    data_and_labels['validation_data'] = validation_data
    data_and_labels['train_labels'] = train_labels
    data_and_labels['validation_labels'] = validation_labels

    #------------------------------------
    # Preparing the dataset for working in neural networks:
    train_dataset = tf.data.Dataset.from_tensor_slices((train_data, to_categorical(train_labels, num_classes=n_classes)))
    '''
    NOTE:
    Shuffling rows in training dataset helps in making the model converge in training.
    However, this is not necessary in our case since out dataset was already shuffled before.
    However, if it were necessary, we would have done it as follows:
    
    `train_dataset = train_dataset.shuffle(buffer_size=1024)`
    '''
    train_dataset = train_dataset.batch(batch_size)
    
    # Preparing the testing dataset:
    validation_dataset = tf.data.Dataset.from_tensor_slices((validation_data, to_categorical(validation_labels, num_classes=n_classes)))
    validation_dataset = validation_dataset.batch(batch_size)

    # Parameters:
    params = {}
    params['segments_per_file'] = segments_per_file
    params['validation_start'] = validation_start
    params['n_classes'] = n_classes
    params['batch_size'] = batch_size

    return params, data_and_labels, train_dataset, validation_dataset

**NOTE ON SHUFFLING DATA BEFORE DIVIDING IT**:

Shuffling the data before dividing it into training and testing datasets reduced overfitting and improved the model's accuracy (training and validation). Hence, it seems the original dataset's rows were arranged in a certain order with respect to which the model could overfit; shuffling the rows avoids this issue.