# Step 2, alternative a: Generate UMAP representations from spectrograms  -  Basic pipeline

## Introduction

This script creates UMAP representations from spectrograms using the basic pipeline.
The following minimal structure and files are required in the project directory:

    ├── data
    │   ├── df.pkl         <- pickled pandas dataframe with metadata and spectrograms (generated in
    |                         01_generate_spectrograms.ipynb)
    ├── spec_params.py     <- python file containing the spectrogram parameters used (generated in 
                              01_generate_spectrograms.ipynb)

The following columns must exist (somewhere) in the pickled dataframe df.pkl:

    | spectrograms    |    ....
    ------------------------------------------
    |  2D np.array    |    ....
    |  ...            |    ....
    |  ...            |    .... 

## Import statements, constants and functions

In [1]:
import pandas as pd
import numpy as np
import pickle
import os
from pathlib import Path
import umap

from preprocessing_functions import calc_zscore, pad_spectro
from custom_dist_functions_umap import unpack_specs

In [2]:
P_DIR = os.getcwd() # --> project directory default: your current working directory
DATA = os.path.join(os.path.sep, P_DIR, 'data') 

Specify UMAP parameters. If desired, other inputs can be used for UMAP, such as denoised spectrograms, bandpass filtered spectrograms or other (MFCC, specs on frequency scale...) by changining the INPUT_COL parameter.

In [3]:
INPUT_COL = 'spectrograms'  # column that is used for UMAP
                            #  could also choose 'denoised_spectrograms' or 'stretched_spectrograms' etc etc...
    
METRIC_TYPE = 'euclidean'     # distance metric used in UMAP. Check UMAP documentation for other options
                              # e.g. 'euclidean', correlation', 'cosine','manhattan' ...
    
N_COMP = 3                    # number of dimensions desired in latent space  

# Load data

In [4]:
df = pd.read_pickle(os.path.join(os.path.sep, DATA, 'df.pkl'))

## UMAP

In this step, the spectrograms are z-transformed, zero-padded and concatenated to obtain numeric vectors.

In [5]:
# Basic pipeline
# No time-shift allowed, spectrograms should be aligned at the start. All spectrograms are zero-padded 
# to equal length
    
specs = df[INPUT_COL] # choose spectrogram column
specs = [calc_zscore(s) for s in specs] # z-transform each spectrogram

maxlen= np.max([spec.shape[1] for spec in specs]) # find maximal length in dataset
flattened_specs = [pad_spectro(spec, maxlen).flatten() for spec in specs] # pad all specs to maxlen, then row-wise concatenate (flatten)
data = np.asarray(flattened_specs) # data is the final input data for UMAP
    
reducer = umap.UMAP(n_components=N_COMP, metric = METRIC_TYPE,  # specify parameters of UMAP reducer
                    min_dist = 0, random_state=2204) 

## Fit UMAP

In [6]:
embedding = reducer.fit_transform(data)  # embedding contains the new coordinates of datapoints in 3D space

In [7]:
# Add UMAP coordinates to dataframe

for i in range(N_COMP):
    df['UMAP'+str(i+1)] = embedding[:,i]

## Save dataframe

In [8]:
df.to_pickle(os.path.join(os.path.sep, DATA, 'df_umap.pkl'))