# Interactive visualization of UMAP representations: Part 1 (Prep)

This script creates a spectrogram image for each call and saves all images in a pickled dictionary in the data subfolder (image_data.pkl). These images will be displayed later in the interactive visualization tool; generating them beforehand makes the tool faster, as images don't need to be created on-the-fly, but can be accessed through the dictionary. 

The default dictionary key is the filename without datatype specifier (e.g. without .wav), but if the dataframe contains a column 'callID', this is used as keys.

#### The following minimal structure and files are required in the project directory:

    ├── data
    │   ├── df_umap.pkl   <- pickled pandas dataframe with metadata, raw_audio, spectrograms and UMAP coordinates
    |                        (generated in 02a_generate_UMAP_basic.ipynb or 02b_generate_UMAP_timeshift.ipynb)
    ├── parameters 
    │   ├── spec_params.py  <- python file containing the spectrogram parameters used (generated in 
                               01_generate_spectrograms.ipynb)
  

#### The following columns must exist (somewhere) in the pickled dataframe df.pkl:
(callID is optional)

    | filename   | spectrograms    |  samplerate_hz |    [optional: callID]
    --------------------------------------------------------------------
    | call_1.wav |  2D np.array    |      8000      |    [call_1]
    | call_2.wav |  ...            |      48000     |    [call_2] 
    | ...        |  ...            |      ....      |    ....  

#### The following files are generated in this script:

    ├── data
    │   ├── df_umap.pkl <- is overwritten with updated version of df_umap.pkl (with ID column)                       
    │   ├── image_data.pkl <- pickled dictionary with spectrogram images as values, ID column as keys

## Import statements, constants and functions

In [3]:
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
import os
from pathlib import Path
import soundfile as sf
import io
import librosa
import librosa.display
import umap

import sys 
sys.path.insert(0, '..')

In [4]:
P_DIR = str(Path(os.getcwd()).parents[0])  
DATA = os.path.join(os.path.sep, P_DIR, 'data') 
DF_NAME = 'df_umap.pkl'

SPEC_COL = 'spectrograms' # column name that contains the spectrograms
ID_COL = 'callID' # column name that contains call identifier (must be unique)

# Spectrogramming parameters (needed for generating the images)

from parameters.spec_params import FFT_WIN, FFT_HOP, FMIN, FMAX

OVERWRITE = False  # If there already exists an image_data.pkl, should it be overwritten? Default no

__Make sure the spectrogramming parameters are correct!__ They are used to set the correct time and frequency axis labels for the spectrogram images. 

For example, if you are using bandpass-filtered spectrograms, you may have to adapt FMIN (-->LOWCUT), FMAX (--> HIGHCUT) and N_MELS

## 1. Read in files

In [27]:
df = pd.read_pickle(os.path.join(os.path.sep, DATA, DF_NAME))

### 1.1. Check if call identifier column is present

In [28]:
# Default callID will be the name of the wav file

if ID_COL not in df.columns:
    print('No ID-Column found (', ID_COL, ')')
    
    if 'filename' in df.columns:
        print("Default ID column ", ID_COL, "will be generated from filename.")
        df[ID_COL] = [x.split(".")[0] for x in df['filename']]
    else:
        raise

No ID-Column found ( callID )
Default ID column  callID will be generated from filename.


## 2. Generate spectrogram images

A spectrogram image is generated from each row in the dataframe. Images are saved in a dictionary (keys are the ID_COL of the dataframe).

The dictionary is pickled and saved as image_data.pkl. It will later be loaded in the interactive visualization script and these images will be displayed in the visualization.

In [2]:
OVERWRITE=True # Do you want to overwrite any existing image_data.pkl file?

In [43]:
if OVERWRITE==False and os.path.isfile(os.path.join(os.path.sep,DATA,'image_data.pkl')):
    print("File already exists. Overwrite is set to FALSE, so no new image_data will be generated.")
    
    # Double-ceck if image_data contains all the required calls
    with open(os.path.join(os.path.sep, DATA, 'image_data.pkl'), 'rb') as handle:
        image_data = pickle.load(handle)  
    image_keys = list(image_data.keys())
    expected_keys = list(df[ID_COL])
    missing = list(set(expected_keys)-set(image_keys))
    
    if len(missing)>0:
        print("BUT: The current image_data.pkl file doesn't seem to contain all calls that are in your dataframe!")
        
else:
    image_data = {}
    for i,dat in enumerate(df.spectrograms):
        print('\rProcessing i:',i, end='')
        dat = np.asarray(df.iloc[i][SPEC_COL]) 
        sr = df.iloc[i]['samplerate_hz']
        plt.figure()
        librosa.display.specshow(dat,sr=sr, hop_length=int(FFT_HOP * sr) , fmin=FMIN, fmax=FMAX, y_axis='mel', x_axis='s',cmap='inferno')
        buf = io.BytesIO()
        plt.savefig(buf, format='png')
        byte_im = buf.getvalue()
        image_data[df.iloc[i][ID_COL]] = byte_im
        plt.close()

    # Store data (serialize)
    with open(os.path.join(os.path.sep,DATA,'image_data.pkl'), 'wb') as handle:
        pickle.dump(image_data, handle, protocol=pickle.HIGHEST_PROTOCOL)

Processing i: 6429

## 3. Save dataframe

Save the dataframe to make sure it contains the correct ID column for access to the image_data.

In [10]:
df.to_pickle(os.path.join(os.path.sep, DATA, DF_NAME))