# File Investigation
The purpose of this file is to investigate using the file description to build a datbase.

The following assumptions are to be tested:
1. Each file has only one language.
2. Our VAD can correctly discern music from voice.
3. For single item files can we discard instrumentals/announcements reliably.

In [1]:
import pandas as pd
import os
import sys
import spleeter
from collections import namedtuple

module_path = os.path.abspath(os.path.join('../vad_utils'))
if module_path not in sys.path:
    sys.path.append(module_path)
from vad_utils import SAMPLING_RATE, FRAME_SIZE_MS, SAMPLES_PER_FRAME
import vad_utils as vu
from pydub import AudioSegment


In [2]:
fd = pd.read_csv("../../data/records_with_voxgrn_files.csv")
print(f'The columns of records are:\n{fd.columns}')
items = pd.read_csv("/prometheus/GRN/grid_program_items.csv")
print(f'Program items shape: {items.shape}')


The columns of records are:
Index(['Unnamed: 0.1', 'Unnamed: 0', 'iso', 'language_name', 'track',
       'location', 'year', 'path', 'filename', 'length', 'program', 'ID'],
      dtype='object')
Program items shape: (267681, 21)


How many files have a duplicate ID?

In [3]:
print(f'Duplicate files {sum(fd.ID.duplicated(keep="first"))}')

Duplicate files 0


Lets start by looking at the third assumption. Can we identify files that are purely unwanted types?

In [3]:
items['program'] = items['Program Number'].str[1:]
items['ID'] = items['program'] + '_' + items['Track Number'].astype(int).apply('{:0>3d}'.format)

Now recreate the file merged version without dropping unwanted types.

In [4]:
items_with_records = pd.merge(fd, items, on="ID", how='inner', validate='1:m')


In [5]:
items_with_records['compound'] = items_with_records.duplicated(subset=['path', 'filename'], keep=False)
print(f'Compound items {sum(items_with_records.compound)}')

Compound items 56937


Now mark the items based on their usability.

In [6]:
def usable_types(item_row):
    unusable_items = ['Instrumental', 'Sound Effect', 'Announcement', 'Bridge']
    return item_row['Item Type'] not in unusable_items

items_with_records['usable'] = items_with_records.apply(usable_types, axis=1)

print(f'Of the {sum(~items_with_records.usable)} unusable items, {sum(~items_with_records.usable & ~items_with_records.compound)} are in single files.')


Of the 4405 unusable items, 1221 are in single files.


In [7]:
single_file_unusable = items_with_records[~items_with_records.usable & ~items_with_records.compound]
multi_file_unusable = items_with_records[~items_with_records.usable & items_with_records.compound]

Sanity check the single file items found.

In [9]:
print(f'{single_file_unusable.iloc[0].path} {single_file_unusable.iloc[0].filename}')
print(f'{single_file_unusable.iloc[1].path} {single_file_unusable.iloc[1].filename}')
print(f'{single_file_unusable.iloc[2].path} {single_file_unusable.iloc[2].filename}')
print(f'{single_file_unusable.iloc[3].path} {single_file_unusable.iloc[3].filename}')
print(f'{single_file_unusable.iloc[4].path} {single_file_unusable.iloc[4].filename}')

Programs/03/03111/A03111/From_CM/ C03111B Region 00_10_37_802 to 00_10_53_562 (04 _ 05).wav
Programs/03/03110/A03110/From_CM/ C03110A Region 00_13_28_106 to 00_13_42_805 (06 _ 07).wav
Programs/03/03410/A03410/From_CM/ C03410A-02.wav
Programs/66/66533/A66533/PM-1407/ A66533-001_Dubli_Introduction.wav
Programs/65/65951/A65951/PM-1810/ A65951-009.wav


Are announcements always in the language?

In [12]:
announcements = single_file_unusable[single_file_unusable['Item Type'].str.contains('Announcement')]

In [13]:
print(f'{announcements.iloc[0].path} {announcements.iloc[0].filename}')
print(f'{announcements.iloc[1].path} {announcements.iloc[1].filename}')
print(f'{announcements.iloc[2].path} {announcements.iloc[2].filename}')
print(f'{announcements.iloc[3].path} {announcements.iloc[3].filename}')
print(f'{announcements.iloc[4].path} {announcements.iloc[4].filename}')

Programs/66/66533/A66533/PM-1407/ A66533-001_Dubli_Introduction.wav
Programs/65/65540/A65540/PM-1706/ A65540-001.wav
Programs/65/65539/A65539/PM-1706/ A65539-001.wav
Programs/65/65538/A65538/PM-1706/ A65538-001.wav
Programs/65/65537/A65537/PM-1706/ A65537-001.wav


No - sometimes they are in English.

Lets look at files that contain mixed items.

In [13]:
print(f'{multi_file_unusable.iloc[0].path} {multi_file_unusable.iloc[0].filename}')
print(f'{multi_file_unusable.iloc[1].path} {multi_file_unusable.iloc[1].filename}')
print(f'{multi_file_unusable.iloc[2].path} {multi_file_unusable.iloc[2].filename}')
print(f'{multi_file_unusable.iloc[3].path} {multi_file_unusable.iloc[3].filename}')
print(f'{multi_file_unusable.iloc[4].path} {multi_file_unusable.iloc[4].filename}')

Programs/03/03111/A03111/From_CM/ C03111B Region 00_14_24_256 to 00_18_27_327 (06 _ 07).wav
Programs/03/03111/A03111/From_CM/ C03111B Region 00_18_27_328 to 00_22_19_986 (07 _ 08).wav
Programs/03/03111/A03111/From_CM/ C03111 Region 00_11_05_088 to 00_14_58_047 (04 _ 05).wav
Programs/03/03351/C03351/PM/ 03351.wav
Programs/03/03381/C03381/Copy_From_MP3_CM/ C03381A.wav


So you can have some instrumentals - we need to be able to reject them.

Waht about announcements - are they ever embedded in files with useful data?

In [10]:
multi_announcements = multi_file_unusable[multi_file_unusable['Item Type'].str.contains('Announcement')].copy()


Are all announcements embedded in other files in the language of the file?

In [15]:
print(f'{multi_announcements.iloc[0].path} {multi_announcements.iloc[0].filename}')
print(f'{multi_announcements.iloc[1].path} {multi_announcements.iloc[1].filename}')
print(f'{multi_announcements.iloc[2].path} {multi_announcements.iloc[2].filename}')
print(f'{multi_announcements.iloc[3].path} {multi_announcements.iloc[3].filename}')
print(f'{multi_announcements.iloc[4].path} {multi_announcements.iloc[4].filename}')

Programs/03/03351/C03351/PM/ 03351.wav
Programs/62/62537/A62537/PM/ A62537-01.wav
Programs/62/62537/A62537/PM/ A62537-11.wav
Programs/62/62537/A62537/PM/ A62537-24.wav
Programs/66/66224/A66224/PM-1904/ A66224-001.wav


So there often seems to be a short announcement at the start of the file. SHould I mark these? It could be useful - time to discard.
Are all of these announcements at the start of the files?

In [11]:
def announcement_is_at_start(item):
    global items_with_records
    multi_items = items_with_records[items_with_records['filename'] == item['filename']].copy()
    multi_items['Item Start Time'] = multi_items['Item Start Time'].fillna('0:00')
    multi_items = multi_items.sort_values(by='Item Start Time')
    return multi_items.iloc[0]['Item Type'] == 'Announcement'

multi_announcements['Announce at start'] = multi_announcements.apply(announcement_is_at_start, axis=1)

print(f'Announcement at the start of {sum(multi_announcements["Announce at start"])} of {len(multi_announcements)} files')

Announcement at the start of 45 of 363 files


## Save to File
To consolidate this lets create a new file descriptor file without the single file announcements and instrumentals.

To do this I want to filter out the file descriptors associated with single items and are unusable.

In [12]:
fd['unusable'] = fd['ID'].isin(single_file_unusable.ID)
print(f'{sum(fd.unusable)} files out of {len(fd)} are unusable.')

1221 files out of 210488 are unusable.


In [23]:
print(fd.columns)

Index(['Unnamed: 0.1', 'Unnamed: 0', 'iso', 'language_name', 'track',
       'location', 'year', 'path', 'filename', 'length', 'program', 'ID',
       'unusable'],
      dtype='object')


In [13]:
usable_fd = fd[~fd.unusable].copy()
usable_fd.drop(columns=['Unnamed: 0.1', 'Unnamed: 0', 'unusable'], inplace=True)


In [25]:
usable_fd.to_csv('../../data/usable_files.csv')

## VAD Testing.

* Test how well our VAD does at rejecting instrumentals.
* Test the effect of altering its sensitivity.
* See what VAD does with singing and background instumentals.

In [8]:
# find instrumental files
single_file_items = items_with_records[~items_with_records.compound]
instrumental_items = items_with_records[items_with_records['Item Type'] == 'Instrumental']
fd['single'] = fd.ID.isin(single_file_items.ID)
fd['instrumental'] = fd.ID.isin(instrumental_items.ID)
single_instrumental_files = fd[fd.single & fd.instrumental]

print(f'Found {sum(fd.single)} single item files. Found {sum(fd.instrumental)} instrumental files. Found {len(single_instrumental_files)} single instrumnetal files. Total files {len(fd)}.')

Found 196150 single item files. Found 2692 instrumental files. Found 1057 single instrumnetal files. Total files 210488.


In [13]:

def condition_audio_segment(audio_seg):
    if audio_seg.channels != 1:
        audio_seg = audio_seg.set_channels(1)

    if audio_seg.sample_width != 2:
        audio_seg = audio_seg.set_sample_width(2)

    if audio_seg.frame_rate != SAMPLING_RATE:
        audio_seg = audio_seg.set_frame_rate(SAMPLING_RATE)

    return audio_seg
        

def extract_audio_segment_for_file(fd, sensitivity=2):
    fmt = 'wav'
    if fd.filename[-4:].lower() == '.mp3' :
        fmt = 'mp3'
    if not fd.path.endswith('/'):
        path = fd.path + '/'
    else:
        path = fd.path
    audio_seg = AudioSegment.from_file('/media/programs/' + path + fd.filename, format=fmt)

    # now condition the segment and extract the raw segments.
    audio_seg = condition_audio_segment(audio_seg)
    segs = vu.audio_to_raw_voice_segments(audio_seg, sensitivity)

    return audio_seg, segs


In [14]:
# find some segments
segs = list()
Desc = namedtuple('Desc', ['sens', 'start', 'stop', 'path', 'file'])
for sens in range(4):
    for fd in single_instrumental_files.itertuples():
        audio_seg, voice_segs = extract_audio_segment_for_file(fd, sens)
        segs_for_time = vu.divide_into_segments(voice_segs, 4.0)
        for i, seg in enumerate(segs_for_time):
            start = vu.convert_frames_to_ms(seg.start)
            stop = vu.convert_frames_to_ms(seg.stop)
            segs.append(Desc(sens, start, stop, fd.path, fd.filename))
    else:
        if len(segs_for_time) == 0:
            print(f'{sens}: No segments {fd.path} {fd.filename}')



In [24]:
from spleeter.separator import Separator
# test out using spleeter
# Load the spleeter configuration for vocal separation
separator = Separator('spleeter:2stems')

# Load the input audio file
input_file = '/media/programs/' + single_instrumental_files.iloc[0].path + single_instrumental_files.iloc[0].filename

# Use spleeter to separate the vocals from the input file
separator.separate_to_file(input_file, '../../data/spleeter')


INFO:tensorflow:Using config: {'_model_dir': 'pretrained_models/2stems', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': gpu_options {
  per_process_gpu_memory_fraction: 0.7
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Apply unet for vocals_spectrogram
INFO:tensorflow:Apply unet for accompaniment_spectrogram
INFO:tensorflow:Don

In [33]:
# now see how vad handles the separated file
fake_fd = namedtuple('fake_fd', ['path', 'filename'])
dummy_fd = fake_fd('../../data/spleeter/C03111B Region 00_10_37_802 to 00_10_53_562 (04 _ 05)/', 'vocals.wav')
audio_seg, voice_segs = extract_audio_segment_for_file(dummy_fd)
segs_for_time = vu.divide_into_segments(voice_segs, 4.0)
for i, seg in enumerate(segs_for_time):
    start = vu.convert_frames_to_ms(seg.start)
    stop = vu.convert_frames_to_ms(seg.stop)
    print(f'{start} {stop}')



300 4290


Save the list of analysed files.

In [None]:
instrumental_files = pd.DataFrame(segs)
instrumental_files.to_csv('../../data/instrumenal_file_segs.csv')

These segments will be analysed in Instrumental.ipynb