# Few Segment Languages
This notebook aims to work out why some of the languages had so few segments created.

## 4 Sec Data
Lets start with the 4 sec data.

In [25]:
import pandas as pd
import numpy as np
import os
import sys
import pickle as pkl
from pathlib import Path
module_path = os.path.abspath(os.path.join('../vad_utils'))
if module_path not in sys.path:
    sys.path.append(module_path)
from vad_utils import SAMPLING_RATE, FRAME_SIZE_MS, SAMPLES_PER_FRAME
import vad_utils as vu
from pydub import AudioSegment



In [3]:
SEGMENTS_DIR = '/media/originals/fsegs/'
DATASETS_DIR = '/media/originals/datasets/'
SEC_4_DATA_DIR = 'fseg_4sec/data/'


Read in the relevant input data

In [4]:
fsegs_4sec = pd.read_csv("../../data/fseg_4_df.csv")

## Extract the Languages

In [6]:
langs = fsegs_4sec['iso'].value_counts()

Now by examining that file we see that smy, srz, kmz, rdb, and tov all only have one segment.

In [7]:
# what languages do we have
lang_ids = sorted(list(langs.index))

In [8]:
fd = pd.read_csv('../../data/usable_files.csv')

# Investigation

We want to answer the following questions:

1. Do these languages have many candidate files?
2. Are there any languages with zero segments?
3. If a different sensitivity were used would the result have been different?

Lets answer the second question first.

In [9]:
fd['no_segs'] = ~fd['iso'].isin(lang_ids)
print(f'Files of languages without any segments: {sum(fd.no_segs)}')

Files of languages without any segments: 802


In [10]:
fd_no_segs = fd[fd.no_segs].copy()
iso_no_segs = set(fd_no_segs.iso)
print(f'ISO languages which had no segments generated: {len(iso_no_segs)}')

ISO languages which had no segments generated: 30


Now find the languages for which there was a low number of segments (>=10)

In [17]:
langs_df = langs.to_frame()
low_segs_langs = langs_df[langs_df.iso <= 10]
low_seg_iso = set(low_segs_langs.index)
print(low_seg_iso)

{'kmz', 'pau', 'jge', 'inj', 'rdb', 'bsy', 'rmq', 'chb', 'tov', 'arx', 'srz', 'ulk', 'def', 'eip', 'dmw', 'wrk', 'smy', 'mep', 'buj', 'dso', 'sos'}


Now answer question 1: How many files do each of these languages have?

In [20]:
low_seg_iso = low_seg_iso | iso_no_segs
fd_low_segs = fd[fd.iso.isin(low_seg_iso)]
fd_segs = fd_low_segs.iso.value_counts()


So of the 51 languages in question 32 had at least 10 files. Lets see if we could have done better with the segments.

# Recreate the VAD Segments

In [21]:

def condition_audio_segment(audio_seg):
    if audio_seg.channels != 1:
        audio_seg = audio_seg.set_channels(1)

    if audio_seg.sample_width != 2:
        audio_seg = audio_seg.set_sample_width(2)

    if audio_seg.frame_rate != SAMPLING_RATE:
        audio_seg = audio_seg.set_frame_rate(SAMPLING_RATE)

    return audio_seg
        

def extract_audio_segment_for_file(fd, sensitivity=2):
    fmt = 'wav'
    if fd.filename[-4:].lower() == '.mp3' :
        fmt = 'mp3'
    if not fd.path.endswith('/'):
        path = fd.path + '/'
    else:
        path = fd.path
    audio_seg = AudioSegment.from_file('/media/programs/' + path + fd.filename, format=fmt)

    # now condition the segment and extract the raw segments.
    audio_seg = condition_audio_segment(audio_seg)
    segs = vu.audio_to_raw_voice_segments(audio_seg, sensitivity)

    return audio_seg, segs


## Find the relevant files.

In [26]:
from collections import namedtuple
# find some segments
segs = list()
Desc = namedtuple('Desc', ['sens', 'start', 'stop', 'path', 'file'])
for sens in range(4):
    for fd in fd_low_segs.itertuples():
        audio_seg, voice_segs = extract_audio_segment_for_file(fd, sens)
        segs_for_time = vu.divide_into_segments(voice_segs, 4.0)
        for i, seg in enumerate(segs_for_time):
            start = vu.convert_frames_to_ms(seg.start)
            stop = vu.convert_frames_to_ms(seg.stop)
            segs.append(Desc(sens, start, stop, fd.path, fd.filename))
    else:
        if len(segs_for_time) == 0:
            print(f'{sens}: No segments {fd.path} {fd.filename}')



In [27]:
segs_df = pd.DataFrame(segs)

In [29]:
segs_df_sens = segs_df.sens.value_counts()
print(segs_df_sens)

3    2968
2    1353
1     371
0     327
Name: sens, dtype: int64


So a sensitivity of 3 works much better. Why did I choose 1? Because when rejecting instrumentals it did the best - but 3 was not that much worse.

Questions: 
* How many segments per file did we find for each sensitivity?
* What happens when we reject the first segment in each file?

In [36]:
def seg_count(fd, sens):
    global segs_df
    segs_for_file = segs_df[(segs_df.path == fd.path) & (segs_df.file == fd.filename)]
    return sum(segs_for_file.sens == sens)

fd_low_segs['segs_0'] = fd_low_segs.apply(seg_count, axis=1, args=(0,))
fd_low_segs['segs_1'] = fd_low_segs.apply(seg_count, axis=1, args=(1,))
fd_low_segs['segs_2'] = fd_low_segs.apply(seg_count, axis=1, args=(2,))
fd_low_segs['segs_3'] = fd_low_segs.apply(seg_count, axis=1, args=(3,))


In [None]:
fd_low_segs.to_csv('../../data/files_with_few_segs.csv')