# Time Investigation
To be able to use the items I need to verify that the start and end times are right.
I will divide the times into 3 types for investigation:

1. wav files. I first need to verify that the original time correction algorithm worked or not.
2. mp3 files. I strongly suspect that these have had dead space removed from them.
3. recovered files. I am not sure how the times will line up on these.

In [1]:
import pandas as pd
import numpy as np
import os
import sys
import pickle as pkl
from pathlib import Path
import time
import glob
import json
import requests
import math
from pydub import AudioSegment




In [2]:
items = pd.read_csv("../../data/items_with_records_all.csv")
print(f'The columns of items with records are:\n{items.columns}')

The columns of items with records are:
Index(['Unnamed: 0', 'iso', 'language_name', 'track', 'location', 'year',
       'path', 'filename', 'length', 'ID', 'item', 'title', 'start',
       'duration', 'end', 'type', 'program'],
      dtype='object')


## Item zero
Before starting I have a strong suspicion that item zero is always an announcement in English. How many zeros do we have?

In [23]:
item_zero = items[items['item'] == 0].copy()
print(f"Item zero count: {len(item_zero)}")

Item zero count: 152


This assumption appears to be untrue - item zero is OK.

## Bad filenames
There were some files recovered that had unreadable filenames. First check that this is true.

In [16]:
import os

def get_fname(path, fname):
    if path[-1] != '/':
        path = path + '/'
    files = glob.glob('/media/programs/' + path + fname.replace('\ufffd', '*'))
    if len(files) == 1:
        return files[0]
    return '/media/programs/' + path + fname


def check_for_file(item_row):
    return os.path.isfile(get_fname(item_row.path, item_row.filename))    



In [17]:
# sanity check on item zero
item_zero['file found'] = item_zero.apply(check_for_file, axis=1)
print(f'Path not found for {sum(item_zero["file found"] == False)}')

Path not found for 0


In [24]:
items['file found'] = items.apply(check_for_file, axis=1)
print(f'Path not found for {sum(items["file found"] == False)}')

Path not found for 0


## Duration of files
The files on the disk did not have the length field filled in. As a sanity check it might be a good idea to check that all the files have the right duration. This section uses pydub to find the length of all files.

In [28]:
def determine_audio_length(row):
    if not np.isnan(row.length):
        return row.length
    else:
        audio = AudioSegment.from_file(get_fname(row.path, row.filename))
        return audio.duration_seconds

In [20]:
# sanity check
item_zero = item_zero.copy()
item_zero['secs'] = item_zero.apply(determine_audio_length, axis=1)

This is going to take a long time to execute and the time appears to be correct anyway. We also need to factor in the unreadable filenames. Lets just do it for the files we need to.

In [29]:

items['secs'] = items.apply(determine_audio_length, axis=1)

In [31]:
items.drop(columns=['length'], inplace=True)
items.rename(columns={ 'secs' : 'length' })
items.to_csv("../../data/items_all.csv")


In [13]:
items = pd.read_csv("../../data/items_all.csv")

print(items.columns)

Index(['Unnamed: 0.1', 'Unnamed: 0', 'iso', 'language_name', 'track',
       'location', 'year', 'path', 'filename', 'ID', 'item', 'title', 'start',
       'duration', 'end', 'type', 'program', 'file found', 'length'],
      dtype='object')


In [None]:
items.rename(columns={ 'secs' : 'length' }, inplace=True)


In [17]:
items.drop(columns=['Unnamed: 0', 'Unnamed: 0.1'], inplace=True)

In [18]:
items.to_csv("../../data/items_all.csv")


## Time reconstruction
As a starting point lets use the time reconstruction originally used. My initial code used two things to determine that a file was a compound file - a filename ending in A or B and when multiple items appear in the one file. Using the filename is a bit weak because GRN are not consistent. Using multiple items would appear to be more robust BUT there is a chance that filtered out items have left files with just one item. There might be some files in this category, but it is unlikely.

For the voxGRN data we know that a diamond in the name means it is definitely a compound item.

Lets see if we can identify compound items.

In [14]:
items['compound'] = items.duplicated(subset=['path', 'filename'], keep=False)
print(f'Compound records: {sum(items["compound"])}')

Compound records: 54890


The rules used originally included the following description:

    Now there are two types of items - compound and single. We want the same method to access each. The solution is to create a start time and an end time for each.

    The rules for creating this start and end times should be as follows:

### Single Item Files
    Item Start Time seems to not be related to the file. Is it the original location? Ignore it.
    Assume the start time is the beginning of the file.

| Scenario | End Time | Item Time | Action |
| -------- | -------- | --------- | ------ |
| 1.       | No       | No        | End = Length of file. |
| 2.       | Yes      | No        | End = End Time. |
| 3.       | No       | Yes       | End = Item Time.   |
| 4.       | Yes      | Yes       | End = Item Time.  |

    End must be checked to ensure it is less than file length.

    For compound files (including files that appear multiple times in the data):

| Scenario | Start Time | End Time | Item Time | Action |
| -------- | ---------- | -------- | --------- | ------ |
| 1.       | No         | No       | No        | Start = 0, End = Length of File |
| 2.       | No         | No       | Yes       | Start = 0, End = Item Time |
| 3.       | No         | Yes      | No        | Start = 0, End = End Time |
| 4.       | No         | Yes      | Yes       | Start = max(0, End Time - Item Time), End = End Time |
| 5.       | Yes        | No       | No        | Start = Start Time, End = Length of File |
| 6.       | Yes        | No       | Yes       | Start = Start Time, End = Start + Item Time |
| 7.       | Yes        | Yes      | No        | Start = Start Time, End = End Time |
| 8.       | Yes        | Yes      | Yes       | Start = Start Time, End = End Time |

    End must be check to ensure it is less than the length of the file.


In [11]:
# try some time functions
import re

def extract_seconds(_time):
    if isinstance(_time, str):
        m = re.search(r'([0-9]{1,2})\:([0-9]{2})', _time)
        if m:
            min = float(m.group(1))
            sec = float(m.group(2))
            return min * 60.0 + sec
    return 0.0

def calculate_item_start_position(row):
    if row['compound']:
        if row['end'] and row['duration'] and not row['start']:
            return max(0.0, extract_seconds(row['end']) - extract_seconds(row['duration']))
        return extract_seconds(row['start'])
    return 0.0
    
def calculate_item_end_position(row):
    end_time = row.length
    if row.compound:
        if row.end:
            end_time = extract_seconds(row.end)
        elif row.duration:
            end_time = extract_seconds(row.duration)
            if row.start:
                end_time += extract_seconds(row.start)

    else:
        if row.duration:
            end_time = extract_seconds(row.duration)
        elif row.end:
            end_time = extract_seconds(row.end)
    return min(row.length, end_time)

I am going to apply these rules to the data and look at how they perform empirically. Cannot see any other way to do it.

In [15]:
items['_start'] = items.apply(calculate_item_start_position,axis=1)
items['_end'] = items.apply(calculate_item_end_position,axis=1)

Now lets see how well this did with the following groups:
1. wav files with multiple items.
2. mp3 files with multiple items.
3. wav files with the old a|b that were not marked as compound.
4. mp3 with a diamond and not marked compound.

In [24]:
items['wav'] = items.filename.str.upper().str.endswith('WAV') 
items['AB'] = items.filename.str.contains('[ACV]?[0-9]{5}[\-\=]?[ABab]\.', regex=True)
items['diamond'] = items.filename.str.contains("♦")

### 1. wav files with multiple items

In [28]:
wav_multi_not_ab = items[items.wav & ~items.AB & items.compound].copy()
wav_multi_not_ab.drop(columns=['wav', 'AB', 'diamond', 'file found'], inplace=True)
wav_multi_ab = items[items.wav & items.AB & items.compound].copy()
wav_multi_ab.drop(columns=['wav', 'AB', 'diamond', 'file found'], inplace=True)


Looking at the first file: 003350.wav It as 4 items at:
1. 0 to 186     Actually does not start for 10 seconds. Ends at 200
2. 191 to 375   Starts at 207 ends with music at 370-396
3. 380 to 582   Starts 401 with music 596 with music to 608
4. 587 to 790   Starts 611 to 825

Now this file is a lot longer - about twice the size. What is in the rest of the file?

There is an A and B side. It appears that both A and B are in the one wav. The start and end times appear to be relative to the tape side.

How did we lose the B side items? Because we assume different tracks to be in different files we lost the B side. Humph.

In [32]:
grid_items = pd.read_csv("/prometheus/GRN/grid_program_items.csv")
print(f'Program items shape: {grid_items.shape}')


Program items shape: (267681, 21)


In [33]:
print(f'{wav_multi_not_ab.iloc[0].path} {wav_multi_not_ab.iloc[0].filename}')
print(f'{wav_multi_ab.iloc[0].path} {wav_multi_ab.iloc[0].filename}')

Programs/03/03350/C03350/PM/ 03350.wav
Programs/03/03230/C03230/Copy_From_MP3_CM/ C03230A.wav


Look at a file with A or B in it: C03230A.wav:
1. Item 1 0 to 405. Started at 0 and stopped 405
2. Item 2 410 to 767 Started 410 and finished 767

### 2. Multi files with diamonds

In [34]:
multi_diamond = items[items.diamond & items.compound]
single_diamond = items[items.diamond & ~items.compound]

In [36]:
print(f'{single_diamond.iloc[0].path} {single_diamond.iloc[0].filename}')
print(f'{multi_diamond.iloc[0].path} {multi_diamond.iloc[0].filename}')


vox_grn/Audio_MP3/10/10990 Huave de San Mateo del Mar Words of Life 002 NT Portions ♦ Instrumental 10990.mp3
vox_grn/Audio_MP3/03/03130 Sadri Words of Life 001 Who is He ♦ What is a Christian ♦ Instrumental - Kului ♦ Power Ov 03130.mp3


The single diamond file is one item and an instrumental. The instrumental appears to be hidden at the end of the file. It ends at 1790 - when stated.
The multi diamond:
1. 0 to 235 actually goes 0 to 240
2. 240 to 436 actually goes 246 to 450
3. Was an instrumental that has been skipped. Ends at 480. 
4. 469 to 660 actually goes 481 to 675
5. 665 to 868 actually goes 680 

## Conclusion
The value in using items is to add the item type meta data field. This would seem to be a lost cause as the labelled information is often incorrect. This is compounded by the time stamps being erroneous.

If we wind back and only use the files is it correct to assume:
1. Each file has only one language.
2. Our VAD can correctly discern music from voice.

Furthermore, for single item files can we discard instrumentals/announcements reliably?

These assumptions will be tested in FileInvestigation.ipynb