# Time correction
The item times can be erroneous or missing. This is further compounded by some files containing multiple items. The previous analysis (StandardiseDataAccess.ipynb) relied on knowing which items belonged to files that contained multiple items. This is not a very reliable metric. Therefore in this analysis we do not differentiate based on compound files.

The issue appears to be that when an item is a completely contained in a file the start and end times seem to be nonsense - other than their difference is the duration. However, when there are multiple items in the one file the start and end times are used. When an item is the first then the start is commonly NaN.

In [1]:
import pandas as pd
import numpy as np
import os
import sys
from pathlib import Path
import requests
import json


## Gain Input Data



In [2]:
# Now read in the description of the input and remove the unwanted columns and rename the rest to be python attribute names.

items = pd.read_csv("../../data/items_with_records_with_voxgrn_files.csv")
print(f'Program items shape: {items.shape}')


Program items shape: (248671, 18)


In [3]:
items['compound'] = items.duplicated(subset=['path', 'filename'], keep=False)


In [10]:
# try some time functions
import re

def extract_seconds(_time):
    if isinstance(_time, str):
        m = re.search(r'([0-9]{1,2})\:([0-9]{2})', _time)
        if m:
            min = float(m.group(1))
            sec = float(m.group(2))
            return min * 60.0 + sec
    return 0.0

In [12]:
items['start_time'] = items['start'].apply(extract_seconds)
items['end_time'] = items['end'].apply(extract_seconds)
items['duration_time'] = items['duration'].apply(extract_seconds)


## Data Exploration
To correct the start and end times of each item we need to first find any patterns that we can exploit. We know that files that contain multiple items are treated differently.

Using items that use the same path and filename will almost certainly find multiple item files. I am calling such items compound items. The issue is does it find them all? Are there items that are the only ones in a file because the other items were removed (for being instrumentals etc.). I know with the new files if there ar multiple items the filename contains diamonds. Lets see how many of them there are.

In [13]:
print(f'Items with diamonds that are not marked compound: {sum(items["filename"].str.contains("♦") & ~items["compound"])}')

Items with diamonds that are not marked compound: 1


Just one! Why?

In [14]:
odd_compound = items[items["filename"].str.contains("♦") & ~items["compound"]]

The other element is an instrumental. Disturbingly the start time is not right.

What do we think we know?
1. The start time is unreliable in single item files.
2. ditto the end time
3. end time - start time reliably is the item duration.
4. when start time is NaN it is actually zero.
5. If we could reliably pick the multiple item files then we could say that their start and end times are correct. (Although the item above shows that this might not be true)

Lets test point 5 on the voxgrn data because we know which files contain multiple items.

In [15]:
voxgrn_multi = items[items["filename"].str.contains("♦")].copy()
voxgrn_multi.shape

(17207, 21)

In [16]:
def valid_times(row):
    return row.start_time < row.length and row.end_time < row.length

voxgrn_multi['valid'] = voxgrn_multi.apply(valid_times, axis=1)

In [17]:
invalid_multi = voxgrn_multi[voxgrn_multi.valid == False]

OK - so the gaps have been removed in the mp3's meaning that the item boundaries do not have any padding. See if changing this corrects the problem.

In [18]:
voxgrn_multi.sort_values(by=['ID', 'item'], ascending=[True, True], inplace=True)

In [19]:
current_ID = ''
accumulated_time = 0
def calculate_start_time(row):
    global current_ID, accumulated_time
    if row.ID != current_ID:
        current_ID = row.ID
        accumulated_time = 0
    start_time = accumulated_time
    accumulated_time += row.duration_time
    return start_time

voxgrn_multi['start_'] = voxgrn_multi.apply(calculate_start_time, axis=1)

In [20]:
def calculate_end_time(row):
    return row.start_ + row.duration_time

voxgrn_multi['end_'] = voxgrn_multi.apply(calculate_end_time, axis=1)

which shows (by visual inspection and playing the tracks for alignment) that the original start times are erroneous in both directions (positive and negative) and the duration is NOT always right when applied from the programs to the voxgrn data.

what if we apply the same logic to the non-voxgrn data?

In [23]:
orig_multi = items[(items["filename"].str.contains("♦") == False) & (items.compound == True)].copy()

In [24]:
orig_multi.sort_values(by=['ID', 'item'], ascending=[True, True], inplace=True)


In [25]:
accumulated_time = 0
current_ID = ''

orig_multi['start_'] = orig_multi.apply(calculate_start_time, axis=1)
orig_multi['end_'] = orig_multi.apply(calculate_end_time, axis=1)


Looking at the first example - program 20 - there is a song for the first 45 seconds which is not in the metadata. The gaps between items do not exist - which means that at the end it is back in sync again. This is poor quality data - the meta data is erroneous.

The next program, 30, is similarly erroneous. There is an instrumental completely misplaced at the end of the first message.

what does the voice activation program do with the file?

In [26]:
import torch
torch.cuda.is_available()

True