# Form Local DB
This notebook aims to collate the output of the file generation and make a locally accessible HF DB.

## 10 Sec Data
Lets start with the 10 sec data.

In [1]:
import pandas as pd
import numpy as np
import os
import sys
import pickle as pkl
from pathlib import Path


In [2]:
SEGMENTS_DIR = '/media/originals/py_audio_seg/'
DATASETS_DIR = '/media/originals/datasets/'
SEC_4_DATA_DIR = 'py_audio_seg_4sec/data/'
SEC_6_DATA_DIR = 'py_audio_seg_6sec/data/'
SEC_10_DATA_DIR = 'py_audio_seg_10sec/data/'


## Form a DF from the segments

In [3]:
# concatenate all the csv files together. 
seg_10_df = pd.read_csv(f'{SEGMENTS_DIR}seg_10sec_0.csv', index_col='file_name')
seg_10_df = pd.concat([seg_10_df, pd.read_csv(f'{SEGMENTS_DIR}seg_10sec_1.csv', index_col='file_name')])
seg_10_df = pd.concat([seg_10_df, pd.read_csv(f'{SEGMENTS_DIR}seg_10sec_2.csv', index_col='file_name')])
seg_10_df = pd.concat([seg_10_df, pd.read_csv(f'{SEGMENTS_DIR}seg_10sec_3.csv', index_col='file_name')])
seg_10_df = pd.concat([seg_10_df, pd.read_csv(f'{SEGMENTS_DIR}seg_10sec_4.csv', index_col='file_name')])
seg_10_df = pd.concat([seg_10_df, pd.read_csv(f'{SEGMENTS_DIR}seg_10sec_5.csv', index_col='file_name')])
seg_10_df = pd.concat([seg_10_df, pd.read_csv(f'{SEGMENTS_DIR}seg_10sec_6.csv', index_col='file_name')])
seg_10_df = pd.concat([seg_10_df, pd.read_csv(f'{SEGMENTS_DIR}seg_10sec_7.csv', index_col='file_name')])
seg_10_df = pd.concat([seg_10_df, pd.read_csv(f'{SEGMENTS_DIR}seg_10sec_8.csv', index_col='file_name')])

print(seg_10_df.head())

                                                        Index      _1  iso  \
file_name                                                                    
py_audio_seg_10sec/data/aaa/C14610A Region 00_0...  14610_002   10622  aaa   
py_audio_seg_10sec/data/aab/Alumu-Tesu Messages...  13981_002  204137  aab   
py_audio_seg_10sec/data/aab/Alumu-Tesu Messages...  13981_002  204137  aab   
py_audio_seg_10sec/data/aab/Alumu-Tesu Messages...  13981_002  204137  aab   
py_audio_seg_10sec/data/aab/Alumu-Tesu Messages...  13981_002  204137  aab   

                                                   language_name  track  \
file_name                                                                 
py_audio_seg_10sec/data/aaa/C14610A Region 00_0...          Otwa      2   
py_audio_seg_10sec/data/aab/Alumu-Tesu Messages...    Alumu-Tesu      2   
py_audio_seg_10sec/data/aab/Alumu-Tesu Messages...    Alumu-Tesu      2   
py_audio_seg_10sec/data/aab/Alumu-Tesu Messages...    Alumu-Tesu      2   
py_

## Extract the Languages

In [4]:
langs = seg_10_df['iso'].value_counts()

In [5]:
# what languages do we have
lang_ids = sorted(list(langs.index))

In [6]:
import csv
with open('lang10.csv', 'w') as csv_file:
    csv_writer = csv.writer(csv_file)
    csv_writer.writerow(lang_ids)

In [7]:
# now write the languages out into 10 separate groups for processing. 
lang_divisions = [i for i in range(len(lang_ids)//10, len(lang_ids), len(lang_ids)//10)]
lang_sets = []
start = 0
for lang_end in lang_divisions:
    lang_sets.append(lang_ids[start:lang_end])
    start = lang_end
lang_sets.append(lang_ids[start:])
# now write the divided rows out as csv files
for i, s in enumerate(lang_sets):
    with open(f'lang10_{i}.pkl', 'wb') as pkl_file:
        pkl.dump(s, pkl_file)


## Get some stats
So there are 3940 languages, 3032 of which have at least 100 4 second segments of data.
There are 2.4 million 4 sec segments. 592 languages have at least 1000 segments.

How much audio is there?


In [8]:
audio_length = seg_10_df['seg_stop'] - seg_10_df['seg_start']
duration = int(sum(audio_length))//1000
hours = duration//3600
minutes = (duration - hours * 3600)//60
print(f'{hours}:{minutes}:{duration%60} {duration}')

403:52:28 1453948


In [11]:
seg_10_df.shape[0] * 16497

40318222581

In [9]:
seg_10_df.to_csv("../../data/seg_10_df.csv")



## Create the meta data
At first I tried creating a json file for the meta data. This proved difficult because the json created by the dataset is ill-formed. Inspired by [Minds14](https://huggingface.co/datasets/PolyAI/minds14/blob/main/minds14.py) I decided to use csv files. The granularity that I want to allow users to have on gaining the data should be per language, so it follows that there should be a csv file per language. Before getting into that let me give an account of my understanding of how HF datasets work when using a script to download. The user executes:

```
load_dataset([string to identify db], [string to identify config], [args])
```
The `string to identify db` can be either a pointer to a local directory or to a HF space. This is neat because it lets you test things locally before deploying them on HF. Once `load_dataset` finds the location it looks for a python script file of the same name in that directory and executes it. There are two things the script needs to do:
    
    * Define (instantiate) the legal configurations 
    * Define (instantiate) a builder

After `load_dataset` has executed the script it checks that there is a configuration that matches the one the user has given. It has to match both the name and the names of the arguments. If that occurs it then sets the builders config member to the config and calls `_split_generators`. That function needs to load the metadata required for the requested configuration. That metadata could exist locally, on a hugging face space with the script file, or on a third party server. Similarly it has to load the data required for the configuration. The data can also be located in any of the same three locations. After loading the metadata and the data, `_split_generators` creates and returns a list of `dataset.SplitGenerator`s, one for each split requested. It must pass the necessary metadata and data to each one. `load_dataset` then calls `_generate_examples` on the builder - once for each split. This generator has to load the data for a single record from the data provided. It is a generator, so it yields the data one record at a time.

With that background here is how I intend to set up the database. First the legal calls to `load_dataset` are:
```
load_dataset('[path|HF space]/VoxGRN', '[seg_4sec|seg_10sec|seg_10sec]', languages=['all', 'aaa', ...])
```
where `languages` is a list of the iso codes of the languages to load, or some macro calls like all. It will also accept ISO-639-1 codes as well.

The metadata exists in a tar.gz file. This has the csv files for every language. There is one tar.gz file for each of the three configurations. These files will exist on the SIL server.

The data for each language is a tar.gz file of the mp3 files for the language. These files will exist on the sil server. There is one tar.gz file per language per configuration.

This means I need to:

    1. Create a csv file for the meta data of each language in each configuration
    2. Create a tar.gz file of the mp3 files for each language in each configuration
    3. Move the tar.gz files to the SIL server.
    4. Get the script file loading the tar.gz files from the SIL server.
    5. Check the script file into the HF space.

I will test all the files locally before moving them to their remote locations.

## Create the CSV files

In [10]:
seg_10_df = seg_10_df.drop(columns=['Index', 'year', 'path', 'filename'])
seg_10_df.rename(inplace=True, columns={ 'file_name' : 'file', 'start' : 'item_start', 'end' : 'item_end'})

In [11]:
# now the index did not rename. Lets try making it a column
seg_10_df['file'] = seg_10_df.index
seg_10_df['file'] = seg_10_df['file'].apply( lambda x : './' + x)

In [12]:
# change the index
seg_10_df.set_index('file', inplace=True)

In [14]:
# write out a csv file for every language
for lang in lang_ids:
    lang_df = seg_10_df[seg_10_df['iso'] == lang]
    lang_df.to_csv(f'/media/originals/datasets/seg_10sec_py/{lang}.csv', header=False)

Sanity check - are we able to load the dataset with the tar.gz metadata file located locally?

In [None]:
from datasets import load_dataset
grnvox_test = load_dataset('/home/jovyan/grnvox_test', 'seg_4sec', languages=['aaa', 'aac'])

Using custom data configuration seg_4sec-7e13906d46e1114d


Downloading and preparing dataset grnvox_test/seg_4sec to /home/jovyan/.cache/huggingface/datasets/grnvox_test/seg_4sec-7e13906d46e1114d/0.0.0/7cad7bd8d099aedf768182b0300e5231ce223ba4fd15db3f13ffd038c2aa6374...


Downloading data:   0%|          | 0.00/914k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.45M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset grnvox_test downloaded and prepared to /home/jovyan/.cache/huggingface/datasets/grnvox_test/seg_4sec-7e13906d46e1114d/0.0.0/7cad7bd8d099aedf768182b0300e5231ce223ba4fd15db3f13ffd038c2aa6374. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [7]:

grnvox_test['train'][0]['audio']

{'path': '/home/jovyan/.cache/huggingface/datasets/downloads/extracted/5796cd957238728f8aeddc5080d3081ffa4132aafbe25c0dcc09047d130bbd31/seg_4sec/data/aaa/A14610_001_001.mp3',
 'array': array([ 0.        ,  0.        ,  0.        , ..., -0.05321169,
        -0.03989655, -0.03859127], dtype=float32),
 'sampling_rate': 16000}

In [8]:
import IPython.display as ipd
ipd.Audio(data=grnvox_test['train'][0]['audio']['array'], rate=grnvox_test['train'][0]['audio']['sampling_rate'])

Great! Now lets create all the tar.gz files.

In [15]:
class cd:
    """Context manager for changing the current working directory"""
    def __init__(self, newPath):
        self.newPath = os.path.expanduser(newPath)

    def __enter__(self):
        self.savedPath = os.getcwd()
        os.chdir(self.newPath)

    def __exit__(self, etype, value, traceback):
        os.chdir(self.savedPath)
        


Warning: This takes about 6 hours to run.

In [16]:
with cd('/media/originals/datasets'):
    for lang in lang_ids:
        os.system(f'tar -czf voxgrn_py_audio_10sec_{lang}.tar.gz py_audio_seg_10sec/data/{lang}')


## Uploading files
To upload files to the SIL Amazon servr we first have to log on.

In [1]:
import boto3
import getpass


In [2]:
access_key = getpass.getpass('key')
secret = getpass.getpass('secret')

In [3]:
session = boto3.session.Session()
client = session.client('s3',
                        region_name='us-east-1',
                        aws_access_key_id=access_key,
                        aws_secret_access_key=secret)

### Upload the meta data

In [None]:
# lets upload the grn data
client.upload_file('/home/jovyan/grnvox_test/vox_grn_4sec_csv.tar.gz', 'grn-media', 'archives/seg_4sec/vox_grn_4sec_csv.tar.gz')

### Upload the language files
This will take a while and may get interrupted. To ensure we can pick up where we left off a pkl of the uploaded files will be kept.

In [22]:
class persistent_set:
    """Context manager for keeping track of a persistent variable"""
    def __init__(self, setname):
        self.pkl_filename = f'{setname}.pkl'
        self.the_set = set()

    def __enter__(self):
        if os.path.isfile(self.pkl_filename):
            with open(self.pkl_filename, 'rb') as pklFile:
                self.the_set = pkl.load(pklFile)
        return self.the_set

    def __exit__(self, etype, value, traceback):
        with open(self.pkl_filename, 'wb') as pklFile:
            pkl.dump(self.the_set, pklFile)



In [23]:
import time
start_time = time.time()
items_to_upload = len(lang_ids)
for lang in lang_ids:
    with persistent_set('uploaded_4sec') as uploaded:
        if lang not in uploaded:
            client.upload_file(f'/media/originals/datasets/vox_grn_4sec_{lang}.tar.gz', 'grn-media', f'archives/seg_4sec/data/vox_grn_4sec_{lang}.tar.gz')
            uploaded.add(lang)
            items_uploaded = len(uploaded)
            if items_uploaded % 10 == 0:
                print(f'Uploaded {items_uploaded} out of {items_to_upload} in {time.time()-start_time} seconds')


Uploaded 10 out of 3153 in 147.65817785263062 seconds
Uploaded 20 out of 3153 in 321.1371548175812 seconds
Uploaded 30 out of 3153 in 691.7319128513336 seconds
Uploaded 40 out of 3153 in 992.736917257309 seconds
Uploaded 50 out of 3153 in 2225.234810113907 seconds
Uploaded 60 out of 3153 in 2529.969522714615 seconds
Uploaded 70 out of 3153 in 2798.333982229233 seconds
Uploaded 80 out of 3153 in 2951.5486624240875 seconds
Uploaded 90 out of 3153 in 3343.21928024292 seconds
Uploaded 100 out of 3153 in 3456.551223754883 seconds
Uploaded 110 out of 3153 in 3739.9261951446533 seconds
Uploaded 120 out of 3153 in 3981.182122707367 seconds
Uploaded 130 out of 3153 in 4103.039168357849 seconds
Uploaded 140 out of 3153 in 4497.39616394043 seconds
Uploaded 150 out of 3153 in 4692.543757200241 seconds
Uploaded 160 out of 3153 in 5005.599019050598 seconds
Uploaded 170 out of 3153 in 5229.098614931107 seconds
Uploaded 180 out of 3153 in 5368.917381763458 seconds
Uploaded 190 out of 3153 in 5733.4422

Now I have commited the script file. Lets test that it works from hugging face.

In [28]:
from datasets import load_dataset
grnvox_test = load_dataset('johno-grn/grnvox_test', 'seg_10sec', languages=['aad', 'aac'])

Downloading builder script:   0%|          | 0.00/28.9k [00:00<?, ?B/s]

Using custom data configuration seg_10sec-6ede14c4d3193064


Downloading and preparing dataset grnvox_test/seg_10sec to /home/jovyan/.cache/huggingface/datasets/johno-grn___grnvox_test/seg_10sec-6ede14c4d3193064/0.0.0/8ddc29269740b672479647faf8981fe5eab5f08d1c231872b740e80d869b8065...


Generating train split: 0 examples [00:00, ? examples/s]

Dataset grnvox_test downloaded and prepared to /home/jovyan/.cache/huggingface/datasets/johno-grn___grnvox_test/seg_10sec-6ede14c4d3193064/0.0.0/8ddc29269740b672479647faf8981fe5eab5f08d1c231872b740e80d869b8065. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [29]:
import IPython.display as ipd
ipd.Audio(data=grnvox_test['train'][2]['audio']['array'], rate=grnvox_test['train'][0]['audio']['sampling_rate'])

In [16]:
grnvox_test.cleanup_cache_files()

{'train': 0}

In [6]:
# read the file back using the web address
import requests

url = 'https://grn-media.s3.amazonaws.com/archives/seg_4sec/vox_grn_4sec_csv.tar.gz'
r = requests.get(url, allow_redirects=True)

In [4]:
s3 = boto3.resource('s3',                        
                    region_name='us-east-1',
                    aws_access_key_id=access_key,
                    aws_secret_access_key=secret)
bucket = s3.Bucket('grn-media')

We need to find if all the files were uploaded correctly.

In [8]:
import datetime
e = datetime.datetime.now()
t_obj = list(bucket.objects.filter(Prefix='archives/seg_6sec/data'))
print(f'Processed {len(list(t_obj))} at {e.hour:02d}:{e.minute:02d}:{e.second:02d}')

Processed 1930 at 10:54:09


In [9]:
e = datetime.datetime.now()
t_obj = list(bucket.objects.filter(Prefix='archives/seg_10sec/data'))
print(f'Processed {len(list(t_obj))} at {e.hour:02d}:{e.minute:02d}:{e.second:02d}')

Processed 1875 at 10:54:27


In [42]:
print(t_obj[110])

s3.ObjectSummary(bucket_name='grn-media', key='archives/seg_4sec/data/vox_grn_4sec_gvn.tar.gz')


In [47]:
print(grnvox_test['train'][0]['iso'])

gvn


In [49]:
print(grnvox_test['train'].features)

{'file': Value(dtype='string', id=None), 'audio': Audio(sampling_rate=16000, mono=True, decode=True, id=None), 'iso': Value(dtype='string', id=None), 'program': Value(dtype='string', id=None), 'location': Value(dtype='string', id=None), 'item_no': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'item_start': Value(dtype='float32', id=None), 'item_end': Value(dtype='float32', id=None), 'seg_start': Value(dtype='float32', id=None), 'seg_end': Value(dtype='float32', id=None), 'seg': Value(dtype='int32', id=None)}
