# Form Local DB
This notebook aims to collate the output of the file generation and make a locally accessible HF DB.

## 6 Sec Data
Lets start with the 4 sec data.

In [1]:
import pandas as pd
import numpy as np
import os
import sys
import pickle as pkl
from pathlib import Path


In [2]:
SEGMENTS_DIR = '/media/originals/segments/'
DATASETS_DIR = '/media/originals/datasets/'
SEC_4_DATA_DIR = 'seg_6sec/data/'
SEC_6_DATA_DIR = 'seg_6sec/data/'
SEC_10_DATA_DIR = 'seg_10sec/data/'



## Form a DF from the segments

In [3]:
# concatenate all the csv files together. 
seg_6_df = pd.read_csv(f'{SEGMENTS_DIR}seg_6sec_0.csv', index_col='file_name')
seg_6_df = pd.concat([seg_6_df, pd.read_csv(f'{SEGMENTS_DIR}seg_6sec_1.csv', index_col='file_name')])
seg_6_df = pd.concat([seg_6_df, pd.read_csv(f'{SEGMENTS_DIR}seg_6sec_2.csv', index_col='file_name')])
seg_6_df = pd.concat([seg_6_df, pd.read_csv(f'{SEGMENTS_DIR}seg_6sec_3.csv', index_col='file_name')])
seg_6_df = pd.concat([seg_6_df, pd.read_csv(f'{SEGMENTS_DIR}seg_6sec_4.csv', index_col='file_name')])
seg_6_df = pd.concat([seg_6_df, pd.read_csv(f'{SEGMENTS_DIR}seg_6sec_5.csv', index_col='file_name')])
seg_6_df = pd.concat([seg_6_df, pd.read_csv(f'{SEGMENTS_DIR}seg_6sec_6.csv', index_col='file_name')])
seg_6_df = pd.concat([seg_6_df, pd.read_csv(f'{SEGMENTS_DIR}seg_6sec_7.csv', index_col='file_name')])
seg_6_df = pd.concat([seg_6_df, pd.read_csv(f'{SEGMENTS_DIR}seg_6sec_8.csv', index_col='file_name')])
seg_6_df = pd.concat([seg_6_df, pd.read_csv(f'{SEGMENTS_DIR}seg_6sec_9.csv', index_col='file_name')])

print(seg_6_df.head())

                                           Index  iso program location  \
file_name                                                                
seg_6sec/data/aaa/A14610_001_000.mp3  A14610_001  aaa  A14610   Igarra   
seg_6sec/data/aaa/A14610_001_001.mp3  A14610_001  aaa  A14610   Igarra   
seg_6sec/data/aaa/A14610_001_002.mp3  A14610_001  aaa  A14610   Igarra   
seg_6sec/data/aaa/A14610_001_003.mp3  A14610_001  aaa  A14610   Igarra   
seg_6sec/data/aaa/A14610_001_004.mp3  A14610_001  aaa  A14610   Igarra   

                                        year  \
file_name                                      
seg_6sec/data/aaa/A14610_001_000.mp3  1963.0   
seg_6sec/data/aaa/A14610_001_001.mp3  1963.0   
seg_6sec/data/aaa/A14610_001_002.mp3  1963.0   
seg_6sec/data/aaa/A14610_001_003.mp3  1963.0   
seg_6sec/data/aaa/A14610_001_004.mp3  1963.0   

                                                                   path  \
file_name                                                            

## Extract the Languages

In [4]:
langs = seg_6_df['iso'].value_counts()

In [5]:
# what languages do we have
lang_ids = sorted(list(langs.index))

In [6]:
import csv
with open('lang6sec.csv', 'w') as csv_file:
    csv_writer = csv.writer(csv_file)
    csv_writer.writerow(lang_ids)

## Get some stats
So there are 3153 languages, 2845 of which have at least 100 4 second segments of data.
There are 2.5 million 4 sec segments. 621 languages have at least 1000 segments.

How much audio is there?


In [7]:
audio_length = seg_6_df['seg_stop'] - seg_6_df['seg_start']
duration = int(sum(audio_length))
hours = duration//3600
minutes = (duration - hours * 3600)//60
print(f'{hours}:{minutes}:{duration%60} {duration}')

3278:38:12 11803092


In [8]:
seg_6_df.shape[0] * 16497

32452601454

## Create the meta data
At first I tried creating a json file for the meta data. This proved difficult because the json created by the dataset is ill-formed. Inspired by [Minds14](https://huggingface.co/datasets/PolyAI/minds14/blob/main/minds14.py) I decided to use csv files. The granularity that I want to allow users to have on gaining the data should be per language, so it follows that there should be a csv file per language. Before getting into that let me give an account of my understanding of how HF datasets work when using a script to download. The user executes:

```
load_dataset([string to identify db], [string to identify config], [args])
```
The `string to identify db` can be either a pointer to a local directory or to a HF space. This is neat because it lets you test things locally before deploying them on HF. Once `load_dataset` finds the location it looks for a python script file of the same name in that directory and executes it. There are two things the script needs to do:
    
    * Define (instantiate) the legal configurations 
    * Define (instantiate) a builder

After `load_dataset` has executed the script it checks that there is a configuration that matches the one the user has given. It has to match both the name and the names of the arguments. If that occurs it then sets the builders config member to the config and calls `_split_generators`. That function needs to load the metadata required for the requested configuration. That metadata could exist locally, on a hugging face space with the script file, or on a third party server. Similarly it has to load the data required for the configuration. The data can also be located in any of the same three locations. After loading the metadata and the data, `_split_generators` creates and returns a list of `dataset.SplitGenerator`s, one for each split requested. It must pass the necessary metadata and data to each one. `load_dataset` then calls `_generate_examples` on the builder - once for each split. This generator has to load the data for a single record from the data provided. It is a generator, so it yields the data one record at a time.

With that background here is how I intend to set up the database. First the legal calls to `load_dataset` are:
```
load_dataset('[path|HF space]/VoxGRN', '[seg_6sec|seg_6sec|seg_10sec]', languages=['all', 'aaa', ...])
```
where `languages` is a list of the iso codes of the languages to load, or some macro calls like all. It will also accept ISO-639-1 codes as well.

The metadata exists in a tar.gz file. This has the csv files for every language. There is one tar.gz file for each of the three configurations. These files will exist on the SIL server.

The data for each language is a tar.gz file of the mp3 files for the language. These files will exist on the sil server. There is one tar.gz file per language per configuration.

This means I need to:

    1. Create a csv file for the meta data of each language in each configuration
    2. Create a tar.gz file of the mp3 files for each language in each configuration
    3. Move the tar.gz files to the SIL server.
    4. Get the script file loading the tar.gz files from the SIL server.
    5. Check the script file into the HF space.

I will test all the files locally before moving them to their remote locations.

## Create the CSV files

In [6]:
seg_6_df = seg_6_df.drop(columns=['Index', 'year', 'path', 'filename'])
seg_6_df.rename(inplace=True, columns={ 'file_name' : 'file', 'start' : 'item_start', 'end' : 'item_end'})

In [7]:
# now the index did not rename. Lets try making it a column
seg_6_df['file'] = seg_6_df.index
seg_6_df['file'] = seg_6_df['file'].apply( lambda x : './' + x)

In [8]:
# change the index
seg_6_df.set_index('file', inplace=True)

In [12]:
# write out a csv file for every language
for lang in lang_ids:
    lang_df = seg_6_df[seg_6_df['iso'] == lang]
    lang_df.to_csv(f'/media/originals/datasets/seg_6sec/{lang}.csv', header=False)

Sanity check - are we able to load the dataset with the tar.gz metadata file located locally?

In [15]:
from datasets import load_dataset
grnvox_test = load_dataset('/home/jovyan/grnvox_test', 'seg_6sec', languages=['aaa', 'aac'])

Using custom data configuration seg_6sec-7e13906d46e1114d


Downloading and preparing dataset grnvox_test/seg_6sec to /home/jovyan/.cache/huggingface/datasets/grnvox_test/seg_6sec-7e13906d46e1114d/0.0.0/1f16a8ded62c9b448cdaade255fbd9dbf70c8fde5f9f6a936f71cd28481fa259...


Generating train split: 0 examples [00:00, ? examples/s]

Dataset grnvox_test downloaded and prepared to /home/jovyan/.cache/huggingface/datasets/grnvox_test/seg_6sec-7e13906d46e1114d/0.0.0/1f16a8ded62c9b448cdaade255fbd9dbf70c8fde5f9f6a936f71cd28481fa259. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [16]:

grnvox_test['train'][0]['audio']

{'path': '/home/jovyan/.cache/huggingface/datasets/downloads/extracted/f17c3ee24a87379216254cd32b0e73c44145ad17dfede5b4a862fd70258cd693/seg_6sec/data/aaa/A14610_001_001.mp3',
 'array': array([ 0.        ,  0.        ,  0.        , ..., -0.00084448,
        -0.00062126,  0.00092   ], dtype=float32),
 'sampling_rate': 16000}

In [17]:
import IPython.display as ipd
ipd.Audio(data=grnvox_test['train'][0]['audio']['array'], rate=grnvox_test['train'][0]['audio']['sampling_rate'])

Great! Now lets create all the tar.gz files.

In [6]:
class cd:
    """Context manager for changing the current working directory"""
    def __init__(self, newPath):
        self.newPath = os.path.expanduser(newPath)

    def __enter__(self):
        self.savedPath = os.getcwd()
        os.chdir(self.newPath)

    def __exit__(self, etype, value, traceback):
        os.chdir(self.savedPath)
        


## Uploading files
To upload files to the SIL Amazon servr we first have to log on.

In [7]:
import boto3
import getpass


In [8]:
access_key = getpass.getpass('key')
secret = getpass.getpass('secret')

In [9]:
session = boto3.session.Session()
client = session.client('s3',
                        region_name='us-east-1',
                        aws_access_key_id=access_key,
                        aws_secret_access_key=secret)

### Upload the meta data

In [21]:
# lets upload the grn data
client.upload_file('/home/jovyan/grnvox_test/vox_grn_6sec_csv.tar.gz', 'grn-media', 'archives/seg_6sec/vox_grn_6sec_csv.tar.gz')

### Upload the language files
This will take a while and may get interrupted. To ensure we can pick up where we left off a pkl of the uploaded files will be kept.

In [10]:
class persistent_set:
    """Context manager for keeping track of a persistent variable"""
    def __init__(self, setname):
        self.pkl_filename = f'{setname}.pkl'
        self.the_set = set()

    def __enter__(self):
        if os.path.isfile(self.pkl_filename):
            with open(self.pkl_filename, 'rb') as pklFile:
                self.the_set = pkl.load(pklFile)
        return self.the_set

    def __exit__(self, etype, value, traceback):
        with open(self.pkl_filename, 'wb') as pklFile:
            pkl.dump(self.the_set, pklFile)



In [11]:
import time
start_time = time.time()
items_to_upload = len(lang_ids)
for lang in lang_ids:
    with persistent_set('uploaded_6sec') as uploaded:
        if lang not in uploaded:
            with cd('/media/originals/datasets'):
                os.system(f'tar -czf vox_grn_6sec_{lang}.tar.gz seg_6sec/data/{lang}')

            client.upload_file(f'/media/originals/datasets/vox_grn_6sec_{lang}.tar.gz', 'grn-media', f'archives/seg_6sec/data/vox_grn_6sec_{lang}.tar.gz')
            uploaded.add(lang)
            items_uploaded = len(uploaded)
            if items_uploaded % 10 == 0:
                print(f'Uploaded {items_uploaded} out of {items_to_upload} in {time.time()-start_time} seconds')


Uploaded 3000 out of 3153 in 337.1318874359131 seconds
Uploaded 3010 out of 3153 in 1474.8198297023773 seconds
Uploaded 3020 out of 3153 in 1965.9780113697052 seconds
Uploaded 3030 out of 3153 in 2313.6164541244507 seconds
Uploaded 3040 out of 3153 in 2811.6134696006775 seconds
Uploaded 3050 out of 3153 in 3542.2324101924896 seconds
Uploaded 3060 out of 3153 in 4429.601443767548 seconds
Uploaded 3070 out of 3153 in 5232.170476913452 seconds
Uploaded 3080 out of 3153 in 5787.633058309555 seconds
Uploaded 3090 out of 3153 in 6161.7162499427795 seconds
Uploaded 3100 out of 3153 in 6803.98233127594 seconds
Uploaded 3110 out of 3153 in 7509.925943374634 seconds
Uploaded 3120 out of 3153 in 7889.163735866547 seconds
Uploaded 3130 out of 3153 in 8317.778066396713 seconds
Uploaded 3140 out of 3153 in 8753.52340388298 seconds
Uploaded 3150 out of 3153 in 9035.480849266052 seconds


Now I have commited the script file. Lets test that it works from hugging face.

In [52]:
from datasets import load_dataset
grnvox_test = load_dataset('johno-grn/grnvox_test', 'seg_6sec', languages=['gvn', 'gun', 'aaa'])

Using custom data configuration seg_4sec-03103f692243d22c


Downloading and preparing dataset grnvox_test/seg_4sec to /home/jovyan/.cache/huggingface/datasets/johno-grn___grnvox_test/seg_4sec-03103f692243d22c/0.0.0/6629e274fe3dde21b99a19ed9a191c9a092eeb63f5c42a00eb2b8d8390e76c4a...


Generating train split: 0 examples [00:00, ? examples/s]

Dataset grnvox_test downloaded and prepared to /home/jovyan/.cache/huggingface/datasets/johno-grn___grnvox_test/seg_4sec-03103f692243d22c/0.0.0/6629e274fe3dde21b99a19ed9a191c9a092eeb63f5c42a00eb2b8d8390e76c4a. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [44]:
import IPython.display as ipd
ipd.Audio(data=grnvox_test['train'][2]['audio']['array'], rate=grnvox_test['train'][0]['audio']['sampling_rate'])

In [3]:
grnvox_test.cleanup_cache_files()

{'train': 0}

In [6]:
# read the file back using the web address
import requests

url = 'https://grn-media.s3.amazonaws.com/archives/seg_6sec/vox_grn_4sec_csv.tar.gz'
r = requests.get(url, allow_redirects=True)

In [7]:
s3 = boto3.resource('s3',                        
                    region_name='us-east-1',
                    aws_access_key_id=access_key,
                    aws_secret_access_key=secret)
bucket = s3.Bucket('grn-media')

We need to find if all the files were uploaded correctly.

In [8]:
import datetime
e = datetime.datetime.now()
t_obj = list(bucket.objects.filter(Prefix='archives/seg_6sec/data'))
print(f'Processed {len(list(t_obj))} at {e.hour:02d}:{e.minute:02d}:{e.second:02d}')

Processed 1416 at 03:56:53


In [15]:
e = datetime.datetime.now()
t_obj = list(bucket.objects.filter(Prefix='archives/seg_6sec/data'))
print(f'Processed {len(list(t_obj))} at {e.hour:02d}:{e.minute:02d}:{e.second:02d}')

Processed 3154 at 19:40:22


In [42]:
print(t_obj[110])

s3.ObjectSummary(bucket_name='grn-media', key='archives/seg_4sec/data/vox_grn_4sec_gvn.tar.gz')


In [47]:
print(grnvox_test['train'][0]['iso'])

gvn


In [49]:
print(grnvox_test['train'].features)

{'file': Value(dtype='string', id=None), 'audio': Audio(sampling_rate=16000, mono=True, decode=True, id=None), 'iso': Value(dtype='string', id=None), 'program': Value(dtype='string', id=None), 'location': Value(dtype='string', id=None), 'item_no': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'item_start': Value(dtype='float32', id=None), 'item_end': Value(dtype='float32', id=None), 'seg_start': Value(dtype='float32', id=None), 'seg_end': Value(dtype='float32', id=None), 'seg': Value(dtype='int32', id=None)}
