<a href="https://colab.research.google.com/github/johno-source/vox-grn/blob/main/vox-grn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analysis of vox-grn
The purpose of this file is to analyse the vox-grn dataset on huggingface.

In [1]:
import pandas as pd
import json
import requests


## Terminology
A few terms need to be clarified to make sense of this notebook:
| Term | Meaning |
|------|---------|
| Item | A single logical unit of audio. It might be a story, a song, a testimony, or some other contiguous logically connected stream of audio. Items always have just one language. All items belong to just one program. All items can be uniquely identified using their program ID and there item ID. |
| Program | A collection of Items that are logically related. It is not necessary that all items are of the same language, although they usually are. All programs can be uniquely identified using their 5 digit program ID. |
| Program Set | A collection of Items similar to a program. Each Program Set has a Program Set ID which can be used to uniquely identify the set. A Program Set can also be a Program. In this case its Program ID is the same as its Program Set ID. |
| vox-grn | A huggingface dataset constructed from GRN items. Where variables are sourced from vox-grn they are prefixed with vox_. |
| File | This is a single mp3 file loaded from vox-grn. A file may contain 1 to many items from one, and only one, program set. No files were found that contain items that were part of a program that was not also a program set. | 
| GRID | A GRN program used to interface to GRN's SQL database that contains metadata about Programs, Program Sets, and Items. Where variables are derived from data extracted using GRID they are prefixed with grid_. |

## Load Data
First read in the json file associated with the data set and convert it to a data frame. 
Likewise read in the csv file exported from GRID.

In [2]:

# create a dataframe using a generator
def gen_vox_grn():
  resp = requests.get('https://raw.githubusercontent.com/johno-source/vox-grn/main/data/vox-grn.json')
  vox_dict = json.loads(resp.text)
  for iso in vox_dict.keys():
    lang_df = pd.json_normalize(vox_dict[iso])
    lang_df['iso'] = iso
    yield lang_df

vox_df = pd.concat(gen_vox_grn())

The path of the file contains the GRN program/program set identifier. Extract this for later analysis.

In [3]:
vox_df['program'] = vox_df['file'].str.extract(r'./Audio_MP3/[0-9]{2}/([0-9]{5})')

In [6]:
# load the GRID data. Exported 16 August 2022
#grid_items = pd.read_csv("https://raw.githubusercontent.com/johno-source/vox-grn/main/data/items_with_records.csv")
grid_items = pd.read_csv("/prometheus/GRN/grid_program_items1.csv")
grid_sets = pd.read_csv("/prometheus/GRN/grid_program_sets.csv")
grn_languages = pd.read_csv("/prometheus/GRN/grid_languages-1.csv")



The grid_items program identifer is prefixed with single character that is not relevant for this analysis. Strip it off.

In [7]:
grid_items['prog_no'] = grid_items['Program Number'].str.extract(r'([0-9]{5})')

The program sets have their program id as an integer. Convert it to a string and format it using 5 digits.

In [8]:
grid_sets['program'] = grid_sets['Program Set Number'].astype(int).apply('{:0>5d}'.format)

Add the iso language to each of the grid program sets.

In [28]:
grn_lang_num_to_iso = dict(zip(grn_languages['Language Number'], grn_languages['ISO Language Code']))
grid_sets['iso'] = grid_sets['Language Number'].map(grn_lang_num_to_iso)
grid_items['iso'] = grid_items['Language Number'].map(grn_lang_num_to_iso)

## Sanity Check
To ensure accurate analysis we need to establish that the GRID data is a superset of vox-grn. This should be the case as the export of the GRID data occurred after vox-grn was taken.

In [10]:
grid_item_program_ids = set(grid_items['prog_no'])
grid_program_set_ids = set(grid_sets['program'])
vox_grn_program_ids = set(vox_df['program'])

vox_extra_ids = vox_grn_program_ids-grid_item_program_ids
program_sets_not_in_items = grid_program_set_ids - grid_item_program_ids
print(f'Number of GRID programs: {len(grid_item_program_ids)}')
print(f'Number of GRID program sets: {len(grid_program_set_ids)}')
print(f'Number of vox-grn programs: {len(vox_grn_program_ids)}')
print(f'Number of vox_grn programs not in Grid programs {len(vox_extra_ids)}')
print(f'Number of GRID program sets not in Grid programs {len(program_sets_not_in_items)}')
print(f'Number of vox-grn programs not in Grid programs or sets: {len(vox_extra_ids - program_sets_not_in_items)}')
print(f'Number of vox-grn programs not in Grid program sets: {len(vox_grn_program_ids-grid_program_set_ids)}')

Number of GRID programs: 14983
Number of GRID program sets: 14999
Number of vox-grn programs: 13004
Number of vox_grn programs not in Grid programs 183
Number of GRID program sets not in Grid programs 1384
Number of vox-grn programs not in Grid programs or sets: 0
Number of vox-grn programs not in Grid program sets: 0


So the program sets are a superset of vox-grn - good! But the program set entries only have information on entire sets - not individual items. What vox_grn programs are not in the GRID programs?

In [19]:
print(f'Vox-grn programs not in GRID programs: {vox_extra_ids}')

Vox-grn programs not in GRID programs: {'85254', '67184', '67271', '63363', '64570', '62736', '38273', '66970', '37764', '38057', '68013', '63128', '82734', '67077', '63105', '33091', '63092', '64619', '81718', '82784', '81771', '66422', '64618', '37772', '67160', '66836', '25190', '78074', '38059', '38053', '74935', '67082', '67162', '63404', '80624', '64866', '67161', '78076', '82747', '80776', '67288', '67062', '80757', '37763', '38051', '63096', '27171', '32190', '63129', '63611', '67157', '63107', '78060', '67153', '82757', '78130', '65216', '62841', '67060', '38188', '67185', '37160', '85234', '33090', '38047', '24740', '80787', '67253', '85249', '78042', '67081', '81768', '67155', '67156', '63095', '35890', '66314', '74806', '82760', '38055', '65463', '66940', '63127', '81755', '78075', '66933', '63094', '38190', '80667', '79146', '82758', '75258', '66412', '35900', '14961', '66328', '80786', '64817', '82793', '29560', '67159', '65464', '67191', '64974', '82797', '67187', '65457

So although this is a small number (183 out of 13004) it means there are some items that we have set information for but not specific item information. Lets see what the recordings are that we do not have item information for:

In [12]:
pd.set_option('display.max_colwidth', 300)
vox_extras = vox_df[vox_df['program'].isin(vox_extra_ids)]
print(vox_extras.head())

                                                        file language name  \
38   ./Audio_MP3/62/62841/Ambai Bible Readings 002 62841.mp3         Ambai   
39   ./Audio_MP3/62/62841/Ambai Bible Readings 001 62841.mp3         Ambai   
0   ./Audio_MP3/37/37567/Arhuaco Mark Portions 008 37567.mp3       Arhuaco   
1   ./Audio_MP3/37/37567/Arhuaco Mark Portions 009 37567.mp3       Arhuaco   
2   ./Audio_MP3/37/37567/Arhuaco Mark Portions 011 37567.mp3       Arhuaco   

                location                  copyright    year  disguised  \
38                   NaN  Global Recordings Network  2003.0      False   
39                   NaN  Global Recordings Network  2003.0      False   
0   Villavicencion, Meta  Global Recordings Network  2007.0      False   
1   Villavicencion, Meta  Global Recordings Network  2007.0      False   
2   Villavicencion, Meta  Global Recordings Network  2007.0      False   

         length  iso program  
38  1249.589875  amk   62841  
39  1684.266708  amk   6

Are they all scripture readings?

In [13]:
pd.set_option('display.max_colwidth', 300)
pd.set_option('display.max_rows', 200)
print(vox_extras['file'])


38                                                                     ./Audio_MP3/62/62841/Ambai Bible Readings 002 62841.mp3
39                                                                     ./Audio_MP3/62/62841/Ambai Bible Readings 001 62841.mp3
0                                                                     ./Audio_MP3/37/37567/Arhuaco Mark Portions 008 37567.mp3
1                                                                     ./Audio_MP3/37/37567/Arhuaco Mark Portions 009 37567.mp3
2                                                                     ./Audio_MP3/37/37567/Arhuaco Mark Portions 011 37567.mp3
                                                                ...                                                           
9                     ./Audio_MP3/67/67062/Zapoteco de Tavehua Las Palabras y Hechos de Jesucrist 002 The Lost Sheep 67062.mp3
13                    ./Audio_MP3/67/67062/Zapoteco de Tavehua Las Palabras y Hechos de Jesucrist 003 The New N

So, no, they are not all scripture readings and furthermore some of them do contain multiple items.


## Language Consistency
The first test we want to do is check the vox-grn iso language matches the GRID program set language.

In [18]:
# form a dictionary of program set number to iso language
program_set_language = dict(zip(grid_sets['program'], grid_sets['iso']))
vox_df['grid iso'] = vox_df['program'].map(program_set_language)
vox_lang_discrepancy = vox_df[vox_df["grid iso"] != vox_df["iso"]]
print(f'Number of ISO language discrepancies: {len(vox_lang_discrepancy)} out of {len(vox_df)} files.')

Number of ISO language discrepancies: 3073 out of 202263 files.


So about 1.5% of files have a discrepancy in the language they are labelled with. 

One of the shortcomings of using the program sets is that they are labelled with just one language when there may be multiple languages in the set. To check to see if that is the source of the discrepancy, check the items with a discrepancy against the program item data.

First of all - are all the files with a discrepancy in the grid_items? We cannot check this exactly becasue we do not know what items are in each file. But we can check if the programs are contained in both.

In [24]:
vox_disc_program_ids = set(vox_lang_discrepancy['program'])
print(f'Number of programs of files with discrepancies ({len(vox_disc_program_ids)}) not in grid_item\'s programs: {len(vox_disc_program_ids-grid_item_program_ids)}.')


Number of programs of files with discrepancies (700) not in grid_item's programs: 44.


So most of the files are of programs that are also in the grid_items. See how many of these have language classifications that correspond to the language classifications given in the grid items.

In [48]:
vox_disc_in_programs = vox_lang_discrepancy[vox_lang_discrepancy['program'].isin(grid_item_program_ids)].copy()

# form a dictionary of grid program id to languages
grid_program_to_language_dict = dict()
def determine_program_language(item):
    global grid_program_to_language_dict
    grid_program_to_language_dict.setdefault(item['prog_no'], set()).add(item['iso'])

grid_items.apply(determine_program_language, axis=1)
print(len(grid_item_program_ids))


14983


In [60]:
# Now use the above dictionary to confirm if the language listed for the file is in the program.
def language_plausibility_check(vox_file):
    global grid_program_to_language_dict
    lang_set = grid_program_to_language_dict[vox_file.program]
    return vox_file['iso'] in lang_set
   

In [61]:
vox_disc_in_programs['lang in program'] = vox_disc_in_programs.apply(language_plausibility_check, axis=1)
print(f'Language found to be plausible in {sum(vox_disc_in_programs["lang in program"])} cases out of {len(vox_disc_in_programs)}')

Language found to be plausible in 1490 cases out of 2577


So there are just over 1000 files where the classified language does not appear to be plausible.

## Items and Compound Files
One of the difficulties with the GRN data is that occassionally multiple items are placed in the one audio file. Examining the file names of the mp3 files a fixed pattern can be seen:

In [62]:
print(vox_df.iloc[0].file)

./Audio_MP3/13/13981/Alumu-Tesu Messages 002 What is a Christian ♦ The Woman at the Well ♦ God's Answers ♦ Jes 13981.mp3


This follows the pattern:

```./Audio_MP3/QQ/PPPPP/[Program Title] NNN [Item Title 1] ♦ [Item Title 2] ♦ ... PPPPP.mp3```

where
* ```QQ``` are the first two numerals in the 5 digit program identifier
* ```PPPPP``` is the 5 digit program identifier
* ```NNN``` is a 3 digit file identifier

The 3 digit file identifier is NOT to be confused with GRN's item number. Where there is one file per item, which is true for the bulk of the data, the file identifier and the item number are the same. However, a significant number of the files contain multiple items. The number can be found by counting the number of files that contain the diamond character (♦) which is UNICODE 2666.

In [63]:
vox_compound = vox_df[vox_df['file'].str.contains('\u2666')]
print(f'The number of compound files in vox-grn: {vox_compound.shape[0]} out of a total of {vox_df.shape[0]} files.')

The number of compound files in vox-grn: 9894 out of a total of 202263 files.


There are a few consequences of multiple items being in one file:
* GRN's database includes start and end times for each item in a compound file. This information is not present in vox-grn.
* Huggingface datasets have no mechanism to allow multiple audio samples to be extracted from the one file.
* If the files had been split on the basis of each item there would be about 40000 more audio samples in the database.
* Although each of GRN's items only ever have one language, programs can contain items from multiple languages. Some of the compound files, which have been universally associated with a single ISO language, actually contain items from different languages.

Although the first points are annoying, the last point makes the database inaccurate with plausible but wrong classifications. The next section will determine the extent of miscategorised content.


## Compound Files with Multiple Languages
To determine vox-grn files that contain multiple languages first find the files that contain multiple items. 


In [68]:
vox_multiple_items = vox_df[vox_df['file'].str.contains('\u2666')].copy()
print(vox_multiple_items.shape)

(9894, 10)


In [69]:
def program_has_multiple_languages(vox_file):
    global grid_program_to_language_dict
    if vox_file.program in grid_program_to_language_dict.keys():
        lang_set = grid_program_to_language_dict[vox_file.program]
        return len(lang_set) > 1
    return False

vox_multiple_items['multiple languages'] = vox_multiple_items.apply(program_has_multiple_languages, axis=1)
print(f'Of the files containing multiple items({len(vox_multiple_items)}) {sum(vox_multiple_items["multiple languages"])} have programs with more than one language.')


Of the files containing multiple items(9894) 2546 have programs with more than one language.
