# Setup for segmenting GRN Data for vox-grn
This is the second attempt to split the data for vox-grn. It was found that the item metadata was simply too unreliable to be used as was the case in the first segmentation (SegmentVoxN).
This new set of notebooks will be called FVox with the idea that they are based on files - not data items.

This notebook is analogous to SegmentVOX. It splits the problem up into 10 subsets to allow parallel segmentation to be done.


In [1]:
import pandas as pd
import os

In [2]:
# Now read in the description of the input and remove the unwanted columns and rename the rest to be python attribute names.
fd = pd.read_csv('../../data/usable_files.csv')
fd.drop(columns=['Unnamed: 0'], inplace=True)
print(fd.columns)


Index(['iso', 'language_name', 'track', 'location', 'year', 'path', 'filename',
       'length', 'program', 'ID'],
      dtype='object')


We need to drop any files with an unknown iso.

In [3]:
fd = fd[fd.iso.notna()]
print(len(fd))

208374


Now some files have foreign characters in them. Because we are going to perpetuate these filenames it is a good idea to rename the files something useful. Do this and update the usable files.

In [18]:
import os
import shutil
import glob

def get_foreign_name(path, fname):
    if path[-1] != '/':
        path = path + '/'
    files = glob.glob('/media/programs/' + path + fname.replace('\ufffd', '*'))
    if len(files) == 1:
        return files[0]
    return None

def rename_file(fd):
    path = fd.path
    if path[-1] != '/':
        path = path + '/'
    fname = '/media/programs/' + path + fd.filename
    if not os.path.isfile(fname):
        foreign_name = get_foreign_name(fd.path, fd.filename)
        if foreign_name:
            new_name = fd.filename.replace('\ufffd', '_')
            shutil.copy(foreign_name, '/media/programs/' + path + new_name)
            return new_name

    return fd.filename  

fd['new_name'] = fd.apply(rename_file, axis=1)

In [19]:
renamed_files = fd[fd.new_name != fd.filename].copy()

In [21]:
fd['filename'] = fd.new_name
fd.drop(columns=['new_name'], inplace=True)
fd.to_csv('../../data/usable_files.csv')

# Parallel Processing
Through trial and error the ideal number to run in parallel was found to be about 10. More than this and vs code starts to cause errors. Divide the data into 10 lots.

In [22]:
sorted_items = fd.sort_values('iso')
row_divisions = [i for i in range(len(sorted_items)//10, len(sorted_items), len(sorted_items)//10)]
row_divisions[-1] = len(fd)
item_df = []
start = 0
for row_end in row_divisions:
    item_df.append(sorted_items[start:row_end])
    start = row_end
item_df.append(sorted_items[start:])
# now write the divided rows out as csv files
for i, df in enumerate(item_df):
    df.to_csv(f'/media/originals/fsegs/files_{i}.csv')

In [23]:
# read one back to check
df = pd.read_csv('/media/originals/fsegs/files_9.csv')

Some stats

In [None]:
iso_langs = fd.iso.value_counts()