## Detalhamento do dataset

Vou utilizar os arquivos que estão descritos na pasta `limited_supervision` para cada idioma.

Vou avaliar o tamanho dos áudios concatenados (total dá 1 hora de áudio?) e proporção de áudios masculinos e femininos.

In [1]:
import os
import gc
import pandas as pd
import numpy as np
import glob
import json

import librosa
import librosa.display
import IPython.display as ipd

import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
LANGUAGE_PATH = "../data/mls/mls_{}"
IDS_PATH = os.path.join(LANGUAGE_PATH, "train/limited_supervision/1hr/{}") # language, folder
AUDIO_PATH = os.path.join(LANGUAGE_PATH, "train/audio/{}/{}/{}.flac") # language, speaker, book, id
LANGUAGES = ["italian", "polish", "portuguese", "spanish"]
FOLDERS = list(range(0, 6))

In [3]:
FOLDERS

[0, 1, 2, 3, 4, 5]

Vou obter todos os IDs que estão nos arquivos `handles.txt`

In [4]:
def get_content_from_file(filepath):
    with open(filepath) as f:
        lines = f.readlines()
    lines = [line[:-1] for line in lines]
    return lines

In [5]:
#get_content_from_file("../data/mls/mls_italian/train/limited_supervision/1hr/0/handles.txt")

In [6]:
ids_map = pd.DataFrame(columns=["language", "ids"])
for language in LANGUAGES:
    print("Idioma: {}".format(language))
    ids = []
    for folder in FOLDERS:
        print("Folder {}".format(folder))
        data_path = IDS_PATH.format(language, folder)
        content = get_content_from_file(os.path.join(data_path, "handles.txt"))
        ids = ids + content
    ids_map.loc[len(ids_map), :] = [language, ids]
    print()

Idioma: italian
Folder 0
Folder 1
Folder 2
Folder 3
Folder 4
Folder 5

Idioma: polish
Folder 0
Folder 1
Folder 2
Folder 3
Folder 4
Folder 5

Idioma: portuguese
Folder 0
Folder 1
Folder 2
Folder 3
Folder 4
Folder 5

Idioma: spanish
Folder 0
Folder 1
Folder 2
Folder 3
Folder 4
Folder 5



  arr_value = np.asarray(value)
  arr_value = np.asarray(value)
  arr_value = np.asarray(value)
  arr_value = np.asarray(value)


In [7]:
ids_map

Unnamed: 0,language,ids
0,italian,"[8828_8610_000510, 659_547_000790, 643_529_000..."
1,polish,"[4228_1447_000021, 7014_6288_000494, 1329_1447..."
2,portuguese,"[10199_6390_000016, 12249_12879_002169, 3718_2..."
3,spanish,"[12332_10604_000010, 11772_11957_000011, 13690..."


Quantos áudios temos para cada idioma?

In [8]:
ids_map["count"] = ids_map["ids"].apply(len)

In [9]:
ids_map

Unnamed: 0,language,ids,count
0,italian,"[8828_8610_000510, 659_547_000790, 643_529_000...",240
1,polish,"[4228_1447_000021, 7014_6288_000494, 1329_1447...",238
2,portuguese,"[10199_6390_000016, 12249_12879_002169, 3718_2...",236
3,spanish,"[12332_10604_000010, 11772_11957_000011, 13690...",233


In [10]:
ids_map["count"].sum()

947

Vamos ler os áudios para entender as características do nosso dataset

In [11]:
ids_map = ids_map.drop(columns=["count"])

In [12]:
df_audios = ids_map.explode("ids").reset_index(drop=True)
df_audios = df_audios.rename(columns={"ids":"id"})

In [13]:
# "id", "language",  "speaker", "book", "segment"
df_audios["id_splitted"] = df_audios["id"].str.split("_")
df_audios["speaker"] = df_audios["id_splitted"].apply(lambda x : x[0])
df_audios["book"] = df_audios["id_splitted"].apply(lambda x : x[1])
df_audios["segment"] = df_audios["id_splitted"].apply(lambda x : x[2])
df_audios = df_audios.drop(columns=["id_splitted"])

In [14]:
df_audios = df_audios.astype(str)

In [15]:
df_audios

Unnamed: 0,language,id,speaker,book,segment
0,italian,8828_8610_000510,8828,8610,000510
1,italian,659_547_000790,659,547,000790
2,italian,643_529_000116,643,529,000116
3,italian,8828_8610_000321,8828,8610,000321
4,italian,659_547_000279,659,547,000279
...,...,...,...,...,...
942,spanish,6156_4006_000020,6156,4006,000020
943,spanish,101_567_000034,101,567,000034
944,spanish,8881_8550_000536,8881,8550,000536
945,spanish,6156_4006_000028,6156,4006,000028


In [16]:
# metainfo.txt (um para cada idioma): "gender", "title"
df_metainfo = pd.DataFrame()
for language in df_audios["language"].unique():
    metainfo_path = os.path.join(LANGUAGE_PATH, "metainfo.txt").format(language)
    df_aux = pd.read_csv(metainfo_path, sep="|", header=0, encoding="utf-8", dtype='str')
    df_aux.columns = df_aux.columns.str.strip().str.lower()
    df_aux["language"] = language
    df_metainfo = pd.concat([df_metainfo, df_aux])

df_metainfo = df_metainfo.apply(lambda x : x.str.strip())
df_metainfo = df_metainfo.rename(columns={"book id":"book"})

In [17]:
df_metainfo.isna().sum()

speaker      0
gender       0
partition    0
minutes      0
book         0
title        0
chapter      0
language     0
dtype: int64

In [18]:
df_metainfo.head(2)

Unnamed: 0,speaker,gender,partition,minutes,book,title,chapter,language
0,6001,F,train,31.716,10011,"Piacevoli Notti, Libro 1",Notte Prima: FAVOLA II,italian
1,6001,F,train,25.961,10011,"Piacevoli Notti, Libro 1",Notte Quinta: FAVOLA II,italian


In [19]:
df_audios = pd.merge(
    df_audios,
    df_metainfo[['speaker', 'gender', 'book', 'title', 'language']],
    on=["language", "speaker", "book"],
    how="left",
    indicator=True
)
print(df_audios["_merge"].value_counts())
df_audios = df_audios.drop_duplicates().reset_index(drop=True)
print(df_audios["_merge"].value_counts())
df_audios = df_audios.drop(columns=["_merge"])

both          8841
left_only        0
right_only       0
Name: _merge, dtype: int64
both          947
left_only       0
right_only      0
Name: _merge, dtype: int64


In [20]:
df_audios

Unnamed: 0,language,id,speaker,book,segment,gender,title
0,italian,8828_8610_000510,8828,8610,000510,F,"Novelle per un Anno, vol. 12: Il Viaggio"
1,italian,659_547_000790,659,547,000790,F,Avventure di Pinocchio
2,italian,643_529_000116,643,529,000116,F,Divina Commedia
3,italian,8828_8610_000321,8828,8610,000321,F,"Novelle per un Anno, vol. 12: Il Viaggio"
4,italian,659_547_000279,659,547,000279,F,Avventure di Pinocchio
...,...,...,...,...,...,...,...
942,spanish,6156_4006_000020,6156,4006,000020,M,Condenada y Otros Cuentos
943,spanish,101_567_000034,101,567,000034,M,Don Quijote 1
944,spanish,8881_8550_000536,8881,8550,000536,M,Aprendiz de Conspirador
945,spanish,6156_4006_000028,6156,4006,000028,M,Condenada y Otros Cuentos


Qual é a proporção de áudios masculinos e femininos por idioma?

In [21]:
men_women_proportion = df_audios.groupby("language")["gender"].value_counts().rename("counts").to_frame().reset_index()
men_women_proportion = men_women_proportion.pivot(index="language", columns="gender", values="counts")
men_women_proportion["total"] = men_women_proportion["M"] + men_women_proportion["F"]
men_women_proportion["ptc_women"] = np.round(men_women_proportion["F"] / men_women_proportion["total"] * 100, 2)
men_women_proportion["ptc_men"] = np.round(men_women_proportion["M"] / men_women_proportion["total"] * 100, 2)
men_women_proportion = men_women_proportion[['F', 'M', 'ptc_women', 'ptc_men', 'total']]
men_women_proportion

gender,F,M,ptc_women,ptc_men,total
language,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
italian,116,124,48.33,51.67,240
polish,118,120,49.58,50.42,238
portuguese,119,117,50.42,49.58,236
spanish,115,118,49.36,50.64,233


In [22]:
# audio
audios = []
lengths = []
srs = []
for i, row in df_audios.iterrows():
    print(i)
    print("Idioma: {}".format(row["language"]))
    print("ID: {}".format(row["id"]))
    
    audio_path = AUDIO_PATH.format(row["language"], row["speaker"], row["book"], row["id"])
    audio, sr = librosa.load(audio_path, sr=16000, mono=True) # sr é fornecido no README.md
    length = len(audio)
    print("Leu {} com sucesso!".format(audio_path))
    
    audios.append(audio)
    srs.append(sr)
    lengths.append(length)
    
    print()

0
Idioma: italian
ID: 8828_8610_000510
Leu ../data/mls/mls_italian/train/audio/8828/8610/8828_8610_000510.flac com sucesso!

1
Idioma: italian
ID: 659_547_000790
Leu ../data/mls/mls_italian/train/audio/659/547/659_547_000790.flac com sucesso!

2
Idioma: italian
ID: 643_529_000116
Leu ../data/mls/mls_italian/train/audio/643/529/643_529_000116.flac com sucesso!

3
Idioma: italian
ID: 8828_8610_000321
Leu ../data/mls/mls_italian/train/audio/8828/8610/8828_8610_000321.flac com sucesso!

4
Idioma: italian
ID: 659_547_000279
Leu ../data/mls/mls_italian/train/audio/659/547/659_547_000279.flac com sucesso!

5
Idioma: italian
ID: 643_529_000078
Leu ../data/mls/mls_italian/train/audio/643/529/643_529_000078.flac com sucesso!

6
Idioma: italian
ID: 8828_8610_000293
Leu ../data/mls/mls_italian/train/audio/8828/8610/8828_8610_000293.flac com sucesso!

7
Idioma: italian
ID: 659_547_000214
Leu ../data/mls/mls_italian/train/audio/659/547/659_547_000214.flac com sucesso!

8
Idioma: italian
ID: 643_529_

Leu ../data/mls/mls_italian/train/audio/4705/4125/4705_4125_000220.flac com sucesso!

68
Idioma: italian
ID: 1595_2012_000787
Leu ../data/mls/mls_italian/train/audio/1595/2012/1595_2012_000787.flac com sucesso!

69
Idioma: italian
ID: 6299_5281_000132
Leu ../data/mls/mls_italian/train/audio/6299/5281/6299_5281_000132.flac com sucesso!

70
Idioma: italian
ID: 4705_4125_000207
Leu ../data/mls/mls_italian/train/audio/4705/4125/4705_4125_000207.flac com sucesso!

71
Idioma: italian
ID: 1595_4194_000519
Leu ../data/mls/mls_italian/train/audio/1595/4194/1595_4194_000519.flac com sucesso!

72
Idioma: italian
ID: 6299_5281_000283
Leu ../data/mls/mls_italian/train/audio/6299/5281/6299_5281_000283.flac com sucesso!

73
Idioma: italian
ID: 4705_9630_000143
Leu ../data/mls/mls_italian/train/audio/4705/9630/4705_9630_000143.flac com sucesso!

74
Idioma: italian
ID: 1595_3311_000265
Leu ../data/mls/mls_italian/train/audio/1595/3311/1595_3311_000265.flac com sucesso!

75
Idioma: italian
ID: 6299_5281

Leu ../data/mls/mls_italian/train/audio/12598/9630/12598_9630_000036.flac com sucesso!

142
Idioma: italian
ID: 1595_5616_001101
Leu ../data/mls/mls_italian/train/audio/1595/5616/1595_5616_001101.flac com sucesso!

143
Idioma: italian
ID: 644_2532_000345
Leu ../data/mls/mls_italian/train/audio/644/2532/644_2532_000345.flac com sucesso!

144
Idioma: italian
ID: 12598_9630_000011
Leu ../data/mls/mls_italian/train/audio/12598/9630/12598_9630_000011.flac com sucesso!

145
Idioma: italian
ID: 1595_5616_001019
Leu ../data/mls/mls_italian/train/audio/1595/5616/1595_5616_001019.flac com sucesso!

146
Idioma: italian
ID: 644_2532_000231
Leu ../data/mls/mls_italian/train/audio/644/2532/644_2532_000231.flac com sucesso!

147
Idioma: italian
ID: 12598_9630_000006
Leu ../data/mls/mls_italian/train/audio/12598/9630/12598_9630_000006.flac com sucesso!

148
Idioma: italian
ID: 1595_3311_001171
Leu ../data/mls/mls_italian/train/audio/1595/3311/1595_3311_001171.flac com sucesso!

149
Idioma: italian
ID:

Leu ../data/mls/mls_italian/train/audio/8842/8610/8842_8610_000168.flac com sucesso!

212
Idioma: italian
ID: 8828_8610_000229
Leu ../data/mls/mls_italian/train/audio/8828/8610/8828_8610_000229.flac com sucesso!

213
Idioma: italian
ID: 7230_6241_000031
Leu ../data/mls/mls_italian/train/audio/7230/6241/7230_6241_000031.flac com sucesso!

214
Idioma: italian
ID: 8842_8610_000178
Leu ../data/mls/mls_italian/train/audio/8842/8610/8842_8610_000178.flac com sucesso!

215
Idioma: italian
ID: 8828_8610_000065
Leu ../data/mls/mls_italian/train/audio/8828/8610/8828_8610_000065.flac com sucesso!

216
Idioma: italian
ID: 7230_6241_000132
Leu ../data/mls/mls_italian/train/audio/7230/6241/7230_6241_000132.flac com sucesso!

217
Idioma: italian
ID: 8842_8795_000443
Leu ../data/mls/mls_italian/train/audio/8842/8795/8842_8795_000443.flac com sucesso!

218
Idioma: italian
ID: 8828_8610_000202
Leu ../data/mls/mls_italian/train/audio/8828/8610/8828_8610_000202.flac com sucesso!

219
Idioma: italian
ID: 9

Leu ../data/mls/mls_polish/train/audio/4228/1447/4228_1447_000037.flac com sucesso!

288
Idioma: polish
ID: 1329_1447_000007
Leu ../data/mls/mls_polish/train/audio/1329/1447/1329_1447_000007.flac com sucesso!

289
Idioma: polish
ID: 7014_6834_003253
Leu ../data/mls/mls_polish/train/audio/7014/6834/7014_6834_003253.flac com sucesso!

290
Idioma: polish
ID: 4228_1447_000031
Leu ../data/mls/mls_polish/train/audio/4228/1447/4228_1447_000031.flac com sucesso!

291
Idioma: polish
ID: 1329_1447_000006
Leu ../data/mls/mls_polish/train/audio/1329/1447/1329_1447_000006.flac com sucesso!

292
Idioma: polish
ID: 7014_6834_003203
Leu ../data/mls/mls_polish/train/audio/7014/6834/7014_6834_003203.flac com sucesso!

293
Idioma: polish
ID: 4228_1447_000035
Leu ../data/mls/mls_polish/train/audio/4228/1447/4228_1447_000035.flac com sucesso!

294
Idioma: polish
ID: 1329_1447_000032
Leu ../data/mls/mls_polish/train/audio/1329/1447/1329_1447_000032.flac com sucesso!

295
Idioma: polish
ID: 7014_6834_000994


Leu ../data/mls/mls_polish/train/audio/1890/1447/1890_1447_000035.flac com sucesso!

356
Idioma: polish
ID: 6892_10674_001753
Leu ../data/mls/mls_polish/train/audio/6892/10674/6892_10674_001753.flac com sucesso!

357
Idioma: polish
ID: 6439_5541_000248
Leu ../data/mls/mls_polish/train/audio/6439/5541/6439_5541_000248.flac com sucesso!

358
Idioma: polish
ID: 1890_1447_000010
Leu ../data/mls/mls_polish/train/audio/1890/1447/1890_1447_000010.flac com sucesso!

359
Idioma: polish
ID: 1636_1447_000032
Leu ../data/mls/mls_polish/train/audio/1636/1447/1636_1447_000032.flac com sucesso!

360
Idioma: polish
ID: 3492_2857_000036
Leu ../data/mls/mls_polish/train/audio/3492/2857/3492_2857_000036.flac com sucesso!

361
Idioma: polish
ID: 4228_1447_000004
Leu ../data/mls/mls_polish/train/audio/4228/1447/4228_1447_000004.flac com sucesso!

362
Idioma: polish
ID: 1636_1447_000006
Leu ../data/mls/mls_polish/train/audio/1636/1447/1636_1447_000006.flac com sucesso!

363
Idioma: polish
ID: 3492_2857_0000

Leu ../data/mls/mls_polish/train/audio/6439/5541/6439_5541_000065.flac com sucesso!

430
Idioma: polish
ID: 3283_1447_000012
Leu ../data/mls/mls_polish/train/audio/3283/1447/3283_1447_000012.flac com sucesso!

431
Idioma: polish
ID: 6892_10462_000931
Leu ../data/mls/mls_polish/train/audio/6892/10462/6892_10462_000931.flac com sucesso!

432
Idioma: polish
ID: 6439_5541_000215
Leu ../data/mls/mls_polish/train/audio/6439/5541/6439_5541_000215.flac com sucesso!

433
Idioma: polish
ID: 3283_1447_000000
Leu ../data/mls/mls_polish/train/audio/3283/1447/3283_1447_000000.flac com sucesso!

434
Idioma: polish
ID: 6892_8912_000352
Leu ../data/mls/mls_polish/train/audio/6892/8912/6892_8912_000352.flac com sucesso!

435
Idioma: polish
ID: 6439_5541_000153
Leu ../data/mls/mls_polish/train/audio/6439/5541/6439_5541_000153.flac com sucesso!

436
Idioma: polish
ID: 3283_1447_000006
Leu ../data/mls/mls_polish/train/audio/3283/1447/3283_1447_000006.flac com sucesso!

437
Idioma: polish
ID: 6892_10674_000

Leu ../data/mls/mls_portuguese/train/audio/12249/12765/12249_12765_000051.flac com sucesso!

495
Idioma: portuguese
ID: 3718_2564_000002
Leu ../data/mls/mls_portuguese/train/audio/3718/2564/3718_2564_000002.flac com sucesso!

496
Idioma: portuguese
ID: 10199_6390_000020
Leu ../data/mls/mls_portuguese/train/audio/10199/6390/10199_6390_000020.flac com sucesso!

497
Idioma: portuguese
ID: 12249_12765_000642
Leu ../data/mls/mls_portuguese/train/audio/12249/12765/12249_12765_000642.flac com sucesso!

498
Idioma: portuguese
ID: 6207_5270_000003
Leu ../data/mls/mls_portuguese/train/audio/6207/5270/6207_5270_000003.flac com sucesso!

499
Idioma: portuguese
ID: 2959_2564_000002
Leu ../data/mls/mls_portuguese/train/audio/2959/2564/2959_2564_000002.flac com sucesso!

500
Idioma: portuguese
ID: 6700_5323_000001
Leu ../data/mls/mls_portuguese/train/audio/6700/5323/6700_5323_000001.flac com sucesso!

501
Idioma: portuguese
ID: 6207_5270_000018
Leu ../data/mls/mls_portuguese/train/audio/6207/5270/620

Leu ../data/mls/mls_portuguese/train/audio/13196/13511/13196_13511_000006.flac com sucesso!

558
Idioma: portuguese
ID: 5103_3962_000002
Leu ../data/mls/mls_portuguese/train/audio/5103/3962/5103_3962_000002.flac com sucesso!

559
Idioma: portuguese
ID: 9958_6390_000005
Leu ../data/mls/mls_portuguese/train/audio/9958/6390/9958_6390_000005.flac com sucesso!

560
Idioma: portuguese
ID: 13196_13511_000008
Leu ../data/mls/mls_portuguese/train/audio/13196/13511/13196_13511_000008.flac com sucesso!

561
Idioma: portuguese
ID: 5103_3962_000004
Leu ../data/mls/mls_portuguese/train/audio/5103/3962/5103_3962_000004.flac com sucesso!

562
Idioma: portuguese
ID: 9958_6390_000011
Leu ../data/mls/mls_portuguese/train/audio/9958/6390/9958_6390_000011.flac com sucesso!

563
Idioma: portuguese
ID: 13196_13511_000002
Leu ../data/mls/mls_portuguese/train/audio/13196/13511/13196_13511_000002.flac com sucesso!

564
Idioma: portuguese
ID: 5103_3962_000005
Leu ../data/mls/mls_portuguese/train/audio/5103/3962/

Leu ../data/mls/mls_portuguese/train/audio/4341/3604/4341_3604_000003.flac com sucesso!

634
Idioma: portuguese
ID: 12428_13396_000013
Leu ../data/mls/mls_portuguese/train/audio/12428/13396/12428_13396_000013.flac com sucesso!

635
Idioma: portuguese
ID: 9351_9018_000930
Leu ../data/mls/mls_portuguese/train/audio/9351/9018/9351_9018_000930.flac com sucesso!

636
Idioma: portuguese
ID: 5103_4744_000003
Leu ../data/mls/mls_portuguese/train/audio/5103/4744/5103_4744_000003.flac com sucesso!

637
Idioma: portuguese
ID: 12249_12879_001458
Leu ../data/mls/mls_portuguese/train/audio/12249/12879/12249_12879_001458.flac com sucesso!

638
Idioma: portuguese
ID: 5103_4572_000001
Leu ../data/mls/mls_portuguese/train/audio/5103/4572/5103_4572_000001.flac com sucesso!

639
Idioma: portuguese
ID: 12249_12879_000482
Leu ../data/mls/mls_portuguese/train/audio/12249/12879/12249_12879_000482.flac com sucesso!

640
Idioma: portuguese
ID: 5103_4572_000002
Leu ../data/mls/mls_portuguese/train/audio/5103/457

Leu ../data/mls/mls_portuguese/train/audio/5677/4807/5677_4807_000490.flac com sucesso!

696
Idioma: portuguese
ID: 2959_2564_000018
Leu ../data/mls/mls_portuguese/train/audio/2959/2564/2959_2564_000018.flac com sucesso!

697
Idioma: portuguese
ID: 7028_6390_000019
Leu ../data/mls/mls_portuguese/train/audio/7028/6390/7028_6390_000019.flac com sucesso!

698
Idioma: portuguese
ID: 5677_4807_001448
Leu ../data/mls/mls_portuguese/train/audio/5677/4807/5677_4807_001448.flac com sucesso!

699
Idioma: portuguese
ID: 2959_2564_000006
Leu ../data/mls/mls_portuguese/train/audio/2959/2564/2959_2564_000006.flac com sucesso!

700
Idioma: portuguese
ID: 7028_6390_000000
Leu ../data/mls/mls_portuguese/train/audio/7028/6390/7028_6390_000000.flac com sucesso!

701
Idioma: portuguese
ID: 5677_4807_001707
Leu ../data/mls/mls_portuguese/train/audio/5677/4807/5677_4807_001707.flac com sucesso!

702
Idioma: portuguese
ID: 2959_2564_000011
Leu ../data/mls/mls_portuguese/train/audio/2959/2564/2959_2564_000011

Leu ../data/mls/mls_spanish/train/audio/12921/12700/12921_12700_000011.flac com sucesso!

761
Idioma: spanish
ID: 13690_11991_000027
Leu ../data/mls/mls_spanish/train/audio/13690/11991/13690_11991_000027.flac com sucesso!

762
Idioma: spanish
ID: 8304_7613_000145
Leu ../data/mls/mls_spanish/train/audio/8304/7613/8304_7613_000145.flac com sucesso!

763
Idioma: spanish
ID: 12921_12700_000000
Leu ../data/mls/mls_spanish/train/audio/12921/12700/12921_12700_000000.flac com sucesso!

764
Idioma: spanish
ID: 13690_14644_000026
Leu ../data/mls/mls_spanish/train/audio/13690/14644/13690_14644_000026.flac com sucesso!

765
Idioma: spanish
ID: 8304_7613_000170
Leu ../data/mls/mls_spanish/train/audio/8304/7613/8304_7613_000170.flac com sucesso!

766
Idioma: spanish
ID: 12921_12700_000005
Leu ../data/mls/mls_spanish/train/audio/12921/12700/12921_12700_000005.flac com sucesso!

767
Idioma: spanish
ID: 13690_14644_000124
Leu ../data/mls/mls_spanish/train/audio/13690/14644/13690_14644_000124.flac com s

Leu ../data/mls/mls_spanish/train/audio/11772/11957/11772_11957_000007.flac com sucesso!

837
Idioma: spanish
ID: 8304_7613_000132
Leu ../data/mls/mls_spanish/train/audio/8304/7613/8304_7613_000132.flac com sucesso!

838
Idioma: spanish
ID: 12921_12700_000010
Leu ../data/mls/mls_spanish/train/audio/12921/12700/12921_12700_000010.flac com sucesso!

839
Idioma: spanish
ID: 11772_11957_000005
Leu ../data/mls/mls_spanish/train/audio/11772/11957/11772_11957_000005.flac com sucesso!

840
Idioma: spanish
ID: 8304_7613_000055
Leu ../data/mls/mls_spanish/train/audio/8304/7613/8304_7613_000055.flac com sucesso!

841
Idioma: spanish
ID: 12921_12700_000004
Leu ../data/mls/mls_spanish/train/audio/12921/12700/12921_12700_000004.flac com sucesso!

842
Idioma: spanish
ID: 8304_7613_000043
Leu ../data/mls/mls_spanish/train/audio/8304/7613/8304_7613_000043.flac com sucesso!

843
Idioma: spanish
ID: 12921_12700_000009
Leu ../data/mls/mls_spanish/train/audio/12921/12700/12921_12700_000009.flac com sucesso

Leu ../data/mls/mls_spanish/train/audio/8881/8550/8881_8550_000527.flac com sucesso!

902
Idioma: spanish
ID: 11228_10604_000001
Leu ../data/mls/mls_spanish/train/audio/11228/10604/11228_10604_000001.flac com sucesso!

903
Idioma: spanish
ID: 6615_11957_000061
Leu ../data/mls/mls_spanish/train/audio/6615/11957/6615_11957_000061.flac com sucesso!

904
Idioma: spanish
ID: 8881_8550_000747
Leu ../data/mls/mls_spanish/train/audio/8881/8550/8881_8550_000747.flac com sucesso!

905
Idioma: spanish
ID: 11228_10604_000010
Leu ../data/mls/mls_spanish/train/audio/11228/10604/11228_10604_000010.flac com sucesso!

906
Idioma: spanish
ID: 6615_11957_000007
Leu ../data/mls/mls_spanish/train/audio/6615/11957/6615_11957_000007.flac com sucesso!

907
Idioma: spanish
ID: 10206_6706_000035
Leu ../data/mls/mls_spanish/train/audio/10206/6706/10206_6706_000035.flac com sucesso!

908
Idioma: spanish
ID: 407_567_000025
Leu ../data/mls/mls_spanish/train/audio/407/567/407_567_000025.flac com sucesso!

909
Idioma

In [23]:
df_audios["audio"] = audios
df_audios["length"] = lengths
df_audios["sr"] = srs

In [24]:
df_audios

Unnamed: 0,language,id,speaker,book,segment,gender,title,audio,length,sr
0,italian,8828_8610_000510,8828,8610,000510,F,"Novelle per un Anno, vol. 12: Il Viaggio","[0.0, 3.0517578e-05, 3.0517578e-05, 6.1035156e...",318560,16000
1,italian,659_547_000790,659,547,000790,F,Avventure di Pinocchio,"[0.0, 3.0517578e-05, 0.0, 3.0517578e-05, 3.051...",186880,16000
2,italian,643_529_000116,643,529,000116,F,Divina Commedia,"[-0.007232666, -0.00491333, -0.0054016113, -0....",314560,16000
3,italian,8828_8610_000321,8828,8610,000321,F,"Novelle per un Anno, vol. 12: Il Viaggio","[0.0, 3.0517578e-05, 0.0, 0.0, 3.0517578e-05, ...",229920,16000
4,italian,659_547_000279,659,547,000279,F,Avventure di Pinocchio,"[0.0, 0.0, 0.0, -3.0517578e-05, -9.1552734e-05...",197440,16000
...,...,...,...,...,...,...,...,...,...,...
942,spanish,6156_4006_000020,6156,4006,000020,M,Condenada y Otros Cuentos,"[0.0, 6.1035156e-05, 9.1552734e-05, 6.1035156e...",250880,16000
943,spanish,101_567_000034,101,567,000034,M,Don Quijote 1,"[0.0020751953, 0.0024108887, 0.002166748, 0.00...",297600,16000
944,spanish,8881_8550_000536,8881,8550,000536,M,Aprendiz de Conspirador,"[0.00076293945, 0.00064086914, 0.0002746582, 0...",162080,16000
945,spanish,6156_4006_000028,6156,4006,000028,M,Condenada y Otros Cuentos,"[-0.00036621094, -0.0004272461, -0.00045776367...",238240,16000


In [25]:
gc.collect()

22

Quanto tempo de áudio temos para cada idioma?

In [26]:
df_audios["duration_s"] = df_audios.apply(lambda x : librosa.get_duration(y=x["audio"], sr=x["sr"]), axis=1)

In [27]:
df_audios.head()

Unnamed: 0,language,id,speaker,book,segment,gender,title,audio,length,sr,duration_s
0,italian,8828_8610_000510,8828,8610,510,F,"Novelle per un Anno, vol. 12: Il Viaggio","[0.0, 3.0517578e-05, 3.0517578e-05, 6.1035156e...",318560,16000,19.91
1,italian,659_547_000790,659,547,790,F,Avventure di Pinocchio,"[0.0, 3.0517578e-05, 0.0, 3.0517578e-05, 3.051...",186880,16000,11.68
2,italian,643_529_000116,643,529,116,F,Divina Commedia,"[-0.007232666, -0.00491333, -0.0054016113, -0....",314560,16000,19.66
3,italian,8828_8610_000321,8828,8610,321,F,"Novelle per un Anno, vol. 12: Il Viaggio","[0.0, 3.0517578e-05, 0.0, 0.0, 3.0517578e-05, ...",229920,16000,14.37
4,italian,659_547_000279,659,547,279,F,Avventure di Pinocchio,"[0.0, 0.0, 0.0, -3.0517578e-05, -9.1552734e-05...",197440,16000,12.34


In [28]:
seconds_total = df_audios.groupby("language")["duration_s"].sum()
seconds_total

language
italian       3559.770125
polish        3575.904312
portuguese    3581.950375
spanish       3580.304125
Name: duration_s, dtype: float64

In [29]:
minutes = seconds_total / 60
seconds = seconds_total - (minutes).astype(int) * 60

In [30]:
pd.concat([minutes.astype(int).rename("minutes").to_frame(), seconds.astype(int).rename("seconds").to_frame()], axis=1)

Unnamed: 0_level_0,minutes,seconds
language,Unnamed: 1_level_1,Unnamed: 2_level_1
italian,59,19
polish,59,35
portuguese,59,41
spanish,59,40


Salvo os áudios em formato string

In [40]:
%%time
df_audios["audio"] = df_audios.apply(lambda x : json.dumps(x["audio"].tolist()), axis=1)

CPU times: user 1min 8s, sys: 3.09 s, total: 1min 12s
Wall time: 1min 14s


In [42]:
df_audios.iloc[0]["audio"]

'[0.0, 3.0517578125e-05, 3.0517578125e-05, 6.103515625e-05, -3.0517578125e-05, 3.0517578125e-05, 3.0517578125e-05, 3.0517578125e-05, 6.103515625e-05, 0.0, 3.0517578125e-05, -3.0517578125e-05, -3.0517578125e-05, 0.0, 0.0, 0.0, 3.0517578125e-05, 0.0, 3.0517578125e-05, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0517578125e-05, 3.0517578125e-05, 3.0517578125e-05, 0.0, 3.0517578125e-05, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -6.103515625e-05, -3.0517578125e-05, -3.0517578125e-05, -3.0517578125e-05, -6.103515625e-05, -6.103515625e-05, -6.103515625e-05, -6.103515625e-05, -3.0517578125e-05, 0.0, -6.103515625e-05, -6.103515625e-05, -3.0517578125e-05, 0.0, -3.0517578125e-05, -3.0517578125e-05, 0.0, -3.0517578125e-05, 0.0, -3.0517578125e-05, -3.0517578125e-05, -3.0517578125e-05, 3.0517578125e-05, 3.0517578125e-05, 0.0, 0.0, 3.0517578125e-05, 0.0, 6.103515625e-05, 3.0517578125e-05, 3.0517578125e-05, 3.0517578125e-05, 3.0517578125e-05, 3.0517578125e-05, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -6.10

Salvo o dataframe final

In [43]:
%%time
df_audios.to_csv("../data/audios.csv")

In [44]:
del df_audios