<a href="https://colab.research.google.com/github/gkv856/util_repo/blob/master/How_to_download_Voxceleb1_audio_data_(demo_for_India_celeb).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Read ME
This notebook will teach you to download audio data from the Voxceleb1 database.
Couple of things to note


1.   The audio data is divided into multiple files on the Voxceleb website, you can find more details [here](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html)
2.   We will use DevA as a sample, but you can scale it as per your convinent

**Steps to get the data**


*   Download the data using wget (depending upon the disk space, you can download all or just one)
* One part file is of 10GB size
* Merge part files into one '.zip' file 
* Extract the zip file into normal 'wav' files
* Use a function to **move** desired celebrity's wav files to another folder
* zip the final folder and download and use




# Download and EDA for data's metadata

In [45]:
import pandas as pd

In [None]:
df = pd.read_csv("https://www.robots.ox.ac.uk/~vgg/data/voxceleb/meta/vox1_meta.csv", delimiter="\t")
df.head()

Unnamed: 0,VoxCeleb1 ID,VGGFace1 ID,Gender,Nationality,Set
0,id10001,A.J._Buckley,m,Ireland,dev
1,id10002,A.R._Rahman,m,India,dev
2,id10003,Aamir_Khan,m,India,dev
3,id10004,Aaron_Tveit,m,USA,dev
4,id10005,Aaron_Yoo,m,USA,dev


In [None]:
# creating another dataframe with only India nationalities. 
# I am interested only in India celebrities 
df_india = df[df["Nationality"]=="India"]
df_india.head()

Unnamed: 0,VoxCeleb1 ID,VGGFace1 ID,Gender,Nationality,Set
1,id10002,A.R._Rahman,m,India,dev
2,id10003,Aamir_Khan,m,India,dev
16,id10017,Ajay_Devgn,m,India,dev
17,id10018,Akshay_Kumar,m,India,dev
44,id10045,Amitabh_Bachchan,m,India,dev


# Download part files

In [None]:
!wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partaa

--2021-10-07 07:24:51--  https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partaa
Resolving thor.robots.ox.ac.uk (thor.robots.ox.ac.uk)... 129.67.95.98
Connecting to thor.robots.ox.ac.uk (thor.robots.ox.ac.uk)|129.67.95.98|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10737418240 (10G) [application/octet-stream]
Saving to: ‘vox1_dev_wav_partaa’


2021-10-07 07:30:56 (28.1 MB/s) - ‘vox1_dev_wav_partaa’ saved [10737418240/10737418240]



In [None]:
!wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partab

--2021-10-07 07:30:56--  https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partab
Resolving thor.robots.ox.ac.uk (thor.robots.ox.ac.uk)... 129.67.95.98
Connecting to thor.robots.ox.ac.uk (thor.robots.ox.ac.uk)|129.67.95.98|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10737418240 (10G) [application/octet-stream]
Saving to: ‘vox1_dev_wav_partab’


2021-10-07 07:37:03 (28.0 MB/s) - ‘vox1_dev_wav_partab’ saved [10737418240/10737418240]



In [None]:
!wget https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partac

--2021-10-07 07:48:18--  https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/vox1_dev_wav_partac
Resolving thor.robots.ox.ac.uk (thor.robots.ox.ac.uk)... 129.67.95.98
Connecting to thor.robots.ox.ac.uk (thor.robots.ox.ac.uk)|129.67.95.98|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10737418240 (10G) [application/octet-stream]
Saving to: ‘vox1_dev_wav_partac’



# Merge all the part files into one big zip file

In [None]:
!cat vox1_dev* > vox1_dev_wav.zip

# Extract files to a folder from the zip file

In [None]:
!jar xvf /content/vox1_dev_wav.zip

# Defining functions

In [None]:
# creating Indian celeb as a list
celeb_lst = df_india["VoxCeleb1 ID"].values
celeb_lst

array(['id10002', 'id10003', 'id10017', 'id10018', 'id10045', 'id10292',
       'id10324', 'id10393', 'id10519', 'id10583', 'id10662', 'id10724',
       'id10852', 'id10901', 'id10912', 'id10941', 'id10943', 'id10955',
       'id10956', 'id11071', 'id11089', 'id11090', 'id11100', 'id11130',
       'id11136', 'id11209'], dtype=object)

In [None]:
import shutil
import os

def walk_through_dir(dir_path):
    """
      this function prints the number of files available in a folder
    """    
    for dirpath, dirnames, filenames in os.walk(dir_path):
      # spliting the directory name and looking at the last item
      # dirpath might look = /content/wav/id11090
      # we are interested in the id11090 part so that we filter on celeb voice
      split_names = dirpath.split("/")
      if split_names[-1] in celeb_lst:
        print(f"There are {len(dirnames)} directories and {len(filenames)} images in '{dirpath}'.")
        

In [None]:
# test the function
walk_through_dir("/content/wav")

In [None]:
import shutil
  

def move_dir(dir_path, destination= "./vox_indian"):
    """
    Walks through dir_path returning moves its content to destination folder.
    Args:
    dir_path (str): root directory to look into
    destination (str): target directory, where to save the final audio files

    """
    for dirpath, dirnames, filenames in os.walk(dir_path):
        split_names = dirpath.split("/")
        if split_names[-1] in celeb_lst:
            dest_path = destination + "/" + split_names[-1]
            dest = shutil.move(dirpath, dest_path)
            
            print(f"Moving from '{dirpath}' to '{dest_path}'")
            

In [None]:
move_dir("/content/wav")

# Final step - zip the subset of vox audio

In [None]:
!zip -r vox1_indian.zip /content/vox_indian

# Now you can download a subset of data