#### Exploratory Data Analysis: Part 1

This notebook contains analysis of the gender and age labels collected from two of the sources.

In [108]:
import numpy as np
import pandas as pd

The audiolingua files

In [2]:
df_1 = pd.read_csv('../audio/1_audiolingua/file_listing.csv')

The voxforge files

In [4]:
df_4 = pd.read_csv('../audio/4_voxforge/filelisting.csv')

Examine

In [7]:
df_1.head()

Unnamed: 0,file_name,language,labels
0,hanna_pancakes,en,"['A1', 'female', 'adult', '30-60 seconds', 'fo..."
1,sport_australia_matt,en,"['B1', 'male', 'adult', '120-180 seconds', 'my..."
2,guess_who_melissa,en,"['A2', 'female', 'adult', '30-60 seconds', 'ce..."
3,who_was_he_lisa,en,"['A2', 'female', 'adult', '30-60 seconds', 'ce..."
4,gettysburg_address,en,"['B1', 'male', 'adult', '120-180 seconds', 'hi..."


The labels were originally lists but were saved as strings

In [8]:
df_1.loc[0, 'labels']

"['A1', 'female', 'adult', '30-60 seconds', 'food', 'recipe']"

In [47]:
def grab_gender_1(x):
    """
    Returns the gender from an entry's list of labels
    
    Parameters:
        x (str) : list of labels
    
    Returns:
        gender (str) : one of 'female', 'male', 'mixed', or 'unknown'
    """
    if "'female'" in x and "'male'" in x:
        return 'mixed'
    elif "'female'" in x:
        return 'female'
    elif "'male'" in x:
        return 'male'
    else:
        return 'unknown'
    
v_gender_1 = np.vectorize(grab_gender_1)

In [80]:
def grab_age_1(x):
    """
    Returns the age group from an entry's list of labels
    
    Parameters:
        x (str) : list of labels
    
    Returns:
        age (str) : one of 'adult', 'teenager', 'child', 'senior', or 'unknown'
    """
    if "'adult'" in x:
        return 'adult'
    elif "'teenager'" in x:
        return 'teenager'
    elif "'child'" in x:
        return 'child'
    elif "'senior citizens'" in x:
        return 'senior'
    else:
        return 'unknown'

v_age_1 = np.vectorize(grab_age_1)

In [81]:
df_1['gender'] = v_gender_1(df_1['labels'])

In [82]:
df_1['age'] = v_age_1(df_1['labels'])

In [83]:
df_1.head()

Unnamed: 0,file_name,language,labels,gender,age
0,hanna_pancakes,en,"['A1', 'female', 'adult', '30-60 seconds', 'fo...",female,adult
1,sport_australia_matt,en,"['B1', 'male', 'adult', '120-180 seconds', 'my...",male,adult
2,guess_who_melissa,en,"['A2', 'female', 'adult', '30-60 seconds', 'ce...",female,adult
3,who_was_he_lisa,en,"['A2', 'female', 'adult', '30-60 seconds', 'ce...",female,adult
4,gettysburg_address,en,"['B1', 'male', 'adult', '120-180 seconds', 'hi...",male,adult


Examination of the gender distribution of audiolingua samples:

In [84]:
df_1['gender'].value_counts()

female     444
male       158
mixed       21
unknown      5
Name: gender, dtype: int64

Although the majority of samples are from female speakers, there is a mix of some male speakers and some audio files with speakers of varying genders

Inspection of samples without gender labels:

In [85]:
df_1[df_1['gender'] == 'unknown']

Unnamed: 0,file_name,language,labels,gender,age
153,miss_awesome_-ailsa,en,"['A2', '30-60 seconds', 'cinema', 'physical de...",unknown,unknown
354,sophia_-_queen-2,en,[],unknown,unknown
403,_2_-3,zh,"['celebrities', 'environmentalism', 'my roots'...",unknown,unknown
404,_partie1,zh,"['celebrities', 'environmentalism', 'my roots'...",unknown,unknown
411,-38,zh,"['A2', 'B1', 'school', 'high school', 'daily']",unknown,unknown


Examination of the age labels of audiolingua samples:

In [86]:
df_1['age'].value_counts()

adult       498
teenager     81
senior       21
child        16
unknown      12
Name: age, dtype: int64

Although the majority of samples are from adult speakers, there is a mix of age groups represented

Inspection of samples without age labels:

In [87]:
df_1[df_1['age'] == 'unknown']

Unnamed: 0,file_name,language,labels,gender,age
57,shambles_3,en,"['B1', 'female', 'male', '30-60 seconds', 'cin...",mixed,unknown
153,miss_awesome_-ailsa,en,"['A2', '30-60 seconds', 'cinema', 'physical de...",unknown,unknown
171,painting_3,en,"['A2', 'male', '0-30 seconds', 'art']",male,unknown
203,grinch,en,"['A1', 'female', 'male', '0-30 seconds', 'read...",mixed,unknown
352,thje_stinky_cheese_man_-_kathleen_andrew,en,"['A2', 'female', 'male', '30-60 seconds', 'rea...",mixed,unknown
354,sophia_-_queen-2,en,[],unknown,unknown
403,_2_-3,zh,"['celebrities', 'environmentalism', 'my roots'...",unknown,unknown
404,_partie1,zh,"['celebrities', 'environmentalism', 'my roots'...",unknown,unknown
411,-38,zh,"['A2', 'B1', 'school', 'high school', 'daily']",unknown,unknown
467,duihua_diergehaizi,zh,"['B1', 'female', 'male', '60-90 seconds', 'fam...",mixed,unknown


Inspection of gender labels by language:

In [89]:
df_1.groupby(by=['language', 'gender']).count()['file_name']

language  gender 
en        female     269
          male       120
          mixed        8
          unknown      2
zh        female     175
          male        38
          mixed       13
          unknown      3
Name: file_name, dtype: int64

Distribution is similar across languages.

Inspection of age group labels by language:

In [90]:
df_1.groupby(by=['language', 'age']).count()['file_name']

language  age     
en        adult       299
          child        14
          senior       21
          teenager     59
          unknown       6
zh        adult       199
          child         2
          teenager     22
          unknown       6
Name: file_name, dtype: int64

Distribution is similar across languages, though there aren't any Mandarin samples labeled as being from seniors.

Examination of samples from voxforge:

In [78]:
df_4.head()

Unnamed: 0,sample_number,age,dialect,gender,language,samples
0,es0000,Age Range: Adulto,Pronunciation dialect: Español España,Gender: Masculino,es,"['es0000-000', 'es0000-001', 'es0000-002', 'es..."
1,es0001,Age Range: Adulto,Pronunciation dialect: Español España,Gender: Masculino,es,"['es0001-000', 'es0001-001', 'es0001-002', 'es..."
2,es0002,Age Range: Adulto,Pronunciation dialect: Español Argentina,Gender: Masculino,es,"['es0002-000', 'es0002-001', 'es0002-002', 'es..."
3,es0003,Age Range: Adulto,Pronunciation dialect: Español Mexicano,Gender: Masculino,es,"['es0003-000', 'es0003-001', 'es0003-002', 'es..."
4,es0004,Age Range: Adulto,Pronunciation dialect: Español Argentina,Gender: Masculino,es,"['es0004-000', 'es0004-001', 'es0004-002', 'es..."


Listing of age value types:

In [79]:
df_4['age'].value_counts()

Age Range: Adulto                  571
Age Range: Adulte                  498
Age Range: Adult                   480
Age Range: Jeune                    92
Age Range: Senior                   58
Age Range: Youth                    47
Age Range: Please Select            16
Age Range: Niño                     13
Age Range: desconocido               7
Age range: adult;                    6
User Name:BlueAgent                  2
Age range: adult                     2
Age Range: inconnu                   1
Age range: youth;                    1
Age Range: adult                     1
Age Range: Tercera Edad              1
Age Range: Por favor Seleccione      1
Age Range: Sélectionnez              1
Name: age, dtype: int64

Translation and replacement of values:

In [97]:
df_4['age'].replace({
    'Age Range: Adult' : 'adult',
    'Age Range: Adulto' : 'adult',
    'Age Range: Adulte' : 'adult',
    'Age Range: Jeune' : 'child',
    'Age Range: Senior' : 'senior',
    'Age Range: Youth' : 'child',
    'Age Range: Please Select' : 'unknown',
    'Age Range: Niño' : 'child',
    'Age Range: desconocido' : 'unknown',
    'Age range: adult;' : 'adult',
    'User Name:BlueAgent' : 'unknown',
    'Age Range: adult' : 'adult',
    'Age Range: inconnu' : 'unknown',
    'Age range: youth;' : 'child',
    'Age Range: Tercera Edad' : 'senior',
    'Age Range: Por favor Seleccione' : 'unknown',
    'Age Range: Sélectionnez' : 'unknown'
}, inplace=True)

Post-replacement distribution of age labels:

In [100]:
df_4['age'].value_counts()

adult      1558
child       153
senior       59
unknown      28
Name: age, dtype: int64

Voxforge similarly has mainly adult speakers, and does not seem to distinguish teenage speakers as a separate category.

Listing of age group values from voxforge:

In [103]:
df_4['gender'].value_counts()

Gender: Masculin                584
Gender: Masculino               546
Gender: Male                    543
Gender: Female                   52
Gender: Femenino                 37
Gender: Féminin                  14
Gender: desconocido               8
Gender: male;                     6
Gender: male                      2
Gender: Por favor Seleccione      2
Gender: female;                   1
Gender: female                    1
Gender: Sélectionnez              1
Gender: inconnu                   1
Name: gender, dtype: int64

Translation and replacement of gender labels:

In [104]:
df_4['gender'].replace({
    'Gender: Female' : 'female',
    'Gender: Femenino' : 'female',
    'Gender: Féminin' : 'female',
    'Gender: Male' : 'male',
    'Gender: Masculin' : 'male',
    'Gender: Masculino' : 'male',
    'Gender: Por favor Seleccione' : 'unknown',
    'Gender: Sélectionnez' : 'unknown',
    'Gender: desconocido' : 'unknown',
    'Gender: female' : 'female',
    'Gender: female;' : 'female',
    'Gender: inconnu' : 'unknown',
    'Gender: male' : 'male',
    'Gender: male;' : 'male'
}, inplace=True)

Post-replacement distribution of age group labels:

In [105]:
df_4['gender'].value_counts()

male       1681
female      105
unknown      12
Name: gender, dtype: int64

Opposite to the samples obtained from audiolingua, these samples are a vast majority male speakers. There also only seem to be single gender (and single speaker) samples. 

Distribution of gender labels by language:

In [106]:
df_4.groupby(by=['language', 'gender']).count()['sample_number']

language  gender 
en        female      53
          male       545
es        female      38
          male       552
          unknown     10
fr        female      14
          male       584
          unknown      2
Name: sample_number, dtype: int64

Distribution is similarly skewed across languages.

Distribution of age group labels across languages:

In [107]:
df_4.groupby(by=['language', 'age']).count()['sample_number']

language  age    
en        adult      483
          child       47
          senior      50
          unknown     18
es        adult      577
          child       14
          senior       1
          unknown      8
fr        adult      498
          child       92
          senior       8
          unknown      2
Name: sample_number, dtype: int64

Distribution is similarly skewed across languages.

Although the examined samples only represent a fraction of the sources and languages used, it can be seen that there is at least some amount of variance among the audio files, and there shouldn't be any accumulating cleavages that will prevent a machine learning model from identifying speech from a variety of different speakers.