I pre-downloaded a lot of files so that you don't need to be constantly querying the database. Here's what the notebook does with these files:

(1) It goes through all the transcripts in a specified age range (currently 6 to 72 months) and only grabs those that have associated audio/video with timestamps.

(2) From those transcripts, it builds a map from word to transcript ID so you can easily get all the transcripts that contain whatever word you are looking for. You can also do this with the API, but it involves a lot of querying.

(4) It maps transcript IDs to file names and likely URLs to give you you most of the info you need to go retrieve the audio file associated with a particular transcript. You can sub in wav for xml in the file names, and if that doesn't work try mp3 or mp4. The only challenge is that there is frequently a 0wav directory buried at some level in the directory structure that contains wavs. Finding the wav files in this case might involve some manual oversight.


In [2]:
# imports
import pandas as pd
import sys
import numpy as np
import nltk
import glob


In [3]:
# Read through each csv file of utterances from the corpora
# that we've previously confirmed have media files.

# I'm using pre-downloaded csvs to avoid too many api calls
csvfiles = glob.glob("csv-files-EngNA-media/[A-Z]*.csv")

# keep track of which words are in which utterances
# faster than querying the database
word2utt = {}

# keep track of which utterances have start and stop times
# these are the only ones we care about right now
timestamps = {}

# keep track of all words that are used for frequency counting
# we only want to look at the highest frequency words
yesprint = 0
wordlist = []
for c in csvfiles:
    
    # print out every 10th corpus to reduce verbosity of output
    #if yesprint % 10 == 0:
    print(c)
    #yesprint +=1 

    # read in the csv file
    child = pd.read_csv(c)
    
    # get kid age, start and stop times
    child = child.astype({"target_child_age": float})
    child = child.astype({"media_start": float})
    child = child.astype({"media_end": float})

  
    # this stores utterances from kids between 6 and 72 months
    # who have transcripts with start and stop times
    newchild = child.loc[(child["speaker_role"] == "Target_Child") & 
              (child["target_child_age"] >= 6) &
              (child["target_child_age"] <= 72) &
              (child["media_start"] > 0),
                         ["id", "transcript_id", "media_start", "media_end", "gloss", "target_child_sex","target_child_age"]]
    
    # compile complete record into alldata
    try:
        alldata = alldata.append(newchild)
    except Exception as noalldata:
        alldata = newchild
    
    thedata = list(newchild.itertuples(index=False, name=None))
    
    # first map every word to all the utterances it appears in (word2utt)
    # then map each utterance ID to its transcript ID, timestamps, gloss, 
    # and corpus name (timestamps)
    for line in thedata:
        if type(line[4]) == str:
            for w in line[4].split():
                wordlist.append(w)
                if w in word2utt:
                    word2utt[w].append(line[0])
                else:
                    word2utt[w] = [line[0]]
            timestamps[line[0]] = [line[1], line[2], line[3], line[4], c]
            
# Note that Sprott corpus as exported doesn't have ages for kids, though those are available
# Note also that childes-py does not have access to MOST of the kids in CHILDES.

csv-files-EngNA-media/NewmanRatner.csv
csv-files-EngNA-media/Goad.csv
csv-files-EngNA-media/Peters.csv
csv-files-EngNA-media/Gleason.csv
csv-files-EngNA-media/PaidoEnglish.csv
csv-files-EngNA-media/Brent.csv
csv-files-EngNA-media/Providence.csv
csv-files-EngNA-media/Sprott.csv
csv-files-EngNA-media/Sachs.csv
csv-files-EngNA-media/Bloom.csv
csv-files-EngNA-media/Bernstein.csv
csv-files-EngNA-media/Soderstrom.csv
csv-files-EngNA-media/EllisWeismer.csv
csv-files-EngNA-media/POLER.csv
csv-files-EngNA-media/Rollins.csv
csv-files-EngNA-media/McCune.csv


  exec(code_obj, self.user_global_ns, self.user_ns)


csv-files-EngNA-media/Snow.csv
csv-files-EngNA-media/Menn.csv
csv-files-EngNA-media/VanHouten.csv
csv-files-EngNA-media/McMillan.csv
csv-files-EngNA-media/Nelson.csv
csv-files-EngNA-media/Davis.csv
csv-files-EngNA-media/Weist.csv
csv-files-EngNA-media/Braunwald.csv
csv-files-EngNA-media/MacWhinney.csv


  exec(code_obj, self.user_global_ns, self.user_ns)


In [4]:
# example: print the data for utt 2703191
print(timestamps[2703191])

# example: print the first 10 utts that have "doggy"
print(word2utt["doggy"][0:10])


[10177, 820.761, 823.738, 'yyy ride yyy doggy back', 'csv-files-EngNA-media/NewmanRatner.csv']
[2686897, 2687332, 2687457, 2687619, 2694910, 2696457, 2703191, 2703343, 2720479, 2786680]


In [None]:
# Here's a quick example of how to use the python wrapper
# for the R interface to childesdb in case you'd like to 
# use this later to get files

# https://github.com/langcog/childespy

#import childespy
#utterances = childespy.get_utterances(corpus="ENNI")
#utterances.to_csv("ENNI.csv")

#import childespy
#phontrans = childespy.get_transcripts(corpus = "MacWhinney")
#phontrans.to_csv("moretranscripts.csv")


In [5]:
# map corpus name to collection so you can get URL for audio
# only worry about the corpora we know have media,
# using a manually curated list of EngNA corpora with media.
corp2url = {}

import csv
with open('corpora-EngNA-media.txt', newline='') as csvfile:
    csvfile.readline()
    treader = csv.reader(csvfile, delimiter='|', quotechar='"')
    for row in treader:
        corp2url[row[1]] = "https://media.talkbank.org/" + row[0].lower() + "/"



In [6]:
# map transcript IDs to file names since we need the file
# names to retrieve the audio
# getting this from a pre-downloaded file of EngNA transcripts to avoid 
# api calls

trans2file = {}
import csv
with open('transcripts.csv', newline='') as csvfile:
    csvfile.readline()
    treader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in treader:
        if row[2] in corp2url:
            urlstart = corp2url[row[2]]
            fileurl = urlstart + "/" + row[5]
            trans2file[int(row[1])] = (urlstart + row[5], row[7])



In [6]:
# now get a list of audio and video files that have timestamped 
# transcriptsion using the timestamps dictionary

audio2get = {}
missing = {}
for t,v in timestamps.items():
    audio2get[ trans2file[v[0]][0] ] = 1


print(len(audio2get.keys()))


2856


### Next step: Get the audio files

* Determining mp3 vs. wav vs. mp4 will require either parsing the main webpage for the media for that corpus or just trying both and seeing what works. Recall that there is often a 0wav directory with wav files somewhere in corpora that seem to have only mp3.
* I use curl or wget in a shell to get the audio, but urllib in python works fine, see below.



In [7]:
# function to get an audio file

# this lets you bypass weird SSL problem with CHILDES
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

# Need for handling mp4s
import ffmpeg
import sox

# some other imports
import urllib.request
import re
import os

# function: takes a url, tries to get the audio file first as
# a wav then as an mp3. Returns the file name if successful
# and return "failure" if unsuccessful
# Note: this only works for childes and not phonbank.
# Need to fancy up the regex below.
def get_the_audiofile(afilename, verbose=False):
    
    thefile = re.sub("xml", "wav", afilename)
    if thefile=="https://media.talkbank.org/childes/Eng-NA/MacWhinney/021001a.wav":
        #this wav file is mislabeled
        thefile = "https://media.talkbank.org/childes/Eng-NA/MacWhinney/021001.wav"
    
    outputfile = re.sub("Frogs/", "splitme", thefile)
    outputfile = re.sub("Eng-NA/", "splitme", outputfile)
    outputfile = re.sub("Clinical-MOR/", "splitme", outputfile)
    
    outputfile = re.sub("/", "_", outputfile).split("splitme")[1]
    outputfile = "rawfiles/" + outputfile
    # let's not waste time re-downloading files
    if verbose:
        print("checking for " + outputfile)
    if os.path.exists(outputfile):
        if verbose:
            print("file already downloaded")
        return outputfile

    # first try as a wav file

    try:
        if verbose:
            print(thefile, outputfile)
        urllib.request.urlretrieve(thefile, outputfile)
        if verbose: 
            print("DOWNLOAD COMPLETE:", outputfile, "\n\n")
        return outputfile
        
    except Exception as e:
        
            # if that fails, look for the 0wav folder
            if verbose:
                print("ERROR: Could not download wav file {}".format(thefile))
                print("Trying to find the 0wav folder, if any")
            string = thefile.split("/")
            thenewfile = "/".join(string[0:(len(string)-2)] + ["0wav"] + string[(len(string)-2):len(string)])
            if thefile.find("EllisWeismer/TD")>0:
                thenewfile = re.sub("/TD/","/0wav/",thefile)
                string = thenewfile.split("/")
                thenewfile = "/".join(string[0:(len(string)-1)] + ["controls"] + string[(len(string)-1):len(string)])
            if verbose:
                print(thenewfile, outputfile)
            try:
                urllib.request.urlretrieve(thenewfile, outputfile)
                if verbose:
                    print("DOWNLOAD COMPLETE:", outputfile, "\n\n")
                return outputfile

            except Exception as e2:

                    # if that fails, look again for the 0wav folder
                    if verbose:
                        print("ERROR: Could not download wav file {}".format(thenewfile))
                        print("Trying to find the 0wav folder, if any")
                    string = thefile.split("/")
                    thenewfile = "/".join(string[0:(len(string)-1)] + ["0wav"] + string[(len(string)-1):len(string)])
                    if thefile.find("Gleason/Father")>0:
                        # because why, Brian, why?
                        thenewfile = re.sub(".wav","-f.wav",thenewfile)
                    if thefile.find("Gleason/Mother")>0:
                        # because why, Brian, why?
                        thenewfile = re.sub(".wav","-m.wav",thenewfile)
                    if verbose:
                        print(thenewfile, outputfile)
                    try:
                        urllib.request.urlretrieve(thenewfile, outputfile)
                        if verbose:
                            print("DOWNLOAD COMPLETE:", outputfile, "\n\n")
                        return outputfile

                    except Exception as e3:

                            # if that fails, look one last time for the 0wav folder
                            if verbose:
                                print("ERROR: Could not download wav file {}".format(thenewfile))
                                print("Trying to fine the 0wav folder, if any")
                            string = thefile.split("/")
                            thenewfile = "/".join(string[0:(len(string)-3)] + ["0wav"] + string[(len(string)-3):len(string)])
                            if verbose:
                                print(thenewfile, outputfile)
                            try:
                                urllib.request.urlretrieve(thenewfile, outputfile)
                                if verbose:
                                    print("DOWNLOAD COMPLETE:", outputfile, "\n\n")
                                return outputfile

                            except Exception as e4:

                                # if that fails, try as an mp3
                                if verbose:
                                    print("ERROR: Could not download wav file {}".format(thenewfile))
                                    print("Trying to download mp3")
                                thenewfile = re.sub("wav", "mp3", thefile)
                                outputfile = re.sub("wav", "mp3", outputfile)
                                if verbose:
                                    print(thenewfile, outputfile)
                                try:
                                    urllib.request.urlretrieve(thenewfile, outputfile)
                                    if verbose:
                                        print("DOWNLOAD COMPLETE:", outputfile, "\n\n Converting to wav \n\n")
                                    outputfile2 = re.sub("mp3", "wav", outputfile)
                                    tfm = sox.Transformer()
                                    tfm.build_file(outputfile, outputfile2)
                                    return outputfile2

                                except Exception as e5:

                                    # if that fails, try as an mp4
                                    if verbose:
                                        print("ERROR: Could not download mp3 file {}".format(thenewfile))
                                        print("Trying to download mp4")
                                    thenewfile = re.sub("wav", "mp4", thefile)
                                    outputfile = re.sub("mp3", "mp4", outputfile)
                                    if verbose: 
                                        print(thenewfile, outputfile)
                                    try:
                                        urllib.request.urlretrieve(thenewfile, outputfile)
                                        if verbose: 
                                            print("DOWNLOAD COMPLETE:", outputfile, "\n\n Converting mp4 to wav \n\n")
                                        outputfile2 = re.sub("mp4", "wav", outputfile)
                                        ffmpeg.input(outputfile).output(outputfile2).run()
                                        return outputfile2

                                    except Exception as e6:
                                        if verbose:
                                            print("GIVING UP: Could not download mp4 file {}".format(thenewfile))
                                        return "failure"




Some notes on files.

* EllisWeismer has both TD and language delayed kids. Make sure not to use the language delayed kids.

Unusable 0wav
* Eng-NA/NewmanRatner has a 0wav for the main files, but not for /Interviews/
* English-NA/Goad has 0wav but it collapses all the clips into a single file, so not easily usable
* English-NA/Providence has 0wav folders for two of the kids, but these folders only contain mp3s 

No 0wav folders:
* English-NA/Sachs
* English-NA/Weist

### Step after that: Extracting audio
Once you've downloaded the audio files, you'll use the data in the timestamps dictionary above to extract the audio clips corresponding to the utterance. I use the unix tools sox or ffmpeg, but there is a nice python wrapper for sox that does this well.



In [8]:
# parse audiofile
def get_clip(uttdata, uttid, afile, theword, verbose=False):

    tfm = sox.Transformer()
    starttime = uttdata[1]
    stoptime = uttdata[2]
    tfm.trim(starttime, stoptime)
    tfm.convert(16000, 1) 
    newtail = "_" + str(starttime) + "-" + str(stoptime) + ".wav"
    newfile = re.sub(".wav", newtail, afile)
    #newfile = theword + "_" + newfile
    newfile = re.sub("rawfiles/","",newfile)
    tfm.build_file(afile, "parsedfiles/" + newfile)

    if verbose:
        print("New file created:", newfile)
        print("Transcript:", uttdata[3] )
    
    alldata.loc[alldata["id"] == uttid, "file"] = newfile
    
    return True


In [38]:
# example: getting an audio file from CHILDES
# and extracting relevant portion using timestamps

import random

# pick some random utterance
someutt = random.choice(list(timestamps.keys()))
uttdata = timestamps[someutt]

#print(uttdata)

# look up the associated audio file
audiofile = trans2file[uttdata[0]]
outfilename = get_the_audiofile(audiofile[0])

if outfilename != "failure":
    # creating a new file using timestamps
    get_clip(uttdata, someutt, outfilename, "y")




Now let's actually download the files we need

In [9]:
# set words of interest
words = ["baby","daddy","mommy","dad","car"]

In [42]:
import warnings
warnings.filterwarnings("ignore")

nofind = []
noparse = []

for word in words:
    print("now working on :" + word)
    wordstamps = word2utt[word]
    for someutt in wordstamps:
        uttdata = timestamps[someutt]

        #print(uttdata)

        audiofile = trans2file[uttdata[0]]
        outfilename = get_the_audiofile(audiofile[0])

        if outfilename != "failure":
            # creating a new file using timestamps
            try:
                get_clip(uttdata, someutt, outfilename, word)
            except Exception as parsefail:
                print("Unable to parse "+outfilename)
                noparse = noparse + [someutt]
        else:
            nofind = nofind + [someutt]


output_file: parsedfiles/daddy_NewmanRatner_24_5346GG_938.086-940.687.wav already exists and will be overwritten on build
output_file: parsedfiles/daddy_NewmanRatner_24_5623AT_816.434-817.357.wav already exists and will be overwritten on build


now working on :daddy


output_file: parsedfiles/daddy_NewmanRatner_24_5623AT_826.938-829.341.wav already exists and will be overwritten on build
output_file: parsedfiles/daddy_NewmanRatner_24_7162MB_424.653-426.816.wav already exists and will be overwritten on build
output_file: parsedfiles/daddy_NewmanRatner_24_6071WB_641.037-648.03.wav already exists and will be overwritten on build
output_file: parsedfiles/daddy_NewmanRatner_24_6071WB_696.932-702.421.wav already exists and will be overwritten on build
output_file: parsedfiles/daddy_NewmanRatner_24_7658LT_153.399-155.667.wav already exists and will be overwritten on build
output_file: parsedfiles/daddy_NewmanRatner_Interviews_24_5073AC_154.007-157.767.wav already exists and will be overwritten on build
output_file: parsedfiles/daddy_Goad_Julia_11102_126.594-127.781.wav already exists and will be overwritten on build
output_file: parsedfiles/daddy_Goad_Julia_11116_185.775-187.389.wav already exists and will be overwritten on build
output_file: parsedfiles/d

now working on :mommy


output_file: parsedfiles/mommy_Gleason_Dinner_katie_1079.896-1083.052.wav already exists and will be overwritten on build
output_file: parsedfiles/mommy_Gleason_Dinner_katie_1079.896-1083.052.wav already exists and will be overwritten on build
output_file: parsedfiles/mommy_Gleason_Dinner_katie_1079.896-1083.052.wav already exists and will be overwritten on build
output_file: parsedfiles/mommy_Gleason_Dinner_katie_1079.896-1083.052.wav already exists and will be overwritten on build
output_file: parsedfiles/mommy_Gleason_Dinner_katie_1079.896-1083.052.wav already exists and will be overwritten on build
ffmpeg version 4.4 Copyright (c) 2000-2021 the FFmpeg developers
  built with Apple clang version 12.0.5 (clang-1205.0.22.9)
  configuration: --prefix=/usr/local/Cellar/ffmpeg/4.4_2 --enable-shared --enable-pthreads --enable-version3 --cc=clang --host-cflags= --host-ldflags= --enable-ffplay --enable-gnutls --enable-gpl --enable-libaom --enable-libbluray --enable-libdav1d --enable-libmp3l

now working on :dad


output_file: parsedfiles/dad_Providence_Ethan_010904_1889.78-1904.0.wav already exists and will be overwritten on build
output_file: parsedfiles/dad_Providence_Naima_020427_1599.725-1607.57.wav already exists and will be overwritten on build
output_file: parsedfiles/dad_Providence_Naima_020427_2428.96-2434.11.wav already exists and will be overwritten on build
output_file: parsedfiles/dad_Providence_Naima_020427_2428.96-2434.11.wav already exists and will be overwritten on build
output_file: parsedfiles/dad_Providence_Naima_020427_2428.96-2434.11.wav already exists and will be overwritten on build
output_file: parsedfiles/dad_Providence_Naima_020427_2428.96-2434.11.wav already exists and will be overwritten on build
output_file: parsedfiles/dad_Providence_Naima_020427_2428.96-2434.11.wav already exists and will be overwritten on build
output_file: parsedfiles/dad_Providence_Naima_020427_2428.96-2434.11.wav already exists and will be overwritten on build
output_file: parsedfiles/dad_Pro

now working on :car


output_file: parsedfiles/car_Goad_Julia_20913_2246.114-2249.504.wav already exists and will be overwritten on build
output_file: parsedfiles/car_Goad_Julia_20913_2246.114-2249.504.wav already exists and will be overwritten on build
output_file: parsedfiles/car_Gleason_Mother_david_799.799-802.776.wav already exists and will be overwritten on build
ffmpeg version 4.4 Copyright (c) 2000-2021 the FFmpeg developers
  built with Apple clang version 12.0.5 (clang-1205.0.22.9)
  configuration: --prefix=/usr/local/Cellar/ffmpeg/4.4_2 --enable-shared --enable-pthreads --enable-version3 --cc=clang --host-cflags= --host-ldflags= --enable-ffplay --enable-gnutls --enable-gpl --enable-libaom --enable-libbluray --enable-libdav1d --enable-libmp3lame --enable-libopus --enable-librav1e --enable-librubberband --enable-libsnappy --enable-libsrt --enable-libtesseract --enable-libtheora --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxml2

TypeError: can only concatenate str (not "list") to str

# Next up, processing the audio files for input into CNN

First, we check to see how much we've got to work with

In [43]:
print("Unable to find file online: ")
print(nofind)
print("Unable to find file locally: ")
print(noparse)

Unable to find file online: 
[2414379, 2414415, 2414443, 2414538, 2414644, 2414656, 2414702, 2414738, 2414764, 2414817, 2414976, 2415042, 2415093, 2415103, 2415185, 2415202, 2415226, 2415310, 2415398, 2415555, 2415694, 2415793, 2415831, 2415873, 2416008, 2416058, 2416135, 2416181, 2416250, 2416324, 2416351, 2416376, 2416610, 2416684, 2416786, 2416800, 2416840, 2416850, 2417033, 2417306, 2417320, 2417691, 2417808, 2418243, 2418286, 2418416, 2418448, 2418480, 2418517, 2418534, 2418584, 2418615, 2418629, 2418699, 2418713, 2418806, 2418845, 2418874, 2418921, 2418934, 2419246, 2419273, 2419313, 2419506, 2419563, 2419588, 2419657, 2419690, 2419799, 2419824, 2419874, 2419942, 2420024, 2420036, 2420219, 2420283, 2420511, 2420575, 2420585, 2420612, 2420796, 2420807, 2420819, 2420839, 2420850, 2421074, 2422042, 2422155, 2422236, 2422302, 2422431, 2422493, 2422510, 2422674, 2422690, 2422731, 2422799, 2422811, 2422871, 2422887, 2422894, 2422926, 2422943, 2422988, 2422995, 2423066, 2423097, 2423281

Next up, set up 'usedata' to contain all the information about the files we need

Exclude files that
* don't exist
* are 10 seconds or longer
* don't contain any of our target words

In [62]:
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np
from scipy.io import wavfile

# which utterances contain which words?
alldata.loc[:,words] = 0
for word in words:
    alldata.loc[alldata["id"].isin(word2utt[word]), word] = 1

# childes-db list of transcripts is missing some of transcripts available in their list of utterances
alldata.loc[:,"file"] = np.array([trans2file.get(key) for key in alldata["transcript_id"]], dtype='O')
usedata = alldata[alldata["file"].isna() == False]
#this looks stupid but was only thing I could get to work
usedata.loc[:,"file"] = np.array([trans2file.get(key)[0] for key in usedata["transcript_id"]], dtype='O')
usedata.loc[:,"file"] = usedata.apply(lambda x: re.sub(".xml","",x["file"]), axis=1)
usedata.loc[:,"file"] = usedata.apply(lambda x: re.sub('.*Eng-NA/',"",x["file"]), axis=1)
usedata.loc[:,"file"] = usedata.apply(lambda x: re.sub('.*Frogs/',"",x["file"]), axis=1)
usedata.loc[:,"file"] = usedata.apply(lambda x: re.sub('.*Clinical-MOR/',"",x["file"]), axis=1)
usedata.loc[:,"file"] = usedata.apply(lambda x: re.sub("/","_",x["file"]), axis=1)
usedata.loc[:,"file"] = [''.join([str(x[1]["file"]), "_", str(x[1]["media_start"]), "-", str(x[1]["media_end"]), ".wav"]) for x in usedata.iterrows()]
usedata.loc[:,"length"] = usedata["media_end"] - usedata["media_start"]
usedata.loc[:,"tot"] = usedata[words].sum(axis=1)

alltrainingdata = usedata[(usedata["length"]<10) & (usedata["tot"]>0)]

print(alltrainingdata["tot"].value_counts())

print(alltrainingdata[words].sum())

print(alltrainingdata.iloc[0:5])

print("done")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)


1    10268
2      253
3        5
Name: tot, dtype: int64
baby     3983
daddy    1663
mommy    2995
dad       459
car      1689
dtype: int64
            id  transcript_id  media_start  media_end  gloss target_child_sex  \
23919  2618441           9852      581.107    585.256  mommy             male   
42513  2637036           9958      135.516    137.102   baby             male   
42535  2637058           9958      138.205    139.957   baby             male   
42600  2637123           9958      151.580    154.082   baby             male   
43110  2637633           9958      290.901    292.404   baby             male   

       target_child_age  baby  daddy  mommy  dad  car  \
23919          7.000144     0      0      1    0    0   
42513         10.000205     1      0      0    0    0   
42535         10.000205     1      0      0    0    0   
42600         10.000205     1      0      0    0    0   
43110         10.000205     1      0      0    0    0   

                              

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = value


Now, read in the data

Note: It seems that Eng-NA/McCune/Alice/011000.mp4 should definitely exist but is not on the server.

In [150]:
all_wave = {}
for readfile in alltrainingdata["file"]:
    try:
        samples, sample_rate = librosa.load('parsedfiles/' + readfile, sr = 16000)
    except Exception as nofind:
        print("Strangely cannot find file: " + readfile)
    all_wave[readfile] = samples

print(len(all_wave))
print(len(usedata["id"]))



Strangely cannot find file: McCune_Alice_010900_21.523-28.36.wav
Strangely cannot find file: McCune_Alice_010900_34.253-39.241.wav
Strangely cannot find file: McCune_Alice_010900_52.205-54.106.wav




Strangely cannot find file: McCune_Alice_010900_64.628-67.766.wav
Strangely cannot find file: McCune_Alice_010900_105.291-109.401.wav
Strangely cannot find file: McCune_Alice_010900_150.633-155.401.wav




Strangely cannot find file: McCune_Alice_010900_159.195-160.638.wav
Strangely cannot find file: McCune_Alice_010900_182.123-187.78.wav
Strangely cannot find file: McCune_Alice_010900_198.075-206.621.wav




Strangely cannot find file: McCune_Alice_010900_208.896-210.798.wav
Strangely cannot find file: McCune_Alice_010900_228.41-231.156.wav
Strangely cannot find file: McCune_Alice_010900_290.055-295.441.wav




Strangely cannot find file: McCune_Alice_010900_312.686-315.683.wav
Strangely cannot find file: McCune_Alice_010900_335.53-339.488.wav
Strangely cannot find file: McCune_Alice_010900_341.116-343.893.wav




Strangely cannot find file: McCune_Alice_010900_384.163-392.11.wav
Strangely cannot find file: McCune_Alice_010900_396.511-401.806.wav
Strangely cannot find file: McCune_Alice_010900_408.74-413.785.wav




Strangely cannot find file: McCune_Alice_010900_455.113-460.281.wav
Strangely cannot find file: McCune_Alice_010900_461.371-464.446.wav
Strangely cannot find file: McCune_Alice_010900_480.26-483.33.wav




Strangely cannot find file: McCune_Alice_010900_500.998-507.561.wav
Strangely cannot find file: McCune_Alice_010900_523.563-529.8.wav
Strangely cannot find file: McCune_Alice_010900_534.395-543.675.wav




Strangely cannot find file: McCune_Alice_010900_547.073-554.478.wav
Strangely cannot find file: McCune_Alice_010900_578.428-581.891.wav
Strangely cannot find file: McCune_Alice_010900_588.586-590.671.wav




Strangely cannot find file: McCune_Alice_010900_605.285-612.281.wav
Strangely cannot find file: McCune_Alice_010900_645.586-650.081.wav
Strangely cannot find file: McCune_Alice_010900_650.081-653.188.wav




Strangely cannot find file: McCune_Alice_010900_653.188-657.088.wav




Strangely cannot find file: McCune_Alice_010900_658.056-664.55.wav
Strangely cannot find file: McCune_Alice_010900_666.971-670.145.wav




Strangely cannot find file: McCune_Alice_010900_671.135-676.388.wav
Strangely cannot find file: McCune_Alice_010900_680.085-684.838.wav
Strangely cannot find file: McCune_Alice_010900_727.44-732.533.wav




Strangely cannot find file: McCune_Alice_010900_732.533-736.873.wav
Strangely cannot find file: McCune_Alice_010900_752.51-761.793.wav
Strangely cannot find file: McCune_Alice_010900_783.476-787.891.wav




Strangely cannot find file: McCune_Alice_010900_787.891-794.831.wav
Strangely cannot find file: McCune_Alice_010900_798.995-800.835.wav
Strangely cannot find file: McCune_Alice_010900_800.835-802.886.wav




Strangely cannot find file: McCune_Alice_010900_802.886-804.173.wav
Strangely cannot find file: McCune_Alice_010900_839.315-840.478.wav




Strangely cannot find file: McCune_Alice_010900_866.605-871.805.wav
Strangely cannot find file: McCune_Alice_010900_871.805-875.496.wav
Strangely cannot find file: McCune_Alice_010900_927.668-935.018.wav




Strangely cannot find file: McCune_Alice_010900_955.296-958.995.wav
Strangely cannot find file: McCune_Alice_010900_1026.546-1030.505.wav
Strangely cannot find file: McCune_Alice_010900_1038.541-1043.265.wav




Strangely cannot find file: McCune_Alice_010900_1058.148-1063.138.wav
Strangely cannot find file: McCune_Alice_010900_1067.706-1071.673.wav
Strangely cannot find file: McCune_Alice_010900_1072.315-1076.143.wav




Strangely cannot find file: McCune_Alice_010900_1080.76-1086.693.wav
Strangely cannot find file: McCune_Alice_010900_1086.693-1091.411.wav
Strangely cannot find file: McCune_Alice_010900_1094.563-1103.636.wav




Strangely cannot find file: McCune_Alice_010900_1104.921-1109.175.wav
Strangely cannot find file: McCune_Alice_010900_1109.175-1114.406.wav
Strangely cannot find file: McCune_Alice_010900_1132.185-1135.655.wav




Strangely cannot find file: McCune_Alice_010900_1135.655-1144.465.wav
Strangely cannot find file: McCune_Alice_010900_1165.418-1167.608.wav
Strangely cannot find file: McCune_Alice_010900_1169.785-1175.475.wav




Strangely cannot find file: McCune_Alice_010900_1175.475-1179.68.wav
Strangely cannot find file: McCune_Alice_010900_1179.68-1183.938.wav
Strangely cannot find file: McCune_Alice_010900_1185.648-1187.136.wav




Strangely cannot find file: McCune_Alice_010900_1187.136-1189.881.wav
Strangely cannot find file: McCune_Alice_010900_1270.901-1274.948.wav
Strangely cannot find file: McCune_Alice_010900_1275.641-1278.415.wav




Strangely cannot find file: McCune_Alice_010900_1279.693-1284.991.wav
Strangely cannot find file: McCune_Alice_010900_1308.046-1313.976.wav
Strangely cannot find file: McCune_Alice_010900_1324.751-1330.253.wav




Strangely cannot find file: McCune_Alice_010900_1334.2-1336.886.wav
Strangely cannot find file: McCune_Alice_010900_1345.766-1350.436.wav
Strangely cannot find file: McCune_Alice_010900_1350.436-1356.15.wav




Strangely cannot find file: McCune_Alice_010900_1356.928-1361.701.wav
Strangely cannot find file: McCune_Alice_010900_1365.63-1368.851.wav
Strangely cannot find file: McCune_Alice_010900_1392.133-1398.433.wav




Strangely cannot find file: McCune_Alice_010900_1417.111-1424.336.wav
Strangely cannot find file: McCune_Alice_010900_1426.751-1432.761.wav
Strangely cannot find file: McCune_Alice_010900_1439.411-1443.158.wav




Strangely cannot find file: McCune_Alice_010900_1451.231-1455.218.wav
Strangely cannot find file: McCune_Alice_010900_1455.218-1461.763.wav
Strangely cannot find file: McCune_Alice_010900_1524.411-1528.476.wav




Strangely cannot find file: McCune_Alice_010900_1539.528-1541.196.wav
Strangely cannot find file: McCune_Alice_010900_1597.69-1601.546.wav
Strangely cannot find file: McCune_Alice_010900_1609.775-1614.56.wav




Strangely cannot find file: McCune_Alice_010900_1614.56-1620.321.wav
Strangely cannot find file: McCune_Alice_010900_1651.876-1661.353.wav
Strangely cannot find file: McCune_Alice_010900_1664.415-1673.411.wav




Strangely cannot find file: McCune_Alice_010900_1674.088-1682.573.wav
Strangely cannot find file: McCune_Alice_010900_1685.9-1691.16.wav
Strangely cannot find file: McCune_Alice_010900_1691.16-1693.755.wav




Strangely cannot find file: McCune_Alice_010900_1698.93-1704.196.wav
Strangely cannot find file: McCune_Alice_010900_1717.981-1720.775.wav
Strangely cannot find file: McCune_Alice_010900_1724.053-1727.356.wav




Strangely cannot find file: McCune_Alice_010900_1727.356-1731.808.wav
Strangely cannot find file: McCune_Alice_010900_1784.015-1789.03.wav
Strangely cannot find file: McCune_Alice_010900_1816.481-1823.188.wav




Strangely cannot find file: McCune_Alice_010900_1845.425-1850.508.wav
Strangely cannot find file: McCune_Alice_010900_1861.746-1864.328.wav
Strangely cannot find file: McCune_Alice_010900_1889.363-1895.215.wav




Strangely cannot find file: McCune_Alice_010900_1930.768-1935.745.wav
Strangely cannot find file: McCune_Alice_010900_1944.991-1948.511.wav
Strangely cannot find file: McCune_Alice_010900_1949.106-1953.476.wav




Strangely cannot find file: McCune_Alice_010900_1977.033-1980.171.wav
Strangely cannot find file: McCune_Alice_010900_1981.25-1983.275.wav
Strangely cannot find file: McCune_Alice_010900_1997.066-1999.84.wav




Strangely cannot find file: McCune_Alice_010900_2021.183-2024.345.wav
Strangely cannot find file: McCune_Alice_010900_2024.345-2028.151.wav
Strangely cannot find file: McCune_Alice_010900_2054.796-2058.468.wav




Strangely cannot find file: McCune_Alice_010900_2083.986-2089.141.wav
Strangely cannot find file: McCune_Alice_010900_2091.739-2097.265.wav
Strangely cannot find file: McCune_Alice_010900_2097.265-2100.24.wav




Strangely cannot find file: McCune_Alice_010900_2127.92-2132.858.wav
Strangely cannot find file: McCune_Alice_010900_2132.858-2136.718.wav
Strangely cannot find file: McCune_Alice_010900_2136.718-2138.74.wav




Strangely cannot find file: McCune_Alice_010900_2141.043-2149.235.wav
Strangely cannot find file: McCune_Alice_010900_2154.745-2158.79.wav
Strangely cannot find file: McCune_Alice_010900_2160.638-2162.38.wav




Strangely cannot find file: McCune_Alice_010900_2177.501-2187.236.wav
Strangely cannot find file: McCune_Alice_010900_2258.818-2266.433.wav
Strangely cannot find file: McCune_Alice_010900_2267.905-2269.551.wav




Strangely cannot find file: McCune_Alice_010900_2269.551-2274.713.wav
Strangely cannot find file: McCune_Alice_010900_2301.455-2305.555.wav
Strangely cannot find file: McCune_Alice_010900_2306.361-2313.201.wav




Strangely cannot find file: McCune_Alice_010900_2313.201-2318.685.wav
Strangely cannot find file: McCune_Alice_010900_2353.386-2355.663.wav
Strangely cannot find file: McCune_Alice_010900_2357.836-2360.035.wav




Strangely cannot find file: McCune_Alice_010900_2378.526-2380.885.wav
Strangely cannot find file: McCune_Alice_010900_2417.34-2423.185.wav
Strangely cannot find file: McCune_Alice_010900_2444.928-2453.063.wav




Strangely cannot find file: McCune_Alice_010900_2458.191-2461.96.wav
Strangely cannot find file: McCune_Alice_010900_2473.835-2480.426.wav
Strangely cannot find file: McCune_Alice_010900_2483.801-2490.226.wav




Strangely cannot find file: McCune_Alice_010900_2494.521-2497.48.wav
Strangely cannot find file: McCune_Alice_010900_2504.353-2507.713.wav
Strangely cannot find file: McCune_Alice_010900_2566.765-2568.365.wav




Strangely cannot find file: McCune_Alice_010900_2572.516-2573.088.wav
Strangely cannot find file: McCune_Alice_010900_2575.271-2576.426.wav
Strangely cannot find file: McCune_Alice_010900_2607.713-2613.73.wav




Strangely cannot find file: McCune_Alice_010900_2619.525-2621.946.wav
Strangely cannot find file: McCune_Alice_010900_2633.623-2639.213.wav
Strangely cannot find file: McCune_Alice_010900_2639.213-2644.943.wav




Strangely cannot find file: McCune_Alice_010900_2648.473-2656.386.wav
Strangely cannot find file: McCune_Alice_010900_2657.705-2659.826.wav




Strangely cannot find file: MacWhinney_021001a_416.151-420.132.wav
Strangely cannot find file: MacWhinney_021001a_1112.283-1115.609.wav
10526
10526


In [None]:
import keras

temp = keras.preprocessing.sequence.pad_sequences(all_wave, maxlen=159920)
print(np.shape(temp))

In [58]:

alltrainingdata.describe()

Unnamed: 0,id,transcript_id,media_start,media_end,target_child_age,baby,daddy,mommy,dad,car,length,tot
count,10526.0,10526.0,10526.0,10526.0,10526.0,10526.0,10526.0,10526.0,10526.0,10526.0,10526.0,10526.0
mean,10477730.0,26877.227817,1602.227642,1605.459361,28.895044,0.378396,0.15799,0.284534,0.043606,0.16046,3.231718,1.024986
std,6951607.0,15767.110287,2017.8525,2017.967001,10.093961,0.48501,0.364749,0.451213,0.204227,0.367049,1.896246,0.159103
min,324825.0,3350.0,0.778,1.88,7.000144,0.0,0.0,0.0,0.0,0.0,0.001,1.0
25%,2435630.0,9271.0,443.0645,446.12675,21.000431,0.0,0.0,0.0,0.0,0.0,1.86425,1.0
50%,14276990.0,32441.0,959.7535,962.5715,28.493398,0.0,0.0,0.0,0.0,0.0,3.0,1.0
75%,16754490.0,42124.25,1935.11675,1938.517,34.000698,1.0,0.0,1.0,0.0,0.0,4.0665,1.0
max,17254240.0,43674.0,16257.374,16261.229,69.001417,1.0,1.0,1.0,1.0,1.0,9.995,3.0
