## Converts TextGrids to CSV

This Jupyter Notebook converts TextGrids to CSV files to be easier analyzed. This script is written with the .flac and .TextGrid files from the Seoul Corpus (Yun et al. 2015) in mind.

## Packages

In [1]:
from praatio import textgrid
import glob
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm

## tg2df

*Modified from "Doing phonetic research on the Seoul Corpus" paper and scripts. Located at https://osf.io/ukh6d/overview?view_only=d9bda726aebe4512830c6996f2ae4cae*

**tb2df** creates a dataframe with all the information in the specified tier.

For the Seoul Corpus data, the tiers are the following:

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;0 -- Phoneme  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1 -- Word/Eojeol (pronounced, hangeul)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;2 -- Word/Eojeol (pronounced, IPA)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;3 -- Utterance (prononced)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;4 -- Word/Eojeol (orthography, hangeul)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;5 -- Word/Eojeol (orthography, IPA)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;6 -- Utterance (orthography)

**Inputs**:
- string&nbsp;&nbsp;&nbsp;*filename*
- int&nbsp;&nbsp;&nbsp;*tier_num*

**Outputs**: (dataframe) of TextGrid

In [3]:
def tg2df(filename, tier_num):
  tg = textgrid.openTextgrid(filename, includeEmptyIntervals=True, duplicateNamesMode='rename')
  tiers = tg.tierNames
  
  tier = tg.getTier(tiers[tier_num])
  InputFile = filename[:-9]  
  data = [(InputFile, label, start, end) for start, end, label in tier.entries]
  df = pd.DataFrame(data, columns=['InputFile', 'text', 'start', 'end'])

  return df

## merge_all_tgs

*Modified from "Doing phonetic research on the Seoul Corpus" paper and scripts. Located at https://osf.io/ukh6d/overview?view_only=d9bda726aebe4512830c6996f2ae4cae*

**merge_all_tgs** takes a list of files and performs tg2df to all the files, combining them into one dataframe. 

**Inputs**:
- string list&nbsp;&nbsp;&nbsp;*list_of_files*
- int&nbsp;&nbsp;&nbsp;*tier_num*

**Outputs**: (dataframe) of the textgrids

In [4]:
def merge_all_tgs(list_of_files, tier_num):
  df = pd.DataFrame()
  for f in tqdm(list_of_files, leave=True):
    cur_data = tg2df(f, tier_num)
    df = pd.concat([df, cur_data], axis = 0, ignore_index=True)
  df = df.sort_values(by=['InputFile', 'start']).reset_index(drop=True)
  return df

## get_filelist

**get_filelist** gets a list of the files that we want to analyze.

**Inputs**:
- Path()&nbsp;&nbsp;&nbsp;*path*
- string&nbsp;&nbsp;&nbsp;*extension*

**Outputs**: (string list) of filenames

In [9]:
def get_filelist(path, extention):
    return [f.name for f in path.glob("*." + extention)]

## get_subsection

**get_subsection** takes the list of files from the Seoul Corpus and filters it according to age (inclusive) and gender ("m" or "f").

Seoul Corpus filenames are written like:  **s01m16f1**

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;s01 -- Speaker 1  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;m16 -- 16 year old male  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;f -- Interviewer gender  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;1 -- File number  

**Inputs**:
- string list&nbsp;&nbsp;&nbsp;*filenames*
- int&nbsp;&nbsp;&nbsp;*age*
- string&nbsp;&nbsp;&nbsp;*gender* ("m" or "f")

**Outputs**: (string list) of filenames

In [14]:
def get_subsection(filenames, age=None, gender=None):
    new_filenames = []
    count = 0

    for filename in filenames:
        g = filename[3]
        a = int(filename[4:6])

        if gender is not None and g != gender:
            continue

        if age is not None and not (age[0] <= a <= age[1]):
            continue

        count += 1
        new_filenames.append(filename)       

    return new_filenames