# Workbook for Processing the P2FA forced alignment
By Katrina Li 2023.4.27

# Outline

1. Initial check of the boundary

    Use `Check_P2FA.praat`

1. Modify the onset-rhyme boudnary, repeat if needed

    Use `modify_boundaries.praat`

1. Generate the f0 tier

     Use `generate_f0_tier.praat`

1. Extract f0 and duration

    Use `extract_acoustics.py`

1. Modify the f0 boundary, repeat if needed

    Use `modify_boundaries.praat`

1. Check Visualitaion

In [None]:
from graphviz import Digraph
dot = Digraph(comment = 'Outline')
dot.node("CheckB","Check segmentation")
dot.node("Modify","Modify boundary (syllbable or f0)",shape = "box")
dot.node("Generatef0","Generate f0 tier")
dot.node("Extract","Extract f0 and duration")
dot.node("Check","Detect f0 tracking error\nCheck data visualisation",shape = "diamond")
dot.node("Finish","Analyse data in R")
# connections
dot.edge("CheckB","Generatef0")
dot.edge("Generatef0", "Extract")
# dot.edge("Generatef0", "Modifyf0")
dot.edge("Modify","Extract",style = "dashed")
dot.edge("Extract","Check")
dot.edge("Check","Finish",label = "good")
dot.edge("Check","Modify",label = "problem", style = "dashed")
dot
dot.render(directory='', view = True).replace('\\', '/')
dot

# Prerequisite

Establish appropriate folder structure for each langauge. My folder structure is shown below. Note that the subfolders are created for intermediate stages.

```
workflow/
├── sound_original/
│   └── S9dia2nDT4.wav
├── textgrid_original/
│   ├── discard/
│   ├── later/
│   ├── processed/
│   └── S10dia1B2_checked.TextGrid
├── textgrid_checked/
│   ├── modify/
│   └── processed/
├── p2falog/
│   └── S9dia2nDT4_log.txt
└── textgrid_pitch_batch/
    ├── discard/
    ├── modify/
    ├── S10dia1B2.PointProcess
    └── S10dia1B2_checked.TextGrid
```

# Process

##  Check the segmentation (syllable level)

In this step, we check the segmentation returned from the P2FA forced alignment.

Run the praat script `scripts_praat/checkP2FA.praat`. This praat script will go through each boundary at onset/rhyme boundary, allowing you to accept the automatic marking or indicate a new boundary.

- Input: `textgrid_original/XXX.wav.Textgrid` files, and the associated wav files in `sound_original` folder.

- Output: `textgrid_checked/XXX_checked.Textgrid` files with generated rhyme + syllable tier. The original files will be moved into `textgrid_original/processed` subfolder.

For mac system, I recommended to use of App BetterTouchTool in combination with the praat script, so that you can specify short cut for the buttons of the scripts.

After finishing checking the files, remember to move all the files into `textgrid_original/processed` folder.

## Modify the boundary (f0 or syllable)

There are two ways to organise the modify process.

1. Put the files in the `/modify` folder, which can either be in `textgrid_checked` folder to modify onset-syllable boundarieds), or in `textgrid_pitch_batch` folder to modify the f0 boundaries.

    Run the praat script `modify_boundaries.praat`, which goes through the files in a folder

2. Use `open_target_files` to open individual file in either `textgrid_checked` or `textgrid_pitch_batch`, and then manually saved to the same place

### Handy script 1: Move selected files
In the `textgrid_original` folder, move files to `/unprocessed`, so that some TextGrid will be processed later. Or simply move all the files to `unprocessed` foler, and move files to be annotated out of thefolder. The code below demonstrates this purpose.

In [None]:
# Prerequisite
import os
from pathlib import Path
import shutil
import re

current_lang = "Chengdu"
dir = os.path.join("/Users/kechun/Documents/0_PhD_working_folder", str(current_lang), "workflow")

######################## In checking the P2FA step#####################
# directory = os.path.join(dir, "textgrid_original/processed")
# destination = os.path.join(dir, "textgrid_original/")
######################## Moving the sound files #####################
# directory = os.path.join(dir, "sound_original")
# destination = os.path.join(dir, "sound_original/old4")
######################## In checking the pitch boundary step#####################
directory = os.path.join(dir, "textgrid_pitch_batch")
destination = os.path.join(dir, "textgrid_pitch_batch/modify")

# If no filter criteria apply, leave the list empty.
# A-Z
targetsentence = ["D"]
# 1-5, 2a
targetfocus = ["1","2","3","4","5"]
targetid = []
n_of_files = 0
for ifile in os.listdir(directory):
    if ifile.endswith(".TextGrid") or ifile.endswith(".PointProcess") or ifile.endswith(".wav"):
        filename = ifile.split(".")[0]
        if filename.endswith("_checked"):
            filename = filename[:-8]
        parid = re.split("dia1?2?N?n?1?",filename)[0]
        condition = re.split("dia1?2?N?n?1?",filename)[1]
        tone = re.search("[A-Z]T?",condition).group(0)
        focus = re.search("[1-5]a?",condition).group(0)
        # if (tone in targetsentence or not targetsentence) and (focus in targetfocus or not targetfocus) and (parid in targetid or not targetid):
        #     print(ifile," go through")
        # else:
        #     print("not go through")
        if (tone in targetsentence or not targetsentence) and (focus in targetfocus or not targetfocus) and (parid in targetid or not targetid):
            n_of_files += 1
            shutil.move(os.path.join(directory,ifile), destination)

### Handy script 2: Cut the checked files shorter

For Condition4 files, we need an extra step of processing. We want to exclude the first part of the sentence so that our cheking workload can be reduced. 

1. Move textgrid files from `/unprocessed` to `/old4`; and sound files into the `sound_original/old4` subfolder.

2. Run the praat script `cut_condition4.praat`, select the cutting position. Then you will have the second part of the files saved to `textgrid_original` folder.

3. Move the cutted files from `old4` to `old4/processed`. In this folder, the textgrid for first part of the sentence is also maintained.

4. You can now follow the previous steps to check the new textgrid files.

### Handy script 3: Open target files

Sometimes we just want to check specific files, and the code below can help you call them. There is a also a praat script 'Open target file.praat' which can achieve the same thing.

In [None]:
# Prerequisite
import os
from pathlib import Path
import subprocess

############ Specify language and the directory to open textgrids #############
current_lang = "Chengdu"
# folder = "textgrid_pitch_batch"
folder = "textgrid_checked"
################ Specify file list #############
filelist = ['S1diaB2', 'S3diaD2']
# filelist = ['S6dia1nI4','S9dia1nI4','S12dia1nG2','S16dia1nI4','S19dia1nI4','S15dia1nI4','S1dia1nI4','S5dia1nI4','S11dia1nI4','S15dia1nG1','S15dia1I4','S3dia1I4','S3dia1nI4','S9dia1I4','S17dia1nI4','S11dia1I4','S16dia1I4','S10dia1nG2','S10dia1nI4','S4dia1nI4','S6dia1I4','S17dia1I4','S10dia1I4','S14dia1nI4','S19dia1I4','S14dia1I4','S5dia1I4']

dir = os.path.join("/Users/kechun/Documents/0_PhD_working_folder", str(current_lang), "workflow")
directory = os.path.join(dir, folder)
sounddir = os.path.join(dir, "sound_original")
commands = list()
# directory = os.path.join(dir, "textgrid_checked")
allfile = []
for ifile in filelist:
    textgridfile = os.path.join(directory,ifile+"_checked.TextGrid")
    soundfile = os.path.join(sounddir, ifile+".wav")
    # subprocess.call(['/Applications/Praat.app/Contents/MacOS/Praat', '--send', textgridfile])
    # subprocess.call(['/Applications/Praat.app/Contents/MacOS/Praat', '--send', soundfile])
    allfile.append(str(textgridfile))
    allfile.append(str(soundfile))
allfile = " ".join(allfile)
print(allfile)
!/Applications/Praat.app/Contents/MacOS/Praat --open {allfile}
# subprocess.call(['/Applications/Praat.app/Contents/MacOS/Praat', '--open', allfile])
# subprocess.call('/Applications/Praat.app/Contents/MacOS/Praat --open {allfile}')

## Generate the f0 tier
After checking the segmentation boudnaries, the next step is to modify the 'rhyme' tier to make it fit for f0 extraction.
The code below will call the praat script `generate_f0_tier.praat`, replace the original rhyme tier with a f0 tiers, where initial and ending periods where no f0 is dectected are deleted.

- Input: `/textgrid_check/XX_checked.TextGrid` files. Remember to change the langauge in the code below.

- Output: `textgrid_pitch_batch/XX_checked.TextGrid` files. The original files will be moved into the subfolder `/textgrid_check/processed`.


The current design is to iterate over the entire folder. Calling praat subprocess does not seem a very elegant solution, however, the advange is that it can be followed by a python function of moving processed files into another folder. When errors arise, we can know which file causes problem, adjust accordingly and continue with the rest files.
The alternative way is to ask users to specify the desired processing file (using the commented code). But when there are a lot of files, this solution is not convenient.

In either case, the idea is to avoid repetitive generation of the files, as later on these files may be adjusted manually and we do not want to overwrite the modified tiers.

In [None]:
# This file will read specified files, and run the praat script
import subprocess
from itertools import product
import os
import shutil
import parselmouth
from parselmouth.praat import call
import math

################## Change the language###############
current_lang = "Chengdu"
dir = os.path.join("/Users/kechun/Documents/0_PhD_working_folder", str(current_lang), "workflow")
directory = os.path.join(dir, "textgrid_checked")
destination = os.path.join(dir, "textgrid_checked/processed")
# Open the sound
directory_sound = os.path.join(dir, "sound_original")

# par = ["S9",]
# dia = ["dia"]
# sentence = ["A"]
# focus = ["1","2","5"]
# element_list = list(product(par,dia,sentence,focus))

# Argument to the script
# The first three is to call the app and script
# Arg1: filename
# Arg2: textgriddir
# Arg3: sounddir
# Arg2: f0min (if 0, then the default two-pass pitch range calculation will be used)
# Arg3: f0max (if 0, then the default two-pass pitch range calculatio will be used)
# Other variables like file directory, the tiers can be modified
# Output: the updated f0 
#########################   V1   #########################
for ifile in os.listdir(directory):
    if ifile.endswith(".TextGrid"):
        basename = ifile[:-17]
        filename = basename + "_checked"
########################   V2   #########################
# for element in element_list:
#     basename = ''.join(element)
#     filename = basename + "_checked"

    # Open the sound
        soundname = os.path.join(directory_sound, basename + ".wav")
        sound = parselmouth.Sound(soundname)
        # Calculate the best f0
        # The first pass
        pitch1 = call(sound, "To Pitch", 0.0, 50, 800)
        min1 = call(pitch1, "Get minimum", 0, 0, "Hertz", "None")
        max1 = call(pitch1, "Get maximum", 0, 0, "Hertz", "None")
        q1 = call(pitch1, "Get quantile", 0, 0, 0.25, "Hertz")
        q3 = call(pitch1, "Get quantile", 0, 0, 0.75, "Hertz")
        q1 = math.floor(q1)
        q3 = math.ceil(q3)
        # The second pass
        defaultf0floor = math.floor((0.7 * q1)/ 10) * 10
        defaultf0ceiling = math.ceil((2.5 * q3)/ 10) * 10
        # print(defaultf0floor)
        # print(defaultf0ceiling)
        # Run the script, not send, 
        subprocess.call(["/Applications/Praat.app/Contents/MacOS/Praat", "--run", "scripts_praat/generate_f0_tier.praat", filename, directory, directory_sound, str(defaultf0floor),str(defaultf0ceiling)])
        path = filename + ".TextGrid"
        shutil.move(os.path.join(directory,path), destination)

## Extract f0 and duration

Open the script `extract_acoustics.py`, and specify the relevant parameters (e.g. the langauge to work on). This will extract relevant parameters for analysis.

- Input: `textgrid_pitch_batch/XXX.TextGrid` files, associated `PointProcess` or `Pitch` files if there are any. Also, a `Template.xlsx` provides the template of read speech to be able to compared to. Note that in the `Template.xlsx`, the `Trim` column shall not contain any punctuation.

- Output: `extract_acoustics_results/230530Cantonese_data.tsv` and `extract_acoustics_results/230530Cantonese_realf0_data.tsv`

This script mainly makes use of the python package Parselmouth.

In [None]:
%run extract_acoustics.py

## Check visualisation

**Step I**

Open `InTone_Visualisation.Rproj`. 

Open `Rcode_checkf0/01f0clean.Rmd` and specify the data files to be read in (because the datafiles have date on it).

- Input: the generated datafiles in `extract_acoustics_results`

- Output: `X_flagfiles.csv` which contains all the files to check, and the visualisation of pitch contours in the folder `03figures`.

**Step II**

Use *Python* script below to move all the files into the `textgrid_pitch_batch/modify` folder, along with their pointprocess files (if not, then generate one). It is often needed to modify pointprocess files to correct the pitch.

In [None]:
# Move the textgrid files to modify folder
# Prerequisite
import os
from pathlib import Path
import shutil
import re
import pandas as pd
import parselmouth
from parselmouth.praat import call

#################################################
# read the flagfile csv
flagfiles = pd.read_csv("Rcode_checkf0/02output_files/Chengdu_flagfiles_essential.csv")
# flagfiles = pd.read_csv("Rcode_analysis/Cantonese_dur_outliers.csv")
#################################################

current_lang = "Chengdu"
dir = os.path.join("/Users/kechun/Documents/0_PhD_working_folder", str(current_lang), "workflow")
directory = os.path.join(dir, "textgrid_pitch_batch")
destination = os.path.join(dir, "textgrid_pitch_batch/modify")
sounddirectory = os.path.join(dir,"sound_original")

for index,row in flagfiles.iterrows():
    file_to_move = row["filename"]
    filename = file_to_move + "_checked.TextGrid"
    # print(filename)
    # Read sound files
    soundname = file_to_move + ".wav"
    sound = parselmouth.Sound(os.path.join(sounddirectory,soundname))
    #Check if pointprocess exists
    pointprocess = os.path.join(directory,file_to_move + ".PointProcess")
    pitch = os.path.join(directory,file_to_move + ".Pitch")
    if os.path.exists(pitch):
        shutil.move(pitch,destination)
    if os.path.exists(pointprocess):
        shutil.move(pointprocess,destination)
    else:
        # Read the default f0 floor and ceilling
        defaultf0floor = row["defaultf0floor"]
        defaultf0ceiling = row["defaultf0ceiling"]
        # generate PointProcesses files
        pointprocess = call(sound, "To PointProcess (periodic, cc)", defaultf0floor,defaultf0ceiling)
        # Save the pointproces file into the modify folder
        pointprocess.save(os.path.join(destination,file_to_move + ".PointProcess"))
    shutil.move(os.path.join(directory,filename), destination)

**Step III**

I will move all files back to the `textgrid_pitch_batch` folder, and rerun the `extract_acoustics.py`, followed by the two steps above. the `Rcode_f0check/01f0clean.Rmd`. This is because using the pointprocesses files might result in a slightly different pitch parameters.

**Step IV**

Run the `modify_boundaries.praat` to modify the boundaries.

If changing the f0 floor/ceiling is crucial for generating the correct contour (e.g. creaky voice), remember to click 'Save pitch' so that the corresponding pitch object will be saved, not just the pointprocess.