# TalkBank Data Pipeline

Man Ho Wong  
University of Pittsburgh

Here are the major steps of the pipeline:
1. Get download URLs for target collection(s)
2. Screen for potential datasets in the target collection(s)
3. Download potential datasets to local drive
4. Search for target files in potential datasets and index them
5. Standardize header labels
6. Add participant identifier (optional)
7. Save the cleaned index table

### Prerequisites:

In [None]:
# See README.md for installation of the following packages
from bs4 import BeautifulSoup
from etc.pittchat import get_age_m as get_age_m # get age in months
from etc.pittchat import get_age_d as get_age_d # get age in days
import numpy as np
import os
import pandas as pd
import pickle
import pylangacq                                # read CHAT files
import requests
from tqdm import tqdm                           # progress bar (optional)
import urllib.request
import zipfile

# To get all labels of a given field (e.g. 'group')
def get_labels(var):
    labels_by_corpus = {}
    corpus_set = set(data_idx.corpus)
    for c in corpus_set:
        labels_by_corpus[c] = set(data_idx[var][data_idx.corpus==c])
    return labels_by_corpus

# Pretty printing for better readability:
# - print dict in compact format instead of one item per line
# - Items will be in alphabetical order; Counter in descending order
# - Nested Dict will be printed with suitable indentation
import pprint  # pretty printing
cp = pprint.PrettyPrinter(compact=True, sort_dicts=True)

### Load data from last run (optional):

In [None]:
paths = pickle.load(open('data/paths.pkl', 'rb'))
data_idx = pickle.load(open('data/data_idx_raw.pkl', 'rb'))

# 1. Get download URLs for target collection(s)

- The following code gets the download URLs for the zip files of the corpora in the target collection(s).  
- You can specify a target collection by the name of the collection (e.g. "childes") and the name of the folder (e.g. "Eng-NA").

> To find the name of a specific collection or folder:
> 1. Go to the [TalkBank Browser](https://sla.talkbank.org/TBB).
> 2. Select the collection in the pull-down menu and navigate to the folder you are interested in.
> 3. The name of the collection and/or folder is in the directory path under the pull-down menu.

***Note:** This code can only find URLs for public collections. For password-protected collections, please follow the instructions in TalkBank's documentation.*

In [None]:
# Get download URLs for zip files

############################## User Input ######################################
# 1. Enter the name of the TalkBank collection below
bank_name = "   "
# 2. Enter target folder(s) in the collection, e.g. folders=['Eng-NA','Eng-UK']
#    To include all downloadable corpora in the data search, 
#    enter an empty string, i.e. folders = ['']
#    Case-sensitive!
folders = ['   ']
################################################################################

# Find all URLs in the selected collection
root_url = 'https://' + bank_name + '.talkbank.org/data/'
# List of urls (only root_url for now)
urls = [root_url]
print('Looking for URLs in {}...' .format(bank_name))
for url in urls:
    if not url.endswith('.zip'): # inspect current url if not a zip file
        # get urls under current url:
        req = requests.get(url)
        soup = BeautifulSoup(req.text, 'html.parser')
        for a in soup.find_all('a'):
            path = a.get('href')
            # add urls under current url to 'urls':
            if not( ('?' in path) or path.startswith('/')): # exclude query code and root folder
                urls.append(url + path)

# Get URLs for zip files
zip_urls = [url for url in urls for d in folders 
            if d in url and url.endswith('.zip')]
print("Done! All URLs for zip files are stored in 'zip_urls'. \n"
      "There are {} downloadable corpora. Here is an example: \n{}"
      .format(len(zip_urls), zip_urls[0]))

# 2. Screen for potential datasets in the target collection(s) 

## Step 2A (To skip, go to Step 2B)
- The following code screens for potential datasets that contain the target file(s) you need.

- **Strategies:** This step is optional, but it is recommended for efficiency: It narrows down the scope of data search later in this pipeline if your target collection is large. You don't need to use very stringent screening criteria here, but rather something more general but good enough to narrow down the scope of the data search. The goal is to find the datasets containing at least one file which satisfies *some* of your data requirements. A more detailed data search will be done in the latter step.

- This step uses the `pylangacq` package to look up the metadata in file headers. Please see the package's documentation for instructions.

In [None]:
# list of corpora matching the criteria
search_result = []
# Inspect every corpus on the target_urls list:
for url in tqdm(zip_urls):  # tqdm for progress bar
    corpus = pylangacq.read_chat(url) # read current corpus into a Reader
    for h in corpus.headers():
        ########################### User Input #################################        
        # Insert if-conditions for search criteria below:
        if (        ):
        ########################################################################
            search_result.append(url)  # add corpus to 'search_result'
            break  # move on to the next corpus once a file matches the criteria

print('\n{} corpora matching the criteria:'.format(len(search_result)))
# Create a dataframe to store corpus info
corpus_names = [url.split('/')[-1].replace('.zip','') for url in search_result]
homepages    = [url.replace('data','access').replace('zip','html') for url in search_result]
local_paths  = ['data/'+bank_name+'/'+url.lstrip(root_url).rstrip('.zip') for url in search_result]
paths = pd.DataFrame({'corpus':corpus_names,'homepage':homepages,
                      'zip_url':search_result,'local_path':local_paths})
paths

# Save 'paths'
with open('data/paths.pkl', 'wb') as f:
    pickle.dump(paths, f, -1)
print("'paths' was saved as 'data/paths.pkl'. ")

## Step 2B (Run this if Step 2A was skipped)

In [None]:
# Uncomment and run the following lines if Step 2A was skipped

# corpus_names = [url.split('/')[-1].replace('.zip','') for url in search_result]
# homepages    = [url.replace('data','access').replace('zip','html') for url in search_result]
# local_paths  = ['data/'+bank_name+'/'+url.lstrip(root_url).rstrip('.zip') for url in search_result]
# paths = pd.DataFrame({'corpus':corpus_names,'homepage':homepages,
#                       'zip_url':search_result,'local_path':local_paths})
# with open('data/paths.pkl', 'wb') as f:
#     pickle.dump(paths, f, -1)
# print("'paths' was saved as 'data/paths.pkl'. ")
# path

# 3. Download potential datasets to local drive

The following code downloads and extracts the potential datasets to your local drive. A folder named "data" will be created in the current directory to store the extracted data.

In [None]:
for url in paths['zip_url']:    
    print('Downloading and extracting {}...'.format(url))
    # create directories
    fname = url.split("/")[-1]    
    dest_dir = 'data/' + bank_name + '/' + url.lstrip(root_url).rstrip(fname)
    if not os.path.exists(dest_dir):
        os.makedirs(dest_dir)
    # Download corpus from URL
    zip_path = dest_dir + fname
    urllib.request.urlretrieve(url, zip_path)    
    # Extract zip file
    with zipfile.ZipFile(zip_path, 'r') as z:
        z.extractall(dest_dir)
    os.remove(zip_path)
       
print("Done! Zip files were downloaded and extracted to 'data/'. ")

# 4. Search for target files in potential datasets and index them

- The following code searches for files that satisfy your data requirements by checking the metadata in their file headers. An index table will be generated to store the metadata of these files for easier file lookup later. Each row of the index table is a file (i.e. an entry) and each column is a file header field. You can name each column however you prefer.

> File header fields are listed in the [CHAT manual](https://talkbank.org/manuals/CHAT.html). Note that not all files contain the same header fields. To check what header fields are available in a file, you may view the header of the file in the TalkBank Browser or download the file and view it on your computer.

- This step uses the `pylangacq` package to look up the metadata in file headers. Please see the package's documentation for instructions.

In [None]:
############################## User Input ######################################
# Create an index table by building an empty dataframe with desired column names
data_idx = pd.DataFrame(columns=[                   ])
################################################################################

# Search for target files and store their metadata to 'data_idx'
for corpus_dir in tqdm(paths['local_path']):       # tqdm for progress bar  
    corpus = pylangacq.Reader.from_dir(corpus_dir) # Read corpus into a Reader
    for f in corpus:                               # Loop through each file
        h = f.headers()[0]                         # Header of each file
        ############################ User Input ################################
        # Insert below the conditions to skip files you don't need:
        # e.g. skip files without 'CHI' in the participant field of the header:
        # if 'CHI' not in h['Participants']: continue
        
        # Create a dict with field names SAME as column names in data_idx
        # Number of fields must be equal to the number of columns in data_idx!
        # Insert below the code to retrieve file/ header info in each field
        # Info can be retrieved from the header (h) or the file (f)
        # Tips:
        # - Age: Age info in the header is in (y,m,d) format. Use get_age_d or
        #   get_age_m to convert ymd to age in days or months
        # - File paths: replace '/' in file paths with '\' for Windows
        # - Handling missing values:
        #   Some files may have missing fields in the header, use if-condition
        #   to spot missing values and replace them with nan or empty string ''
        
        info_dict = {    }
        ########################################################################
        # Fill the next row of data_idx with info_dict
        data_idx.loc[len(data_idx)] = info_dict
                    
# Replace empty string with 'unspecified'
data_idx.replace(to_replace = '', value = 'unspecified', inplace=True)
data_idx.head()

# Save unprocessed 'data_idx' as 'data/data_idx_raw.pkl'
with open('data/data_idx_raw.pkl', 'wb') as f:
    pickle.dump(data_idx, f, -1)
    
print("'data_idx' was saved as 'data/data_idx_raw.pkl'. ")

---

# 5. Standardize header labels

Different studies may use different texts (i.e. "header labels" in this pipeline) to specify the same recording conditions in the file header (e.g. either `TD` or `typical` might be used to label files from typically developed children). Besides, since metadata might be entered manually, human errors such as typos or missing data could be found. To index files of the same conditions from different studies correctly, we need to standardize the header labels from all the studies. 

Here is the general workflow:
1. Inspect header labels
2. Check study design or header label definition
3. Integrate header labels in the index table
4. Replace missing labels

From now on, the pipeline will only work on the index table instead of the raw data.

## 5.1 Inspect header labels
To get all the labels used for a specific header field (i.e. column) in the index table `data_idx`, you can call this function (defined at the beginning of this noteboook): `get_labels( <column name> )`.

In [None]:
############################## User Input ######################################
# Enter the column name of data_idx you want to inspect
cp.pprint( get_labels('') )           # pprint for compact printing

## 5.2 Check study design / header label definition

Please visit the homepages of the datasets for more info:

In [None]:
paths['homepage'].values

## 5.3 Integrate header labels in the index

The goal of this step is to standardize header labels so that all files are indexed with a common set of labels for every header field.

## 5.4 Replace missing labels

Missing labels will be filled according to the documentation of the corpus from where a file is downloaded.

# 6. Save the cleaned index table

You may use the index table, `data_idx`, for retrieving data in your dataset. Save the table for data analysis later.

In [None]:
# Save cleaned 'data_idx' as 'data/data_idx_cleaned.pkl'
with open('data/data_idx_cleaned.pkl', 'wb') as f:
    pickle.dump(data_idx, f, -1)    
print("Cleaned 'data_idx' was saved as 'data/data_idx_cleaned.pkl'. ")

In [None]:
# Uncomment the following lines to read data into readers and export as pickle

# corpus_readers = [pylangacq.Reader.from_dir(d) for d in tqdm(paths['local_path'])]
# with open('data/corpus_readers.pkl', 'wb') as f:
#     pickle.dump(corpus_readers, f, -1)    
# print("'corpus_readers' was saved as 'data/corpus_readers.pkl'. ")