# Accessing and Searching Cantonese Corpora in PyCantonese

[Jackson Lee](https://jacksonllee.com/)

May 16, 2021 (tutorial at the [School of Cantonese Studies](https://www.eduhk.hk/lml/scs2021/), the Education University of Hong Kong)

Source of this Jupyter notebook: https://github.com/jacksonllee/pycantonese/blob/main/docs/tutorials/lee-pycantonese-2021-05-16.ipynb

The easiest way to immediately play with this notebook is to log on to your Google account (Gmail, etc.) and [open this notebook in Google Colab](https://colab.research.google.com/github/jacksonllee/pycantonese/blob/main/docs/tutorials/lee-pycantonese-2021-05-16.ipynb). You'll have your own copy to run cells from, make changes, and save in your Google Drive.

## Introduction

This tutorial introduces [PyCantonese](https://pycantonese.org/), a Python library for Cantonese linguistics and natural language processing. In particular, this tutorial focuses on the following topics and provides hands-on exercises:

1. Accessing existing Cantonese corpora
2. Searching corpora programmatically

## Download and Install PyCantonese

Just like all other open-source Python libraries, PyCantonese is readily available through the `pip install` terminal command. In this tutorial, we're pinning the version we want at the latest version v3.3.0 for reproducibility.

In [None]:
!pip install pycantonese==3.3.0

## 1. Accessing Existing Cantonese Corpora

For a corpus to be useful for PyCantonese, its source data files have to be publicly available in a machine-readable format, and the format has to be the [CHAT](https://talkbank.org/manuals/CHAT.pdf) format. This format was chosen for an important reason: A cluster of Cantonese corpora with publicly available data files are those on CHILDES and TalkBank, thanks to research on Cantonese language acquisition.

PyCantonese is shipped with the [Hong Kong Cantonese Corpus](http://compling.hss.ntu.edu.sg/hkcancor/) (HKCanCor, CC BY license). We are going to use this corpus a lot in this tutorial.

### Getting Started with HKCanCor

To give you a sense of what the HKCanCor data looks like, here are the first three utterances as they've been transformed to CHAT and incorporated into PyCantonese:

```
*XXA:	喂 遲 啲 去 唔 去 旅行 啊 ?
%mor:	e|wai3 a|ci4 u|di1 v|heoi3 d|m4 v|heoi3 vn|leoi5hang4 y|aa3 ?
*XXA:	你 老公 有冇 平 機票 啊 ?
%mor:	r|nei5 n|lou5gung1 v1|jau5mou5 a|peng4 n|gei1piu3 y|aa3 ?
*XXB:	平 機票 要 淡季 先 有得 平 𡃉 喎 .
%mor:	a|peng4 n|gei1piu3 vu|jiu3 an|daam6gwai3 d|sin1 vu|jau5dak1 a|peng4 y|gaa3 y|wo3 .
```

In [None]:
import pycantonese

corpus = pycantonese.hkcancor()

What is HKCanCor's word count?

In [None]:
words = corpus.words()

In [None]:
len(words)

Let's check out the first 10 words.

In [None]:
words[:10]

Hmm, the output got cut off arbitrarily at 你, because `corpus.words()` has intentionally removed the *utterance-level* data structure in order to expose the corpus data as a flat (and very long) list of words.

### [ Exercise 1 ]

`words` is a list of the words in HKCanCor. `words[:10]` gives you the first 10 words. If you want to see the first 20 words, what do you do? Type your code in the cell below and run it.

### Bringing Back the Utterances

Let's look at the words again, but this time we use a list to wrap around each utterance's words.

In [None]:
words_by_utterances = corpus.words(by_utterances=True)

If we ask for a count by `len()` again, it's the number of utterances.

In [None]:
len(words_by_utterances)

Here are the first two utterances.

In [None]:
words_by_utterances[:2]  # a list of two lists

Let's check them out again but with a `for` loop instead. In actual work, a `for` loop is going to be much more useful for walking through corpus data and grabbing whatever is of interest. For now, we're simply `print`ing the two utterances as a start.

In [None]:
for utterance in words_by_utterances[:2]:
    print(utterance)  # each utterance is a list of strings

### Visualization Example: Utterance Lengths

With access to the corpus source data, you can see just about anything you want. Besides examining contents of particular interest, you can zoom out for more general statistics and visualize them. Since we have the utterances in hand, here's something we can pull out real quick: what does the distribution of utterance lengths in HKCanCor look like?

In [None]:
utterance_lengths = []

for utterance in words_by_utterances:

    # Utterances that are too long seem suspicious.
    # Deciding where to cut for utterance boundaries is genearlly hard...
    if len(utterance) > 50:
        continue

    utterance_lengths.append(len(utterance))

In [None]:
# Install packages for making plots.
!pip install seaborn==0.11.1 matplotlib pandas

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.rcParams["figure.figsize"] = (10, 6)
sns.histplot(utterance_lengths, binwidth=2).set(xlabel="Utterance length", ylabel="Count", title="Count against utterance length (<= 50) in HKCanCor")

### What About CHILDES / TalkBank?

To illustrate how you may access a Cantonese CHILDES dataset, let's use the Cantonese monolingual [Lee-Wong-Leung corpus](https://childes.talkbank.org/access/Chinese/Cantonese/LeeWongLeung.html). All you need is the URL that points to the ZIP file containing the CHAT data files.

In [None]:
corpus = pycantonese.read_chat("https://childes.talkbank.org/data/Chinese/Cantonese/LeeWongLeung.zip")

Before this previous cell, we'd been using the `corpus` variable name to represent HKCanCor. Now `corpus` has been repurposed to represent the Lee-Wong-Leung corpus instead. This is intentional - to imply that both instances of `corpus` would have the same functionality.

### [ Exercise 2 ]

Write and run code to answer the following questions. The code you need can be found from above.

How many words are there in the Lee-Wong-Leung corpus?

How many utterances are there in the Lee-Wong-Leung corpus?

Using a `for` loop, show the first ten utterances of the Lee-Wong-Leung corpus.

### Visualization Example: Mean Length of Utterance (MLU)

The mean length of utterance (MLU) is a standard measure of productivity in language acquisition research. Since PyCantonese uses the CHAT format from CHILDES and TalkBank, it can readily compute language development measure such as MLU.

Let's use Timmy's data from the CHILDES Yip-Matthews Cantonese-English bilingual corpus. Between 2 and 3.5 years old, it appears that Timmy went from being Cantonese-dominant to English-dominant.

In [None]:
timmy_cantonese = pycantonese.read_chat("https://childes.talkbank.org/data/Biling/YipMatthews.zip", "TimCan")
timmy_english = pycantonese.read_chat("https://childes.talkbank.org/data/Biling/YipMatthews.zip", "TimEng")

In [None]:
import pandas as pd
import seaborn as sns

df = pd.DataFrame(
    {"MLU": timmy_cantonese.mlu() + timmy_english.mlu(),
     "Language": ["Cantonese"] * timmy_cantonese.n_files() + ["English"] * timmy_english.n_files(),
     "Age in months": timmy_cantonese.ages(months=True) + timmy_english.ages(months=True)}
)

sns.lmplot(
    x="Age in months", y="MLU", hue="Language", data=df, markers=["o", "x"], legend=True, legend_out=False, ci=None,
).set(xlim=(22, 45), ylim=(0, 6), title="Timmy's MLU between 2 and 3.5 years old")

## 2. Searching Corpora Programmatically

When you need corpus data, there's a reason for that. This section is about a common scenario: You're looking for something specific from a corpus for your research. Your search criteria can be as simple or complex as desired, provided that you can programmatically tap into them in some way.

### Tokens as Words with Annotations

HKCanCor is word-segmented as well as annotated for parts of speech and Jyutping romanization. While the `.words()` call gives us only the plain text words as Chinese/Cantonese characters, the annotations are accessible via `.tokens()`.

In [None]:
corpus = pycantonese.hkcancor()

In [None]:
tokens_by_utterances = corpus.tokens(by_utterances=True)

We saw the very first utterance in HKCanCor as `['喂', '遲', '啲', '去', '唔', '去', '旅行', '啊', '?']` (as a list of plain text strings) through `.words()`. Let's check it out again with annotations.

In [None]:
tokens_by_utterances[0]

This is a list of `Token` objects. Each `Token` object has a couple attributes:

* `word`: the word form as Chinese/Cantonese characters
* `pos`: part-of-speech tag (see the [HKCanCor documentation](http://compling.hss.ntu.edu.sg/hkcancor/) for the POS tagset)
* `jyutping`: Jyutping romanization
* (`mor` and `gra` are morphological information and grammatical relations, respectively. They are used in CHILDES and TalkBank datasets, but not in HKCanCor.)

The attributes from each token can be directly accessed.

In [None]:
for token in tokens_by_utterances[0][:3]:
    print(f"word: {token.word}")
    print(f"jyutping: {token.jyutping}")
    print()

### [ Exercise 3 ]

Using `tokens_by_utterances` we've just created, write code to generate the following output for the first utterance, where each line has a word followed by a space and then the word's part-of-speech tag. You should use a `for` loop.

```
喂 E
遲 A
啲 U
去 V
唔 D
去 V
旅行 VN
啊 Y
? ?
```

### Searching by a Single Element

Now that we know about the tokens, we're ready for programmatically searching a corpus.

Let's say you're interested in gwai2 鬼, and that you'd like to pull out all occurrences of 鬼 from HKCanCor. The `.search()` call handles a search targeting a single word/token ("search for all utterances with 鬼").

In [None]:
gwai2 = corpus.search(character="鬼", by_utterances=True)

How many times does gwai2 鬼 occur in HKCanCor?

In [None]:
len(gwai2)

What's the first utterance with 鬼?

In [None]:
gwai2[0]

Show the first 20 utterances -- as the Chinese characters, don't bother with all the nitty-gritty details of the tokens.

In [None]:
for utterance in gwai2[:20]:
    print("".join(token.word for token in utterance))

### [ Exercise 4 ]

Suppose you're working on numerical classifiers, and that you want to find all of them from HKCanCor.

1. What is the part-of-speech tag for classifiers. Check the [HKCanCor documentation](http://compling.hss.ntu.edu.sg/hkcancor/).
2. We have the HKCanCor reader object `corpus` ready. How should we use the `.search()` call to say "give me all utterances where a numerical classifier shows up"? You may need to refer to the [PyCantonese documentation](https://pycantonese.org/searches.html).

### More Complex Searches

While the `.search()` method handles only one search criterion, you'd have to write your own custom code for more complex searches.

Reminders:
* To look for what you want, it's a matter of how you make use of the available corpus data and its annotations.
* We've been using HKCanCor, but it's just one of the many possible corpora you could use.
* After all, you drive your own research. You decide what you want the computer to do!

Your use case may vary, but a basic approach looks like this:

1. Create an empty container that's going to keep the search result.
2. Loop through a corpus.
3. Whenever you (= your code) see what's of interest, keep that in the container you've created.
4. When the looping is over, the container should have what you're after.

As an example, let's say you're interested in nominal syntax and semantics with respect to ge3 嘅. Now you want to grab all the utterances in HKCanCor where 嘅 appears in a nominal context.

In [None]:
# Install a package that makes it easy to get ngrams.
!pip install nskipgrams==0.3.0

In our search below, what exactly are the search criteria?

In [None]:
import nskipgrams

# 1. Create an empty container that's going to keep the search result.
result = []

# 2. Loop through a corpus.
for utterance in corpus.tokens(by_utterances=True):

    # Set up a two-token sliding window and move it along the utterance.
    for bigram in nskipgrams.ngrams_from_seq(utterance, 2):
        token1, token2 = bigram

        # 3. Whenever you (= your code) see what's of interest, keep that in the container you've created.
        if token1.word == "嘅" and token2.pos.startswith("N"):
            result.append(utterance)

            # Once you've saved this utterance,
            # you don't need to check the rest of the utterance anymore.
            break

# 4. When the looping is over, the container should have what you're after.

In [None]:
len(result)

In [None]:
result[:3]

Questions: Did this search strategy miss cases of NP ellipsis? If we left out the search criterion that a noun must be present, would there be issues?

### [ Exercise 5 ]

If you were to work on Cantonese verbal particles (aspect, etc.), what would you do to pull out all instances of verbal particles from HKCanCor?

* What are verbal articles? Would you target a specific set of Chinese/Cantonese characters, or would you use the part-of-speech tags? A combination of both?
* How much context would you like for each hit in the search result? A window of tokens around a found verbal particle? The utterance where it appears? A window of utterances?

## There's More!

PyCantonese, either on its own or combined with other tools, supports other functionality that we didn't have time to cover in this tutorial:

* Jyutping parsing and conversion
* Making use of speakers' demographics available from the metadata of CHAT-formatted corpus data
* Stop words
* Word segmentation
* Part-of-speech tagging
* Creating your own word-segmented, POS-tagged corpus from unanalyzed Cantonese text
* Topic modeling
* Sentiment analysis
* and more?

The goal of PyCantonese is to fill the Cantonese-specific gaps that other tools don't. Some of these items can be handled just by PyCantonese alone (hence [documented](https://pycantonese.org/)), others possible when combined with more programming and modeling chops - the sky is the limit! While there may be more tutorials, I hope you found this tutorial helpful.