<a href="https://colab.research.google.com/github/liao961120/hgct/blob/main/docs_source/nb/corpusSearch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
## Colab setup
# !gdown https://github.com/liao961120/hgct/raw/main/test/data.zip
# !unzip -q data.zip
# !pip install -qU hgct

<!-- Hide section entries in toc -->
# Search API in *hgtk*

In the following tutorials (Appendix A and B), we will use a small
collection of texts as the example corpus. The text data is available on
GitHub at <https://github.com/liao961120/hgct/raw/main/test/data.zip>.
After extracting `data.zip` to the directory `data`, it should have the
following structure:

```
data
├── 01
│   ├── 儀禮_公食大夫禮.txt
│   ├── ...
│   └── 黃帝內經_靈樞經.txt
├── 02
│   ├── ...
│   └── 鹽鐵論_卷四.txt
├── 03
│   ├── 三國志_吳書一.txt
│   ├── ...
│   └── 魯勝墨辯注敘_魯勝墨辯注敘.txt
├── 08
│   ├── asbc1.txt
│   └── asbc2.txt
└── 10
    ├── dispersion1.txt
    ├── ...
    └── dispersion5.txt
```

The directory `data` corresponds to the corpus in *hgct*’s corpus
representation. It contains five directories, each of which corresponds
to a subcorpus. Directory `01`, `02`, and `03` consists of small samples
of Literary Chinese texts collected from the Chinese Text Project
(<https://ctext.org>). Directory `08` holds modern Chinese texts sampled
from ASBC. The directory `10` is a toy corpus in @gries2020 \[p. 102\]
used for illustrating calculations of dispersion measures.

In this tutorial, we demonstrate the supported functionalities in *hgct*
for searching the corpus.

## Loading Corpus Data into Concordancer

Provided that the input corpus follows the required directory structure
mentioned in @sec:corpus-structure-and-input-data, users could convert
the input corpus to the internal corpus representation with
`PlainTextReader()` as in the following code block. Since we are now
demonstrating the search functions, we immediately pass the corpus to
`Concordancer()`, which is the object used in *hgct* for searching the
corpus.

In [2]:
from hgct import PlainTextReader, Concordancer
c = Concordancer(PlainTextReader("data/").corpus)

The `Concordancer` object could be used to retrieve results matching the
search pattern as a sequence[1] of concordance lines. Since many of the
search patterns would return plenty of results, we define a wrapper
function `get_first_n()` here for the purpose of demonstration.

[1] More precisely, a *generator* of concordance lines.

In [3]:
def get_first_n(cql, n=10, left=5, right=5):
    out = []
    for i, r in enumerate(c.cql_search(cql, left=left, right=right)):
        if i == n: break
        out.append(r)
    return out

## Search by Character

In our first example, we define the search pattern as
`[char="龜"] [char="[一-龜]"]`, which roughly means

> a sequence of two characters starting with “龜” and ending with any
> Chinese characters (not, e.g., punctuations)

Passing this pattern to `get_first_n()` (or `Concordancer.cql_search()`
gives us a sequence of `Concord` objects. A `Concord` object is used to
represent a matched result returned from the corpus in *hgct*.

In [4]:
cql = '''
[char="龜"] [char="[一-龜]"]
'''
# left/right: left/right context size around the keyword
results = get_first_n(cql, n=5, left=6, right=3)  
results

[<Concord 遷有無，貨自{龜貝}，至此>,
 <Concord 山在西北。有{龜山}。有龍>,
 <Concord ，故獸不狘；{龜以}為畜，>,
 <Concord 江郡常歲時生{龜長}尺二寸>,
 <Concord 無為頓復卜三{龜知}。聖人>]

To get more information about a particular matching result, we can look
at the `data` attribute in a `Concord` object, which is a dictionary
holding the relevant information of the matching result.

In [5]:
result_1 = results[0]
result_1.data

{'captureGroups': {},
 'keyword': '龜貝',
 'left': '遷有無，貨自',
 'meta': {'id': '02/漢書_傳.txt',
  'text': {'book': '漢書', 'sec': '傳'},
  'time': {'label': '漢', 'ord': 2, 'time_range': [-205, 220]}},
 'position': (1, 6, 3482, 42),
 'right': '，至此'}

Note the `position` key in `Concord.data`. It holds the position of the
matched keyword in the corpus. The elements in the 4-tuple
`(1, 6, 3482, 32)` correspond respectively to the indices of
`(subcorpus, text, sentence, character)`.

We did not mention above how the index of a subcorpus is determined. The
index of a subcorpus is automatically determined according to the
**character order of the directory names**. Remember that there are four
directories (subcorpora) in our input corpus---`01`, `02`, `03`, `08`,
and `10`. So by character order, `01` appears before `02`, `02` before
`03`, `03` before `08`, and so on. Hence, the first directory `01` is
given the index of 0, the second is given the index of 1, and so on.
These indices of the subcorpora, as seen later in Appendix B, could be
used for limiting the scope of the functions in *hgct* in computing
corpus statistical measures.

## Search by Character Components

In addition to character forms, we can also describe search patterns in
terms of character compositions, such as the Kangxi Radical or
Ideographic Descriptions of a character.

### Kangxi Radicals

To take a look at all the present Kangxi radicals in the characters of
the corpus, the attribute `Concordancer.chr_radicals` could be used:

In [None]:
print(c.chr_radicals)

{'火', '豆', '鳥', '鹿', '辵', '風', '鬯', '手', '欠', '瓦', '见', 
 '卜', '网', '彐', '冫', '夕', '鬥', '子', '勹', '饣', '鬼', ...}

To search the corpus with Kangxi radicals, simply use the attribute
`radical` in the description of the search pattern.

In [None]:
cql = '''
[radical="立"]
'''
get_first_n(cql, 5)

[<Concord 》有竘匠。{竵}：不正也。>,
 <Concord ，遠塗也，{竫}立安坐而至>,
 <Concord ？惟諓諓善{竫}言。俾君子>,
 <Concord 自申束也。{竫}：亭安也。>,
 <Concord 聲。靖：立{竫}也。从立青>]

### Ideographic Description Characters (IDCs)

Character components defined according to the Unicode’s Ideographic
Description Characters (IDCs) could also be used for searching. The IDCs
and their names in *hgct* are found in `Concordancer.chr_idcs`:

In [None]:
c.chr_idcs

{'curC': '⿷', 'encl': '⿴', 'horz2': '⿰', 'horz3': '⿲', 
 'over': '⿻', 'sur7': '⿹', 'surL': '⿺', 'surN': '⿵', 
 'surT': '⿸', 'surU': '⿶', 'vert2': '⿱', 'vert3': '⿳'}

To search according to Ideographic Descriptions, use the attributes
`compo` and/or `idc`.

In [None]:
cql = '''
[compo="木" & idc="vert2" & pos="0"]
'''
get_first_n(cql, 5)

[<Concord 以中牟叛，{桼}雕刑殘，莫>,
 <Concord 銅錮其內，{桼}塗其外，被>,
 <Concord 行，堅如膠{桼}，昆弟不能>,
 <Concord 陳、夏千畝{桼}；齊、魯千>,
 <Concord 千兩；木器{桼}者千枚，銅>]

In [None]:
cql = '''
[compo="木" & idc="vert2" & pos="1"]
'''
get_first_n(cql, 5)

[<Concord 之也。从手{罙}聲。撢：探>,
 <Concord 營道。从水{罙}聲。潭：水>,
 <Concord 吉臺原姑與{柒}里，使海於>,
 <Concord 㕮咀，以水{柒}升，微火煮>,
 <Concord ，綿裹。右{柒}味，㕮咀。>]

In [None]:
cql = '''
[compo="木" & idc="vert2"]
'''
get_first_n(cql, 5)

[<Concord 之也。从手{罙}聲。撢：探>,
 <Concord 營道。从水{罙}聲。潭：水>,
 <Concord 城，積木為{寨}，匈奴不敢>,
 <Concord 入侍。以邊{寨}無寇。減戍>,
 <Concord 親王（柬埔{寨}）等針撥白>]

Either `compo` or `idc` could be left out if a more abstract search
pattern is preferred. For instance, if the shape (`idc`) and the
position (`pos`) are not of interest, these attributes could be left
out.

In [None]:
cql = '''
[compo="木"]
'''
get_first_n(cql, 5)

[<Concord 梅。楥，柜{枊}。栩，杼。>,
 <Concord 。讀若過。{枊}：馬柱。从>,
 <Concord 繫其頸著馬{枊}，五葬反。>,
 <Concord 其甲冑、干{楯}也；钁鍤、>,
 <Concord 句踐也以甲{楯}三千，棲於>]

If one is interested only in the shape of the character, `idc` could be
specified while all other attributes could be left out.

In [None]:
cql = '''
[idc="encl"] [idc="encl"]
'''
get_first_n(cql, 5)

[<Concord 岸崩。始置{圃囿}署，以宦者>,
 <Concord 曰：「請以{國因}。」故曰可>,
 <Concord 天子東出其{國四}十六里而壇>,
 <Concord 入{國四}旬，五行九>,
 <Concord 君約，破趙{國因}封二子者各>]

### Radical Semantic Type

Ma’s (2016) semantic type classification of Kangxi Radicals is also
incorporated in *hgct*’s search function. Use the attribute `semtag` to
specify a radical semantic type. Refer to @tbl:ma2016-radical for the 22
available semantic types.

In [None]:
cql = '''
[semtag="植物"] [semtag="植物"]
'''
get_first_n(cql, 5)

[<Concord 。且夫山不{槎蘗}，澤不伐夭>,
 <Concord 彘有艽莦，{槎櫛}堀虛，連比>,
 <Concord 則從行獵，{槎桎}拔，失鹿，>,
 <Concord 。冒甯柘，{槎棘}枳，窮浚谷>,
 <Concord 嶽之山，多{枳棘}剛木。有獸>]

## Search by Phonetic Properties

*hgct* also provides searching the corpus with sound properties. The
sound properties are defined according to the data from two
system—Guanyun 廣韻 (Middle Chinese) and Chinese Dictionary compiled by
the Ministry of Education (MOE) in Taiwan (Mandarin).

In [None]:
c.cql_attrs['CharPhonetic']

{'moe': ['phon', 'tone', 'tp', 'sys="moe"'],
 '廣韻': ['攝', '聲調', '韻母', '聲母', '開合', '等第', 
         '反切', '拼音', 'IPA', 'sys="廣韻"']
}

### Mandarin (based on 萌典)

In [None]:
cql = '''
[phon="ㄨㄥ" & tone="1" & sys="moe"]
'''
get_first_n(cql, 5)

[<Concord 」耳邊不斷{嗡}嗡的縈繞著>,
 <Concord 耳邊不斷嗡{嗡}的縈繞著類>,
 <Concord 哭泣不秩聲{翁}，縗絰垂涕>,
 <Concord ，黑文而赤{翁}，名曰櫟，>,
 <Concord 發猛，塤篪{翁}博，瑟易良>]

In [None]:
cql = '''
[phon="^p" & tp="ipa" & sys="moe"] [phon="^p" & tp="ipa" & sys="moe"]
'''
get_first_n(cql, 5)

[<Concord 大禍或遭流{炮波}及。我們步>,
 <Concord 牀版也。从{片扁}聲。讀若邊>,
 <Concord 如看推理名{片般}，由姐妹的>,
 <Concord 了進來，一{片片}綠油油的田>,
 <Concord 好高哇！一{片片}的竹葉，好>]

### Middle Chinese (based on 廣韻)

In [None]:
cql = '''
[韻母="東" & 聲調="平" & sys="廣韻"]
'''
get_first_n(cql, 5)

[<Concord 从雨相聲。{霚}：地气發，>,
 <Concord 山，其上多{銅}，其下多玉>,
 <Concord 無草木，多{銅}玉。囂水出>,
 <Concord 玉，其下多{銅}，其獸多閭>,
 <Concord 山，其上多{銅}玉，其下多>]