Skip to content

nevenjovanovic/explore-treebanks-xquery

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Explore dependency tree morphosyntactic annotations with XQuery

Neven Jovanović, Department of Classical Philology, Faculty of Humanities and Social Sciences, University of Zagreb

Summary

Import a set of treebanks (for example, from Vanessa Gorman's Greek collection, freely available through Github and Perseids Publications) into an XML database. Use XQuery scripts (for example, from the BaseX GUI) to analyze the set.

How to get started

  1. Clone this repository
  2. Install BaseX
  3. From BaseX, open and
    1. download the XML files from Gorman's repository or clone the repository
    2. run the script createGrcTBG.xq
  4. alternatively to 3, without cloning Gorman's repo
    1. run the script createGrcTBGpull.xq,
    2. then run populateGrcTBGfromGit.xq; the second script pulls the files directly from Github and adds them to the database created by the first script (it takes about a minute)
  5. Run other XQuery scripts (see the project wiki)

Some basic queries

  1. retrieve some statistics on the collection and its annotations (for results, see the wiki):
    1. how many sentences and words (also excluding punctuation): GrcTBStatsGeneral.xq
    2. how many ellipses, words without annotations or with undefined annotations: GrcTBStatsGeneral.xq
    3. sentences grouped by word count: GrcTBStatsGeneral.xq
    4. how many instances of certain syntactic relations: GrcTBStatsRelations-1.xq; current results on a dedicated wiki page
    5. how many instances of certain parts of speech – how many nouns, verbs, adverbs...? GrcTBStatsPOS-1.xq; current results on a wiki page
    6. how many wordforms: script FindWordforms1.xq, report on the wiki page WordformsList (66,313 wordforms) – how many lemmata: script FindLemmata.xq, report on the wiki page LemmataList (18,056 lemmata)
    7. how many lemmata per parts of speech
  2. retrieve sentences of a certain type, or configuration
  3. retrieve a list of all parts of speech in a specific syntactic relation: FindPOStag.xq; what POS can take on the role of subject? – for example, report for AuxZ, emphasizing particle: POStagAuxZ
  4. retrieve a set of treebank fragments – a specific syntactic relation, its parent, and all its children traverseTree1-levels-all.xq) (what is dependent on a subject?); but it can be achieved more easily and efficiently following Bozia 2018
  5. transform the "flat" ALDT XML into nested tree structures, as proposed by Bozia 2018, for easier retrieving of children and parent nodes: ListToTree.xq

A note on the XML source repositories

Gorman's Github repo and Perseids Publications differ slightly in filenames and stability. Gorman's is the working repo, more up to date; Perseids contains more stable versions and might not be synchronized immediately.

License

CC-BY

Releases

No releases published

Packages

No packages published

Languages