Korp corpus annotation system

Right now the app is pre-configured to work with a backend running on our server.

Running the frontend

Go into frontend directory
Install dependencies by running yarn
Start the frontend by running yarn start
Open localhost:9111 in your browser

Running the backend

Requirements

python 3.6 - 3.10 installed
Corpus Workbench (CWB) 3.4.12 or newer

Corpus Workbench

Download the current stable version of Corpus Workbench. Install by following the Installing the CWB Core instructions, either by using the provided packages or building from source. Refer to the included INSTALL text file for further instructions.

Once CWB is installed, by default you will find it under /usr/local/cwb-X.X.X/bin (where X.X.X is the version number). Confirm that the installation was successful by running:

/usr/local/cwb-X.X.X/bin/cqp -v

CWB needs two directories for storing the corpora. One for the data, and one for the corpus registry. You may create these directories wherever you want, but from here on we will assume that you have created the following two:

/corpora/data
/corpora/registry

Added functionality, for uploading vrt files through the frontend, also needs:

/corpora/vrts

A sample vrt file has been provided. To encode it, run the command:

cwb-encode -d /absolute/path/corpora/data/test -f /input/file/path test.vrt -R /absoulte/path/corpora/registry -S corpus -S text -S sentence -S phrase_with_errors:1 -P word_id -P word_tokens -P grammatical_number -P POS -P lemma -P correction_status -P annotator_code -P error_correction -P error_tag -P dependency_relation -P dependency_head -P verb_morpho -P case -P error_type -c utf8

For multiple corpuses, you need to make sure each one is encoded in its own seperate directory

Afterwards, run:

cwb-makeall -r /absolute/path/corpora/registry test

Setting up the Python environment and requirements

Optionally you may set up a virtual Python environment:

$ python3 -m venv venv
$ source venv/bin/activate

Install the required Python modules using pip with the included requirements.txt.

$ pip3 install -r requirements.txt

Configuring Korp

The supplied config.py contains the default configuration.

The following variables need to be set for Korp to work:

CQP_EXECUTABLE
The absolute path to the CQP binary. By default /usr/local/cwb-X.X.X/bin/cqp
CWB_SCAN_EXECUTABLE
The absolute path to the cwb-scan-corpus binary. By default /usr/local/cwb-X.X.X/cwb-scan-corpus
CWB_REGISTRY
The absolute path to the CWB registry files. This is the /corpora/registry folder you created before.
CWB_VRTS
The absolute path to the directory to house vrt files. This is the /corpora/vrts folder you created before.

If you are planning on using functionality dependent on a database, you also need to set the following variables:

DBNAME
The name of the MySQL database where the corpus data will be stored.
DBUSER & DBPASSWORD
Username and password for accessing the database.

For caching to work you need to specify both a cache directory (CACHE_DIR) and a Memcached server address or socket (MEMCACHED_SERVER).

Running the backend

To run the backend, simply run run.py:

python3 run.py

Corpus Configuration for the Korp Frontend

A sample is arleady provided to get up and running, but if you add more corpuses, follow this procedure

The corpus configuration used by the Korp frontend is served by the backend. In config.py, the variable CORPUS_CONFIG_DIR should point to a directory having the following structure:

.
├── attributes
│   ├── positional
│   │   ├── lemma.yaml
│   │   ├── msd.yaml
│   │   ├── ...
│   │   └── pos.yaml
│   └── structural
│       ├── author.yaml
│       ├── title.yaml
│       ├── ...
│       └── year.yaml
├── corpora
│   ├── corpus1.yaml
│   ├── corpus2.yaml
│   ├── ...
│   └── yet-another-corpus.yaml
└── modes
    ├── default.yaml
    ├── another.yaml
    ├── ...
    └── other.yaml

The modes directory contains one YAML file per mode in Korp.
The corpora directory contains one YAML file per corpus.
The attributes directory contains two subdirectories: positional and structural, containing optional attribute presets referred to by the corpus configurations.

Note:
Most settings in these files referring to labels or descriptions can optionally be localized using ISO 639-3 language codes. For example, a label can look both like this:

label: author

... and like this:

label:
  eng: author
  swe: författare

Mode Configuration

At least one mode file is required, and that file must be named default.yaml. This is the mode that will be loaded when no mode is explicitly requested.

Required:

label: The name of the mode, which will be shown in the interface.

Optional:

description: A description of the mode, shown when first entering it. May include HTML.
order: A number used for sorting the modes in the interface. Modes without an order will end up last.

folders: A folder structure for the corpus selector. These folders can then be referenced by individual corpora. The folder structure can be of any depth, and folders can have any number of sub-folders (using the key subfolders). You may use HTML in the descriptions. Example:

folders:
  novels:
    title:
      eng: Novels
      swe: Skönlitteratur
    description:
      eng: Corpora consisting of novels.
      swe: Korpusar bestående av skönlitteratur.
    subfolders:
      classics:
        title:
          eng: Classics
          swe: Klassiker
      scifi:
        title: Science-Fiction

preselected_corpora: A list of corpus IDs which will be pre-selected when the user enters the mode. You may also refer to folders by using the prefix __, and dot-notation for refering to subfolders. Example:
```
preselected_corpora:
  - my-corpus
  - __novels.scifi
```
Other than the above, you can also override almost all the global settings set in the frontend's config.yaml. See the documentation for the frontend for a list of available settings.

Corpus Configuration

Corpus configuration files are placed in the corpora folder, and the filename of each configuration file should correspond to a corpus ID in lowercase, followed by .yaml, e.g. mycorpus.yaml.

Required:

id: The corpus' system name, same as the configuration file name (minus .yaml).
title: Title of the corpus.
description: Description of the corpus. HTML can be used.
modes: A list of the modes in which the corpus will be included, optionally specifying a folder. Example:
```
mode:
  - name: default
    folder: novels.classics
```

Optional:

within: Use this to override default_within (set in the global or mode config). within is a list of structural elements to use as boundaries when searching, ordered from smaller to bigger. Example:
```
within:
  - label:
      eng: sentence
      swe: mening
    value: sentence
  - label:
      eng: paragraph
      swe: stycke
    value: paragraph
```
context: Use this to override default_context (set in the global or mode config). context is a list of structural elements that can be used as context in the displaying of the search results, ordered from smaller to bigger. Example:
```
context:
  - label:
      eng: 1 sentence
      swe: 1 mening
    value: 1 sentence
  - label:
      eng: 1 paragraph
      swe: 1 stycke
    value: 1 paragraph
```
attribute_filters: A list of structural attributes on which the user will be able to filter the search results, using menus in both simple and extended search.
pos_attributes and struct_attributes: Lists of positional and structural attributes. Every item in each list should be an object with one key. The key should be the ID of the attribute, e.g. msd for positional attributes or text_title for structural. The value should be either 1) an object with a complete attribute definition, or 2) a string referring to an attribute preset containing such a definition, e.g. msd to refer to attributes/positional/msd.yaml. With option 1, you may also refer to a preset by using the key preset and then extend/override that preset. The attribute definition is what tells the Korp frontend how to handle each attribute, like how it should be presented in the sidebar and what interface widget to use in extended search. For more information about what options are available for attribute definitions, see the Korp frontend documentation. Example:
```
struct_attributes:
  - text_title: title
  - text_type:
      label:
        eng: type
        swe: typ
  - text_source:
      preset: url
      label:
        eng: source
        swe: källa
```
custom_attributes: See Custom attributes.
reading_mode: See Reading mode.
limited_access: Set to true to indicate that this corpus requires the user to be logged in and having the right permissions.

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
.github		.github
backend		backend
cwb-3.5.0-src		cwb-3.5.0-src
frontend		frontend
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
config.py		config.py
test.vrt		test.vrt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Korp corpus annotation system

Running the frontend

Running the backend

Requirements

Corpus Workbench

Setting up the Python environment and requirements

Configuring Korp

Running the backend

Corpus Configuration for the Korp Frontend

Mode Configuration

Corpus Configuration

About

Uh oh!

Releases 3

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Korp corpus annotation system

Running the frontend

Running the backend

Requirements

Corpus Workbench

Setting up the Python environment and requirements

Configuring Korp

Running the backend

Corpus Configuration for the Korp Frontend

Mode Configuration

Corpus Configuration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages