Skip to content

PMark-est/SoftwareProjectUT24

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

76 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Korp corpus annotation system

Right now the app is pre-configured to work with a backend running on our server.

Running the frontend

  1. Go into frontend directory
  2. Install dependencies by running yarn
  3. Start the frontend by running yarn start
  4. Open localhost:9111 in your browser

Running the backend

Requirements

Corpus Workbench

Download the current stable version of Corpus Workbench. Install by following the Installing the CWB Core instructions, either by using the provided packages or building from source. Refer to the included INSTALL text file for further instructions.

Once CWB is installed, by default you will find it under /usr/local/cwb-X.X.X/bin (where X.X.X is the version number). Confirm that the installation was successful by running:

/usr/local/cwb-X.X.X/bin/cqp -v

CWB needs two directories for storing the corpora. One for the data, and one for the corpus registry. You may create these directories wherever you want, but from here on we will assume that you have created the following two:

/corpora/data
/corpora/registry

Added functionality, for uploading vrt files through the frontend, also needs:

/corpora/vrts

A sample vrt file has been provided. To encode it, run the command:

cwb-encode -d /absolute/path/corpora/data/test -f /input/file/path test.vrt -R /absoulte/path/corpora/registry -S corpus -S text -S sentence -S phrase_with_errors:1 -P word_id -P word_tokens -P grammatical_number -P POS -P lemma -P correction_status -P annotator_code -P error_correction -P error_tag -P dependency_relation -P dependency_head -P verb_morpho -P case -P error_type -c utf8

For multiple corpuses, you need to make sure each one is encoded in its own seperate directory

Afterwards, run:

cwb-makeall -r /absolute/path/corpora/registry test

Setting up the Python environment and requirements

Optionally you may set up a virtual Python environment:

$ python3 -m venv venv
$ source venv/bin/activate

Install the required Python modules using pip with the included requirements.txt.

$ pip3 install -r requirements.txt

Configuring Korp

The supplied config.py contains the default configuration.

The following variables need to be set for Korp to work:

  • CQP_EXECUTABLE
    The absolute path to the CQP binary. By default /usr/local/cwb-X.X.X/bin/cqp

  • CWB_SCAN_EXECUTABLE
    The absolute path to the cwb-scan-corpus binary. By default /usr/local/cwb-X.X.X/cwb-scan-corpus

  • CWB_REGISTRY
    The absolute path to the CWB registry files. This is the /corpora/registry folder you created before.

  • CWB_VRTS
    The absolute path to the directory to house vrt files. This is the /corpora/vrts folder you created before.

If you are planning on using functionality dependent on a database, you also need to set the following variables:

  • DBNAME
    The name of the MySQL database where the corpus data will be stored.

  • DBUSER & DBPASSWORD
    Username and password for accessing the database.

For caching to work you need to specify both a cache directory (CACHE_DIR) and a Memcached server address or socket (MEMCACHED_SERVER).

Running the backend

To run the backend, simply run run.py:

python3 run.py

Corpus Configuration for the Korp Frontend

A sample is arleady provided to get up and running, but if you add more corpuses, follow this procedure

The corpus configuration used by the Korp frontend is served by the backend. In config.py, the variable CORPUS_CONFIG_DIR should point to a directory having the following structure:

.
├── attributes
│   ├── positional
│   │   ├── lemma.yaml
│   │   ├── msd.yaml
│   │   ├── ...
│   │   └── pos.yaml
│   └── structural
│       ├── author.yaml
│       ├── title.yaml
│       ├── ...
│       └── year.yaml
├── corpora
│   ├── corpus1.yaml
│   ├── corpus2.yaml
│   ├── ...
│   └── yet-another-corpus.yaml
└── modes
    ├── default.yaml
    ├── another.yaml
    ├── ...
    └── other.yaml
  • The modes directory contains one YAML file per mode in Korp.
  • The corpora directory contains one YAML file per corpus.
  • The attributes directory contains two subdirectories: positional and structural, containing optional attribute presets referred to by the corpus configurations.

Note:
Most settings in these files referring to labels or descriptions can optionally be localized using ISO 639-3 language codes. For example, a label can look both like this:

label: author

... and like this:

label:
  eng: author
  swe: författare

Mode Configuration

At least one mode file is required, and that file must be named default.yaml. This is the mode that will be loaded when no mode is explicitly requested.

Required:

  • label: The name of the mode, which will be shown in the interface.

Optional:

  • description: A description of the mode, shown when first entering it. May include HTML.
  • order: A number used for sorting the modes in the interface. Modes without an order will end up last.
  • folders: A folder structure for the corpus selector. These folders can then be referenced by individual corpora. The folder structure can be of any depth, and folders can have any number of sub-folders (using the key subfolders). You may use HTML in the descriptions. Example:
    folders:
      novels:
        title:
          eng: Novels
          swe: Skönlitteratur
        description:
          eng: Corpora consisting of novels.
          swe: Korpusar bestående av skönlitteratur.
        subfolders:
          classics:
            title:
              eng: Classics
              swe: Klassiker
          scifi:
            title: Science-Fiction
    
  • preselected_corpora: A list of corpus IDs which will be pre-selected when the user enters the mode. You may also refer to folders by using the prefix __, and dot-notation for refering to subfolders. Example:
    preselected_corpora:
      - my-corpus
      - __novels.scifi
    
  • Other than the above, you can also override almost all the global settings set in the frontend's config.yaml. See the documentation for the frontend for a list of available settings.

Corpus Configuration

Corpus configuration files are placed in the corpora folder, and the filename of each configuration file should correspond to a corpus ID in lowercase, followed by .yaml, e.g. mycorpus.yaml.

Required:

  • id: The corpus' system name, same as the configuration file name (minus .yaml).
  • title: Title of the corpus.
  • description: Description of the corpus. HTML can be used.
  • modes: A list of the modes in which the corpus will be included, optionally specifying a folder. Example:
    mode:
      - name: default
        folder: novels.classics
    

Optional:

  • within: Use this to override default_within (set in the global or mode config). within is a list of structural elements to use as boundaries when searching, ordered from smaller to bigger. Example:
    within:
      - label:
          eng: sentence
          swe: mening
        value: sentence
      - label:
          eng: paragraph
          swe: stycke
        value: paragraph
    
  • context: Use this to override default_context (set in the global or mode config). context is a list of structural elements that can be used as context in the displaying of the search results, ordered from smaller to bigger. Example:
    context:
      - label:
          eng: 1 sentence
          swe: 1 mening
        value: 1 sentence
      - label:
          eng: 1 paragraph
          swe: 1 stycke
        value: 1 paragraph
    
  • attribute_filters: A list of structural attributes on which the user will be able to filter the search results, using menus in both simple and extended search.
  • pos_attributes and struct_attributes: Lists of positional and structural attributes. Every item in each list should be an object with one key. The key should be the ID of the attribute, e.g. msd for positional attributes or text_title for structural. The value should be either 1) an object with a complete attribute definition, or 2) a string referring to an attribute preset containing such a definition, e.g. msd to refer to attributes/positional/msd.yaml. With option 1, you may also refer to a preset by using the key preset and then extend/override that preset. The attribute definition is what tells the Korp frontend how to handle each attribute, like how it should be presented in the sidebar and what interface widget to use in extended search. For more information about what options are available for attribute definitions, see the Korp frontend documentation. Example:
    struct_attributes:
      - text_title: title
      - text_type:
          label:
            eng: type
            swe: typ
      - text_source:
          preset: url
          label:
            eng: source
            swe: källa
    
  • custom_attributes: See Custom attributes.
  • reading_mode: See Reading mode.
  • limited_access: Set to true to indicate that this corpus requires the user to be logged in and having the right permissions.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors