Right now the app is pre-configured to work with a backend running on our server.
- Go into frontend directory
- Install dependencies by running
yarn - Start the frontend by running
yarn start - Open
localhost:9111in your browser
- python 3.6 - 3.10 installed
- Corpus Workbench (CWB) 3.4.12 or newer
Download the current stable version of Corpus Workbench. Install by following the
Installing the CWB Core instructions, either by using the provided
packages or building from source. Refer to the included INSTALL text file for further instructions.
Once CWB is installed, by default you will find it under /usr/local/cwb-X.X.X/bin (where X.X.X is the version
number). Confirm that the installation was successful by running:
/usr/local/cwb-X.X.X/bin/cqp -v
CWB needs two directories for storing the corpora. One for the data, and one for the corpus registry. You may create these directories wherever you want, but from here on we will assume that you have created the following two:
/corpora/data
/corpora/registry
Added functionality, for uploading vrt files through the frontend, also needs:
/corpora/vrts
A sample vrt file has been provided. To encode it, run the command:
cwb-encode -d /absolute/path/corpora/data/test -f /input/file/path test.vrt -R /absoulte/path/corpora/registry -S corpus -S text -S sentence -S phrase_with_errors:1 -P word_id -P word_tokens -P grammatical_number -P POS -P lemma -P correction_status -P annotator_code -P error_correction -P error_tag -P dependency_relation -P dependency_head -P verb_morpho -P case -P error_type -c utf8
For multiple corpuses, you need to make sure each one is encoded in its own seperate directory
Afterwards, run:
cwb-makeall -r /absolute/path/corpora/registry test
Optionally you may set up a virtual Python environment:
$ python3 -m venv venv
$ source venv/bin/activate
Install the required Python modules using pip with the included requirements.txt.
$ pip3 install -r requirements.txt
The supplied config.py contains the default configuration.
The following variables need to be set for Korp to work:
-
CQP_EXECUTABLE
The absolute path to the CQP binary. By default/usr/local/cwb-X.X.X/bin/cqp -
CWB_SCAN_EXECUTABLE
The absolute path to the cwb-scan-corpus binary. By default/usr/local/cwb-X.X.X/cwb-scan-corpus -
CWB_REGISTRY
The absolute path to the CWB registry files. This is the/corpora/registryfolder you created before. -
CWB_VRTS
The absolute path to the directory to house vrt files. This is the/corpora/vrtsfolder you created before.
If you are planning on using functionality dependent on a database, you also need to set the following variables:
-
DBNAME
The name of the MySQL database where the corpus data will be stored. -
DBUSER & DBPASSWORD
Username and password for accessing the database.
For caching to work you need to specify both a cache directory (CACHE_DIR) and a Memcached server address or socket
(MEMCACHED_SERVER).
To run the backend, simply run run.py:
python3 run.py
A sample is arleady provided to get up and running, but if you add more corpuses, follow this procedure
The corpus configuration used by the Korp frontend is served by the backend. In config.py, the variable
CORPUS_CONFIG_DIR should point to a directory having the following structure:
.
├── attributes
│ ├── positional
│ │ ├── lemma.yaml
│ │ ├── msd.yaml
│ │ ├── ...
│ │ └── pos.yaml
│ └── structural
│ ├── author.yaml
│ ├── title.yaml
│ ├── ...
│ └── year.yaml
├── corpora
│ ├── corpus1.yaml
│ ├── corpus2.yaml
│ ├── ...
│ └── yet-another-corpus.yaml
└── modes
├── default.yaml
├── another.yaml
├── ...
└── other.yaml
- The modes directory contains one YAML file per mode in Korp.
- The corpora directory contains one YAML file per corpus.
- The attributes directory contains two subdirectories: positional and structural, containing optional attribute presets referred to by the corpus configurations.
Note:
Most settings in these files referring to labels or descriptions can optionally be localized using ISO 639-3 language
codes. For example, a label can look both like this:
label: author
... and like this:
label:
eng: author
swe: författare
At least one mode file is required, and that file must be named default.yaml. This is the mode that will be loaded
when no mode is explicitly requested.
Required:
- label: The name of the mode, which will be shown in the interface.
Optional:
- description: A description of the mode, shown when first entering it. May include HTML.
- order: A number used for sorting the modes in the interface. Modes without an order will end up last.
- folders: A folder structure for the corpus selector. These folders can then be referenced by individual corpora.
The folder structure can be of any depth, and folders can have any number of sub-folders (using the key
subfolders). You may use HTML in the descriptions. Example:folders: novels: title: eng: Novels swe: Skönlitteratur description: eng: Corpora consisting of novels. swe: Korpusar bestående av skönlitteratur. subfolders: classics: title: eng: Classics swe: Klassiker scifi: title: Science-Fiction - preselected_corpora: A list of corpus IDs which will be pre-selected when the user enters the mode. You may also
refer to folders by using the prefix
__, and dot-notation for refering to subfolders. Example:preselected_corpora: - my-corpus - __novels.scifi - Other than the above, you can also override almost all the global settings set in the frontend's
config.yaml. See the documentation for the frontend for a list of available settings.
Corpus configuration files are placed in the corpora folder, and the filename of each configuration file should
correspond to a corpus ID in lowercase, followed by .yaml, e.g. mycorpus.yaml.
Required:
- id: The corpus' system name, same as the configuration file name (minus
.yaml). - title: Title of the corpus.
- description: Description of the corpus. HTML can be used.
- modes: A list of the modes in which the corpus will be included, optionally specifying a folder. Example:
mode: - name: default folder: novels.classics
Optional:
- within: Use this to override default_within (set in the global or mode config). within is a list of
structural elements to use as boundaries when searching, ordered from smaller to bigger. Example:
within: - label: eng: sentence swe: mening value: sentence - label: eng: paragraph swe: stycke value: paragraph - context: Use this to override default_context (set in the global or mode config). context is a list of
structural elements that can be used as context in the displaying of the search results, ordered from smaller to
bigger. Example:
context: - label: eng: 1 sentence swe: 1 mening value: 1 sentence - label: eng: 1 paragraph swe: 1 stycke value: 1 paragraph - attribute_filters: A list of structural attributes on which the user will be able to filter the search results, using menus in both simple and extended search.
- pos_attributes and struct_attributes: Lists of positional and structural attributes. Every item in each list
should be an object with one key. The key should be the ID of the attribute, e.g.
msdfor positional attributes ortext_titlefor structural. The value should be either 1) an object with a complete attribute definition, or 2) a string referring to an attribute preset containing such a definition, e.g.msdto refer toattributes/positional/msd.yaml. With option 1, you may also refer to a preset by using the keypresetand then extend/override that preset. The attribute definition is what tells the Korp frontend how to handle each attribute, like how it should be presented in the sidebar and what interface widget to use in extended search. For more information about what options are available for attribute definitions, see the Korp frontend documentation. Example:struct_attributes: - text_title: title - text_type: label: eng: type swe: typ - text_source: preset: url label: eng: source swe: källa - custom_attributes: See Custom attributes.
- reading_mode: See Reading mode.
- limited_access: Set to
trueto indicate that this corpus requires the user to be logged in and having the right permissions.