Skip to content

hplt-project/data-analytics-tool

Repository files navigation

HPLT Analytics

This tool provides a full range of analytics automatically computed on either monolingual or bilingual data sets to help making informed decisions about them.

It shows corpora details, volumes, language, length, noise and quality score distributions, common n-grams and others in the spirit of the work carried out by https://www.semanticscholar.org/paper/Documenting-the-English-Colossal-Clean-Crawled-Dodge-Sap/40c3327a6ddb0603b6892344509c7f428ab43d81.

Support for language-dependent components has been added for dozens of languages.

Automated reports generated out of the tool that is actioned from a web application to which a corpus can be uploaded. Once processed, the viewer will plot the analysis and automatically generate a PDF report containing the same information.

Icon: https://thenounproject.com/icon/fingerprint-3530285/

Running the docker:

  • sudo docker-compose build
  • sudo docker-compose up

URLS to upload and view a dataset:

  • Uploader: localhost:8000/uploader
  • Viewer: localhost:8000/viewer

If you need to access docker to run stuff inside:

  • sudo docker exec -it dat-webapp /bin/bash

Code and data are located in /work

Current info in the generated yaml files:

The stats generated with this tool come in a handy yaml format with the following fields:

  • bicleaner_scores: Distribution of segments pairs with certain Bicleaner AI scores (only for parallel corpora)
  • corpus: Corpus filename
  • docs_avg_lm: Distribution of documents having a certain Monocleaner average fluency score, of its segments (only for monolingual documents)
  • docs_collections: Distribution of documents per origin collection (only for monoligual documents)
  • docs_langs: Distribution of documents having a certain percentage of its segments in the declared document language (only for monolingual documents)
  • docs_segments: Distribution of documents having a certain amount of segments (only for monolingual documents)
  • docs_timestamp: Unix timestamp indicating when were the documents part of the stats obtained (only for monolingual documents)
  • docs_top100_domains: 100 most common domains, and the amount of documents for each one (only for monolingual documents)
  • docs_top100_tld: 100 most common top level domains (not including subdomains), and the amount of document for each one (only for monolingual documents)
  • docs_total: Total amount of documents in the corpus (only for monolingual documents)
  • docs_warning: List of issues encountered while processing documents (only for monolingual documents)
    • docs_unmatching_xxx: Some documents (a total of xxx) in the corpus had a different amount of segments and LM scores or language identification, so they were discarded.
  • hardrules_tags: List of possible issues in the segments, detected by Hardrules
    • not_too_long: Percentage of segments being larger than 1024 characters.
    • not_too_short: Percentage of segments being shorter than 3 tokens.
    • no_urls: Percentage of segments containing URLs.
    • no_bad_encoding: Percentage of bad encoded segments.
    • no_porn: Percentage of segments having porn content (not available for all languages)
  • monocleaner_scores: Distribution of segments with a certain Monocleaner score (only for monolingual corpora)
  • sentence_pairs: Total amount of segments (in the case of monolingual corpora) or segment pairs (in the case of parallel corpora)
  • src_bytes: Total size of source segments, uncompressed.
  • srclang: Source language.
  • src_langs: Distribution of source segments languages, as identified by FastSpell
  • src_ngrams: Distribution of the 5 most common n-grams of each order (1-grams to 5-grams) in source segments.
  • src_sent_tokens: Distribution of source segments having a certain amount of tokens (more info on tokenization tools here)
  • src_sent_tokens_mean: Mean value of src_sent_tokens.
  • src_sent_tokens_median: Median value of src_sent_tokens.
  • src_tokens: Total amount of tokens in source segments.
  • src_unique_sents: Distribution of source segments having a certain amount of tokens, after removing duplicated segments.
  • timestamp: Unix timestamp indicating when were the stats obtained.
  • trg_bytes: Total size of target segments, uncompressed (only for parallel corpora)
  • trglang: Target language.
  • trg_langs: Distribution of target segments languages, as identified by FastSpell (only for parallel corpora)
  • trg_ngrams: Distribution of the 5 most common n-grams of each order (1-grams to 5-grams) in target segments (only for parallel corpora)
  • trg_sent_tokens: Distribution of target segments having a certain amount of tokens (more info on tokenization tools here) (only for parallel corpora)
  • trg_sent_tokens_mean: Mean value of trg_sent_tokens (only for parallel corpora)
  • trg_sent_tokens_median: Median value of trg_sent_tokens (only for parallel corpora)
  • trg_tokens: Total amount of tokens in target segments (only for parallel corpora)
  • trg_unique_sents: Distribution of target segments having a certain amount of tokens, after removing duplicated segments (only for parallel corpora)
  • ttr_src: Type-Token Ratio of the source segments.
  • ttr_trg: Type-Token Ratio of the target segments.
  • unique_sents: Total amount of segments (for monolingual corpora) or segment pairs (for parallel corpora), after removing duplicated segments or segment pairs.
  • warnings: List of issues encountered while processing the corpus.
    • src_warning_tok_xxx_yyy: The source language is not supported by a dedicated tokenizer, so it fallbacks to the xxx tokenizer with the yyy language (only for parallel corpora).
    • trg_warning_tok_xxx_yyy: Same as the above but for the target language (only for parallel corpora).
    • ngrams_xxx_nostopwords: No stopwords available for the xxx language (the language being processed)
    • ngrams_xxx_freq: The stopwords used for the xxx language were simply obtained by frequency (top 1% of the corpus)

Viewer :

HPLTAnalytics comes with a webapp that is able to display the generated yaml files in a friendlier, more confortable interface. It has the following sections:

  • General overview:
    • Corpus name
    • Date on which the analysis was performed
    • Language(s)
  • Volumes
    • Documents (only for monolingual documents)
    • Segments
    • Unique segments
    • Size in tokens
    • File size
  • Type Token Ratio
    • Lexical variation indicator. The ratio is obtained by dividing the total number of different words (called types) by the total number of words (called tokens). The higher, the better as high TTR indicates a high degree of lexical variation while low TTR indicates the opposite.
  • Top 10 domains (excluding subdomains) (only for monolingual documents)
  • Top 10 TLDs (only for monolingual documents)
  • Document size (in segments). Histogram showing the distribution of document sizes (only for monolingual documents)
  • Documents by collection (only for monolingual documents)
  • Language distribution.
    • Number of segments: Shows percentage of automatically identified languages.
    • Percentage of segments in the declared languge, inside documents (only for monolingual documents)
  • Quality Score distribution: as per language models (monolingual) or bicleaner scores (tool that computes the likelihood of two sentences of being mutual translations)
  • Quality Score average distribution: Histogram displaying the distribution of the average fluency score of segments in documents (only for monolingual documents)
  • Segment length distribution: tokens per segment for each language, showing total, unique and duplicate segments or segment pairs.
  • Noise distribution: the result of applying hard rules and computing which percentage is affected by them (too short or too long sentences, sentences being URLs, bad encoding, sentences containing poor language, etc.)
  • Frequent n-grams: 1-5 more frequent n-grams

Output examples:

  • HPLT monolingual documents for Afrikaans: it shows that more than a half of the documents come from the same domain, and that a large amount of documents contain less than a 30% of segments in Afrikaans. It also contains a lot of short segments.

Data Analytics Viewer

  • Parallel English-Norwegian HPLT corpus from initial data release: it shows that deduplication needs to be addressed as one of the most important issues.

Data Analytics Viewer

  • Monolingual Basque corpus from HPLT: it shows that at least 3/4 of the corpus is not in Basque, and that a very high percentage of segments are very short.

Data Analytics Viewer