# OCR QA Score Pipeline with Impresso Package

### What is this notebook about?

This notebook demonstrates the functionality and use cases of the Impresso subpackage `ocrqa`.

At its core, this package/pipeline takes a text input and calculates a QA score using a Bloom filter. It first automatically detects the language of the text using the Impresso subpackage `langident` (unless explicitly specified). The pipeline dynamically checks all available languages based on existing Bloom filters and returns an "unsupported language" message if the language is not yet available. If supported, it retrieves the latest Bloom filter for the detected language from the Impresso Hugging Face repository: `impresso-project/OCR-quality-assessment-unigram`, which is then used to calculate the QA score.

*  QA Score: A value between 0 and 1, representing the ratio of known to unknown words in the text compared to the Bloom filter.

Additionally, as will be shown below, the pipeline provides optional arguments that allow for more advanced usage of the package.



### Prerequisites

First, you should install Impresso package:

In [None]:
!pip install glebs_package

Collecting glebs_package
  Downloading glebs_package-1.2.2-py3-none-any.whl.metadata (351 bytes)
Collecting pybloomfiltermmap3 (from glebs_package)
  Downloading pybloomfiltermmap3-0.6.0.tar.gz (505 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m505.3/505.3 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting floret (from glebs_package)
  Downloading floret-0.10.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.1 kB)
Downloading glebs_package-1.2.2-py3-none-any.whl (5.1 kB)
Downloading floret-0.10.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (321 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m321.6/321.6 kB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pybloomfiltermmap3
  Building wheel for pybloomfiltermmap3 (setup.py) ... [?25l[?25hdone
  Created wheel for pybloomfiltermmap3: filename=pybloomfilte

---
### Basic Usage

Unless explicitly specified, the pipeline uses the Impresso subpackage `langident` to detect the language of the text with the latest available model. For more details on the langident subpackage, please refer to the Impresso demo notebook *langident_pipeline_demo.ipynb*.

Once the language is detected, the pipeline checks if a corresponding Bloom filter exists. If available, it retrieves and uses the latest version.

In [None]:
# Start by importing the necessary module from Impresso package
from glebs_package.ocrqa import OCRQAPipeline
ocrqa_pipeline = OCRQAPipeline()

Once you initialize the pipeline, you can simply provide the text you'd like to classify. This example demonstrates the use of German text.

In [None]:
de_text = "Ein kleiner Hund namens Max lebte in einem ruhigen Dorf. Jeden Tag rannte er durch die Straßen und spielte mit den Kindern. Eines Tages fand er einen geheimen Garten, den niemand kannte. Max entschied sich, den Garten zu erkunden und entdeckte viele schöne Blumen und Tiere. Von diesem Tag an besuchte er den Garten jeden Nachmittag."

In [None]:
ocrqa_pipeline(de_text)

{'language': 'de', 'score': 1.0}

The default output of the pipeline is a dictionary containing the detected language and the corresponding QA score. The QA score is rounded to one decimal place to account for minor variations, such as the presence of unusual names, which should not significantly impact the overall score.

In this example, the score is 1.0, indicating that almost all words in the text exist in the Bloom filter. This suggests that the OCR process was highly successful.

---
### Advanced Usage

This pipeline offers several additional attributes that can be used when calling it to gain a deeper understanding of the results. These attributes include `language`, `version`, `diagnostics`, `model_id`, and `supported_languages`:

*   `language`: Accepts language abbreviation strings such as "en" (English) or "de" (German). If provided, the pipeline assumes the specified language and skips the language detection step, directly using the corresponding Bloom filter.

*   `version`: Accepts a specific Bloom filter model version in the format "1.0.5" or "1.0.6". If specified, the pipeline uses the requested version (if available) and skips the automatic retrieval of the latest model.

*   `diagnostics`: Boolean. If set to True, the pipeline returns additional information, such as known_tokens, unknown_tokens, and the Bloom filter name used. For more details, see the sections below.

*   `model_id`: Boolean. If set to True, the pipeline includes the name of the Bloom filter model used in the output.

*   `supported_languages`: Boolean. If set to True, the pipeline returns a list of supported languages (i.e., languages for which a Bloom filter is available).

These attributes can be used individually, in combination with each other, or all at once, depending on the level of detail needed.

**Example 1**: `language`

In [None]:
# Using the same German text example as before
ocrqa_pipeline(de_text, language="lb")

{'language': 'lb', 'score': 0.9}

Even though the provided text is clearly in German, specifying the language as Luxembourgish, for example, forces the pipeline to use the corresponding Bloom filter for that language. If the selected language is unsupported, the pipeline will return an appropriate error message.

**Example 2**: `version`

In [None]:
# Using the same German text example as before
ocrqa_pipeline(de_text, version="1.0.5")

{'language': 'de', 'score': 1.0}

In the example above, by explicitly setting the `version` to *1.0.5* , you are instructing the pipeline to use the Bloom filter corresponding to this version, even if a more recent version is available.

**Example 3**: `diagnostics`

In [None]:
# Using the same German text example as before
ocrqa_pipeline(de_text, diagnostics=True)

{'language': 'de',
 'score': 1.0,
 'diagnostics': {'known_tokens': ['jeden',
   'die',
   'lebte',
   'eines',
   'diesem',
   'namens',
   'einen',
   'hund',
   'er',
   'entdeckte',
   'von',
   'straßen',
   'den',
   'besuchte',
   'viele',
   'zu',
   'durch',
   'tag',
   'kleiner',
   'fand',
   'nachmittag',
   'garten',
   'sich',
   'in',
   'spielte',
   'mit',
   'tiere',
   'an',
   'entschied',
   'rannte',
   'tages',
   'blumen',
   'kindern',
   'schöne',
   'geheimen',
   'einem',
   'niemand',
   'dorf',
   'ruhigen',
   'max',
   'kannte',
   'erkunden',
   'und',
   'ein'],
  'unknowns_tokens': [],
  'bloom_filter': 'ocrqa-wp_v1.0.6-de.bloom'}}

Once you set `diagnostics` to *True* , an additional key, `diagnostics`, will be added to the dictionary. The value of this key contains all known and unknown tokens, as well as the name of the Bloom filter used. In this example, we can see that there are no unknown words, meaning every word exists in this specific Bloom filter.

**Example 4**: `model_id`

In [None]:
# Using the same German text example as before
ocrqa_pipeline(de_text, model_id=True)

{'language': 'de', 'score': 1.0, 'bloom_filter': 'ocrqa-wp_v1.0.6-de.bloom'}

Similar to the `diagnostics` attribute, the `model_id` attribute is a simpler version. If set to `True`, the pipeline will return an additional key, `bloom_filter`, with the value indicating the Bloom filter that was used for the analysis.

**Example 5**: `supported languages`

In [None]:
# Using the same German text example as before
ocrqa_pipeline(de_text, supported_languages=True)

{'language': 'de',
 'score': 1.0,
 'supported_languages': ['de', 'lb', 'fr', 'en']}

Once `supported_languages` is set to *True*, the pipeline returns an additional key, `supported_languages`, with a value containing a list of all currently supported languages (i.e., languages that have a corresponding Bloom filter).

**Example 6**: All at once

In [None]:
# Using the same German text example as before
ocrqa_pipeline(de_text, language="fr", version="1.0.5", diagnostics=True, model_id=True, supported_languages=True)

{'language': 'fr',
 'score': 0.8,
 'diagnostics': {'known_tokens': ['jeden',
   'die',
   'eines',
   'diesem',
   'namens',
   'einen',
   'hund',
   'er',
   'von',
   'straßen',
   'den',
   'viele',
   'zu',
   'durch',
   'tag',
   'kleiner',
   'fand',
   'nachmittag',
   'garten',
   'sich',
   'in',
   'mit',
   'tiere',
   'an',
   'tages',
   'blumen',
   'kindern',
   'schöne',
   'geheimen',
   'einem',
   'niemand',
   'dorf',
   'max',
   'kannte',
   'und',
   'ein'],
  'unknowns_tokens': ['lebte',
   'ruhigen',
   'entschied',
   'rannte',
   'erkunden',
   'entdeckte',
   'spielte',
   'besuchte'],
  'bloom_filter': 'ocrqa-wp_v1.0.5-fr.bloom'},
 'supported_languages': ['de', 'lb', 'fr', 'en']}

You can use a mix of additional parameters, or all of them at once, to gain a deeper understanding of your QA score. In the example above, we set the language to French, which results in many unknown tokens being identified, as the Bloom filter used may not cover certain French words.

---
### Edge cases

In [None]:
short_de_text_with_unusual_name = "Glebs geht ins Büro."

In [None]:
# Example 1: Very short sentence
ocrqa_pipeline(short_de_text_with_unusual_name, diagnostics=True)

{'language': 'lb',
 'score': 0.8,
 'diagnostics': {'known_tokens': ['ins', 'büro', 'geht'],
  'unknowns_tokens': ['glebs'],
  'bloom_filter': 'ocrqa-wp_v1.0.5-lb.bloom'}}