# Natural Language Processing with python and medkit

medkit (github: https://github.com/TeamHeka/medkit, documentation: https://medkit.readthedocs.io/en/stable/) is library dedicated to the treatment of clinical data.

For the moment, mekdit is mainly dedicated to two types of data: text and audio data. For texts, resources provided are mostly developed in French, but the library could be used in other languages.

## Installing medkit

The recommended way is to use Conda environment, in this Lab, we will rely on a *pip* install.

The library is already installed on this system, the command line is given only as a reference.

In [None]:
!pip3 install --upgrade medkit-lib[optional]

## Downloading a corpus of documents

Here the documents have already been downloaded. No need to perform this step

In [None]:
!wget https://github.com/neurazlab/mtsamplesFR/raw/master/data/mtsamples.csv -O 01_data/mtsamples.csv

## Reading the documents
And printing the first document

In [1]:
import pandas as pd

docs = pd.read_csv('/home/ressources/PBL/mtsamples.csv')

The *iloc* function, allows to select the first row of the data frame

In [2]:
print(docs.iloc[0]['transcription'])

SUBJECTIVE:,  This 23-year-old white female presents with complaint of allergies.  She used to have allergies when she lived in Seattle but she thinks they are worse here.  In the past, she has tried Claritin, and Zyrtec.  Both worked for short time but then seemed to lose effectiveness.  She has used Allegra also.  She used that last summer and she began using it again two weeks ago.  It does not appear to be working very well.  She has used over-the-counter sprays but no prescription nasal sprays.  She does have asthma but doest not require daily medication for this and does not think it is flaring up.,MEDICATIONS: , Her only medication currently is Ortho Tri-Cyclen and the Allegra.,ALLERGIES: , She has no known medicine allergies.,OBJECTIVE:,Vitals:  Weight was 130 pounds and blood pressure 124/78.,HEENT:  Her throat was mildly erythematous without exudate.  Nasal mucosa was erythematous and swollen.  Only clear drainage was seen.  TMs were clear.,Neck:  Supple without adenopathy.,L

Visualisation de l'objet document :

In [3]:
docs.iloc[0]

Unnamed: 0                                                           0
description           A 23-year-old white female presents with comp...
medical_specialty                                 Allergy / Immunology
sample_name                                         Allergic Rhinitis 
transcription        SUBJECTIVE:,  This 23-year-old white female pr...
keywords             allergy / immunology, allergic rhinitis, aller...
Name: 0, dtype: object

## Creating your first medkit document

In [10]:
from medkit.core.text import TextDocument

doc = TextDocument(text=docs.iloc[0]['transcription'])
print(doc.text)

SUBJECTIVE:,  This 23-year-old white female presents with complaint of allergies.  She used to have allergies when she lived in Seattle but she thinks they are worse here.  In the past, she has tried Claritin, and Zyrtec.  Both worked for short time but then seemed to lose effectiveness.  She has used Allegra also.  She used that last summer and she began using it again two weeks ago.  It does not appear to be working very well.  She has used over-the-counter sprays but no prescription nasal sprays.  She does have asthma but doest not require daily medication for this and does not think it is flaring up.,MEDICATIONS: , Her only medication currently is Ortho Tri-Cyclen and the Allegra.,ALLERGIES: , She has no known medicine allergies.,OBJECTIVE:,Vitals:  Weight was 130 pounds and blood pressure 124/78.,HEENT:  Her throat was mildly erythematous without exudate.  Nasal mucosa was erythematous and swollen.  Only clear drainage was seen.  TMs were clear.,Neck:  Supple without adenopathy.,L

### Using regular expressions to extract vitals
Let's first extract vitals (weight, blood pressure, size....) using regular expression.

In [11]:
from medkit.text.ner import RegexpMatcher, RegexpMatcherRule

regexp_rules = [
    RegexpMatcherRule(regexp=r"[0-9]+", label="number"), # change this rule to detect blood pressure (the form is usually 120/80
    RegexpMatcherRule(regexp=r"[0-9]{2,3}/[0-9]{2,3}", label="bp"),
    #    
    # Add a rule capturing BMI values
    #
]
regexp_matcher = RegexpMatcher(rules=regexp_rules)

The code above creates a matcher. The matcher itself does nothing. To use the matcher, it has to run on the document

In [12]:
entities = regexp_matcher.run([doc.raw_segment])

Let's visualize, the entities detected by the regular expressions:

In [13]:
for entity in entities:
    print(f"text={entity.text!r}, spans={entity.spans}, label={entity.label}")

text='23', spans=[Span(start=19, end=21)], label=number
text='130', spans=[Span(start=775, end=778)], label=number
text='124', spans=[Span(start=805, end=808)], label=number
text='78', spans=[Span(start=809, end=811)], label=number
text='1', spans=[Span(start=1053, end=1054)], label=number
text='2', spans=[Span(start=1222, end=1223)], label=number
text='124/78', spans=[Span(start=805, end=811)], label=bp


Technical action to create document annotations using the detected entities (mandatory for visualization, but not important to understand).

In [14]:
for entity in entities:
    doc.anns.add(entity)

In [20]:
from spacy import displacy
from medkit.text.spacy.displacy_utils import medkit_doc_to_displacy

colors = {"number": "#ff6961", "bp":"#93ff33"}
options = {"ents": ['number', 'bp'], "colors": colors}

displacy_data = medkit_doc_to_displacy(doc)
displacy.render(displacy_data, manual=True, style="ent", options=options)

#### Exercice :

Modify the code above to capture IMC, blood pressure
and the age of the patients

# Extracting drugs using a dictionary

We will rely on a list of drugs provided by the US FDA to build a dictionnary (https://www.fda.gov/drugs/drug-approvals-and-databases/drugsfda-data-files)

### Step 1 - constructing a dictionary of drug names
Read the Products file, and identify the columns of interest

In [21]:
import pandas as pd

drugs = pd.read_csv("/home/ressources/PBL/Products.txt", sep="\t",on_bad_lines='skip')
drugs.iloc[0]

ApplNo                                             4
ProductNo                                          4
Form                       SOLUTION/DROPS;OPHTHALMIC
Strength                                          1%
ReferenceDrug                                      0
DrugName                                   PAREDRINE
ActiveIngredient     HYDROXYAMPHETAMINE HYDROBROMIDE
ReferenceStandard                                0.0
Name: 0, dtype: object

In [22]:
# Print the size of the dataset, here 46k brand names
len(drugs)

46146

### Step 2 - Build the dictionary using the IAMsystem matcher

In [23]:
from iamsystem import Matcher
from iamsystem import ESpellWiseAlgo

from medkit.text.ner import RegexpMatcher, RegexpMatcherRule
from medkit.text.ner.iamsystem_matcher import IAMSystemMatcher
from medkit.text.ner.iamsystem_matcher import MedkitKeyword

keywords_list=[]
# Creation of the list of terms to be searched
for i in range(0, len(drugs)):
    keywords_list.append(MedkitKeyword(label=drugs.at[i, 'DrugName'], kb_id=drugs.at[i, 'ApplNo'], kb_name="FDA", ent_label='drug'))

    
    
matcher = Matcher.build(
    keywords=keywords_list,
    spellwise=[dict(measure=ESpellWiseAlgo.LEVENSHTEIN,max_distance=1,min_nb_char=10,)],
    stopwords=["and"],
    w=2,
)

In [24]:
iam_matcher = IAMSystemMatcher(matcher = matcher, attrs_to_copy = ["is_negated", "family"])

In [25]:
entities = iam_matcher.run([doc.raw_segment])

for entity in entities:
    print(f"text={entity.text!r}, spans={entity.spans}, label={entity.label}")

text='Claritin', spans=[Span(start=200, end=208)], label=drug
text='Zyrtec', spans=[Span(start=214, end=220)], label=drug
text='Allegra', spans=[Span(start=303, end=310)], label=drug
text='Ortho Tri Cyclen', spans=[Span(start=660, end=665), ModifiedSpan(length=1, replaced_spans=[]), Span(start=666, end=669), ModifiedSpan(length=1, replaced_spans=[]), Span(start=670, end=676)], label=drug
text='Allegra', spans=[Span(start=685, end=692)], label=drug
text='Zyrtec', spans=[Span(start=1070, end=1076)], label=drug
text='Allegra', spans=[Span(start=1088, end=1095)], label=drug
text='loratadine', spans=[Span(start=1134, end=1144)], label=drug
text='Nasonex', spans=[Span(start=1237, end=1244)], label=drug


In [26]:
for entity in entities:
    doc.anns.add(entity)

In [27]:
from spacy import displacy
from medkit.text.spacy.displacy_utils import medkit_doc_to_displacy

colors = {'weight': "#85C1E9", "number": "#ff6961", "drug":"#ff9900"}
options = {"ents": ['weight', 'number','drug'], "colors": colors}

displacy_data = medkit_doc_to_displacy(doc)
displacy.render(displacy_data, manual=True, style="ent", options=options)

#### Exercice:

Complete the code above to capture not only brand names but also ingredients (i.e. molecule names)

## Extracting phenotypes using dictionaries

We will rely on phenotypes provided by the Human Phenotype Ontology (download from https://hpo.jax.org/app/data/ontology)

In [28]:
import pandas as pd

pheno_terms = pd.read_csv("/home/ressources/PBL/hp_terms.txt", sep="\t",on_bad_lines='skip', )
print(pheno_terms.iloc[0])
pheno_syn_terms = pd.read_csv("/home/ressources/PBL/hp_synonyms.txt", sep="\t",on_bad_lines='skip', )
print(pheno_syn_terms.iloc[0])

Term    1-2 finger syndactyly,
Name: 0, dtype: object
Term    1-Methylhistidinuria
Name: 0, dtype: object


The next steps takes about 30 seconds

In [29]:
pheno_keywords_list=[]
max_index = 0
# Creation of the list of terms to be searched
for i in range(0, len(pheno_terms)):
    pheno_keywords_list.append(MedkitKeyword(label=pheno_terms.at[i, 'Term'], kb_id=i, kb_name="HPO", ent_label='pheno'))
    max_index = i

pheno_matcher = Matcher.build(
    keywords=pheno_keywords_list,
    spellwise=[dict(measure=ESpellWiseAlgo.LEVENSHTEIN,max_distance=1,min_nb_char=10,)],
    stopwords=["and"],
    w=2,
)

In [30]:
iam_matcher_pheno = IAMSystemMatcher(matcher = pheno_matcher, attrs_to_copy = ["is_negated", "family"])

In [31]:
entities = iam_matcher_pheno.run([doc.raw_segment])
for entity in entities:
    print(f"text={entity.text!r}, spans={entity.spans}, label={entity.label}")

text='asthma', spans=[Span(start=520, end=526)], label=pheno
text='Allergic rhinitis', spans=[Span(start=1028, end=1036), ModifiedSpan(length=1, replaced_spans=[]), Span(start=1037, end=1045)], label=pheno
text='2', spans=[Span(start=1222, end=1223)], label=pheno


In [32]:
for entity in entities:
    doc.anns.add(entity)

In [33]:
from spacy import displacy
from medkit.text.spacy.displacy_utils import medkit_doc_to_displacy

colors = {'weight': "#85C1E9", "number": "#ff6961", "drug":"#ff9900", "pheno" : "#10ad10"}
options = {"ents": ['weight', 'number','drug', 'pheno'], "colors": colors}

displacy_data = medkit_doc_to_displacy(doc)
displacy.render(displacy_data, manual=True, style="ent", options=options)

#### Exercice :

Modify the code above to also capture synonymes of phenotypes.

## Extracting entities using neural networks (transformers)

We will be using a model from the Huggingface repository (https://huggingface.co/d4data/biomedical-ner-all)

In [34]:
from medkit.text.ner.hf_entity_matcher import HFEntityMatcher

matcher = HFEntityMatcher(model="d4data/biomedical-ner-all")
# https://huggingface.co/d4data/biomedical-ner-all

In [36]:
# detect entities in the raw segment
detected_entities = matcher.run([doc.raw_segment]) 
for entity in detected_entities:
    print(f"text={entity.text!r}, spans={entity.spans}, label={entity.label}")

text='23-year-old', spans=[Span(start=19, end=30)], label=Age
text='white', spans=[Span(start=31, end=36)], label=Personal_background
text='female', spans=[Span(start=37, end=43)], label=Sex
text='presents', spans=[Span(start=44, end=52)], label=Clinical_event
text='allergies', spans=[Span(start=100, end=109)], label=Disease_disorder
text='Seattle', spans=[Span(start=128, end=135)], label=Nonbiological_location
text='summer', spans=[Span(start=337, end=343)], label=Duration
text='two', spans=[Span(start=373, end=376)], label=Duration
text='over', spans=[Span(start=447, end=451)], label=Detailed_description
text='counter', spans=[Span(start=456, end=463)], label=Detailed_description
text='asthma', spans=[Span(start=520, end=526)], label=History
text='medication', spans=[Span(start=555, end=565)], label=Medication
text='Ortho Tri-Cyclen', spans=[Span(start=660, end=676)], label=Medication
text='Vitals', spans=[Span(start=755, end=761)], label=Diagnostic_procedure
text='Weight', spans=[Sp

Let's visualize only the Sign and symptomes

In [37]:
for entity in detected_entities:
    if entity.label=="Sign_symptom" or entity.label=="Biological_structure":
        doc.anns.add(entity)

In [38]:
from spacy import displacy
from medkit.text.spacy.displacy_utils import medkit_doc_to_displacy

colors = {'weight': "#85C1E9", "number": "#ff6961", "drug":"#ff9900", 
          "pheno" : "#10ad10", "Sign_symptom" : "#0281fa",
         "Biological_structure" : "#dcdcdc"}
options = {"ents": ['weight', 'number','drug', 'pheno','Sign_symptom','Biological_structure'], "colors": colors}

displacy_data = medkit_doc_to_displacy(doc)
displacy.render(displacy_data, manual=True, style="ent", options=options)

#### Exercice :

Add the entities of the class Sex, History and Disease_disorder as annotations to the document, and print them.

# Detecting the context

Réinitialisation du document

In [None]:
from medkit.core.text import TextDocument

doc = TextDocument(text=docs.iloc[0]['transcription'])
print(doc.text)

On découpe d'abord le texte en phrase.

In [None]:
from medkit.text.segmentation import SentenceTokenizer

sentence_tokenizer = SentenceTokenizer(
    output_label="sentence",
    keep_punct=True,
    split_on_newlines=True,
)

# Run the sentence tokenizer on the section segments,
# not on the full text
sentence_segs = sentence_tokenizer.run([doc.raw_segment])

for sentence_seg in sentence_segs:
    print("Sentence: ",sentence_seg.text, end="\n\n")

On cherche maintenant des marqueurs d'antécédents familiaux et de négation dans les phrases

In [None]:
from medkit.text.context import FamilyDetector, FamilyDetectorRule

family_rule_1 = FamilyDetectorRule(
    # Pattern to search inside each input segment.
    # If the pattern is found, the segment will be flagged
    # as being related to family history
    regexp=r"\bfamily\b",
    # Optional exclusions patterns: if found,
    # the segment won't be flagged
    # (Exclusion regexps are also supported for RegexpMatcher)
    exclusion_regexps=[r"\bwith (his|her) family\b"],
    # The regexp will be used with a case-insensitivity flag
    case_sensitive=False,
    # Special chars in the input text will be converted
    # to equivalent ASCII char before runing the regexp on it
    unicode_sensitive=False,
)

family_rule_2 = FamilyDetectorRule(
    regexp=r"\bfamilial\s+history\b",
    case_sensitive=False,
    unicode_sensitive=False,
)
family_rule_3 = FamilyDetectorRule(
    regexp=r"father|mother|brother|sister|cousin|uncle|aunt",
    case_sensitive=False,
    unicode_sensitive=False,
)


family_detector = FamilyDetector(rules=[family_rule_1, family_rule_2, family_rule_3], output_label="family")
# The family detector doesn't return anything but instead adds an attribute to each
# segment with a boolean value indicating if description of family history was detected or not
family_detector.run(sentence_segs)

# Print sentences detected as being related to family history
for sentence_seg in sentence_segs:
    # Retrieve the attribute created by the family detector
    family_attr = sentence_seg.attrs.get(label="family")[0]
    # Only print sentences about family history
    if family_attr.value:
        print(sentence_seg.text)

In [None]:
from medkit.text.segmentation import SyntagmaTokenizer

# Here we will use the default settings of SyntagmaTokenizer,
# but you can specify your own separator patterns
syntagma_tokenizer = SyntagmaTokenizer(
    output_label="syntagma",
    # We want to keep the section and family history information
    # at the syntagma level
    attrs_to_copy=["family"],
)
# The syntagma tokenizer expects sentence segments as input
syntagma_segs = syntagma_tokenizer.run(sentence_segs)

for syntagma_seg in syntagma_segs:
    print(syntagma_seg.text)

In [None]:
from medkit.text.context import NegationDetector, NegationDetectorRule

negation_rule_1 = NegationDetectorRule(
    regexp=r"absence|\bno\b|\bnot\b|\bnormal\b",
    case_sensitive=False,
    unicode_sensitive=False,
)

# NegationDetectorRule objects have the same structure as FamilyDetectorRule
# Here we will use the default rules
negation_detector = NegationDetector(rules=[negation_rule_1], output_label="negation")
negation_detector.run(syntagma_segs)

# Display negated syntagmas
for syntagma_seg in syntagma_segs:
    negation_attr = syntagma_seg.attrs.get(label="negation")[0]
    if negation_attr.value:
        print(syntagma_seg.text)

Après détection du contexte, on peut rechercher les entités

In [None]:
import pandas as pd

pheno_terms = pd.read_csv("/home/ressources/PBL/hp_terms.txt", sep="\t",on_bad_lines='skip', )
print(pheno_terms.iloc[0])
pheno_syn_terms = pd.read_csv("/home/ressources/PBL/hp_synonyms.txt", sep="\t",on_bad_lines='skip', )
print(pheno_syn_terms.iloc[0])

In [None]:
pheno_keywords_list=[]
max_index = 0
# Creation of the list of terms to be searched
for i in range(0, len(pheno_terms)):
    pheno_keywords_list.append(MedkitKeyword(label=pheno_terms.at[i, 'Term'], kb_id=i, kb_name="HPO", ent_label='pheno'))
    max_index = i

pheno_matcher = Matcher.build(
    keywords=pheno_keywords_list,
    spellwise=[dict(measure=ESpellWiseAlgo.LEVENSHTEIN,max_distance=1,min_nb_char=10,)],
    stopwords=["and"],
    w=2,
)

In [None]:
from iamsystem import Matcher
from iamsystem import ESpellWiseAlgo

from medkit.text.ner import RegexpMatcher, RegexpMatcherRule
from medkit.text.ner.iamsystem_matcher import IAMSystemMatcher
from medkit.text.ner.iamsystem_matcher import MedkitKeyword

iam_matcher_pheno = IAMSystemMatcher(matcher = pheno_matcher, attrs_to_copy = ["negation", "family"])
entities = iam_matcher_pheno.run(syntagma_segs)
for entity in entities:
    print(f"text={entity.text!r}, spans={entity.spans}, label={entity.label}")
    family_attr = entity.attrs.get(label="family")[0]
    print("family:", family_attr.value)
    negation_attr = entity.attrs.get(label="negation")[0]
    print("negation:", negation_attr.value)

In [None]:
for entity in entities:
    doc.anns.add(entity)

In [None]:
from spacy import displacy
from medkit.text.spacy.displacy_utils import medkit_doc_to_displacy

# Define a custom formatter that will also display some context flags
# ex: "disorder[fn]" for an entity with label "disorder" and
# family and negation attributes set to True
def _custom_formatter(entity):
    label = entity.label
    flags = []
    
    family_attr = entity.attrs.get(label="family")[0]
    if family_attr.value:
        flags.append("f")
    negation_attr = entity.attrs.get(label="negation")[0]
    if negation_attr.value:
        flags.append("n")

    if flags:
        label += "[" + "".join(flags) + "]"
    
    return label

# Pass the formatter to medkit_doc_to_displacy()
displacy_data = medkit_doc_to_displacy(doc, entity_formatter=_custom_formatter)
displacy.render(docs=displacy_data, manual=True, style="ent")

# Annotation all the documents

**Task 1**: identify the most frequent drugs

In [39]:
import pandas as pd
docs = pd.read_csv('/home/ressources/PBL/mtsamples.csv')

# Plotting all weights

**Task2:** Extract all weight from the corpus

In [None]:
import pandas as pd
import unicodedata
import unidecode

from medkit.text.ner import RegexpMatcher, RegexpMatcherRule
from medkit.core.text import TextDocument
from medkit.core.text import Segment, Span
# from medkit.text.preprocessing import Normalizer, NormalizerRule
from medkit.text.preprocessing import CharReplacer
from medkit.text.preprocessing import (
    CharReplacer,
    LIGATURE_RULES,
    SIGN_RULES,
    SPACE_RULES,
    DOT_RULES,
    FRACTION_RULES,
    QUOTATION_RULES,
)

characters_to_normalize = """''', 'é', 'â', '\xa0', 'à', 'è', 'ô', 'ê', 'À', 'µ', '➢', '–', '°', 'Œ', '�', '½', 'œ', 'ç', '…', 'Ï', '¼', 'ï', '²', 'É', '•', 'ë', 'û', '¾', 'ù', ':black_small_square:', 'î', '—', '«', '»', '®', '¿', 'Ô', '≥', '¤', '³', 'Í', 'Î', 'å', 'ð', 'ñ', 'ò', '÷', 'ø', 'Ó', 'Å', '‡', 'Š', '■', '⇨', ''', 'β', '\u2002', 'ü', '„', 'ì', 'Û', 'Ç', '¹', 'ž', 'ú', '□', '●', '´', '¶', 'ö', 'Þ', 'Ò', 'Æ', 'Ë', '¡', 'õ', 'ä', 'Ù', 'Ã', '¨', 'š', 'Ú', '✓', '←', 'Ê', 'α', '§', '©', '·', '¸', '×', 'Ø', 'Ì', '™', 'È', 'μ', '−', '◙', 'Δ', 'ª', 'Â', '→', '£', '¦', '≈', 'ó', 'æ', '❖', 'Ö', 'ˆ', '‰', 'Ž', '±', 'Ý', '‚', '›', '¬', 'ß', 'ý', 'þ', 'ÿ', 'Õ', 'Ñ', '¢', '⋄', '€', '˜', '¯', 'í', 'á', 'Ü', 'º', '†', '‹', '\u0600', 'ࠂ', 'ࠄ', '࠘', '࠺', '࠼', 'ࡌ', 'ࡎ', 'ࡘ', '\u086c', '\u086e', '\u0874', '\u08c8', 'ࣔ', 'ࣖ', 'ࣺ', 'ं', 'ऐ', 'ऒ', 'द', 'न', 'प', 'ऴ', 'श', 'स', 'ऺ', '॒', '॔', 'ॶ', 'ॸ', 'ॺ', 'জ', 'ল', '\ueaee', '\uead5', '\uead1', '싌', 'ꮶ', 'ꮢ', '鎶', '膫', '沓', '뚓', '屡', '퇪', '\uea5c', '呜', 'ᘎ', '써', '䥳', '洀', 'ፈ', '猄', 'ᘉ', '㸀', 'Ī', 'ᘔ', '㔀', '脈', '䩃', '\u085c', '憁', '\u1c4a', '̨', 'ᘀ', '쉨', '띱', 'ࡕ', '封', '䩡', '䡭', 'Ѐ', '䡮', '\u0875', 'ȣ', '樃', 'ࠆ', 'ᘁ', '̝', 'ᘐ', 'ᡊ', 'ᘗ', '⨾', '䌁', '尀', 'ᘓ', '⁊', '㥉', '㥊', '㨀', '㨊', '㨌', '㨔', '㨖', '㨪', '㨬', '㨮', '㨸', '㨺', '㩪', '㩬', '㩴', '㩶', '㪦', '㪨', '㫄', '㬖', '㬜', '㭆', '㭈', '㭊', '㭎', '㭐', '㭚', '㭾', '㮔', '㮴', '㮶', '㮸', '㮺', '㮼', '㯀', '㯂', '㯆', '㯈', '㯌', '㯎', '㯪', '㯮', '겾', '짞', '뻞', '뺤', '뺜', '뺍', '뺅', '群', '獷', 'ᘆ', 'ᡨ', '䘕', 'ཨ', '理', '̏', '唀', 'Ĉ', 'ᘊ', '䌀', '䉨', '츁', 'ᙊ', '愀', 'ᔜ', '票', '瀉', '㕨', '騜', '伀', '͊', '儀', '덨', '렐', '˅', 'ᔁ', '뽨', '罹', 'ᔔ', 'ᔀ', 'ɍ', '"', '"', 'Ä', 'ƒ', 'ã', 'Ÿ', '¥', '彟', 'ഠ', '\u200d', '䕒', '乁', '䵉', '呁', '佉', '⁎', '䕍', '䥄', '䅃', '䕌', '\u0d0d', '桃', '晥', '搠', '\u2065', '敓', '癲', '捩', '\u0d65', '倠', '\u2e72', '䨠', '慥', '\u2d6e', '癙', '獥', '䘠', '䝁', '乏', '吠', '泩', '烩', '潨', '敮', '㨠', '〠', '‱', '㘵', '㈳', 'റ', '摁', '潪', '湩', '獴', '牐', '\u202e', '敊', '湡', '䰭', '捵', '䐠', '䕉', '䱈', '\ue954', '\ue96c', '桰', '湯', '\u3130', '㔠', '‶', '㤰', '㌠', '′', '㌱', '牄', '浅', '慭', '畮', '汥', '䜠', '䕕', '佒', 'ൔ', 'ㄠ', 'ല', '䄠', '慮', '丠', '噏', '剁', 'ു', '䌠', '牨', '獩', '潴', '䅆', '卉', '൙', '㈠', 'ര', '\u2073', '敤', 'Á', '\xad', 'ࢲ', '࣪', '࣬', '࣭', '࣮', 'ऀ', 'ऍ', 'ऎ', 'ग', 'ङ', 'ड', 'ण', 'थ', 'ू', 'ॉ', '॓', 'ॗ', 'फ़', 'ॠ', '२', '३', '७', 'ॱ', 'ॳ', 'ॼ', 'ং', 'র', 'া', '\u09d9', 'ਃ', 'ਆ', 'ਇ', 'ਚ', 'ਛ', 'ਤ', 'ਥ', '\u0a31', 'ਲ', 'ਸ਼', '\u0a37', '\u0a4e', 'ધ', 'ુ', '\ue7f5', '싍', 'ꮷ', 'ꏵ', '貗', '貣', '碁', 'ᘑ', 'ɨ', '湽', '瑨', '쑞', 'ᔗ', '歨'"""

normalized_rules = []
for ch in characters_to_normalize:
    if len(ch) != len(unidecode.unidecode(ch)):
        #normalized_rules.append(NormalizerRule(ch, unidecode.unidecode(ch)))
        normalized_rules.append([ch, unidecode.unidecode(ch)])

#normalized_rules_set = [*default_rules, *normalized_rules]
rules = (
    LIGATURE_RULES
    + SIGN_RULES
    + SPACE_RULES
    + DOT_RULES
    + FRACTION_RULES
    + QUOTATION_RULES
    + normalized_rules
)

#preprocessing_op  = Normalizer(output_label="clean_segment", rules=normalized_rules_set)
# preprocessing_op  = Normalizer(output_label="clean_segment", rules=default_rules)
preprocessing_char_op = CharReplacer(output_label="clean_segment", rules=rules)

regexp_rules = [
    RegexpMatcherRule(regexp=r"### Complete the REGEX here #######", label="weight"),
    #RegexpMatcherRule(regexp=r"[0-9]+", label="number"), # change this rule to detect blood pressure (the form is usually 120/80
    # copy the model to add a rule to identify the size and the BMI
]
regexp_matcher = RegexpMatcher(rules=regexp_rules)

annotated_documents_regex = []

### Complete the code below

docs = pd.read_csv('/home/ressources/PBL/mtsamples.csv')

for i in range(0,len(docs)):
    doc = TextDocument(text=str(docs.iloc[i]['transcription']))
    annotated_documents.append(doc)
    normalized = preprocessing_char_op.run([doc.raw_segment])
    entities = regexp_matcher.run(normalized)
    for entity in entities:
        doc.anns.add(entity)

Normalize all weigth to kilograms (a pound = 0.453592 kg), store all the value in a data structure

Plot the graph of all detected weights