# How to use Laundromat

#### Import the SpacyModel class. This is the main class that controls Laundromat

In [1]:
from laundromat.spacy.spacy_model import SpacyModel

#### Instantiate your model. The class takes a path to an existing model as an argument. If no argument is given the class defaults to the nb_core_news_lg SpaCy model.

In [2]:
nlp = SpacyModel() 

#### Consider the following text

In [3]:
text = ("Hei, mitt navn er Ola Nordmann. Jeg er 20 år og jobber 50 prosent "
"stilling i Arbeidsgiver AS i Norge. Du kan nå meg på 99999999.")

#### There are a variety of functions that can be applied depending on your needs. Keep in mind that the predict() function returns the start and end position in the form of token index, i.e. which token(s) in the text are part of the entity

In [4]:
#returns [predicted entity, start position, end position, certainty]
entities = nlp.predict(text)
print(entities)

[['Ola Nordmann', 'PER', 5, 7], ['Arbeidsgiver AS', 'ORG', 18, 20], ['Norge', 'LOC', 21, 22]]


#### If you want predictions prettier and no return you can use the display function. It highlights the given text with predicted entities.

In [5]:
nlp.display(text)

In [6]:
nlp.count(text)

{'PER': 1,
 'FNR': 0,
 'DTM': 0,
 'TLF': 0,
 'AMOUNT': 0,
 'LOC': 1,
 'CREDIT_CARD': 0,
 'ORG': 1}

#### If you want to replace the found entities with something, you can use the replace() function. It returns the replaced text. The replacement argument determines which kind of replacement is performed, accepted inputs are entity, character, pad, and shuffle. The default is entity. You can also supply the character to be used for character or pad replacement using the replacement_char argument.

In [7]:
replaced_text_1 = nlp.replace(text, replacement="entity", replacement_char=":^)")
print(replaced_text_1)

Hei, mitt navn er <PER>. Jeg er 20 år og jobber 50 prosent stilling i <ORG> i <LOC>. Du kan nå meg på 99999999.


In [8]:
replaced_text_2 = nlp.replace(text, replacement="pad", replacement_char=":^)")
print(replaced_text_2)

Hei, mitt navn er :^):^):^):^):^):^):^):^):^):^):^):^). Jeg er 20 år og jobber 50 prosent stilling i :^):^):^):^):^):^):^):^):^):^):^):^):^):^):^) i :^):^):^):^):^). Du kan nå meg på 99999999.


In [9]:
replaced_text_3 = nlp.replace(text)
print(replaced_text_3)

Hei, mitt navn er <PER>. Jeg er 20 år og jobber 50 prosent stilling i <ORG> i <LOC>. Du kan nå meg på 99999999.


#### Laundromat also contains the similarity() function which returns the cosine similarity between the unaltered text and the replaced text.

In [10]:
nlp.similarity(replaced_text_1)

0.8818055262549428

In [11]:
nlp.similarity(replaced_text_2)

0.789567684923907

In [12]:
nlp.similarity(replaced_text_3)

0.8818055262549428

#### The add_patterns() function adds the built-in regex and lookup functionality to the nlp pipeline. Observe how functionality is increased after we call on this function.

In [13]:
nlp.add_patterns()

In [14]:
nlp.predict(text)

[['Ola Nordmann', 'PER', 5, 7],
 ['50 prosent', 'AMOUNT', 14, 16],
 ['Arbeidsgiver AS', 'ORG', 18, 20],
 ['Norge', 'LOC', 21, 22],
 ['99999999.', 'TLF', 28, 29]]

In [15]:
nlp.display(text)

#### You can also disable the NER model if you only wish to use regex and lookup.

In [16]:
nlp.disable_ner()

In [17]:
nlp.display(text)

#### And enable it again

In [18]:
nlp.enable_ner()

In [19]:
nlp.display(text)

In [20]:
#Laundromat comes with list lookup disabled by default, to enable it
# use add_patterns with the parameter lookup = True
nlp = SpacyModel()
nlp.add_patterns(lookup=True)

#### Now we come to a core functionality of Laundromat, which is the ability to easily add custom RegEx expressions or lookup lists by calling a function.

In [21]:
#First let us consider RegExes. You can print the current regex_labels by calling
# .print_regex_labels()
nlp.print_regex_labels()

FNR
CREDIT_CARD
TLF
DTM
AMOUNT


In [22]:
#Next let us add a regex for detecting all lower case four letter words.
#You can do this by calling the .new_regex() function.
nlp.new_regex(pattern=r"([a-z]{4})", context="Four letter words", label="FLW")

In [23]:
#Now we can see that it is part of the list of RegExes.
nlp.print_regex_labels()

FNR
CREDIT_CARD
TLF
DTM
AMOUNT
FLW


In [24]:
#We see that our new regex correctly identifies the two lower case four letter words.
nlp.predict(text)

[['mitt', 'FLW', 2, 3],
 ['navn', 'FLW', 3, 4],
 ['Ola Nordmann', 'PER', 5, 7],
 ['50 prosent', 'AMOUNT', 14, 16],
 ['Arbeidsgiver AS', 'ORG', 18, 20],
 ['Norge', 'LOC', 21, 22],
 ['99999999.', 'TLF', 28, 29]]

In [25]:
#Now to lookup lists. These lookup lists are stored in a list of tuples in the form of 
# (lookup list path, entity label). Only the first column of these lists will be considered.
#To print the current lookup lists use the .print_matcher_lists() function.
nlp.print_matcher_lists()

Path:  land.csv Entity tag:  LOC
Path:  etternavn_ssb.csv Entity tag:  PER
Path:  guttefornavn_ssb.csv Entity tag:  PER
Path:  jentefornavn_ssb.csv Entity tag:  PER


In [26]:
#To add a lookup list us the .add_list() function.
nlp.add_list(("Some random", "RANDOM"))

In [27]:
#As we can see, our new list was added
nlp.print_matcher_lists()

Path:  land.csv Entity tag:  LOC
Path:  etternavn_ssb.csv Entity tag:  PER
Path:  guttefornavn_ssb.csv Entity tag:  PER
Path:  jentefornavn_ssb.csv Entity tag:  PER
Path:  Some random Entity tag:  RANDOM


#### Now let us explore how to train a model. We shall train a new entity type, but to strengthen performance on already existing entities one just has to provide data with labels corresponding to existing entities. The following "dataset" was copied from SpaCy's website on training new entities. Beware that certain functionality only works for the entity types defined in the code, and so the following example does not inclu

In [28]:
#Note that the index values of the labels are given in string index.
#SpaCy operates with both string index and token index, i.e. which token in the text it is,
# which can be confusing at times.
nlp = SpacyModel()
LABEL = "ANIMAL"

TRAIN_DATA = [
    (
        "Horses are too tall and they pretend to care about your feelings",
        {"entities": [(0, 6, LABEL)]},
    ),
    ("Do they bite?", {"entities": []}),
    (
        "horses are too tall and they pretend to care about your feelings",
        {"entities": [(0, 6, LABEL)]},
    ),
    ("horses pretend to care about your feelings", {"entities": [(0, 6, LABEL)]}),
    (
        "they pretend to care about your feelings, those horses",
        {"entities": [(48, 54, LABEL)]},
    ),
    ("horses?", {"entities": [(0, 6, LABEL)]}),
]

In [29]:
#We see that our model is not capable of recognising horses.
for data in TRAIN_DATA:
    print(nlp.predict(data[0]))

[]
[]
[]
[]
[['those horses', 'PER', 8, 10]]
[]


#### Keep in mind that display() only has functionality for existing entities. It will not display new entities.

#### First let us test our performance on the data. Laundromat contains three scoring functions: SpaCy's built in F1 scorer, a confusion matrix function, and a custom scoring function. The latter two allow for a strict or lax scoring rule, and the custom scoring function prints a variety of metrics.

In [30]:
nlp.f1_scorer(TRAIN_DATA)

{'uas': 0.0,
 'las': 0.0,
 'las_per_type': {'root': {'p': 0.0, 'r': 0.0, 'f': 0.0},
  'flat:foreign': {'p': 0.0, 'r': 0.0, 'f': 0.0},
  'nsubj': {'p': 0.0, 'r': 0.0, 'f': 0.0},
  'flat:name': {'p': 0.0, 'r': 0.0, 'f': 0.0}},
 'ents_p': 0.0,
 'ents_r': 0.0,
 'ents_f': 0.0,
 'ents_per_type': {'ANIMAL': {'p': 0.0, 'r': 0.0, 'f': 0.0},
  'PER': {'p': 0.0, 'r': 0.0, 'f': 0.0}},
 'tags_acc': 0.0,
 'token_acc': 100.0,
 'textcat_score': 0.0,
 'textcats_per_cat': {}}

In [31]:
nlp.confusion_matrix(TRAIN_DATA)

Unnamed: 0,Is Positive,Is Negative
Predicted Positive,0,1
Predicted Negative,5,41


In [32]:
nlp.print_scores(TRAIN_DATA)

Accuracy is:  0.8723404255319149
Balanced accuracy is:  0.4880952380952381
Precision is:  0.0
Recall is:  0.0
F1 score is:  0.0


#### Now let us train on the available data. The train() function accepts both number of training iterations and output directory as arguments, but these default to 30 and None (which means the model is not saved) respectively.

In [33]:
nlp.train(TRAIN_DATA, labels=[LABEL], n_iter=35)

Losses {'ner': 43.7634323188446}
Losses {'ner': 37.36912397686882}
Losses {'ner': 29.574495344797867}
Losses {'ner': 35.95390339749133}
Losses {'ner': 25.025167298056203}
Losses {'ner': 36.8087579193525}
Losses {'ner': 23.677375968293404}
Losses {'ner': 29.86252936234814}
Losses {'ner': 32.49622757605357}
Losses {'ner': 28.53937743109418}
Losses {'ner': 29.734447697010182}
Losses {'ner': 29.19986493446595}
Losses {'ner': 28.927347520426615}
Losses {'ner': 31.028226506207602}
Losses {'ner': 27.96554018043389}
Losses {'ner': 29.021093687762914}
Losses {'ner': 19.464889489364396}
Losses {'ner': 24.685661644959964}
Losses {'ner': 28.09824222815655}
Losses {'ner': 30.978461074040162}
Losses {'ner': 21.962706289265558}
Losses {'ner': 25.314817140913256}
Losses {'ner': 27.289513138375696}
Losses {'ner': 31.48207950840373}
Losses {'ner': 19.897172549979036}
Losses {'ner': 28.3832769321466}
Losses {'ner': 26.69563136419244}
Losses {'ner': 29.50139485899109}
Losses {'ner': 24.454738525967507}
Lo

In [34]:
nlp.f1_scorer(TRAIN_DATA)

{'uas': 0.0,
 'las': 0.0,
 'las_per_type': {'root': {'p': 0.0, 'r': 0.0, 'f': 0.0},
  'flat:foreign': {'p': 0.0, 'r': 0.0, 'f': 0.0},
  'flat:name': {'p': 0.0, 'r': 0.0, 'f': 0.0},
  'nsubj': {'p': 0.0, 'r': 0.0, 'f': 0.0}},
 'ents_p': 100.0,
 'ents_r': 100.0,
 'ents_f': 100.0,
 'ents_per_type': {'ANIMAL': {'p': 100.0, 'r': 100.0, 'f': 100.0}},
 'tags_acc': 0.0,
 'token_acc': 100.0,
 'textcat_score': 0.0,
 'textcats_per_cat': {}}

In [35]:
nlp.confusion_matrix(TRAIN_DATA)

Unnamed: 0,Is Positive,Is Negative
Predicted Positive,5,0
Predicted Negative,0,42


In [36]:
nlp.print_scores(TRAIN_DATA)

Accuracy is:  1.0
Balanced accuracy is:  1.0
Precision is:  1.0
Recall is:  1.0
F1 score is:  1.0


In [37]:
for data in TRAIN_DATA:
    print(nlp.predict(data[0]))

[['horses', 'ANIMAL', 9, 10]]
[['horses', 'ANIMAL', 0, 1]]
[]
[['Horses', 'ANIMAL', 0, 1]]
[['horses', 'ANIMAL', 0, 1]]
[['horses', 'ANIMAL', 0, 1]]


#### We observe a strong improvement in  performance. Let us now test it on a new sentence.

In [52]:
horse_text = "Horse"

In [53]:
nlp.predict(horse_text)

[['Horse', 'ANIMAL', 0, 1]]