# How to use Laundromat

#### Import the SpacyModel class. This is the main class that controls Laundromat

In [1]:
from laundromat.spacy.spacy_model import SpacyModel

#### Instantiate your model. The class takes a path to an existing model as an argument. If no argument is given the class defaults to the nb_core_news_lg SpaCy model.

In [2]:
nlp = SpacyModel() 

#### Consider the following text

In [3]:
text = ("Mitt navn er Ola Nordmann. Jeg er 20 år og jobber 50 prosent "
"stilling i Arbeidsgiver AS i Norge. Du kan nå meg på 99999999.")

#### There are a variety of functions that can be applied depending on your needs. Keep in mind that the predict() function returns the start and end position in the form of token index, i.e. which token(s) in the text are part of the entity

In [4]:
#returns [predicted entity, start position, end position, certainty]
entities = nlp.predict(text)
print(entities)

[['Ola Nordmann', 'PER', 3, 5], ['Arbeidsgiver AS', 'ORG', 16, 18], ['Norge', 'LOC', 19, 20]]


#### If you want predictions prettier and no return you can use the display function. It highlights the given text with predicted entities.

In [5]:
nlp.display(text)

In [6]:
nlp.count(text)

{'PER': 1,
 'FNR': 0,
 'DTM': 0,
 'TLF': 0,
 'AMOUNT': 0,
 'LOC': 1,
 'CREDIT_CARD': 0,
 'ORG': 1}

#### If you want to replace the found entities with something, you can use the replace() function. It returns the replaced text. The replacement argument determines which kind of replacement is performed, accepted inputs are entity, character, pad, and shuffle. The default is entity. You can also supply the character to be used for character or pad replacement using the replacement_char argument.

In [7]:
replaced_text_1 = nlp.replace(text, replacement="entity", replacement_char=":^)")
print(replaced_text_1)

Mitt navn er <PER>. Jeg er 20 år og jobber 50 prosent stilling i <ORG> i <LOC>. Du kan nå meg på 99999999.


In [8]:
replaced_text_2 = nlp.replace(text, replacement="pad", replacement_char=":^)")
print(replaced_text_2)

Mitt navn er :^):^):^):^):^):^):^):^):^):^):^):^). Jeg er 20 år og jobber 50 prosent stilling i :^):^):^):^):^):^):^):^):^):^):^):^):^):^):^) i :^):^):^):^):^). Du kan nå meg på 99999999.


In [9]:
replaced_text_3 = nlp.replace(text)
print(replaced_text_3)

Mitt navn er <PER>. Jeg er 20 år og jobber 50 prosent stilling i <ORG> i <LOC>. Du kan nå meg på 99999999.


#### Laundromat also contains the similarity() function which returns the cosine similarity between the unaltered text and the replaced text.

In [10]:
nlp.similarity(replaced_text_1)

0.8880513862428588

In [11]:
nlp.similarity(replaced_text_2)

0.7791217588867314

In [12]:
nlp.similarity(replaced_text_3)

0.8880513862428588

#### The add_patterns() function adds the built-in regex and lookup functionality to the nlp pipeline. Observe how functionality is increased after we call on this function.

In [13]:
nlp.add_patterns()

[('land.csv', 'LOC'), ('etternavn_ssb.csv', 'PER'), ('guttefornavn_ssb.csv', 'PER'), ('jentefornavn_ssb.csv', 'PER')]


In [14]:
nlp.predict(text)

NameError: name 'overlap' is not defined

In [None]:
nlp.display(text)

#### You can also disable the NER model if you only wish to use regex and lookup.

In [None]:
nlp.disable_ner()

In [None]:
nlp.display(text)

#### And enable it again

In [None]:
nlp.enable_ner()

In [None]:
nlp.display(text)

#### Now let us explore how to train a model. We shall train a new entity type, but to strengthen performance on already existing entities one just has to provide data with labels corresponding to existing entities. The following "dataset" was copied from SpaCy's website on training new entities. Beware that certain functionality only works for the entity types defined in the code, and so the following example does not inclu

In [None]:
#Note that the index values of the labels are given in string index.
#SpaCy operates with both string index and token index, i.e. which token in the text it is,
# which can be confusing at times.
LABEL = "ANIMAL"

TRAIN_DATA = [
    (
        "Horses are too tall and they pretend to care about your feelings",
        {"entities": [(0, 6, LABEL)]},
    ),
    ("Do they bite?", {"entities": []}),
    (
        "horses are too tall and they pretend to care about your feelings",
        {"entities": [(0, 6, LABEL)]},
    ),
    ("horses pretend to care about your feelings", {"entities": [(0, 6, LABEL)]}),
    (
        "they pretend to care about your feelings, those horses",
        {"entities": [(48, 54, LABEL)]},
    ),
    ("horses?", {"entities": [(0, 6, LABEL)]}),
]

In [None]:
#We see that our model is not capable of recognising horses.
for data in TRAIN_DATA:
    print(nlp.predict(data[0]))

#### Keep in mind that display() only has functionality for existing entities. It will not display new entities.

#### First let us test our performance on the data. Laundromat contains three scoring functions: SpaCy's built in F1 scorer, a confusion matrix function, and a custom scoring function. The latter two allow for a strict or lax scoring rule, and the custom scoring function prints a variety of metrics.

In [None]:
nlp.f1_scorer(TRAIN_DATA)

In [None]:
nlp.confusion_matrix(TRAIN_DATA)

In [None]:
nlp.print_scores(TRAIN_DATA)

#### Now let us train on the available data. The train() function accepts both number of training iterations and output directory as arguments, but these default to 30 and None (which means the model is not saved) respectively.

In [None]:
nlp.train(TRAIN_DATA, labels=[LABEL], n_iter=35)

In [None]:
nlp.f1_scorer(TRAIN_DATA)

In [None]:
nlp.confusion_matrix(TRAIN_DATA)

In [None]:
nlp.print_scores(TRAIN_DATA)

In [None]:
for data in TRAIN_DATA:
    print(nlp.predict(data[0]))

#### We observe a strong improvement in  performance. Let us now test it on a new sentence.

In [None]:
horse_text = "What are horses?"

In [None]:
nlp.predict(horse_text)

In [None]:
nlp.count(text)

In [None]:
nlp.count(2)