## Text categorization (using spacy)

This is getting into some higher-level things that spacy also provides,

Categorization by nature tends to answer much more specific question - unlike tasks like Named Entity Recognition
which can do a fairly good job for general input (though which can still be refined for specific fields).

So categorization fundamentally needs training. Training needs examples. Examples of the thing you want it to answer.


As we are acting as programmers right now, we need to do the busywork:
creating those examples, formatting those examples how the software wants to see it, 
running the training, seeing if it worked and we or it did something dumb,
perhaps fiddling with some of the knobs. 

## (Install, import, load)

In [None]:
!pip --quiet install https://github.com/scarfboy/wetsuite-dev/archive/refs/heads/main.zip           
!python3 -m spacy download en_core_web_trf

!pip3 install -U spacy ml_datasets

In [2]:
import re, random, pprint, os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'    # suppressing most warning and error output - see also spacy basics notebook
import spacy

## Using spacy

There is various software that will categorize text for you, be smart in varied ways, uses and/or provides things like word embeddings, etc.

If any were fundamentally better or nicer, or nicer for a specific task, we would choose that.

Each of them needs you to learn how to wrangle it, so this moment's choice to use spacy's categorization
it just that if you were using it already, it might provide less learning curve.
(also it can combine well with recent methods, like spacy's use of transformers)

That said, it can only take so much work off your hands - you still need to understand at least its general workflow.

---

...as a side note to potentially avoid some confusion: 
There were some structural changes made between spacy v2 and v3,
so when you look for tutorials on the internet to copy-paste from,
because right now a lot of them detail the v2 way, which won't work in v3.

- We will be using v3's preferred method, centered on config files instead of scripts (and typically bu not necessarily trained via the command line commands),
- the v2 way was more code-based - you ended up writing a short training script
- and you can still do that in v3, but should only need that when doing more custom things - which we won't need here.

---

To get actually started: what wrangling do we need to do before spacy does the work for us?

So, as v3 tutorials start with, you generate a config, for a `textcat` component (it helps to know some spacy terminology already).

This is an editable text file that represents the training we will be running, including training parameters, and how to load data. 
You can tweak this, but largely won't need to.

The main thing it _doesn't_ refer to is the data - that's handed into the actual training command, alongside - and 

In [2]:
!python3 -m spacy  init config  --pipeline textcat  mytrain.cfg
# which will also mention some choices

2023-05-12 14:53:51.168486: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: textcat
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
mytrain.cfg
You can now add your data and train your pipeline:
python -m spacy train mytrain.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


Working backwards from the above, that suggested command mentions 
* the config we just made
* `--paths.train ./train.spacy`
* `--paths.dev ./dev.spacy`

What are those? 
* Those parameter names (`paths.train`, `paths.dev`) are things that will be slotted into the configuration at runtime.
* Those values are filenames that contain the training data (more on them below)

For the inquisitive: the config asks for those parameters by name - open the .cfg you just made and look for the `[corpus]` sections.


### imdb example (2 exclusive categories)

Let's get some readymade training data. There are meany sources of datasets to experiment, for now we're going with [ml-dataset](https://github.com/explosion/ml-datasets#available-loaders) for ease.

For example, their [imdb dataset](https://github.com/explosion/ml-datasets#imdb) is *binary sentiment analysis* - which is a fancy term for "*there are two output classes, that happen to be called 'positive' and 'negative'".  (It doesn't know how we value it, it just associates things with either class)

As a task this isn't very interesting in legal context (except maybe broadly sorting [internetconsultaties](https://www.internetconsultatie.nl/)),
but makes for an simpler introduction.

In [2]:
import ml_datasets

# ml_datasets.imdb() returns a set of training and a separate set of evaluation data - a commmon practice in machine learning 
# to e.g. check that you have actually learned to generalize, not just to precisely recognize your training data.

# the underlying dataset has 25k in training and 25k in testing   (plus 50k unlabeled to play with, not exposed by ml_datasets)
# with a 50-50 split in positive and negative examples

# ml_datasets.imdb() with no parameters would fetch all that data, which would take order of five minutes, largely because it's all in separate files)
#train, test = ml_datasets.imdb( )
# For this moment, let's load just a few to inspect
train_list, test_list = ml_datasets.imdb( train_limit=100, dev_limit=100 ) 

In [3]:
# A quick peek at that data
# both train and test are a listof (review_text, posorneg_text) pairs

import random, textwrap, re

for review_text, annot_text in random.sample( train_list, 2 ):
    print('-'*50)
    print(f"Sentiment: {annot_text.upper()}")
    #review_text = re.sub('\s+', ' ', review_text).strip()
    print("Review: %s"%('\n'.join(textwrap.wrap(review_text))))

--------------------------------------------------
Sentiment: POS
Review: In what could have been seen as a coup towards the sexual "revolution"
(purposefully I use quotations for that word), Jean Eustache wrote and
directed The Mother and the Whore as a poetic, damning critique of
those who can't seem to get enough love. If there is a message to this
film- and I'd hope that the message would come only after the fact of
what else this Ben-Hur length feature has to offer- it's that in order
to love, honestly, there has to be some level of happiness, of real
truth. Is it possible to have two lovers? Some can try, but what is
the outcome if no one can really have what they really want, or feel
they can even express to say what they want?     What is the truth in
the relationships that Alexandre (Jean-Pierre Leaud) has with the
women around him? He's a twenty-something pseudo-intellectual, not
with any seeming job and he lives off of a woman, Marie (Bernadette
Lafont) slightly older than h

#### Creating those files it wants

For the inquisitive: That config section should result in a Corpus object (which itself yields [Example](https://spacy.io/api/example) objects,
which is why it is used in training and pretraining), a config by default uses the built-in [corpus reader](https://spacy.io/api/top-level#corpus-readers).

...but you don't really need to know that ([until data grows too large for RAM](https://spacy.io/usage/training#data-corpora)]).
Right now you can do something like the following:




In [5]:
import spacy
from spacy.tokens import DocBin

nlp = spacy.blank("en") # blank means no components  (and vocabulary seems built per document??)

category_names = ['pos', 'neg']

db = DocBin()
for review_text, annot_text in random.sample( train_list, 100 ):
    doc = nlp.make_doc( review_text )
    
    # this is overkill right here, but some decent syntax-fu if you have a lot of categories to set to 0
    doc.cats = {category:0   for category in category_names}
    doc.cats[annot_text] = 1

    db.add(doc)
db.to_disk('mytrain.spacy')
    

In [11]:
db = DocBin()
db.from_disk('mytrain.spacy')

nlp = spacy.blank("en")

for doci, doc in enumerate( db.get_docs(nlp.vocab) ):
    print( db.cats[doci], list( doc ) )

{'pos': 0, 'neg': 1} [Suffice, to, say, that, -, despite, the, odd, ludicrous, panegyric, to, his, soi, disant, ", abilities, ", posted, here, -, the, director, of, this, inept, ,, odious, tosh, has, n't, made, a, film, since, ., Well, that, is, excellent, news, as, far, as, I, 'm, concerned, ., 



, Dead, Babies, has, all, of, the, bile, of, its, creator, ,, but, lacks, the, wit, and, technical, proficiency, that, make, Martin, Amis, the, novelist, readable, ., 



, When, will, the, British, film, industry, wake, up, and, realise, that, if, it, wants, to, regain, the, status, it, once, had, it, should, stop, producing, rubbish, like, this, and, make, something, real, people, will, actually, want, to, watch, ?, 



, Avoid, like, the, plague, .]
{'pos': 0, 'neg': 1} [The, film, My, Name, is, Modesty, is, based, around, an, episode, that, takes, up, about, one, page, in, the, 10th, modesty, Blaise, novel, called, Night, of, the, Morningstar, ., It, describes, an, incident, in, which, 

#### Running the training

As suggested earlier, you can now run
`python3 -m spacy train mytrain.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy`

If you want do the same but avoid the command-line callout, e.g. because you're not in a notebook or command line,
the code equivalent of that suggested line would something like 
`spacy.cli.train.train("./myconfig.cfg", overrides={"paths.train": "./train.spacy", "paths.dev": "./dev.spacy"})`)


P, R, and F refer to 
* Precision - roughly 'how many of the assigned labels are correct'
* Recall - roughly 'how many of the labels we know should be there are assigned?'
* F1 score - the harmonic mean of the precision and recall, basically a single figure

You can get in a long, mathematical, and pragmatic discussion about whether true positives or true negatives are more significant.

Only considering recall ignores the question what we're missing,
Only considering precision (over-values true positives) ignores 

If you just want a single number generally indicating how good it works works,
metrics like F1 balance both precision and recall in a single number.

https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall

2023-05-13 17:28:57.665368: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


SyntaxError: invalid syntax (628347894.py, line 15)


https://explosion.ai/blog/spacy-v3-project-config-systems

https://spacy.io/usage/training#quickstart

In [4]:
import spacy
english = spacy.load("en_core_web_trf")

### 20newsgroups example

Multiple categories, still treated as exclusive. 
Mostly just a step to show more complex-looking isn't necessarily much scarier.


In [6]:
from sklearn.datasets import fetch_20newsgroups

newsgroups_train = fetch_20newsgroups(subset='train')

# the data is 
#newsgroups_train.target_names

#https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html

### Annotation

Above, we cheated our way out of a major step, by having data that already had categories. 

In the real world, you will often need to create those associations yourself.

This certainly is extra wrangling, and often some extra data conversion - more so if you want an incremental sort of workflow,
with "add more work to previous set" or "give me suggestions to confirm".

This is where you might consider paid-for tools for 






### In spacy 2 (for reference; we mostly want spacy 3 now)

In [27]:



tc_train = []
for text, annot in random.sample( train_data, 200 ):
    item = {}
    item['text'] = re.sub('\s+', ' ', text).strip()
    item['cats'] = { "pos":(annot=="pos"),   "neg":(annot=="neg") }
    tc_train.append( item )

# import pprint
# pprint.pprint( tc_train[0] )

200


In [None]:
# spacy2 
  
from spacy.util import minibatch, compounding

nlp = english
training_data = tc_train
iterations = 20

with nlp.disable_pipes([ pipe for pipe in nlp.pipe_names if pipe != "textcat" ]):   # only train the textcat component
    print("Beginning training")

    optimizer = nlp.begin_training()

    batch_sizes = compounding( 4.0, 32.0, 1.001 )

    for i in range(iterations):
        loss = {}
        random.shuffle(training_data)
        for batch in minibatch(training_data, size=batch_sizes) :
            text, labels = zip(*batch)
            print( text, labels )
            nlp.update(        # https://spacy.io/api/language#update
                text,
                labels,
                drop=0.2,      # dropout rate
                sgd=optimizer, # an optimizer 
                losses=loss    # dictionary to update losses in
            )


# minibatch and compounding are some mechanical stuff about training and in small batches
# You can ignore them the first time around.
# If you want to know:
# - minibatch is about training on subsets of the data at a time. It is often combined with randomizing first. 
#   For example:
#     work = list(range(15))
#     random.shuffle(work)
#     for batch in minibatch(work, 3):
#         print( batch )
#   would yield size-3 batches, e.g. 
#     [10, 6, 5]
#     [13, 2, 3]
#     [12, 0, 9]
#     [7, 8, 1]
#     [11, 14, 4]
# - compounding in itself it's it's a generator that returns floats. Take compounding( 4.0, 32.0, 1.5 )
#   - it multiplies that start value (4) 
#   - by the compounding factor (1.4) 
#   - until it reaches the maximum (32)
#   in this case it will yield 4.0, 6.0, 9.0, 13.5, 20.25, 30.375, and then 32.0 forevermore
#   Its use here is to start with smaller batches to learn faster at the start, then slowly move to larger batch size that ends up refining.
#   The actual example above (compounding( 4.0, 32.0, 1.001 )) takes 2000 steps to actually reach 32   

### In spacy 3

In [None]:


from spacy.training import Example
# An example stores two Doc objects
# - one for holding the gold-standard reference data
# - one for holding the predictions of the pipeline
# An Alignment object stores the alignment between these two documents, as they can differ in tokenization.
#example = Example(predicted, reference)


In [None]:
from spacy.util import minibatch, compounding

nlp = english
training_data = tc_train
iterations = 20

with nlp.disable_pipes([ pipe for pipe in nlp.pipe_names if pipe != "textcat" ]):   # only train the textcat component
    print("Beginning training")

    optimizer = nlp.create_optimizer()    
    
    batch_sizes = compounding( 4.0, 32.0, 1.001 )

    for i in range(iterations):
        loss = {}
        random.shuffle(training_data)
        for batch in minibatch(training_data, size=batch_sizes) :
            print( text, labels )
            nlp.update(        # https://spacy.io/api/language#update
                batch,
                drop=0.2,      # dropout rate
                sgd=optimizer, # an optimizer 
                losses=loss    # dictionary to update losses in
            )


# You can ignore minibatch and compounding for now, 
# it's just mechanical stuff about training and in small batches
# 
# If you want to know:
# - minibatch is about training small bits at a time,  and is often combined with randomizing first, e.g. 
#     work = list(range(15))
#     random.shuffle(work)
#     for batch in minibatch(work, 3):
#         print( batch )
#   would yield size-3 batches, e.g. 
#     [10, 6, 5]
#     [13, 2, 3]
#     [12, 0, 9]
#     [7, 8, 1]
#     [11, 14, 4]
# - compounding about starting with smaller batches
#   it's a generator that returns floats - it multiplies that start value (4) by the compounding factor (1.4) ubtil it reaches the maximum (32)
#   e.g.  compounding( 4.0, 32.0, 1.5 )   will yield   4.0, 6.0, 9.0, 13.5, 20.25, 30.375, and then 32.0 forevermore

## using fasttext

https://fasttext.cc/docs/en/python-module.html

