In [None]:
#all_slow

# Tutorial - Sequence Classification
> Performing Sequence Classification with AdaptNLP

Sequence Classification (or Text Classification) is the NLP task of predicting a label for a sequence of words.

For example, a string of `That movie was terrible because the acting was bad` could be tagged with a label of `negative`. A string of `That movie was great because the acting was good` could be tagged with a label of `positive`.

A model that can predict sentiment from text is called a sentiment classifier, which is an example of a sequence classification model.

 Below, we'll walk through how we can use AdaptNLP's EasySequenceClassification module to easily do the following:
1. Load pre-trained models and tag data using mini-batched inference
2. Train and fine-tune a pre-trained model on your own dataset
3. Evaluate your model

## Loading Pretrained Models and Tag Data using Mini-Batched Inference

We'll first get started by importing the EasySequenceClassifier class from AdaptNLP and instantiating the
`EasySequenceClassifier` class object.

In [None]:
from adaptnlp import EasySequenceClassifier
from pprint import pprint

classifier = EasySequenceClassifier()

With this class we can dynamically load models to run on inference. 

Let's use the `HFModelHub` to search for some pre-trained sequence classification models to use:

In [None]:
from adaptnlp.model_hub import HFModelHub
hub = HFModelHub()

We can either seach by task or by model name. Below is an example of the associated models HuggingFace has come out with:

In [None]:
hub.search_model_by_task('text-classification')

[Model Name: distilbert-base-uncased-finetuned-sst-2-english, Tasks: [text-classification],
 Model Name: roberta-base-openai-detector, Tasks: [text-classification],
 Model Name: roberta-large-mnli, Tasks: [text-classification],
 Model Name: roberta-large-openai-detector, Tasks: [text-classification]]

For this example though we will tag some text with a model that [NLP Town](https://www.nlp.town/) has trained called `nlptown/bert-base-multilingual-uncased-sentiment`. Let's find it in the model hub:

In [None]:
model = hub.search_model_by_name('nlptown/bert-base', user_uploaded=True)[0]; model

Model Name: nlptown/bert-base-multilingual-uncased-sentiment, Tasks: [text-classification]

This is a multi-lingual model that predicts how many stars (1-5) a text review has given a product. More information can be found via. the Transformers model card [here](https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment)

Next we can perform some inference. First let's write some example text:

In [None]:
example_text = "This didn't work at all"

Then we can tell our classifier to tag some text with `tag_text`:

In [None]:
sentences = classifier.tag_text(
    text=example_text,
    model_name_or_path=model,
    mini_batch_size=1
)

2021-05-31 16:09:42,794 loading file nlptown/bert-base-multilingual-uncased-sentiment




Now let's look at our outputs:

In [None]:
#hide_input
print(f"Tag Score Outputs:\n")
print(f'Sentences: {sentences["sentences"]}\n')
print(f'Classes: {sentences["classes"]}\n')
print(f"Probabilities: \n\t{sentences['probs']}")

Tag Score Outputs:

Sentences: ["This did n't work at all"]

Classes: ['1 star']

Probabilities: 
	tensor([[8.4212e-01, 1.3793e-01, 1.8024e-02, 1.2419e-03, 6.8153e-04]])


It's easy to pass in multiple sentences at once as well (in an array). Let's try that now:

In [None]:
multiple_text = ["This didn't work well at all.",
                 "I really liked it.",
                 "It was really useful.",
                 "It broke after I bought it."]

We'll pass it into the `classifier` just like before:

In [None]:
sentences = classifier.tag_text(
    text=multiple_text,
    model_name_or_path=model,
    mini_batch_size=2
)

And we can check the outputs again:

In [None]:
#hide_input
print(f"Tag Score Outputs:\n")
print(f'Sentences: {sentences["sentences"]}\n')
print(f'Classes: {sentences["classes"]}\n')
print(f"Probabilities: \n\t{sentences['probs']}")

Tag Score Outputs:

Sentences: ["This did n't work well at all .", 'I really liked it .', 'It was really useful .', 'It broke after I bought it .']

Classes: ['1 star', '4 stars', '5 stars', '1 star']

Probabilities: 
	tensor([[6.2198e-01, 3.3563e-01, 4.0320e-02, 1.5827e-03, 4.8790e-04],
        [3.2305e-03, 4.7872e-03, 5.4017e-02, 4.8129e-01, 4.5668e-01],
        [5.9679e-03, 9.2630e-03, 7.0121e-02, 4.1363e-01, 5.0102e-01],
        [4.4894e-01, 3.9348e-01, 1.4158e-01, 1.2110e-02, 3.8937e-03]])


> Note: The output is going to be a probility distribution of what the text should be tagged. If you're running this on a GPU, you can specify the `mini_batch_size` parameter to run mini-batch inference against your data for faster run time.

You can set `model_name_or_path` to any of Transformer's or Flair's pre-trained sequence classification models.

Let's tag some text with another model, specifically Oliver Guhr's German sentiment model called `oliverguhr/german-sentiment-bert`.

First we'll write some german text:

In [None]:
german_text = ["Das hat überhaupt nicht gut funktioniert.",
               "Ich mochte es wirklich.",
               "Es war wirklich nützlich.",
               "Es ist kaputt gegangen, nachdem ich es gekauft habe."]

And then tag it:

In [None]:
sentences = classifier.tag_text(
    german_text,
    model_name_or_path="oliverguhr/german-sentiment-bert",
    mini_batch_size=1
)

> Note: As seen here, you can either search for a model through the various `ModelHub` classes, or you can directly pass in the string to the model you want

Let's look at the output:

In [None]:
#hide_input
print(f"Tag Score Outputs:\n")
print(f'Sentences: {sentences["sentences"]}\n')
print(f'Classes: {sentences["classes"]}\n')
print(f"Probabilities: \n\t{sentences['probs']}")

Tag Score Outputs:

Sentences: ['Das hat überhaupt nicht gut funktioniert .', 'Ich mochte es wirklich .', 'Es war wirklich nützlich .', 'Es ist kaputt gegangen , nachdem ich es gekauft habe .']

Classes: ['negative', 'positive', 'positive', 'negative']

Probabilities: 
	tensor([[8.2706e-04, 9.9915e-01, 2.7373e-05],
        [7.0231e-01, 2.0294e-01, 9.4746e-02],
        [9.8132e-01, 1.8442e-02, 2.3914e-04],
        [4.2462e-03, 9.9566e-01, 9.4817e-05]])


Don't forget you can still quickly run inference with the multi-lingual review sentiment model you loaded in earlier (memory permitting)! Just change the `model_name_or_path` param to the model you used before.

Let's release the german sentiment model to free up some memory for our next step...training! 

In [None]:
classifier.release_model(model_name_or_path="oliverguhr/german-sentiment-bert")

## Train and Fine-Tune a Pre-Trained Model on Your Own Dataset

Let's imagine you have your own dataset with text/label pairs you'd like to create a sequence classification model for.

With the easy sequence classifier, you can take advantage of transfer learning by fine-tuning pre-trained models on your own custom datasets.

Note: The `EasySequenceClassifier` is integrated heavily with the `datasets.Dataset` and `transformers.Trainer` class objects, so please check out the [datasets](https://huggingface.co/datasets) and [transformers](https://huggingface.co/transformers) documentation for more information.

We'll first need a "custom" dataset to start training our model. Our `EasySequenceClassifier.train()` method can run with either `datasets.Dataset` objects or CSV data file paths. Since the datasets library makes it so easy, we'll use the `datasets.load_dataset()` method to load in the IMDB Sentiment dataset. We'll show an example with a CSV later. 

In [None]:
from datasets import load_dataset

train_dataset, eval_dataset = load_dataset('imdb', split=['train[:1%]', 'test[:1%]'])

# Uncomment below if you want to use all the data so you don't spend an hour+ on training and evaluation
#train_dataset, eval_dataset = load_dataset('imdb', split=['train', 'test'])

pprint(vars(train_dataset.info))

Reusing dataset imdb (/root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/90099cb476936b753383ba2ae6ab2eae419b2e87f71cd5189cb9c8e5814d12a3)


{'builder_name': 'imdb',
 'citation': '@InProceedings{maas-EtAl:2011:ACL-HLT2011,\n'
             '  author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  '
             'Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, '
             'Christopher},\n'
             '  title     = {Learning Word Vectors for Sentiment Analysis},\n'
             '  booktitle = {Proceedings of the 49th Annual Meeting of the '
             'Association for Computational Linguistics: Human Language '
             'Technologies},\n'
             '  month     = {June},\n'
             '  year      = {2011},\n'
             '  address   = {Portland, Oregon, USA},\n'
             '  publisher = {Association for Computational Linguistics},\n'
             '  pages     = {142--150},\n'
             '  url       = {http://www.aclweb.org/anthology/P11-1015}\n'
             '}\n',
 'config_name': 'plain_text',
 'dataset_size': 133190346,
 'description': 'Large Movie Review Dataset.\n'
            

Let's take a brief look at what the IMDB Sentiment dataset looks like. We can see that the label column has two classes of 0 and 1. You can see the name of the classes mapped to the integers with `train_dataset.features["names"]`.

In [None]:
train_dataset.set_format(type="pandas", columns=["text", "label"])
train_dataset[:]

Unnamed: 0,label,text
0,1,Bromwell High is a cartoon comedy. It ran at t...
1,1,Homelessness (or Houselessness as George Carli...
2,1,Brilliant over-acting by Lesley Ann Warren. Be...
3,1,This is easily the most underrated film inn th...
4,1,This is not the typical Mel Brooks film. It wa...
...,...,...
245,1,That hilarious line is typical of what these n...
246,1,Faith and Mortality... viewed through the lens...
247,1,The unlikely duo of Zero Mostel and Harry Bela...
248,1,*some spoilers*<br /><br />I was pleasantly su...


Let's reformat it back into a more "pythonic" dataset:

In [None]:
train_dataset.set_format(columns=["text", "label"])

Uncomment below to see training done with CSV files. The cell below will just save the `datasets.Dataset` objects you have in `train_dataset` and `eval_dataset` as CSVs and will train the model with the CSV file paths. Ignore to just continue to training.

In [None]:
#train_dataset.set_format(type="pandas", columns=["text", "label"])
#eval_dataset.set_format(type="pandas", columns=["text", "label"])

#train_dataset[:].to_csv("./IMDB train.csv", index=False)
#eval_dataset[:].to_csv("./IMDB eval.csv", index=False)

#train_dataset = "./IMDB train.csv"
#eval_dataset = "./IMDB eval.csv"

One of the first things we'll need to specify before we start training are the training arguments. Training arguments consist mainly of the hyperparameters we want to provide the model. These may include batch size, initial learning rate, number of epochs, etc.

We will be using the `transformers.TrainingArguments` data class to store our training args. These are compatible with the `transformers.Trainer` as well as AdaptNLP's train methods. For more documention on the `TrainingArguments` class, please look [here](https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments). There are a lot of arguments available, but we will pass in the important args and use default values for the rest.

The training arguments below specify the output directory for you model and checkpoints.

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir='./models',
    num_train_epochs=1,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    warmup_steps=500,
    weight_decay=0.01,
    evaluation_strategy="steps",
    logging_dir='./logs',
    save_steps=100
)

Now we can run the built-in `train()` method by passing in the training arguments. The training method will also be where you specify your data arguments which include the your train and eval datasets, the pre-trained model ID (this should have been loaded from your earlier cells, but can be loaded dynamically), text column name, label column name, and ordered label names (only required if loading in paths to CSV data file for dataset args).

Please checkout AdaptNLP's package reference for more information [here](https://novetta.github.io/adaptnlp/class-api/sequence-classifier-module.html).

In [None]:
classifier.train(training_args=training_args,
                 train_dataset=train_dataset,
                 eval_dataset=eval_dataset,
                 model_name_or_path="nlptown/bert-base-multilingual-uncased-sentiment",
                 text_col_nm="text",
                 label_col_nm="label",
                 label_names=["positive","negative"]
                )

## Evaluate your model 

After training, you can evaluate the model with the eval dataset you passed in for training.

In [None]:
classifier.evaluate(model_name_or_path="nlptown/bert-base-multilingual-uncased-sentiment")

{'eval_loss': 0.017184646800160408,
 'eval_accuracy': 1.0,
 'eval_f1': array([1.]),
 'eval_precision': array([1.]),
 'eval_recall': array([1.]),
 'eval_runtime': 2.7201,
 'eval_samples_per_second': 91.91,
 'epoch': 1.0}

Now you can see it's a little weird that we're still using the `model_name_or_path` of the pre-trained model we fine-tuned and took advantage of via. transfer learning. We can release the model we've fine-tuned, and then load it back in using the directory that we've serialized the fine-tuned model. 

In [None]:
classifier.release_model(model_name_or_path="nlptown/bert-base-multilingual-uncased-sentiment")

In [None]:
sentences = classifier.tag_text(
    multiple_text,
    model_name_or_path="./models",
    mini_batch_size=1
)

print(f"Tag Score Outputs:\n")
print(f'Sentences: {sentences["sentences"]}\n')
print(f'Classes: {sentences["classes"]}\n')
print(f"Probabilities: \n\t{sentences['probs']}")

2021-04-20 19:43:56,203 loading file ./models




Tag Score Outputs:

{"This didn't work well at all.": [neg (0.263), pos (0.737)]}
{'I really liked it.': [neg (0.1309), pos (0.8691)]}
{'It was really useful.': [neg (0.184), pos (0.816)]}
{'It broke after I bought it.': [neg (0.2716), pos (0.7284)]}


And we're done!