<a href="https://colab.research.google.com/github/tds2023-24/course/blob/main/notebooks/09_Introduction_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div class='bar_title'></div>

*Practical Data Science*

# Introduction to Natural Language Processing

Gunther Gust & Viet Nguyen<br>
Chair for Enterprise AI <br>
Center for Artificial Intelligence and Data Science (CAIDAS)

<img src="images/d3.png" style="width:20%; float:left;" />

<img src="images/CAIDASlogo.png" style="width:20%; float:left;" />

## *Traditional* NLP - Working With Text Data (scikit-learn)

__Credits for this section__

- https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

The goal of this sections is to explore some of the main scikit-learn tools on a single practical task: analyzing a collection of text documents (newsgroups posts) on twenty different topics.

In this section we will see how to:
- extract feature vectors suitable for machine learning
- train a model to perform categorization

### The 20 newsgroups dataset


The dataset is called “Twenty Newsgroups”. Here is the official description, quoted from the website:

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of our knowledge, it was originally collected by Ken Lang, probably for his paper “Newsweeder: Learning to filter netnews,” though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

In the following we will use the built-in dataset loader for 20 newsgroups from scikit-learn.
In order to get faster execution times for this first example we will work on a partial dataset with only 4 categories out of the 20 available in the dataset:

We can now load the list of files matching those categories as follows:

In [None]:
# install on Colab
# %pip install scikit-learn transformers gradio fastai

In [None]:
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

The returned dataset is a scikit-learn “bunch”: a simple holder object with fields that can be both accessed as python dict keys or object attributes for convenience, for instance the target_names holds the list of the requested category names:

In [None]:
twenty_train.target_names

The files themselves are loaded in memory in the data attribute. For reference the filenames are also available:

In [None]:
len(twenty_train.data)

In [None]:
twenty_train.filenames[0]

Let’s print the first lines of the first loaded file:

In [None]:
print("\n".join(twenty_train.data[0].split("\n")[:10]))

**Target Labels**

In [None]:
print(twenty_train.target_names[twenty_train.target[0]])

Supervised learning algorithms will require a category label for each document in the training set. In this case the category is the name of the newsgroup which also happens to be the name of the folder holding the individual documents.

For speed and space efficiency reasons `scikit-learn` loads the target attribute as an array of integers that corresponds to the index of the category name in the target_names list. The category integer id of each sample is stored in the target attribute:

In [None]:
twenty_train.target[:15]

It is possible to get back the category names as follows:

In [None]:
for t in twenty_train.target[:15]:
    print(twenty_train.target_names[t])

### Extracting features from text files

In order to perform machine learning on text documents, we first need to turn the text content into numerical feature vectors. 

#### Bags of words

1. Assign a fixed integer id to each word occurring in any document of the training set (for instance by building a dictionary from words to integer indices).

2. For each document #i, count the number of occurrences of each word w and store it in X[i, j] as the value of feature #j where j is the index of word w in the dictionary.



The bags of words representation implies that `n_features` is the number of distinct words in the corpus: this number is typically larger than 100,000.

If `n_samples == 10000`, storing X as a NumPy array of type float32 would require 10000 x 100000 x 4 bytes = __4GB in RAM__ and may quickly lead to memory overload.

Fortunately, __most values in X will be zeros__ since for a given document less than a few thousand distinct words will be used. For this reason we say that bags of words are typically __high-dimensional sparse datasets__. We can save a lot of memory by only storing the non-zero parts of the feature vectors in memory.

`scipy.sparse` matrices are data structures that do exactly this, and `scikit-learn` has built-in support for these structures.

**Tokenizing text with scikit-learn**

Text preprocessing, tokenizing and filtering of stopwords are all included in `CountVectorizer`, which builds a dictionary of features and transforms documents to feature vectors:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(ngram_range=(1,1))

X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

CountVectorizer supports counts of N-grams of words or consecutive characters. Once fitted, the vectorizer has built a dictionary of feature indices:

In [None]:
count_vect.vocabulary_.get('algorithm')

The index value of a word in the vocabulary is linked to its frequency in the whole training corpus.

#### From occurrences to frequencies

Occurrence count is a good start but there is an issue: longer documents will have higher average count values than shorter documents, even though they might talk about the same topics.

- To avoid these potential discrepancies it suffices to divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called `tf` for **Term Frequencies**.

- Another refinement on top of `tf` is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus. This downscaling is called `tf–idf` for **“Term Frequency times Inverse Document Frequency”**.

Both `tf` and `tf–idf` can be computed as follows using `TfidfTransformer`:

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer(use_idf=True)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape 

### Training a classifier and performing predictions

Now that we have our features, we can train a classifier to try to predict the category of a post. Let’s start with a LogisticRegression classifier, which provides a nice baseline for this task:

In [None]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression().fit(X = X_train_tfidf, y = twenty_train.target)

To try to predict the outcome on a new document we need to extract the features using almost the same feature extracting chain as before. The difference is that we call transform instead of fit_transform on the transformers, since they have already been fit to the training set:

In [None]:
docs_new = ['God is love', 'OpenGL on the GPU is fast','The tomograph is not working today and the patients are waiting for it to be fixed.']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

In [None]:
predicted = clf.predict(X_new_tfidf)
for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

### Evaluation of the performance on the test set

In order to make the vectorizer => transformer => classifier easier to work with, scikit-learn provides a Pipeline class that behaves like a compound classifier:

In [None]:
from sklearn.pipeline import Pipeline

text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', LogisticRegression()),
    ])

In [None]:
text_clf.fit(twenty_train.data, twenty_train.target)

Evaluating the predictive accuracy of the model is equally easy. First, we need the predictions on the test data:

In [None]:
from sklearn import metrics
import pandas as pd

twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)
docs_test = twenty_test.data
predicted = text_clf.predict(docs_test)

The `scikit-learn` classification_report provides utilities for a detailed performance analysis of the results:

In [None]:
clf_report = metrics.classification_report(twenty_test.target, 
                                           predicted, 
                                           target_names=twenty_test.target_names, 
                                           output_dict=True)
pd.DataFrame(clf_report).T

## Metrics explained

A confusion matrix is a table used to evaluate the performance of a classification model by comparing actual labels with predicted labels.

|               | Predicted Positive | Predicted Negative |
|---------------|--------------------|--------------------|
| **Actual Positive** | True Positive (TP)      | False Negative (FN)     |
| **Actual Negative** | False Positive (FP)     | True Negative (TN)      |

### Key:
- **True Positive (TP)**: Correctly predicted positive cases.
- **False Positive (FP)**: Incorrectly predicted as positive when they are actually negative.
- **False Negative (FN)**: Incorrectly predicted as negative when they are actually positive.
- **True Negative (TN)**: Correctly predicted negative cases.



### Precision
Measures how many of the predicted positive samples are actually positive.

$ \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} $

---

### Recall
Measures how many of the actual positive samples are correctly identified.

$ \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} $

---

### F1-Score
The harmonic mean of precision and recall, balancing both metrics.

$ \text{F1-Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $

---

### Support
The number of true instances in each category.

---

### Accuracy
Overall fraction of correctly classified samples.

---

### Macro Avg
Arithmetic mean of the metrics for each class, treating all classes equally.

---

### Weighted Avg
Metrics averaged across classes, weighted by the number of instances in each class.


### Building an app

We can expose our model to users by designing a small app using gradio:

In [None]:
import gradio as gr

def newsgroups_classification(newsgroup_post):
    category = text_clf.predict([newsgroup_post])
    return twenty_train.target_names[category[0]]

iface = gr.Interface(fn=newsgroups_classification, inputs="textbox", outputs="text").launch(share=True)

## Exercises

1. Compute the Bag-of-Words for the documents below. What can you observe from the top-5 words that occur the most frequently in all documents?
2. Based on your observation, modify the Bag-of-Words implementation to get the "dog" and the "fox" appeared in top-5 words.

In [None]:
import pandas as pd

# 5 documents
documents = [
    "The quick brown fox jumps over the lazy dog. The dog, which was very lazy, did not even try to get up.",
    "In fact, it just lay there and watched as the fox ran around. The fox was very happy and excited because it loved to play.",
    "Every time the fox jumped, the dog would just look at it and think about how nice it would be to join in on the fun.",
    "However, the dog was too tired to move or play. It was a beautiful day outside, but the dog preferred to stay inside and relax.",
    "The sun was shining brightly, and there were many birds singing in the trees. Despite all of this, the dog remained in its comfortable spot."
]

# EX 1 - [YOUR CODE HERE] call the count vectorizer and create Bag-of-Words for the documents
count_vectorizer = ...
X_bow = ...


# Visualization -- you don't need to touch this
df_bow = pd.DataFrame(X_bow.toarray(), columns=count_vectorizer.get_feature_names_out())
print("Bag-of-Words Representation:")
print(df_bow.sum().nlargest(5))

In [None]:
# EX 2 - [YOUR CODE HERE] call the count vectorizer and create Bag-of-Words for the documents, with necessary modification.
# Hint 1: look at the top-5 frequency in the exercise above. Do you want to remove some of them?
# Hint 2: you need one parameter from this class: https://scikit-learn.org/1.5/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
count_vectorizer_mod = ...
X_bow_mod = ...

# Visualization -- you don't need to touch this
df_bow_mod = pd.DataFrame(X_bow_mod.toarray(), columns=count_vectorizer_mod.get_feature_names_out())
print("Bag-of-Words Representation:")
print(df_bow_mod.sum().nlargest(5))

## huggingface transformers

__Credits for this section__

<img src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg" width="100" align="left"/>

Materials taken from
- https://huggingface.co/docs/transformers/notebooks
- https://huggingface.co/course
- https://huggingface.co/docs/transformers/quicktour

### A brief history

**2017: Introduction of the transformer: "Attention is all you need (Vaswani et al 2017)**



**followed by...**

<img src="https://user-images.githubusercontent.com/13711052/146187951-5897600f-0c03-4816-8f6b-2e5487543c47.png" width="100%">

**Main ingredients for the breakthrough in NLP**

<img src="https://user-images.githubusercontent.com/13711052/146187962-259ab2e8-4df4-4abf-8b90-d85b8948e1ca.png" width="100%">

**The modern model training paradigm**

<img src="https://user-images.githubusercontent.com/13711052/146187969-fe0e37ef-8c73-495b-811c-133c8ea2611b.png" width="100%">

**Works for vision too**

<img src="https://user-images.githubusercontent.com/13711052/146187977-47f78167-ee42-4f7a-8989-a8a02a20ea73.png" width="100%">

**Including multiple modalities**

<img src="https://user-images.githubusercontent.com/13711052/146187985-6ef1a57f-974b-475d-820e-677cfee3032c.png" width="100%">

**Transformers are now everywhere**

<img src="https://user-images.githubusercontent.com/13711052/146187988-6fc51ea3-6af4-4976-81b9-d51ef0bca113.png" width="50%">

### Introducing huggingface

The wild west of open source

<img src="https://user-images.githubusercontent.com/13711052/146187999-ecf16bff-e0ec-4d4a-9378-e491295761bc.png" width="100%">

Hugging Face, a company that first built a chat app for bored teens provides open-source NLP technologies. In 2019 it raised $15 million to build a definitive NLP library. From its chat app to this day, Hugging Face has been able to swiftly develop language processing expertise. The company’s aim is to advance NLP and democratize it for use by everyone.

> We're on a journey to advance and democratize artificial intelligence through open source and open science.

**The huggingface ecosystem**

<img src="https://user-images.githubusercontent.com/13711052/146188011-f5c5f3f3-0156-43ee-be4c-71c2e3310801.png" width="100%">

**Model hub**: https://huggingface.co/models

<img src="https://user-images.githubusercontent.com/13711052/146188034-a592840b-5675-4a15-8589-4c752729090e.png" width="100%">



**Dataset hub**: https://huggingface.co/datasets

<img src="https://user-images.githubusercontent.com/13711052/146188024-f8146eb1-2d9d-47be-b5c2-fa388e48f705.png" width="100%">

### huggingface quick tour

Let's have a quick look at the 🤗 Transformers library features. The library downloads pretrained models for Natural
Language Understanding (NLU) tasks, such as analyzing the sentiment of a text, and Natural Language Generation (NLG),
such as completing a prompt with new text or translating in another language.

First we will see how to easily leverage the pipeline API to quickly use those pretrained models at inference. Then, we
will dig a little bit more and see how the library gives you access to those models and helps you preprocess your data.

#### Getting started on a task with a pipeline

The easiest way to use a pretrained model on a given task is to use `pipeline`.

In [None]:
import fastai
import transformers

🤗 Transformers provides the following tasks out of the box:

- __Sentiment analysis:__ is a text positive or negative?
- __Text generation (in English):__ provide a prompt and the model will generate what follows.
- __Name entity recognition (NER):__ in an input sentence, label each word with the entity it represents (person, place,
  etc.)
- __Question answering:__ provide the model with some context and a question, extract the answer from the context.
- __Filling masked text:__ given a text with masked words (e.g., replaced by `[MASK]`), fill the blanks.
- __Summarization:__ generate a summary of a long text.
- __Translation:__ translate a text in another language.
- __Feature extraction:__ return a tensor representation of the text.

In [None]:
from transformers import pipeline
classifier = pipeline('sentiment-analysis')

When typing this command for the first time, a pretrained model and its tokenizer are downloaded and cached. We will
look at both later on, but as an introduction the tokenizer's job is to preprocess the text for the model, which is
then responsible for making predictions. 

The pipeline groups all of that together, and post-process the predictions to
make them readable. For instance:

In [None]:
classifier('We are very happy to show you the 🤗 Transformers library.')

That's encouraging! You can use it on a list of sentences, which will be preprocessed then fed to the model, returning
a list of dictionaries like this one:

In [None]:
results = classifier(["We are very happy to show you the 🤗 Transformers library.",
                      "We hope you don't hate it."])
for result in results:
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

You can see the second sentence has been classified as negative (it needs to be positive or negative) but its score is
fairly neutral.

To compute predictions for a large dataset, look at [iterating over a pipeline](https://huggingface.co/transformers/./main_classes/pipelines.html)


**German data**

By default, the model downloaded for this pipeline is called "distilbert-base-uncased-finetuned-sst-2-english". We can
look at its [model page](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) to get more
information about it. It uses the [DistilBERT architecture](https://huggingface.co/transformers/model_doc/distilbert.html) and has been fine-tuned on a dataset called SST-2 for the sentiment analysis task.

Let's say we want to use another model; for instance, one that has been trained on German data. We can search through
the [model hub](https://huggingface.co/models) that gathers models pretrained on a lot of data by research labs, but
also community models (usually fine-tuned versions of those big models on a specific dataset). Applying the tags
"de" and "text-classification" gives back a suggestion "nlptown/bert-base-multilingual-uncased-sentiment". Let's
see how we can use it.

You can directly pass the name of the model to use to `pipeline`:

In [None]:
classifier = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment", top_k=None)

In [None]:
classifier(["Topics in Data Science 2 ist super!", "Bayern München ist echt mies", "Dies ist ein neutraler Beispieltext"])

This classifier can now deal with texts in English, French, but also Dutch, German, Italian and Spanish! You can also
replace that name by a local folder where you have saved a pretrained model (see below). 



**Building a Pipeline Step-by-Step**

You can also pass a model object and its associated tokenizer. We will need two classes for this. The first is `AutoTokenizer`, which we will use to download the tokenizer associated to the model we picked and instantiate it. The second is
the model itself. Note that if we were using the library on an other task, the class of the model would change. The
[task summary](https://huggingface.co/transformers/task_summary.html) tutorial summarizes which class is used for which task.

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

Now, to download the models and tokenizer we found previously, we just have to use the
`AutoModelForSequenceClassification.from_pretrained` method (feel free to replace `model_name` by
any other model from the model hub):

In [None]:
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

If you don't find a model that has been pretrained on some data similar to yours, you will need to fine-tune a
pretrained model on your data. We provide [example scripts](https://huggingface.co/docs/transformers/training) to do so. Once you're done, don't forget
to share your fine-tuned model on the hub with the community, using [this tutorial](https://huggingface.co/transformers/model_sharing.html).

#### Under the hood: pretrained models

As we saw, the model and tokenizer are created using the `from_pretrained` method:

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

 __a) Using the tokenizer__

We mentioned the tokenizer is responsible for the preprocessing of your texts. 
* First, it will **split a given text in words** (or part of words, punctuation symbols, etc.) usually called *tokens*. There are multiple rules that can govern that process (you can learn more about them in the [tokenizer summary](https://huggingface.co/transformers/tokenizer_summary.html)), which is why we need to instantiate the tokenizer using the name of the model, to make sure we use the same rules as when the model was pretrained.

* The second step is to convert those **tokens into numbers**, to be able to build a tensor out of them and feed them to
the model. To do this, the tokenizer has a *vocab*, which is the part we download when we instantiate it with the
`from_pretrained` method, since we need to use the same *vocab* as when the model was pretrained.

To apply these steps on a given text, we can just feed it to our tokenizer:

In [None]:
inputs = tokenizer("We are very happy to show you the 🤗 Transformers library.")

This returns a dictionary string to list of ints. It contains the [ids of the tokens](https://huggingface.co/transformers/glossary.html#input-ids), as
mentioned before, but also additional arguments that will be useful to the model. Here for instance, we also have an
[attention mask](https://huggingface.co/transformers/glossary.html#attention-mask) that the model will use to have a better understanding of the
sequence:

In [None]:
print(inputs)

You can pass a list of sentences directly to your tokenizer. If your goal is to send them through your model as a
batch, you probably want to pad them all to the same length, truncate them to the maximum length the model can accept
and get tensors back. You can specify all of that to the tokenizer:

In [None]:
pt_batch = tokenizer(
    ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],  # Text inputs
    padding=True,  # Equalize token lengths
    truncation=True,  # Cut off excess tokens
    max_length=512,  # Limit to 512 tokens
    return_tensors="pt"  # Output as PyTorch tensors
)

The padding is automatically applied on the side expected by the model (in this case, on the right), with the padding
token the model was pretrained with. The attention mask is also adapted to take the padding into account:

In [None]:
for key, value in pt_batch.items():
    print(f"{key}: {value.numpy().tolist()}")

You can learn more about tokenizers [here](https://huggingface.co/transformers/preprocessing.html).

__b) Using the model__

Once your input has been preprocessed by the tokenizer, you can send it directly to the model. As we mentioned, it will
contain all the relevant information the model needs. For a PyTorch model, you need to [unpack](https://geekflare.com/python-unpacking-operators/) the dictionary by adding `**`.

In [None]:
pt_outputs = pt_model(**pt_batch)

In 🤗 Transformers, all outputs are objects that contain the model's final activations along with other metadata. These
objects are described in greater detail [here](https://huggingface.co/transformers/main_classes/output.html). For now, let's inspect the output ourselves:

In [None]:
print(pt_outputs)

Notice how the output object has a `logits` attribute. You can use this to access the model's final activations.

> **NOTE:** All 🤗 Transformers models (PyTorch or TensorFlow) return the activations of the model **before** the final activation
> function (like SoftMax) since this final activation function is often fused with the loss.

Let's apply the SoftMax activation to get predictions.

In [None]:
from torch import nn
pt_predictions = nn.functional.softmax(pt_outputs.logits, dim=-1)

We can see we get the numbers from before:

In [None]:
print(pt_predictions)

If you provide the model with labels in addition to inputs, the model output object will also contain a `loss`
attribute:

In [None]:
import torch
pt_outputs = pt_model(**pt_batch, labels = torch.tensor([1, 0]))
print(pt_outputs)

Models are standard [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) or [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) so you can use them in your usual training loop. 🤗
Transformers also provides a `Trainer` (or `TFTrainer` if you are using
TensorFlow) class to help with your training (taking care of things such as distributed training, mixed precision,
etc.). See the [training tutorial](https://huggingface.co/transformers/training.html) for more details.

Once your model is fine-tuned, you can save it with its tokenizer in the following way:

In [None]:
pt_save_directory = './pt_save_pretrained'
tokenizer.save_pretrained(pt_save_directory)
pt_model.save_pretrained(pt_save_directory)

You can then load this model back using the `AutoModel.from_pretrained` method by passing the
directory name instead of the model name. One cool feature of 🤗 Transformers is that you can easily switch between
PyTorch and TensorFlow: any model saved as before can be loaded back either in PyTorch or TensorFlow.

Use the corresponding Auto class to load it like this:

In [None]:
from transformers import AutoModel
tokenizer = AutoTokenizer.from_pretrained(pt_save_directory)
tf_model = AutoModel.from_pretrained(pt_save_directory)

Lastly, you can also ask the model to return all hidden states and all attention weights if you need them:

In [None]:
pt_outputs = pt_model(**pt_batch, output_hidden_states=True, output_attentions=True)
all_hidden_states  = pt_outputs.hidden_states 
all_attentions = pt_outputs.attentions

#### Accessing the code

The `AutoModel` and `AutoTokenizer` classes are just shortcuts that will automatically work with any pretrained model. Behind the scenes, the library has one model class per combination of architecture plus task, so the code is easy to access and tweak if you need to.


In our previous example, the model was called "distilbert-base-uncased-finetuned-sst-2-english", which means it's using
the [DistilBERT](https://huggingface.co/transformers/model_doc/distilbert.html) architecture. As
`AutoModelForSequenceClassification` (or
`TFAutoModelForSequenceClassification` if you are using TensorFlow) was used, the model
automatically created is then a `DistilBertForSequenceClassification`. You can look at its
documentation for all details relevant to that specific model, or browse the source code. This is how you would
directly instantiate model and tokenizer without the auto magic:


In [None]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = DistilBertForSequenceClassification.from_pretrained(model_name)
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

#### Customizing the model

**Create a model with custom configuration.** 

If you want to change how the model itself is built, you can define a custom configuration class. Each architecture comes with its own relevant configuration. For example, `DistilBertConfig` allows you to specify
parameters such as the hidden dimension, dropout rate, etc for DistilBERT. If you do core modifications, like changing
the hidden size, you won't be able to use a pretrained model anymore and will need to train from scratch. You would
then instantiate the model directly from this configuration.

Below, we load a predefined vocabulary for a tokenizer with the
`DistilBertTokenizer.from_pretrained` method. However, unlike the tokenizer, we wish to initialize
the model from scratch. Therefore, we instantiate the model from a configuration instead of using the
`DistilBertForSequenceClassification.from_pretrained` method.

In [None]:
from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification


config = DistilBertConfig(
    n_heads=8,       # Set the number of attention heads
    dim=512,         # Define the dimensionality of hidden layers
    hidden_dim=4*512 # Set the dimension of the feed-forward layer (4 times the hidden layer dimension)
)

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification(config)

**Changing the number of labels.** 

For something that only changes the head of the model (for instance, the number of labels), you can still use a
pretrained model for the body. For instance, let's define a classifier for 10 different labels using a pretrained body.
Instead of creating a new configuration with all the default values just to change the naumber of labels, we can instead
pass any argument a configuration would take to the `from_pretrained` method and it will update the default
configuration appropriately:

In [None]:
from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification
model_name = "distilbert-base-uncased"
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=10)
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

### Building an app

Example from https://huggingface.co/spaces/gradio/gpt-neo

In [None]:
import gradio as gr

title = "GPT-Neo Demo"
description = "demo for GPT-Neo by EleutherAI for text generation. To use it, simply add your text, or click one of the examples to load them. Read more at the links below."
article = "<p style='text-align: center'><a href='http://github.com/eleutherai/gpt-neo'>GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow</a></p>"

examples = [
    ['The tower is 324 metres (1,063 ft) tall,'],
    ["The Moon's orbit around Earth has"],
    ["The smooth Borealis basin in the Northern Hemisphere covers 40%"]
]

iface = gr.load("huggingface/EleutherAI/gpt-neo-1.3B", 
                inputs=gr.components.Textbox(lines=5, label="Input Text"),
                title=title,
                description=description,
                article=article, 
                examples=examples)


In [None]:
iface.launch(share=True)

# Mentimeter

<img src="images/d3.png" style="width:50%; float:center;" />