# Lithology classification using Hugging Face, part 2

- toc: true 
- badges: true
- comments: true
- categories: [hugging-face, NLP, lithology]
- author: J-M


# About

This is a continuation of [Lithology classification using Hugging Face, part 1](https://jmp75.github.io/work-blog/hugging-face/nlp/lithology/2022/06/01/lithology-classification-hugging-face.html).

We saw in the previous post that the Namoi lithology logs data had their primary (major) lithology mostly completed. A substantial proportion had None nevertheless, despite descriptions that looked like they would obviously lead to a categorisation. There were many labels, with a long-tailed frequency histogram.

# Kernel installation

Note that mamba seems to not work on `conda install -c huggingface -c conda-forge datasets`; just returns immediately with no output that I can see. Pity because conda install is taking minutes to solve the environment. 

```sh
conda install --force-reinstall mamba -c conda-forge
# led to conflicts! Probably related to my installing grayskull and shyaml recently in the base environment. This seems to prevent the instalation of the latest vesion of mamba
```

```text
UnsatisfiableError: The following specifications were found to be incompatible with a past
explicit spec that is not an explicit spec in this operation (openssl):

  - mamba -> conda[version='4.6.*,<4.13.0|4.6.*|4.7.*,<4.13.0|>=4.7.12,<4.13.0|>=4.8,<4.13.0|>=4.7.12,<4.8']
  - mamba -> openssl[version='>=1.1.1f,<1.1.2a|>=1.1.1g,<1.1.2a|>=1.1.1h,<1.1.2a|>=1.1.1i,<1.1.2a|>=1.1.1j,<1.1.2a|>=1.1.1k,<1.1.2a|>=1.1.1l,<1.1.2a|>=1.1.1n,<1.1.2a|>=1.1.1o,<1.1.2a']
  - mamba -> openssl[version='>=1.1.1o,<1.1.2a'] -> ca-certificates
  - mamba -> pypy3.7[version='>=7.3.7'] -> openssl[version='1.0.*|>=1.0.2o,<1.0.3a|>=1.0.2p,<1.0.3a|>=1.1.1a,<1.1.2a|>=1.1.1e,<1.1.2a|>=3.0.0,<4.0a0|>=3.0.3,<4.0a0|>=3.0.2,<4.0a0|>=1.1.1d,<1.1.2a']

The following specifications were found to be incompatible with each other:

Output in format: Requested package -> Available versions
```

`conda remove shyaml grayskull conda-build` In the end figured out I needed `conda install -c conda-forge conda=4.12.0`



```sh
myenv=hf
mamba create -n $myenv python=3.9 -c conda-forge
mamba install -n $myenv --yes ipykernel matplotlib sentencepiece scikit-learn -c conda-forge
mamba install -n $myenv --yes pytorch=1.11 -c pytorch -c nvidia -c conda-forge
mamba install -n $myenv --yes torchvision torchaudio -c pytorch -c nvidia -c conda-forge
mamba install -n $myenv --yes -c huggingface -c conda-forge datasets transformers 
conda activate $myenv
python -m ipykernel install --user --name $myenv --display-name "Hugging Face"
# /home/per202/src/learning-notes/ml/fastai/log.md
cd /home/per202/src/learning-notes/ml/fastai
# OPTIONAL? mamba install -n $myenv -c pytorch -c conda-forge --file nbdev.txt# 
```

```bat
set myenv=hf
mamba create -n %myenv% python=3.9 -c conda-forge
mamba install -n %myenv% --yes ipykernel matplotlib sentencepiece scikit-learn -c conda-forge
mamba install -n %myenv% --yes pytorch=1.11 -c pytorch -c nvidia -c conda-forge
mamba install -n %myenv% --yes torchvision torchaudio -c pytorch -c nvidia -c conda-forge
mamba install -n %myenv% --yes -c huggingface -c conda-forge datasets transformers 
conda activate %myenv%
python -m ipykernel install --user --name %myenv% --display-name "Hugging Face"
REM /home/per202/src/learning-notes/ml/fastai/log.md
REM cd /home/per202/src/learning-notes/ml/fastai
REM OPTIONAL? mamba install -n %myenv% -c pytorch -c conda-forge --file nbdev.txt
```

```sh
. /home/per202/config/baseconda 
conda env list
conda activate hf
rm tokz.log ;  nsntrace -o tokz.log python ./tokz.py 
```

Use ctrl-c to end at any time.
Downloading: 100%|| 52.0/52.0 [00:00<00:00, 106kB/s]
Downloading: 100%|| 578/578 [00:00<00:00, 753kB/s]
Downloading: 100%|| 2.35M/2.35M [00:00<00:00, 3.91MB/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Finished capturing 3454 packets.

In [None]:
#hide
import pandas as pd
from pathlib import Path
fn =  Path('~').expanduser() / "data/ela/shp_namoi_river/NGIS_LithologyLog.csv"
litho_logs = pd.read_csv(fn, dtype={'FromDepth': str, 'ToDepth': str, 'MajorLithCode': str, 'MinorLithCode': str})
MAJOR_CODE='MajorLithCode'
MINOR_CODE='MinorLithCode'
DESC='Description'

#from ela.textproc import token_freq, plot_freq

from collections import Counter
def token_freq(tokens, n_most_common = 50):
    list_most_common=Counter(tokens).most_common(n_most_common)
    return pd.DataFrame(list_most_common, columns=["token","frequency"])

def plot_freq(dataframe, y_log = False, x='token', figsize=(15,10), fontsize=14):
    """Plot a sorted histogram of work frequencies

    Args:
        dataframe (pandas dataframe): frequency of tokens, typically with colnames ["token","frequency"]
        y_log (bool): should there be a log scale on the y axis
        x (str): name of the columns with the tokens (i.e. words)
        figsize (tuple):
        fontsize (int):

    Returns:
        barplot: plot

    """
    p = dataframe.plot.bar(x=x, figsize=figsize, fontsize=fontsize)
    if y_log:
        p.set_yscale("log", nonposy='clip')
    return p


litho_classes=litho_logs[MAJOR_CODE].values
df_most_common= token_freq(litho_classes, 50)
plot_freq(df_most_common)

In [None]:
import torch

from datasets import Dataset,DatasetDict

from transformers import AutoModelForSequenceClassification,AutoTokenizer

import numpy as np


* For the sake of applying HF, can I reduce the number of target labels.
* unbalanced data sets: https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/ 


In [None]:
def sample_desc_for_code(major_code, n=50, seed=None):
    is_code = litho_logs[MAJOR_CODE] == major_code
    coded = litho_logs.loc[is_code][DESC]
    if seed is not None:
        np.random.seed(seed)
    return coded.sample(n=50)

In [None]:
sample_desc_for_code('UNKN', seed=123)

The "unknown" category is rather interesting in fact, and worth keeping as a valid class.

## Subsetting

Let's keep "only" the main labels, for the sake of this exercise.

In [None]:
labels_kept = df_most_common['token'][:17].values
labels_kept = labels_kept[labels_kept != 'None']
labels_kept

In [None]:
kept = [x in labels_kept for x in litho_classes]

In [None]:
litho_logs_kept = litho_logs[kept]
litho_logs_kept.sample(10)

In [None]:
MAJOR_CODE_INT='MajorLithoCodeInt'
from datasets import ClassLabel
labels = ClassLabel(names=labels_kept)
litho_logs_kept[MAJOR_CODE_INT] = [labels.str2int(x) for x in litho_logs_kept[MAJOR_CODE].values]

## Class imbalance

Even our subset of 16 classes is rather imbalanced; the "clay" label is more than 30 times that of "coal" just by eyeballing. 

The post by Jason Brownlee [8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset](https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset). One of them is to resample from labels, perhaps with replacement, to equalise classes. It is a relatively easy approach to implement, but there are issues growing with the level of imbalance. 

The video [Simple Training with the 🤗 Transformers Trainer (at 669 seconds)](https://youtu.be/u--UVvH-LIQ?t=669) also explains the issues with imbalances and crude resampling. It offers instead a solution with class weighting that is more robust. That approach is evoked in Jason's post, but the video has a "Hugging Face style" implementation ready to repurpose.

### Resample with replacement

Just for information, what we'd do with a relatively crude resampling may be:


In [None]:
def sample_major_lithocode(dframe, code, n=10000, seed=None):
    x = dframe[dframe[MAJOR_CODE] == code]
    replace = n > len(x)
    return x.sample(n=n, replace=replace, random_state=seed)

In [None]:
sample_major_lithocode(litho_logs_kept, 'CLAY', n=10, seed=0)

In [None]:
balanced_litho_logs = [sample_major_lithocode(litho_logs_kept, code, n=10000, seed=0) for code in labels_kept]
balanced_litho_logs = pd.concat(balanced_litho_logs)
balanced_litho_logs.head()

In [None]:
plot_freq(token_freq(balanced_litho_logs[MAJOR_CODE].values, 50))

### Dealing with imbalanced classes with weights



In [None]:
sorted_counts = litho_logs_kept[MAJOR_CODE].value_counts().sort_index()
sorted_counts

In [None]:
sorted_counts / sorted_counts.sum()

In [None]:
class_weights = (1 - sorted_counts / sorted_counts.sum()).values

In [None]:
class_weights

In [None]:
torch.cuda.is_available()

In [None]:
class_weights = torch.from_numpy(class_weights).float().to("cuda")
class_weights

In [None]:
model_nm = 'microsoft/deberta-v3-small'

Note: once when running the cell below it was not completing. it got stuck for 500 seconds and when interupting the execution the stack trace was showing it stuck `sock.connect(sa)`. There seems to be a timeout at play, but not clear at all from the documentation of from_pretrained if there is a way to specify it and what its default is. Irritating.

'Active cell trusted. 20 of 52 cells trusted.'

Try to "Ctrl-Shift-C" then "Trust Notebook", but this had no benefit. AutoTokenizer.from_pretrained still spins forever without feedback to the user. Poor UX.

```text
CPU times: user 504 ms, sys: 57.9 ms, total: 562 ms
Wall time: 14min 13s
```


In [None]:
from pathlib import Path
p = Path('./tokz_pretrained')
if p.exists():
    tokz = AutoTokenizer.from_pretrained(p)
else:
    tokz = AutoTokenizer.from_pretrained(model_nm)
    tokz.save_pretrained('./tokz_pretrained')

Let's see what this does on a typical lithology description

In [None]:
tokz.tokenize('CLAY, VERY SANDY')

Well, the vocabulary is probably case sensitive and all the descriptions being uppercase in the source data are likely problematic. Let's check on lowercase:

In [None]:
tokz.tokenize('clay, very sandy')

This looks better. So let's change the descriptions to lowercase; we are not loosing any relevent information in this case, I think.

In [None]:
litho_logs_kept[DESC] = litho_logs_kept[DESC].str.lower()

In [None]:
litho_logs_kept_mini = litho_logs_kept[[MAJOR_CODE_INT, DESC]]
litho_logs_kept_mini.sample(n=10)



In [None]:
# NLP for beginers
ds = Dataset.from_pandas(litho_logs_kept_mini)

In [None]:

# https://youtu.be/_BZearw7f0w?list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o&t=150
# Cheating a bit on guessing the length (max is 90 tokens)
max_length=128

def tok_func(x): 
    return tokz(x[DESC], padding='max_length', truncation=True, max_length=max_length, return_tensors='pt')

In [None]:
# tok_ds = ds.map(tok_func, batched=True) 
# Nope, no can do. 

In [None]:
tok_ds = ds.map(tok_func)

In [None]:
tok_ds_tmp = tok_ds[:5]
tok_ds_tmp.keys()

In [None]:
# check
len(tok_ds_tmp['input_ids'][0][0])

Once again, the `from_pretrained` method takes ages to complete, yet the effective CPU time is minimal. No idea what is going on. 

So, better cache the model locally just in case this behavior persists.

```text
Downloading: 100%
273M/273M [00:25<00:00, 11.6MB/s]
CPU times: user 4.38 s, sys: 1.17 s, total: 5.55 s
Wall time: 19min 16s
```

In [None]:
p = Path('./model_pretrained')

As I write this by elaborating from Jeremy Howard'd notebook, I should mention the misunderstanding I have with num_labels.

In [None]:
num_labels = len(labels_kept)

In [None]:
if p.exists():
    model = AutoModelForSequenceClassification.from_pretrained(p, num_labels=num_labels)
else:
    model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=num_labels)    
    model.save_pretrained(p)

In [None]:
print(type(model))

In [None]:
# https://youtu.be/1pedAIvTWXk?list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o&t=143
litho_desc_list = [x for x in litho_logs_kept_mini[DESC].values]
input_descriptions = tokz(litho_desc_list, padding=True, truncation=True, max_length=256, return_tensors='pt')
input_descriptions['input_ids'].shape

In [None]:
# model(input_descriptions['input_ids'][:5,:], attention_mask=input_descriptions['attention_mask'][:5,:]).logits

In [None]:
from transformers import TrainingArguments,Trainer

In [None]:
tok_ds

Transformers always assumes that your labels has the column name labels,

In [None]:
tok_ds = tok_ds.rename_columns({MAJOR_CODE_INT:'labels'})

In [None]:
tok_ds.set_format('torch')

In [None]:
bs = 128
epochs = 4

lr = 8e-5

In [None]:
args = TrainingArguments(output_dir='./litho_outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')

In [None]:
dds = tok_ds.train_test_split(0.25, seed=42)

In [None]:
# https://huggingface.co/docs/transformers/training

In [None]:
import numpy as np
from datasets import load_metric

In [None]:
import sklearn

`metric = load_metric("accuracy")` took an awful:

```text
CPU times: user 92.7 ms, sys: 70.7 ms, total: 163 ms
Wall time: 6min 41s
```


In [None]:
# %%time
# metric = load_metric("accuracy")

# # ImportError: To be able to use accuracy, you need to install the following dependencies['sklearn'] using 'pip install sklearn' for instance'
# # (hf) per202@keywest-bm:~$ mamba install -c conda-forge scikit-learn

In [None]:
# from Jeremy's notebook:
# def compute_metrics(eval_pred):
#     logits, labels = eval_pred
#     predictions = np.argmax(logits, axis=-1)
#     return metric.compute(predictions=predictions, references=labels)

In [None]:
# Defining the Trainer to compute Custom Loss Function
class WeightedLossTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        # Feed inputs to model and extract logits
        outputs = model(**inputs)
        logits = outputs.get("logits")
        # Extract Labels
        labels = inputs.get("labels")
        # Define loss function with class weights
        loss_func = torch.nn.CrossEntropyLoss(weight=class_weights)
        # Compute loss
        loss = loss_func(logits, labels)
        return (loss, outputs) if return_outputs else loss


    

In [None]:
from sklearn.metrics import f1_score    

In [None]:
def compute_metrics(eval_pred):
    labels = eval_pred.label_ids
    predictions = eval_pred.predictions.argmax(-1)
    f1 = f1_score(labels, predictions, average="weighted")
    return {"f1": f1}

In [None]:
output_dir = "./hf_training"
batch_size = 128
epochs = 5
lr = 8e-5

In [None]:
training_args = TrainingArguments(output_dir = output_dir,
                                  num_train_epochs=epochs, 
                                  learning_rate=lr,
                                  lr_scheduler_type='cosine', 
                                  per_device_train_batch_size=batch_size,
                                  per_device_eval_batch_size=batch_size * 2,
                                  weight_decay=0.01,
                                  evaluation_strategy="epoch",
                                  logging_steps=len(dds['train']),
                                  fp16=True,
                                  push_to_hub=False,
                                  report_to='none')
    


In [None]:
model = model.to("cuda:0")
# otherwise trainer.train will give: 
# RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)

In [None]:
trainer = Trainer(model=model, 
                  args=training_args, 
                  train_dataset=dds['train'],
                  eval_dataset=dds['test'],
                  tokenizer=tokz, 
                  compute_metrics=compute_metrics)

# https://github.com/nlp-with-transformers/notebooks/issues/31#issuecomment-1075369210 ???
# old_collator = trainer.data_collator
# trainer.data_collator = lambda data: dict(old_collator(data))

# (hf) per202@keywest-bm:~/src/work-blog$ mamba list | grep hugging
# datasets                  2.2.2                      py_0    huggingface
# huggingface_hub           0.7.0                      py_0    huggingface
# sacremoses                master                     py_0    huggingface
# transformers              4.11.3                     py_0    huggingface


# mamba update -c conda-forge transformers datasets

In [None]:
trainer.train()


```text
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.
```

OK. But where? this comes from transformers/tokenization_utils_base so presumably something to do with the settings of the tokeniser. Besides, from googling for similar error messages I am not convinced following the suggestion blindly is the right approach.

The youtube video has factorised labels, int coding for strings. 


In [None]:
dtrain = dds['train']

In [None]:
dtrain.features