# Disclaimer
Notice: This computer software was prepared by Battelle Memorial Institute, hereinafter the Contractor, under Contract No. DE-AC05-76RL01830 with the Department of Energy (DOE).  All rights in the computer software are reserved by DOE on behalf of the United States Government and the Contractor as provided in the Contract.  You are authorized to use this computer software for Governmental purposes but it is not to be released or distributed to the public.  NEITHER THE GOVERNMENT NOR THE CONTRACTOR MAKES ANY WARRANTY, EXPRESS OR IMPLIED, OR ASSUMES ANY LIABILITY FOR THE USE OF THIS SOFTWARE.  This Software is provided for use on Federally funded projects only.

# Example of classification task

Preparatory steps:
1. Save a serialized model to the "data/06_models" directory. Here, we use SciBERT with continued pre-training on OSTI.
2. Update the tokenizer name, if necessary.

In [1]:
from pathlib import Path

import pandas as pd
from transformers import AutoConfig, AutoModelForSequenceClassification, AutoTokenizer, TextClassificationPipeline


project_dir = Path.cwd().parent

## Multiclass (OSTI Labels)

In [2]:
model_path = str(project_dir / "data" / "06_models" / "Multiclass Classification")
tokenizer_name = "roberta-large"

In [3]:
config = AutoConfig.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, config=config)
model = AutoModelForSequenceClassification.from_pretrained(model_path, config=config)
pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer)

Some weights of the model checkpoint at C:\Users\burk640\OneDrive - PNNL\Desktop\DUDE\nukelm\data\06_models\Multiclass Classification were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [4]:
labels = pd.read_csv(project_dir / "data" / "06_models" / "osti-labels.csv", index_col=0)


def get_label_multi(label_no, binary=False):
    label_row = labels.iloc[label_no]
    label = label_row["Description"].title()
    return label

In [5]:
sentence = "These calculations are based on a description of the stagnation phase where the rate of change of energy in the hot spot is determined by the competition of the energy losses due to thermal conduction and radiation with the heating due to compressional work and alpha particle energy deposition."  # noqa: E501

result = pipe(sentence)

result = pd.DataFrame(result)
result["label"] = result["label"].apply(lambda label: int(label.split("_")[1])).apply(get_label_multi)
result

Unnamed: 0,label,score
0,Plasma Physics And Fusion Technology,0.521805


In [6]:
sentence = "The use of heavy water as the moderator is the key to the PHWR system, enabling the use of natural uranium as the fuel (in the form of ceramic UO2), which means that it can be operated without expensive uranium enrichment facilities."  # noqa: E501

result = pipe(sentence)

result = pd.DataFrame(result)
result["label"] = result["label"].apply(lambda label: int(label.split("_")[1])).apply(get_label_multi)
result

Unnamed: 0,label,score
0,Specific Nuclear Reactors And Associated Plants,0.958586


## Binary (Related to the Nuclear Fuel Cycle)

In [7]:
model_path = str(project_dir / "data" / "06_models" / "Binary Classification")
tokenizer_name = "roberta-large"

In [8]:
config = AutoConfig.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, config=config)
model = AutoModelForSequenceClassification.from_pretrained(model_path, config=config)
pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer)

Some weights of the model checkpoint at C:\Users\burk640\OneDrive - PNNL\Desktop\DUDE\nukelm\data\06_models\Binary Classification were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [9]:
def get_label_binary(label_no):
    if label_no:
        label = "NFC-Related"
    else:
        label = "Not NFC-Related"
    return label

In [10]:
sentence = "These calculations are based on a description of the stagnation phase where the rate of change of energy in the hot spot is determined by the competition of the energy losses due to thermal conduction and radiation with the heating due to compressional work and alpha particle energy deposition."  # noqa: E501

result = pipe(sentence)

result = pd.DataFrame(result)
result["label"] = result["label"].apply(lambda label: int(label.split("_")[1])).apply(get_label_binary)
result

Unnamed: 0,label,score
0,NFC-Related,0.80213


In [11]:
sentence = "The use of heavy water as the moderator is the key to the PHWR system, enabling the use of natural uranium as the fuel (in the form of ceramic UO2), which means that it can be operated without expensive uranium enrichment facilities."  # noqa: E501

result = pipe(sentence)

result = pd.DataFrame(result)
result["label"] = result["label"].apply(lambda label: int(label.split("_")[1])).apply(get_label_binary)
result

Unnamed: 0,label,score
0,NFC-Related,0.990763
