<a href="https://colab.research.google.com/github/jpsiegel/Projects/blob/master/caseStudy_Jan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

### Background

This case study is losely based on an actual market research problem. The client tested a (relatively) large number of advertising messages over the years. These tests involved respondents evaluating the credibility and appeal of those messages. These evaluations are aggregated to a score which is then binned using deciles. Top decile claims are likely to be used in actual advertising campaigns. 

We try to predict the claim performance (as represented by the decile) from the text by applying a popular pre-trained model.

### Your task

Fill in the blanks to make a basic analysis work. Add to it as much as you like. During the subsequent interview, you can explain your solution and approach.

### Data

There are three column in the data, `Message_Text` (The advertising message), `score` (The survey-based score) and `label` (Decile).

Our client wants to predict `label` from `Message_Text`. There are 10 classes in total.

The data structure is based on real data, however, for confidentiality reasons it is not our actual client data.

### Model

The code below apply *distilbert model* to do the classification. Please fill the blanks.

We will use pretraiend model from [huggingface](https://huggingface.co/) library. Hugginface is an open source AI library where published cutting-edge advanced AI models. You can find [courses](https://huggingface.co/course/chapter1/1) online.
 
This case study is [text classification](https://huggingface.co/tasks/text-classification) task.
If you are not familiar with [Bert](https://en.wikipedia.org/wiki/BERT_(language_model)), please check this [paper](https://arxiv.org/abs/1810.04805). Please also check [attention machenism](https://arxiv.org/abs/1706.03762) and transformer. 

###Setup

In [1]:
!pip install transformers --quiet 
!pip install datasets --quiet
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil --quiet
!pip install psutil --quiet
!pip install humanize --quiet

import psutil
import humanize
import os
import GPUtil as GPU

[K     |████████████████████████████████| 4.7 MB 38.6 MB/s 
[K     |████████████████████████████████| 6.6 MB 47.3 MB/s 
[K     |████████████████████████████████| 120 kB 72.0 MB/s 
[K     |████████████████████████████████| 365 kB 15.7 MB/s 
[K     |████████████████████████████████| 115 kB 13.0 MB/s 
[K     |████████████████████████████████| 212 kB 54.6 MB/s 
[K     |████████████████████████████████| 127 kB 54.2 MB/s 
[?25h  Building wheel for gputil (setup.py) ... [?25l[?25hdone


In [2]:
# check GPU unit

GPUs = GPU.getGPUs()
gpu = GPUs[0]  # Only one GPU on Colab and not guaranteed

def print_mem():
  """Prints available ram and graphic memory"""
  process = psutil.Process(os.getpid())
  print("RAM Free: " + humanize.naturalsize(psutil.virtual_memory().available), " | Used: " + humanize.naturalsize(process.memory_info().rss))
  print("VRAM Free: {0:.0f}MB | Used: {1:.0f}MB | Using {2:3.0f}% Total {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil * 100, gpu.memoryTotal))

print_mem()

RAM Free: 12.4 GB  | Used: 111.6 MB
VRAM Free: 15109MB | Used: 0MB | Using   0% Total 15109MB


In [3]:
# mount google drive for workspace

from google.colab import drive

def mount_gdrive():
  """Sets up google drive directory access"""
  path = "/content/drive"
  drive.mount(path, force_remount=True)
  
my_dir = "/content/drive/MyDrive/CaseStudySkim/"
mount_gdrive()
print(os.listdir(my_dir))

Mounted at /content/drive
['sampled_data_NLP.xltx', 'runs', 'checkpoint-500', 'caseStudy_Jan.ipynb']


###Data Exploration

In [5]:
import pandas as pd

data = pd.read_excel(my_dir + "sampled_data_NLP.xltx")
print("Shape: ", data.shape, "\n")
data.head()

Shape:  (7211, 3) 



Unnamed: 0,Message_Text,label,score
0,Impeccable stain removal,class7,0.634615
1,With Odour Resistance Formula - Still fresh at...,class9,0.875
2,Original recipe,class8,0.788462
3,Natural hair gene awakening for 1000 new hair ...,class4,0.383333
4,Plastic-free packaging,class5,0.483871


In [7]:
symbol_example1 = 1254
symbol_example2 = 1649
dutch_example = 2000
spanish_example = 1104
print("Unwanted characters: \n\n", data.iloc[symbol_example1], "\n")
print(data.iloc[symbol_example2], "\n")
print("Non-english languages: \n\n", data.iloc[dutch_example], "\n")
print(data.iloc[spanish_example])

Unwanted characters: 

 Message_Text    Yumos: 2 fragrances in 1¬†fabcon - every time ...
label                                                      class2
score                                                    0.105263
Name: 1254, dtype: object 

Message_Text    "It‚Äôs better than any other Deodorants that ...
label                                                     class10
score                                                    0.928571
Name: 1649, dtype: object 

Non-english languages: 

 Message_Text    Vrij van conserveermiddelen
label                                class5
score                              0.433333
Name: 2000, dtype: object 

Message_Text    Textura ideal con muchos tomates
label                                     class9
score                                       0.85
Name: 1104, dtype: object


### Preprocessing

In [8]:
!pip install googletrans==4.0.0rc1 --quiet
!pip install langid --quiet

from googletrans import Translator
import langid

[K     |████████████████████████████████| 55 kB 3.7 MB/s 
[K     |████████████████████████████████| 42 kB 1.3 MB/s 
[K     |████████████████████████████████| 1.4 MB 56.4 MB/s 
[K     |████████████████████████████████| 53 kB 2.0 MB/s 
[K     |████████████████████████████████| 65 kB 3.2 MB/s 
[?25h  Building wheel for googletrans (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 1.9 MB 36.7 MB/s 
[?25h  Building wheel for langid (setup.py) ... [?25l[?25hdone


In [10]:
# Delete unwanted characters

data["Message_Text"] = data["Message_Text"].apply(lambda x: ''.join(["" if ord(i) < 32 or ord(i) > 126 else i for i in x])) # filters out non printable ascii

print(data.iloc[symbol_example1], "\n")
print(data.iloc[symbol_example2])

# Remove non-english languages 
# Translate? some frases that sound cool or catchy in english stop having that property when translated

translator = Translator()
def detect_lang_google(string, translator=translator):
  """Given a string, use Google API to check if its in english"""
  try:
    ret = translator.detect(string).lang == "en"
  except Exception as err:
    print(string, err)
    ret = False
  return ret

def detect_lang(string):
  """Given a string, use langid to check if its in english"""
  return langid.classify(string)[0] == "en"
#print(langid.classify("More space, better taste"))

data_noneng = data[data["Message_Text"].map(detect_lang) == False]
data_eng = data[data["Message_Text"].map(detect_lang) == True]

pct_lost = round((1 - (data_eng.shape[0] / data.shape[0])) * 100, 1)
print("\nDataset reduced to {} from {} examples ({}% removed)".format(data_eng.shape[0], data.shape[0], pct_lost))


Message_Text    Yumos: 2 fragrances in 1fabcon - every time yo...
label                                                      class2
score                                                    0.105263
Name: 1254, dtype: object 

Message_Text    "Its better than any other Deodorants that I h...
label                                                     class10
score                                                    0.928571
Name: 1649, dtype: object

Dataset reduced to 6438 from 7211 examples (10.7% removed)


In [26]:
# Make sure we still have balanced classes

def get_pct(value, total=data_eng.shape[0]):
  pct = round((value / total) * 100, 2)
  return str(pct) + "%"

amount = data_eng.label.value_counts()
percentages = data_eng.label.value_counts().apply(get_pct)
balance_check = pd.concat([amount, percentages], axis=1)
balance_check

Unnamed: 0,label,label.1
class10,732,11.37%
class3,686,10.66%
class4,669,10.39%
class8,658,10.22%
class5,651,10.11%
class9,635,9.86%
class6,631,9.8%
class2,630,9.79%
class7,605,9.4%
class1,541,8.4%


In [27]:
# Cast string label to one hot encoding

def one_hot_label(str_label):
  one_hot_list = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  idx = int(str_label.strip("class")) - 1
  one_hot_list[idx] = 1
  return one_hot_list

data_ready = data_eng.copy()
data_ready["label"] = data_eng["label"].map(one_hot_label)
data_ready

Unnamed: 0,Message_Text,label,score
1,With Odour Resistance Formula - Still fresh at...,"[0, 0, 0, 0, 0, 0, 0, 0, 1, 0]",0.875000
2,Original recipe,"[0, 0, 0, 0, 0, 0, 0, 1, 0, 0]",0.788462
3,Natural hair gene awakening for 1000 new hair ...,"[0, 0, 0, 1, 0, 0, 0, 0, 0, 0]",0.383333
4,Plastic-free packaging,"[0, 0, 0, 0, 1, 0, 0, 0, 0, 0]",0.483871
5,Blended with the most luxurious fragrances for...,"[0, 0, 0, 0, 1, 0, 0, 0, 0, 0]",0.476190
...,...,...,...
7206,Every drop actively removes stains,"[0, 0, 0, 0, 1, 0, 0, 0, 0, 0]",0.428571
7207,"Indulge your senses with a richer, more foamy cup","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0]",0.055556
7208,CIF Cream with microparticles acts deep into t...,"[1, 0, 0, 0, 0, 0, 0, 0, 0, 0]",0.021739
7209,With Patented Technology to refresh your senses,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1]",0.980000


In [28]:
# convert pandas into dataset and get train and test dataset
from datasets import Dataset

ds = (Dataset.from_pandas(data_ready, preserve_index=False).train_test_split(train_size=0.8, test_size=0.2))
print("Train set: ", len(ds["train"]), "examples")
print("Test set: ", len(ds["test"]), "examples")

# peek at one example
ds["train"][0]

Train set:  5150 examples
Test set:  1288 examples


{'Message_Text': 'Perfect result with natural ingredients',
 'label': [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
 'score': 0.1637931034482758}

## Tokenizer

We are using huggingface [AutoClass](https://huggingface.co/docs/transformers/model_doc/auto). For the [tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer), please check here.

In [29]:
import torch
from transformers import (AutoTokenizer, 
                          AutoModelForSequenceClassification, 
                          TrainingArguments, 
                          Trainer)

In [32]:
# Instantiate tokenizer

tokenizer = AutoTokenizer.from_pretrained(
"distilbert-base-uncased"
)

# same as
# tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

Downloading tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [35]:
# Define the maximum number of words to tokenize (DistilBERT can tokenize up to 512)
MAX_LENGTH = 128


# Define function to encode text data in batches
def batch_encode(tokenizer, dataset, batch_size=16, max_length=MAX_LENGTH):
    """
    A function that encodes a batch of texts and returns the texts'
    corresponding encodings and attention masks that are ready to be fed 
    into a pre-trained transformer model.
    
    Input:
        - tokenizer:   Tokenizer object from the PreTrainedTokenizer Class
        - texts:       List of strings where each string represents a text
        - batch_size:  Integer controlling number of texts in a batch
        - max_length:  Integer controlling max number of words to tokenize in a given text
    Output:
        - input_ids:       sequence of texts encoded as a tf.Tensor object
        - attention_mask:  the texts' attention mask encoded as a tf.Tensor object
    """
    
    input_ids = []
    attention_mask = []
    
    for i in range(0, len(dataset), batch_size):
        batch = [ dataset[n]["Message_Text"] for n in range(batch_size) ]
        inputs = tokenizer.batch_encode_plus(batch,
                                             max_length=max_length,
                                             padding='longest', #implements dynamic padding
                                             truncation=True,
                                             return_attention_mask=True,
                                             return_token_type_ids=False
                                             )
        input_ids.extend(inputs['input_ids'])
        attention_mask.extend(inputs['attention_mask'])
    
    
    return torch.FloatTensor(input_ids), torch.FloatTensor(attention_mask)
    
    
# Encode X_train
X_train_ids, X_train_attention = batch_encode(tokenizer, ds["train"])
print(X_train_ids)
print(X_train_attention)

# Encode X_test
#X_test_ids, X_test_attention = batch_encode(tokenizer, X_test.tolist())

tensor([[  101.,  3819.,  2765.,  ...,     0.,     0.,     0.],
        [  101.,  1996.,  2087.,  ...,     0.,     0.,     0.],
        [  101.,  1996.,  2190.,  ...,     0.,     0.,     0.],
        ...,
        [  101., 19275.,  2000.,  ...,     0.,     0.,     0.],
        [  101.,  2317.,  2099.,  ...,     0.,     0.,     0.],
        [  101.,  1037.,  5572.,  ...,     0.,     0.,     0.]])
tensor([[1., 1., 1.,  ..., 0., 0., 0.],
        [1., 1., 1.,  ..., 0., 0., 0.],
        [1., 1., 1.,  ..., 0., 0., 0.],
        ...,
        [1., 1., 1.,  ..., 0., 0., 0.],
        [1., 1., 1.,  ..., 0., 0., 0.],
        [1., 1., 1.,  ..., 0., 0., 0.]])


In [42]:
def tokenize_and_encode(examples):
    return tokenizer(examples["Message_Text"], truncation=True, padding="max_length")
  
tokenize_and_encode(ds["train"][1])

{'input_ids': [101, 1996, 2087, 4621, 21101, 8208, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [43]:
cols = ds["train"].column_names
cols.remove("label")
ds_enc = ds.map(tokenize_and_encode, batched=True, remove_columns=cols)
print("\n Training example \n", ds_enc["train"][0],  "\n")

ds_enc

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]


 Training example 
 {'label': [0, 1, 0, 0, 0, 0, 0, 0, 0, 0], 'input_ids': [101, 3819, 2765, 2007, 3019, 12760, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 5150
    })
    test: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 1288
    })
})

In [44]:
# cast to tensors

ds_enc.set_format("torch")
ds_enc = (ds_enc
          .map(lambda x : {"float_label": x["label"].to(torch.float)}, remove_columns=["label"])
          .rename_column("float_label", "label"))
print("\n Training example in tensor \n")
ds_enc["train"][0]

  0%|          | 0/5150 [00:00<?, ?ex/s]

  0%|          | 0/1288 [00:00<?, ?ex/s]


 Training example in tensor 



{'input_ids': tensor([  101,  3819,  2765,  2007,  3019, 12760,   102,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,   

## Model

In [41]:
AMOUNT_CLASSES = 10
DROPOUT = 0.2

model = (AutoModelForSequenceClassification
         .from_pretrained(
            "distilbert-base-uncased",
            num_labels=AMOUNT_CLASSES,
            dropout=DROPOUT
          ).to('cuda')
        )

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'classifier.w

In [45]:
# let's peek the data
ds_enc["train"][0]


{'input_ids': tensor([  101,  3819,  2765,  2007,  3019, 12760,   102,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,   

In [61]:
args = TrainingArguments(
    output_dir=my_dir,
    per_device_train_batch_size=8,
    evaluation_strategy="epoch",
    num_train_epochs=2
  # please put your code here....
)

from torch import nn
from transformers import Trainer


#class CustomTrainer(Trainer):
#    def compute_loss(self, model, inputs, return_outputs=False):
#        labels = inputs.get("labels")
#        # forward pass
#        outputs = model(**inputs)
#        logits = outputs.get("logits")
#        # compute custom loss (suppose one has 3 labels with different weights)
#        loss_fct = nn.CrossEntropyLoss(weight=torch.tensor([1.0, 2.0, 3.0]))
#        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
#        return (loss, outputs) if return_outputs else loss

import numpy as np
from datasets import load_metric
metric = load_metric("accuracy")
def compute_metrics(eval_pred):
  print(eval_pred, type(eval_pred))
  logits, labels = eval_pred
  predictions = np.argmax(logits, axis=1)
  print(predictions, type(predictions))
  numbered_labels = []
  for label in labels:
    listed = list(label)
    num = listed.index(1.0)
    numbered_labels.append(num)
  print(numbered_labels, type(numbered_labels))
  return metric.compute(predictions=predictions, references=numbered_labels)

trainer = Trainer(model=model, 
                  args=args, 
                  train_dataset=ds_enc["train"], 
                  eval_dataset=ds_enc["test"], 
                  tokenizer=tokenizer,
                  compute_metrics=compute_metrics
                  )

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [62]:
trainer.train()

***** Running training *****
  Num examples = 5150
  Num Epochs = 2
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 322


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.635552,0.156056
2,No log,0.609917,0.163043


***** Running Evaluation *****
  Num examples = 1288
  Batch size = 8


<transformers.trainer_utils.EvalPrediction object at 0x7f6b0c0e3c90> <class 'transformers.trainer_utils.EvalPrediction'>
[8 4 6 ... 1 1 0] <class 'numpy.ndarray'>
[6, 0, 3, 0, 7, 1, 1, 1, 5, 5, 5, 6, 8, 0, 4, 1, 6, 9, 3, 3, 5, 2, 6, 8, 6, 8, 8, 9, 3, 7, 0, 2, 9, 2, 5, 9, 5, 6, 8, 6, 3, 4, 8, 6, 1, 2, 3, 2, 6, 0, 9, 8, 3, 2, 9, 2, 3, 1, 5, 2, 3, 1, 6, 6, 0, 7, 4, 7, 8, 8, 0, 6, 1, 3, 6, 5, 4, 5, 4, 5, 8, 9, 6, 2, 4, 9, 0, 1, 3, 9, 6, 8, 9, 7, 8, 7, 6, 5, 1, 4, 8, 3, 8, 9, 2, 9, 0, 6, 2, 7, 4, 7, 9, 4, 4, 5, 1, 6, 9, 4, 0, 2, 1, 2, 7, 4, 9, 9, 1, 7, 0, 3, 0, 7, 6, 5, 9, 1, 3, 4, 6, 4, 3, 5, 1, 0, 7, 5, 6, 3, 4, 5, 4, 6, 9, 5, 0, 2, 8, 1, 0, 2, 2, 5, 1, 3, 0, 7, 8, 0, 4, 8, 0, 7, 9, 5, 4, 6, 9, 9, 0, 7, 0, 9, 0, 5, 0, 9, 4, 1, 1, 4, 8, 5, 1, 2, 2, 9, 6, 3, 9, 8, 3, 9, 9, 4, 8, 9, 8, 9, 8, 4, 1, 3, 3, 3, 6, 9, 5, 3, 6, 4, 0, 8, 1, 0, 0, 5, 4, 4, 9, 4, 8, 2, 9, 4, 3, 0, 3, 1, 9, 9, 9, 0, 2, 9, 7, 4, 4, 1, 3, 8, 1, 0, 8, 2, 8, 5, 8, 9, 4, 1, 6, 1, 3, 3, 3, 6, 1, 6, 8, 4, 1, 7, 5, 8, 8, 8, 6,

***** Running Evaluation *****
  Num examples = 1288
  Batch size = 8


<transformers.trainer_utils.EvalPrediction object at 0x7f6b13a10450> <class 'transformers.trainer_utils.EvalPrediction'>
[8 4 6 ... 1 1 0] <class 'numpy.ndarray'>
[6, 0, 3, 0, 7, 1, 1, 1, 5, 5, 5, 6, 8, 0, 4, 1, 6, 9, 3, 3, 5, 2, 6, 8, 6, 8, 8, 9, 3, 7, 0, 2, 9, 2, 5, 9, 5, 6, 8, 6, 3, 4, 8, 6, 1, 2, 3, 2, 6, 0, 9, 8, 3, 2, 9, 2, 3, 1, 5, 2, 3, 1, 6, 6, 0, 7, 4, 7, 8, 8, 0, 6, 1, 3, 6, 5, 4, 5, 4, 5, 8, 9, 6, 2, 4, 9, 0, 1, 3, 9, 6, 8, 9, 7, 8, 7, 6, 5, 1, 4, 8, 3, 8, 9, 2, 9, 0, 6, 2, 7, 4, 7, 9, 4, 4, 5, 1, 6, 9, 4, 0, 2, 1, 2, 7, 4, 9, 9, 1, 7, 0, 3, 0, 7, 6, 5, 9, 1, 3, 4, 6, 4, 3, 5, 1, 0, 7, 5, 6, 3, 4, 5, 4, 6, 9, 5, 0, 2, 8, 1, 0, 2, 2, 5, 1, 3, 0, 7, 8, 0, 4, 8, 0, 7, 9, 5, 4, 6, 9, 9, 0, 7, 0, 9, 0, 5, 0, 9, 4, 1, 1, 4, 8, 5, 1, 2, 2, 9, 6, 3, 9, 8, 3, 9, 9, 4, 8, 9, 8, 9, 8, 4, 1, 3, 3, 3, 6, 9, 5, 3, 6, 4, 0, 8, 1, 0, 0, 5, 4, 4, 9, 4, 8, 2, 9, 4, 3, 0, 3, 1, 9, 9, 9, 0, 2, 9, 7, 4, 4, 1, 3, 8, 1, 0, 8, 2, 8, 5, 8, 9, 4, 1, 6, 1, 3, 3, 3, 6, 1, 6, 8, 4, 1, 7, 5, 8, 8, 8, 6,



Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=322, training_loss=0.07614654043446416, metrics={'train_runtime': 491.1893, 'train_samples_per_second': 20.97, 'train_steps_per_second': 0.656, 'total_flos': 1364608865280000.0, 'train_loss': 0.07614654043446416, 'epoch': 2.0})

In [56]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 1288
  Batch size = 8


{'eval_loss': 0.42159342765808105,
 'eval_runtime': 19.595,
 'eval_samples_per_second': 65.731,
 'eval_steps_per_second': 8.216,
 'epoch': 5.0}

# Improvement?

Congrats! You finish your model training.

But this is a very basic model, there are still lots of improvements could be done. 

For example, improving model accuracy by tuning hyperparameter, changing different models, logging model to diagnose models. 
You could also try explinable AI to interpret why model gives this prediction. 

Client is also interested in directly predicting socres, you could also try that.

Have fun!