## __Statistical and Linguistic Insights for Model Explanation - SLIME__ 
### __Fine-tuning custom LLM for classification__
<font size=3>

To improve/modify the LLM for classification, we can modify the $\mathtt{CustomModel}$ class to make the NN modeling using:
* $\mathtt{FitModel().fit()}$ method when the available dataset is large enough for training and validation;
* $\mathtt{FitModel().kfold()}$ method when the available dataset is small for training and validation.

After NN modeling, we make the final training using the $\mathtt{FitModel().fit()}$ method.

In [1]:
import sys
sys.path.insert(0, '../')

from slime_nlp.dataset import ImportData
from slime_nlp.model import CustomModel, FitModel

### __1. Fine-tuning: for train and validation data__
<font size=3>

- Using $\mathtt{ImportData}$ to split the dataset into train, validation, and test data;
- Using $\mathtt{FitModel.fit()}$ for train and validation.

In [2]:
id = ImportData(path_name="../dataset/adress_all.csv", n_val=0.15, n_test=0.1, 
                group_by=['text', 'group'], verbose=True)

train_data = id.train
val_data = id.val
test_data = id.test

DataFrame:
                                                 text  group
0  well the little girl is saying to be uiet to h...      0
1  mhm . well the water's running over on the flo...      0
2  look at the picture <unintelligible> . oh okay...      0

Data length: N_total = 156
N-train = 118, N-val = 23, N-test = 15



In [3]:
fm = FitModel(device='cpu')

print(fm.__doc__)


    # FitModel: CustomModel model fitting.

    Input: (device='cpu', optimizer='AdamW', lr=2e-5, lr_sub=2e-4, eps=1e-8)
    -----
    - device (str): select CPU or GPU for training.
    - optimizer (str): training optimizer name.
    - lr (float): learning-rate for AutoModel' LLM weights adjustment.
    - lr_sub (float): learning-rate for weights adjustment of the CustomModel's 
    additional layer block.
    - eps (float): optimizer constant for numerical stability.

    Methods:
    -------
    - train_step (X, y):
      -- X (Tensor): CustomModel input data.
      -- y (Tensor): tensor of numerical labels.

      Returns (Tensor) the loss function value.

    - fit (train_data, val_data=None, epochs=1, batch_size=1, pretrained_name="google-bert/bert-base-cased",
    klabel='', path_name=None, patience=0, min_delta=1e-2):
      -- train_data (Dataframe): pandas dataframe (ImportData's output) with "text"(str) 
      and "group"(int) columns.
      -- val_data (Dataframe): equivale

In [4]:
fm.fit(train_data, val_data, epochs=1)

#Epoch 1/1:
Batch:99% - <train-loss> = 7.261e-01
<validation-metric>: Acc = 5.652e-01, F1 = 0.000e+00
Time taken: 68.94s



In [None]:
fm.plot_metric() 

### __2. Fine-tuning: for K-fold cross-validation.__
<font size=3>

- Using $\mathtt{ImportData}$ to split the dataset into train and test data;
- Using $\mathtt{FitModel.kfold()}$ for K-fold cross-validation.

In [4]:
id = ImportData(path_name="../dataset/adress_all.csv", n_val=0.0, n_test=0.1,
                group_by=['text', 'group'], verbose=True)

train_data = id.train
test_data = id.test

DataFrame:
                                                 text  group
0  well the little girl is saying to be uiet to h...      0
1  mhm . well the water's running over on the flo...      0
2  look at the picture <unintelligible> . oh okay...      0

Data length: N_total = 156
N-train = 141, N-val = 0, N-test = 15



In [None]:
fm = FitModel(device='cpu')

fm.kfold(train_data, K=5, batch_size=2, epochs=30)

In [None]:
fm.plot_metric() 

### __3. Fine-tuning: for final training after NN modeling.__
<font size=3>

- Using $\mathtt{ImportData}$ to split the dataset into train and test data;
- Using $\mathtt{FitModel.fit()}$ for final training.

In [5]:
id = ImportData(path_name="../dataset/adress_all.csv", n_val=0.0, n_test=0.1,
                group_by=['text', 'group'], verbose=True)

train_data = id.train
test_data = id.test

DataFrame:
                                                 text  group
0  well the little girl is saying to be uiet to h...      0
1  mhm . well the water's running over on the flo...      0
2  look at the picture <unintelligible> . oh okay...      0

Data length: N_total = 156
N-train = 141, N-val = 0, N-test = 15



In [None]:
fm = FitModel(device='cpu')

fm.fit(train_data, epochs=2)

In [None]:
fm.evaluate(test_data)

### __4. Making predictions:__

In [2]:
model = CustomModel().to('cpu')

print(model.__doc__)


    # CustomModel: Custom LLM for classification

    Input: (pretrained_name="google-bert/bert-base-cased")
    ----- 
    - pretained_name (str): pretrained model name from huggingface.co repository.

    Returns object with callable model's input. 

    
    Methods:
    -------
    - forward = __call__: (input_ids, token_type_ids=None, attention_mask=None)
      -- input_ids (Tensor): sequence of special tokens IDs.
      -- token_type_ids (Tensor): sequence of token indices to distinguish 
      between sentence pairs.
      -- attention_mask (Tensor): mask to avoid performing attention on padding 
      token indices.

      Returns a Tensor with linear prediction output.
    
    - load: (path_name="weights/model_weights.pt", device='cpu') 
      Loads model's weights.
      
      -- path_name (str): path string of the model's weights (.pt).
      -- device (str): select CPU or GPU for prediction processing.
    
    - predict: (data)
      -- data (Dataframe): pandas datafram

In [None]:
model.load("../weights/best_model.pt")

pred = model.predict(test_data)

https://huggingface.co/docs/transformers/v4.45.2/en/model_doc/auto#transformers.AutoConfig