<h1 align="center">
  <a href="https://uptrain.ai">
    <img width="300" src="https://user-images.githubusercontent.com/108270398/214240695-4f958b76-c993-4ddd-8de6-8668f4d0da84.png" alt="uptrain">
  </a>
</h1>

<h1 style="text-align: center;">Fine-tuning a Large-Language Model</h1>

### Install Required packages
- [PyTorch](https://pytorch.org/get-started/locally/): Deep learning framework.
- Hugging Face Transformers(https://huggingface.co/docs/transformers/installation): To use pretrained state-of-the-art models.
- [Hugging Face Datasets](https://pypi.org/project/datasets/): Use public Hugging Face datasets
- [IPywidgets](https://ipywidgets.readthedocs.io/en/stable/user_install.html): For interactive notebook widgets

In [None]:
%pip install torch transformers[torch] datasets ipywidgets textblob uptrain

# Imports

In [None]:
from transformers import AutoModelForMaskedLM, AutoTokenizer
from model_constants import *
from model_train import retrain_model
from helper_funcs import *
import json
import uptrain
from textblob import TextBlob

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
testing_text = "Nike shoes are very [MASK]."#"Nike shoes are very [MASK]."
original_model_outputs = test_model(model, testing_text)

def nike_text_present_func(inputs, outputs, gts=None, extra_args={}):
    is_present = []
    for input in inputs["text"]:
        this_present = "nike" in input.lower() #or "puma" in input.lower() or "adidas" in input.lower() or "bata" in input.lower()
        is_present.append(bool(this_present))
    return is_present


uptrain_save_fold_name = "uptrain_smart_data_bert"
nike_text_present = uptrain.Signal("Nike Text Present", nike_text_present_func)

cfg = {
    'checks': [{
        'type': uptrain.Anomaly.EDGE_CASE,
        "signal_formulae": nike_text_present
    }],

    # Define where to save the retraining dataset
    'retraining_folder': uptrain_save_fold_name,
    
    # Define when to retrain, define a large number because we
    # are not retraining yet
    'retrain_after': 10000000000
}

framework = uptrain.Framework(cfg)

# Sentiment Analysis (Preprocessing)

* We use TextBlob, which is a pre-trained model for sentiment analysis.
* We make use of sentiment polarity to classify sentiments into positive or negative.
* Reviews containing negative adjectives are not selected, as shown by the array `["basic", "cheap", "feminine", "expensive", "inexpensive", "costly", "common"]`
* Returns an boolean array of sentiments, for further filtering of the dataset.

In [None]:
def positive_sentiment(inputs, outputs, gts = None, extra_args = {}):
    is_positive = []
    for input in inputs["text"]:
        senti = TextBlob(input)
        is_pos = senti.sentiment.polarity > 0
        for adj in ["basic", "cheap", "feminine", "expensive", "inexpensive", "costly", "common"]:
            if adj in input:
                is_pos = False
        is_positive.append(bool(is_pos))
    return is_positive

uptrain_save_fold_name = "uptrain_smart_data_bert"
positive_signal = uptrain.Signal("Positive Sentiment", positive_sentiment)

cfg = {
    'checks': [{
        'type': uptrain.Anomaly.EDGE_CASE,
        "signal_formulae": positive_signal
    }],

    # Define where to save the retraining dataset
    'retraining_folder': uptrain_save_fold_name,
    
    # Define when to retrain, define a large number because we
    # are not retraining yet
    'retrain_after': 10000000000
}

framework = uptrain.Framework(cfg) 

# Identify edge cases

In [None]:
raw_dataset = "raw_nike_reviews_data.json"#create_sample_dataset("raw_nike_reviews_data.json")
print(raw_dataset)
with open(raw_dataset) as f:
    all_data = json.load(f)
#print(all_data)
for sample in all_data['data']:
    inputs = {'data': {'text': [sample['text']]}}
    framework.log(inputs = inputs, outputs = None)

retraining_dataset = create_dataset_from_csv(uptrain_save_fold_name + "/1/smart_data.csv", "text", "retrain_dataset.json")
print(retraining_dataset)

# Retraining the model to skew towards positive reviews/descriptions

In [None]:
retrain_model(model, retraining_dataset)
retrained_model_outputs = test_model(model, testing_text)

In [None]:
print([original_model_outputs, retrained_model_outputs])

# Create Nike review training dataset
nike_attrs = {
    "version": "0.1.0",
    'source': "nike review dataset",
    'url': 'https://www.kaggle.com/datasets/tinkuzp23/nike-onlinestore-customer-reviews?resource=download',
}
# Download the dataset from the url, zip it and copy the csv file here
raw_nike_reviews_dataset = create_dataset_from_csv("Final1.csv", "Description", "raw_nike_reviews_data.json")

# Save the model.

In [None]:
import pickle as pkl

pkl.dump(model, open('final_model.pkl', 'wb'))

# Retraining the model to skew towards positive reviews/descriptions

In [5]:
retrain_model(model, retraining_dataset)
retrained_model_outputs = test_model(model, testing_text)

Using custom data configuration default-6f7d250b2b1e8631


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Downloading and preparing dataset json/default to /Users/prateekrao/.cache/huggingface/datasets/json/default-6f7d250b2b1e8631/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /Users/prateekrao/.cache/huggingface/datasets/json/default-6f7d250b2b1e8631/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 135
  Batch size = 64
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


The following columns in the training set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1213
  Num Epochs = 10
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 190
  Number of trainable parameters = 66985530


>>>Before training, Perplexity: 46.72


Epoch,Training Loss,Validation Loss
1,3.0938,2.379672
2,2.1386,2.008094
3,1.8374,1.841843
4,1.6828,1.701258
5,1.5171,1.63571
6,1.3965,1.569052
7,1.3328,1.481639
8,1.2528,1.368961
9,1.2212,1.40201
10,1.1325,1.398103


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 135
  Batch size = 64
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 135
  Batch size = 64
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForMaskedLM.forward` and have been ignored: word_ids. If word_ids are not expected by `DistilBertForMaskedLM.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 135
  Batch size = 64
The following columns in the evaluation 

>>>After training, Perplexity: 4.41


In [6]:
print([original_model_outputs, retrained_model_outputs])

# Create Nike review training dataset
nike_attrs = {
    "version": "0.1.0",
    'source': "nike review dataset",
    'url': 'https://www.kaggle.com/datasets/tinkuzp23/nike-onlinestore-customer-reviews?resource=download',
}
# Download the dataset from the url, zip it and copy the csv file here
raw_nike_reviews_dataset = create_dataset_from_csv("Final1.csv", "Description", "raw_nike_reviews_data.json")

[['popular', 'expensive', 'durable', 'common', 'comfortable', 'worn', 'versatile', 'inexpensive', 'rare', 'fashionable', 'costly', 'cheap', 'attractive', 'affordable', 'lightweight', 'basic', 'important', 'distinctive', 'sturdy', 'similar'], ['comfortable', 'popular', 'durable', 'good', 'attractive', 'nice', 'lightweight', 'functional', 'versatile', 'affordable', 'luxurious', 'fancy', 'effective', 'reliable', 'cool', 'light', 'soft', 'special', 'beautiful', 'flexible']]


# Save the model.

In [8]:
import pickle as pkl

pkl.dump(model, open('final_model.pkl', 'wb'))