First, you'll need to install the necessary dependencies, such as PyTorch and the Transformer library. You can do this using pip:

In [None]:
!pip install torch transformers

Next, you'll need to download a pre-trained Transformer model. You can do this using the transformers library:

In [1]:
import pandas as pd
import numpy as np
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline, BertTokenizer, BertModel, Trainer, BertConfig, BertForMaskedLM

In [2]:
model_names = ["moreh/MoMo-70B-lora-1.8.6-DPO","bert-large-uncased-whole-word-masking","bert-base-uncased"]
model_name = model_names[1]
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-large-uncased-whole-word-masking and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Now, you'll need to prepare your dataset for fine-tuning. This will involve tokenizing your text data and creating a CSV file that contains the input text and the corresponding labels.

In [3]:
unmasker = pipeline('fill-mask', model=model_name)

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking were not used when initializing BertForMaskedLM: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [4]:
unmasker("Hello I'm a [MASK] model.")
# unmasker("The man worked as a [MASK].")

[{'score': 0.1581384241580963,
  'token': 4827,
  'token_str': 'fashion',
  'sequence': "hello i'm a fashion model."},
 {'score': 0.10551061481237411,
  'token': 3104,
  'token_str': 'cover',
  'sequence': "hello i'm a cover model."},
 {'score': 0.08340441435575485,
  'token': 3287,
  'token_str': 'male',
  'sequence': "hello i'm a male model."},
 {'score': 0.036382000893354416,
  'token': 3565,
  'token_str': 'super',
  'sequence': "hello i'm a super model."},
 {'score': 0.036095913499593735,
  'token': 2327,
  'token_str': 'top',
  'sequence': "hello i'm a top model."}]

In [5]:
# tokenizer = BertTokenizer.from_pretrained(model_name)
# model = BertModel.from_pretrained(model_name)
from transformers import BertForMaskedLM, BertTokenizer, BertConfig

config = BertConfig.from_pretrained(model_name)
model = BertForMaskedLM.from_pretrained(model_name, config=config)
tokenizer = BertTokenizer.from_pretrained(model_name)

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking were not used when initializing BertForMaskedLM: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [8]:
df = pd.read_parquet("data/train-00000-of-00001-45c83a9a6160af42.parquet")

In [9]:
# df = df.iloc[: 1000]
df

Unnamed: 0,instruction,input,output
0,Your goal is to determine the relationship bet...,"Sentence 1: For his hypotension, autonomic tes...",Entailment
1,"In the provided text, your objective is to rec...","The first product of ascorbate oxidation , the...",The : O\nfirst : O\nproduct : O\nof : O\nascor...
2,"If you have expertise in healthcare, assist us...",i am having acute gas problem which i am feeli...,Dear-thanks for using our service and will try...
3,"Being a doctor, your task is to answer the med...",###Question: What is the relation between Hous...,###Answer: Does not bath self (finding) interp...
4,"In your role as a medical professional, addres...",Hi I have the last few weeks had a sharp pain ...,Thank you. Upper abdominal pain with anemia in...
...,...,...,...
205043,Your role involves answering medical questions...,"Ok, so I have a friend that is 30 years old an...","Hi, Why the detection may have been late. I mu..."
205044,Your task is to determine the relationships be...,"Meds at home : @treatment$ , Levoxyl , @treatm...",No Relations
205045,"Given your profession as a doctor, please prov...",###Question: What is the relation between Open...,###Answer: Open fine needle biopsy of liver (p...
205046,Your goal as an annotator is to recognize clin...,Received 2 mg dilaudid for pain o/n .,Received : O\n2 : O\nmg : O\ndilaudid : B-TREA...


In [None]:
# text = "Replace me by any text you'd like."
# encoded_input = tokenizer(text, return_tensors='pt')
# output = model(**encoded_input) # type: ignore

In [10]:
import torch

In [None]:
# Put model in train mode and move to GPU if available
model.train()
device = torch.device('cpu')
model.to(device)

In [11]:
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertForMaskedLM, BertTokenizer
from queue import Queue

# Load model and tokenizer
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
df = pd.read_parquet("data/train-00000-of-00001-45c83a9a6160af42.parquet")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [12]:
# Sample data
train_data = [
    ['Summarize this', 'The quick brown fox jumped over the lazy dog', 'The text describes a fox quickly jumping over a lazy dog.'],
    ['Translate to French', 'The house is blue', 'La maison est bleue']  
]
# Load CSV data
train_data = df.values[:100]
# train_data

In [13]:

# Custom dataset
class CustomDataset(Dataset):
    def __init__(self, data):
        self.data = data
        self.queue = Queue()
        for item in data:
            self.queue.put(item)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.queue.get()

# Custom collate function to handle variable batch sizes
def custom_collate(batch):
    batch_size = len(batch)
    max_len = max(len(item[1]) for item in batch)
    
    input_ids = torch.zeros((batch_size, max_len), dtype=torch.long)
    labels = torch.zeros((batch_size, max_len), dtype=torch.long)

    for i, item in enumerate(batch):
        instruction, input_text, output_text = item
        input_encoding = tokenizer.encode(instruction + tokenizer.mask_token + input_text, return_tensors='pt')
        label_encoding = tokenizer.encode(output_text + tokenizer.eos_token, return_tensors='pt', padding=True)

        input_ids[i, :input_encoding.size(1)] = input_encoding[0]
        labels[i, :label_encoding.size(1)] = label_encoding[0]

    return {'input_ids': input_ids, 'labels': labels}

# Create custom dataset and dataloader
custom_dataset = CustomDataset(train_data)
custom_dataloader = DataLoader(custom_dataset, batch_size=2, collate_fn=custom_collate)

# Fine-tune 
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

In [15]:
model.train()  # Set model to training mode

for epoch in range(3):
    for item in train_encodings:  
       # Adjust batch size
        input_ids = item['input_ids'][:, :3]  # Assuming 3 is the desired batch size
        labels = item.get('labels', None)
        if labels is not None:
            labels = labels[:, :3]  # Assuming 3 is the desired batch size

        outputs = model(input_ids, labels=labels)  
        loss = outputs.loss
        loss.backward() 
        optimizer.step()
        optimizer.zero_grad()

model.eval()  # Set model to evaluation mode

NameError: name 'train_encodings' is not defined

In [14]:
model.train()  # Set model to training mode

for epoch in range(3):
    for batch in custom_dataloader:  
        input_ids = batch['input_ids']
        labels = batch['labels']

        outputs = model(input_ids, labels=labels)  
        loss = outputs.loss
        loss.backward() 
        optimizer.step()
        optimizer.zero_grad()

model.eval()  # Set model to evaluation mode

TypeError: can only concatenate str (not "NoneType") to str

In [None]:
# Sample text generation  
instruction = 'Summarize this'
text = 'The quick brown fox jumped over the lazy dog'
input_ids = tokenizer.encode(instruction + tokenizer.mask_token + text,  
                             return_tensors='pt') 
outputs = model(input_ids)
predictions = torch.argmax(outputs.logits, dim=2)
generated_text = tokenizer.decode(predictions[0], skip_special_tokens=True)
print(generated_text)

In [16]:
# Tokenize and encode data
def encode(instruction, input, output=None):
    input_ids = tokenizer.encode(instruction + tokenizer.sep_token + input, 
                                 return_tensors='pt')
    if output:
        output_ids = tokenizer.encode(output, return_tensors='pt') 
    return {'input_ids': input_ids, 'labels': output_ids} if output else {'input_ids': input_ids}

In [17]:
# Tokenize and encode data
def encode(instruction, input, output=None):
    input_ids = tokenizer.encode(instruction + tokenizer.mask_token + input, 
                                 return_tensors='pt')
    if output:
        output_ids = tokenizer.encode(output, return_tensors='pt', padding=True) 
        return {'input_ids': input_ids, 'labels': output_ids}
    return {'input_ids': input_ids}

train_encodings = [encode(ins, inp, out) for ins, inp, out in train_data]

# Fine-tune 
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

Token indices sequence length is longer than the specified maximum sequence length for this model (677 > 512). Running this sequence through the model will result in indexing errors


In [18]:
model.train()  # Set model to training mode

for epoch in range(3):
    for item in train_encodings:  
       # Adjust batch size
        input_ids = item['input_ids'][:, :3]  # Assuming 3 is the desired batch size
        labels = item.get('labels', None)
        if labels is not None:
            labels = labels[:, :3]  # Assuming 3 is the desired batch size

        outputs = model(input_ids, labels=labels)  
        loss = outputs.loss
        loss.backward() 
        optimizer.step()
        optimizer.zero_grad()

model.eval()  # Set model to evaluation mode

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_a

In [22]:
# Sample text generation  
instruction = 'Summarize sds'
text = 'The quick brown fox jumped over the lazy dogs'
input_ids = tokenizer.encode(instruction + tokenizer.sep_token + text,  
                             return_tensors='pt')
# outputs = model(input_ids)
# predictions = torch.argmax(outputs.logits, dim=2)
# generated_text = tokenizer.decode(predictions[0], skip_special_tokens=True)
# print(generated_text)

generated = model.generate(input_ids)
# generated
print(tokenizer.decode(generated[0]))

[CLS] summarize sds [SEP] the quick brown fox jumped over the lazy dogs [SEP] # # #


In [None]:
# Tokenize
tokenized_data = tokenizer([inputs], padding="max_length", truncation=True, max_length=32)

input_ids = tokenized_data.input_ids.squeeze().tolist()  
attention_mask = tokenized_data.attention_mask.squeeze().tolist()

# Only take first 1000 samples
input_ids = input_ids[:1000]
attention_mask = attention_mask[:1000]

fine_tune_data = pd.DataFrame({
    "input_ids": input_ids,
    "attention_mask": attention_mask,
    "labels": outputs
})

fine_tune_data.to_csv("fine_tune_data.csv", index=False)

In [None]:
tokenized_data = tokenizer.batch_encode_plus(inputs,
                                            padding="max_length",
                                            truncation=True,
                                            max_length=32,
                                            return_attention_mask=True,
                                            return_tensors="pt")

input_ids = tokenized_data["input_ids"].detach().cpu().numpy().tolist()
attention_mask = tokenized_data["attention_mask"].detach().cpu().numpy().tolist()
# print(len(input_ids), len(attention_mask))

# Flatten 2D array to 1D
input_ids = [item for sublist in input_ids for item in sublist]  
attention_mask = [item for sublist in attention_mask for item in sublist]

print(len(input_ids), len(attention_mask), len(outputs))

# fine_tune_data = pd.DataFrame({
#     "input_ids": input_ids,
#     "attention_mask": attention_mask, 
#     "labels": outputs
# })

# fine_tune_data.to_csv("fine_tune_data.csv", index=False)

In [None]:
# Set up the training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    save_total_limit=2,
    save_steps=500,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True,
    save_strategy="steps",
    save_on_each_node=True,
)

# Set up the Trainer
trainer = Trainer(
    model=AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=8),
    args=training_args,
    train_dataset=fine_tune_data,
    tokenizer=tokenizer,
    compute_metrics=lambda pred: {"accuracy": accuracy_score(pred.label_ids, pred.predictions.argmax(-1))},
)

# Fine-tune the model
trainer.train()

In [None]:
# Training loop
for epoch in range(num_epochs):
    for step, batch in enumerate(input_ids):
        # Move batch to GPU
        batch = {k: v.type(torch.long).to(device) for k, v in batch.items()}
        
        # Forward pass
        outputs = model(**batch)
        loss =  criterion(outputs.logits.view(-1, config.vocab_size), 
                         batch['labels'].view(-1))
        
        # Backward pass and optimization
        optimizer.zero_grad()  
        loss.backward()
        optimizer.step()
        
        # Logging
        if step % 100 == 0:
            print(f'Epoch: {epoch} | Loss: {loss.item()}')

In [None]:
# Save fine-tuned model
output_dir = '/path/to/save/' 
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

In [None]:
df = df.iloc[:1000]

In [None]:
input_ids = tokenizer.encode(df['instruction'].values.tolist(), return_tensors='pt')
output_ids = tokenizer.encode(df['output'].values.tolist(), return_tensors='pt')

In [None]:
input_ids

In [None]:
# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device = torch.device("cpu")

In [None]:
# Pad the input and output sequences
max_length = 32
padded_input_ids = torch.zeros((len(df), max_length)).to(device)
padded_output_ids = torch.zeros((len(df), max_length)).to(device)

In [None]:
# Create a training loop
# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = BertModel.from_pretrained(model_name, num_labels=10000).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)
loss_fn = torch.nn.CrossEntropyLoss()

In [None]:
for epoch in range(5):
    model.train()
    total_loss = 0
    for batch_idx, (input_ids, output_ids) in enumerate(zip(padded_input_ids, padded_output_ids)):
        input_ids = input_ids.to(device)
        output_ids = output_ids.to(device)
        optimizer.zero_grad()
        outputs = model(input_ids)
        loss = loss_fn(outputs, output_ids)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f'Epoch {epoch+1}, Loss: {total_loss / len(df)}')

# Evaluate the model
model.eval()
test_loss = 0
correct = 0

In [None]:
with torch.no_grad():
    for batch_idx, (input_ids, output_ids) in enumerate(zip(padded_input_ids, padded_output_ids)):
        input_ids = input_ids.to(device)
        output_ids = output_ids.to(device)
        outputs = model(input_ids)
        loss = loss_fn(outputs, output_ids)
        test_loss += loss.item()
        _, predicted = torch.max(outputs, dim=1)
        correct += (predicted == output_ids).sum().item()

accuracy = correct / len(df)
print(f'Test Loss: {test_loss / len(df)}')
print(f'Accuracy: {accuracy:.4f}')

In [None]:
# Load your dataset
df = pd.read_csv("your_dataset.csv")

# Tokenize the text data
tokenized_data = tokenizer.batch_encode_plus(df["text"],
                                            padding="max_length",
                                            truncation=True,
                                            max_length=32,
                                            return_attention_mask=True,
                                            return_tensors="pt")

# Create a CSV file for fine-tuning
fine_tune_data = pd.DataFrame({"input_ids": tokenized_data["input_ids"],
                             "attention_mask": tokenized_data["attention_mask"],
                             "labels": df["label"]})
fine_tune_data.to_csv("fine_tune_data.csv", index=False)

Finally, you can fine-tune the Transformer model using the Trainer class from the transformers library:

In [None]:
from transformers import Trainer

# Set up the training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    save_total_limit=2,
    save_steps=500,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True,
    save_strategy="steps",
    save_on_each_node=True,
)

# Set up the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=fine_tune_data,
    tokenizer=tokenizer,
    compute_metrics=lambda pred: {"accuracy": accuracy_score(pred.label_ids, pred.predictions.argmax(-1))},
)

# Fine-tune the model
trainer.train()

Once the fine-tuning process is complete, you can use the trained model to make predictions on new, unseen text data. Here's an example of how you could do this using the transformers library:

In [None]:
# Load the trained model
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=8)

# Tokenize the input text
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
input_text = "This is a sample input text."
input_ids = tokenizer.encode(input_text, return_tensors="pt")

# Make predictions on the input text
predictions = model(input_ids)

# Get the predicted label
predicted_label = predictions.argmax(-1)

# Print the predicted label
print(f"Predicted label: {predicted_label}")

In this example, we first load the trained model and tokenizer. We then tokenize the input text and pass it through the model to get the predicted label.

You can also use the predict method of the Trainer class to make predictions on new data. Here's an example of how you could do this:

In [None]:
# Load the trained model
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=8)

# Tokenize the input text
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
input_text = "This is a sample input text."
input_ids = tokenizer.encode(input_text, return_tensors="pt")

# Create a dataset for the input text
input_dataset = pd.DataFrame({"input_ids": input_ids, "attention_mask": tokenizer.encode(input_text, return_tensors="pt", max_length=32, padding="max_length", truncation=True)})

# Set up the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=fine_tune_data,
    tokenizer=tokenizer,
    compute_metrics=lambda pred: {"accuracy": accuracy_score(pred.label_ids, pred.predictions.argmax(-1))},
)

# Make predictions on the input text
predictions = trainer.predict(input_dataset)

# Get the predicted label
predicted_label = predictions.argmax(-1)

# Print the predicted label
print(f"Predicted label: {predicted_label}")

In this example, we first load the trained model and tokenizer. We then tokenize the input text and create a dataset for the input text. We set up the Trainer and use the predict method to make predictions on the input text. Finally, we get the predicted label and print it to the console.

Keep in mind that the specifics of how to use the trained model will depend on your specific use case and the format of your data. You may need to modify the code to fit your specific needs

you can use a BERT-based model for language translation tasks. BERT has been pre-trained on a large corpus of text data and has learned to encode language in a way that is useful for a wide range of NLP tasks, including language translation.

To use a BERT-based model for language translation, you would need to fine-tune the model on a large corpus of text data in the source language and the target language. This would involve adding a new output layer on top of the pre-trained BERT model and training the whole network to predict the correct translation in the target language.

Here's an example of how you could fine-tune a BERT-based model for language translation using the transformers library:

In [None]:
# Load the pre-trained BERT model
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=8)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Define the training data
train_data = [
    {"input_ids": tokenizer.encode("This is a sample input sentence in English", return_tensors="pt"), "attention_mask": tokenizer.encode("This is a sample input sentence in English", return_tensors="pt", max_length=32, padding="max_length", truncation=True)},
    {"input_ids": tokenizer.encode("This is a sample input sentence in Spanish", return_tensors="pt"), "attention_mask": tokenizer.encode("This is a sample input sentence in Spanish", return_tensors="pt", max_length=32, padding="max_length", truncation=True)}
]

# Define the labels
labels = [1, 1]

# Create a dataset from the training data
train_dataset = pd.DataFrame(train_data, columns=["input_ids", "attention_mask", "label"])

# Set up the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    tokenizer=tokenizer,
    compute_metrics=lambda pred: {"accuracy": accuracy_score(pred.label_ids, pred.predictions.argmax(-1))},
)

# Fine-tune the model
trainer.train()

In [None]:

# Use the trained model to translate text
def translate(input_text):
    input_ids = tokenizer.encode(input_text, return_tensors="pt")
    input_dataset = pd.DataFrame({"input_ids": input_ids, "attention_mask": tokenizer.encode(input_text, return_tensors="pt", max_length=32, padding="max_length", truncation=True)})
    predictions = trainer.predict(input_dataset)
    predicted_label = predictions.argmax(-1)
    return predicted_label

# Test the translation model
input_text = "I love to travel to new places."
translated_text = translate(input_text)
print(f"Translated text: {translated_text}")