### Load Data
Create folder and place your data text file in Google drive. Also, you can store the file somewhere and use the URL address.

https://sites.psu.edu/hcai/files/2023/04/monica.txt

In [1]:
from google.colab import drive

In [2]:
import re

In [3]:
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
file_name = '/content/sherlock_dialogues.txt'

### Data Preprocessing
Sample cleaning approach.

In [6]:
with open(file_name, 'r') as f:
  data = f.read()

In [7]:
len(data)

1066771

In [8]:
def cleaning(s):
    s = str(s)
    s = re.sub('\s\W',' ',s)
    s = re.sub('\W,\s',' ',s)
    s = re.sub("\d+", "", s)
    s = re.sub('\s+',' ',s)
    s = re.sub('[!@#$_]', '', s)
    s = s.replace("co","")
    s = s.replace("https","")
    s = s.replace("[\w*"," ")
    return s


In [9]:
data = cleaning(data)

In [10]:
len(data)

992724

In [11]:
data[:1000]

"I'm awful frightened, Uniform away for repairs. All right--noon exactly, I should have more faith, At the end of that time she shall give her answer. Beautiful beautiful The old Guiacum test was very clumsy and uncertain. So is the microspic examination for blood rpuscles. The latter is valueless if the stains are a few hours old. Now, this appears to act as well whether the blood is old or new. Had this test been invented, there are hundreds of men now walking the earth who would long ago have paid the penalty of their crimes. He has given his nsent, provided we get these mines working all right. I have no fear on that head. I ought to be more case-hardened after my Afghan experiences. I saw my own mrades hacked to pieces at Maiwand without losing my nerve. The plot thickens, There can't be any number of Injuns here, I have had no time for bite or sup for eight-and-forty hours. It was magnificent, From their lightness and transparency, I should imagine that they are soluble in water,

### Model Training
You can also use the server file to create a pipeline for your language model. The following instructions can guide you through the process.

In [12]:
!pip install transformers



In [13]:
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import Trainer, TrainingArguments


# The sample pipeline for fine-tuning a language model

Steps Overview:

1. Load Tokenizer

2. Load and Tokenize Dataset

3. Load Data Collator

4. Load Pre-trained Model

5. Set Training Arguments

6. Initialize Trainer

7. Train the Model

8. Save Fine-tuned Model




In [14]:
def load_dataset(file_path, tokenizer, block_size = 128):
    dataset = TextDataset(
        tokenizer = tokenizer,
        file_path = file_path,
        block_size = block_size,
    )
    return dataset


def load_data_collator(tokenizer, mlm = False):
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=mlm,
    )
    return data_collator


def train(train_file_path,model_name,
          output_dir,
          overwrite_output_dir,
          per_device_train_batch_size,
          num_train_epochs,
          save_steps):
  tokenizer = GPT2Tokenizer.from_pretrained(model_name)
  train_dataset = load_dataset(train_file_path, tokenizer)
  data_collator = load_data_collator(tokenizer)

  tokenizer.save_pretrained(output_dir)

  model = GPT2LMHeadModel.from_pretrained(model_name)

  model.save_pretrained(output_dir)

  training_args = TrainingArguments(
          output_dir=output_dir,
          overwrite_output_dir=overwrite_output_dir,
          per_device_train_batch_size=per_device_train_batch_size,
          num_train_epochs=num_train_epochs,
      )

  trainer = Trainer(
          model=model,
          args=training_args,
          data_collator=data_collator,
          train_dataset=train_dataset,
  )

  trainer.train()
  trainer.save_model()


#### Model Parameters
1. Training file path to your data
2. Path for saving the trained model on Google Drive
3. Model training parameters

In [15]:
# you need to set parameters
train_file_path = '/content/sherlock_dialogues.txt'
model_name = 'gpt2'
output_dir = '/content/HolmesModel'

overwrite_output_dir = False
per_device_train_batch_size = 8
num_train_epochs = 2.0
save_steps = 500


In [16]:
# It takes about 30 minutes to train in colab.
train(
    train_file_path=train_file_path,
    model_name=model_name,
    output_dir=output_dir,
    overwrite_output_dir=overwrite_output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    num_train_epochs=num_train_epochs,
    save_steps=save_steps
)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33msivanid2606[0m ([33msivanid2606-penn-state[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
500,2.9486


### Model Inference
Similar to generate responses using the model

In [17]:
from transformers import PreTrainedTokenizerFast, GPT2LMHeadModel, GPT2TokenizerFast, GPT2Tokenizer


In [18]:
def load_model(model_path):
    model = GPT2LMHeadModel.from_pretrained(model_path)
    return model


def load_tokenizer(tokenizer_path):
    tokenizer = GPT2Tokenizer.from_pretrained(tokenizer_path)
    return tokenizer


def generate_text(sequence, max_length):
    model_path = output_dir
    model = load_model(model_path)
    tokenizer = load_tokenizer(model_path)
    ids = tokenizer.encode(f'{sequence}', return_tensors='pt')
    final_outputs = model.generate(
        ids,
        do_sample=True,
        max_length=max_length,
        pad_token_id=model.config.eos_token_id,
        top_k=50,
        top_p=0.95,
    )
    print(tokenizer.decode(final_outputs[0], skip_special_tokens=True))

In [19]:
sequence = 'where is Sherlock holmes?'
max_len = 150
generate_text(sequence, max_len)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


where is Sherlock holmes?
Yes, indeed. I see him here, Holmes, in the room under the light-
     lamp. Sherlock?
You are the police,
You don't need me to tell you all about the affair,
I am not sure of the
     effect of your confession in the case at present,
My dear Watson, it is a pity that your daughter and I could never see each other
    again. It has always been my duty to get to know him, and I have always
     been glad to give him good words. I am sure that you can see the
     point in the story with a certain


In [24]:
import shutil

# Copy the model directory to your Drive
shutil.copytree('/content/HolmesModel', '/content/drive/MyDrive/HolmesModel')


'/content/drive/MyDrive/HolmesModel'