https://www.youtube.com/watch?v=elUCn_TFdQc
https://github.com/Pawandeep-prog/finetuned-gpt2-convai

In [16]:

from transformers import GPT2LMHeadModel, GPT2Tokenizer
from torch.optim import Adam
from torch.utils.data import DataLoader
import tqdm
import torch

In [17]:
import json
import urllib
def open_json_from_url(url):
    try:
        with urllib.request.urlopen(url) as response:
            json_data = json.loads(response.read().decode())
            return json_data
    except urllib.error.URLError as e:
        print("Error:", e)
        return None

In [18]:
device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"

In [19]:

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.add_special_tokens({"pad_token": "<pad>", 
                                "bos_token": "<startofstring>",
                                "eos_token": "<endofstring>"})
tokenizer.add_tokens(["<bot>:"])

model = GPT2LMHeadModel.from_pretrained("gpt2")
model.resize_token_embeddings(len(tokenizer))

model = model.to(device)

In [20]:
from torch.utils.data import Dataset

class ChatData(Dataset):
    def __init__(self, path:str, tokenizer):
        self.data = open_json_from_url(path)

        self.X = []
        for i in self.data:
            for j in i['dialog']:
                self.X.append(j['text'])

        for idx, i in enumerate(self.X):
            try:
                self.X[idx] = "<startofstring> "+i+" <bot>: "+self.X[idx+1]+" <endofstring>"
            except:
                break

        self.X = self.X[:-1]
        
        print(self.X[0])

        self.X_encoded = tokenizer(self.X,max_length=50, truncation=True, padding="max_length", return_tensors="pt")
        self.input_ids = self.X_encoded['input_ids']
        self.attention_mask = self.X_encoded['attention_mask']

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return (self.input_ids[idx], self.attention_mask[idx])
    
def infer(inp):
    inp = "<startofstring> "+inp+" <bot>: "
    inp = tokenizer(inp, return_tensors="pt")
    X = inp["input_ids"].to(device)
    a = inp["attention_mask"].to(device)
    output = model.generate(X, attention_mask=a,temperature=0.9 )
    output = tokenizer.decode(output[0])
    return output

In [21]:
def train(chatData, model, optim):

    epochs = 20

    for i in tqdm.tqdm(range(epochs)):
        for X, a in chatData:
            X = X.to(device)
            a = a.to(device)
            optim.zero_grad()
            loss = model(X, attention_mask=a, labels=X).loss
            loss.backward()
            optim.step()
        torch.save(model.state_dict(), "model_state.pt")
        print(infer("hello how are you"))




In [22]:

# print(tokenizer.decode(model.generate(**tokenizer("hey i was good at basketball but ",
#                          return_tensors="pt"))[0]))

chatData = ChatData("https://raw.githubusercontent.com/Pawandeep-prog/finetuned-gpt2-convai/main/chat_data.json", tokenizer)
chatData =  DataLoader(chatData, batch_size=64)


<startofstring> I love iphone! i just bought new iphone! <bot>: Thats good for you, i'm not very into new tech <endofstring>


In [23]:

model.train()

optim = Adam(model.parameters(), lr=1e-3)

print("training .... ")
train(chatData, model, optim)


training .... 


  0%|                                                                                                                                                                                                                 | 0/20 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
  5%|██████████                                                                                                                                                                                               | 1/20 [00:18<05:58, 18.87s/it]

<startofstring> hello how are you <bot>: <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 10%|████████████████████                                                                                                                                                                                     | 2/20 [00:37<05:39, 18.85s/it]

<startofstring> hello how are you <bot>: I am a very good? <bot>: I am a very good? <endofstring>


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 15%|██████████████████████████████▏                                                                                                                                                                          | 3/20 [00:57<05:30, 19.45s/it]

<startofstring> hello how are you <bot>: Hi? <endofstring> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 20%|████████████████████████████████████████▏                                                                                                                                                                | 4/20 [01:16<05:07, 19.22s/it]

<startofstring> hello how are you <bot>: i am a huge gamer <bot>: i am a huge gamer <endofstring> <pad> <pad>


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 25%|██████████████████████████████████████████████████▎                                                                                                                                                      | 5/20 [01:35<04:46, 19.09s/it]

<startofstring> hello how are you <bot>: i am a huge gamer <endofstring> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 30%|████████████████████████████████████████████████████████████▎                                                                                                                                            | 6/20 [01:55<04:31, 19.39s/it]

<startofstring> hello how are you <bot>: Hello, how are doing? <endofstring> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 35%|██████████████████████████████████████████████████████████████████████▎                                                                                                                                  | 7/20 [02:14<04:09, 19.22s/it]

<startofstring> hello how are you <bot>: i am not sure what that means. i am a very experienced person


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 40%|████████████████████████████████████████████████████████████████████████████████▍                                                                                                                        | 8/20 [02:33<03:50, 19.21s/it]

<startofstring> hello how are you <bot>: Hi <endofstring> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 45%|██████████████████████████████████████████████████████████████████████████████████████████▍                                                                                                              | 9/20 [02:53<03:32, 19.28s/it]

<startofstring> hello how are you <bot>: i am a huge gamer <endofstring> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 50%|████████████████████████████████████████████████████████████████████████████████████████████████████                                                                                                    | 10/20 [03:12<03:13, 19.35s/it]

<startofstring> hello how are you <bot>: hello <endofstring> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 55%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                                          | 11/20 [03:36<03:08, 20.89s/it]

<startofstring> hello how are you <bot>: i am a huge gamer <endofstring> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 60%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                                | 12/20 [03:55<02:42, 20.29s/it]

<startofstring> hello how are you <bot>: hello <endofstring> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 65%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                                      | 13/20 [04:15<02:20, 20.04s/it]

<startofstring> hello how are you <bot>: hello <endofstring> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 70%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                            | 14/20 [04:34<01:59, 19.91s/it]

<startofstring> hello how are you <bot>: hello how are you? <endofstring> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 75%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                                  | 15/20 [04:54<01:39, 19.85s/it]

<startofstring> hello how are you <bot>: Hi, how are doing? <endofstring> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 80%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                                        | 16/20 [05:14<01:19, 19.82s/it]

<startofstring> hello how are you <bot>: Hi <endofstring> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 85%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                              | 17/20 [05:34<00:59, 19.83s/it]

<startofstring> hello how are you <bot>: hello <endofstring> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 90%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                    | 18/20 [05:54<00:39, 19.81s/it]

<startofstring> hello how are you <bot>: Hi <endofstring> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
 95%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████          | 19/20 [06:13<00:19, 19.83s/it]

<startofstring> hello how are you <bot>: you're a george <endofstring> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [06:37<00:00, 19.88s/it]

<startofstring> hello how are you <bot>: i am a huge gamer, my mom is a very good person.





https://huggingface.co/blog/how-to-generate

In [24]:
def infer(inp):
    inp = "<startofstring> "+inp+" <bot>: "
    inp = tokenizer(inp, return_tensors="pt")
    X = inp["input_ids"].to(device)
    a = inp["attention_mask"].to(device)
    output = model.generate(X, attention_mask=a, do_sample= True, temperature=0.9)
    output = tokenizer.decode(output[0])
    return output

In [25]:

print("infer from model : ")
while True:
  inp = input()
  print(infer(inp))

infer from model : 


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


<startofstring> hiiii <bot>: Hello there! how are you? <endofstring> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


<startofstring> can i have your phone please? <bot>: i am not sure what that is. i am a


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


<startofstring> <bot>: What are you doing now? <endofstring> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


KeyboardInterrupt: Interrupted by user