Skip to content
This repository has been archived by the owner on Jan 16, 2022. It is now read-only.

Provide the google colab ipynb #25

Closed
aolko opened this issue Jun 4, 2020 · 11 comments
Closed

Provide the google colab ipynb #25

aolko opened this issue Jun 4, 2020 · 11 comments
Labels
wontfix This will not be worked on

Comments

@aolko
Copy link

aolko commented Jun 4, 2020

Please provide the working ipynb, i'm trying to run it in google colab (using transformers), but it's so messy it won't even run and where's the transformers compliant tokenizer?


upd 17.07: no response

@fantik11
Copy link

Yeah, will be nice if anyone provides colab notebook

@mgrankin
Copy link
Owner

It would be great to have the code working on Colab. Please, share if you make it work.

@Kepler-Br
Copy link
Contributor

Kepler-Br commented Aug 26, 2020

Hello!
https://colab.research.google.com/drive/1jwFks82BLyy8x3oxyKpiNdlL1PfKSQwW?usp=sharing
Here's my notebook. Works perfectly for me. But I was unable to force TPU to work
Still have doubts about preparing data
Hope it will help you

@fantik11
Copy link

#25 (comment) Thanks. Can you explain why you use ssh in your notebook

@Kepler-Br
Copy link
Contributor

Kepler-Br commented Aug 28, 2020

#25 (comment) Thanks. Can you explain why you use ssh in your notebook

Well, when I was training through notebook, it glitches out when output length is too much to handle for the web app. But if you use SSH, then everything is OK and nobody will glitch
If notebook will exit because "no activity", then use this little trick:

import time
start_time = time.time()
while True:
  time_delta = time.time() - start_time
  time_string = time.strftime("%d %H:%M:%S", time.gmtime(time_delta))
  print(f"Elapsed time: {time_string}")
  time.sleep(20)

@fantik11
Copy link

#25 (comment) Thanks. Can you explain why you use ssh in your notebook

Well, when I was training through notebook, it glitches out when output length is too much to handle for the web app. But if you use SSH, then everything is OK and nobody will glitch
If notebook will exit because "no activity", then use this little trick:

import time
start_time = time.time()
while True:
  time_delta = time.time() - start_time
  time_string = time.strftime("%d %H:%M:%S", time.gmtime(time_delta))
  print(f"Elapsed time: {time_string}")
  time.sleep(20)

What format of dataset I need to pass to your notebook?
I tried to pass "<|startoftext|>data1<|endoftext|><|startoftext|>data2<|endoftext|>...", but I feel that it wrong.

Can you help me?

@Kepler-Br
Copy link
Contributor

#25 (comment) Thanks. Can you explain why you use ssh in your notebook

Well, when I was training through notebook, it glitches out when output length is too much to handle for the web app. But if you use SSH, then everything is OK and nobody will glitch
If notebook will exit because "no activity", then use this little trick:

import time
start_time = time.time()
while True:
  time_delta = time.time() - start_time
  time_string = time.strftime("%d %H:%M:%S", time.gmtime(time_delta))
  print(f"Elapsed time: {time_string}")
  time.sleep(20)

What format of dataset I need to pass to your notebook?
I tried to pass "<|startoftext|>data1<|endoftext|><|startoftext|>data2<|endoftext|>...", but I feel that it wrong.

Can you help me?

#33 (comment)
Here, look at it.
TL;DR: put your data in GPT-2 as is and hope that it will understand it.
But I recomend you to add space betwen each word and non-word because, for example, "лес," - is one token, but "лес ," - are two tokens. You can look at my repo(https://github.com/Kepler-Br/ru_gpt2) for more information. I'll add in near future script that will help you process text data. I'll PR soon some parts from my repo to main repo. If you want to ask me other questions, just contact me

@aolko
Copy link
Author

aolko commented Aug 29, 2020

Still no transformers compliant tokenizer.
Also why would i refine already pretrained model? Provide an option to generate from input.

@Kepler-Br
Copy link
Contributor

Kepler-Br commented Aug 30, 2020

Still no transformers compliant tokenizer.

Can't understand what you want.

Also why would i refine already pretrained model?

There are resons. I don't have equipment powerful enough to train GPT-2 from ground up and I want to finetune it on russian dataset.

Provide an option to generate from input.

Here, take a look at this
I've made a PR all about new scripts(including this one).
UPD: My PR was accepted. Now evaluate_model.py is in the main repo. Use evaluate_model.py --help for help.

@aolko
Copy link
Author

aolko commented Aug 30, 2020

Can't understand what you want.

Transformers' vocab.json and merges.txt [ref] in order to use it's tokenizer.

Here, take a look at this

yeah no, what about a direct way of doing this rather then relying on some script?

here's a gpt-2 raw sample, it does it's thing well enough w/o requiring any script execution

text_1 = input()

input_ids = tokenizer.encode(text_1, return_tensors='pt')

sample_outputs = model.generate(
    input_ids,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
    do_sample=True, 
    max_length=100, 
    top_k=0, 
    top_p=0.9, 
    temperature=0.6,
    num_return_sequences=3,
    repetition_penalty=2.3,
    use_cache=True,
    early_stopping=True
)

def console_print(text, width=75):
    last_newline = 0
    i = 0
    while i < len(text):
        if text[i] == "\n":
            last_newline = 0
        elif last_newline > width and text[i] == " ":
            text = text[:i] + "\n" + text[i:]
            last_newline = 0
        else:
            last_newline += 1
        i += 1
    #print(text)
    return text


print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("{}: {}".format(i, console_print(tokenizer.decode(sample_output, skip_special_tokens=True))))

@stale
Copy link

stale bot commented Oct 29, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Oct 29, 2020
@stale stale bot closed this as completed Nov 5, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

4 participants