Provide the google colab ipynb #25

aolko · 2020-06-04T22:58:36Z

Please provide the working ipynb, i'm trying to run it in google colab (using transformers), but it's so messy it won't even run and where's the transformers compliant tokenizer?

upd 17.07: no response

fantik11 · 2020-08-26T06:57:20Z

Yeah, will be nice if anyone provides colab notebook

mgrankin · 2020-08-26T07:08:11Z

It would be great to have the code working on Colab. Please, share if you make it work.

Kepler-Br · 2020-08-26T08:15:14Z

Hello!
https://colab.research.google.com/drive/1jwFks82BLyy8x3oxyKpiNdlL1PfKSQwW?usp=sharing
Here's my notebook. Works perfectly for me. But I was unable to force TPU to work
Still have doubts about preparing data
Hope it will help you

fantik11 · 2020-08-28T07:28:39Z

#25 (comment) Thanks. Can you explain why you use ssh in your notebook

Kepler-Br · 2020-08-28T07:39:36Z

#25 (comment) Thanks. Can you explain why you use ssh in your notebook

Well, when I was training through notebook, it glitches out when output length is too much to handle for the web app. But if you use SSH, then everything is OK and nobody will glitch
If notebook will exit because "no activity", then use this little trick:

import time
start_time = time.time()
while True:
  time_delta = time.time() - start_time
  time_string = time.strftime("%d %H:%M:%S", time.gmtime(time_delta))
  print(f"Elapsed time: {time_string}")
  time.sleep(20)

fantik11 · 2020-08-29T09:17:20Z

#25 (comment) Thanks. Can you explain why you use ssh in your notebook

Well, when I was training through notebook, it glitches out when output length is too much to handle for the web app. But if you use SSH, then everything is OK and nobody will glitch
If notebook will exit because "no activity", then use this little trick:
import time
start_time = time.time()
while True:
  time_delta = time.time() - start_time
  time_string = time.strftime("%d %H:%M:%S", time.gmtime(time_delta))
  print(f"Elapsed time: {time_string}")
  time.sleep(20)

Can you help me?

Kepler-Br · 2020-08-29T09:49:02Z

#25 (comment) Thanks. Can you explain why you use ssh in your notebook

Well, when I was training through notebook, it glitches out when output length is too much to handle for the web app. But if you use SSH, then everything is OK and nobody will glitch
If notebook will exit because "no activity", then use this little trick:
import time
start_time = time.time()
while True:
  time_delta = time.time() - start_time
  time_string = time.strftime("%d %H:%M:%S", time.gmtime(time_delta))
  print(f"Elapsed time: {time_string}")
  time.sleep(20)
What format of dataset I need to pass to your notebook?
I tried to pass "<|startoftext|>data1<|endoftext|><|startoftext|>data2<|endoftext|>...", but I feel that it wrong.

Can you help me?

#33 (comment)
Here, look at it.
TL;DR: put your data in GPT-2 as is and hope that it will understand it.
But I recomend you to add space betwen each word and non-word because, for example, "лес," - is one token, but "лес ," - are two tokens. You can look at my repo(https://github.com/Kepler-Br/ru_gpt2) for more information. I'll add in near future script that will help you process text data. I'll PR soon some parts from my repo to main repo. If you want to ask me other questions, just contact me

aolko · 2020-08-29T22:18:08Z

Still no transformers compliant tokenizer.
Also why would i refine already pretrained model? Provide an option to generate from input.

Kepler-Br · 2020-08-30T08:43:12Z

Still no transformers compliant tokenizer.

Can't understand what you want.

Also why would i refine already pretrained model?

There are resons. I don't have equipment powerful enough to train GPT-2 from ground up and I want to finetune it on russian dataset.

Provide an option to generate from input.

Here, take a look at this
I've made a PR all about new scripts(including this one).
UPD: My PR was accepted. Now evaluate_model.py is in the main repo. Use evaluate_model.py --help for help.

aolko · 2020-08-30T11:14:47Z

Can't understand what you want.

Transformers' vocab.json and merges.txt [ref] in order to use it's tokenizer.

Here, take a look at this

yeah no, what about a direct way of doing this rather then relying on some script?

here's a gpt-2 raw sample, it does it's thing well enough w/o requiring any script execution

text_1 = input()

input_ids = tokenizer.encode(text_1, return_tensors='pt')

sample_outputs = model.generate(
    input_ids,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
    do_sample=True, 
    max_length=100, 
    top_k=0, 
    top_p=0.9, 
    temperature=0.6,
    num_return_sequences=3,
    repetition_penalty=2.3,
    use_cache=True,
    early_stopping=True
)

def console_print(text, width=75):
    last_newline = 0
    i = 0
    while i < len(text):
        if text[i] == "\n":
            last_newline = 0
        elif last_newline > width and text[i] == " ":
            text = text[:i] + "\n" + text[i:]
            last_newline = 0
        else:
            last_newline += 1
        i += 1
    #print(text)
    return text


print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("{}: {}".format(i, console_print(tokenizer.decode(sample_output, skip_special_tokens=True))))

stale · 2020-10-29T12:47:36Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale bot added the wontfix This will not be worked on label Oct 29, 2020

stale bot closed this as completed Nov 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide the google colab ipynb #25

Provide the google colab ipynb #25

aolko commented Jun 4, 2020 •

edited

fantik11 commented Aug 26, 2020

mgrankin commented Aug 26, 2020

Kepler-Br commented Aug 26, 2020 •

edited

fantik11 commented Aug 28, 2020

Kepler-Br commented Aug 28, 2020 •

edited

fantik11 commented Aug 29, 2020

Kepler-Br commented Aug 29, 2020

aolko commented Aug 29, 2020

Kepler-Br commented Aug 30, 2020 •

edited

aolko commented Aug 30, 2020

stale bot commented Oct 29, 2020

Provide the google colab ipynb #25

Provide the google colab ipynb #25

Comments

aolko commented Jun 4, 2020 • edited

fantik11 commented Aug 26, 2020

mgrankin commented Aug 26, 2020

Kepler-Br commented Aug 26, 2020 • edited

fantik11 commented Aug 28, 2020

Kepler-Br commented Aug 28, 2020 • edited

fantik11 commented Aug 29, 2020

Kepler-Br commented Aug 29, 2020

aolko commented Aug 29, 2020

Kepler-Br commented Aug 30, 2020 • edited

aolko commented Aug 30, 2020

stale bot commented Oct 29, 2020

aolko commented Jun 4, 2020 •

edited

Kepler-Br commented Aug 26, 2020 •

edited

Kepler-Br commented Aug 28, 2020 •

edited

Kepler-Br commented Aug 30, 2020 •

edited