Skip to content

[Bug] Tokenizing with Llama 2 model freezes #199

@fozziethebeat

Description

@fozziethebeat

Describe the bug
A clear and concise description of what the bug is.

I manually downloaded the new meta-llama/Llama-2-7b-hf model to my local disk and loaded it up with transformer.js tokenizer. When trying to tokenize a very particular string, tokenization halts and spins forever. Checking top, the node process is doing something but it's not clear what. I've let this run for at least 10 minutes and it doesn't finish.

How to reproduce
Steps or a minimal working example to reproduce the behavior

const breaking_string = `They wrote the following:
Momentum in your online business comes from (crystal) clarity around who you serve and how you’re going to serve them. 🤔   Over on The Art of Paid Traffic podcast today, I’m pumped to share with you an interview with one of my Accelerator Group Coaching students, Trish Taylor. 🤗                                                                                           Trish is a stylist (and former celebrity stylist at that), coach and brand style expert for business women, teaching them to own their worth and show up in an authentic and meaningful way. 🙌🏼                                                        sI asked Trish to come on the show because she’s gone through what I think is a super relatable transformation in her business -- that of working with clients 1-on-1 to now scaling her business online and creating online courses and programs. 👩🏼🏼 YOU
<200d>💻
As you’ll hear, it wasn’t until she had clarity on who exactly she serves that she was able to clearly create offers that serve this audience that she’s so passionate about.  ♀️
You\'ll also hear how she’s using a quiz and challenge launch, why I don't think she should count out webinars, the metrics she should be paying attention to when evaluating her ad campaigns, and so much more. 📊
Follow the link in my bio to give today's episode a listen and I’d love to hear from you…  are there any transformations YOU
 RE wanting to make in your business? 👇🏼                                                                                  *
. . . . . . . . .`;

export default async (argv: any) => {
  const { env, AutoTokenizer, PreTrainedTokenizer } = await eval(
    "import('@xenova/transformers')"
  );
  if (process.env.LOCAL_TRANSFORMER_PATH) {
    env.localModelPath = process.env.LOCAL_TRANSFORMER_PATH;
  }
  const pretrained = await AutoTokenizer.from_pretrained(
    "meta-llama/Llama-2-7b-hf"
  );
  const tokens = pretrained.encode(breaking_string);                                                                          console.log(tokens.length);
};

Pre-steps requires downloading the model locally and setting LOCAL_TRANSFORMER_PATH.

Expected behavior
A clear and concise description of what you expected to happen.

The string should be tokenized. The same string can be tokenized with mosaicml/mpt-7b-8k.

Logs/screenshots
If applicable, add logs/screenshots to help explain your problem.

Environment

  • Transformers.js version: ^2.4.1
  • Browser (if applicable):
  • Operating system (if applicable): MacOS
  • Other: Node v18.16.0

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions