Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.Net: Add BERT ONNX embedding generation service #5518

Merged
merged 8 commits into from
Mar 20, 2024

Conversation

stephentoub
Copy link
Member

Adds a new Microsoft.SemanticKernel.Connectors.Onnx component. As of this PR, it contains one service, BertOnnxTextEmbeddingGenerationService, for using BERT-based ONNX models to generate embeddings. But in time we can add more ONNX-based implementations for using local models.

This is in part based on https://onnxruntime.ai/docs/tutorials/csharp/bert-nlp-csharp-console-app.html and https://github.com/dotnet-smartcomponents/smartcomponents. It doesn't support everything that's supported via sentence-transformers, but we should be able to extend it as needed.

cc: @luisquintanilla, @SteveSandersonMS, @JakeRadMSFT

@stephentoub stephentoub requested a review from a team as a code owner March 18, 2024 03:02
@markwallace-microsoft markwallace-microsoft added .NET Issue or Pull requests regarding .NET code kernel Issues or pull requests impacting the core kernel labels Mar 18, 2024
@github-actions github-actions bot changed the title Add BERT ONNX embedding generation service .Net: Add BERT ONNX embedding generation service Mar 18, 2024
@SteveSandersonMS
Copy link

This looks great! I'd approve it but am probably not the relevant person to do so in this repo.

@luisquintanilla
Copy link
Member

Nice addition @stephentoub. Looks great.

Copy link
Member

@RogerBarreto RogerBarreto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the final version of the connector or are we going to have more PRs. If the later suggest creating a dedicated feature-branch following our current practices for new connectors.

@stephentoub
Copy link
Member Author

Is this the final version of the connector or are we going to have more PRs.

There's nothing further I plan to add right now.

When https://github.com/microsoft/onnxruntime-genai/ is further along, we'll want to add support for chat completion based on using ONNX with models like Phi2, Mistral, LLaMa, Gemma, etc. That's part of why I made the connector generally about working with ONNX models, and for now the only service it provides is one for BERT embeddings. Hopefully this can evolve to provide broad support for much more.

Incurs a 500MB download and doesn't add anything meaningful over the other tests.

Also make downloading a bit more reliable, and add a couple of test cases.
@stephentoub stephentoub added this pull request to the merge queue Mar 20, 2024
Merged via the queue into microsoft:main with commit c304e85 Mar 20, 2024
19 checks passed
@stephentoub stephentoub deleted the onnxembeddinggeneration branch March 20, 2024 17:10
@SteveSandersonMS
Copy link

Does anyone here happen to know when this will make it to NuGet.org?

I'm wondering whether to re-plat SmartComponents.LocalEmbeddings on top of it, and whether I can use SK in a demo of embeddings next week or whether I need to stick with SmartComponents.LocalEmbeddings only for now.

@stephentoub
Copy link
Member Author

Does anyone here happen to know when this will make it to NuGet.org?

@markwallace-microsoft ?

@georg-jung
Copy link

georg-jung commented Mar 28, 2024

Regarding:

<TargetFramework>net6.0</TargetFramework> <!-- TODO: Support netstandard2.0 once replacement for FastBertTokenizer available -->

Hey @stephentoub, maintainer of FastBertTokenizer here. I'd be happy to help/contribute in that direction and would also be willing to move FastBertTokenizer in a direction that makes it's use viable here. Is the main concern the lack of netstandard2.0 support or something else (e.g. taking a dependency altogether)?

I already considered supporting older .NET versions (georg-jung/FastBertTokenizer#23). However, because they lack System.Text.Rune, it is hard to get unicode right there in the general case (for codepoints that cannot be represented by char alone, such as uncommon scripts, etc.). It would probably be possible to build that on top of NStack for netstandard2.0. I'd be willing to add that to FastBertTokenizer if there's demand.

Any suggestions on how to move this forward?

@georg-jung
Copy link

georg-jung commented Mar 28, 2024

Same "happy to help/contribute" of course also applies to what's needed for SmartComponents.LocalEmbeddings and @SteveSandersonMS - just let me know/open an issue if you need something :)

@luisquintanilla
Copy link
Member

Hi @georg-jung,

Thanks for the offer!

Currently we're working on consolidating efforts in tokenizers around Microsoft.ML.Tokenizers.

The goal is to provide an extensible library that serves as a one-stop-shop for tokenizer needs.

dotnet/machinelearning#6984

So far, we've introduced tokenizers like:

  • TikToken
  • LlamaTokenizer

BERT Tokenizers are part of the roadmap.

dotnet/machinelearning#6991

Would you be up for making a contribution to Microsoft.ML.Tokenizers for the BERT tokenizers?

cc @stephentoub @ericstj @tarekgh

@georg-jung
Copy link

Sorry for the delay in responding over the holidays.

I see. I think I'll need to take a more in depth look into the current API surface and get an impression of the current design to say something more substantial. A one-stop-shop tokenizer library for .NET would of course be great and I'm happy to help with that if it makes sense! Given the broad API surface, my gut feeling is that a BERT implementation in Tokenizers would be essentially a complete rewrite vs. FastBertTokenizers.

Some first thoughts from my experience creating FastBertTokenizer:

  • Regarding correct handling of Unicode, I'm aware of these options (maybe there are more?)
    1. Target a runtime that supports System.Text.Rune (not supported in netstandard2.0).
    2. Depend on NStack instead.
    3. Ditch Unicode correctness and get slightly different tokenization results than HuggingFace tokenizers would.
  • From a quick look, the API seems to use/create instances of string in many places. My experience with FastBertTokenizer is that this can lead to rather substantial performance hits/GC pressure. Changing this would probably be a rather impactful breaking change. Is this something you plan to do as part of #6984?

@luisquintanilla
Copy link
Member

No worries. Thanks for the response @georg-jung.

Tagging @tarekgh who can better comment.

@tarekgh
Copy link
Member

tarekgh commented Apr 4, 2024

a BERT implementation in Tokenizers would be essentially a complete rewrite vs. FastBertTokenizers.

Yah, we can help in that as needed.

Target a runtime that supports System.Text.Rune (not supported in netstandard2.0).

We build for netstandard 2.0. If you can tell what functionality you need for that I can try provide similar implementation for downlevel and we can use Rune for .NET targeting.

From a quick look, the API seems to use/create instances of string in many places. My experience with FastBertTokenizer is that this can lead to rather substantial performance hits/GC pressure. Changing this would probably be a rather impactful breaking change. Is this something you plan to do as part of dotnet/machinelearning#6984?

I am not sure if you are looking at the latest code? we are now have the tokenizer model work with spans like https://github.com/dotnet/machinelearning/blob/0fd58cbfb613113e920977b6891c05fd949486d8/src/Microsoft.ML.Tokenizers/Model/Model.cs#L35. If FastBertTokenizers uses pre-tokenization using regex, this will be difficult to avoid creating strings as regex on down-level always create strings. If you can elaborate more on where you are seeing creating string, I can try to look at that.

@georg-jung
Copy link

georg-jung commented Apr 4, 2024

If you can tell what functionality you need for that I can try provide similar implementation for downlevel and we can use Rune for .NET targeting.

One typical preprecessing step for BERT encoding is removing control chars etc. from the input text. FastBertTokenizer therefor enumerates over the runes and decides based on Rune.GetUnicodeCategory if the thing we look at should be removed. Char.GetUnicodeCategory would work in most cases too. Input text might contain unicode surrogate pairs though and Char.GetUnicodeCategory wouldn't return the correct category for them. The codepoints they represent would thus be removed even though they could be correctly encoded in many cases.

Example:

https://github.com/georg-jung/FastBertTokenizer/blob/cf29d5adc2b671694fb873335741334045560261/src/FastBertTokenizer/BertTokenizer.cs#L280-L304

I am not sure if you are looking at the latest code? we are now have the tokenizer model work with spans like

I hope I did. Specificly I'm thinking of

https://github.com/dotnet/machinelearning/blob/0fd58cbfb613113e920977b6891c05fd949486d8/src/Microsoft.ML.Tokenizers/PreTokenizer/PreTokenizer.cs#L26

I didn't take an in-depth look yet but if I understand that correctly this means for every split the pretokenizer creates at least one string will be allocated, no?

Edit: Oh, sorry, right after posting I noticed that Split won't allocate a string instance if just the TokenSpan property is used. But then, wouldn't the same still apply e.g. to Token.Value? https://github.com/dotnet/machinelearning/blob/0fd58cbfb613113e920977b6891c05fd949486d8/src/Microsoft.ML.Tokenizers/Token.cs#L25

If FastBertTokenizers uses pre-tokenization using regex, this will be difficult to avoid

It doesn't use regex. FastBertTokenizer uses a ref struct enumerator over the original input string that enumerates ReadOnlySpan<char>. It's pre-tokenization should be allocation free (if lowercasing isn't required and if the vocabulary doesn't require normalization to FormC or the input is already in FormC; if lowercasing is required it will only lowercase the current token in a small buffer before yielding, which would be almost allocation-free).

For corresponding code see https://github.com/georg-jung/FastBertTokenizer/blob/cf29d5adc2b671694fb873335741334045560261/src/FastBertTokenizer/PreTokenizingEnumerator.cs

Also note that whitespace isn't "enough" for tokenization. I think e.g. chinese characters tend to be not seperated by whitespace but should still be splitted. I'm not sure if regexes could be used to correctly tokenize them, but e.g. deciding based on the unicode category wouldn't be sufficient here. See https://github.com/georg-jung/FastBertTokenizer/blob/cf29d5adc2b671694fb873335741334045560261/src/FastBertTokenizer/PreTokenizingEnumerator.cs#L169

Most of the string allocations FastBertTokenizer does at all are required for dict lookups and, if required, unicode normalization, because these are scenarios where, I think, the runtime doesn't yet support ReadOnlySpan<char>. If the dict lookups arrive with .NET 9 that would probably have a quite positive impact. dotnet/runtime#27229 dotnet/runtime#87757

@tarekgh
Copy link
Member

tarekgh commented Apr 4, 2024

Thanks for the details @georg-jung.

Input text might contain unicode surrogate pairs though and Char.GetUnicodeCategory wouldn't return the correct category for them.

You may consider using https://learn.microsoft.com/en-us/dotnet/api/system.globalization.charunicodeinfo.getunicodecategory?view=net-8.0#system-globalization-charunicodeinfo-getunicodecategory(system-string-system-int32) which should work with Surrogate too.

I noticed that Split won't allocate a string instance if just the TokenSpan property is used. But then, wouldn't the same still apply e.g. to Token.Value?

In our library , we have a function called Tokenizer.Encode(...). When you use this function, it returns comprehensive encoding data. This data includes the following components:

  • Ids: These represent the numerical identifiers for the tokens.
  • String tokens: These are the actual text tokens.
  • Offsets: These indicate the positions of the tokens within the text.
  • If there is normalization, we return the normalized text too as the Offsets will be relative to.
    In certain scenarios, we must allocate memory for this data. Why? Because regardless of the specific use case, we always need to return the string tokens. However, if a user calls an API that doesn’t return string tokens (for example, EncodeToIds), we take a different approach. In such cases, we utilize Split.TokenSpan, which avoids unnecessary allocations unless the split operation itself requires memory allocation. Notably, the allocation occurs in regex scenarios due to regex-specific requirements.

It doesn't use regex. FastBertTokenizer uses a ref struct enumerator over the original input string that enumerates ReadOnlySpan. It's pre-tokenization should be allocation free (if lowercasing isn't required and if the vocabulary doesn't require normalization to FormC or the input is already in FormC; if lowercasing is required it will only lowercase the current token in a small buffer before yielding, which would be almost allocation-free).

We allocate memory for normalization due to reasons similar to Unicode normalization. Our interfaces are designed to be generic, accommodating any tokenizer and any scenario. We’ve observed scenarios require text manipulation—for instance, removing specific characters, replacing them, or adding new ones. Additionally, our APIs return offsets or indexes as part of the encoding process, and these are relative to the normalized string.
Now, I’m curious about your experience. Have you encountered frequent use cases for normalization? I ask because while normalization is an extra operation, it could potentially be offered as a pay-for-play feature.

Also note that whitespace isn't "enough" for tokenization. I think e.g. chinese characters tend to be not seperated by whitespace but should still be splitted. I'm not sure if regexes could be used to correctly tokenize them, but e.g. deciding based on the unicode category wouldn't be sufficient here.

I’m pretty sure that regular expressions (regex) can address this issue.

If the dict lookups arrive with .NET 9 that would probably have a quite positive impact. dotnet/runtime#27229

In our tokenizers, we’ve introduced a solution that enables us to search the dictionary using spans. You can find the relevant type in the following location: StringSpanOrdinalKey.cs. Additionally, take a look at how this type is utilized in examples like Tiktoken.cs.

For dotnet/runtime#87757 even providing that will not be enough for our scenarios as we need to support down-levels netstandard 2.0 and even .NET 8. SentencePiece tokenizers actually include the normalization data inside the tokenizer data which make it self contained. But this will be too much requirements for tokenizer to carry such non-trivial data.

@georg-jung
Copy link

georg-jung commented Apr 5, 2024

You may consider using https://learn.microsoft.com/en-us/dotnet/api/system.globalization.charunicodeinfo.getunicodecategory?view=net-8.0#system-globalization-charunicodeinfo-getunicodecategory(system-string-system-int32) which should work with Surrogate too.

I think I originally ruled that out as it would require a string allocation for every category check. Thinking about this again, it might actually be a great fallback. On modern .NET we could use non-allocating Rune, on older .NET Char and only if we detect a surrogate use this API to get the category. Thanks!

Now, I’m curious about your experience. Have you encountered frequent use cases for normalization? I ask because while normalization is an extra operation, it could potentially be offered as a pay-for-play feature.

I think unicode normalization, e.g. FormC and FormD, is a common requirement for BERT. I did a quick check on some random vocabs on HuggingFace and bert-base-multilingual-cased, bert-base-french-europeana-cased and bert-base-spanish-wwm-cased are in FormC. (For some common models/vocabs this isn't a thing because they'd e.g. remove diacritics and then comply with FormC and FormD, e.g. bert-base-uncased, nomic-embed-text, baai/bge-large-zh). Now, when encoding arbitrary input for e.g. bert-base-multilingual-cased, we'd need to ensure that the input is in FormC too before encoding (or maybe convert the input and vocab on import to FormD; would need to check if feasible and benchmark but could provide perf benefits). Otherwise we wouldn't find the encodings for inputs that aren't in FormC.

Vocabs like bert-base-uncased on the otherhand are typically lowercased. If we consider this a normalization form, I think most vocabs would require one or the other normalization. I'm not an expert but my gut feeling is they tend to either be lowercased or include diacritics.

And even for e.g. bert-base-uncased we'd need to normalize input to FormD, because we need to remove the diacritics from arbitrary inputs. FastBertTokenizer circumvents the need to unicode normalize every arbitrary input by applying a modified encoding pipeline. It doesn't apply all the normalization steps and encodes then, but first tries to encode best-effort and only normalizes if the input can not be encoded. It's also easy to return original input indices without allocating strings (FastBertTokenizer currently doesn't have the option to return the normalized tokens though).

As a side note, it should be possible to check if a vocabulary has a need for unicode normalization when loading it, so that we could do the right thing without explicit configuration.

For dotnet/runtime#87757 even providing that will not be enough for our scenarios as we need to support down-levels netstandard 2.0 and even .NET 8.

But if that arrives it would be possible to multi-target and have the perf wins on modern .NET 9/... and use allocating behaviour on older .NET, no?

In our tokenizers, we’ve introduced a solution that enables us to search the dictionary using spans. You can find the relevant type in the following location: StringSpanOrdinalKey.cs. Additionally, take a look at how this type is utilized in examples like Tiktoken.cs.

Thanks, this looks really interesting!

@georg-jung
Copy link

georg-jung commented Apr 5, 2024

FWIW, I did a quick benchmark of the different pretokenization approaches. Note that it doesn't compare exactly the same thing, as the regex methods just create splits at whitespace if I understand correctly, while the ref struct enumerator took more options into account (punctuation + chinese chars). Corpus I tested with is some thousands of articles from simple english wikipedia. It contains unicode surrogate pairs, chinese, .... RegexPublicMlNetNuget is the older variant splitting to strings, RegexCurrentMlNetGithub is the more recent code that uses ReadOnlySpan.

// * Summary *

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3374/23H2/2023Update/SunValley3)
AMD Ryzen 7 PRO 4750U with Radeon Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.202
  [Host]               : .NET 8.0.3 (8.0.324.11423), X64 RyuJIT AVX2
  .NET 6.0             : .NET 6.0.28 (6.0.2824.12007), X64 RyuJIT AVX2
  .NET 8.0             : .NET 8.0.3 (8.0.324.11423), X64 RyuJIT AVX2
  .NET Framework 4.8.1 : .NET Framework 4.8.1 (4.8.9181.0), X64 RyuJIT VectorSize=256


| Method                  | Job                  | Runtime              | Mean       | Error    | StdDev   | Ratio | RatioSD | Gen0        | Gen1        | Allocated    | Alloc Ratio  |
|------------------------ |--------------------- |--------------------- |-----------:|---------:|---------:|------:|--------:|------------:|------------:|-------------:|-------------:|
| RefStructEnumerator     | .NET 6.0             | .NET 6.0             |   240.9 ms |  1.35 ms |  1.05 ms |  1.00 |    0.00 |           - |           - |       4437 B |         1.00 |
| RegexPublicMlNetNuget   | .NET 6.0             | .NET 6.0             | 1,647.4 ms | 29.19 ms | 25.88 ms |  6.85 |    0.11 | 478000.0000 | 135000.0000 | 1700021840 B |   383,146.68 |
| RegexCurrentMlNetGithub | .NET 6.0             | .NET 6.0             |   685.6 ms | 11.07 ms |  9.81 ms |  2.84 |    0.04 | 528000.0000 |           - | 1104531776 B |   248,936.62 |
|                         |                      |                      |            |          |          |       |         |             |             |              |              |
| RefStructEnumerator     | .NET 8.0             | .NET 8.0             |   173.8 ms |  3.47 ms |  5.99 ms |  1.00 |    0.00 |           - |           - |        256 B |         1.00 |
| RegexPublicMlNetNuget   | .NET 8.0             | .NET 8.0             | 1,435.1 ms | 28.11 ms | 27.60 ms |  8.36 |    0.57 | 491000.0000 | 148000.0000 | 1700009576 B | 6,640,662.41 |
| RegexCurrentMlNetGithub | .NET 8.0             | .NET 8.0             |   313.1 ms |  1.52 ms |  1.27 ms |  1.83 |    0.13 |    500.0000 |           - |    1320536 B |     5,158.34 |
|                         |                      |                      |            |          |          |       |         |             |             |              |              |
| RefStructEnumerator     | .NET Framework 4.8.1 | .NET Framework 4.8.1 |   447.5 ms |  3.46 ms |  3.24 ms |  1.00 |    0.00 |           - |           - |            - |           NA |
| RegexPublicMlNetNuget   | .NET Framework 4.8.1 | .NET Framework 4.8.1 | 3,232.1 ms | 16.89 ms | 14.97 ms |  7.23 |    0.06 | 471000.0000 | 126000.0000 | 1731811528 B |           NA |
| RegexCurrentMlNetGithub | .NET Framework 4.8.1 | .NET Framework 4.8.1 | 2,200.3 ms | 24.80 ms | 23.20 ms |  4.92 |    0.06 | 528000.0000 |           - | 1107784272 B |           NA |

// * Hints *
Outliers
  Pretokenization.RefStructEnumerator: .NET 6.0               -> 3 outliers were removed (247.29 ms..252.96 ms)
  Pretokenization.RegexPublicMlNetNuget: .NET 6.0             -> 1 outlier  was  removed (1.80 s)
  Pretokenization.RegexCurrentMlNetGithub: .NET 6.0           -> 1 outlier  was  removed (722.77 ms)
  Pretokenization.RefStructEnumerator: .NET 8.0               -> 4 outliers were removed, 5 outliers were detected (138.11 ms, 177.90 ms..178.81 ms)
  Pretokenization.RegexCurrentMlNetGithub: .NET 8.0           -> 2 outliers were removed, 3 outliers were detected (309.58 ms, 316.50 ms, 321.06 ms)
  Pretokenization.RegexPublicMlNetNuget: .NET Framework 4.8.1 -> 1 outlier  was  removed (3.43 s)

Always impressive how much difference just using recent .NET makes :) - great work there!

Code is available here: https://github.com/georg-jung/TokenizerBenchmarks

@georg-jung
Copy link

Thanks again @tarekgh for the suggestions and thoughts! By incorporating them I was able to add netstandard2.0 support (GetUnicodeCategory) to FastBertTokenizer and also make the whole encoding process almost allocation free (StringSpanOrdinalKey). A stable v1.0 is available now.

Thus, adding netstandard2.0 support to Microsoft.SemanticKernel.Connectors.Onnx isn't blocked by FastBertTokenizer anymore. (I could send a PR if you want to.)

@SteveSandersonMS The API you use in SmartComponents changed slightly in v1.0.28 and doesn't return ReadOnlyMemory anymore but directly returns Memory, so consuming it and feeding the results to onnxruntime should be a bit more straight foreward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation kernel Issues or pull requests impacting the core kernel .NET Issue or Pull requests regarding .NET code
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

7 participants