Skip to content

Commit

Permalink
remove default number shape for embeddings
Browse files Browse the repository at this point in the history
  • Loading branch information
kermitt2 committed Jul 23, 2022
1 parent d28aaed commit 92785a7
Showing 1 changed file with 4 additions and 1 deletion.
Expand Up @@ -354,6 +354,7 @@ private static String normaliseDescription(String wikitext, String lang) {
// their numbers flattened, to prevent combinatorial explosions.
// They might be specific numbers, prices, etc.
// -> all numerical chars are actually all transformed to '0'
// -> variant: all numerical chars are removed
// 2. All letters: case-flattened.
// 3. Mixed letters and numbers: a product ID? Flatten case and leave
// numbers alone.
Expand All @@ -372,7 +373,9 @@ private static String normaliseDescription(String wikitext, String lang) {
text = text.replaceAll("\\p{P}", " ");

// flatten numerical chars
text = text.replaceAll("\\d", "0");
//text = text.replaceAll("\\d", "0");
// remove numerical chars
text = text.replaceAll("\\d", " ");

text = text.replaceAll("\\|", " ");

Expand Down

0 comments on commit 92785a7

Please sign in to comment.