Consider umlaut forms when building tokenized map #15235

tors42 · 2024-05-05T19:16:32Z

Here's an implementation proposal for the player replacement part of #15152

Allows for writing broadcast player replacements using umlaut form (i.e "Blübaum" instead of "Bluebaum") and have replacement happen even if the name in the PGN is spelled "Bluebaum".

Previously each player replacement name was mapped into a single token string which identifies the replacement info:

"Matthias Blübaum" -> Map("blubaum matthias" -> ReplacementInfo)

Now names with umlauts will make an additional mapping:

"Matthias Blübaum" -> Map("blubaum matthias"  -> ReplacementInfo,
                          "bluebaum matthias" -> ReplacementInfo)

Relates: #15152

Note,
this can introduce mismatches though... I haven't looked up any "real" mismatches,
but here's a made up one for instance:

If broadcast is created with a player replacement for fictional player "Joe Sü",
there will be two mappings, "joe su" and "joe sue".
When "Joe Sü" then plays against fictional WGM "Sue Joe",
"Sue Joe"'s info will be replaced with "Joe Sü"'s info!
(Workaround is to change the player replacement for fictional player "Joe Sü" to "Joe Su",
which would avoid the additional mapping...)

Here's an example scala-cli application which "passes" after this patch - including the bad Sü-case and the good Blübaum-case,

umlaut.scala

//> using scala 3.4.1
//> using dep io.github.tors42:chariot:0.0.87

@main
def main() =

  val lichessApi = "http://localhost:8080"
  val token      = "lip_diego"

  var broadcastReplacements = List(
                               // Tokenized form:
    "Magnus Carlsen / 2863",   // carlsen magnus
    "Senor Ramirez / 1812",    // ramirez senor
    "José Ángel / 2002",       // angel jose
    "Matthias Blübaum / 2649", // blubaum matthias (+ bluebaum matthias)
    "Joe Sü / 1700 / FM"       // joe su           (+ joe sue)
  ).mkString("\n")

  val incomingPGN =
    """
       [White "Matthias Bluebaum"]
       [Black "Magnus Carlsen"]

       1. d4 d5 *

       [White "Jose Angel"]
       [Black "Señor Ramirez"]

       1. d4 d5 *

       [White "Joe Sü"]
       [Black "Sue Joe"]
       [BlackElo "2000"]
       [BlackTitle "WGM"]

       1. d4 d5 *""".stripIndent().linesIterator.drop(1).mkString("\n")

  val expectedPGN =
    """
       [White "Matthias Bluebaum"]
       [Black "Magnus Carlsen"]
       [WhiteElo "2649"]
       [BlackElo "2863"]

       1. d4 d5 *

       [White "Jose Angel"]
       [Black "Señor Ramirez"]
       [WhiteElo "2002"]
       [BlackElo "1812"]

       1. d4 d5 *

       [White "Joe Sü"]
       [Black "Sue Joe"]
       [WhiteElo "1700"]
       [WhiteTitle "FM"]
       [BlackElo "1700"]
       [BlackTitle "FM"]

       1. d4 d5 *""".stripIndent().linesIterator.drop(1).mkString("\n")

  val client = chariot.Client.basic(conf => conf.api(lichessApi))
    .withToken(token)

  val broadcast = client.broadcasts().create(p => p
    .name("Broadcast Name")
    .shortDescription("Short Broadcast Description")
    .longDescription("Looooong Broadcast Description")
    .players(broadcastReplacements)).get()

  val round = client.broadcasts().createRound(broadcast.id(),
    p => p.name("Round Name")).get()

  client.broadcasts().pushPgnByRoundId(round.id(), incomingPGN)

  def filterTags(pgn: chariot.model.Pgn): chariot.model.Pgn =
    chariot.model.Pgn.of(
      pgn.tags().stream()
        .filter(tag => Set(
          "White", "WhiteElo", "WhiteTitle", "Black", "BlackElo", "BlackTitle"
        ).contains(tag.name()))
        .toList(),
        pgn.moves())

  val exportedPGN = String.join("\n\n",
    client.broadcasts().exportPgn(broadcast.id())
      .stream()
      .map(filterTags(_))
      .map(_.toString)
      .toList())

  if exportedPGN == expectedPGN then
    println("Exported PGN matched expected PGN")
  else
    println(s"""\nExpected:\n$expectedPGN%nActual:\n$exportedPGN\nDiffEnd""")

Allows for writing broadcast player replacements using umlaut form (i.e "Blübaum" instead of "Bluebaum") and have replacement happen even if the name in the PGN is spelled "Bluebaum". Previously each player replacement name was mapped into a single token string which identifies the replacement info: "Matthias Blübaum" -> Map("blubaum matthias" -> ReplacementInfo) Now names with umlauts will make an additional mapping: "Matthias Blübaum" -> Map("blubaum matthias" -> ReplacementInfo, "bluebaum matthias" -> ReplacementInfo) Relates: lichess-org#15152

lenguyenthanh

This PR looks great, especially with the scala-cli example.

I think it'd better to move umlautify function to tokenize object as it a part of it. And it'll simplify tokenizedPlayers a bit more.

Also added some code golf suggestions because it's fun ⛳

modules/relay/src/main/RelayPlayers.scala

Co-authored-by: Thanh Le <lenguyenthanh@hotmail.com>

* master: move app rate limiters to web Revert necessary import for pipe ensure all rate limiters are configured move ctrl limiters to web - WIP scala tweaks report donation stats every 24h New Crowdin updates (lichess-org#15226) chessground redrawAll scss tweaks put is3d in board change event prettier Don't refetch same css, cached or no brightness/opacity on last move & check squares, but no hue rotate disable social links on kid profiles Hide kid teams on profile fix 3d piece z index when toggling 3d after page load

tors42 · 2024-05-06T17:11:49Z

Eeek, the code golfing changed the behaviour of the Pull Request 😅

The post-"code golfing" version,

private lazy val tokenizedPlayers: Map[PlayerToken, RelayPlayer] =
    players.mapKeys(umlautify.andThen(_.value).andThen(tokenize.apply))

, is a 1-1 mapping - all keys are "umlautified". We now only tokenize the umlautified form.

The pre-"code golfing" version,

private lazy val tokenizedPlayers: Map[PlayerToken, RelayPlayer] =
    players.iterator
      .flatMap((name, player) => Set(name, umlautify(name)).map((_, player)))
      .map((name, player) => (tokenize.apply(name.value), player))
      .toMap

, would possibly insert an extra element in the map for each key - it would tokenize the original name, and if the original name had an "umlautified" form, it would also tokenize the "umlautified" form.
(Set(name, umlautify(name)) can contain 1 or 2 entries)

pp:ing post-"code golfing"

HashMap(
    bluebaum matthias -> RelayPlayer(None,Some(2649),None,None)
    joe sue -> RelayPlayer(None,Some(1700),Some(FM),None)
    carlsen magnus -> RelayPlayer(None,Some(2863),None,None)
    ramirez senor -> RelayPlayer(None,Some(1812),None,None)
    angel jose -> RelayPlayer(None,Some(2002),None,None)
)

pp:ing pre-"code golfing"

HashMap(                                                                                                         
    blubaum matthias -> RelayPlayer(None,Some(2649),None,None)
    bluebaum matthias -> RelayPlayer(None,Some(2649),None,None)
    joe su -> RelayPlayer(None,Some(1700),Some(FM),None)
    joe sue -> RelayPlayer(None,Some(1700),Some(FM),None)
    carlsen magnus -> RelayPlayer(None,Some(2863),None,None)
    ramirez senor -> RelayPlayer(None,Some(1812),None,None)
    angel jose -> RelayPlayer(None,Some(2002),None,None)
)

Maybe the post-"code golfing" is fine,
but it "loses support" for the (plausible?) [White "Matthias Blubaum"] spelling...

lenguyenthanh · 2024-05-07T01:22:40Z

oh, sorry, I felt weird about Set(name, umlautify(name)) but then ignored it 😓 . Lets me fix it.

lenguyenthanh reviewed May 6, 2024

View reviewed changes

modules/relay/src/main/RelayPlayers.scala Outdated Show resolved Hide resolved

modules/relay/src/main/RelayPlayers.scala Show resolved Hide resolved

ornicar and others added 4 commits May 6, 2024 10:51

thanh code golf

8a4f607

Co-authored-by: Thanh Le <lenguyenthanh@hotmail.com>

fix github mess

4da8274

private stuff

9b49613

ornicar merged commit 0b2a2dd into lichess-org:master May 6, 2024
3 checks passed

kraktus mentioned this pull request May 6, 2024

Broadcast FIDE player guessing + replacements: match characters ä, ö, ü, ß #15152

Open

lenguyenthanh mentioned this pull request May 7, 2024

Duplicate player's name with it's umlautified version #15245

Merged

tors42 deleted the player-name-umlaut branch May 11, 2024 11:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider umlaut forms when building tokenized map #15235

Consider umlaut forms when building tokenized map #15235

tors42 commented May 5, 2024

lenguyenthanh left a comment

tors42 commented May 6, 2024

lenguyenthanh commented May 7, 2024 •

edited

Loading

Consider umlaut forms when building tokenized map #15235

Consider umlaut forms when building tokenized map #15235

Conversation

tors42 commented May 5, 2024

lenguyenthanh left a comment

Choose a reason for hiding this comment

tors42 commented May 6, 2024

lenguyenthanh commented May 7, 2024 • edited Loading

lenguyenthanh commented May 7, 2024 •

edited

Loading