What constitutes an acceptable keyword? #194

jacobwhall · 2021-07-02T02:31:35Z

First of all, thank you for maintaining this repository!

I wrote a rudimentary emoji search program using your data, and noticed that, for example, "poop" does not match any of the keywords for 💩:

emojilib/dist/emoji-en-US.json

Lines 746 to 753 in f3169dc

    
           "💩": [ 
        
             "pile_of_poo", 
        
             "hankey", 
        
             "shitface", 
        
             "fail", 
        
             "turd", 
        
             "shit" 
        
           ],

There are a lot of other poop synonyms listed here, so I feel that "poop" would be an uncontroversial addition. But there are many synonyms for poop, and we might not want to include them all?

Another example I ran into was for 📱:

emojilib/dist/emoji-en-US.json

Lines 7259 to 7265 in f3169dc

    
           "📱": [ 
        
             "mobile_phone", 
        
             "technology", 
        
             "apple", 
        
             "gadgets", 
        
             "dial" 
        
           ],

The first phrase I'd say if you asked me to identify this emoji is "cell phone." However, none of the keywords for this emoji would match "cell." Would it be appropriate to add "cell," "cell_phone," or "cellular_phone?" Are non-official keywords that use underscores OK, or should substrings like "phone" be added as well as "mobile_phone?"

Finally, and I write this sincerely, I'd like to discuss 🍆:

emojilib/dist/emoji-en-US.json

Lines 4269 to 4275 in f3169dc

    
           "🍆": [ 
        
             "eggplant", 
        
             "vegetable", 
        
             "nature", 
        
             "food", 
        
             "aubergine" 
        
           ],

This emoji is often used to signify a penis. Would it be acceptable to add "dick" or "penis" to the list of associated keywords for this emoji? I think that doing so would better reflect common usage, but might stray too far from Unicode's "intended use" for the emoji (if that's a thing).

I suggest that a section be added to CONTRIBUTING.md or README.md that gives guidance to future contributors about questions like these.

…and that's how I posted a GitHub issue about poop, cell phones, and penises 🤪

muan · 2022-02-03T14:44:05Z

Hey sorry for the lack of response I was largely away last year.

TBH I have not thought about this at length. but I agree with what you've written here. If pull requests were sent for these keywords, I'd accept them all.

I suggest that a section be added to CONTRIBUTING.md or README.md that gives guidance to future contributors about questions like these.

I agree. I'd be happy to accept a PR for this if anyone's willing to send them.

thdoan · 2022-07-05T18:34:20Z

@jacobwhall I'm planning to fork this and start an emoji autocomplete project also. Have you settled on a fast way to search through the aliases? I was thinking about doing something like a filter, but not sure if there are faster options out there.

UPDATE: I did some performance tests, and I think for best performance I'm going to flatten the arrays into strings -- finding partial text matches in strings is faster than doing the same operation on arrays.

https://jsbench.me/zql58n0oew/1

When doing a partial match on every keystroke, every bit of performance counts ^^.

jacobwhall · 2022-07-07T02:42:56Z

@thdoan sounds like you've done as much as I have. I wrote an emoji picker in Python that you're welcome to check out. The search works surprisingly well!

thdoan · 2022-07-08T09:24:38Z

@jacobwhall cool, I'm experimenting with an emoji autocomplete by leveraging the browser's native datalist functionality. However, I've decided to start my emojis map from scratch based on https://emojipedia.org/ (all tedious manual work since they closed their API). We'll see how it goes.

JoshuaKGoldberg · 2023-05-19T14:35:12Z

+1, having docs on this would be great. I'm working on omnidan/node-emoji#132 to bring node-emoji to emojilib@3. The test cases in that draft PR are showing a lot of places where emojilib@3 removed conveniences the library relied on. For example, "heart" shows up in a few emojis, but not ❤️ itself:

emojilib/dist/emoji-en-US.json

Lines 986 to 991 in e8e9a84

    
           "❤️": [ 
        
             "red_heart", 
        
             "love", 
        
             "like", 
        
             "valentines" 
        
           ],

I wrote a quick script to find discrepencies:

// npm i emojilib-2@npm:emojilib@2 emojilib-3@npm:emojilib@3
const { lib: emojisV2 } = await import("emojilib-2");
const { default: emojisV3 } = await import("emojilib-3", {
  assert: { type: "json" },
});

const missing = [];
const missingIgnoringAliases = [];

for (const [nameV2, detailsV2] of Object.entries(emojisV2)) {
  const detailsV3 = emojisV3[detailsV2.char];
  if (detailsV3?.includes(nameV2)) {
    continue;
  }

  const complaint = { nameV2, detailsV2, detailsV3 };
  missing.push(complaint);

  const primaryAlias = detailsV3?.[0];
  if (
    primaryAlias &&
    !/^(?:flag|two|smiling_face_with)_|_face$/.test(primaryAlias)
  ) {
    missingIgnoringAliases.push(complaint);
  }
}

console.table({
  "Missing in general": missing.length,
  "Missing ignoring a few quick aliases": missingIgnoringAliases.length,
});

┌──────────────────────────────────────┬────────┐
│               (index)                │ Values │
├──────────────────────────────────────┼────────┤
│          Missing in general          │  678   │
│ Missing ignoring a few quick aliases │  456   │
└──────────────────────────────────────┴────────┘

@muan is there a description anywhere of how #178's lists were generated? Or, if not, could you speak to how you generated it?

muan · 2023-09-22T17:37:25Z

@muan is there a description anywhere of how #178's lists were generated? Or, if not, could you speak to how you generated it?

I believe I had some hack-together local scripts so I don't recall the exact differences. But here's what might have happened:

Previously this project was exclusively built for github shortcodes at our internal hackathon, and with v3 I decided to move away from that. so the primary key became their official unicode names, which would explains why tada was replaced with party popper, poop was replaced by pile of poo.

IIRC, the official name of the emoji changes with each version sometimes too (gun -> water gun), which was why I made the character be the key now.

I feel like I would/should have done the work to compare and keep the GitHub shortcodes but I guess I did not.

So to add them all back, a name/alias comparison between GitHub's set and the unicode set could potentially do the trick.

JoshuaKGoldberg · 2024-03-18T19:33:38Z

OK! Sorry for taking so long on this - I wanted to really think through the problem space. As in: what's a "keyword"?

Using the 🛫 emoji as an example, I think there are really 2-3 use cases for emoji keywords:

🆔 Identity: Where keywords can be used as either...
- 🌕 Full Identity: Terms that are a complete alias or title for the emoji (e.g. airplane_departure)
- 🌗 Partial Identity: Terms that can be a part of the complete identity of the emoji, but aren't standalone (e.g. airplane, departure)
🔗 Relation: Terms that would relate to the emoji in searching, but aren't part of its identity (e.g. airport, taking)

Ideally I'd propose emojilib separate at least 🆔 identity from 🔗 relation keywords. Some users will want only identity, e.g. node-emoji's :shortcode: replacement. Some users will want the relation ones as well, e.g. general text searches.

+1 to @muan's suggestion in #194 (comment) of a comparison. I'd say a programmatic approach would be the easiest & least controversy-risking approach for emojilib. My proposal would be something like:

🆔 Identity keywords should be sourced from the Unicode standard, Emojipedia also-known-as and title, and platform shortcodes
🔗 Relation keywords should be sourced from the search terms defined for emoji in individual platforms

As for setting up that programmatic approach... we can get halfway there. I made a standalone emojipedia package to scrape & store the Emojipedia data for each emoji. That data includes 🆔 identity shortcodes across Discord, Emojipedia (based on the Unicode standard), GitHub, and Slack.

Looking at the data that's in emojipedia and/or emojilib@3 today on the 🛫 emoji, we can see that there are a lot of 🆔 identity keywords that are only in one of the two datasets but not both:

In Both 🌕		Only in Emojilib 🌗		Only in Emojipedia 🌓
Full Keywords	Partial Keywords	Full Keywords	Partial Keywords	Full Keywords	Partial Keywords
`airplane_departure`	`airplane` `departure`	`airport` `flight` `landing`		`aeroplane_taking_off` `airplane_taking_off` `flight_departure` `plane_taking_off`	`aeroplane` `off` `plane` `taking`

Full comparison on: https://github.com/JoshuaKGoldberg/repros/tree/emojilib-emojipedia-keywords-comparison.

My next task will be trying to similarly source the 🔗 relation keywords programmatically. That way we can make a script that populates emojilib data automatically. 🔗 Relation keywords aren't stored on Emojipedia that I can find, so I plan on trying to find exports of individual platforms' emoji libraries such as https://github.com/github/gemoji.

JoshuaKGoldberg · 2024-03-20T13:53:00Z

Update: I have a proposal for your review now @muan! 🙌

Preview the full proposal of changes here: Proposed-all.html.

This follows what I proposed in the last comment: that emojilib's keywords be sourced from all associated words in Emojilib/Unicode and platforms we can scrape from. The ones I could easily access were: "fluemoji" (Fluent UI / Windows), "gemoji" (GitHub), and "twemoji" (Twitter).

Using 🛫 as an example, here's what that would look like:

Current	Proposed	Proposed Changes
Current	Proposed	➕ Added	➖ Removed	✔️ Unchanged
`airplane_departure` `airport` `flight` `landing`	`aeroplane` `aeroplane_taking_off` `airplane` `airplane_departure` `airplane_taking_off` `check-in` `departure` `departures` `flight` `flight_departure` `off` `plane` `plane_taking_off` `taking` `vehicle`	`aeroplane` `aeroplane_taking_off` `airplane` `airplane_taking_off` `check-in` `departure` `departures` `flight_departure` `off` `plane` `plane_taking_off` `taking` `vehicle`	`airport` `landing`	`airplane_departure` `flight`

Full comparison and proposal tables on: https://github.com/JoshuaKGoldberg/repros/tree/emojilib-platforms-keywords-comparison.

Unless directed otherwise, I'll send a big PR updating the keywords in this repo... soon. Hopefully later this month.

Note that the following emojis have significantly fewer keywords in the proposed changes:

🐦 went from 6 keywords to 1: bird
🛃 went from 4 keywords to 1: customs
🏜️ went from 4 keywords to 1: desert
🐬 went from 9 keywords to 2: dolphin, flipper
🐘 went from 6 keywords to 1: elephant
🦍 went from 4 keywords to 1: gorilla
⛰️ went from 4 keywords to 1: mountain
🐙 went from 7 keywords to 1: octopus
❇️ went from 6 keywords to 2: *, sparkle

None of the platforms in emoji-platform-data have more than 1-2 keywords for them. Adding in a more rich platform would fill back in those missing keywords. For example, asking the native macOS emoji picker for sea includes 🐙 in the results. I added emoji-platform-data issues labeled platform support.

jacobwhall · 2024-03-22T19:41:33Z

Thank you for your work on this @JoshuaKGoldberg

Note that the following emojis have significantly fewer keywords in the proposed changes

I suggest that we integrate individual keyword contributions into this new workflow. I think it's worth retaining the keywords from this project for the example emojis you provided. Contributions to this project could continue to add common-sense keywords that may have been overlooked by unicode/emojipedia/etc.

JoshuaKGoldberg · 2024-03-30T13:33:52Z

Makes sense! I sent #226 as a draft for reference that only augments, rather than removes.

yannickgloster · 2024-05-13T14:54:53Z

Is there any indication when #226 will be moved from draft/will be merged? Interested in seeing a resolution to this upstream lib omnidan/node-emoji#132.

muan mentioned this issue Mar 23, 2022

Keyword workflow #205

Open

JoshuaKGoldberg mentioned this issue May 20, 2023

Fix woman-* and man-* short codes are now *_woman and *_man on GitHub omnidan/node-emoji#112

Open

This was referenced Sep 7, 2023

feat: update emojilib to v3 omnidan/node-emoji#132

Draft

Bump emojilib to 3.X omnidan/node-emoji#129

Open

muan added the helpwanted label Sep 22, 2023

JoshuaKGoldberg mentioned this issue Mar 22, 2024

🚀 Feature: Add emojis from emoji-mart (Bluesky) JoshuaKGoldberg/emoji-platform-data#14

Open

2 tasks

JoshuaKGoldberg linked a pull request Mar 30, 2024 that will close this issue

Added 'augment-en' script to pull keywords from platform data #226

Open

cvzi mentioned this issue Apr 1, 2024

Feature: Use muan/emojilib dataset carpedm20/emoji#286

Open

JoshuaKGoldberg mentioned this issue Jun 17, 2024

[Duplicate] Added 'augment-en' script to pull keywords from platform data #227

Closed

muan mentioned this issue Jun 27, 2024

First entry for some emoji is slug, some are not #228

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What constitutes an acceptable keyword? #194

What constitutes an acceptable keyword? #194

jacobwhall commented Jul 2, 2021

muan commented Feb 3, 2022

thdoan commented Jul 5, 2022 •

edited

Loading

jacobwhall commented Jul 7, 2022

thdoan commented Jul 8, 2022 •

edited

Loading

JoshuaKGoldberg commented May 19, 2023

muan commented Sep 22, 2023 •

edited

Loading

JoshuaKGoldberg commented Mar 18, 2024 •

edited

Loading

JoshuaKGoldberg commented Mar 20, 2024

jacobwhall commented Mar 22, 2024

JoshuaKGoldberg commented Mar 30, 2024

yannickgloster commented May 13, 2024

What constitutes an acceptable keyword? #194

What constitutes an acceptable keyword? #194

Comments

jacobwhall commented Jul 2, 2021

muan commented Feb 3, 2022

thdoan commented Jul 5, 2022 • edited Loading

jacobwhall commented Jul 7, 2022

thdoan commented Jul 8, 2022 • edited Loading

JoshuaKGoldberg commented May 19, 2023

muan commented Sep 22, 2023 • edited Loading

JoshuaKGoldberg commented Mar 18, 2024 • edited Loading

JoshuaKGoldberg commented Mar 20, 2024

jacobwhall commented Mar 22, 2024

JoshuaKGoldberg commented Mar 30, 2024

yannickgloster commented May 13, 2024

thdoan commented Jul 5, 2022 •

edited

Loading

thdoan commented Jul 8, 2022 •

edited

Loading

muan commented Sep 22, 2023 •

edited

Loading

JoshuaKGoldberg commented Mar 18, 2024 •

edited

Loading