# Emoji Library Builder

The purpose of this notebook is to build a comprehensive library of emojis with their names and utf-8 as well as unicode code points.

The most comprehensive collection of emojis I've found (care of James Caldwell and ['Barndog'](https://stackoverflow.com/questions/71404081/how-to-obtain-a-full-list-of-unicode-emojis-from-the-unicode-website)) is this [text file developed by Unicode Inc. in 2021.](https://unicode.org/Public/emoji/14.0/emoji-test.txt)

## Task List

* [X]  Copy and paste the tabular format of the txt file into a new txt file so the intro and headings don't interfere with the dataframe.
* [X]  Import the updated txt file as a dataframe and remove unnecessary features.
* [X]  Split the emoji feature into 3 separate features: emoji, emoji number, and description.
* [X]  Create two more features containing the utf-8 codepoints as bytes object and as string object for each emoji.
* [X]  Create another feature of cleaned utf-8 code points by removing any "/" for easy matching in other strings.
* [X]  Create another series of features for unicode representations


## Packages

In [73]:
import pandas as pd

## Txt to DataFrame

In [74]:
df = pd.read_fwf('emoji-test-cleaned.txt')

In [75]:
df.head(150)

Unnamed: 0,hex_codepoint,Unnamed: 1,;,status,#,emoji,Unnamed: 6
0,1F600,,;,fully-qualified,#,😀 E1.0 grinning face,
1,1F603,,;,fully-qualified,#,😃 E0.6 grinning face with big eyes,
2,1F604,,;,fully-qualified,#,😄 E0.6 grinning face with smiling eyes,
3,1F601,,;,fully-qualified,#,😁 E0.6 beaming face with smiling eyes,
4,1F606,,;,fully-qualified,#,😆 E0.6 grinning squinting face,
...,...,...,...,...,...,...,...
145,2764 FE0F 200D 1,A79,;,fully-qualified,#,❤️‍🩹 E13.1 mending heart,
146,2764 200D 1FA79,,;,unqualified,#,❤‍🩹 E13.1 mending heart,
147,2764 FE0F,,;,fully-qualified,#,❤️ E0.6 red heart,
148,2764,,;,unqualified,#,❤ E0.6 red heart,


The initial dataframe is a little messy.  Let's drop unnecessary columns.

In [76]:
df = df.drop(columns=['Unnamed: 1', ';', '#', 'Unnamed: 6'])

In [77]:
df.head()

Unnamed: 0,hex_codepoint,status,emoji
0,1F600,fully-qualified,😀 E1.0 grinning face
1,1F603,fully-qualified,😃 E0.6 grinning face with big eyes
2,1F604,fully-qualified,😄 E0.6 grinning face with smiling eyes
3,1F601,fully-qualified,😁 E0.6 beaming face with smiling eyes
4,1F606,fully-qualified,😆 E0.6 grinning squinting face


## New Features: Emoji, E Version, Description

And now let's clean up the emoji column by separating out the contents into different columns: emoji, e_version and description.

In [78]:
df[['emoji', 'e_version', 'description']] = df['emoji'].str.split(' ', 2, expand=True)

In [79]:
df.head()

Unnamed: 0,hex_codepoint,status,emoji,e_version,description
0,1F600,fully-qualified,😀,E1.0,grinning face
1,1F603,fully-qualified,😃,E0.6,grinning face with big eyes
2,1F604,fully-qualified,😄,E0.6,grinning face with smiling eyes
3,1F601,fully-qualified,😁,E0.6,beaming face with smiling eyes
4,1F606,fully-qualified,😆,E0.6,grinning squinting face


In [80]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4697 entries, 0 to 4696
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   hex_codepoint  4697 non-null   object
 1   status         4697 non-null   object
 2   emoji          4697 non-null   object
 3   e_version      4697 non-null   object
 4   description    4697 non-null   object
dtypes: object(5)
memory usage: 183.6+ KB


We have a comprehensive library of 4,696 emojis.

## New Features: UTF8_Bytes, UTF8_Str, and UTF8_Clean Codepoints

Now let's test our utf-8 encoding and cleaning process on a single emoji to get it right and then iterate over the entire emoji column to create the new feature for every emoji.

Emoji

In [81]:
df.emoji[0]

'😀'

Bytes object

In [82]:
test = df.emoji[0].encode('utf-8')
print(test)

b'\xf0\x9f\x98\x80'


String object

In [83]:
test = str(test).replace("b'",'')
test = test.replace("'",'')
print(test)

\xf0\x9f\x98\x80


Cleaned with no "\" or spaces.

In [84]:
test = test.replace("\\",'')
print(test)

xf0x9fx98x80


In [85]:
for i in df.index:
    # create the utf8_bytes code point feature in bytes datatype
    df.at[i,'utf8_bytes'] = df.emoji[i].encode('utf-8')
    # create the utf8_str code point feature as string datatype
    df.at[i,'utf8_str'] = str(df.utf8_bytes[i]).replace("b'",'')
    df.at[i,'utf8_str'] = df.utf8_str[i].replace("'",'')
    # create the utf8_cleaned code point feature for easy matching within other strings
    df.at[i,'utf8_clean'] = df.utf8_str[i].replace("\\",'')

In [86]:
df.head(20)

Unnamed: 0,hex_codepoint,status,emoji,e_version,description,utf8_bytes,utf8_str,utf8_clean
0,1F600,fully-qualified,😀,E1.0,grinning face,b'\xf0\x9f\x98\x80',\xf0\x9f\x98\x80,xf0x9fx98x80
1,1F603,fully-qualified,😃,E0.6,grinning face with big eyes,b'\xf0\x9f\x98\x83',\xf0\x9f\x98\x83,xf0x9fx98x83
2,1F604,fully-qualified,😄,E0.6,grinning face with smiling eyes,b'\xf0\x9f\x98\x84',\xf0\x9f\x98\x84,xf0x9fx98x84
3,1F601,fully-qualified,😁,E0.6,beaming face with smiling eyes,b'\xf0\x9f\x98\x81',\xf0\x9f\x98\x81,xf0x9fx98x81
4,1F606,fully-qualified,😆,E0.6,grinning squinting face,b'\xf0\x9f\x98\x86',\xf0\x9f\x98\x86,xf0x9fx98x86
5,1F605,fully-qualified,😅,E0.6,grinning face with sweat,b'\xf0\x9f\x98\x85',\xf0\x9f\x98\x85,xf0x9fx98x85
6,1F923,fully-qualified,🤣,E3.0,rolling on the floor laughing,b'\xf0\x9f\xa4\xa3',\xf0\x9f\xa4\xa3,xf0x9fxa4xa3
7,1F602,fully-qualified,😂,E0.6,face with tears of joy,b'\xf0\x9f\x98\x82',\xf0\x9f\x98\x82,xf0x9fx98x82
8,1F642,fully-qualified,🙂,E1.0,slightly smiling face,b'\xf0\x9f\x99\x82',\xf0\x9f\x99\x82,xf0x9fx99x82
9,1F643,fully-qualified,🙃,E1.0,upside-down face,b'\xf0\x9f\x99\x83',\xf0\x9f\x99\x83,xf0x9fx99x83


In [87]:
df.tail(20)

Unnamed: 0,hex_codepoint,status,emoji,e_version,description,utf8_bytes,utf8_str,utf8_clean
4677,1F1FA 1F1FE,fully-qualified,🇺🇾,E2.0,flag: Uruguay,b'\xf0\x9f\x87\xba\xf0\x9f\x87\xbe',\xf0\x9f\x87\xba\xf0\x9f\x87\xbe,xf0x9fx87xbaxf0x9fx87xbe
4678,1F1FA 1F1FF,fully-qualified,🇺🇿,E2.0,flag: Uzbekistan,b'\xf0\x9f\x87\xba\xf0\x9f\x87\xbf',\xf0\x9f\x87\xba\xf0\x9f\x87\xbf,xf0x9fx87xbaxf0x9fx87xbf
4679,1F1FB 1F1E6,fully-qualified,🇻🇦,E2.0,flag: Vatican City,b'\xf0\x9f\x87\xbb\xf0\x9f\x87\xa6',\xf0\x9f\x87\xbb\xf0\x9f\x87\xa6,xf0x9fx87xbbxf0x9fx87xa6
4680,1F1FB 1F1E8,fully-qualified,🇻🇨,E2.0,flag: St. Vincent & Grenadines,b'\xf0\x9f\x87\xbb\xf0\x9f\x87\xa8',\xf0\x9f\x87\xbb\xf0\x9f\x87\xa8,xf0x9fx87xbbxf0x9fx87xa8
4681,1F1FB 1F1EA,fully-qualified,🇻🇪,E2.0,flag: Venezuela,b'\xf0\x9f\x87\xbb\xf0\x9f\x87\xaa',\xf0\x9f\x87\xbb\xf0\x9f\x87\xaa,xf0x9fx87xbbxf0x9fx87xaa
4682,1F1FB 1F1EC,fully-qualified,🇻🇬,E2.0,flag: British Virgin Islands,b'\xf0\x9f\x87\xbb\xf0\x9f\x87\xac',\xf0\x9f\x87\xbb\xf0\x9f\x87\xac,xf0x9fx87xbbxf0x9fx87xac
4683,1F1FB 1F1EE,fully-qualified,🇻🇮,E2.0,flag: U.S. Virgin Islands,b'\xf0\x9f\x87\xbb\xf0\x9f\x87\xae',\xf0\x9f\x87\xbb\xf0\x9f\x87\xae,xf0x9fx87xbbxf0x9fx87xae
4684,1F1FB 1F1F3,fully-qualified,🇻🇳,E2.0,flag: Vietnam,b'\xf0\x9f\x87\xbb\xf0\x9f\x87\xb3',\xf0\x9f\x87\xbb\xf0\x9f\x87\xb3,xf0x9fx87xbbxf0x9fx87xb3
4685,1F1FB 1F1FA,fully-qualified,🇻🇺,E2.0,flag: Vanuatu,b'\xf0\x9f\x87\xbb\xf0\x9f\x87\xba',\xf0\x9f\x87\xbb\xf0\x9f\x87\xba,xf0x9fx87xbbxf0x9fx87xba
4686,1F1FC 1F1EB,fully-qualified,🇼🇫,E2.0,flag: Wallis & Futuna,b'\xf0\x9f\x87\xbc\xf0\x9f\x87\xab',\xf0\x9f\x87\xbc\xf0\x9f\x87\xa,xf0x9fx87xbcxf0x9fx87xa


In [88]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4697 entries, 0 to 4696
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   hex_codepoint  4697 non-null   object
 1   status         4697 non-null   object
 2   emoji          4697 non-null   object
 3   e_version      4697 non-null   object
 4   description    4697 non-null   object
 5   utf8_bytes     4697 non-null   object
 6   utf8_str       4697 non-null   object
 7   utf8_clean     4697 non-null   object
dtypes: object(8)
memory usage: 293.7+ KB


## New Features: Unicode

We repeat to do the same for the unicode representation.

Emoji

In [97]:
df.emoji[0]

'😀'

Bytes object

In [103]:
test = df.emoji[0].encode('unicode_escape')
print(test)

b'\\U0001f600'


String object

In [108]:
test = str(test).replace("b'",'')
test = test.replace("'",'')

print(test)

\\U0001f600


Traditional format

In [109]:
test = test.replace("000",'+')
print(test)

\\U+1f600


In [110]:
for i in df.index:
    # create the unicode_bytes code point feature in bytes datatype
    df.at[i,'unicode_bytes'] = df.emoji[i].encode('unicode_escape')
    # create the unicode_str code point feature as string datatype
    df.at[i,'unicode_str'] = str(df.unicode_bytes[i]).replace("b'",'')
    df.at[i,'unicode_str'] = df.unicode_str[i].replace("'",'')

In [111]:
df.head()

Unnamed: 0,hex_codepoint,status,emoji,e_version,description,utf8_bytes,utf8_str,utf8_clean,unicode_bytes,unicode_str
0,1F600,fully-qualified,😀,E1.0,grinning face,b'\xf0\x9f\x98\x80',\xf0\x9f\x98\x80,xf0x9fx98x80,b'\\U0001f600',\\U0001f600
1,1F603,fully-qualified,😃,E0.6,grinning face with big eyes,b'\xf0\x9f\x98\x83',\xf0\x9f\x98\x83,xf0x9fx98x83,b'\\U0001f603',\\U0001f603
2,1F604,fully-qualified,😄,E0.6,grinning face with smiling eyes,b'\xf0\x9f\x98\x84',\xf0\x9f\x98\x84,xf0x9fx98x84,b'\\U0001f604',\\U0001f604
3,1F601,fully-qualified,😁,E0.6,beaming face with smiling eyes,b'\xf0\x9f\x98\x81',\xf0\x9f\x98\x81,xf0x9fx98x81,b'\\U0001f601',\\U0001f601
4,1F606,fully-qualified,😆,E0.6,grinning squinting face,b'\xf0\x9f\x98\x86',\xf0\x9f\x98\x86,xf0x9fx98x86,b'\\U0001f606',\\U0001f606


In [112]:
df.tail(20)

Unnamed: 0,hex_codepoint,status,emoji,e_version,description,utf8_bytes,utf8_str,utf8_clean,unicode_bytes,unicode_str
4677,1F1FA 1F1FE,fully-qualified,🇺🇾,E2.0,flag: Uruguay,b'\xf0\x9f\x87\xba\xf0\x9f\x87\xbe',\xf0\x9f\x87\xba\xf0\x9f\x87\xbe,xf0x9fx87xbaxf0x9fx87xbe,b'\\U0001f1fa\\U0001f1fe',\\U0001f1fa\\U0001f1fe
4678,1F1FA 1F1FF,fully-qualified,🇺🇿,E2.0,flag: Uzbekistan,b'\xf0\x9f\x87\xba\xf0\x9f\x87\xbf',\xf0\x9f\x87\xba\xf0\x9f\x87\xbf,xf0x9fx87xbaxf0x9fx87xbf,b'\\U0001f1fa\\U0001f1ff',\\U0001f1fa\\U0001f1ff
4679,1F1FB 1F1E6,fully-qualified,🇻🇦,E2.0,flag: Vatican City,b'\xf0\x9f\x87\xbb\xf0\x9f\x87\xa6',\xf0\x9f\x87\xbb\xf0\x9f\x87\xa6,xf0x9fx87xbbxf0x9fx87xa6,b'\\U0001f1fb\\U0001f1e6',\\U0001f1fb\\U0001f1e6
4680,1F1FB 1F1E8,fully-qualified,🇻🇨,E2.0,flag: St. Vincent & Grenadines,b'\xf0\x9f\x87\xbb\xf0\x9f\x87\xa8',\xf0\x9f\x87\xbb\xf0\x9f\x87\xa8,xf0x9fx87xbbxf0x9fx87xa8,b'\\U0001f1fb\\U0001f1e8',\\U0001f1fb\\U0001f1e8
4681,1F1FB 1F1EA,fully-qualified,🇻🇪,E2.0,flag: Venezuela,b'\xf0\x9f\x87\xbb\xf0\x9f\x87\xaa',\xf0\x9f\x87\xbb\xf0\x9f\x87\xaa,xf0x9fx87xbbxf0x9fx87xaa,b'\\U0001f1fb\\U0001f1ea',\\U0001f1fb\\U0001f1ea
4682,1F1FB 1F1EC,fully-qualified,🇻🇬,E2.0,flag: British Virgin Islands,b'\xf0\x9f\x87\xbb\xf0\x9f\x87\xac',\xf0\x9f\x87\xbb\xf0\x9f\x87\xac,xf0x9fx87xbbxf0x9fx87xac,b'\\U0001f1fb\\U0001f1ec',\\U0001f1fb\\U0001f1ec
4683,1F1FB 1F1EE,fully-qualified,🇻🇮,E2.0,flag: U.S. Virgin Islands,b'\xf0\x9f\x87\xbb\xf0\x9f\x87\xae',\xf0\x9f\x87\xbb\xf0\x9f\x87\xae,xf0x9fx87xbbxf0x9fx87xae,b'\\U0001f1fb\\U0001f1ee',\\U0001f1fb\\U0001f1ee
4684,1F1FB 1F1F3,fully-qualified,🇻🇳,E2.0,flag: Vietnam,b'\xf0\x9f\x87\xbb\xf0\x9f\x87\xb3',\xf0\x9f\x87\xbb\xf0\x9f\x87\xb3,xf0x9fx87xbbxf0x9fx87xb3,b'\\U0001f1fb\\U0001f1f3',\\U0001f1fb\\U0001f1f3
4685,1F1FB 1F1FA,fully-qualified,🇻🇺,E2.0,flag: Vanuatu,b'\xf0\x9f\x87\xbb\xf0\x9f\x87\xba',\xf0\x9f\x87\xbb\xf0\x9f\x87\xba,xf0x9fx87xbbxf0x9fx87xba,b'\\U0001f1fb\\U0001f1fa',\\U0001f1fb\\U0001f1fa
4686,1F1FC 1F1EB,fully-qualified,🇼🇫,E2.0,flag: Wallis & Futuna,b'\xf0\x9f\x87\xbc\xf0\x9f\x87\xab',\xf0\x9f\x87\xbc\xf0\x9f\x87\xa,xf0x9fx87xbcxf0x9fx87xa,b'\\U0001f1fc\\U0001f1eb',\\U0001f1fc\\U0001f1e


Save completed library to our local directory and we can start using it in our other programs.

In [124]:
df.to_csv("emoji_lib_expanded.csv", index = False)

## Side Notes

I'm not confident about the format of the unicodes because to actually print an emoji using these codes you need one \ instead of two \\.  I'm having a hard time getting python to recognize that I need only one \ because the \ character is a special escape operator.  But because my primary need is the utf-8 codepoints, which are in the right format, I will leave this problem for another day when it's necessary to solve.

In [66]:
'🏴󠁧󠁢󠁷󠁬󠁳󠁿'.encode('unicode_escape')

b'\\U0001f3f4\\U000e0067\\U000e0062\\U000e0077\\U000e006c\\U000e0073\\U000e007f'

In [123]:
print('\\U0001f3f4\\U000e0067\\U000e0062\\U000e0077\\U000e006c\\U000e0073\\U000e007f')

\U0001f3f4\U000e0067\U000e0062\U000e0077\U000e006c\U000e0073\U000e007f


In [71]:
print('\U0001f3f4\U000e0067\U000e0062\U000e0077\U000e006c\U000e0073\U000e007f')

🏴󠁧󠁢󠁷󠁬󠁳󠁿


In [72]:
print('\U0001F5FD')

🗽
