The hypothesis is that there are two main ways of using emojis. One is to emphasis some word from from the text, literally using an emoji that represents something from the text. The other way, mainly associated with smileys is to represent some emotion or subtle meaning with emojis.

There are emojis that can be used for both such contexts, as such it is pretty difficult for the model to learn which emoji to predict. Having two models, one responsible for literal emoji prediction and one for subtle meaning emoji prediction might be helpful. 

In [1]:
%load_ext lab_black

In [1]:
import pandas as pd
import numpy as np
import torch

from twemoji.twemoji_dataset import TwemojiData, TwemojiBalancedData, TwemojiDataChunks
from embert import Sembert, TopKAccuracy, LiteralModel, Baseline
from tqdm import tqdm

### get mapping dict

In [2]:
df_des = pd.read_csv("emoji_embedding/data/processed/emoji_descriptions.csv")
emoji_id_char = {k: v for k, v in zip(df_des.emoji_id, df_des.emoji_char)}

### validation set sentence performance prediction

In [30]:
data = TwemojiData("balanced_valid_v2")

In [31]:
wrong_e_model1 = pd.read_pickle("emoji_usage_data/wrong_e_model1.pkl")
wrong_e_model2 = pd.read_pickle("emoji_usage_data/wrong_e_model2.pkl")
wrong_l_model1 = pd.read_pickle("emoji_usage_data/wrong_l_model1.pkl")
wrong_l_model2 = pd.read_pickle("emoji_usage_data/wrong_l_model2.pkl")

In [32]:
df = data.df

### analyse performance to predict literal emojis

In [7]:
literal_df = pd.read_pickle("emoji_usage_data/literal_df.pkl")

In [57]:
literal_df.head()

Unnamed: 0,emoji_id,emjpd_emoji_name_og,like,anger,lit,e_model1_like,e_model2_like,l_model1_like,l_model2_like,e_model1_anger,e_model2_anger,l_model1_anger,l_model2_anger,e_model1_lit,e_model2_lit,l_model1_lit,l_model2_lit
0,0,2nd place medal,I really like 2nd place medal,I am angered by 2nd place medal,"2nd place medalis very lit, I wanna see it","{1410, 72, 371, 1111, 1407}","{1664, 1184, 1542, 1672, 1215}","{0, 1, 1793, 1542, 1711}","{0, 1793, 1, 1036, 1711}","{352, 369, 371, 1556, 923}","{1664, 22, 24, 409, 1663}","{0, 1, 1793, 294, 1711}","{0, 1793, 1, 1036, 1711}","{417, 354, 1509, 371, 923}","{417, 1542, 72, 1672, 371}","{0, 1, 1793, 1542, 1711}","{1793, 803, 1036, 1711, 1241}"
1,1,3rd place medal,I really like 3rd place medal,I am angered by 3rd place medal,"3rd place medalis very lit, I wanna see it","{1410, 72, 371, 1111, 1407}","{1184, 1542, 1672, 1624, 1215}","{0, 1, 1793, 1542, 1711}","{0, 1793, 1, 1036, 1711}","{352, 369, 371, 1556, 923}","{1664, 1556, 22, 24, 409}","{0, 1, 1793, 294, 1711}","{0, 1793, 1, 1036, 1711}","{417, 354, 1509, 371, 923}","{417, 354, 1542, 72, 371}","{0, 1, 1793, 270, 1711}","{1793, 803, 1036, 1711, 1241}"
2,2,a button (blood type),I really like a button (blood type),I am angered by a button (blood type),"a button (blood type)is very lit, I wanna see it","{1410, 1509, 1413, 371, 1407}","{2, 3, 1103, 1297, 21}","{2, 3, 1607, 1103, 1712}","{704, 240, 1110, 1144, 1083}","{352, 367, 369, 371, 1556}","{2, 3, 1294, 1103, 21}","{2, 3, 1607, 1103, 1712}","{704, 240, 1110, 1144, 1083}","{354, 1509, 371, 923, 1407}","{2, 3, 354, 1509, 21}","{2, 3, 1607, 1103, 1712}","{704, 240, 1110, 1144, 1083}"
3,3,ab button (blood type),I really like ab button (blood type),I am angered by ab button (blood type),"ab button (blood type)is very lit, I wanna see it","{1410, 358, 371, 1111, 1407}","{354, 1410, 1297, 1270, 1407}","{2, 3, 74, 1103, 1712}","{704, 240, 1110, 1144, 1083}","{352, 367, 369, 371, 1556}","{2, 1103, 1268, 21, 794}","{2, 3, 1103, 1712, 19}","{704, 240, 1110, 1144, 1083}","{354, 1509, 371, 923, 1407}","{354, 1509, 265, 21, 923}","{2, 3, 1103, 1712, 19}","{704, 240, 1110, 1144, 1083}"
4,4,abacus,I really like abacus,I am angered by abacus,"abacusis very lit, I wanna see it","{1410, 371, 1111, 923, 1407}","{1408, 354, 1069, 14, 271}","{4, 1326, 819, 822, 1048}","{256, 4, 936, 1117, 1246}","{352, 367, 369, 1556, 923}","{367, 369, 22, 23, 734}","{4, 230, 231, 1458, 1556}","{4, 21, 22, 1117, 1726}","{354, 1509, 371, 923, 1407}","{1408, 354, 1448, 299, 1431}","{4, 1543, 108, 407, 1720}","{417, 4, 680, 907, 1117}"


In [59]:
def get_accuracy(model, column): 
    s = literal_df.apply(lambda x: x.emoji_id in x[f"{model}_{column}"], axis = 1)
    return round(s.sum()/len(s), 4)

for model in ["e_model1", "e_model2", "l_model1", "l_model2"]: 
    print()
    for col in ["like", "anger", "lit"]: 
        print(f"acc for {model} {col}: ", get_accuracy(model, col))


acc for e_model1 like:  0.0757
acc for e_model1 anger:  0.0238
acc for e_model1 lit:  0.0127

acc for e_model2 like:  0.3646
acc for e_model2 anger:  0.2669
acc for e_model2 lit:  0.1796

acc for l_model1 like:  0.9514
acc for l_model1 anger:  0.9398
acc for l_model1 lit:  0.9276

acc for l_model2 like:  0.7989
acc for l_model2 anger:  0.795
acc for l_model2 lit:  0.7619


In [60]:
# for model in ["e_model1", "e_model2", "l_model1", "l_model2"]: 
#     for col in ["like", "anger", "lit"]: 
#         s = literal_df[f"{model}_{col}"].apply(lambda x: [emoji_id_char[y] for y in x]).explode()
#         print(f"{model} {col}: \n", s.value_counts()[:5]/len(literal_df), "\n\n") 

In [62]:
for model in ["e_model1", "e_model2", "l_model1", "l_model2"]: 
    for col in ["like", "anger", "lit"]: 
        s = literal_df[f"{model}_{col}"].apply(lambda x: [emoji_id_char[y] for y in x]).explode().nunique()
        print(f"{model} {col}: \n", s, "\n\n") 

e_model1 like: 
 181 


e_model1 anger: 
 79 


e_model1 lit: 
 43 


e_model2 like: 
 944 


e_model2 anger: 
 773 


e_model2 lit: 
 608 


l_model1 like: 
 1763 


l_model1 anger: 
 1747 


l_model1 lit: 
 1733 


l_model2 like: 
 1544 


l_model2 anger: 
 1512 


l_model2 lit: 
 1508 


