# Analysis of emojis
## Pre-processing and initial data cleaning 2

An analysis of emojis was also carried out to determine whether or not they could be determinant for the training
and results of our models. Through the analyses carried out, we were able to determine the following:
- The frequencies of the different emojis in the texts were obtained, and it was determined that only 17.9% of
texts contained some emoji, i.e. 359 of the 2,000 total texts.
- Statistical calculations were carried out on the distributions of emojis according to the four categorisations
made. To statistically analyse these data, two statistical techniques were used: (i) overlap analysis and (ii)
Spearman’s rho correlation.
    * **Overlap analysis**: An overlap analysis compares the values between columns in one table or across tables.
This analysis allow us to identify overlapping or redundant data in columns.In our particular case, we com-
pared whether or not the number of emojis found in tweets that have been tagged with 0 in each of the binary
categories overlapped with the emojis found that have been tagged with the value 1 in each category.
    * **Spearman’s rho correlation**: this analysis makes it possible to determine whether two data distributions are
correlated. In the analysis of emojis, this analysis allows us to determine whether the distribution of emojis
found in the tweets categorised in each binary class of the four categorisations performed are significantly
similar or not. That is, it helps us determine whether the number of emojis found in tweets that are labelled
as "tweets written by ED patients" is similar, statistically speaking, to the number of emojis found in tweets
that are labelled as "tweets not written by ED patients".
It was observed that, in all cases, the similarity between the two distributions was statistically significant. The
correlations are shown in table 1.
- Therefore, in line with what has been done in other similar research [68, 69], it was decided to eliminate emojis
in the pre-processing of the texts.

[68] S. Talebi, K. Manoj and G. Hemantha Kumar, Building Knowledge Graph Based on User Tweets, in: Data Analytics and Learning,
P. Nagabhushan, D.S. Guru, B.H. Shekar and Y.H.S. Kumar, eds, Lecture Notes in Networks and Systems, Springer, Singapore, 2019,
pp. 433–443. ISBN 9789811325144. doi:10.1007/978-981-13-2514-4_36.

[69] V. Pinheiro, Computational Processing of the Portuguese Language: 15th International Conference, PROPOR 2022, Fortaleza, Brazil,
March 21-23, 2022, Proceedings, Springer Nature, 2022, Google-Books-ID: Df9kEAAAQBAJ. ISBN 978-3-030-98305-5.




In [53]:
import os
import pandas as pd

df = pd.read_csv('TweetsEtiquetados-2000.csv', encoding='utf8', sep=';', error_bad_lines=False)

In [54]:
df.head()

Unnamed: 0,Column1,id_tweet,link,stream_group,text_orig,Commercial,POLITICS,ED,Family,ED_patience,ProED,Offensive,Informative,Scientific,Sad,hashtag,Columna2,Columna3
0,0,"1,31851E+34",https://twitter.com/sophhhiiieeeeee/status/1318513897826123776,1,"RT @beatED: Learn more about anorexia and bulimia, as well as other eating disorders, here: https://t.co/Aj2HbjRH39 @BBCPanorama #BBCPanorama",0,0,1,0.0,0,0,0,1,0,0.0,['BBCPanorama'],,
1,3,"1,31851E+34",https://twitter.com/thelewespound/status/1318514436542500866,1,"A woman tries to balance her relationships with her mother and teenage daughter while under the shadow of #anorexia in the atmospheric British drama #BodyofWater. \r\n\r\nAt Depot from Friday, book now: https://lewesdepot.org/film/body-of-water\r\n",0,0,1,0.0,0,0,0,0,0,0.0,"['anorexia', 'BodyofWater']",,
2,6,"1,31851E+34",https://twitter.com/milkylbs/status/1318514816814882817,1,not a full on diagnosis but like my therapist legit told my mom i have anorexia nevrususu and my poor mom sat there like 👁👄👁...oh,0,0,1,0.0,1,0,0,0,0,0.0,[],,
3,11,"1,31852E+34",https://twitter.com/JamesJosephIgoe/status/1318515167743811584,1,Higher-calorie diets for patients with anorexia nervosa shorten hospital stays https://t.co/jQFWUHQELR via @instapaper,0,0,1,0.0,0,0,0,1,1,0.0,[],,
4,14,"1,31852E+34",https://twitter.com/GENDERYOON/status/1318515364217704449,1,"tw // ed ment freddie pissed me off but it was amazing. but the way they got the anorexic, bulimic actress to play a food-obsessed character will never sit right with me",0,0,1,0.0,1,1,0,0,0,0.0,[],,


In [3]:
df = df.drop(['Column1','Columna2','Columna3','hashtag','link'], axis=1)

In [4]:
df.columns

Index(['id_tweet', 'stream_group', 'text_orig', 'Commercial', 'POLITICS', 'ED',
       'Family', 'ED_patience', 'ProED', 'Offensive', 'Informative',
       'Scientific', 'Sad'],
      dtype='object')

In [5]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.options.display.max_colwidth = 300
#data2_1.info(memory_usage="deep")

We unite all the data sets into one.

In [6]:
def usageForType(df):
    for ctype in ['float','float64','int64','int','object','datetime','category']:
        columnType = df.select_dtypes(include=[ctype])
        meanMemoryUsage = columnType.memory_usage(deep=True).mean()
        meanMemoryUsageMB = meanMemoryUsage / 1024 ** 2
        print("Memory usage for type ",ctype , " : {:0.5f} MB".format(meanMemoryUsageMB))

In [7]:
df.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id_tweet      2000 non-null   object 
 1   stream_group  2000 non-null   int64  
 2   text_orig     2000 non-null   object 
 3   Commercial    2000 non-null   int64  
 4   POLITICS      2000 non-null   int64  
 5   ED            2000 non-null   int64  
 6   Family        1733 non-null   float64
 7   ED_patience   2000 non-null   int64  
 8   ProED         2000 non-null   int64  
 9   Offensive     2000 non-null   int64  
 10  Informative   2000 non-null   int64  
 11  Scientific    2000 non-null   int64  
 12  Sad           1728 non-null   float64
dtypes: float64(2), int64(9), object(2)
memory usage: 1.1 MB


In [8]:
usageForType(df)

Memory usage for type  float  : 0.01021 MB
Memory usage for type  float64  : 0.01021 MB
Memory usage for type  int64  : 0.01375 MB
Memory usage for type  int  : 0.00012 MB
Memory usage for type  object  : 0.30499 MB
Memory usage for type  datetime  : 0.00012 MB
Memory usage for type  category  : 0.00012 MB


We make a cross between the columns 'id.tweet' and 'retweeted_id' so that we remove all the tweets that are retuits of already existing tweets.

In [9]:
df2 = df.copy()

In [16]:
import advertools as adv
emoji_summary = adv.extract_emoji(df2['text_orig'].tolist())
print(emoji_summary.keys())


  from cryptography import utils, x509


dict_keys(['emoji', 'emoji_text', 'emoji_flat', 'emoji_flat_text', 'emoji_counts', 'emoji_freq', 'top_emoji', 'top_emoji_text', 'top_emoji_groups', 'top_emoji_sub_groups', 'overview'])


In [17]:
df2['emoji'] = emoji_summary['emoji']

In [18]:
emoji_summary['top_emoji']

[('😭', 42),
 ('✨', 38),
 ('\U0001f97a', 25),
 ('😂', 20),
 ('👇', 17),
 ('🙏', 13),
 ('❤️', 12),
 ('💜', 12),
 ('🤢', 11),
 ('😔', 10),
 ('💕', 10),
 ('😩', 9),
 ('\U0001fa79', 9),
 ('⭐', 9),
 ('👍', 8),
 ('😁', 8),
 ('😐', 7),
 ('🥀', 7),
 ('♥', 7),
 ('‼️', 7),
 ('👉', 7),
 ('💗', 7),
 ('⚠️', 7),
 ('😀', 7),
 ('\U0001f970', 7),
 ('👁', 6),
 ('\U0001f974', 6),
 ('\U0001f92e', 6),
 ('\U0001f972', 6),
 ('🙃', 6),
 ('🌀', 6),
 ('😅', 5),
 ('😊', 5),
 ('😋', 5),
 ('😍', 5),
 ('😜', 5),
 ('🎉', 5),
 ('💀', 5),
 ('🎄', 5),
 ('💔', 5),
 ('💚', 5),
 ('✌️', 4),
 ('💘', 4),
 ('❤', 4),
 ('😉', 4),
 ('🚨', 4),
 ('🖤', 4),
 ('🙄', 4),
 ('💙', 4),
 ('\U0001f9e1', 4),
 ('👄', 3),
 ('☁️', 3),
 ('➡', 3),
 ('😳', 3),
 ('🚶\u200d♀️', 3),
 ('💖', 3),
 ('🔥', 3),
 ('\U0001f973', 3),
 ('🤣', 3),
 ('🙏🏻', 3),
 ('🇳🇬', 3),
 ('💪', 3),
 ('💛', 3),
 ('🙈', 3),
 ('🦎', 3),
 ('🤷🏻\u200d♀️', 3),
 ('🤚', 3),
 ('💪🏼', 3),
 ('🌷', 3),
 ('☑️', 3),
 ('😑', 2),
 ('📣', 2),
 ('🤷🏼\u200d♀️', 2),
 ('💪🏻', 2),
 ('😎', 2),
 ('😕', 2),
 ('🤡', 2),
 ('😆', 2),
 ('😼', 2),
 ('🍎', 2),
 

In [19]:
emoji_summary['emoji_freq']

[(0, 1641),
 (1, 204),
 (2, 75),
 (3, 41),
 (4, 19),
 (5, 7),
 (6, 3),
 (7, 2),
 (8, 3),
 (9, 1),
 (12, 1),
 (13, 2),
 (14, 1)]

In [20]:
df2['emoji']

0                                                 []
1                                                 []
2                                          [👁, 👄, 👁]
3                                                 []
4                                                 []
5                                                 []
6                                                 []
7                                                 []
8                                                 []
9                                                [🌞]
10                                                []
11                                                []
12                                                []
13                                                []
14                                                []
15                                                []
16                                                []
17                                                []
18                                            

In [21]:
import numpy as np
df2['emoji']=df2['emoji'].apply(lambda x: "-".join(x)) #convierto la lista de answers en una cadena
df2['emoji'] = df2['emoji'].replace('', np.nan) # Reemplazo los registros vacíos con NaN
df2 = df2.dropna(axis=0, subset=['emoji']) # Elimino registros con Answers NaN
df2['emoji']= df2['emoji'].apply(lambda x: x.split('-')) # Vuelvo a dejar la columna en tipo Lista

print(df2.emoji)

2                                          [👁, 👄, 👁]
9                                                [🌞]
24                                              [☁️]
27                                               [😐]
30                                               [🥴]
32                                               [😅]
36                                               [😩]
38                                               [😂]
40                                               [😭]
41                                               [➡]
45                                               [😑]
49                                            [🤢, 🤢]
50                                               [👎]
57                             [🥀, 🥀, 🥀, 🥀, 🥀, 🥀, 🥀]
67                                            [🙏, 🙏]
69                                               [📣]
73                                            [😳, 😔]
75                                               [📹]
77                                         [😔,

In [26]:
print('Cat1 ED o non-ED:\n',df2['ED_patience'].value_counts())
print('\nCat2 ProED o non-ProED:\n',df2['ProED'].value_counts())
print('\nCat3 Informative or not:\n',df2['Informative'].value_counts())
print('\nCat4 Scientific or not:\n',df2['Scientific'].value_counts())

Cat1 ED o non-ED:
 1    248
0    111
Name: ED_patience, dtype: int64

Cat2 ProED o non-ProED:
 0    231
1    128
Name: ProED, dtype: int64

Cat3 Informative or not:
 0    285
1     74
Name: Informative, dtype: int64

Cat4 Scientific or not:
 0    326
1     33
Name: Scientific, dtype: int64


In [29]:
from collections import Counter, OrderedDict

list_ed_patience = []
list_ed_patience_no = []
for index, row in df2.iterrows():
    if(row['ED_patience']==1):
        for item in row['emoji']:
            list_ed_patience.append(item)
        #print(row['emoji'], row['ED_patience'])
    else:
        for item in row['emoji']:
            list_ed_patience_no.append(item)

list_proed = []
list_proed_no = []
for index, row in df2.iterrows():
    if(row['ProED']==1):
        for item in row['emoji']:
            list_proed.append(item)
        #print(row['emoji'], row['ED_patience'])
    else:
        for item in row['emoji']:
            list_proed_no.append(item)
            
list_inf = []
list_inf_no = []
for index, row in df2.iterrows():
    if(row['Informative']==1):
        for item in row['emoji']:
            list_inf.append(item)
        #print(row['emoji'], row['ED_patience'])
    else:
        for item in row['emoji']:
            list_inf_no.append(item)
            
list_sci = []
list_sci_no = []
for index, row in df2.iterrows():
    if(row['Scientific']==1):
        for item in row['emoji']:
            list_sci.append(item)
        #print(row['emoji'], row['ED_patience'])
    else:
        for item in row['emoji']:
            list_sci_no.append(item)
            
            

    
OrderedDict(Counter(list_ed_patience).most_common())

#dfEmoji1 = pd.DataFrame.from_dict(Counter(list_ed_patience).most_common(), orient='index', columns=['emoji','ed_patience'])
#dfEmoji2 = pd.DataFrame.from_dict(Counter(list_ed_patience_no).most_common(),orient='index', columns=['emoji','non_ed_patience'])

dfEmoji1ED = pd.DataFrame(list(Counter(list_ed_patience).most_common()), columns=['emoji','ed_patience'])
dfEmoji2ED = pd.DataFrame(list(Counter(list_ed_patience_no).most_common()), columns=['emoji','non_ed_patience'])

dfEmoji1Pro = pd.DataFrame(list(Counter(list_proed).most_common()), columns=['emoji','ProED'])
dfEmoji2Pro = pd.DataFrame(list(Counter(list_proed_no).most_common()), columns=['emoji','non_ProED'])

dfEmoji1Inf = pd.DataFrame(list(Counter(list_inf).most_common()), columns=['emoji','Informative'])
dfEmoji2Inf = pd.DataFrame(list(Counter(list_inf_no).most_common()), columns=['emoji','non_Informative'])

dfEmoji1Sci = pd.DataFrame(list(Counter(list_sci).most_common()), columns=['emoji','Scientific'])
dfEmoji2Sci = pd.DataFrame(list(Counter(list_sci_no).most_common()), columns=['emoji','non_Scientific'])



mergedRes = pd.merge(dfEmoji1ED, dfEmoji2ED, on ='emoji',how="left")
mergedRes = pd.merge(mergedRes, dfEmoji1Pro, on ='emoji',how="left")
mergedRes = pd.merge(mergedRes, dfEmoji2Pro, on ='emoji',how="left")
mergedRes = pd.merge(mergedRes, dfEmoji1Inf, on ='emoji',how="left")
mergedRes = pd.merge(mergedRes, dfEmoji2Inf, on ='emoji',how="left")
mergedRes = pd.merge(mergedRes, dfEmoji1Sci, on ='emoji',how="left")
mergedRes = pd.merge(mergedRes, dfEmoji2Sci, on ='emoji',how="left")

In [30]:
mergedRes = mergedRes.fillna(0)
mergedRes.non_ed_patience = mergedRes.non_ed_patience.astype(int)
mergedRes.non_ProED = mergedRes.non_ProED.astype(int)
mergedRes.ProED = mergedRes.ProED.astype(int)
mergedRes.non_ProED = mergedRes.non_ProED.astype(int)
mergedRes.Informative = mergedRes.Informative.astype(int)
mergedRes.non_Informative = mergedRes.non_Informative.astype(int)
mergedRes.Scientific = mergedRes.Scientific.astype(int)
mergedRes.non_Scientific = mergedRes.non_Scientific.astype(int)

mergedRes

Unnamed: 0,emoji,ed_patience,non_ed_patience,ProED,non_ProED,Informative,non_Informative,Scientific,non_Scientific
0,✨,36,2,21,17,1,37,0,38
1,😭,30,12,15,27,0,42,0,42
2,🥺,25,0,11,14,0,25,0,25
3,😔,10,0,4,6,0,10,0,10
4,❤️,10,2,2,10,3,9,1,11
5,💕,10,0,9,1,0,10,0,10
6,🙏,9,4,8,5,1,12,1,12
7,😂,9,11,2,18,3,17,2,18
8,🩹,9,0,9,0,0,9,0,9
9,😩,8,1,4,5,1,8,0,9


In [45]:
nonED_l = mergedRes.non_ed_patience.tolist()
nonED_l.sort(reverse=True)
ED_l = mergedRes.ed_patience.tolist()
ED_l.sort(reverse=True)

nonProED_l = mergedRes.non_ProED.tolist()
nonProED_l.sort(reverse=True)
ProED_l = mergedRes.ProED.tolist()
ProED_l.sort(reverse=True)

nonSci_l = mergedRes.non_Scientific.tolist()
nonSci_l.sort(reverse=True)
Sci_l = mergedRes.Scientific.tolist()
Sci_l.sort(reverse=True)

nonInf_l = mergedRes.non_Informative.tolist()
nonInf_l.sort(reverse=True)
Inf_l = mergedRes.Informative.tolist()
Inf_l.sort(reverse=True)

In [46]:
import py_stringmatching
#from py_stringmatching import similarity_measure
# py_stringmatching.similarity_measure.overlap_coefficient.OverlapCoefficient

oc = py_stringmatching.similarity_measure.overlap_coefficient.OverlapCoefficient()
#print(oc.get_raw_score(mergedRes.non_ed_patience.tolist(),mergedRes.ed_patience.tolist()))
#print(oc.get_raw_score(mergedRes.non_ProED.tolist(),mergedRes.ProED.tolist()))
#print(oc.get_raw_score(mergedRes.Informative.tolist(),mergedRes.non_Informative.tolist()))
#print(oc.get_raw_score(mergedRes.Scientific.tolist(),mergedRes.non_Scientific.tolist()))


print(oc.get_raw_score(ED_l,nonED_l))
print(oc.get_raw_score(ProED_l,nonProED_l))
print(oc.get_raw_score(Sci_l,nonSci_l))
print(oc.get_raw_score(Inf_l,nonInf_l))

0.6666666666666666
0.75
1.0
1.0


In [52]:
from scipy import stats
r = stats.spearmanr(ED_l,nonED_l)
text = "correlation: {:.3f}, pvalue:{:.20f}".format(r.correlation, r.pvalue)
print(text)
r = stats.spearmanr(ProED_l,nonProED_l)
text = "correlation: {:.4f}, pvalue:{:.20f}".format(r.correlation, r.pvalue)
print(text)
r = stats.spearmanr(Sci_l,nonSci_l)
text = "correlation: {:.3f}, pvalue:{:.20f}".format(r.correlation, r.pvalue)
print(text)
r = stats.spearmanr(Inf_l,nonInf_l)
text = "correlation: {:.3f}, pvalue:{:.20f}".format(r.correlation, r.pvalue)
print(text)

correlation: 0.849, pvalue:0.00000000000000000000
correlation: 0.9643, pvalue:0.00000000000000000000
correlation: 0.567, pvalue:0.00000000000000202010
correlation: 0.821, pvalue:0.00000000000000000000


In [102]:
mergedRes.to_csv('emojis.csv', encoding='utf-8',sep=";")

In [100]:
df2.shape

(359, 14)

In [80]:
import gc
del model1,model2,model3
gc.collect()
print(torch.cuda.memory_summary(device=0, abbreviated=False))


|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 4            |        cudaMalloc retries: 6         |
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |    1391 MB |    6527 MB |   12960 GB |   12959 GB |
|       from large pool |    1390 MB |    6518 MB |   12898 GB |   12897 GB |
|       from small pool |       0 MB |      22 MB |      62 GB |      62 GB |
|---------------------------------------------------------------------------|
| Active memory         |    1391 MB |    6527 MB |   12960 GB |   12959 GB |
|       from large pool |    1390 MB |    6518 MB |   12898 GB |   12897 GB |
|       from small pool |       0 MB |      22 MB |      62 GB |      62 GB |
|---------------------------------------------------------------