## Steps

- Load a `merged` file.
- ~~Join on `PostURL` for `Page Name`~~.
- Only look up the `Page Name`.
- Recalculate `LexFound`.
- Keep the desired columns.
- Repeat until all merged files are done.
- Merge the merged files.
- Do analysis.

In [2]:
import pandas as pd
import numpy as np
from tqdm import tqdm
from glob import glob
# import sys
# sys.path.append('/home/bupi/Documents/pdy/hs/hsle/src')
import HsleCandidateGenerationUtils as hsle

In [3]:
merged_files = glob('../hsle/data/exportcomments-outputs/*/processed/merged.csv')

In [4]:
merged_files

['../hsle/data/exportcomments-outputs/20200329_20200331/processed/merged.csv',
 '../hsle/data/exportcomments-outputs/20200403_20200405/processed/merged.csv',
 '../hsle/data/exportcomments-outputs/20200323_20200325/processed/merged.csv']

In [5]:
# load merged file
mf = merged_files[0]
cols2load = [
    'Profile ID',
    'Date',
    'Likes',
    'MsgUniSeg',
    'LexFound',
    'PostURL'
]
df = pd.read_csv(mf, usecols=cols2load)
print(df.shape)
display(df.head(2))

(82552, 6)


Unnamed: 0,Profile ID,Date,Likes,MsgUniSeg,LexFound,PostURL
0,ID: 100007659289464,2020-03-29 23:48:23,20,အရမ်း များ စော နေ မ လား ဗျ ။ YBS တွေ မြို့ပတ်ရ...,,https://www.facebook.com/MOIWebportalMyanmar/p...
1,ID: 100017115008349,2020-03-29 23:51:47,7,စော သေး တယ် နေ ပါ အုံး လား သုံး လ ကြီး များ တေ...,,https://www.facebook.com/MOIWebportalMyanmar/p...


In [32]:
post_file = '../hsle/data/crowdtangle-posts/processed_{}.csv'.format(mf.split('/')[-3])
postdf = pd.read_csv(post_file, sep='~')
postdf.columns

Index(['Page Name', 'User Name', 'Facebook Id', 'Likes at Posting', 'Created',
       'Type', 'Likes', 'Comments', 'Shares', 'Love', 'Wow', 'Haha', 'Sad',
       'Angry', 'Thankful', 'Video Share Status', 'Post Views', 'Total Views',
       'Total Views for all Crossposts', 'Video Length', 'URL', 'Message',
       'Link', 'Final Link', 'Link Text', 'Description', 'Sponsor Id',
       'Sponsor Name', 'Overperforming Score', 'commentsFile', 'PostId',
       'MessageUni', 'MsgUniCleanSeg'],
      dtype='object')

In [23]:
post_cols2load = [
    'Page Name',
    'URL',
    'Likes',
    'Comments',
    'Shares',
    'Love',
    'Wow',
    'Haha',
    'Sad',
    'Angry',
    'Thankful',
    'MsgUniCleanSeg'
]
post_file = '../hsle/data/crowdtangle-posts/processed_{}.csv'.format(mf.split('/')[-3])
postdf = pd.read_csv(post_file, sep='~', usecols=post_cols2load)
postdf = postdf[[
    'Page Name',
    'URL',
    'Likes',
    'Comments',
    'Shares',
    'Love',
    'Wow',
    'Haha',
    'Sad',
    'Angry',
    'Thankful',
    'MsgUniCleanSeg'
]]
postdf.columns = [
    'PageName',
    'URL',
    'Likes',
    'Comments',
    'Shares',
    'Love',
    'Wow',
    'Haha',
    'Sad',
    'Angry',
    'Thankful',
    'MsgUniSeg'
]
print(postdf.shape)
postdf.head(2)

(400, 12)


Unnamed: 0,PageName,URL,Likes,Comments,Shares,Love,Wow,Haha,Sad,Angry,Thankful,MsgUniSeg
0,ApannPyay,https://www.facebook.com/ApannPyay/posts/36268...,33122,21539,45983,107,3754,526,870,11354,0,ခဏ လေး သည်းခံ ပြီး အစစ်ခံ လိုက် ရင် ဘာ များ ဖြ...
1,"Ministry of Health and Sports, Myanmar",https://www.facebook.com/MinistryOfHealthAndSp...,18327,732,3883,143,35,39,170,12,0,နယ်စပ် ဂိတ် များ မှတစ်ဆင့် မြန်မာနိုင်ငံ သို့ ...


## Update `LexFound`

In [20]:
lexes = hsle.LoadLexiconSet(False, None, True)
df['LexFound'] = [set(l.split()).intersection(lexes) for l in tqdm(df.MsgUniSeg)]
df['LexFound'] = [np.nan if len(l)==0 else '~'.join(l) for l in df.LexFound]
df.head(2)

100%|██████████| 82552/82552 [00:00<00:00, 134880.54it/s]


Unnamed: 0,Profile ID,Date,Likes,MsgUniSeg,LexFound,PostURL
0,ID: 100007659289464,2020-03-29 23:48:23,20,အရမ်း များ စော နေ မ လား ဗျ ။ YBS တွေ မြို့ပတ်ရ...,,https://www.facebook.com/MOIWebportalMyanmar/p...
1,ID: 100017115008349,2020-03-29 23:51:47,7,စော သေး တယ် နေ ပါ အုံး လား သုံး လ ကြီး များ တေ...,,https://www.facebook.com/MOIWebportalMyanmar/p...


In [25]:
postdf['LexFound'] = [set(str(l).split()).intersection(lexes) for l in postdf.MsgUniSeg]
postdf['LexFound'] = [np.nan if len(l)==0 else '~'.join(l) for l in postdf.LexFound]
postdf.head(2)

Unnamed: 0,PageName,URL,Likes,Comments,Shares,Love,Wow,Haha,Sad,Angry,Thankful,MsgUniSeg,LexFound
0,ApannPyay,https://www.facebook.com/ApannPyay/posts/36268...,33122,21539,45983,107,3754,526,870,11354,0,ခဏ လေး သည်းခံ ပြီး အစစ်ခံ လိုက် ရင် ဘာ များ ဖြ...,
1,"Ministry of Health and Sports, Myanmar",https://www.facebook.com/MinistryOfHealthAndSp...,18327,732,3883,143,35,39,170,12,0,နယ်စပ် ဂိတ် များ မှတစ်ဆင့် မြန်မာနိုင်ငံ သို့ ...,


## Analyses
### HS-DateTime

In [30]:
tmp = df.loc[~df.LexFound.isna(),:]
datetime_sr = pd.to_datetime(tmp.Date)
lex_sr = tmp.LexFound.apply(lambda x: x.split('~'))

dtflat, lexflat = [], []
hours = []
for dt,lex in zip(datetime_sr, lex_sr):
    for l in lex:
        dtflat.append(dt)
        hours.append(dt.hour)
        lexflat.append(l)

lex_time = pd.DataFrame({
    'Hate Speech Phrase': lexflat,
    'DateTime': dtflat,
    'Hour': hours
})
lex_time['Date'] = [pd.datetime(d.year, d.month, d.day) for d in lex_time.DateTime]
display(lex_time.head(2))

Unnamed: 0,Hate Speech Phrase,DateTime,Hour,Date
0,သူခိုး,2020-03-30 07:23:13,7,2020-03-30
1,ခွေး,2020-03-30 16:45:30,16,2020-03-30


In [31]:
lex_time.shape

(3592, 4)

In [34]:
from datetime import datetime

In [38]:
def datetime_suffix():
    now = datetime.now()
    return '{}{:02}{:02}{:02}{:02}{:02}'.format(
        now.year, now.month, now.day, now.hour, now.minute, now.second)