## Planning / Scratch Work

Do /r/kpop commenters talk differently about male vs. female groups?

Initial exploration of this question:
- Identify submissions on 2 all-male groups, 2 all-female groups
- Collect their comments
- Contrast comments in general to "typical" reddit language (using /r/funny as a standard)
- Contrast comments on male group vs female group 

Using Pushshift to get reddit comments

See [Pushshift's GitHub API README](https://github.com/pushshift/api)

> Search for the most recent comments mentioning the word "science" within the subreddit /r/askscience
>
> `https://api.pushshift.io/reddit/search/comment/?q=science&subreddit=askscience`

Retrieve all comment ids for a submission object

`https://api.pushshift.io/reddit/submission/comment_ids/{base36_submission_id}`

[New to Pushshift FAQ](https://www.reddit.com/r/pushshift/comments/bcxguf/new_to_pushshift_read_this_faq/)

[Pushshift Reddit API v4.0 Documentation](https://reddit-api.readthedocs.io/en/latest/#)

Not-comprehensive related works:
- "A Community of Curious Souls: An Analysis of Commenting Behavior on TED Talks Videos" (Tsou, Thelwall, Mongeon, and Sugimoto, 2014)
- "YouTube science channel video presenters and comments: female friendly or vestiges of sexism?" (Thelwall and Mas-Bleda, 2018)
- "Shirtless and dangerous: Quantifying linguistic signals of gender bias in an online fiction writing community." (Fast, Vachovsky, and Bernstein, 2016)
- "Using language models to quantify gender bias in sports journalism" (Fu, Danescu-Niculescu-Mizil, Lee, 2016)

## Data Collection

Import statements

In [7]:
import string
import re
import requests
import logging
import pickle
import json
import time

from collections import Counter

from tqdm import tqdm

import pandas as pd

import nltk
from nltk.corpus import stopwords

from data_collection_utils import *

ENGLISH_STOPWORDS = stopwords.words('english')

## Pushshift Notes

What are the values that we can access for each submission?

```python
> response.json()['data'][1].keys()

> dict_keys(['all_awardings', 'allow_live_comments', 'author', 'author_flair_css_class', 'author_flair_richtext', 'author_flair_template_id', 'author_flair_text', 
'author_flair_text_color', 'author_flair_type', 'author_fullname', 'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post', 'contest_mode', 'created_utc', 'domain',
'full_link', 'gildings', 'id', 'is_crosspostable', 'is_meta', 'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video',
'link_flair_background_color', 'link_flair_css_class', 'link_flair_richtext', 'link_flair_template_id', 'link_flair_text', 'link_flair_text_color', 'link_flair_type', 'locked',
'media_only', 'no_follow', 'num_comments', 'num_crossposts', 'over_18', 'parent_whitelist_status', 'permalink', 'pinned', 'pwls', 'retrieved_on', 'score', 'selftext',
'send_replies', 'spoiler', 'stickied', 'subreddit', 'subreddit_id', 'subreddit_subscribers', 'subreddit_type', 'thumbnail', 'title', 'total_awards_received', 'treatment_tags',
'upvote_ratio', 'url', 'url_overridden_by_dest', 'whitelist_status', 'wls'])
```

Note that `created_utc` is given in unix timestamp

```
> [post['created_utc'] for post in response.json()['data']]

>[1595657105,
 1595641997,
 1595632191,
 1595623051,
 1595602847,
 1595599200,
 1595583205,
 1595581926,...
```

This tells us that newer posts are given first (i.e. order of posts in repsonse.json() is newest to oldest).

What are the values that we can access for each comment?

```python
comments_json[0].keys()

dict_keys(['all_awardings', 'approved_at_utc', 'associated_award', 'author', 'author_flair_background_color', 'author_flair_css_class', 'author_flair_richtext',
'author_flair_template_id', 'author_flair_text', 'author_flair_text_color', 'author_flair_type', 'author_fullname', 'author_patreon_flair', 'author_premium', 'awarders',
'banned_at_utc', 'body', 'can_mod_post', 'collapsed', 'collapsed_because_crowd_control', 'collapsed_reason', 'created_utc', 'distinguished', 'edited', 'gildings', 'id', 
'is_submitter', 'link_id', 'locked', 'no_follow', 'parent_id', 'permalink', 'retrieved_on', 'score', 'send_replies', 'stickied', 'subreddit', 'subreddit_id', 'top_awarded_type', 
'total_awards_received', 'treatment_tags'])
   
```


In [36]:
# Collect N posts
number_of_posts_to_collect = 50
posts = []
oldest_post_utc = None
while len(posts) < number_of_posts_to_collect: # < 1000
    if oldest_post_utc is None:
        posts = collect_posts()
        oldest_post_utc = posts[-1]['created_utc']
    else:
        print('oldest_post_utc: {}'.format(oldest_post_utc))
        time.sleep(1)
        older_posts = collect_posts(before = oldest_post_utc)
        oldest_post_utc = older_posts[-1]['created_utc']
        posts.extend(older_posts)

data_filename = 'rkpop-{}-posts.pkl'.format(number_of_posts_to_collect)
save_data(posts, 'data/{}'.format(data_filename))

# Test if saving works
test_load = load_data('data/{}'.format(data_filename))
assert test_load == posts
del test_load

In [34]:
post_ids = [post['id'] for post in posts]
len(set(post_ids))

50

In [31]:
from multiprocessing import Pool
import multiprocessing

num_cpu = multiprocessing.cpu_count()

In [35]:
# Collect comments for the posts
# with Pool(10) as p:
#     comments = p.map(collect_comment, posts)

# Decided to do it serially because multiprocessing was getting my IP blocked?

comments = []
for post in tqdm(posts):
    comments.append(collect_comment(post, size_to_collect=50)) # Max to collect at once is 100....

save_data(comments, 'data/rkpop-50-comments.pkl')

100%|██████████| 50/50 [02:21<00:00,  2.83s/it]


In [42]:
import glob

In [25]:
# for comment_pkl in glob.glob('data/comments/*'):
#     print(len(load_data(comment_pkl))) 
# ??? Some sort f collection error... fewer than 50 for many

In [53]:
# not sure what's going on with rkpop 50 coments pkl saved

comments = {}
for filename in glob.glob('data/comments/*'):
    start = filename.rindex('/') + 1
    end = filename.rindex('-')
    post_id = filename[start:end]
    comments[post_id] = load_data(filename)

comments.keys()

dict_keys(['lsnwob-50', 'lwjk2z-50', 'ltln4b-50', 'lshlk9-50', 'lsd3o2-50', 'lu3jz4-50', 'lsy8d5-50', 'lsuap0-50', 'ltczoz-50', 'lsi5ps-50', 'ltcu3m-50', 'lsm25s-50', 'lsqrhi-50', 'ls48qf-50', 'lsowxb-50', 'ls7inp-50', 'lt1ud1-50', 'ls2btt-50', 'lss0sj-50', 'lu35aa-50', 'lwjwkh-50', 'ltna6n-50', 'lwmn31-50', 'lszzqo-50', 'lwlmck-50', 'ls3u29-50', 'lsc69z-50', 'lwmc3q-50', 'lt229t-50', 'ls1ge8-50', 'lsol9v-50', 'lsz15v-50', 'lwpgcl-50', 'ls0rin-50', 'lsrbpv-50', 'lu1cq0-50', 'lsuedo-50', 'ls8g51-50', 'ls2ajb-50', 'lwoei6-50', 'ltm7mp-50', 'lt32uy-50', 'ls27m9-50', 'lsz4y6-50', 'lslnvo-50', 'lsu027-50', 'lss0qb-50', 'lsuyzh-50', 'lu71qn-50', 'lsqd5f-50'])

In [68]:
example_post_id = list(comments.keys())[0]
example_post_id

'lsnwob-50'

In [54]:
post_ids = [p['id'] for p in posts]
post_titles = [p['title'] for p in posts]

post_ids_titles_dict = dict(zip(post_ids, post_titles))

In [69]:
comments[example_post_id][:5]

['My company hires hundreds, if not 1000+ employees each year, and absolutely does background check on every hiree. They check your personal references, workplace reference, check my parents background (when I was already hired my boss said to the group something like "her dad works in X and was assigned in X" which I did not reveal in my application) and even go where you live.\n\nI definitely think they can do more rigorous background checks on the 10s of trainees they accept each year. If that is too much then they can do it one year after if the trainee remained with the company and shown great potential to debut.\n\nIt will definitely not be perfect. If your bullying is confined to a few people and hidden it\'ll be hard to track it down. There\'ll also be malicious accusations. But if someone is a known school or class bully (like Hyunjin appears to be) then it\'ll be possible to corroborate claims.',
 "This honestly just seems like middle school name calling, disagreements, and b

In [72]:
post_ids_titles_dict

{'lwpgcl': 'iKON - Why Why Why',
 'lwoei6': "More Brands And Variety Shows Pull APRIL's Naeun From Their Content Due To Recent Controversy",
 'lwmn31': "SHINee's Taemin awarded for being an exemplary taxpayer along with Park Minyoung, Jo Jungsuk, and Jo Seho",
 'lwmc3q': "Dong Suh Foods Corporation halts promotions with April's Naeun",
 'lwlmck': "'Delicious Rendezvous' to edit out April's Naeun in the upcoming episode",
 'lwjwkh': 'S. Korea to build state-run K-pop concert hall for unaffordable online performances',
 'lwjk2z': 'ITZY Yeji to be the featured artist for "M2 Studio CHOOM - Artist of the Month" for March 2021',
 'lu71qn': 'Pledis releases a statement regarding Seventeen’s Mingyu Bullying Allegations',
 'lu3jz4': "Lights ON! We're ONF. Let's AMA!",
 'lu35aa': 'Packed Circle Chart - K-Pop Groups and YouTube Distribution Channels',
 'lu1cq0': "Second accuser, owner of 'The North Face' jacket, reveals they were also asked to meet with Cube Entertainment reps",
 'ltna6n': 'Brav

## Saving to CSV

In [76]:
def get_comments_from_obj(post_id):
    # ratchet
    post_id_mod = post_id + '-50'
    if post_id_mod in comments:
        return comments[post_id_mod]
    else:
        return None

df = pd.DataFrame.from_dict(post_ids_titles_dict, orient='index')
comments_as_series = df.reset_index()['index'].apply(lambda post_id: get_comments_from_obj(post_id))
df = df.reset_index()
df['comments'] = comments_as_series
df.columns = ['id', 'title', 'comments']
df.head()

Unnamed: 0,id,title,comments
0,lwpgcl,iKON - Why Why Why,[]
1,lwoei6,More Brands And Variety Shows Pull APRIL's Nae...,[]
2,lwmn31,SHINee's Taemin awarded for being an exemplary...,[Korea has awards for everything. Im still not...
3,lwmc3q,Dong Suh Foods Corporation halts promotions wi...,[DSP is fucked either way. DSP is literally su...
4,lwlmck,'Delicious Rendezvous' to edit out April's Nae...,[She was the second celebrity that made me bin...


In [77]:
df.to_csv('data/rkpop-data-2021-03-04.csv', index=False)

In [4]:
data_df = pd.read_csv('data/rkpop-data-2021-03-04.csv')
data_df.head()

Unnamed: 0,id,title,comments
0,lwpgcl,iKON - Why Why Why,[]
1,lwoei6,More Brands And Variety Shows Pull APRIL's Nae...,[]
2,lwmn31,SHINee's Taemin awarded for being an exemplary...,['Korea has awards for everything. Im still no...
3,lwmc3q,Dong Suh Foods Corporation halts promotions wi...,"[""DSP is fucked either way. DSP is literally s..."
4,lwlmck,'Delicious Rendezvous' to edit out April's Nae...,"[""She was the second celebrity that made me bi..."


## Identify male vs female groups

In [43]:
m_f_mapping = {'male': {'EXO', 'NCT', 'BTS', 'Stray Kids', 'G-Dragon', 'Big Bang', 
                        'AB6IX', 'Golden Child', 'SEVENTEEN', 'Top Secret', 'TST', 
                        'ONEUS', 'TVXQ', 'PENTAGON', 'THE BOYZ', 'VERIVERY', 'Ravi', 
                        'WayV', 'VIXX', 'Super Junior', 'SHINee', 'Monsta X',
                        'Block B', 'Zico', 'Treasure', 'iKON'},

               'female': {'GFriend', "Girl's Day", 'Red Velvet', 'AOA', 'BLACKPINK', 
               'Momoland', 'miss A', 'MAMAMOO', 'ITZY', 'Sunmi', 'Weeekly', 'NiziU', 
               'NATTY', 'Twice', 'LOONA', 'After School', 'IU', 'IZ*ONE', 'WJSN', 
               'Cosmic Girls', 'DIA', 'CHUNGHA', 'SNSD', 'Cherry Bullet', 'Somi', 
               '(G)I-DLE', 'Apink', 'Yukika', 'Oh My Girl', 'Lee Hi',
               'PURPLE K!SS', 'Singer Minty', 'Rocket Punch'}
}
m_f_mapping['male'] = {g.lower() for g in m_f_mapping['male']}
m_f_mapping['female'] = {g.lower() for g in m_f_mapping['female']}



## Tag submissions with male or female

In [44]:
# https://stackoverflow.com/questions/55941100/how-to-filter-pandas-dataframe-rows-which-contains-any-string-from-a-list
data_df['female'] = data_df['title'].str.contains('|'.join(m_f_mapping['female']), case=False)
data_df['male'] = data_df['title'].str.contains('|'.join(m_f_mapping['male']), case=False)

  return func(self, *args, **kwargs)


In [48]:
data_df.head(15)

Unnamed: 0,id,title,comments,female,male
0,lwpgcl,iKON - Why Why Why,[],False,True
1,lwoei6,More Brands And Variety Shows Pull APRIL's Nae...,[],False,False
2,lwmn31,SHINee's Taemin awarded for being an exemplary...,['Korea has awards for everything. Im still no...,False,True
3,lwmc3q,Dong Suh Foods Corporation halts promotions wi...,"[""DSP is fucked either way. DSP is literally s...",False,False
4,lwlmck,'Delicious Rendezvous' to edit out April's Nae...,"[""She was the second celebrity that made me bi...",False,False
5,lwjwkh,S. Korea to build state-run K-pop concert hall...,"['Yup, kiswe the one in charge for BTS online ...",False,False
6,lwjk2z,"ITZY Yeji to be the featured artist for ""M2 St...","['Oh yes, I would love to see all of ITZYs dan...",True,False
7,lu71qn,Pledis releases a statement regarding Seventee...,"[""yeah they are a fan but they aren't 14 they'...",False,True
8,lu3jz4,Lights ON! We're ONF. Let's AMA!,[],False,False
9,lu35aa,Packed Circle Chart - K-Pop Groups and YouTube...,[],False,False


In [47]:
# Checking if any overlapping...
data_df[data_df['male'] & data_df['female']]

Unnamed: 0,id,title,comments,female,male
11,ltna6n,Brave Girls achieve their first #1 on Bugs Cha...,"['[removed]', 'They’re so close to breaking in...",True,True


In [50]:
data_df[data_df['male'] | data_df['female']]

Unnamed: 0,id,title,comments,female,male
0,lwpgcl,iKON - Why Why Why,[],False,True
2,lwmn31,SHINee's Taemin awarded for being an exemplary...,['Korea has awards for everything. Im still no...,False,True
6,lwjk2z,"ITZY Yeji to be the featured artist for ""M2 St...","['Oh yes, I would love to see all of ITZYs dan...",True,False
7,lu71qn,Pledis releases a statement regarding Seventee...,"[""yeah they are a fan but they aren't 14 they'...",False,True
11,ltna6n,Brave Girls achieve their first #1 on Bugs Cha...,"['[removed]', 'They’re so close to breaking in...",True,True
12,ltm7mp,JYP Releases Statement Addressing Stray Kids’ ...,[],False,True
14,ltczoz,"Halsey, MAX, And Lauv Show Support For BTS In ...","['[removed]', ""I think they are regulated, but...",False,True
15,ltcu3m,W Magazine- George Clooney's dramatic reading ...,['I think most longterm kpop fans dismissed th...,False,True
18,lt1ud1,Starship Entertainment Says That MONSTA X’s Ki...,"[""I think your points are valid. I also think ...",False,True
20,lsz4y6,German radio show Bayern 3 apologizes for host...,"['[removed]', 'Waiting on your ""constructive"" ...",False,True
