## Planning / Scratch Work

Do /r/kpop commenters talk differently about male vs. female groups?

Initial exploration of this question:
- Identify submissions on 2 all-male groups, 2 all-female groups
- Collect their comments
- Contrast comments in general to "typical" reddit language (using /r/funny as a standard)
- Contrast comments on male group vs female group 

Using Pushshift to get reddit comments

See [Pushshift's GitHub API README](https://github.com/pushshift/api)

> Search for the most recent comments mentioning the word "science" within the subreddit /r/askscience
>
> `https://api.pushshift.io/reddit/search/comment/?q=science&subreddit=askscience`

Retrieve all comment ids for a submission object

`https://api.pushshift.io/reddit/submission/comment_ids/{base36_submission_id}`

[New to Pushshift FAQ](https://www.reddit.com/r/pushshift/comments/bcxguf/new_to_pushshift_read_this_faq/)

[Pushshift Reddit API v4.0 Documentation](https://reddit-api.readthedocs.io/en/latest/#)

Not-comprehensive related works:
- "A Community of Curious Souls: An Analysis of Commenting Behavior on TED Talks Videos" (Tsou, Thelwall, Mongeon, and Sugimoto, 2014)
- "YouTube science channel video presenters and comments: female friendly or vestiges of sexism?" (Thelwall and Mas-Bleda, 2018)
- "Shirtless and dangerous: Quantifying linguistic signals of gender bias in an online fiction writing community." (Fast, Vachovsky, and Bernstein, 2016)
- "Using language models to quantify gender bias in sports journalism" (Fu, Danescu-Niculescu-Mizil, Lee, 2016)

## Data Collection

Import statements

In [4]:
import requests
import json

import pandas as pd
from bs4 import BeautifulSoup

Collect relevant /r/kpop submission

In [5]:
url = 'https://api.pushshift.io/reddit/search/submission/?subreddit=kpop&score=>50&num_comments=>50'
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
response = requests.get(url, headers={'User-Agent': user_agent})
post_titles = [post['title'] for post in response.json()['data']]
post_ids = [post['id'] for post in response.json()['data']]
post_id = post_ids[0]
# response.json()['data'][1].keys() # What are the values that we can access for each submission?
# dict_keys(['all_awardings', 'allow_live_comments', 'author', 'author_flair_css_class', 'author_flair_richtext', 'author_flair_template_id', 'author_flair_text', 'author_flair_text_color', 'author_flair_type', 'author_fullname', 'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post', 'contest_mode', 'created_utc', 'domain', 'full_link', 'gildings', 'id', 'is_crosspostable', 'is_meta', 'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video', 'link_flair_background_color', 'link_flair_css_class', 'link_flair_richtext', 'link_flair_template_id', 'link_flair_text', 'link_flair_text_color', 'link_flair_type', 'locked', 'media_only', 'no_follow', 'num_comments', 'num_crossposts', 'over_18', 'parent_whitelist_status', 'permalink', 'pinned', 'pwls', 'retrieved_on', 'score', 'selftext', 'send_replies', 'spoiler', 'stickied', 'subreddit', 'subreddit_id', 'subreddit_subscribers', 'subreddit_type', 'thumbnail', 'title', 'total_awards_received', 'treatment_tags', 'upvote_ratio', 'url', 'url_overridden_by_dest', 'whitelist_status', 'wls'])

Collect comments given post_id

In [6]:
url = 'https://api.pushshift.io/reddit/comment/search?link_id=' + post_id
response = requests.get(url, headers={'User-Agent': user_agent})
comments_json = response.json()['data']
# comments_json[0].keys()
# dict_keys(['all_awardings', 'approved_at_utc', 'associated_award', 'author', 'author_flair_background_color', 'author_flair_css_class', 'author_flair_richtext', 'author_flair_template_id', 'author_flair_text', 'author_flair_text_color', 'author_flair_type', 'author_fullname', 'author_patreon_flair', 'author_premium', 'awarders', 'banned_at_utc', 'body', 'can_mod_post', 'collapsed', 'collapsed_because_crowd_control', 'collapsed_reason', 'created_utc', 'distinguished', 'edited', 'gildings', 'id', 'is_submitter', 'link_id', 'locked', 'no_follow', 'parent_id', 'permalink', 'retrieved_on', 'score', 'send_replies', 'stickied', 'subreddit', 'subreddit_id', 'top_awarded_type', 'total_awards_received', 'treatment_tags'])
comment_bodies = [comment['body'] for comment in comments_json]

In [7]:
data = []
for i, post_id in enumerate(post_ids):
    url = 'https://api.pushshift.io/reddit/comment/search?link_id=' + post_id # TODO: Collect more than 25 comments
    response = requests.get(url, headers={'User-Agent': user_agent})
    comments_json = response.json()['data']
    comment_bodies = [comment['body'] for comment in comments_json]
    entry = [post_id, post_titles[i], comment_bodies]
    data.append(entry)

Identify male vs female groups

In [8]:
data_df = pd.DataFrame(data, columns=['id', 'title', 'comments'])
m_f_mapping = {'male': {'EXO', 'NCT', 'BTS', 'Stray Kids', 'G-Dragon', 'Big Bang'},
'female': {'GFriend', "Girl's Day", 'Red Velvet', 'AOA', 'BLACKPINK', 'Momoland', 'miss A', 'MAMAMOO', 'ITZY', 'Sunmi', 'Weeekly', 'NiziU'}
}
m_f_mapping['male'] = {g.lower() for g in m_f_mapping['male']}
m_f_mapping['female'] = {g.lower() for g in m_f_mapping['female']}

Tag submissions with male or female

In [9]:
data_df['male'] = data_df.title.apply(lambda t: any(group in t.lower() for group in m_f_mapping['male']))
data_df['female'] = data_df.title.apply(lambda t: any(group in t.lower() for group in m_f_mapping['female']))

In [10]:
data_df

Unnamed: 0,id,title,comments,male,female
0,ho4kzo,EXO-SC - On Me (Sehun Solo - Track MV),[this was so different from what we usually ge...,True,False
1,ho4kpi,GFriend - Apple (MV Teaser 1),"[Obvious girl detected, lol. Gfriend is for me...",False,True
2,hnin4k,Happy 10th Anniversary to Girl's Day!,[i'm a somewhat new k pop fan (since late 2017...,False,True
3,hm9ctk,Happy 4th anniversary to NCT 127!,"[n in nct stands for non stop bops, man im hap...",True,False
4,hm9cak,Red Velvet - IRENE &amp; SEULGI - Monster Musi...,"[well, 3 people downvoted me but looks like I ...",False,True
5,hlz0dr,FNC Entertainment confirms that AOA will no lo...,"[That could be, but AOA at the time was more h...",False,True
6,hlydaf,BTS now holds the record for an artist with th...,"[ARMYs no chill, And the one lipbalm jungkook ...",True,False
7,hlx4yj,BLACKPINK - How You Like That (Performance Video),"[For real! When she's doing solo work, she def...",False,True
8,hluxyt,What are some instances of kpop idols hating t...,[[This video](https://youtu.be/u5XU9-wpqNY?t=1...,False,False
9,hlqlje,Heyyy it’s Kyla (from pristin)! AMA,[Hi Kylа! How did you or Pristin members react...,False,False
