## Planning / Scratch Work

Do /r/kpop commenters talk differently about male vs. female groups?

Initial exploration of this question:
- Identify submissions on 2 all-male groups, 2 all-female groups
- Collect their comments
- Contrast comments in general to "typical" reddit language (using /r/funny as a standard)
- Contrast comments on male group vs female group 

Using Pushshift to get reddit comments

See [Pushshift's GitHub API README](https://github.com/pushshift/api)

> Search for the most recent comments mentioning the word "science" within the subreddit /r/askscience
>
> `https://api.pushshift.io/reddit/search/comment/?q=science&subreddit=askscience`

Retrieve all comment ids for a submission object

`https://api.pushshift.io/reddit/submission/comment_ids/{base36_submission_id}`

[New to Pushshift FAQ](https://www.reddit.com/r/pushshift/comments/bcxguf/new_to_pushshift_read_this_faq/)

[Pushshift Reddit API v4.0 Documentation](https://reddit-api.readthedocs.io/en/latest/#)

Not-comprehensive related works:
- "A Community of Curious Souls: An Analysis of Commenting Behavior on TED Talks Videos" (Tsou, Thelwall, Mongeon, and Sugimoto, 2014)
- "YouTube science channel video presenters and comments: female friendly or vestiges of sexism?" (Thelwall and Mas-Bleda, 2018)
- "Shirtless and dangerous: Quantifying linguistic signals of gender bias in an online fiction writing community." (Fast, Vachovsky, and Bernstein, 2016)
- "Using language models to quantify gender bias in sports journalism" (Fu, Danescu-Niculescu-Mizil, Lee, 2016)

## Data Collection

Import statements

In [111]:
import string
import re
import requests
import json

from collections import Counter

import pandas as pd

import nltk
from nltk.corpus import stopwords

In [134]:
ENGLISH_STOPWORDS = stopwords.words('english')

Collect relevant /r/kpop submission

In [2]:
url = 'https://api.pushshift.io/reddit/search/submission/?subreddit=kpop&score=>50&num_comments=>50'
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'
response = requests.get(url, headers={'User-Agent': user_agent})
post_titles = [post['title'] for post in response.json()['data']]
post_ids = [post['id'] for post in response.json()['data']]
post_id = post_ids[0]

What are the values that we can access for each submission?

```
response.json()['data'][1].keys()

dict_keys(['all_awardings', 'allow_live_comments', 'author', 'author_flair_css_class', 'author_flair_richtext', 'author_flair_template_id', 'author_flair_text', 
'author_flair_text_color', 'author_flair_type', 'author_fullname', 'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post', 'contest_mode', 'created_utc', 'domain',
'full_link', 'gildings', 'id', 'is_crosspostable', 'is_meta', 'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video',
'link_flair_background_color', 'link_flair_css_class', 'link_flair_richtext', 'link_flair_template_id', 'link_flair_text', 'link_flair_text_color', 'link_flair_type', 'locked',
'media_only', 'no_follow', 'num_comments', 'num_crossposts', 'over_18', 'parent_whitelist_status', 'permalink', 'pinned', 'pwls', 'retrieved_on', 'score', 'selftext',
'send_replies', 'spoiler', 'stickied', 'subreddit', 'subreddit_id', 'subreddit_subscribers', 'subreddit_type', 'thumbnail', 'title', 'total_awards_received', 'treatment_tags',
'upvote_ratio', 'url', 'url_overridden_by_dest', 'whitelist_status', 'wls'])
```

Collect comments given post_id

In [3]:
url = 'https://api.pushshift.io/reddit/comment/search?link_id=' + post_id
response = requests.get(url, headers={'User-Agent': user_agent})
comments_json = response.json()['data']
comment_bodies = [comment['body'] for comment in comments_json]

What are the values that we can access for each comment?

```python
comments_json[0].keys()

dict_keys(['all_awardings', 'approved_at_utc', 'associated_award', 'author', 'author_flair_background_color', 'author_flair_css_class', 'author_flair_richtext',
'author_flair_template_id', 'author_flair_text', 'author_flair_text_color', 'author_flair_type', 'author_fullname', 'author_patreon_flair', 'author_premium', 'awarders',
'banned_at_utc', 'body', 'can_mod_post', 'collapsed', 'collapsed_because_crowd_control', 'collapsed_reason', 'created_utc', 'distinguished', 'edited', 'gildings', 'id', 
'is_submitter', 'link_id', 'locked', 'no_follow', 'parent_id', 'permalink', 'retrieved_on', 'score', 'send_replies', 'stickied', 'subreddit', 'subreddit_id', 'top_awarded_type', 
'total_awards_received', 'treatment_tags'])
   
```

In [4]:
data = []
for i, post_id in enumerate(post_ids):
    url = 'https://api.pushshift.io/reddit/comment/search?link_id=' + post_id # TODO: Collect more than 25 comments
    response = requests.get(url, headers={'User-Agent': user_agent})
    comments_json = response.json()['data']
    comment_bodies = [comment['body'] for comment in comments_json]
    entry = [post_id, post_titles[i], comment_bodies]
    data.append(entry)

Identify male vs female groups

In [5]:
data_df = pd.DataFrame(data, columns=['id', 'title', 'comments'])
m_f_mapping = {'male': {'EXO', 'NCT', 'BTS', 'Stray Kids', 'G-Dragon', 'Big Bang'},
'female': {'GFriend', "Girl's Day", 'Red Velvet', 'AOA', 'BLACKPINK', 'Momoland', 'miss A', 'MAMAMOO', 'ITZY', 'Sunmi', 'Weeekly', 'NiziU'}
}
m_f_mapping['male'] = {g.lower() for g in m_f_mapping['male']}
m_f_mapping['female'] = {g.lower() for g in m_f_mapping['female']}

Tag submissions with male or female

In [6]:
data_df['male'] = data_df.title.apply(lambda t: any(group in t.lower() for group in m_f_mapping['male']))
data_df['female'] = data_df.title.apply(lambda t: any(group in t.lower() for group in m_f_mapping['female']))

Clean comment text and prepare for analysis

[How to strip punctuation from a string](https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string)

`s.translate(str.maketrans('', '', string.punctuation))`

[`maketrans` documentation](https://docs.python.org/3.3/library/stdtypes.html?highlight=maketrans#str.maketrans)

[Removing URLs from a string](https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python)

In [135]:
def giant_cleaned_string(series_of_list_of_comments):
    """
    Combines multiple pandas rows with lists of strings into one giant string with URLs and punctuation removed.
    """
    comment_string = ' '.join(series_of_list_of_comments.apply(lambda x: ' '.join(x)))
    comment_string = re.sub('http://\S+|https://\S+', '', comment_string)

    chars_to_replace = string.punctuation[:6]+string.punctuation[7:]+'“”\n' # Don't remove single quotation mark
    whitespace_to_replace_with = len(chars_to_replace) * ' '

    comment_string = comment_string.lower().translate(str.maketrans(chars_to_replace, whitespace_to_replace_with))
    return comment_string


In [136]:
male_giant_comment_string = giant_cleaned_string(data_df[data_df['male']]['comments'])
female_giant_comment_string = giant_cleaned_string(data_df[data_df['female']]['comments'])

In [137]:
female_giant_comment_string[:250]

"obvious girl detected  lol  gfriend is for me the only group in kpop  this looks like a new concept for them  o omg i just watched that xxyx whatever thing you cited  and it's sooooo different from whatever is gfriend doing here    like the whole vib"

In [138]:
male_giant_comment_string[:250]

"this was so different from what we usually get from sc  but it's good the vivi stans we’re well fed today that character development though he’s so brilliant  my ult bias he and vivi popped off   vivi snapped  toben could never okay this is gonna sou"

In [140]:
def acceptable_token(token):
    return (len(token) > 1 and token not in ENGLISH_STOPWORDS)

def create_counter_object(giant_comment_string):
    tokens = giant_comment_string.split(' ')
    word_counter = Counter(list(filter(acceptable_token, tokens)))
    return word_counter

In [156]:
female_word_counter = create_counter_object(female_giant_comment_string)
male_word_counter = create_counter_object(male_giant_comment_string)

In [190]:
len(male_giant_comment_string.split(' '))

6718

In [189]:
len(female_giant_comment_string.split(' '))

14882

In [177]:
male_top_50 = male_word_counter.most_common(50)
female_top_50 = female_word_counter.most_common(50)

What adjectives are used? Verbs? 

[Categorizing and Tagging Words](https://www.nltk.org/book/ch05.html)