## Planning / Scratch Work

Do /r/kpop commenters talk differently about male vs. female groups?

Initial exploration of this question:
- Identify submissions on 2 all-male groups, 2 all-female groups
- Collect their comments
- Contrast comments in general to "typical" reddit language (using /r/funny as a standard)
- Contrast comments on male group vs female group 

Using Pushshift to get reddit comments

See [Pushshift's GitHub API README](https://github.com/pushshift/api)

> Search for the most recent comments mentioning the word "science" within the subreddit /r/askscience
>
> `https://api.pushshift.io/reddit/search/comment/?q=science&subreddit=askscience`

Retrieve all comment ids for a submission object

`https://api.pushshift.io/reddit/submission/comment_ids/{base36_submission_id}`

[New to Pushshift FAQ](https://www.reddit.com/r/pushshift/comments/bcxguf/new_to_pushshift_read_this_faq/)

[Pushshift Reddit API v4.0 Documentation](https://reddit-api.readthedocs.io/en/latest/#)

Not-comprehensive related works:
- "A Community of Curious Souls: An Analysis of Commenting Behavior on TED Talks Videos" (Tsou, Thelwall, Mongeon, and Sugimoto, 2014)
- "YouTube science channel video presenters and comments: female friendly or vestiges of sexism?" (Thelwall and Mas-Bleda, 2018)
- "Shirtless and dangerous: Quantifying linguistic signals of gender bias in an online fiction writing community." (Fast, Vachovsky, and Bernstein, 2016)
- "Using language models to quantify gender bias in sports journalism" (Fu, Danescu-Niculescu-Mizil, Lee, 2016)

## Data Collection

Import statements

In [196]:
import string
import re
import requests
import json

from collections import Counter

import pandas as pd

import nltk
from nltk.corpus import stopwords

ENGLISH_STOPWORDS = stopwords.words('english')

Collect relevant /r/kpop submissions

In [244]:
url = 'https://api.pushshift.io/reddit/search/submission/?subreddit=kpop&score=>50&num_comments=>50&size=100' # TODO: Collect more than 100 posts
response = requests.get(url)
post_titles = [post['title'] for post in response.json()['data']]
post_ids = [post['id'] for post in response.json()['data']]
post_id = post_ids[0]

What are the values that we can access for each submission?

```
response.json()['data'][1].keys()

dict_keys(['all_awardings', 'allow_live_comments', 'author', 'author_flair_css_class', 'author_flair_richtext', 'author_flair_template_id', 'author_flair_text', 
'author_flair_text_color', 'author_flair_type', 'author_fullname', 'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post', 'contest_mode', 'created_utc', 'domain',
'full_link', 'gildings', 'id', 'is_crosspostable', 'is_meta', 'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video',
'link_flair_background_color', 'link_flair_css_class', 'link_flair_richtext', 'link_flair_template_id', 'link_flair_text', 'link_flair_text_color', 'link_flair_type', 'locked',
'media_only', 'no_follow', 'num_comments', 'num_crossposts', 'over_18', 'parent_whitelist_status', 'permalink', 'pinned', 'pwls', 'retrieved_on', 'score', 'selftext',
'send_replies', 'spoiler', 'stickied', 'subreddit', 'subreddit_id', 'subreddit_subscribers', 'subreddit_type', 'thumbnail', 'title', 'total_awards_received', 'treatment_tags',
'upvote_ratio', 'url', 'url_overridden_by_dest', 'whitelist_status', 'wls'])
```

Collect comments given post_id

In [245]:
url = 'https://api.pushshift.io/reddit/comment/search?link_id=' + post_id
response = requests.get(url, headers={'User-Agent': user_agent})
comments_json = response.json()['data']
comment_bodies = [comment['body'] for comment in comments_json]

What are the values that we can access for each comment?

```python
comments_json[0].keys()

dict_keys(['all_awardings', 'approved_at_utc', 'associated_award', 'author', 'author_flair_background_color', 'author_flair_css_class', 'author_flair_richtext',
'author_flair_template_id', 'author_flair_text', 'author_flair_text_color', 'author_flair_type', 'author_fullname', 'author_patreon_flair', 'author_premium', 'awarders',
'banned_at_utc', 'body', 'can_mod_post', 'collapsed', 'collapsed_because_crowd_control', 'collapsed_reason', 'created_utc', 'distinguished', 'edited', 'gildings', 'id', 
'is_submitter', 'link_id', 'locked', 'no_follow', 'parent_id', 'permalink', 'retrieved_on', 'score', 'send_replies', 'stickied', 'subreddit', 'subreddit_id', 'top_awarded_type', 
'total_awards_received', 'treatment_tags'])
   
```

In [246]:
data = []
for i, post_id in enumerate(post_ids):
    url = 'https://api.pushshift.io/reddit/comment/search?link_id=' + post_id # TODO: Collect more than 25 comments per post
    response = requests.get(url, headers={'User-Agent': user_agent})
    comments_json = response.json()['data']
    comment_bodies = [comment['body'] for comment in comments_json]
    entry = [post_id, post_titles[i], comment_bodies]
    data.append(entry)

In [251]:
data_df = pd.DataFrame(data, columns=['id', 'title', 'comments'])
data_df.to_csv('rkpop-data.csv',index=False)

Identify male vs female groups

In [201]:

m_f_mapping = {'male': {'EXO', 'NCT', 'BTS', 'Stray Kids', 'G-Dragon', 'Big Bang', 
                        'AB6IX', 'Golden Child', 'SEVENTEEN', 'Top Secret', 'TST', 
                        'ONEUS', 'TVXQ', 'PENTAGON', 'THE BOYZ', 'VERIVERY', 'Ravi', 'WayV', 'VIXX'},
               'female': {'GFriend', "Girl's Day", 'Red Velvet', 'AOA', 'BLACKPINK', 
               'Momoland', 'miss A', 'MAMAMOO', 'ITZY', 'Sunmi', 'Weeekly', 'NiziU', 
               'NATTY', 'Twice', 'LOONA', 'After School', 'IU', 'IZ*ONE', 'WJSN', 
               'Cosmic Girls', 'DIA', 'CHUNGHA'}
}
m_f_mapping['male'] = {g.lower() for g in m_f_mapping['male']}
m_f_mapping['female'] = {g.lower() for g in m_f_mapping['female']}

Tag submissions with male or female

In [264]:
# TODO: Count a submission as 'male' or 'female' only if it has one gender present?
data_df['male'] = data_df.title.apply(lambda t: any(group in t.lower() for group in m_f_mapping['male']))
data_df['female'] = data_df.title.apply(lambda t: any(group in t.lower() for group in m_f_mapping['female']))

In [269]:
# Checking if any overlapping...
data_df[data_df['male'] & data_df['female']]

Unnamed: 0,id,title,comments,male,female


Clean comment text and prepare for analysis

[How to strip punctuation from a string](https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string)

`s.translate(str.maketrans('', '', string.punctuation))`

[`maketrans` documentation](https://docs.python.org/3.3/library/stdtypes.html?highlight=maketrans#str.maketrans)

[Removing URLs from a string](https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python)

Helper functions

In [272]:
def giant_cleaned_string(series_of_list_of_comments):
    """Return string from Pandas Series of lists of strings.
    
    Combines multiple pandas rows with lists of strings into one giant string with URLs and punctuation removed.
    """
    comment_string = ' '.join(series_of_list_of_comments.apply(lambda x: ' '.join(x)))
    comment_string = re.sub('http://\S+|https://\S+', '', comment_string)

    chars_to_replace = string.punctuation[:6]+string.punctuation[7:]+'“”\n' # Don't remove single quotation mark
    whitespace_to_replace_with = len(chars_to_replace) * ' '

    comment_string = comment_string.lower().translate(str.maketrans(chars_to_replace, whitespace_to_replace_with))
    return comment_string

def acceptable_token(token):
    """ Return True if token is longer than one character and is not present in ENGLISH_STOPWORDS
    """
    return (len(token) > 1 and token not in ENGLISH_STOPWORDS)

def tokenize(giant_comment_string):
    """ Return list of word tokens from given string.
    """
    tokens = giant_comment_string.split(' ')
    return list(filter(acceptable_token, tokens))

def create_counter_object(giant_comment_string):
    """ Return Counter with word counters for given string.
    """
    word_counter = Counter(tokenize(giant_comment_string))
    return word_counter

def top_adjectives(giant_comment_string, num_of_words=10):
    """ Return list with most common adjectives in given string.
    """

    def find_adjectives(list_of_word_pos_tuple):
        return list_of_word_pos_tuple[1] == 'JJ'

    comment_words_POS = nltk.pos_tag(tokenize(giant_comment_string))
    comment_adj_counter = Counter([adj[0] for adj in list(filter(find_adjectives, comment_words_POS))])
    return comment_adj_counter.most_common(num_of_words)

# TODO: Determine association metric to use
# http://www.nltk.org/_modules/nltk/metrics/association.html
def top_ngrams(giant_comment_string, num_of_words=15, ngram=2):
    """ Return list with most frequently appearing n-grams in given string.
    """

    if ngram == 2:
        finder = BigramCollocationFinder.from_words(tokenize(giant_comment_string))
        return finder.nbest(bigram_measures.likelihood_ratio, num_of_words)
    elif ngram == 3:
        finder = TrigramCollocationFinder.from_words(tokenize(giant_comment_string))
        return finder.nbest(trigram_measures.likelihood_ratio, num_of_words)
    else:
        return "Error: Only bi- and trigrams supported."

In [273]:
male_giant_comment_string = giant_cleaned_string(data_df[data_df['male']]['comments'])
female_giant_comment_string = giant_cleaned_string(data_df[data_df['female']]['comments'])

In [274]:
female_word_counter = create_counter_object(female_giant_comment_string)
male_word_counter = create_counter_object(male_giant_comment_string)

In [275]:
# TODO: Log-Odds Ratio of Words
# len(male_giant_comment_string.split(' ')) # 6718 
# len(female_giant_comment_string.split(' ')) # 14882

In [276]:
male_top_50 = male_word_counter.most_common(50)
female_top_50 = female_word_counter.most_common(50)

What adjectives are used? Verbs? 

[Categorizing and Tagging Words](https://www.nltk.org/book/ch05.html)

[collocations](https://www.nltk.org/howto/collocations.html)

Most common ngrams

In [293]:
from nltk.collocations import *

In [317]:
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()



In [336]:
top_ngrams(female_giant_comment_string, num_of_words=50, ngram=3)

[('city', 'pop', 'real'),
 ('city', 'pop', 'icing'),
 ('city', 'pop', 'influence'),
 ('city', 'pop', 'permit'),
 ('sucker', 'city', 'pop'),
 ('city', 'pop', 'catch'),
 ('city', 'pop', 'lately'),
 ('city', 'pop', 'term'),
 ('considered', 'city', 'pop'),
 ('fall', 'city', 'pop'),
 ('siren', 'city', 'pop'),
 ('sounding', 'city', 'pop'),
 ('term', 'city', 'pop'),
 ('ton', 'city', 'pop'),
 ('defend', 'city', 'pop'),
 ('game', 'city', 'pop'),
 ('using', 'city', 'pop'),
 ('red', 'velvet', 'leaders'),
 ('😄😆', 'red', 'velvet'),
 ('example', 'city', 'pop'),
 ('hear', 'city', 'pop'),
 ('quite', 'city', 'pop'),
 ('call', 'city', 'pop'),
 ('city', 'pop', 'like'),
 ('red', 'velvet', 'listener'),
 ('need', 'city', 'pop'),
 ('city', 'pop', 'excited'),
 ('casual', 'red', 'velvet'),
 ('perhaps', 'red', 'velvet'),
 ('red', 'velvet', 'oldest'),
 ('irene', 'red', 'velvet'),
 ('city', 'pop', 'well'),
 ('city', 'pop', 'even'),
 ('expecting', 'red', 'velvet'),
 ('red', 'velvet', 'promote'),
 ('red', 'velvet',

In [338]:
top_ngrams(male_giant_comment_string, num_of_words=50, ngram=3)

[('defending', 'stray', 'kids'),
 ('discussion', 'stray', 'kids'),
 ('familiar', 'stray', 'kids'),
 ('perception', 'stray', 'kids'),
 ('stray', 'kids', 'crackhead'),
 ('stray', 'kids', 'draws'),
 ('stray', 'kids', 'objectively'),
 ('stray', 'kids', 'touring'),
 ('stray', 'kids', 'specifically'),
 ('opinion', 'stray', 'kids'),
 ('stray', 'kids', 'called'),
 ('stray', 'kids', 'marketed'),
 ('blm', 'stray', 'kids'),
 ('point', 'stray', 'kids'),
 ('stray', 'kids', 'group'),
 ('groups', 'stray', 'kids'),
 ('love', 'stray', 'kids'),
 ('stray', 'kids', 'would'),
 ('culture', 'stray', 'kids'),
 ('stray', 'kids', "i'm"),
 ('hip', 'hop', 'rap'),
 ('features', 'hip', 'hop'),
 ('hoping', 'hip', 'hop'),
 ('consider', 'hip', 'hop'),
 ('hip', 'hop', 'banger'),
 ('american', 'hip', 'hop'),
 ('hip', 'hop', 'pop'),
 ('also', 'hip', 'hop'),
 ('find', 'new', 'home'),
 ('pretty', 'much', 'contained'),
 ('almost', 'pretty', 'much'),
 ('bans', 'depending', 'severity'),
 ('concert', 'entails', 'proper'),
 ('e

Most common adjectives

In [335]:
most_common_adjectives(female_giant_comment_string, n=50)

[('good', 46),
 ('much', 39),
 ('new', 37),
 ('different', 35),
 ('korean', 32),
 ('many', 25),
 ("i'm", 24),
 ('bad', 23),
 ('japanese', 19),
 ('similar', 18),
 ('english', 18),
 ('first', 17),
 ('last', 17),
 ('happy', 16),
 ('great', 16),
 ('big', 16),
 ('lol', 15),
 ('it’s', 15),
 ('right', 15),
 ('sure', 15),
 ('full', 14),
 ('favorite', 13),
 ('single', 13),
 ('whole', 12),
 ('weird', 12),
 ('little', 12),
 ('wrong', 11),
 ('amazing', 11),
 ('real', 11),
 ('american', 11),
 ('popular', 11),
 ('long', 11),
 ('high', 11),
 ('red', 10),
 ('international', 10),
 ('sad', 10),
 ('top', 10),
 ('ready', 10),
 ('cute', 9),
 ('hard', 9),
 ('i’m', 9),
 ('mean', 9),
 ('main', 8),
 ('original', 8),
 ('give', 8),
 ('western', 8),
 ('song', 8),
 ('stupid', 7),
 ('aware', 7),
 ("that's", 7)]

In [334]:
most_common_adjectives(male_giant_comment_string, n=50)

[('much', 26),
 ('black', 22),
 ('happy', 19),
 ('new', 16),
 ("i'm", 15),
 ('western', 13),
 ('different', 12),
 ('sure', 12),
 ('american', 12),
 ('korean', 12),
 ('good', 11),
 ('last', 11),
 ('big', 11),
 ('old', 11),
 ('many', 10),
 ('first', 10),
 ('great', 10),
 ('right', 9),
 ("that's", 9),
 ('little', 8),
 ('open', 8),
 ('wrong', 7),
 ('whole', 6),
 ('long', 6),
 ('it’s', 6),
 ('cultural', 6),
 ('live', 6),
 ('sm', 6),
 ('asian', 6),
 ('clear', 6),
 ('bad', 6),
 ('i’m', 5),
 ('hard', 5),
 ('specific', 5),
 ('bts', 5),
 ('shit', 5),
 ('likely', 5),
 ('exo', 4),
 ('next', 4),
 ('proud', 4),
 ('iconic', 4),
 ('amazing', 4),
 ('true', 4),
 ("can't", 4),
 ('nct', 4),
 ('anniversary', 4),
 ('thank', 4),
 ('nice', 4),
 ('huge', 4),
 ('fair', 4)]