## Planning / Scratch Work

Do /r/kpop commenters talk differently about male vs. female groups?

Initial exploration of this question:
- Identify submissions on 2 all-male groups, 2 all-female groups
- Collect their comments
- Contrast comments in general to "typical" reddit language (using /r/funny as a standard)
- Contrast comments on male group vs female group 

In [81]:
# TODO: Maybe separate analysis for emoji

Using Pushshift to get reddit comments

See [Pushshift's GitHub API README](https://github.com/pushshift/api)

> Search for the most recent comments mentioning the word "science" within the subreddit /r/askscience
>
> `https://api.pushshift.io/reddit/search/comment/?q=science&subreddit=askscience`

Retrieve all comment ids for a submission object

`https://api.pushshift.io/reddit/submission/comment_ids/{base36_submission_id}`

[New to Pushshift FAQ](https://www.reddit.com/r/pushshift/comments/bcxguf/new_to_pushshift_read_this_faq/)

[Pushshift Reddit API v4.0 Documentation](https://reddit-api.readthedocs.io/en/latest/#)

Not-comprehensive related works:
- "A Community of Curious Souls: An Analysis of Commenting Behavior on TED Talks Videos" (Tsou, Thelwall, Mongeon, and Sugimoto, 2014)
- "YouTube science channel video presenters and comments: female friendly or vestiges of sexism?" (Thelwall and Mas-Bleda, 2018)
- "Shirtless and dangerous: Quantifying linguistic signals of gender bias in an online fiction writing community." (Fast, Vachovsky, and Bernstein, 2016)
- "Using language models to quantify gender bias in sports journalism" (Fu, Danescu-Niculescu-Mizil, Lee, 2016)

## Data Collection

Import statements

In [102]:
import string
import re
import requests
import logging
import pickle
import json

from collections import Counter

from tqdm import tqdm

import pandas as pd

import nltk
from nltk.corpus import stopwords

ENGLISH_STOPWORDS = stopwords.words('english')

## Pushshift Notes

What are the values that we can access for each submission?

```python
> response.json()['data'][1].keys()

> dict_keys(['all_awardings', 'allow_live_comments', 'author', 'author_flair_css_class', 'author_flair_richtext', 'author_flair_template_id', 'author_flair_text', 
'author_flair_text_color', 'author_flair_type', 'author_fullname', 'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post', 'contest_mode', 'created_utc', 'domain',
'full_link', 'gildings', 'id', 'is_crosspostable', 'is_meta', 'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video',
'link_flair_background_color', 'link_flair_css_class', 'link_flair_richtext', 'link_flair_template_id', 'link_flair_text', 'link_flair_text_color', 'link_flair_type', 'locked',
'media_only', 'no_follow', 'num_comments', 'num_crossposts', 'over_18', 'parent_whitelist_status', 'permalink', 'pinned', 'pwls', 'retrieved_on', 'score', 'selftext',
'send_replies', 'spoiler', 'stickied', 'subreddit', 'subreddit_id', 'subreddit_subscribers', 'subreddit_type', 'thumbnail', 'title', 'total_awards_received', 'treatment_tags',
'upvote_ratio', 'url', 'url_overridden_by_dest', 'whitelist_status', 'wls'])
```

Note that `created_utc` is given in unix timestamp

```
> [post['created_utc'] for post in response.json()['data']]

>[1595657105,
 1595641997,
 1595632191,
 1595623051,
 1595602847,
 1595599200,
 1595583205,
 1595581926,...
```

This tells us that newer posts are given first (i.e. order of posts in repsonse.json() is newest to oldest).

What are the values that we can access for each comment?

```python
comments_json[0].keys()

dict_keys(['all_awardings', 'approved_at_utc', 'associated_award', 'author', 'author_flair_background_color', 'author_flair_css_class', 'author_flair_richtext',
'author_flair_template_id', 'author_flair_text', 'author_flair_text_color', 'author_flair_type', 'author_fullname', 'author_patreon_flair', 'author_premium', 'awarders',
'banned_at_utc', 'body', 'can_mod_post', 'collapsed', 'collapsed_because_crowd_control', 'collapsed_reason', 'created_utc', 'distinguished', 'edited', 'gildings', 'id', 
'is_submitter', 'link_id', 'locked', 'no_follow', 'parent_id', 'permalink', 'retrieved_on', 'score', 'send_replies', 'stickied', 'subreddit', 'subreddit_id', 'top_awarded_type', 
'total_awards_received', 'treatment_tags'])
   
```


In [64]:
# Collect 1000 posts
posts = []
oldest_post_id = None
while len(posts) < 1000:
    if oldest_post_id is None:
        posts.extend(collect_posts())
    else:
        post.extend(collect_posts(oldest_post_id))

save_data(posts, 'data/rkpop-1000-posts.pkl')
# test_load = load_data('rkpop-1000-posts.pkl')
# assert test_load == posts

In [1]:
from multiprocessing import Pool
import multiprocessing

num_cpu = multiprocessing.cpu_count()

In [13]:
from data_collection_utils import collect_comment, load_data, save_data

In [5]:
posts = load_data('data/rkpop-1000-posts.pkl')

In [10]:
with Pool(10) as p:
    comments = p.map(collect_comment, posts)

In [14]:
save_data(comments, 'data/rkpop-1000-comments.pkl')

In [15]:
import glob

In [25]:
# for comment_pkl in glob.glob('data/comments/*'):
#     print(len(load_data(comment_pkl))) 
# ??? Some sort f collection error... fewer than 50 for many

In [56]:
comments = {}
for filename in glob.glob('data/comments/*'):
    start = filename.rindex('/') + 1
    end = filename.rindex('-')
    post_id = filename[start:end]
    comments[post_id] = load_data(filename)

In [57]:
save_data(comments, 'data/rkpop-3000-comments.pkl')

In [44]:
comments.keys()

dict_keys(['huclsf', 'htuc60', 'hvskyu', 'hucjmo', 'hu9rcu', 'huzmb5', 'hwtrm6', 'hrp506', 'hv2356', 'ho4kpi', 'hu3597', 'huu5n2', 'huvuwp', 'hu7vo3', 'hvtdw0', 'hv1ebo', 'huzsvw', 'hqyxzz', 'hrc8b0', 'hrljep', 'hub1jr', 'hvj3uw', 'hvka7v', 'hvj615', 'hudp5x', 'htybex', 'hx29od', 'hw0msh', 'huxala', 'hudowa', 'hrp4jz', 'hv8u2z', 'hxc2tq', 'hv422q', 'hxhvbk', 'hudxp1', 'hvvai2', 'htpot9', 'hvvasx', 'hwyk30', 'hu5qlw', 'hv41za', 'humrj9', 'hx3a8s', 'hvrr9j', 'hvqe4a', 'hub3bl', 'hvqpny', 'hui3wc', 'htvhtn', 'hvvaiq', 'hvs5o0', 'huoafl', 'hudqes', 'hx9kq3', 'hv7a2v', 'hrfis5', 'hubz5p', 'hxekwy', 'hv1n7f', 'husgf7', 'humrtx', 'hujyqp', 'hwg5tm', 'hu0d49', 'hvt82o', 'htujl0', 'hr34w6', 'hv4vrd'])

In [64]:
post_ids = [p['id'] for p in posts]
post_titles = [p['title'] for p in posts]

post_ids_titles_dict = dict(zip(post_ids, post_titles))

In [59]:
comments['huclsf'][:5]

["BTS ISN'T THAT GOOD, THEY LOOK GAY",
 "I'm really just venting/ranting because people were responsible for the safety of that stage, but failed to do their job. This could be career-altering and life-changing. They ruined this year for her and potentially limited her career.\n\nLike I said, it's good that she's recovering well. We all knew from the beginning that she would recover. It's the uncertainty after recovery that I'm worried about. It just might be too troublesome continue her career after because injuries can have lasting effects on the human body long after recovery. No young person should have to deal with that.",
 "Can I just clarify that my original post meant that they got away with allowing for a dangerous set up in the first place? The injury should never have occurred and i find it really bizarre that defending SBS seems to be the hill that you're wanting to die on. I'm just expressing my frustrations that such shitty working environments are created for idols in th

In [60]:
comments['htuc60'][:5]

['definitely, i was so sad to see them not make it far :(',
 "and the best group on that show but we're not ready for that conversation yet",
 'Oh yeah, Bomin is definitely helping their group get recognized but I think a good chunk of people don’t even know he’s an idol because he’s so good at acting!',
 "Yess, my boys deserve even more success! I've been following them since their Let Me comeback and there isn't really a song I didn't like from them (but Crush is still my favorite)!\nGo GolCha!",
 "They really worked so hard for this promotion I'm so so happy to see this, they've suffered so much through 2019 because of the hiatus and then seeing them crying on that circus show made my heart break again but I'm happy that they've left those memories behind and seem happier these days. Realise how golcha has never mentioned RTK at all since leaving that shit lmao"]

In [52]:
comments.keys()

dict_keys(['huclsf', 'htuc60', 'hvskyu', 'hucjmo', 'hu9rcu', 'huzmb5', 'hwtrm6', 'hrp506', 'hv2356', 'ho4kpi', 'hu3597', 'huu5n2', 'huvuwp', 'hu7vo3', 'hvtdw0', 'hv1ebo', 'huzsvw', 'hqyxzz', 'hrc8b0', 'hrljep', 'hub1jr', 'hvj3uw', 'hvka7v', 'hvj615', 'hudp5x', 'htybex', 'hx29od', 'hw0msh', 'huxala', 'hudowa', 'hrp4jz', 'hv8u2z', 'hxc2tq', 'hv422q', 'hxhvbk', 'hudxp1', 'hvvai2', 'htpot9', 'hvvasx', 'hwyk30', 'hu5qlw', 'hv41za', 'humrj9', 'hx3a8s', 'hvrr9j', 'hvqe4a', 'hub3bl', 'hvqpny', 'hui3wc', 'htvhtn', 'hvvaiq', 'hvs5o0', 'huoafl', 'hudqes', 'hx9kq3', 'hv7a2v', 'hrfis5', 'hubz5p', 'hxekwy', 'hv1n7f', 'husgf7', 'humrtx', 'hujyqp', 'hwg5tm', 'hu0d49', 'hvt82o', 'htujl0', 'hr34w6', 'hv4vrd'])

## Loading from saved CSV

In [6]:
data_df = pd.read_csv('rkpop-data.csv')

Identify male vs female groups

In [7]:

m_f_mapping = {'male': {'EXO', 'NCT', 'BTS', 'Stray Kids', 'G-Dragon', 'Big Bang', 
                        'AB6IX', 'Golden Child', 'SEVENTEEN', 'Top Secret', 'TST', 
                        'ONEUS', 'TVXQ', 'PENTAGON', 'THE BOYZ', 'VERIVERY', 'Ravi', 'WayV', 'VIXX'},
               'female': {'GFriend', "Girl's Day", 'Red Velvet', 'AOA', 'BLACKPINK', 
               'Momoland', 'miss A', 'MAMAMOO', 'ITZY', 'Sunmi', 'Weeekly', 'NiziU', 
               'NATTY', 'Twice', 'LOONA', 'After School', 'IU', 'IZ*ONE', 'WJSN', 
               'Cosmic Girls', 'DIA', 'CHUNGHA'}
}
m_f_mapping['male'] = {g.lower() for g in m_f_mapping['male']}
m_f_mapping['female'] = {g.lower() for g in m_f_mapping['female']}

Tag submissions with male or female

In [8]:
# TODO: Count a submission as 'male' or 'female' only if it has one gender present?
data_df['male'] = data_df.title.apply(lambda t: any(group in t.lower() for group in m_f_mapping['male']))
data_df['female'] = data_df.title.apply(lambda t: any(group in t.lower() for group in m_f_mapping['female']))

In [66]:
for key in comments.keys():
    print(post_ids_titles_dict[key])

Red Velvet Seulgi shares that Wendy is currently focusing on rehabilitation exercises and singing practice. Wendy often gets in contact with the members and is the first to cheer them on
Golden Child's 4th mini album 'Take A Leap' has surpassed 50,000 sales on Hanteo chart. It's their first album to do so.
MBC M Show Champion Performances (July 22, 2020) - MISTER T, JEONG SEWOON, TOO, GFRIEND, D.COY, GreatGuys, YUKIKA, MustB, Han Gabin, Kang Sori, +more
Oh My Girl’s Arin and TXT’s Soobin to become the new MCs for KBS Music Bank
SSAK3's "다시 여기 바닷가 (Summer Sea Again / Beach Again)" earns Perfect All-Kill
Red Velvet Irene - 놀이 (Naughty) + Diamond + Feel Good + Jelly (IRENE &amp; SEULGI Episode 2: Irene Solo Performance Video)
MAMAMOO to feature in Rain’s upcoming solo song for SSAK3.
Girls' Generation (SNSD) Hyoyeon (HYO) - 4th Single: DESSERT (feat. Loopy, (G)I-DLE Soyeon) (Teaser Image 2)
TWICE - Beyond LIVE - TWICE : World in A Day (Online Concert Poster)
GFriend - Apple (MV Teaser 1)


In [9]:
# Checking if any overlapping...
data_df[data_df['male'] & data_df['female']]

Unnamed: 0,id,title,comments,male,female
27,hegnwo,"TWICE, IZ*ONE, (G)I-DLE, SEVENTEEN, NCT 127, T...",['Seventeen and Izone collab stage to consolid...,True,True


In [10]:
# TODO: Remove overlapping

Clean comment text and prepare for analysis

[How to strip punctuation from a string](https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string)

`s.translate(str.maketrans('', '', string.punctuation))`

[`maketrans` documentation](https://docs.python.org/3.3/library/stdtypes.html?highlight=maketrans#str.maketrans)

[Removing URLs from a string](https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python)

Helper functions

In [23]:
def giant_cleaned_string(series_of_list_of_comments):
    """Return string from Pandas Series of lists of strings.
    
    Combines multiple pandas rows with lists of strings into one giant string with URLs and punctuation removed.
    """
    comment_string = ' '.join(series_of_list_of_comments.apply(lambda x: ' '.join(x.split())))
    comment_string = re.sub('http://\S+|https://\S+', '', comment_string)

    chars_to_replace = string.punctuation[:6]+string.punctuation[7:]+'“”\n' # Don't remove single quotation mark
    whitespace_to_replace_with = len(chars_to_replace) * ' '

    comment_string = comment_string.lower().translate(str.maketrans(chars_to_replace, whitespace_to_replace_with))
    return comment_string

def acceptable_token(token):
    """ Return True if token is longer than one character and is not present in ENGLISH_STOPWORDS
    """
    return (len(token) > 1 and token not in ENGLISH_STOPWORDS)

def tokenize(giant_comment_string):
    """ Return list of word tokens from given string.
    """
    tokens = giant_comment_string.split(' ')
    return list(filter(acceptable_token, tokens))

def create_counter_object(giant_comment_string):
    """ Return Counter with word counters for given string.
    """
    word_counter = Counter(tokenize(giant_comment_string))
    return word_counter

def top_adjectives(giant_comment_string, num_of_words=10):
    """ Return list with most common adjectives in given string.
    """

    def find_adjectives(list_of_word_pos_tuple):
        return list_of_word_pos_tuple[1] == 'JJ'

    comment_words_POS = nltk.pos_tag(tokenize(giant_comment_string))
    comment_adj_counter = Counter([adj[0] for adj in list(filter(find_adjectives, comment_words_POS))])
    return comment_adj_counter.most_common(num_of_words)

# TODO: Determine association metric to use
# http://www.nltk.org/_modules/nltk/metrics/association.html
def top_ngrams(giant_comment_string, num_of_words=15, ngram=2):
    """ Return list with most frequently appearing n-grams in given string.
    """

    if ngram == 2:
        finder = BigramCollocationFinder.from_words(tokenize(giant_comment_string))
        return finder.nbest(bigram_measures.likelihood_ratio, num_of_words)
    elif ngram == 3:
        finder = TrigramCollocationFinder.from_words(tokenize(giant_comment_string))
        return finder.nbest(trigram_measures.likelihood_ratio, num_of_words)
    else:
        return "Error: Only bi- and trigrams supported."

In [25]:
male_giant_comment_string = giant_cleaned_string(data_df[data_df['male']]['comments'])
female_giant_comment_string = giant_cleaned_string(data_df[data_df['female']]['comments'])

In [27]:
female_word_counter = create_counter_object(female_giant_comment_string)
male_word_counter = create_counter_object(male_giant_comment_string)

In [35]:
sum(female_word_counter.values())

23918

In [34]:
sum(male_word_counter.values())

13298

In [14]:
# TODO: Log-Odds Ratio of Words
# len(male_giant_comment_string.split(' ')) # 6718 
# len(female_giant_comment_string.split(' ')) # 14882

In [28]:
male_top_50 = male_word_counter.most_common(50)
female_top_50 = female_word_counter.most_common(50)

In [31]:
print(male_top_50)
print()
print(female_top_50)

[('like', 157), ('really', 98), ('one', 84), ('song', 76), ('people', 74), ("i'm", 73), ('think', 71), ('would', 68), ('love', 66), ('time', 65), ('much', 58), ('get', 56), ('even', 56), ('know', 52), ('see', 49), ('also', 49), ("'i", 49), ('kpop', 48), ('songs', 47), ('good', 46), ('still', 46), ('album', 46), ('group', 46), ('fans', 45), ('well', 42), ('first', 42), ('groups', 42), ('going', 36), ('years', 36), ('since', 36), ('lot', 36), ('go', 35), ('gt', 34), ('new', 34), ('culture', 34), ('actually', 33), ('way', 32), ('ni', 32), ('music', 32), ('make', 32), ('could', 31), ('back', 31), ('bts', 31), ('sm', 31), ("that's", 30), ('pretty', 30), ('feel', 29), ('never', 28), ('sure', 28), ('though', 28)]

[('like', 295), ('really', 185), ('one', 135), ('song', 130), ('think', 127), ('even', 122), ("'i", 121), ("i'm", 117), ('people', 112), ('still', 110), ('also', 108), ('kpop', 105), ('group', 104), ('know', 100), ('good', 97), ('would', 94), ('songs', 93), ('much', 91), ('see', 88)

In [45]:
unique_male_words = set(male_word_counter.keys()) - set(female_word_counter.keys())
unique_female_words = set(female_word_counter.keys()) - set(male_word_counter.keys())

In [54]:
unique_male_word_counter = Counter()
unique_female_word_counter = Counter()

for word in unique_male_words:
    unique_male_word_counter[word] = male_word_counter[word] 

for word in unique_female_words:
    unique_female_word_counter[word] = female_word_counter[word] 

In [68]:
unique_male_words_str = ' '.join(unique_male_words)

In [75]:
unique_male_words_str = ' '.join(unique_male_words)
unique_female_words_str = ' '.join(unique_female_words)
unique_male_adj_tuples = list(filter(lambda x: x[1]=='JJ', nltk.pos_tag(tokenize(unique_male_words_str)))) # filter for words uniquely used toward male groups/people AND are adjectives
unique_female_adj_tuples = list(filter(lambda x: x[1]=='JJ', nltk.pos_tag(tokenize(unique_female_words_str)))) # filter for words uniquely used toward male groups/people AND are adjectives
unique_male_adj = [tup[0] for tup in unique_male_adj_tuples]
unique_female_adj = [tup[0] for tup in unique_female_adj_tuples]

In [77]:
unique_male_adj_counter = Counter()
unique_female_adj_counter = Counter()

for word in unique_male_adj:
    unique_male_adj_counter[word] = male_word_counter[word] 

for word in unique_female_adj:
    unique_female_adj_counter[word] = female_word_counter[word] 

In [78]:
print(unique_male_adj_counter.most_common(100))

[('rtk', 12), ('vixx', 11), ('golcha', 7), ('superm', 6), ('sc', 6), ("here's", 6), ('dog', 6), ('yohan', 5), ('discussed', 4), ('unbreakable', 4), ('nhappy', 4), ('animal', 3), ('double', 3), ('handsome', 3), ('unusual', 3), ('nokay', 3), ('busy', 3), ('excellent', 3), ('y’all', 3), ('skz', 3), ('youthful', 3), ('profile', 3), ('raw', 3), ('pet', 3), ('tempo', 3), ('corden', 3), ('electric', 3), ('normalize', 3), ('monotree', 3), ('ode', 3), ('grand', 3), ('people’s', 3), ('equal', 3), ('website', 3), ('tst', 3), ('jaejoong', 3), ('obv', 2), ('umpah', 2), ('subject', 2), ('superhuman', 2), ('sixth', 2), ('cpop', 2), ('subs', 2), ('army’s', 2), ('narrative', 2), ('jellyfish', 2), ('national', 2), ('temporary', 2), ('irresponsible', 2), ('jazzy', 2), ('manipulative', 2), ('leadt', 2), ('opposite', 2), ('adventure', 2), ('criticised', 2), ('sentimental', 2), ('formal', 2), ('piece', 2), ('armys', 2), ('visible', 2), ('john', 2), ("else's", 2), ('tricky', 2), ('amazed', 2), ('mymy', 2), (

In [79]:
print(unique_female_adj_counter.most_common(100))

[('mld', 13), ('lisa', 11), ('tzuyu', 11), ('fancams', 9), ('choa', 9), ('teddy', 9), ('nayeon', 8), ('female', 8), ('fancy', 8), ("girl's", 7), ('comfortable', 6), ('innocent', 6), ('ridiculous', 6), ('hot', 6), ('forgot', 6), ('write', 6), ("girls'", 6), ('sixteen', 6), ('aoa', 5), ('correct', 5), ('photo', 5), ('wtf', 5), ('mental', 5), ('powerful', 5), ('vita', 5), ('whistle', 5), ('vlive', 5), ('ktl', 4), ('sian', 4), ('jimin’s', 4), ('broad', 4), ('local', 4), ('technical', 4), ('unpopular', 4), ('impressive', 4), ('extra', 4), ('mediocre', 4), ('screentime', 4), ('sakura', 4), ('react', 4), ('iggy', 4), ('accustomed', 4), ('lied', 4), ('dead', 4), ('latin', 4), ("pretty'", 4), ('liberal', 4), ('nana', 4), ('dan', 4), ('fei', 4), ('lightstick', 4), ('npeople', 4), ('baam', 4), ('various', 4), ('childish', 4), ('valid', 3), ('usual', 3), ('mine', 3), ('nayun', 3), ('political', 3), ('awful', 3), ('precious', 3), ('photocard', 3), ('sucked', 3), ('translate', 3), ('unlikely', 3), (

What adjectives are used? Verbs? 

[Categorizing and Tagging Words](https://www.nltk.org/book/ch05.html)

[collocations](https://www.nltk.org/howto/collocations.html)

Most common ngrams

In [16]:
from nltk.collocations import *

In [17]:
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()



In [336]:
top_ngrams(female_giant_comment_string, num_of_words=50, ngram=3)

[('city', 'pop', 'real'),
 ('city', 'pop', 'icing'),
 ('city', 'pop', 'influence'),
 ('city', 'pop', 'permit'),
 ('sucker', 'city', 'pop'),
 ('city', 'pop', 'catch'),
 ('city', 'pop', 'lately'),
 ('city', 'pop', 'term'),
 ('considered', 'city', 'pop'),
 ('fall', 'city', 'pop'),
 ('siren', 'city', 'pop'),
 ('sounding', 'city', 'pop'),
 ('term', 'city', 'pop'),
 ('ton', 'city', 'pop'),
 ('defend', 'city', 'pop'),
 ('game', 'city', 'pop'),
 ('using', 'city', 'pop'),
 ('red', 'velvet', 'leaders'),
 ('😄😆', 'red', 'velvet'),
 ('example', 'city', 'pop'),
 ('hear', 'city', 'pop'),
 ('quite', 'city', 'pop'),
 ('call', 'city', 'pop'),
 ('city', 'pop', 'like'),
 ('red', 'velvet', 'listener'),
 ('need', 'city', 'pop'),
 ('city', 'pop', 'excited'),
 ('casual', 'red', 'velvet'),
 ('perhaps', 'red', 'velvet'),
 ('red', 'velvet', 'oldest'),
 ('irene', 'red', 'velvet'),
 ('city', 'pop', 'well'),
 ('city', 'pop', 'even'),
 ('expecting', 'red', 'velvet'),
 ('red', 'velvet', 'promote'),
 ('red', 'velvet',

In [338]:
top_ngrams(male_giant_comment_string, num_of_words=50, ngram=3)

[('defending', 'stray', 'kids'),
 ('discussion', 'stray', 'kids'),
 ('familiar', 'stray', 'kids'),
 ('perception', 'stray', 'kids'),
 ('stray', 'kids', 'crackhead'),
 ('stray', 'kids', 'draws'),
 ('stray', 'kids', 'objectively'),
 ('stray', 'kids', 'touring'),
 ('stray', 'kids', 'specifically'),
 ('opinion', 'stray', 'kids'),
 ('stray', 'kids', 'called'),
 ('stray', 'kids', 'marketed'),
 ('blm', 'stray', 'kids'),
 ('point', 'stray', 'kids'),
 ('stray', 'kids', 'group'),
 ('groups', 'stray', 'kids'),
 ('love', 'stray', 'kids'),
 ('stray', 'kids', 'would'),
 ('culture', 'stray', 'kids'),
 ('stray', 'kids', "i'm"),
 ('hip', 'hop', 'rap'),
 ('features', 'hip', 'hop'),
 ('hoping', 'hip', 'hop'),
 ('consider', 'hip', 'hop'),
 ('hip', 'hop', 'banger'),
 ('american', 'hip', 'hop'),
 ('hip', 'hop', 'pop'),
 ('also', 'hip', 'hop'),
 ('find', 'new', 'home'),
 ('pretty', 'much', 'contained'),
 ('almost', 'pretty', 'much'),
 ('bans', 'depending', 'severity'),
 ('concert', 'entails', 'proper'),
 ('e

Most common adjectives

In [335]:
most_common_adjectives(female_giant_comment_string, n=50)

[('good', 46),
 ('much', 39),
 ('new', 37),
 ('different', 35),
 ('korean', 32),
 ('many', 25),
 ("i'm", 24),
 ('bad', 23),
 ('japanese', 19),
 ('similar', 18),
 ('english', 18),
 ('first', 17),
 ('last', 17),
 ('happy', 16),
 ('great', 16),
 ('big', 16),
 ('lol', 15),
 ('it’s', 15),
 ('right', 15),
 ('sure', 15),
 ('full', 14),
 ('favorite', 13),
 ('single', 13),
 ('whole', 12),
 ('weird', 12),
 ('little', 12),
 ('wrong', 11),
 ('amazing', 11),
 ('real', 11),
 ('american', 11),
 ('popular', 11),
 ('long', 11),
 ('high', 11),
 ('red', 10),
 ('international', 10),
 ('sad', 10),
 ('top', 10),
 ('ready', 10),
 ('cute', 9),
 ('hard', 9),
 ('i’m', 9),
 ('mean', 9),
 ('main', 8),
 ('original', 8),
 ('give', 8),
 ('western', 8),
 ('song', 8),
 ('stupid', 7),
 ('aware', 7),
 ("that's", 7)]

In [334]:
most_common_adjectives(male_giant_comment_string, n=50)

[('much', 26),
 ('black', 22),
 ('happy', 19),
 ('new', 16),
 ("i'm", 15),
 ('western', 13),
 ('different', 12),
 ('sure', 12),
 ('american', 12),
 ('korean', 12),
 ('good', 11),
 ('last', 11),
 ('big', 11),
 ('old', 11),
 ('many', 10),
 ('first', 10),
 ('great', 10),
 ('right', 9),
 ("that's", 9),
 ('little', 8),
 ('open', 8),
 ('wrong', 7),
 ('whole', 6),
 ('long', 6),
 ('it’s', 6),
 ('cultural', 6),
 ('live', 6),
 ('sm', 6),
 ('asian', 6),
 ('clear', 6),
 ('bad', 6),
 ('i’m', 5),
 ('hard', 5),
 ('specific', 5),
 ('bts', 5),
 ('shit', 5),
 ('likely', 5),
 ('exo', 4),
 ('next', 4),
 ('proud', 4),
 ('iconic', 4),
 ('amazing', 4),
 ('true', 4),
 ("can't", 4),
 ('nct', 4),
 ('anniversary', 4),
 ('thank', 4),
 ('nice', 4),
 ('huge', 4),
 ('fair', 4)]