## Planning / Scratch Work

Do /r/kpop commenters talk differently about male vs. female groups?

Initial exploration of this question:
- Identify submissions on 2 all-male groups, 2 all-female groups
- Collect their comments
- Contrast comments in general to "typical" reddit language (using /r/funny as a standard)
- Contrast comments on male group vs female group 

In [81]:
# TODO: Maybe separate analysis for emoji

Using Pushshift to get reddit comments

See [Pushshift's GitHub API README](https://github.com/pushshift/api)

> Search for the most recent comments mentioning the word "science" within the subreddit /r/askscience
>
> `https://api.pushshift.io/reddit/search/comment/?q=science&subreddit=askscience`

Retrieve all comment ids for a submission object

`https://api.pushshift.io/reddit/submission/comment_ids/{base36_submission_id}`

[New to Pushshift FAQ](https://www.reddit.com/r/pushshift/comments/bcxguf/new_to_pushshift_read_this_faq/)

[Pushshift Reddit API v4.0 Documentation](https://reddit-api.readthedocs.io/en/latest/#)

Not-comprehensive related works:
- "A Community of Curious Souls: An Analysis of Commenting Behavior on TED Talks Videos" (Tsou, Thelwall, Mongeon, and Sugimoto, 2014)
- "YouTube science channel video presenters and comments: female friendly or vestiges of sexism?" (Thelwall and Mas-Bleda, 2018)
- "Shirtless and dangerous: Quantifying linguistic signals of gender bias in an online fiction writing community." (Fast, Vachovsky, and Bernstein, 2016)
- "Using language models to quantify gender bias in sports journalism" (Fu, Danescu-Niculescu-Mizil, Lee, 2016)

## Data Collection

Import statements

In [162]:
import string
import re
import requests
import logging
import pickle
import json

from collections import Counter

from tqdm import tqdm

import pandas as pd

import nltk
from nltk.corpus import stopwords

ENGLISH_STOPWORDS = stopwords.words('english')

## Pushshift Notes

What are the values that we can access for each submission?

```python
> response.json()['data'][1].keys()

> dict_keys(['all_awardings', 'allow_live_comments', 'author', 'author_flair_css_class', 'author_flair_richtext', 'author_flair_template_id', 'author_flair_text', 
'author_flair_text_color', 'author_flair_type', 'author_fullname', 'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post', 'contest_mode', 'created_utc', 'domain',
'full_link', 'gildings', 'id', 'is_crosspostable', 'is_meta', 'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video',
'link_flair_background_color', 'link_flair_css_class', 'link_flair_richtext', 'link_flair_template_id', 'link_flair_text', 'link_flair_text_color', 'link_flair_type', 'locked',
'media_only', 'no_follow', 'num_comments', 'num_crossposts', 'over_18', 'parent_whitelist_status', 'permalink', 'pinned', 'pwls', 'retrieved_on', 'score', 'selftext',
'send_replies', 'spoiler', 'stickied', 'subreddit', 'subreddit_id', 'subreddit_subscribers', 'subreddit_type', 'thumbnail', 'title', 'total_awards_received', 'treatment_tags',
'upvote_ratio', 'url', 'url_overridden_by_dest', 'whitelist_status', 'wls'])
```

Note that `created_utc` is given in unix timestamp

```
> [post['created_utc'] for post in response.json()['data']]

>[1595657105,
 1595641997,
 1595632191,
 1595623051,
 1595602847,
 1595599200,
 1595583205,
 1595581926,...
```

This tells us that newer posts are given first (i.e. order of posts in repsonse.json() is newest to oldest).

What are the values that we can access for each comment?

```python
comments_json[0].keys()

dict_keys(['all_awardings', 'approved_at_utc', 'associated_award', 'author', 'author_flair_background_color', 'author_flair_css_class', 'author_flair_richtext',
'author_flair_template_id', 'author_flair_text', 'author_flair_text_color', 'author_flair_type', 'author_fullname', 'author_patreon_flair', 'author_premium', 'awarders',
'banned_at_utc', 'body', 'can_mod_post', 'collapsed', 'collapsed_because_crowd_control', 'collapsed_reason', 'created_utc', 'distinguished', 'edited', 'gildings', 'id', 
'is_submitter', 'link_id', 'locked', 'no_follow', 'parent_id', 'permalink', 'retrieved_on', 'score', 'send_replies', 'stickied', 'subreddit', 'subreddit_id', 'top_awarded_type', 
'total_awards_received', 'treatment_tags'])
   
```


In [64]:
# Collect 1000 posts
posts = []
oldest_post_id = None
while len(posts) < 1000:
    if oldest_post_id is None:
        posts.extend(collect_posts())
    else:
        post.extend(collect_posts(oldest_post_id))

save_data(posts, 'data/rkpop-1000-posts.pkl')
# test_load = load_data('rkpop-1000-posts.pkl')
# assert test_load == posts

In [1]:
from multiprocessing import Pool
import multiprocessing

num_cpu = multiprocessing.cpu_count()

In [13]:
from data_collection_utils import collect_comment, load_data, save_data

In [5]:
posts = load_data('data/rkpop-1000-posts.pkl')

In [10]:
with Pool(10) as p:
    comments = p.map(collect_comment, posts)

In [14]:
save_data(comments, 'data/rkpop-1000-comments.pkl')

In [15]:
import glob

In [25]:
# for comment_pkl in glob.glob('data/comments/*'):
#     print(len(load_data(comment_pkl))) 
# ??? Some sort f collection error... fewer than 50 for many

In [56]:
comments = {}
for filename in glob.glob('data/comments/*'):
    start = filename.rindex('/') + 1
    end = filename.rindex('-')
    post_id = filename[start:end]
    comments[post_id] = load_data(filename)

In [57]:
save_data(comments, 'data/rkpop-3000-comments.pkl')

In [44]:
comments.keys()

dict_keys(['huclsf', 'htuc60', 'hvskyu', 'hucjmo', 'hu9rcu', 'huzmb5', 'hwtrm6', 'hrp506', 'hv2356', 'ho4kpi', 'hu3597', 'huu5n2', 'huvuwp', 'hu7vo3', 'hvtdw0', 'hv1ebo', 'huzsvw', 'hqyxzz', 'hrc8b0', 'hrljep', 'hub1jr', 'hvj3uw', 'hvka7v', 'hvj615', 'hudp5x', 'htybex', 'hx29od', 'hw0msh', 'huxala', 'hudowa', 'hrp4jz', 'hv8u2z', 'hxc2tq', 'hv422q', 'hxhvbk', 'hudxp1', 'hvvai2', 'htpot9', 'hvvasx', 'hwyk30', 'hu5qlw', 'hv41za', 'humrj9', 'hx3a8s', 'hvrr9j', 'hvqe4a', 'hub3bl', 'hvqpny', 'hui3wc', 'htvhtn', 'hvvaiq', 'hvs5o0', 'huoafl', 'hudqes', 'hx9kq3', 'hv7a2v', 'hrfis5', 'hubz5p', 'hxekwy', 'hv1n7f', 'husgf7', 'humrtx', 'hujyqp', 'hwg5tm', 'hu0d49', 'hvt82o', 'htujl0', 'hr34w6', 'hv4vrd'])

In [64]:
post_ids = [p['id'] for p in posts]
post_titles = [p['title'] for p in posts]

post_ids_titles_dict = dict(zip(post_ids, post_titles))

In [59]:
comments['huclsf'][:5]

["BTS ISN'T THAT GOOD, THEY LOOK GAY",
 "I'm really just venting/ranting because people were responsible for the safety of that stage, but failed to do their job. This could be career-altering and life-changing. They ruined this year for her and potentially limited her career.\n\nLike I said, it's good that she's recovering well. We all knew from the beginning that she would recover. It's the uncertainty after recovery that I'm worried about. It just might be too troublesome continue her career after because injuries can have lasting effects on the human body long after recovery. No young person should have to deal with that.",
 "Can I just clarify that my original post meant that they got away with allowing for a dangerous set up in the first place? The injury should never have occurred and i find it really bizarre that defending SBS seems to be the hill that you're wanting to die on. I'm just expressing my frustrations that such shitty working environments are created for idols in th

In [60]:
comments['htuc60'][:5]

['definitely, i was so sad to see them not make it far :(',
 "and the best group on that show but we're not ready for that conversation yet",
 'Oh yeah, Bomin is definitely helping their group get recognized but I think a good chunk of people don’t even know he’s an idol because he’s so good at acting!',
 "Yess, my boys deserve even more success! I've been following them since their Let Me comeback and there isn't really a song I didn't like from them (but Crush is still my favorite)!\nGo GolCha!",
 "They really worked so hard for this promotion I'm so so happy to see this, they've suffered so much through 2019 because of the hiatus and then seeing them crying on that circus show made my heart break again but I'm happy that they've left those memories behind and seem happier these days. Realise how golcha has never mentioned RTK at all since leaving that shit lmao"]

In [52]:
comments.keys()

dict_keys(['huclsf', 'htuc60', 'hvskyu', 'hucjmo', 'hu9rcu', 'huzmb5', 'hwtrm6', 'hrp506', 'hv2356', 'ho4kpi', 'hu3597', 'huu5n2', 'huvuwp', 'hu7vo3', 'hvtdw0', 'hv1ebo', 'huzsvw', 'hqyxzz', 'hrc8b0', 'hrljep', 'hub1jr', 'hvj3uw', 'hvka7v', 'hvj615', 'hudp5x', 'htybex', 'hx29od', 'hw0msh', 'huxala', 'hudowa', 'hrp4jz', 'hv8u2z', 'hxc2tq', 'hv422q', 'hxhvbk', 'hudxp1', 'hvvai2', 'htpot9', 'hvvasx', 'hwyk30', 'hu5qlw', 'hv41za', 'humrj9', 'hx3a8s', 'hvrr9j', 'hvqe4a', 'hub3bl', 'hvqpny', 'hui3wc', 'htvhtn', 'hvvaiq', 'hvs5o0', 'huoafl', 'hudqes', 'hx9kq3', 'hv7a2v', 'hrfis5', 'hubz5p', 'hxekwy', 'hv1n7f', 'husgf7', 'humrtx', 'hujyqp', 'hwg5tm', 'hu0d49', 'hvt82o', 'htujl0', 'hr34w6', 'hv4vrd'])

## Loading from saved CSV

In [72]:
df.reset_index()

Unnamed: 0,index,0
0,hxhvbk,Mamamoo Wheein - Candy (orig. Baekhyun) (Speci...
1,hxekwy,Red Velvet - IRENE &amp; SEULGI - Monster (Two...
2,hxc2tq,BLACKPINK Lisa Appointed as Ambassador for BVL...
3,hx9kq3,The Rolling Stone included 9 K-Pop Boygroup so...
4,hx3a8s,PURPLE K!SS - Debut Trailer : WH0 CARES? - 유키 ...
...,...,...
95,ho4kzo,EXO-SC - On Me (Sehun Solo - Track MV)
96,ho4kpi,GFriend - Apple (MV Teaser 1)
97,hnin4k,Happy 10th Anniversary to Girl's Day!
98,hm9ctk,Happy 4th anniversary to NCT 127!


In [74]:
def get_comments_from_obj(post_id):
    if post_id in comments:
        return comments[post_id]
    else:
        return None

In [93]:
df = pd.DataFrame.from_dict(post_ids_titles_dict, orient='index')
comments_as_series = df.reset_index()['index'].apply(lambda post_id: get_comments_from_obj(post_id))
df = df.reset_index()
df['comments'] = comments_as_series
df.columns = ['id', 'title', 'comments']
df

Unnamed: 0,id,title,comments
0,hxhvbk,Mamamoo Wheein - Candy (orig. Baekhyun) (Speci...,"[I'm so happy she didn't change the pronouns, ..."
1,hxekwy,Red Velvet - IRENE &amp; SEULGI - Monster (Two...,"[RV never ever created a bad song, let alone a..."
2,hxc2tq,BLACKPINK Lisa Appointed as Ambassador for BVL...,"[count me in, what about Hera?, anyone partner..."
3,hx9kq3,The Rolling Stone included 9 K-Pop Boygroup so...,"[Lol @ the strawman, Man I'm reading through t..."
4,hx3a8s,PURPLE K!SS - Debut Trailer : WH0 CARES? - 유키 ...,"[Yes Indeed, it\`s a classic produce thing whe..."
...,...,...,...
95,ho4kzo,EXO-SC - On Me (Sehun Solo - Track MV),
96,ho4kpi,GFriend - Apple (MV Teaser 1),"[Obvious girl detected, lol. Gfriend is for me..."
97,hnin4k,Happy 10th Anniversary to Girl's Day!,
98,hm9ctk,Happy 4th anniversary to NCT 127!,


In [94]:
df.to_csv('data/rkpop-data-2020-08-01.csv',index=False)

In [95]:
data_df = pd.read_csv('data/rkpop-data-2020-08-01.csv')

Identify male vs female groups

In [148]:
m_f_mapping = {'male': {'EXO', 'NCT', 'BTS', 'Stray Kids', 'G-Dragon', 'Big Bang', 
                        'AB6IX', 'Golden Child', 'SEVENTEEN', 'Top Secret', 'TST', 
                        'ONEUS', 'TVXQ', 'PENTAGON', 'THE BOYZ', 'VERIVERY', 'Ravi', 
                        'WayV', 'VIXX', 'Super Junior', 'SHINee', 'Monsta X',
                        'Block B', 'Zico', 'Treasure'},

               'female': {'GFriend', "Girl's Day", 'Red Velvet', 'AOA', 'BLACKPINK', 
               'Momoland', 'miss A', 'MAMAMOO', 'ITZY', 'Sunmi', 'Weeekly', 'NiziU', 
               'NATTY', 'Twice', 'LOONA', 'After School', 'IU', 'IZ*ONE', 'WJSN', 
               'Cosmic Girls', 'DIA', 'CHUNGHA', 'SNSD', 'Cherry Bullet', 'Somi', 
               '(G)I-DLE', 'Apink', 'Yukika', 'Oh My Girl', 'Lee Hi',
               'PURPLE K!SS', 'Singer Minty', 'Rocket Punch'}
}
m_f_mapping['male'] = {g.lower() for g in m_f_mapping['male']}
m_f_mapping['female'] = {g.lower() for g in m_f_mapping['female']}



Tag submissions with male or female

In [151]:
# Checking if any overlapping...
data_df[data_df['male'] & data_df['female']]

Unnamed: 0,id,title,comments,male,female
85,hrc2pf,"EXO-SC, MAMAMOO, Red Velvet Irene &amp; Seulgi...",,True,True


In [149]:
# TODO: Count a submission as 'male' or 'female' only if it has one gender present?
data_df['male'] = data_df['title'].apply(lambda t: any(group in t.lower() for group in m_f_mapping['male']))
data_df['female'] = data_df['title'].apply(lambda t: any(group in t.lower() for group in m_f_mapping['female']))

Clean comment text and prepare for analysis

In [156]:
data_df_subset = data_df[((data_df['male']) | (data_df['female'])) & data_df['comments']] # Keep only comments w either male or female TRUE & comments are available
data_df_subset = data_df_subset[ ~(data_df_subset['male'] & data_df_subset['female']) ] # Removes overlapping
print(len(data_df_subset

59


[How to strip punctuation from a string](https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string)

`s.translate(str.maketrans('', '', string.punctuation))`

[`maketrans` documentation](https://docs.python.org/3.3/library/stdtypes.html?highlight=maketrans#str.maketrans)

[Removing URLs from a string](https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python)

Helper functions

In [160]:
def giant_cleaned_string(series_of_list_of_comments):
    """Return string from Pandas Series of lists of strings.
    
    Combines multiple pandas rows with lists of strings into one giant string with URLs and punctuation removed.
    """
    comment_string = ' '.join(series_of_list_of_comments.apply(lambda x: ' '.join(x.split())))
    comment_string = re.sub('http://\S+|https://\S+', '', comment_string)

    chars_to_replace = string.punctuation[:6]+string.punctuation[7:]+'“”\n' # Don't remove single quotation mark
    whitespace_to_replace_with = len(chars_to_replace) * ' '

    comment_string = comment_string.lower().translate(str.maketrans(chars_to_replace, whitespace_to_replace_with))
    return comment_string

def acceptable_token(token):
    """ Return True if token is longer than one character and is not present in ENGLISH_STOPWORDS
    """
    return (len(token) > 1 and token not in ENGLISH_STOPWORDS)

def tokenize(giant_comment_string):
    """ Return list of word tokens from given string.
    """
    tokens = giant_comment_string.split(' ')
    return list(filter(acceptable_token, tokens))

def create_counter_object(giant_comment_string):
    """ Return Counter with word counters for given string.
    """
    word_counter = Counter(tokenize(giant_comment_string))
    return word_counter

def top_adjectives(giant_comment_string, num_of_words=10):
    """ Return list with most common adjectives in given string.
    """

    def find_adjectives(list_of_word_pos_tuple):
        return list_of_word_pos_tuple[1] == 'JJ'

    comment_words_POS = nltk.pos_tag(tokenize(giant_comment_string))
    comment_adj_counter = Counter([adj[0] for adj in list(filter(find_adjectives, comment_words_POS))])
    return comment_adj_counter.most_common(num_of_words)

# TODO: Determine association metric to use
# http://www.nltk.org/_modules/nltk/metrics/association.html
def top_ngrams(giant_comment_string, num_of_words=15, ngram=2):
    """ Return list with most frequently appearing n-grams in given string.
    """

    if ngram == 2:
        finder = BigramCollocationFinder.from_words(tokenize(giant_comment_string))
        return finder.nbest(bigram_measures.likelihood_ratio, num_of_words)
    elif ngram == 3:
        finder = TrigramCollocationFinder.from_words(tokenize(giant_comment_string))
        return finder.nbest(trigram_measures.likelihood_ratio, num_of_words)
    else:
        return "Error: Only bi- and trigrams supported."

In [186]:
# TODO: Move analysis helper function into their own py file

In [163]:
male_giant_comment_string = giant_cleaned_string(data_df_subset[data_df_subset['male']]['comments'])
female_giant_comment_string = giant_cleaned_string(data_df_subset[data_df_subset['female']]['comments'])

In [164]:
female_word_counter = create_counter_object(female_giant_comment_string)
male_word_counter = create_counter_object(male_giant_comment_string)

In [165]:
sum(female_word_counter.values())

37134

In [166]:
sum(male_word_counter.values())

10120

In [14]:
# TODO: Log-Odds Ratio of Words
# len(male_giant_comment_string.split(' ')) # 6718 
# len(female_giant_comment_string.split(' ')) # 14882

In [167]:
male_top_50 = male_word_counter.most_common(50)
female_top_50 = female_word_counter.most_common(50)

In [168]:
print(male_top_50)
print()
print(female_top_50)

[('like', 96), ('think', 84), ('see', 54), ('would', 53), ("i'm", 53), ('back', 52), ('really', 51), ('get', 49), ('fans', 48), ('year', 47), ('even', 47), ("'s", 45), ("'i", 45), ('one', 44), ('still', 42), ('know', 40), ('much', 40), ('bts', 39), ('go', 37), ('heechul', 37), ('also', 35), ('sm', 35), ('big', 35), ('yg', 35), ('since', 33), ('i’m', 33), ('hope', 33), ('people', 33), ('time', 32), ('gt', 32), ('reporter', 31), ('right', 30), ('comments', 29), ('going', 28), ('way', 28), ('say', 28), ('group', 28), ('good', 27), ('hate', 27), ('well', 26), ('bighit', 26), ('sure', 25), ('love', 24), ('could', 24), ('debut', 24), ('got', 23), ('though', 23), ("can't", 23), ('lot', 23), ('it’s', 23)]

[('like', 460), ('really', 238), ("i'm", 221), ('song', 206), ('think', 203), ('even', 193), ('one', 187), ('also', 175), ('group', 173), ('would', 165), ('see', 165), ('people', 156), ('know', 152), ("'i", 148), ('love', 146), ('good', 144), ('album', 141), ('music', 135), ('much', 132), ('

In [169]:
unique_male_words = set(male_word_counter.keys()) - set(female_word_counter.keys())
unique_female_words = set(female_word_counter.keys()) - set(male_word_counter.keys())

In [170]:
unique_male_word_counter = Counter()
unique_female_word_counter = Counter()

for word in unique_male_words:
    unique_male_word_counter[word] = male_word_counter[word] 

for word in unique_female_words:
    unique_female_word_counter[word] = female_word_counter[word] 

In [171]:
unique_male_words_str = ' '.join(unique_male_words)

In [172]:
unique_male_words_str = ' '.join(unique_male_words)
unique_female_words_str = ' '.join(unique_female_words)
unique_male_adj_tuples = list(filter(lambda x: x[1]=='JJ', nltk.pos_tag(tokenize(unique_male_words_str)))) # filter for words uniquely used toward male groups/people AND are adjectives
unique_female_adj_tuples = list(filter(lambda x: x[1]=='JJ', nltk.pos_tag(tokenize(unique_female_words_str)))) # filter for words uniquely used toward male groups/people AND are adjectives
unique_male_adj = [tup[0] for tup in unique_male_adj_tuples]
unique_female_adj = [tup[0] for tup in unique_female_adj_tuples]

In [173]:
unique_male_adj_counter = Counter()
unique_female_adj_counter = Counter()

for word in unique_male_adj:
    unique_male_adj_counter[word] = male_word_counter[word] 

for word in unique_female_adj:
    unique_female_adj_counter[word] = female_word_counter[word] 

In [174]:
print(unique_male_adj_counter.most_common(100))

[('enlist', 12), ('enlistment', 10), ('wonho', 9), ('mx', 8), ('sergeant', 7), ('interpretation', 7), ('onew', 6), ('ygtb', 5), ('hara', 4), ('bh', 4), ('goo', 4), ("'welcome", 4), ('it’ll', 3), ('rejoin', 3), ('important', 3), ('boo', 3), ('spanish', 3), ('defamatory', 3), ('misogyny', 3), ('summary', 3), ('ambitious', 2), ("year'", 2), ('learned', 2), ('sparked', 2), ('rid', 2), ('fanname', 2), ('smthg', 2), ('cleared', 2), ('wig', 2), ('njonghyun', 2), ('ntreasure', 2), ('njongdae', 2), ('jumbo', 2), ('tribute', 2), ('netz', 2), ('corporate', 2), ('bt21', 2), ('😁😁', 2), ('pearl', 2), ('you’ll', 2), ('applied', 2), ('website', 2), ('considerable', 2), ('feb', 2), ('poet', 2), ('grateful', 2), ('literal', 2), ('innovative', 2), ('subsidary', 2), ('intern', 1), ('nwhich', 1), ('cancel', 1), ('sangbyeong', 1), ('uniform', 1), ('wolo', 1), ('intellectual', 1), ('sangdeungbyeong', 1), ('tsol', 1), ('dissimilar', 1), ('seventeen’s', 1), ('reasonable', 1), ('사람들이', 1), ("'ouch", 1), ('kryst

In [175]:
print(unique_female_adj_counter.most_common(100))

[('seulgi', 37), ('arin', 30), ('fun', 25), ('purple', 21), ('strong', 18), ('western', 17), ('mvs', 16), ('cube', 14), ('black', 14), ('visual', 14), ('adorable', 13), ('genre', 13), ('standard', 12), ('ssak3', 12), ('ot5', 12), ('psycho', 12), ('sns', 11), ('simple', 11), ('cannot', 11), ('solar', 11), ('cb', 10), ('mad', 10), ('ridiculous', 10), ('incredible', 10), ('natural', 10), ('instrumental', 9), ('ioi', 9), ('uncover', 9), ('aware', 9), ('common', 9), ('goeun', 9), ('average', 9), ('nonstop', 9), ('shot', 9), ('abc', 8), ('gothic', 8), ('front', 8), ('clear', 8), ('teen', 8), ('excellent', 7), ("yukika's", 7), ('cut', 7), ('involved', 7), ("now'", 7), ('pick', 7), ('skip', 7), ('shade', 7), ('mixed', 7), ('minimum', 6), ('healthy', 6), ('white', 6), ('idle', 6), ('hostess', 6), ('valid', 6), ('prostitute', 6), ('‘music', 6), ('iz', 6), ('confirmed', 6), ('attractive', 6), ('colour', 6), ('unfortunate', 6), ('minute', 6), ('taboo', 6), ('cha', 6), ('double', 6), ('typical', 6)

What adjectives are used? Verbs? 

[Categorizing and Tagging Words](https://www.nltk.org/book/ch05.html)

[collocations](https://www.nltk.org/howto/collocations.html)

Most common ngrams

In [176]:
from nltk.collocations import *

In [177]:
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()



In [178]:
top_ngrams(female_giant_comment_string, num_of_words=50, ngram=3)

[('red', 'velvet', 'irene'),
 ('red', 'velvet', 'quality'),
 ('experimental', 'red', 'velvet'),
 ('lacks', 'red', 'velvet'),
 ('red', 'velvet', 'dying'),
 ('red', 'velvet', 'photobooks'),
 ('red', 'velvet', 'pushes'),
 ('red', 'velvet', 'red'),
 ('velvet', 'red', 'velvet'),
 ('die', 'red', 'velvet'),
 ('nmaybe', 'red', 'velvet'),
 ('red', 'velvet', 'charm'),
 ('red', 'velvet', 'nmaybe'),
 ('yeri', 'red', 'velvet'),
 ('amp', 'red', 'velvet'),
 ('red', 'velvet', 'favorites'),
 ('respect', 'red', 'velvet'),
 ('created', 'red', 'velvet'),
 ('typical', 'red', 'velvet'),
 ('friends', 'red', 'velvet'),
 ('red', 'velvet', 'showing'),
 ('wrote', 'red', 'velvet'),
 ('red', 'velvet', 'car'),
 ('compare', 'red', 'velvet'),
 ('reputation', 'red', 'velvet'),
 ('red', 'velvet', 'lost'),
 ('love', 'red', 'velvet'),
 ('beat', 'red', 'velvet'),
 ('red', 'velvet', 'beat'),
 ('red', 'velvet', 'group'),
 ('follow', 'red', 'velvet'),
 ('red', 'velvet', 'also'),
 ('special', 'red', 'velvet'),
 ('gg', 'red', 

In [338]:
top_ngrams(male_giant_comment_string, num_of_words=50, ngram=3)

[('defending', 'stray', 'kids'),
 ('discussion', 'stray', 'kids'),
 ('familiar', 'stray', 'kids'),
 ('perception', 'stray', 'kids'),
 ('stray', 'kids', 'crackhead'),
 ('stray', 'kids', 'draws'),
 ('stray', 'kids', 'objectively'),
 ('stray', 'kids', 'touring'),
 ('stray', 'kids', 'specifically'),
 ('opinion', 'stray', 'kids'),
 ('stray', 'kids', 'called'),
 ('stray', 'kids', 'marketed'),
 ('blm', 'stray', 'kids'),
 ('point', 'stray', 'kids'),
 ('stray', 'kids', 'group'),
 ('groups', 'stray', 'kids'),
 ('love', 'stray', 'kids'),
 ('stray', 'kids', 'would'),
 ('culture', 'stray', 'kids'),
 ('stray', 'kids', "i'm"),
 ('hip', 'hop', 'rap'),
 ('features', 'hip', 'hop'),
 ('hoping', 'hip', 'hop'),
 ('consider', 'hip', 'hop'),
 ('hip', 'hop', 'banger'),
 ('american', 'hip', 'hop'),
 ('hip', 'hop', 'pop'),
 ('also', 'hip', 'hop'),
 ('find', 'new', 'home'),
 ('pretty', 'much', 'contained'),
 ('almost', 'pretty', 'much'),
 ('bans', 'depending', 'severity'),
 ('concert', 'entails', 'proper'),
 ('e

Most common adjectives

In [185]:
top_adjectives(female_giant_comment_string, num_of_words=50)

[('good', 140),
 ('much', 94),
 ('korean', 73),
 ('first', 69),
 ('sure', 68),
 ('many', 66),
 ('new', 66),
 ('great', 64),
 ('different', 63),
 ("i'm", 62),
 ('big', 58),
 ('last', 48),
 ('red', 46),
 ('next', 40),
 ('right', 39),
 ('single', 39),
 ('full', 35),
 ('whole', 35),
 ('main', 35),
 ('top', 34),
 ('bad', 33),
 ('ni', 33),
 ('happy', 32),
 ('japanese', 32),
 ('little', 31),
 ('digital', 31),
 ('similar', 30),
 ('song', 30),
 ('nice', 29),
 ('live', 29),
 ('real', 28),
 ('public', 28),
 ('mean', 27),
 ('english', 26),
 ('high', 26),
 ('it’s', 26),
 ('female', 25),
 ('wrong', 24),
 ('favorite', 24),
 ("can't", 23),
 ('sm', 23),
 ('able', 23),
 ('album', 22),
 ('irene', 22),
 ('popular', 22),
 ("they're", 21),
 ('lol', 21),
 ('beautiful', 21),
 ('huge', 20),
 ('general', 20)]

In [184]:
top_adjectives(male_giant_comment_string, num_of_words=50)

[('big', 35),
 ('much', 28),
 ('good', 27),
 ('sure', 24),
 ('many', 22),
 ('vlive', 19),
 ('korean', 18),
 ("i'm", 18),
 ('english', 17),
 ('happy', 17),
 ('last', 17),
 ('right', 14),
 ('sm', 13),
 ('huge', 13),
 ('i’m', 13),
 ('popular', 12),
 ('malicious', 12),
 ('new', 11),
 ('ready', 11),
 ('first', 11),
 ('different', 11),
 ('public', 11),
 ('full', 10),
 ('recent', 9),
 ('little', 9),
 ('ni', 9),
 ('whole', 9),
 ('it’s', 9),
 ('bad', 9),
 ('possible', 9),
 ('military', 9),
 ('able', 9),
 ('live', 9),
 ('yg', 9),
 ('online', 8),
 ('due', 8),
 ('gt', 8),
 ('next', 8),
 ('nct', 7),
 ('great', 7),
 ('song', 7),
 ('lol', 7),
 ('sad', 7),
 ('actual', 7),
 ('male', 7),
 ('single', 7),
 ('mean', 7),
 ('long', 7),
 ("can't", 7),
 ('curious', 7)]