## Planning / Scratch Work

Do /r/kpop commenters talk differently about male vs. female groups?

Initial exploration of this question:
- Identify submissions on 2 all-male groups, 2 all-female groups
- Collect their comments
- Contrast comments in general to "typical" reddit language (using /r/funny as a standard)
- Contrast comments on male group vs female group 

In [81]:
# TODO: Maybe separate analysis for emoji

Using Pushshift to get reddit comments

See [Pushshift's GitHub API README](https://github.com/pushshift/api)

> Search for the most recent comments mentioning the word "science" within the subreddit /r/askscience
>
> `https://api.pushshift.io/reddit/search/comment/?q=science&subreddit=askscience`

Retrieve all comment ids for a submission object

`https://api.pushshift.io/reddit/submission/comment_ids/{base36_submission_id}`

[New to Pushshift FAQ](https://www.reddit.com/r/pushshift/comments/bcxguf/new_to_pushshift_read_this_faq/)

[Pushshift Reddit API v4.0 Documentation](https://reddit-api.readthedocs.io/en/latest/#)

Not-comprehensive related works:
- "A Community of Curious Souls: An Analysis of Commenting Behavior on TED Talks Videos" (Tsou, Thelwall, Mongeon, and Sugimoto, 2014)
- "YouTube science channel video presenters and comments: female friendly or vestiges of sexism?" (Thelwall and Mas-Bleda, 2018)
- "Shirtless and dangerous: Quantifying linguistic signals of gender bias in an online fiction writing community." (Fast, Vachovsky, and Bernstein, 2016)
- "Using language models to quantify gender bias in sports journalism" (Fu, Danescu-Niculescu-Mizil, Lee, 2016)

In [101]:
import string
import re
import requests
import logging
import pickle
import json

from collections import Counter

from tqdm import tqdm

import pandas as pd

import nltk
from nltk.corpus import stopwords

ENGLISH_STOPWORDS = stopwords.words('english')

import data_collection_utils

In [102]:
filename[:23]

'data/comments/dncyvd-50'

In [103]:
comments = {}
for filename in glob.glob('data/comments/*-500-comments.pkl'):
    start = filename.rindex('/') + 1
    post_id = filename[start:start+6]
    comments[post_id] = data_collection_utils.load_data(filename)

In [106]:
data_collection_utils.save_data(comments, 'data/rkpop-1000-posts-comments.pkl')

In [107]:
posts = data_collection_utils.load_data('data/rkpop-1000-posts.pkl')
post_ids = [p['id'] for p in posts]
post_titles = [p['title'] for p in posts]

post_ids_titles_dict = dict(zip(post_ids, post_titles))

## Loading from saved CSV

In [110]:
def get_comments_from_obj(post_id):
    if post_id in comments:
        return comments[post_id]
    else:
        return None

In [111]:
df = pd.DataFrame.from_dict(post_ids_titles_dict, orient='index')
comments_as_series = df.reset_index()['index'].apply(lambda post_id: get_comments_from_obj(post_id))
df = df.reset_index()
df['comments'] = comments_as_series
df.columns = ['id', 'title', 'comments']
df

Unnamed: 0,id,title,comments
0,hxhvbk,Mamamoo Wheein - Candy (orig. Baekhyun) (Speci...,"[I'm so happy she didn't change the pronouns, ..."
1,hxekwy,Red Velvet - IRENE &amp; SEULGI - Monster (Two...,"[RV never ever created a bad song, let alone a..."
2,hxc2tq,BLACKPINK Lisa Appointed as Ambassador for BVL...,"[count me in, what about Hera?, anyone partner..."
3,hx9kq3,The Rolling Stone included 9 K-Pop Boygroup so...,"[Lol @ the strawman, Man I'm reading through t..."
4,hx3a8s,PURPLE K!SS - Debut Trailer : WH0 CARES? - 유키 ...,"[Yes Indeed, it\`s a classic produce thing whe..."
...,...,...,...
995,cul56i,BTS’ Map Of The Soul: Persona is Now Riaa Cert...,"[Hmm, what does that mean for achievement thre..."
996,cuge9n,What headline would you love to wake up to?,"[NCT 2019 ot21 yearbook and comeback, This com..."
997,cuew53,Congrats r/kpop for 400k subs!,"[No worries! 😊, That’s the top post over the l..."
998,cud85y,Comeback Stage: Red Velvet - Umpah Umpah (음파음파...,[I would have preferred this to be releases ea...


In [112]:
df.to_csv('data/rkpop-data-id-title-comments-2020-08-01.csv',index=False)

In [115]:
data_df = pd.read_csv('data/rkpop-data-id-title-comments-2020-08-01.csv')

Identify male vs female groups

In [149]:
# TODO: Map entities within comments

In [166]:
pd.options.display.max_colwidth = 100

In [210]:
m_f_mapping = {'male': {'male idols', 'male soloists' 'boy group', 'boygroup', 'boy', 'EXO', 
                        'NCT', 'BTS', 'Stray Kids', 'G-Dragon', 'Big Bang', 
                        'AB6IX', 'Golden Child', 'SEVENTEEN', 'Top Secret', 'TST', 
                        'ONEUS', 'TVXQ', 'PENTAGON', 'THE BOYZ', 'VERIVERY', 'Ravi', 
                        'WayV', 'VIXX', 'Super Junior', 'SHINee', 'Monsta X',
                        'Block B', 'Zico', 'Treasure', 'J.Y Park', 'ATEEZ', 'iKON',
                        'TXT', 'TOMORROW X TOGETHER', 'Jay Park', 'SuperM', 'GOT7', 'Dawn', 'X1',
                        'BIGBANG', 'D-Crunch', 'Kingdom', 'Epik High', 'Day6', 'Winner', 'Shinhwa',
                        'GDragon', 'Daesung', 'Taemin', 'Kang Daniel', 'J-Hope', 'Sleepy', 'OnlyOneOf',
                        'Jackson Wang', 'Jungkook', 'B.I', 'IN2IT', '2PM', 'Super M', 'J.Y. Park',
                        'CNBLUE', 'Seungri', 'Aoora'},

               'female': {'female idols', 'female soloists', 'girlgroup', 'girl group', 
               'girl', 'GFriend', "Girl's Day", 'Red Velvet', 'AOA', 'BLACKPINK', 
               'Momoland', 'miss A', 'MAMAMOO', 'ITZY', 'Sunmi', 'Weeekly', 'NiziU', 
               'NATTY', 'Twice', 'LOONA', 'After School', 'IU', 'IZ*ONE', 'WJSN', 
               'Cosmic Girls', 'DIA', 'CHUNGHA', 'SNSD', 'Cherry Bullet', 'Somi', 
               '(G)I-DLE', 'Apink', 'Yukika', 'Oh My Girl', 'Lee Hi',
               'PURPLE K!SS', 'Singer Minty', 'Rocket Punch', 'SISTAR', 'APRIL',
               'Dreamcatcher', 'Secret', 'GWSN', 'pristin', 'Minah', 'Taeyeon',
                'Girls Generation', '2NE1', 'Gong Minzy', 'Gugudan', 'Amber', 'f(x)',
                'Crayon Pop', 'Hyuna', 'HINAPIA', 'BVNDIT', 'I.O.I.', 'Queendom', 'Alexa',
                'LOOΠΔ', 'Sulli', 'Park Jimin', 'Jamie', 'PinkFantasy', 'Mina',
                'Weki Meki', 'Tiffany Young', 'Jessica Jung', 'Ladies\' Code', 'CLC',
                'J-Min', 'Kyla Massie', 'Everglow', 'fromis_9', 'BOL4', 'Baek A Yeon', 
                'Park Bom', 'Idol School', 'IOI', 'EXID', 'BoA'},
                
                'mixed': {'AKMU', 'KARD'}
}
m_f_mapping['male'] = {g.lower() for g in m_f_mapping['male']}
m_f_mapping['female'] = {g.lower() for g in m_f_mapping['female']}

data_df['male'] = data_df['title'].apply(lambda t: any(group in t.lower() for group in m_f_mapping['male']))
data_df['female'] = data_df['title'].apply(lambda t: any(group in t.lower() for group in m_f_mapping['female']))
# # Checking if any overlapping...
# data_df[data_df['male'] & data_df['female']]

subset = data_df[  ~(data_df['male'] | data_df['female'])  ]
# print(len(subset)) # len left to sorT
# sorting
# start = 202
# width = 20
# subset[['id', 'title']][start:start+width]



## Tag submissions with male or female

Clean comment text and prepare for analysis

In [211]:
data_df_subset = data_df[((data_df['male']) | (data_df['female'])) & data_df['comments']] # Keep only comments w either male or female TRUE & comments are available
data_df_subset = data_df_subset[ ~(data_df_subset['male'] & data_df_subset['female']) ] # Removes overlapping
print(len(data_df_subset))

736


In [212]:
data_df_subset.head()

Unnamed: 0,id,title,comments,male,female
0,hxhvbk,Mamamoo Wheein - Candy (orig. Baekhyun) (Special video),"[""I'm so happy she didn't change the pronouns"", 'that last part is a question I think of almost ...",False,True
1,hxekwy,Red Velvet - IRENE &amp; SEULGI - Monster (Two Weeks Later),"['RV never ever created a bad song, let alone a bad album. \nBut what shocked me the most has to...",False,True
2,hxc2tq,BLACKPINK Lisa Appointed as Ambassador for BVLGARI,"['count me in', 'what about Hera?', 'anyone partnering with Payless? /s', 'yes we can only affor...",False,True
3,hx9kq3,"The Rolling Stone included 9 K-Pop Boygroup songs into their ""75 Greatest Boyband Songs of Allti...","['Lol @ the strawman', ""Man I'm reading through this thread as a casual kpop fan and i was like ...",True,False
4,hx3a8s,PURPLE K!SS - Debut Trailer : WH0 CARES? - 유키 (Yuki),"['Yes Indeed, it\\`s a classic produce thing where some really talented people get eliminated pr...",False,True


[How to strip punctuation from a string](https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string)

`s.translate(str.maketrans('', '', string.punctuation))`

[`maketrans` documentation](https://docs.python.org/3.3/library/stdtypes.html?highlight=maketrans#str.maketrans)

[Removing URLs from a string](https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python)

Helper functions

In [213]:
def giant_cleaned_string(series_of_list_of_comments):
    """Return string from Pandas Series of lists of strings.
    
    Combines multiple pandas rows with lists of strings into one giant string with URLs and punctuation removed.
    """
    comment_string = ' '.join(series_of_list_of_comments.apply(lambda x: ' '.join(x.split())))
    comment_string = re.sub('http://\S+|https://\S+', '', comment_string)

    chars_to_replace = string.punctuation[:6]+string.punctuation[7:]+'“”\n' # Don't remove single quotation mark
    whitespace_to_replace_with = len(chars_to_replace) * ' '

    comment_string = comment_string.lower().translate(str.maketrans(chars_to_replace, whitespace_to_replace_with))
    return comment_string

def acceptable_token(token):
    """ Return True if token is longer than one character and is not present in ENGLISH_STOPWORDS
    """
    return (len(token) > 1 and token not in ENGLISH_STOPWORDS)

def tokenize(giant_comment_string):
    """ Return list of word tokens from given string.
    """
    tokens = giant_comment_string.split(' ')
    return list(filter(acceptable_token, tokens))

def create_counter_object(giant_comment_string):
    """ Return Counter with word counters for given string.
    """
    word_counter = Counter(tokenize(giant_comment_string))
    return word_counter

def top_adjectives(giant_comment_string, num_of_words=10):
    """ Return list with most common adjectives in given string.
    """

    def find_adjectives(list_of_word_pos_tuple):
        return list_of_word_pos_tuple[1] == 'JJ'

    comment_words_POS = nltk.pos_tag(tokenize(giant_comment_string))
    comment_adj_counter = Counter([adj[0] for adj in list(filter(find_adjectives, comment_words_POS))])
    return comment_adj_counter.most_common(num_of_words)

# TODO: Determine association metric to use
# http://www.nltk.org/_modules/nltk/metrics/association.html
def top_ngrams(giant_comment_string, top_n=15, ngram=2):
    """ Return top-n sized list with most frequently appearing n-grams in given string.
    """

    if ngram == 2:
        finder = BigramCollocationFinder.from_words(tokenize(giant_comment_string))
        return finder.nbest(bigram_measures.likelihood_ratio, top_n)
    elif ngram == 3:
        finder = TrigramCollocationFinder.from_words(tokenize(giant_comment_string))
        return finder.nbest(trigram_measures.likelihood_ratio, top_n)
    else:
        return "Error: Only bi- and trigrams supported."

In [186]:
# TODO: Move analysis helper function into their own py file

In [214]:
overall_giant_comment_string = giant_cleaned_string(data_df_subset['comments'])
overall_word_counter = create_counter_object(overall_giant_comment_string)

In [215]:
male_giant_comment_string = giant_cleaned_string(data_df_subset[data_df_subset['male']]['comments'])
female_giant_comment_string = giant_cleaned_string(data_df_subset[data_df_subset['female']]['comments'])

In [216]:
female_word_counter = create_counter_object(female_giant_comment_string)
male_word_counter = create_counter_object(male_giant_comment_string)

In [217]:
sum(overall_word_counter.values())

1127531

In [218]:
sum(female_word_counter.values())

692335

In [219]:
sum(male_word_counter.values())

435196

In [220]:
# TODO: Log-Odds Ratio of Words
# len(male_giant_comment_string.split(' ')) # 6718 
# len(female_giant_comment_string.split(' ')) # 14882

In [221]:
overall_top_50 = overall_word_counter.most_common(50)
male_top_50 = male_word_counter.most_common(50)
female_top_50 = female_word_counter.most_common(50)

In [222]:
print(overall_top_50)
print()
print(male_top_50)
print()
print(female_top_50)

[('like', 13156), ('really', 7572), ('one', 6644), ('think', 6481), ('people', 6263), ('song', 6134), ('even', 5389), ("i'm", 5344), ('would', 5163), ("'s", 4909), ('get', 4855), ('group', 4854), ('know', 4691), ('love', 4525), ("'i", 4356), ('also', 4302), ('time', 4246), ('see', 4185), ('good', 4170), ('much', 4085), ('still', 4035), ("'t", 3729), ('well', 3439), ('kpop', 3317), ('music', 3196), ('album', 3194), ('going', 3036), ('fans', 3004), ('gt', 2879), ('way', 2865), ('first', 2821), ('songs', 2804), ('make', 2732), ('groups', 2682), ('since', 2592), ('got', 2592), ('lot', 2584), ('say', 2581), ('something', 2555), ('want', 2488), ('feel', 2473), ('go', 2457), ('said', 2438), ('back', 2412), ('could', 2392), ('ni', 2384), ('year', 2351), ('show', 2326), ('actually', 2237), ('right', 2236)]

[('like', 5068), ('really', 2829), ('people', 2597), ('think', 2506), ('one', 2469), ("'s", 2152), ('song', 2142), ('even', 2089), ("i'm", 2087), ('get', 2003), ('know', 1938), ('group', 187

In [240]:
# TODO: Fix words that start with an apostrophe

In [223]:
unique_male_words = set(male_word_counter.keys()) - set(female_word_counter.keys())
unique_female_words = set(female_word_counter.keys()) - set(male_word_counter.keys())

In [238]:
for word in unique_male_words:
    if male_word_counter[word] > 10:
        print('{}: {}'.format(word, male_word_counter[word]))

queer: 28
meat: 20
shindong: 13
fender: 14
brockhampton: 11
jisung: 14
he’ll: 11
riyadh: 14
he‘s: 14
beomgyu: 21
ntaemin: 14
sherlock: 18
thanksgiving: 14
taeyong's: 19
ptg: 12
youngk: 12
tangled: 13
yeonjun: 30
boys24: 13
weishennies: 12
jeonghan: 19
danity: 29
donghae: 12
hendery: 36
xiaojun: 41
daeun: 38
fanship: 11
seungkwan: 14
bv: 11
hakka: 16
interlude: 12
mihawk: 14
jooheon: 22
kihyun: 23
kun: 35
'nct: 14
molka: 18
127's: 12
monbebes: 12
seungyoun: 30
hongjoong: 17
exols: 25
juvie: 13
hwanwoong: 18
nick: 19
puma: 28
aoora: 12
basquiat: 18
rtk: 48
ten's: 28
joshua: 14
pension: 13
ycmn: 11
wayv's: 12
wonwoo: 12
nipples: 14
cowell: 14
markyong: 13
monarchs: 11
nctzen: 16
atinys: 16
icstr: 17
thermal: 12
yun: 14
seohee: 12
moa: 16
lm: 49
yesung: 16
cctv: 12
violations: 11
thanxx: 23
mist: 17
yangyang: 30
capitol: 72
mingi: 30
jongin: 16
cjh: 13
mark's: 29
corden: 19
seonghwa: 11
lauv: 16
backstreet: 11
sr: 18
caa: 11
drivers: 13
wonho's: 18
3racha: 13
dramarama: 24
seoho: 15
yeosan

In [239]:
for word in unique_female_words:
    if female_word_counter[word] > 10:
        print('{}: {}'.format(word, female_word_counter[word]))

hocus: 18
'clc: 11
twicelights: 14
gsd: 12
crayon: 30
amber's: 17
'somi: 15
blusher: 14
jennie's: 28
chuu: 25
hayi: 13
aviation: 22
nayeon's: 18
'tzuyu: 13
moonbyul: 71
bangle: 15
jamming: 12
ladies': 12
cowboy: 18
sailor: 15
minyoung: 19
detroit: 11
onces: 60
gimmicks: 15
'seulgi: 14
chu: 22
'queens: 18
gaeun: 12
hyolyn: 17
jea: 15
sunhwa: 13
nbom: 17
ponytail: 26
rouge: 14
wheein: 98
yoojung: 37
nayeon: 162
'irene: 14
'chaeyoung: 12
mina’s: 15
condolences: 15
flo: 17
tilapia: 13
cignature: 12
yeoreum: 14
yves: 32
lackluster: 15
taeha: 33
dlwlrma: 17
nugus: 16
dumhdurum: 14
happyface: 17
yeri: 70
alex: 12
akali: 19
russian: 41
diets: 49
yiren: 14
mospick: 11
wengie: 12
ssak3: 14
mimi's: 13
realtime: 16
'sana: 24
ollounder: 22
neon: 35
twitch: 19
sian: 32
dart: 15
tweaks: 34
seunghee: 72
jieun: 34
coni: 15
dahyun's: 13
lipa: 13
yujin: 51
solar's: 12
yuqi's: 11
bazooka: 29
loonatic: 25
momo's: 28
overweight: 16
nako: 28
dkdk: 79
orbit: 14
yein: 28
underweight: 24
lions: 14
aires: 17
min

In [224]:
unique_overall_words = set(overall_word_counter.keys()) - set(female_word_counter.keys()) - set(male_word_counter.keys())

In [225]:
unique_overall_words # No words that showed up only in overall but in neither female nor male

set()

In [226]:
unique_male_word_counter = Counter()
unique_female_word_counter = Counter()

for word in unique_male_words:
    unique_male_word_counter[word] = male_word_counter[word] 

for word in unique_female_words:
    unique_female_word_counter[word] = female_word_counter[word] 

In [227]:
unique_male_words_str = ' '.join(unique_male_words)

In [228]:
unique_male_words_str = ' '.join(unique_male_words)
unique_female_words_str = ' '.join(unique_female_words)
unique_male_adj_tuples = list(filter(lambda x: x[1]=='JJ', nltk.pos_tag(tokenize(unique_male_words_str)))) # filter for words uniquely used toward male groups/people AND are adjectives
unique_female_adj_tuples = list(filter(lambda x: x[1]=='JJ', nltk.pos_tag(tokenize(unique_female_words_str)))) # filter for words uniquely used toward male groups/people AND are adjectives
unique_male_adj = [tup[0] for tup in unique_male_adj_tuples]
unique_female_adj = [tup[0] for tup in unique_female_adj_tuples]

In [229]:
unique_male_adj_counter = Counter()
unique_female_adj_counter = Counter()

for word in unique_male_adj:
    unique_male_adj_counter[word] = male_word_counter[word] 

for word in unique_female_adj:
    unique_female_adj_counter[word] = female_word_counter[word] 

In [230]:
print(unique_male_adj_counter.most_common(100))

[('mingi', 30), ("'baekhyun", 21), ('nick', 19), ('hwanwoong', 18), ('hongjoong', 17), ('atinys', 16), ('ntaemin', 14), ("'nct", 14), ('in2it', 14), ('shindong', 13), ('markyong', 13), ('interlude', 12), ('wonwoo', 12), ('thermal', 12), ('cctv', 12), ('sungjin', 12), ('he’ll', 11), ('fanship', 11), ('bv', 11), ('wooyoung', 10), ('atiny', 9), ('gea', 9), ('jop', 9), ('upskirt', 9), ("shownu's", 8), ('ntaeyong', 8), ("lucas'", 8), ('unbreakable', 8), ('nsuperm', 8), ('ukpop', 7), ('nba', 7), ("yeonjun's", 7), ('unicef', 7), ('siento', 7), ('mbbs', 7), ('applicable', 7), ('nten', 7), ('hj', 7), ("onf's", 7), ('nateez', 7), ('locs', 7), ("woojin's", 7), ('dohyon', 7), ('yohan', 7), ('flush', 6), ('taeyong’s', 6), ('superm’s', 6), ('ungri', 6), ('ot21', 6), ('unimportant', 6), ('nwonho', 6), ('juliet', 6), ('turkish', 6), ('dohyun', 6), ("utopia'", 6), ('suhwan', 6), ("namjoon's", 5), ('nmonsta', 5), ('tmap', 5), ('noneus', 5), ('cwjltma', 5), ('nwoojin', 5), ('homoerotic', 5), ('nrun', 5),

In [231]:
print(unique_female_adj_counter.most_common(100))

[('wheein', 98), ('seunghee', 72), ('moonbyul', 71), ('choa', 65), ('rumpumpum', 58), ('handong', 57), ('natty', 50), ('pinky', 46), ('brian', 44), ('chanmi', 43), ('russian', 41), ('sian', 32), ('papi', 32), ('nbsp', 30), ('bazooka', 29), ("bom's", 29), ('sejeong', 28), ("soyeon's", 28), ('princess', 28), ('loonatic', 25), ('naoa', 25), ('underweight', 24), ('love4eva', 23), ("rv's", 22), ('noir', 21), ('sujeong', 20), ('rosy', 20), ("hwasa's", 19), ('gothic', 19), ("nayeon's", 18), ("amber's", 17), ('nbom', 17), ("queens'", 17), ('loopy', 17), ('nugus', 16), ("chaeyoung's", 16), ("soojin's", 16), ('nodes', 16), ('nomg', 16), ("nshe's", 16), ('sailor', 15), ('lackluster', 15), ('ireh', 15), ('indefinite', 15), ('iggy', 14), ('wonderboy', 13), ('rendezvous', 13), ('unpretty', 13), ('participant', 13), ('alex', 12), ('bna', 12), ('christian', 12), ('nina', 12), ('bingle', 12), ('austin', 11), ('weeekly', 11), ('upward', 10), ('garnered', 10), ("nancy's", 10), ('wendy’s', 10), ('nhonorab

What adjectives are used? Verbs? 

[Categorizing and Tagging Words](https://www.nltk.org/book/ch05.html)

[collocations](https://www.nltk.org/howto/collocations.html)

Most common ngrams

In [273]:
# TODO: Remove title tracks from ngram consideration

In [241]:
from nltk.collocations import *

In [242]:
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()



In [254]:
set_top_male_bigram_tuples = set(top_male_bigram_tuples)
set_top_female_bigram_tuples = set(top_female_bigram_tuples)

In [270]:
top_female_bigram_tuples = top_ngrams(female_giant_comment_string, top_n=1000, ngram=2)
top_male_bigram_tuples = top_ngrams(male_giant_comment_string, top_n=1000, ngram=2)

unique_top_male_bigram_tuples = []
for t in top_male_bigram_tuples:
    if t not in top_female_bigram_tuples:
        unique_top_male_bigram_tuples.append(t)
print('unique_top_male_bigram_tuples: {}'.format(unique_top_male_bigram_tuples))

unique_top_female_bigram_tuples = []
for t in top_female_bigram_tuples:
    if t not in top_male_bigram_tuples:
        unique_top_female_bigram_tuples.append(t)
print('unique_top_female_bigram_tuples: {}'.format(unique_top_female_bigram_tuples))


itar'), ('juvenile', 'detention'), ('seperate', 'rather'), ('physical', 'sales'), ('china', 'politically'), ('ngiving', 'level'), ('nap', 'star'), ('born', 'china'), ('inclined', 'take'), ('deep', 'breath'), ('getting', 'downvoted'), ('role', 'models'), ('critcism', 'nothing'), ('zero', 'lines'), ('consequences', 'half'), ('termination', 'fees'), ('half', 'ass'), ('dog', 'meat'), ('posted', 'siding'), ('understanding', 'situation'), ('line', 'born'), ('issue', 'ngiving'), ('black', 'suit'), ('take', 'advantage'), ("can't", 'survive'), ('city', 'lights'), ('urban', 'dictionary'), ('double', 'knot'), ('love', 'shot'), ('please', 'let'), ('rather', 'ironic'), ('thanksgiving', 'parade'), ('proven', 'guilty'), ("'i", 'can’t'), ('sexual', 'harassment'), ('don’t', 'want'), ('career', '2017'), ('hala', 'n2'), ('view', 'counts'), ('jackson', 'shitty'), ('new', 'world'), ('pearl', 'aqua'), ('street', 'team'), ('digital', 'downloads'), ("we've", 'seen'), ('rookie', 'award'), ('confession', 'apolo

In [272]:
print('unique_top_male_bigram_tuples: {}'.format(unique_top_male_bigram_tuples))

unique_top_male_bigram_tuples: [('wtf', 'wtf'), ('pirate', 'king'), ('hala', 'hala'), ('nct', '127'), ('saudi', 'arabia'), ('harry', 'potter'), ('bear', 'consequences'), ('say', 'name'), ('roller', 'coaster'), ('run', 'away'), ('stand', 'rain'), ('wanna', 'one'), ('fairy', 'shampoo'), ('yang', 'hyun'), ('pre', 'orders'), ('side', 'effects'), ('jay', 'park'), ('golden', 'phone'), ('cherry', 'bomb'), ('seo', 'hee'), ('da', 'eun'), ('consequences', 'apply'), ('reiterating', 'jackson'), ('butterfly', 'wings'), ('choose', 'bear'), ('magic', 'island'), ('shangri', 'la'), ('new', 'rules'), ('safety', 'video'), ('dazzling', 'light'), ('sexual', 'assault'), ('jackson', 'choose'), ('maze', 'mirror'), ('sex', 'workers'), ('tl', 'dr'), ('radio', 'play'), ('angel', 'devil'), ('black', 'culture'), ('exo', 'ls'), ('han', 'seo'), ('korean', 'air'), ('chow', 'yun'), ('yun', 'fat'), ('apply', 'different'), ('golden', 'child'), ('reading', 'comprehension'), ('sweet', 'chaos'), ('drunk', 'driving'), ('low

In [271]:
# set_top_male_bigram_tuples = set(top_male_bigram_tuples)
# set_top_female_bigram_tuples = set(top_female_bigram_tuples)
# print(set_top_male_bigram_tuples - set_top_female_bigram_tuples)

In [258]:
print(set_top_female_bigram_tuples - set_top_male_bigram_tuples)

{('24', 'hours'), ('four', 'seasons'), ('really', 'like'), ('ice', 'cream'), ('ah', 'choo'), ('new', 'music'), ('la', 'vie'), ('solo', 'activities'), ("'it", "'s"), ('peek', 'boo'), ('couple', 'years'), ('main', 'vocal'), ('orange', 'caramel'), ("i'd", 'say'), ('comfort', 'zone'), ('singing', 'rain'), ('dome', 'tour'), ('02', '2020'), ('hi', 'high'), ('red', 'sun'), ('hocus', 'pocus'), ('black', 'label'), ('reve', 'festival'), ('roll', 'deep'), ("'i", 'thought'), ('steel', 'wool'), ('united', 'states'), ("we're", 'getting'), ('weki', 'meki'), ('city', 'pop'), ('evil', 'editing'), ('dalla', 'dalla'), ('lil', 'nas'), ('digital', 'single'), ('lucky', 'strike'), ('light', 'stick'), ('dancing', 'queen'), ('melting', 'point'), ('en', 'rose'), ('looked', 'like'), ('uh', 'oh'), ('nowhere', 'near'), ("'hi", 'kyla'), ('real', 'name'), ('nice', 'see'), ('extreme', 'diets'), ('almost', 'every'), ('live', 'vocals'), ('ping', 'pong'), ('oh', 'girl'), ('pum', 'pum'), ('bboom', 'bboom'), ('boom', 'boo

In [243]:
print(top_ngrams(female_giant_comment_string, top_n=50, ngram=3))

[('red', 'velvet', "'s"), ('red', 'velvet', 'side'), ('red', 'velvet', 'irene'), ('gt', 'red', 'velvet'), ('red', 'velvet', 'umpah'), ('red', 'velvet', 'dance'), ('red', 'velvet', 'psycho'), ('red', 'velvet', 'red'), ('congrats', 'red', 'velvet'), ('velvet', 'red', 'velvet'), ('twice', 'red', 'velvet'), ('red', 'velvet', 'n2nd'), ('14th', 'red', 'velvet'), ('pig', 'red', 'velvet'), ('dismiss', 'red', 'velvet'), ('perfect', 'red', 'velvet'), ('called', 'red', 'velvet'), ('red', 'velvet', 'apink'), ('red', 'velvet', 'blackpink'), ('red', 'velvet', 'appeared'), ('red', 'velvet', 'flop'), ('except', 'red', 'velvet'), ('bile', 'red', 'velvet'), ('grats', 'red', 'velvet'), ('meaner', 'red', 'velvet'), ("mode'", 'red', 'velvet'), ('red', 'velvet', 'constaantly'), ('red', 'velvet', '레드벨벳'), ('spontaneously', 'red', 'velvet'), ('umpag', 'red', 'velvet'), ("unpopular'", 'red', 'velvet'), ('red', 'velvet', 'seulgi'), ('experimental', 'red', 'velvet'), ('red', 'velvet', 'russian'), ('red', 'velvet

In [249]:
print(top_ngrams(male_giant_comment_string, top_n=100, ngram=2))

[('hip', 'hop'), ('title', 'track'), ('feel', 'like'), ("can't", 'wait'), ('wtf', 'wtf'), ('stray', 'kids'), ('pirate', 'king'), ('super', 'junior'), ('gt', 'gt'), ('makes', 'sense'), ('looking', 'forward'), ('cultural', 'appropriation'), ('amp', 'x200b'), ('mental', 'health'), ('years', 'ago'), ('hala', 'hala'), ("i'm", 'sure'), ('burning', 'sun'), ('sounds', 'like'), ('hong', 'kong'), ('looks', 'like'), ('social', 'media'), ('title', 'tracks'), ('seems', 'like'), ('feels', 'like'), ('pretty', 'much'), ('last', 'year'), ('nct', '127'), ('even', 'though'), ('black', 'people'), ('big', 'bang'), ('saudi', 'arabia'), ('come', 'back'), ('harry', 'potter'), ("i've", 'seen'), ("i'm", 'glad'), ('bear', 'consequences'), ('wait', 'see'), ('big', 'deal'), ('holy', 'shit'), ('general', 'public'), ('lt', "3'"), ('every', 'single'), ('say', 'name'), ('roller', 'coaster'), ('long', 'time'), ('first', 'place'), ('run', 'away'), ('music', 'shows'), ('kang', 'daniel')]


In [247]:
print(top_ngrams(male_giant_comment_string, top_n=50, ngram=3))

[('wtf', 'wtf', 'wtf'), ('hip', 'hop', 'rap'), ('hip', 'hop', 'trap'), ('hip', 'hop', 'funk'), ('hip', 'hop', 'culture'), ('lyrical', 'hip', 'hop'), ('hip', 'hop', 'rnb'), ('classic', 'hip', 'hop'), ('hip', 'hop', 'style'), ('assimilated', 'hip', 'hop'), ('denounce', 'hip', 'hop'), ('elevating', 'hip', 'hop'), ('hip', 'hop', "dads'"), ('hip', 'hop', 'impoverished'), ('mirrored', 'hip', 'hop'), ('ndescribing', 'hip', 'hop'), ('rigidity', 'hip', 'hop'), ('rigidnessin', 'hip', 'hop'), ('hip', 'hop', 'banger'), ("'s", 'hip', 'hop'), ('hip', 'hop', 'edm'), ('hip', 'hop', 'concepts'), ('flourished', 'hip', 'hop'), ('hills', 'hip', 'hop'), ('hip', 'hop', 'flourished'), ('hip', 'hop', 'reggae'), ('hip', 'hop', 'shorthand'), ('lmfaooooo', 'hip', 'hop'), ('hip', 'hop', 'definitely'), ('hip', 'hop', 'pop'), ('hip', 'hop', 'nhere’s'), ('hip', 'hop', 'righteous'), ('hip', 'hop', 'skool'), ('hip', 'hop', 'beat'), ('hoping', 'hip', 'hop'), ('hip', 'hop', 'beyoncé'), ('hip', 'hop', 'superstar'), ('gar

Most common adjectives

In [None]:
top_adjectives(female_giant_comment_string, num_of_words=100)

In [None]:
top_adjectives(male_giant_comment_string, num_of_words=50)

In [274]:
# TODO: Just analyze "regular English" words?
# TODO: Generate markov chain of how someone would talk about one group versus another