# Social Network Analysis with Spark

## Dataset: Reddit comments

The dataset consists of about 250 million comments on the social media website Reddit from 2007 to 2011. Each comment is represented as a JSON object which includes fields indicated below: 

In [21]:
rc = spark.read.format('json').load('hdfs://orion11:15000/rc/*')

In [25]:
rc

DataFrame[archived: boolean, author: string, author_flair_css_class: string, author_flair_text: string, body: string, controversiality: bigint, created_utc: string, distinguished: string, downs: bigint, edited: string, gilded: bigint, id: string, link_id: string, name: string, parent_id: string, removal_reason: string, retrieved_on: bigint, score: bigint, score_hidden: boolean, stickied: boolean, subreddit: string, subreddit_id: string, ups: bigint]

### Samples
We create samples of successively smaller size for more efficient analysis, creating subsequent subsets from previous subsets in a cacheable pipeline.

In [120]:
rc_samp = rc.sample(False, .1)
rc_samp.write.format('json').save('hdfs://orion11:15000/rc_samp')

In [164]:
%%time
rc_samp = spark.read.format('json').load('hdfs://orion11:15000/rc_samp/*')
# rc_samp.cache()
print(rc_samp.count())

26237538
CPU times: user 12 ms, sys: 2.97 ms, total: 15 ms
Wall time: 57.4 s


In [None]:
rc_s = rc_samp.sample(False, .1)
rc_s.write.format('json').save('hdfs://orion11:15000/rc_s')

In [1]:
%%time
rc_s = spark.read.format('json').load('hdfs://orion11:15000/rc_s/*')
rc_s.cache()
print(rc_s.count())

2622716
CPU times: user 5.8 ms, sys: 3.14 ms, total: 8.94 ms
Wall time: 15.8 s


In [126]:
rc_t = rc_s.sample(False, .1)
rc_t.write.format('json').save('hdfs://orion11:15000/rc_t')

In [1]:
%%time
rc_t = spark.read.format('json').load('hdfs://orion11:15000/rc_t/*')
rc_t.cache()
print(rc_t.count())

262372
CPU times: user 3.26 ms, sys: 445 µs, total: 3.7 ms
Wall time: 9.71 s


In [38]:
rc_u = rc_t.sample(False, .1)
rc_u.write.format('json').save('hdfs://orion11:15000/rc_u')

In [39]:
%%time
rc_u = spark.read.format('json').load('hdfs://orion11:15000/rc_u/*')
rc_u.cache()
print(rc_u.count())

26273
CPU times: user 1.71 ms, sys: 1.17 ms, total: 2.88 ms
Wall time: 1.56 s


In [75]:
rc_v = sc.parallelize(rc_t.take(10000))
rc_v.cache()
print(rc_v.count())

10000


## Spark API
We explore how Spark enables data analysis with different paradigms, with the API for RDDs, SQL, and DataFrames.

### Number of comments
We find the number of records with the Spark API for RDDs (Map-Reduce style), SQL, and DataFrames.

In [135]:
rc_t.rdd \
    .map(lambda comment: ('key', 1)) \
    .reduceByKey(lambda accum, n: accum + n) \
    .collect()

[('key', 262372)]

In [136]:
rc_t.createOrReplaceTempView('rc_t')

In [139]:
spark.sql("\
SELECT COUNT(*) \
FROM rc_t").collect()

[Row(count(1)=262372)]

In [138]:
rc_t.count()

262372

### Number of subreddits, subreddits' comments
Likewise, we use the Spark RDD/SQL/DataFrame API to find the number of records grouped by field and the number of groups. 

In [137]:
sub_count_mr = rc_t.rdd \
    .map(lambda comment: (comment['subreddit'], 1)) \
    .reduceByKey(lambda accum, n: accum + n) \
    .collect()

In [138]:
print(len(sub_count_mr))
print(sorted(sub_count_mr, key=lambda sub_cnt: sub_cnt[1], reverse=True)[0:20])

4097
[('AskReddit', 38907), ('pics', 18693), ('reddit.com', 16228), ('politics', 10680), ('gaming', 10358), ('funny', 9590), ('IAmA', 7895), ('fffffffuuuuuuuuuuuu', 6876), ('atheism', 6753), ('WTF', 6696), ('trees', 4824), ('worldnews', 4191), ('videos', 3849), ('starcraft', 3246), ('programming', 2715), ('todayilearned', 2622), ('science', 2619), ('Minecraft', 2479), ('technology', 2211), ('gonewild', 1961)]


In [185]:
spark.sql("\
SELECT subreddit, COUNT(*) as count \
FROM rc_t \
GROUP BY subreddit \
ORDER BY count DESC").show()

+-------------------+-----+
|          subreddit|count|
+-------------------+-----+
|          AskReddit|38907|
|               pics|18693|
|         reddit.com|16228|
|           politics|10680|
|             gaming|10358|
|              funny| 9590|
|               IAmA| 7895|
|fffffffuuuuuuuuuuuu| 6876|
|            atheism| 6753|
|                WTF| 6696|
|              trees| 4824|
|          worldnews| 4191|
|             videos| 3849|
|          starcraft| 3246|
|        programming| 2715|
|      todayilearned| 2622|
|            science| 2619|
|          Minecraft| 2479|
|         technology| 2211|
|           gonewild| 1961|
+-------------------+-----+
only showing top 20 rows



In [186]:
spark.sql("\
SELECT COUNT(DISTINCT subreddit) AS count \
FROM rc_t").collect()

[Row(count=4097)]

In [139]:
rc_t.groupBy('subreddit').count().orderBy('count', ascending=False).show()

+-------------------+-----+
|          subreddit|count|
+-------------------+-----+
|          AskReddit|38907|
|               pics|18693|
|         reddit.com|16228|
|           politics|10680|
|             gaming|10358|
|              funny| 9590|
|               IAmA| 7895|
|fffffffuuuuuuuuuuuu| 6876|
|            atheism| 6753|
|                WTF| 6696|
|              trees| 4824|
|          worldnews| 4191|
|             videos| 3849|
|          starcraft| 3246|
|        programming| 2715|
|      todayilearned| 2622|
|            science| 2619|
|          Minecraft| 2479|
|         technology| 2211|
|           gonewild| 1961|
+-------------------+-----+
only showing top 20 rows



## Social Network Analysis with Spark

We perform analysis on the text of Reddit comments (found in the `body` field) to discover and communities (i.e. `subreddit`s) of interest on the social network. We can calculate established or custom metrics to find, for example, the following:

- the most scream-y subreddits (custom metric)
- the most positive/negative subreddits (sentiment analysis)
- the discussion topics on a subreddit (key terms/tf-idf)

### Custom Metric: Screamer Subreddits
We find top subreddits for scream-y comments, for when we really want to get something off our chest, by calculating a "screamer score" (a metric expressed as the percentage of uppercase letters).

In [133]:
%%time

import string

def screamer_sub_mapper(comment):
    n_upper = len(list(filter(lambda c: c in string.ascii_uppercase, comment['body'])))
    n_alpha = len(list(filter(lambda c: c in string.ascii_letters, comment['body'])))
    return (comment['subreddit'], (n_upper, n_alpha))

def screamer_reducer(value_list):
    total_upper = 0
    total_alpha = 0
    for value in value_list:
        (n_upper, n_alpha) = value
        total_upper += n_upper
        total_alpha += n_alpha
    screamer_score = total_upper / total_alpha if total_alpha else 0
    return (screamer_score, len(value_list))
    
screamer_subs = rc_samp.rdd \
    .map(screamer_sub_mapper) \
    .groupByKey() \
    .mapValues(screamer_reducer) \

screamer_subs.cache()

CPU times: user 50.4 ms, sys: 9.97 ms, total: 60.3 ms
Wall time: 59.1 s


In [144]:
screamer_subs \
    .filter(lambda sub_rval: sub_rval[1][1] > 1000) \ # filter by count
    .sortBy(lambda sub_rval: sub_rval[1][0], False) \ # sort by score
    .take(3)

[('spacedicks', (0.5781305686052163, 1385)),
 ('circlejerk', (0.18934631577970992, 45861)),
 ('googleplusinvites', (0.11052298941380945, 1089))]

### Sentiment Analysis: Positive/Negative Subreddits

We find the most positive and negative subreddits by using [Sentiment Analysis](https://en.wikipedia.org/wiki/Sentiment_analysis) to calculate a sentiment score for each comment, and an average sentiment score for each subreddit.

In [48]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

In [70]:
def sentiment_mapper(comment):
    # Note: the VADER sentiment analyzer is trained on a single sentence as input
    # We take a naive initial approach here and simply treat a comment as single sentence
    score = sid.polarity_scores(comment['body'])['compound']
    return (comment['subreddit'], score)

sentiment = rc_t.rdd \
    .map(sentiment_mapper)

In [67]:
%%time
avg_sentiment = sentiment \
    .mapValues(lambda v: (v, 1)) \ # score, one
    .reduceByKey(lambda a,b: (a[0]+b[0], a[1]+b[1])) \ # sum(scores), sum(ones)
    .mapValues(lambda v: (v[0]/v[1], v[1])) \ # avg_score=(sum_scores/count), count
    .filter(lambda k_v: k_v[1][1] > 1000) \ # filter by count
    .sortBy(lambda k_v: k_v[1][0], False) # sort by avg_score

CPU times: user 95.8 ms, sys: 7.38 ms, total: 103 ms
Wall time: 10.1 s


The most positive subreddits are:

In [72]:
avg_sentiment.takeOrdered(10, key=lambda k_v: -k_v[1][0]) # take by avg_score desceding

[('gonewild', (0.2941499235084141, 1961)),
 ('Fitness', (0.19091240875912405, 1370)),
 ('Music', (0.1868072727272727, 1485)),
 ('mylittlepony', (0.17997485322896284, 1022)),
 ('TwoXChromosomes', (0.1782817759421786, 1937)),
 ('Android', (0.1760810091743119, 1090)),
 ('trees', (0.1685991500829187, 4824)),
 ('soccer', (0.15145398655139292, 1041)),
 ('programming', (0.15078088397790054, 2715)),
 ('leagueoflegends', (0.14420487948265726, 1701))]

The most negative subreddits are:

In [73]:
avg_sentiment.takeOrdered(10, key=lambda k_v: k_v[1][0]) # take by avg_score ascending

[('worldnews', (-0.044262133142448104, 4191)),
 ('WTF', (-0.0032822580645161217, 6696)),
 ('politics', (-0.0011439325842696566, 10680)),
 ('Libertarian', (0.02533822091886608, 1023)),
 ('funny', (0.04700525547445254, 9590)),
 ('reddit.com', (0.04863151959576039, 16228)),
 ('fffffffuuuuuuuuuuuu', (0.048675159976730664, 6876)),
 ('todayilearned', (0.049298474446987046, 2622)),
 ('guns', (0.05428072390572392, 1188)),
 ('videos', (0.05748225513120292, 3849))]

### Key Terms (TFIDF): Subreddit Discussion Topics
We find the key terms through the collection of comments in the dataset to discover a subreddits' discussion topics at the time. To do so, we calculate the Term frequency-Inverse document frequency ([TFIDF](https://en.wikipedia.org/wiki/Tf–idf)) for the subreddit. We can cache each step/Spark job in the pipeline to calculate the successive results more efficiently.

#### Term Frequency (TF)
We calculate the term frequency/count for a subreddit by summing the term frequencies of all comments in the subreddit, treating them collectively as a single 'document' representing that subreddit. 

In [None]:
from collections import Counter
import string
import nltk

def term_freq_mapper(comment):
    body = comment['body']
#     tokens = nltk.tokenize.word_tokenize(body.lower())
    tokens = [word.strip(string.punctuation) for word in body.lower().split()]
    counter = Counter(tokens)
    return (comment['subreddit'], counter)

term_freq = rc_t.rdd \
    .map(term_freq_mapper) \
    .reduceByKey(lambda a,b: a+b)
term_freq.cache()

We show the frequency of top terms in a subreddit below. 

In [150]:
%%time
sub_term_freq_res_0 = term_freq.take(1)[0]
sub_0 = sub_term_freq_res_0[0]
term_freq_res_0 = sub_term_freq_res_0[1]
print(sub_0)
print(sorted(list(term_freq_res_0.items()), key=lambda t_f:t_f[1], reverse=True)[0:50])

politics
[('the', 21326), ('to', 13160), ('a', 10442), ('of', 9538), ('and', 9389), ('that', 7717), ('is', 7363), ('i', 6311), ('you', 6176), ('in', 6128), ('it', 5226), ('for', 4320), ('are', 3678), ('not', 3626), ('be', 3237), ('have', 3221), ('they', 3160), ('this', 3019), ('on', 2864), ('as', 2696), ('with', 2596), ('but', 2565), ('if', 2421), ('people', 2388), ('was', 2055), ('or', 1942), ('what', 1927), ('we', 1883), ('would', 1872), ('he', 1836), ('just', 1770), ('your', 1741), ('do', 1736), ('all', 1732), ('so', 1685), ('no', 1630), ('about', 1623), ("don't", 1580), ('their', 1579), ('like', 1566), ("it's", 1563), ('', 1554), ('at', 1530), ('more', 1521), ('by', 1474), ('an', 1473), ('can', 1472), ('who', 1449), ('from', 1444), ('will', 1365)]
CPU times: user 22.3 ms, sys: 1.59 ms, total: 23.8 ms
Wall time: 90.6 ms


#### Inverse Document Frequency (IDF)
We calculate inverse document frequency, a 'measure of how much information the word provides, i.e., if it's common or rare across all documents' (Wikipedia). To do so, we must first calculate the:
- document frequency 
- number of documents

##### Document Frequency
We calculate the terms' document frequency (i.e. how common the word is) as the number of 'documents' (subreddits) the word appears in. 

In [None]:
doc_freq = term_freq \
    .flatMap(lambda sub_counter: list(sub_counter[1])) \
    .map(lambda word: (word, 1)) \
    .reduceByKey(lambda a, b: a + b) \
    .sortBy(lambda word_docfreq: word_docfreq[1], False)
doc_freq.cache()

We show the some of the most common words with their document frequencies below.

In [4]:
print(doc_freq.collect()[0:50])

[('the', 2818), ('i', 2677), ('a', 2612), ('to', 2601), ('and', 2404), ('of', 2290), ('is', 2232), ('it', 2216), ('you', 2186), ('that', 2148), ('in', 2148), ('for', 2087), ('this', 1938), ('but', 1852), ('on', 1840), ('have', 1805), ('be', 1792), ('with', 1700), ('not', 1666), ('are', 1647), ('if', 1646), ('my', 1588), ('just', 1588), ('so', 1560), ('like', 1519), ('as', 1484), ('was', 1477), ('', 1468), ('at', 1468), ('or', 1464), ('can', 1420), ("it's", 1409), ("i'm", 1375), ('they', 1373), ('do', 1359), ('one', 1358), ('what', 1355), ('all', 1349), ('me', 1344), ('out', 1340), ('from', 1325), ('get', 1323), ('your', 1314), ('about', 1311), ('there', 1309), ('deleted', 1303), ('up', 1296), ('would', 1283), ('an', 1279), ("don't", 1264)]
CPU times: user 151 ms, sys: 20.2 ms, total: 171 ms
Wall time: 1.5 s


##### Number of Documents
We get the number of 'documents' in the 'corpus', i.e. the number of subreddits (in the dataset sample). This will be used to calculate the 'inverse document frequency' in the next step.

In [3]:
%%time
num_docs = term_freq.count()
print(num_docs)

4097
CPU times: user 14.4 ms, sys: 5.29 ms, total: 19.7 ms
Wall time: 200 ms


##### Inverse Document Frequency
Now, we calculate the inverse document frequency (the logarithmically scaled inverse fraction of the documents that contain the word).

In [None]:
%%time
import math

inv_doc_freq = doc_freq \
    .map(lambda t_df: (t_df[0], math.log(num_docs / t_df[1]))) \
    .sortBy(lambda t_idf: t_idf[1], True)
inv_doc_freq.cache()

We show some of the least informative words with their IDF metric below.

In [5]:
inv_doc_freq_res = inv_doc_freq.collect();
print(inv_doc_freq_res[0:50])

[('the', 0.3742275850882455), ('i', 0.4255582340265186), ('a', 0.45013878715054945), ('to', 0.4543590120982195), ('and', 0.5331209818917729), ('of', 0.5817031809985858), ('is', 0.6073569540456695), ('it', 0.6145512296796967), ('you', 0.6281816088103872), ('that', 0.6457178219181158), ('in', 0.6457178219181158), ('for', 0.6745273704696704), ('this', 0.7485984850961596), ('but', 0.7939888623407463), ('on', 0.8004894269438397), ('have', 0.8196944067798898), ('be', 0.8269226840119954), ('with', 0.8796267475025636), ('not', 0.899829454820083), ('are', 0.9112995473692307), ('if', 0.911906896309856), ('my', 0.94777963573979), ('just', 0.94777963573979), ('so', 0.9655691773032884), ('like', 0.9922027749510982), ('as', 1.0155138538195454), ('was', 1.0202419950154913), ('', 1.0263540683724102), ('at', 1.0263540683724102), ('or', 1.0290825830256143), ('can', 1.0595981269515646), ("it's", 1.067374765648191), ("i'm", 1.0918012674461994), ('they', 1.0932568717789), ('do', 1.1035058633957273), ('one'

#### Term Frequency - Inverse Document Frequency (TFIDF)
Finally, we find the TFIDF metric for a specified document (subreddit) by multiplying the frequency of each term by the term's inverse document frequency.

In [125]:
%%time
sub_1 = 'politics'
term_freq_res_1 = term_freq.sortByKey().lookup(sub_1)[0]
print(sub_1)
print(sorted(list(term_freq_res_1.items()), key=lambda t_f:t_f[1], reverse=True)[0:50])

politics
[('the', 21326), ('to', 13160), ('a', 10442), ('of', 9538), ('and', 9389), ('that', 7717), ('is', 7363), ('i', 6311), ('you', 6176), ('in', 6128), ('it', 5226), ('for', 4320), ('are', 3678), ('not', 3626), ('be', 3237), ('have', 3221), ('they', 3160), ('this', 3019), ('on', 2864), ('as', 2696), ('with', 2596), ('but', 2565), ('if', 2421), ('people', 2388), ('was', 2055), ('or', 1942), ('what', 1927), ('we', 1883), ('would', 1872), ('he', 1836), ('just', 1770), ('your', 1741), ('do', 1736), ('all', 1732), ('so', 1685), ('no', 1630), ('about', 1623), ("don't", 1580), ('their', 1579), ('like', 1566), ("it's", 1563), ('', 1554), ('at', 1530), ('more', 1521), ('by', 1474), ('an', 1473), ('can', 1472), ('who', 1449), ('from', 1444), ('will', 1365)]
CPU times: user 27.5 ms, sys: 5.58 ms, total: 33.1 ms
Wall time: 368 ms


In [126]:
%%time
sub_2 = 'programming'
term_freq_res_2 = term_freq.sortByKey().lookup(sub_2)[0]
print(sub_2)
print(sorted(list(term_freq_res_2.items()), key=lambda t_f:t_f[1], reverse=True)[0:50])

programming
[('the', 4242), ('to', 2970), ('a', 2631), ('of', 2040), ('and', 2008), ('i', 1859), ('is', 1834), ('that', 1759), ('it', 1662), ('you', 1592), ('in', 1357), ('for', 1130), ('', 915), ('with', 786), ('not', 777), ('be', 776), ('but', 767), ('have', 722), ('on', 719), ('this', 710), ('are', 692), ('as', 626), ('if', 615), ('or', 517), ('can', 495), ("it's", 488), ('just', 466), ('they', 445), ('like', 442), ('so', 433), ('an', 431), ('your', 429), ('was', 428), ('do', 414), ('my', 399), ("don't", 398), ('what', 398), ('at', 378), ('all', 360), ('use', 352), ('would', 337), ('more', 337), ('from', 326), ('gt', 326), ('about', 322), ('some', 298), ('there', 291), ("i'm", 290), ('one', 289), ('when', 285)]
CPU times: user 20 ms, sys: 3.16 ms, total: 23.2 ms
Wall time: 332 ms


In [69]:
inv_doc_freq_map_res = inv_doc_freq.collectAsMap()
inv_doc_freq_map_res

{'the': 0.3742275850882455,
 'i': 0.4255582340265186,
 'a': 0.45013878715054945,
 'to': 0.4543590120982195,
 'and': 0.5331209818917729,
 'of': 0.5817031809985858,
 'is': 0.6073569540456695,
 'it': 0.6145512296796967,
 'you': 0.6281816088103872,
 'in': 0.6457178219181158,
 'that': 0.6457178219181158,
 'for': 0.6745273704696704,
 'this': 0.7485984850961596,
 'but': 0.7939888623407463,
 'on': 0.8004894269438397,
 'have': 0.8196944067798898,
 'be': 0.8269226840119954,
 'with': 0.8796267475025636,
 'not': 0.899829454820083,
 'are': 0.9112995473692307,
 'if': 0.911906896309856,
 'my': 0.94777963573979,
 'just': 0.94777963573979,
 'so': 0.9655691773032884,
 'like': 0.9922027749510982,
 'as': 1.0155138538195454,
 'was': 1.0202419950154913,
 '': 1.0263540683724102,
 'at': 1.0263540683724102,
 'or': 1.0290825830256143,
 'can': 1.0595981269515646,
 "it's": 1.067374765648191,
 "i'm": 1.0918012674461994,
 'they': 1.0932568717789,
 'do': 1.1035058633957273,
 'one': 1.1042419694282297,
 'what': 1.106

We show some of top terms for chosen subreddits below:

In [128]:
%%time
tfidf_list_1 = list(map(lambda t_f: (t_f[0], t_f[1] * inv_doc_freq_map_res[t_f[0]]), 
                 term_freq_res_1.items()))

print(sub_1)
print(sorted(tfidf_list_1, key = lambda t_fidf: t_fidf[1], reverse = True)[0:50])

politics
[('the', 7980.777479591923), ('to', 5979.3645992125685), ('of', 5548.284940364511), ('and', 5005.472898981856), ('that', 4983.0044317421), ('a', 4700.3492154260375), ('is', 4471.969252638264), ('in', 3956.9588127142133), ('you', 3879.6496160129514), ('they', 3454.691714821324), ('are', 3351.7597352240305), ('not', 3262.781603177621), ('people', 3217.3051161751896), ('it', 3211.644726306095), ('government', 2956.781859719072), ('for', 2913.958240428976), ('he', 2819.762254747586), ('as', 2737.8253498974946), ('i', 2685.698014941359), ('be', 2676.7487281468293), ('have', 2640.235684238025), ('we', 2576.2336478543034), ('their', 2416.1310011796836), ('on', 2292.601718767157), ('with', 2283.511036516655), ('this', 2260.0188265053057), ('who', 2223.758676490278), ('if', 2207.7265959661613), ('obama', 2192.3983117970974), ('would', 2173.492925007271), ('what', 2132.1359797371256), ('was', 2096.5972997568347), ('no', 2068.071361154386), ('but', 2036.5814319040142), ('or', 1998.478376

In [130]:
%%time

tfidf_list_2 = list(map(lambda t_f: (t_f[0], t_f[1] * inv_doc_freq_map_res[t_f[0]]), 
                 term_freq_res_2.items()))

print(sub_2)
print(sorted(tfidf_list_2, key = lambda t_fidf: t_fidf[1], reverse = True)[0:50])

programming
[('the', 1587.4734159443374), ('to', 1349.4462659317119), ('of', 1186.674489237115), ('a', 1184.3151489930956), ('that', 1135.8176487539656), ('is', 1113.8926537197578), ('and', 1070.50693163868), ('it', 1021.3841437276559), ('you', 1000.0651212261364), ('', 939.1139725607553), ('in', 876.2390843428831), ('code', 818.2860278216028), ('i', 791.1127570552981), ('for', 762.2159286307275), ('not', 699.1674863952045), ('gt', 694.967339088719), ('with', 691.386623537015), ('be', 641.6920027933085), ('as', 635.7116724910354), ('are', 630.6192867795077), ('c', 626.6820159504492), ('but', 608.9894574153524), ('have', 591.8193616950805), ('use', 579.8298727587509), ('on', 575.5518979726207), ('if', 560.8227412305614), ('or', 532.0356954242426), ('this', 531.5049244182733), ('can', 524.5010728410244), ('language', 522.8502962703028), ("it's", 520.8788856363171), ('an', 501.76006114222025), ('your', 487.84982467749325), ('they', 486.4993079416105), ("don't", 468.0375337303472), ('do', 

##### Removing Stop-words

While weighting the term frequency with the inverse document frequency should deemphasize the most common/least informative words, we may still find some of them at the top of the adjusted frequency list. We thus filter out the [stop-words](https://en.wikipedia.org/wiki/Stop_words) to get the more informative key terms.

In [78]:
from nltk.corpus import stopwords
stop_words = stopwords.words()

In [131]:
%%time
f_tfidf_list_1 = list(filter(lambda t_fidf: t_fidf[0] not in stop_words, tfidf_list_1))

CPU times: user 1.17 s, sys: 0 ns, total: 1.17 s
Wall time: 1.17 s


In [132]:
%%time
f_tfidf_list_2 = list(filter(lambda t_fidf: t_fidf[0] not in stop_words, tfidf_list_2))

CPU times: user 545 ms, sys: 0 ns, total: 545 ms
Wall time: 546 ms


#### Key Terms
Finally, we can view the key terms for the specified subreddit.

In [107]:
from pprint import pprint

In [133]:
pprint(sub_1)
pprint(sorted(f_tfidf_list_1, key = lambda t_fidf: t_fidf[1], reverse = True)[0:50])

'politics'
[('people', 3217.3051161751896),
 ('government', 2956.781859719072),
 ('obama', 2192.3983117970974),
 ('would', 2173.492925007271),
 ('money', 1609.3852245466571),
 ('', 1594.9542222507255),
 ('us', 1587.0675099983946),
 ('like', 1553.78954557342),
 ('think', 1552.1000901547854),
 ('tax', 1511.3084122659845),
 ('vote', 1401.3016157115342),
 ('get', 1333.8166738496602),
 ('one', 1308.5267337724522),
 ('right', 1282.50450720408),
 ('even', 1249.060298263983),
 ('taxes', 1245.1607177632886),
 ('paul', 1227.8021700862641),
 ('country', 1216.8792847140562),
 ('republicans', 1195.0712944477862),
 ('make', 1175.699453076192),
 ('state', 1130.4281123947137),
 ('republican', 1099.0942182558558),
 ('ron', 1085.6796530851152),
 ('say', 1083.091086237188),
 ('bush', 1073.0548161002976),
 ("that's", 1063.2257973957508),
 ('way', 1055.570395235103),
 ('pay', 1045.8099690811764),
 ("i'm", 1040.4866078762282),
 ('point', 1001.1494716824),
 ('know', 982.449413910854),
 ('federal', 980.127930

In [135]:
pprint(sub_2)
pprint(sorted(f_tfidf_list_2, key = lambda t_fidf: t_fidf[1], reverse = True)[0:50])

'programming'
[('', 939.1139725607553),
 ('code', 818.2860278216028),
 ('gt', 694.967339088719),
 ('use', 579.8298727587509),
 ('language', 522.8502962703028),
 ('languages', 450.835282960784),
 ('java', 442.17261998463636),
 ('like', 438.5536265283854),
 ('programming', 438.0645953916532),
 ('would', 391.27516865782604),
 ('software', 364.9534705988249),
 ('python', 334.20355790808793),
 ('work', 333.7701495370556),
 ('lisp', 330.98764451023504),
 ('one', 319.12592916475836),
 ("i'm", 316.62236755939784),
 ('php', 313.7146518505741),
 ('people', 311.2217260621728),
 ('think', 299.8094768740961),
 ('windows', 295.25457696125284),
 ('get', 290.50075015200224),
 ('using', 287.49192181545527),
 ("that's", 282.1112672119966),
 ('make', 281.87576949528574),
 ('haskell', 281.76003933473635),
 ('something', 281.2034440012708),
 ('write', 279.77596257728527),
 ('web', 275.12358307753107),
 ('even', 273.2319402452463),
 ('linux', 267.6518064333667),
 ('time', 265.88332728246155),
 ('deleted', 2