# Get all relevant submissions

Author: Junita Sirait

I am tired of parsing the whole 6 months of data each time I need additional informations from the JSON files. Here I will master the art of `ndjson` dump and load, so I can just process these files in the future.

Table of contents:
1. [Reading the original massive files](#sub1)

In [1]:
# !pip install ndjson

Collecting ndjson
  Downloading ndjson-0.3.1-py2.py3-none-any.whl (5.3 kB)
Installing collected packages: ndjson
Successfully installed ndjson-0.3.1


You should consider upgrading via the 'C:\Users\User200803\AppData\Local\Programs\Python\Python38\python.exe -m pip install --upgrade pip' command.


In [1]:
import ndjson

import json
import pandas as pd
import zstandard as zstd
import io

from collections import defaultdict, Counter
from urllib.parse import urlparse
import re
import datetime, time
import tldextract

## Reading the original massive files

In [16]:
with open("D:\\Wellesley\\F21\\thesis\\data\\gm_intersection.json", "r") as infile:
    gm_news_sources = set(json.load(infile))

In [17]:
dctx = zstd.ZstdDecompressor(max_window_size=2147483648)

In [18]:
def findURLs(phrase):
    regex = re.compile('((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)')
    url = re.findall(regex, phrase)     
    return [x[0] for x in url]

In [19]:
def get_hostname(url, uri_type='both'):
    """Get the host name from the url"""
    hostnames = set()
    extracted = tldextract.extract(url)
    subdomain, domain, suffix = extracted
    # add both versions of domain.suffix and subdomain.domain.suffix
    full = ""
    # with subdomain
    if len(subdomain) > 0 and len(suffix) > 0:
        #print(f"{subdomain}.{domain}.{suffix}")
        full = f"{subdomain}.{domain}.{suffix}"
        if len(full) > 0:
            hostnames.add(full[4:].strip('/')) if full.startswith("www.") else hostnames.add(full.strip('/'))
    # without subdomain
    full = f"{domain}.{suffix}"
    if len(full) > 0 and len(suffix) > 0:
        hostnames.add(full[4:].strip('/')) if full.startswith("www.") else hostnames.add(full.strip('/'))
    return hostnames

In [20]:
interesting_features = ["created_utc", "num_comments", "num_crossposts", "retrieved_utc", "subreddit",
                       "subreddit_type", "total_awards_received", "upvote_ratio"]

In [21]:
zst_files = [ "RS_2021-01.zst", "RS_2021-02.zst", "RS_2021-03.zst", "RS_2021-04.zst", "RS_2021-05.zst", "RS_2021-06.zst"]
zst_filepath = "E:/thesis_data/"

In [22]:
subreddits_total_activity = defaultdict(dict)

In [None]:
print("start time:", datetime.datetime.now())


for zst_file in zst_files:
    counter = 0
    num_containing_urls = 0
    nc_wo_reddit = 0
    added = 0
    month = zst_file[8:10]
    with open ("E:/relevant_posts_{}.ndjson".format(month), "a", encoding="utf-8") as ndjfile:
        writer = ndjson.writer(ndjfile, ensure_ascii=False)
        print("***** Start processing for {} *****".format(zst_file))
        with open(zst_filepath+zst_file, 'rb') as ifh:
            with dctx.stream_reader(ifh, read_size=2) as reader:
                text_stream = io.TextIOWrapper(reader, encoding='utf-8')
                for d in text_stream:
                    line = json.loads(d)
                    URLs = findURLs(line['url']) + findURLs(line['selftext'])                
                    hostnames = [get_hostname(url) for url in URLs]
                    URLs = [element for sub in hostnames for element in sub]
                    subreddit = line["subreddit"]
                    if month in subreddits_total_activity[subreddit]:
                        subreddits_total_activity[subreddit][month] += 1
                    else:
                        subreddits_total_activity[subreddit][month] = 1
                    if len(URLs) > 0:
                        num_containing_urls += 1
                        if "reddit.com" not in URLs:
                            nc_wo_reddit += 1
                    for url in URLs:
                        if url in gm_news_sources:
                            to_write = dict()
                            for f in interesting_features:
                                to_write[f] = line[f]
                            writer.writerow(to_write)
                            added += 1
                            break # write this post only once
                    url_of_our_ns = False
                    counter += 1
                    if counter%500000 == 0: 
                        print("processed {}; added {}; {} has URLs; {} has URLs that are not reddit \t by {}".format(counter, added, num_containing_urls, nc_wo_reddit, str(datetime.datetime.now())[11:19]))
        
        print("****************************************** Summary ******************************************")
        print(f"There are {counter} total posts in {month}, {num_containing_urls} has URLs, {nc_wo_reddit} has URLs that are not reddit and {added} of them have urls to our news sources.")
        
        print("----------------------------------------------------------------------------------------")
        print("-------------------------------- Done processing for {} --------------------------------".format(zst_file))
        print("----------------------------------------------------------------------------------------")

with open ("recent_subreddit_activity.json", "w", encoding="utf-8") as sub:
    json.dump(subreddit_total_activity, sub)

print("finish time:", datetime.datetime.now())

start time: 2022-03-20 00:46:58.758995
***** Start processing for RS_2021-01.zst *****
processed 500000; added 253697; 498685 has URLs; 304435 has URLs that are not reddit 	 by 00:49:06
processed 1000000; added 516349; 997246 has URLs; 601131 has URLs that are not reddit 	 by 00:51:10
processed 1500000; added 771349; 1495937 has URLs; 903613 has URLs that are not reddit 	 by 00:53:04
processed 2000000; added 1027480; 1994616 has URLs; 1199834 has URLs that are not reddit 	 by 00:54:46
processed 2500000; added 1279215; 2493245 has URLs; 1504453 has URLs that are not reddit 	 by 00:56:28
processed 3000000; added 1540102; 2991793 has URLs; 1799489 has URLs that are not reddit 	 by 00:58:06
processed 3500000; added 1795421; 3490492 has URLs; 2105254 has URLs that are not reddit 	 by 00:59:48
processed 4000000; added 2055866; 3989103 has URLs; 2403318 has URLs that are not reddit 	 by 01:01:27
processed 4500000; added 2313788; 4487870 has URLs; 2703049 has URLs that are not reddit 	 by 01:0

processed 4000000; added 2120699; 3984405 has URLs; 2338921 has URLs that are not reddit 	 by 02:51:33
processed 4500000; added 2379226; 4482208 has URLs; 2642618 has URLs that are not reddit 	 by 02:53:16
processed 5000000; added 2656598; 4980377 has URLs; 2921714 has URLs that are not reddit 	 by 02:54:56
processed 5500000; added 2911249; 5478555 has URLs; 3231520 has URLs that are not reddit 	 by 02:56:39
processed 6000000; added 3187428; 5976706 has URLs; 3516075 has URLs that are not reddit 	 by 02:58:22
processed 6500000; added 3458532; 6475143 has URLs; 3799850 has URLs that are not reddit 	 by 03:00:00
processed 7000000; added 3718415; 6973348 has URLs; 4097621 has URLs that are not reddit 	 by 03:01:45
processed 7500000; added 3976850; 7471593 has URLs; 4394353 has URLs that are not reddit 	 by 03:03:25
processed 8000000; added 4234684; 7969929 has URLs; 4693425 has URLs that are not reddit 	 by 03:05:07
processed 8500000; added 4488474; 8468384 has URLs; 4997077 has URLs that

processed 9500000; added 4858303; 9469398 has URLs; 5782649 has URLs that are not reddit 	 by 04:56:36
processed 10000000; added 5123568; 9967460 has URLs; 6078772 has URLs that are not reddit 	 by 04:58:18
processed 10500000; added 5383004; 10465754 has URLs; 6380624 has URLs that are not reddit 	 by 04:59:58
processed 11000000; added 5635005; 10964219 has URLs; 6685106 has URLs that are not reddit 	 by 05:01:37
processed 11500000; added 5896424; 11462673 has URLs; 6983187 has URLs that are not reddit 	 by 05:03:19
processed 12000000; added 6151007; 11961099 has URLs; 7287796 has URLs that are not reddit 	 by 05:05:03
processed 12500000; added 6409403; 12459647 has URLs; 7587359 has URLs that are not reddit 	 by 05:06:46
processed 13000000; added 6669069; 12958003 has URLs; 7892016 has URLs that are not reddit 	 by 05:08:30
processed 13500000; added 6934620; 13456449 has URLs; 8182374 has URLs that are not reddit 	 by 05:10:32
processed 14000000; added 7193803; 13954994 has URLs; 8481

processed 13000000; added 6844512; 12955148 has URLs; 7793293 has URLs that are not reddit 	 by 07:03:24
processed 13500000; added 7113647; 13453392 has URLs; 8086415 has URLs that are not reddit 	 by 07:05:09
processed 14000000; added 7375465; 13951764 has URLs; 8390425 has URLs that are not reddit 	 by 07:06:51
processed 14500000; added 7644830; 14450001 has URLs; 8684395 has URLs that are not reddit 	 by 07:08:36
processed 15000000; added 7906608; 14948413 has URLs; 8986702 has URLs that are not reddit 	 by 07:10:20
processed 15500000; added 8177483; 15446786 has URLs; 9280346 has URLs that are not reddit 	 by 07:12:06
processed 16000000; added 8445701; 15945204 has URLs; 9575486 has URLs that are not reddit 	 by 07:13:50
processed 16500000; added 8715639; 16443512 has URLs; 9869324 has URLs that are not reddit 	 by 07:15:31
processed 17000000; added 8984144; 16942125 has URLs; 10160960 has URLs that are not reddit 	 by 07:17:14
processed 17500000; added 9253249; 17440617 has URLs; 

processed 18000000; added 9168430; 17940534 has URLs; 10856294 has URLs that are not reddit 	 by 09:11:55
processed 18500000; added 9418875; 18438915 has URLs; 11154993 has URLs that are not reddit 	 by 09:14:18
processed 19000000; added 9631976; 18937606 has URLs; 11488020 has URLs that are not reddit 	 by 09:16:33
processed 19500000; added 9867315; 19435752 has URLs; 11807399 has URLs that are not reddit 	 by 09:18:53
processed 20000000; added 10098209; 19934383 has URLs; 12126378 has URLs that are not reddit 	 by 09:21:14
processed 20500000; added 10321128; 20432890 has URLs; 12457731 has URLs that are not reddit 	 by 09:23:30
processed 21000000; added 10560402; 20931318 has URLs; 12773360 has URLs that are not reddit 	 by 09:25:49
processed 21500000; added 10783611; 21430021 has URLs; 13101889 has URLs that are not reddit 	 by 09:28:02
processed 22000000; added 11012003; 21928512 has URLs; 13432187 has URLs that are not reddit 	 by 09:30:42
processed 22500000; added 11259562; 22426

```
There are 32704571 total posts in 01, 32611865 has URLs, 19864182 has URLs that are not reddit and 16764578 of them have urls to our news sources.

There are 31147947 total posts in 02, 31042457 has URLs, 18595912 has URLs that are not reddit and 16219186 of them have urls to our news sources.

There are 33006103 total posts in 03, 32898212 has URLs, 19828991 has URLs that are not reddit and 17229686 of them have urls to our news sources.

There are 31616206 total posts in 04, 31509254 has URLs, 18843731 has URLs that are not reddit and 16901090 of them have urls to our news sources.

There are 36310673 total posts in 05, 36192154 has URLs, 22181096 has URLs that are not reddit and 18177456 of them have urls to our news sources.
```

In [5]:
18177456/22181096

0.8195021562505297

In [27]:
len(subreddits_total_activity)

2361255

In [38]:
len([s for s in subreddits_total_activity if len(subreddits_total_activity[s])==6])

129327

In [37]:
subreddits_total_activity["twilight"]

{'02': 490, '03': 537, '04': 478, '05': 455, '06': 451, '01': 589}

In [21]:
# with open("subreddits_total_activity.json", "w", encoding="utf-8") as af:
#     json.dump(subreddits_total_activity, af)

```
Month            |    with news  |      total    |    % url with news    |
-------------------------------------------------------------------------|
January          |    1137576    |    32704571   |         3.48%         |
February         |    1026958    |    31147947   |         3.29%         |
March            |    1155554    |    33006103   |         3.50%         |
April            |    1090699    |    31616206   |         3.45%         |
May              |    1019503    |    36310673   |         2.81%         |
June             |     811758    |    34118481   |         2.38%         |
```

In [26]:
811758/34118481

0.023792325338282204

In [26]:
"wikipedia.org" in gmm_news_sources

True

In [18]:
counter

235400