# Preliminary look at subreddits for 6k news sources

This notebook is to extract how many times each news source is mentioned in each subreddits.

Final data structures:

Six files of `ns_subreddit_{month}.json` for month January, February, March, April, May, and June. Each of the file is a dictionary, with keys being news sources. The values would be a triple (`mention_count`, `mention_count_weighted_by_upvote_ratio`, `mention_count_weighted_by_number_of_comments`). For example, if in January `nytimes.com` is only mentioned in subreddit `news` two times with the following details:
* first mention has 0.60 upvote ratio and 3 comments
* second mention has 1 upvote ratio and 2 comments

Then `ns_subreddit_2021-01.json` will have a dictionary item of `{nytimes.com: (2, 1.60, 5)}`. This is because `nytimes.com` is mentioned in total two times. The `mention_count_weighted_by_upvote_ratio` is `1*0.60 + 1*1`. The `mention_count_weighted_by_number_of_comments` is `1*3 + 1*2`.
    

In [19]:
!pip install zstandard



You should consider upgrading via the 'C:\Users\User200803\AppData\Local\Programs\Python\Python38\python.exe -m pip install --upgrade pip' command.


In [20]:
import json
import pandas as pd
import zstandard as zstd
import io

from collections import defaultdict, Counter
from urllib.parse import urlparse
import re
import datetime, time
import tldextract

In [21]:
print(datetime.datetime.now())
print(str(datetime.datetime.now())[11:19])

2022-03-16 20:00:01.883972
20:00:01


## Reading news sources that are the intersection of GDELT and Muck Rack news sources

In [22]:
with open("gmm_intersection.json", "r") as infile:
    news_sources = json.load(infile)

In [23]:
len(news_sources)

1631

In [24]:
news_sources[:10]

['wcyb.com',
 'betootaadvocate.com',
 'wikinews.org',
 'oklahoman.com',
 'codepink.org',
 'veteranstoday.com',
 'arktimes.com',
 'thelancet.com',
 'jewishworldreview.com',
 'mcsweeneys.net']

## Open reddit data from April 2021 for exploration

Reference: https://arxiv.org/pdf/2001.08435.pdf

Example of an entry of data:

```
{
    'all_awardings': [], 
    'allow_live_comments': False, 
    'archived': False, 
    'author': 'elanglohablante9805', 
    'author_created_utc': 1609519842, 
    'author_flair_background_color': '#ffb000', 
    'author_flair_css_class': None, 
    'author_flair_richtext': [], 
    'author_flair_template_id': '4f908eaa-9664-11ea-a567-0ed46a42aec3', 
    'author_flair_text': 'Historiador 📜 | 80-Day Streak 🔥', 
    'author_flair_text_color': 'dark', 
    'author_flair_type': 'text', 
    'author_fullname': 't2_9lr431i4', 
    'author_patreon_flair': False, 
    'author_premium': False, 
    'can_gild': True, 
    'category': None, 
    'content_categories': None, 
    'contest_mode': False, 
    'created_utc': 1617235201, 
    'discussion_type': None, 
    'distinguished': None, 
    'domain': 'self.WriteStreakES', 
    'edited': False, 
    'gilded': 0, 
    'gildings': {}, 
    'hidden': False, 
    'hide_score': False, 
    'id': 'mhj2hj', 
    'is_created_from_ads_ui': False, 
    'is_crosspostable': True, 
    'is_meta': False, 
    'is_original_content': False, 
    'is_reddit_media_domain': False, 
    'is_robot_indexable': True, 
    'is_self': True, 
    'is_video': False, 
    'link_flair_background_color': '', 
    'link_flair_css_class': None, 
    'link_flair_richtext': [], 
    'link_flair_text': None, 
    'link_flair_text_color': 'dark', 
    'link_flair_type': 'text', 
    'locked': False,
    'media': None, 
    'media_embed': {}, 
    'media_only': False, 
    'name': 't3_mhj2hj', 
    'no_follow': True, 
    'num_comments': 2, 
    'num_crossposts': 0, 
    'over_18': False, 
    'parent_whitelist_status': None, 
    'permalink': '/r/WriteStreakES/comments/mhj2hj/streak_90_ha_llegado_la_primavera/', 
    'pinned': False, 
    'pwls': None, 
    'quarantine': False, 
    'removed_by_category': None, 
    'retrieved_utc': 1623447663, 
    'score': 1, 
    'secure_media': None, 
    'secure_media_embed': {}, 
    'selftext': 'Los pájaros están cantando, las hierbas verdes están brotando, y tengo alergias.  Esto es la temporada de las alergias.  Estornudo cada mañana cuando me despierto, y otra vez si voy afuera.  Necesito tomar medicina cada día, pero no funciona tan bien. \n\nPor fuera, las lomas son bonitas porque son verdes y los robles tienen hojas nuevas.  Por el fin de semana,  hago caminatas pero cuando regreso a casa, necesito ducharme para remover el polen.\n\nCuando me jubile, voy a viajar al desierto cada año por toda la primavera.  No me gustaría quedarme aquí.', 
    'send_replies': True, 
    'spoiler': False, 
    'stickied': False, 
    'subreddit': 'WriteStreakES', 
    'subreddit_id': 't5_2eamt5', 
    'subreddit_subscribers': 2205, 
    'subreddit_type': 'public', 
    'suggested_sort': None, 
    'thumbnail': 'self', 
    'thumbnail_height': None, 
    'thumbnail_width': None, 
    'title': 'Streak 90: Ha llegado la primavera', 
    'top_awarded_type': None, 
    'total_awards_received': 0, 
    'treatment_tags': [], 
    'upvote_ratio': 1.0, 
    'url': 'https://www.reddit.com/r/WriteStreakES/comments/mhj2hj/streak_90_ha_llegado_la_primavera/', 
    'whitelist_status': None, 'wls': None}

```

## Extracting count info

In [25]:
dctx = zstd.ZstdDecompressor(max_window_size=2147483648)

In [26]:
def findURLs(phrase):
    regex = re.compile('((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)')
    url = re.findall(regex, phrase)     
    return [x[0] for x in url]

In [27]:
# try out
findURLs("does this find https://lol.com or nytimes.com/2021/10/19/us/politics/trump-border.html or https://nytimes.com/2021/10/19/us/politics/trump-border.html")

['https://lol.com',
 'https://nytimes.com/2021/10/19/us/politics/trump-border.html']

In [28]:
def get_hostname(url, uri_type='both'):
    """Get the host name from the url"""
    hostnames = set()
    extracted = tldextract.extract(url)
    subdomain, domain, suffix = extracted
    # add both versions of domain.suffix and subdomain.domain.suffix
    full = ""
    # with subdomain
    if len(subdomain) > 0 and len(suffix) > 0:
        #print(f"{subdomain}.{domain}.{suffix}")
        full = f"{subdomain}.{domain}.{suffix}"
        if len(full) > 0:
            hostnames.add(full[4:].strip('/')) if full.startswith("www.") else hostnames.add(full.strip('/'))
    full = f"{domain}.{suffix}"
    if len(full) > 0 and len(suffix) > 0:
        hostnames.add(full[4:].strip('/')) if full.startswith("www.") else hostnames.add(full.strip('/'))
    return hostnames

In [29]:
get_hostname("https://cs.wellesley.edu/~cs313/")

{'cs.wellesley.edu', 'wellesley.edu'}

In [30]:
# function try out
print(get_hostname("https://www.nytimes.com"))
print(get_hostname("http://www.aiaia.nytimes.com/add"))
print(get_hostname("www.nytimes.com/additional"))

{'nytimes.com'}
{'nytimes.com', 'aiaia.nytimes.com'}
{'nytimes.com'}


In [31]:
"realtor.com" in news_sources

False

In [32]:
zst_files = ["RS_2021-01.zst", "RS_2021-02.zst", "RS_2021-03.zst", "RS_2021-04.zst", "RS_2021-05.zst", "RS_2021-06.zst"]

In [33]:
subreddit_srid = dict() # to keep track of subreddit id's

In [34]:
# example of flattening
import itertools
x = [[], ['foo'], ['bar', 'baz'], ['quux'], ("tup_1", "tup_2"), {1:"one", 2:"two"}]
print(list(itertools.chain(*x)))
# print([element for sub in x for element in sub])

['foo', 'bar', 'baz', 'quux', 'tup_1', 'tup_2', 1, 2]


In [35]:
print("start time:", datetime.datetime.now())

counter = 0
for zst_file in zst_files:
    ns_subreddit = defaultdict(dict)
    # counting how many time a news source appears in each subreddit along with weighted counts
    print("***** Start processing for {} *****".format(zst_file))
    with open("E://thesis_data/"+zst_file, 'rb') as ifh: #, open("stream_output.json", 'wb') as ofh:
        with dctx.stream_reader(ifh, read_size=2) as reader:
            text_stream = io.TextIOWrapper(reader, encoding='utf-8')
            for d in text_stream:
                line = json.loads(d)
                subreddit, subreddit_id = line['subreddit'], line['subreddit_id']
                num_comments = line['num_comments']
                upvote_ratio = line['upvote_ratio']
                if subreddit not in subreddit_srid:
                    subreddit_srid[subreddit] = subreddit_id
                URLs = findURLs(line['url']) + findURLs(line['selftext'])                
                hostnames = [get_hostname(url) for url in URLs]
                # the following counter to count how many times each url appears in the post
                URLs = Counter([element for sub in hostnames for element in sub])
                for url in URLs:
                    if url in news_sources:
                        mention_count = URLs[url]
                        weighted_by_upvote_ratio = URLs[url]*upvote_ratio
                        weighted_by_num_comments = URLs[url]*num_comments
                        # update
                        triple = (mention_count, weighted_by_upvote_ratio, weighted_by_num_comments)
                        # if subreddit in ns_subreddit[url], update. Else, initialize
                        if subreddit not in ns_subreddit[url]:
                            ns_subreddit[url][subreddit] = triple
                        else:
                            ns_subreddit[url][subreddit] = (ns_subreddit[url][subreddit][0] + triple[0],
                                                            ns_subreddit[url][subreddit][1] + triple[1],
                                                            ns_subreddit[url][subreddit][2] + triple[2])
                counter += 1
                if counter%500000 == 0: 
                    print("processed {} by {}".format(counter, str(datetime.datetime.now())[11:19]))
                
    
    print("-------------------------------- Done reading, will write files now --------------------------------")
    
    # write into files separated by months
    with open("ns_subreddit_{}.json".format(zst_file[3:10]), "w", encoding="utf-8") as outfile:
        json.dump(ns_subreddit, outfile, indent=4)
        
    # with open("subreddit_ns_{}.json".format(zst_file[3:10]), "w", encoding = "utf-8") as outfile1:
    #     json.dump(subreddit_ns, outfile1, indent=4)
        
    counter = 0
        
    print("----------------------------------------------------------------------------------------")
    print("-------------------------------- Done processing for {} --------------------------------".format(zst_file))
    print("----------------------------------------------------------------------------------------")
                
with open ("subreddit_srid_{}.json".format(zst_file[3:10]), "w", encoding = "utf-8") as infile_srid:
    json.dump(subreddit_srid, infile_srid, indent=4)

print("finish time:", datetime.datetime.now())

start time: 2022-03-16 20:00:02.403276
***** Start processing for RS_2021-01.zst *****
processed 500000 by 20:02:16
processed 1000000 by 20:04:32
processed 1500000 by 20:06:47
processed 2000000 by 20:08:59
processed 2500000 by 20:11:10
processed 3000000 by 20:13:22
processed 3500000 by 20:15:34
processed 4000000 by 20:17:49
processed 4500000 by 20:20:02
processed 5000000 by 20:22:16
processed 5500000 by 20:24:30
processed 6000000 by 20:26:45
processed 6500000 by 20:28:59
processed 7000000 by 20:31:12
processed 7500000 by 20:33:23
processed 8000000 by 20:35:38
processed 8500000 by 20:37:55
processed 9000000 by 20:40:10
processed 9500000 by 20:42:24
processed 10000000 by 20:44:40
processed 10500000 by 20:46:59
processed 11000000 by 20:49:22
processed 11500000 by 20:51:38
processed 12000000 by 20:53:58
processed 12500000 by 20:56:13
processed 13000000 by 20:58:26
processed 13500000 by 21:00:39
processed 14000000 by 21:03:01
processed 14500000 by 21:05:17
processed 15000000 by 21:07:32
pro

processed 15500000 by 04:34:19
processed 16000000 by 04:36:12
processed 16500000 by 04:38:03
processed 17000000 by 04:39:54
processed 17500000 by 04:41:49
processed 18000000 by 04:43:40
processed 18500000 by 04:45:33
processed 19000000 by 04:47:25
processed 19500000 by 04:49:18
processed 20000000 by 04:51:12
processed 20500000 by 04:53:10
processed 21000000 by 04:55:03
processed 21500000 by 04:56:57
processed 22000000 by 04:58:51
processed 22500000 by 05:00:47
processed 23000000 by 05:02:41
processed 23500000 by 05:04:36
processed 24000000 by 05:06:29
processed 24500000 by 05:08:22
processed 25000000 by 05:10:14
processed 25500000 by 05:12:08
processed 26000000 by 05:14:03
processed 26500000 by 05:15:55
processed 27000000 by 05:17:51
processed 27500000 by 05:19:42
processed 28000000 by 05:21:37
processed 28500000 by 05:23:31
processed 29000000 by 05:25:25
processed 29500000 by 05:27:17
processed 30000000 by 05:29:09
processed 30500000 by 05:31:01
processed 31000000 by 05:32:58
processe