# Preliminary look at subreddits for 6k news sources

This notebook is to extract how many times each news source is mentioned in each subreddits.

Final data structures:

Six files of `ns_subreddit_{month}.json` for month January, February, March, April, May, and June. Each of the file is a dictionary, with keys being news sources. The values would be a triple (`mention_count`, `mention_count_weighted_by_upvote_ratio`, `mention_count_weighted_by_number_of_comments`). For example, if in January `nytimes.com` is only mentioned in subreddit `news` two times with the following details:
* first mention has 0.60 upvote ratio and 3 comments
* second mention has 1 upvote ratio and 2 comments

Then `ns_subreddit_2021-01.json` will have a dictionary item of `{nytimes.com: (2, 1.60, 5)}`. This is because `nytimes.com` is mentioned in total two times. The `mention_count_weighted_by_upvote_ratio` is `1*0.60 + 1*1`. The `mention_count_weighted_by_number_of_comments` is `1*3 + 1*2`.
    

In [2]:
!pip install zstandard

Collecting zstandard
  Downloading zstandard-0.16.0-cp39-cp39-win_amd64.whl (733 kB)
Installing collected packages: zstandard
Successfully installed zstandard-0.16.0


In [3]:
import json
import pandas as pd
import zstandard as zstd
import io

from collections import defaultdict, Counter
from urllib.parse import urlparse
import re
import datetime, time
import tldextract

In [2]:
print(datetime.datetime.now())
print(str(datetime.datetime.now())[11:19])

2022-03-12 00:10:54.960419
00:10:54


## Reading news sources that are the intersection of GDELT and Muck Rack news sources

In [3]:
with open("gm_intersection.json", "r") as infile:
    news_sources = json.load(infile)

In [4]:
len(news_sources)

42477

In [5]:
news_sources[:10]

['websterprogresstimes.com',
 'cordeledispatch.com',
 'k12.wv.us',
 'ukconstructionmedia.co.uk',
 'dylanpaulus.com',
 'arktimes.com',
 'asiafoodjournal.com',
 'corydontimes.com',
 'stuttgartdailyleader.com',
 'artrockermagazine.com']

## Open reddit data from April 2021 for exploration

Reference: https://arxiv.org/pdf/2001.08435.pdf

Example of an entry of data:

```
{
    'all_awardings': [], 
    'allow_live_comments': False, 
    'archived': False, 
    'author': 'elanglohablante9805', 
    'author_created_utc': 1609519842, 
    'author_flair_background_color': '#ffb000', 
    'author_flair_css_class': None, 
    'author_flair_richtext': [], 
    'author_flair_template_id': '4f908eaa-9664-11ea-a567-0ed46a42aec3', 
    'author_flair_text': 'Historiador 📜 | 80-Day Streak 🔥', 
    'author_flair_text_color': 'dark', 
    'author_flair_type': 'text', 
    'author_fullname': 't2_9lr431i4', 
    'author_patreon_flair': False, 
    'author_premium': False, 
    'can_gild': True, 
    'category': None, 
    'content_categories': None, 
    'contest_mode': False, 
    'created_utc': 1617235201, 
    'discussion_type': None, 
    'distinguished': None, 
    'domain': 'self.WriteStreakES', 
    'edited': False, 
    'gilded': 0, 
    'gildings': {}, 
    'hidden': False, 
    'hide_score': False, 
    'id': 'mhj2hj', 
    'is_created_from_ads_ui': False, 
    'is_crosspostable': True, 
    'is_meta': False, 
    'is_original_content': False, 
    'is_reddit_media_domain': False, 
    'is_robot_indexable': True, 
    'is_self': True, 
    'is_video': False, 
    'link_flair_background_color': '', 
    'link_flair_css_class': None, 
    'link_flair_richtext': [], 
    'link_flair_text': None, 
    'link_flair_text_color': 'dark', 
    'link_flair_type': 'text', 
    'locked': False,
    'media': None, 
    'media_embed': {}, 
    'media_only': False, 
    'name': 't3_mhj2hj', 
    'no_follow': True, 
    'num_comments': 2, 
    'num_crossposts': 0, 
    'over_18': False, 
    'parent_whitelist_status': None, 
    'permalink': '/r/WriteStreakES/comments/mhj2hj/streak_90_ha_llegado_la_primavera/', 
    'pinned': False, 
    'pwls': None, 
    'quarantine': False, 
    'removed_by_category': None, 
    'retrieved_utc': 1623447663, 
    'score': 1, 
    'secure_media': None, 
    'secure_media_embed': {}, 
    'selftext': 'Los pájaros están cantando, las hierbas verdes están brotando, y tengo alergias.  Esto es la temporada de las alergias.  Estornudo cada mañana cuando me despierto, y otra vez si voy afuera.  Necesito tomar medicina cada día, pero no funciona tan bien. \n\nPor fuera, las lomas son bonitas porque son verdes y los robles tienen hojas nuevas.  Por el fin de semana,  hago caminatas pero cuando regreso a casa, necesito ducharme para remover el polen.\n\nCuando me jubile, voy a viajar al desierto cada año por toda la primavera.  No me gustaría quedarme aquí.', 
    'send_replies': True, 
    'spoiler': False, 
    'stickied': False, 
    'subreddit': 'WriteStreakES', 
    'subreddit_id': 't5_2eamt5', 
    'subreddit_subscribers': 2205, 
    'subreddit_type': 'public', 
    'suggested_sort': None, 
    'thumbnail': 'self', 
    'thumbnail_height': None, 
    'thumbnail_width': None, 
    'title': 'Streak 90: Ha llegado la primavera', 
    'top_awarded_type': None, 
    'total_awards_received': 0, 
    'treatment_tags': [], 
    'upvote_ratio': 1.0, 
    'url': 'https://www.reddit.com/r/WriteStreakES/comments/mhj2hj/streak_90_ha_llegado_la_primavera/', 
    'whitelist_status': None, 'wls': None}

```

## Extracting count info

In [6]:
dctx = zstd.ZstdDecompressor(max_window_size=2147483648)

In [4]:
def findURLs(phrase):
    regex = re.compile('((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)')
    url = re.findall(regex, phrase)     
    return [x[0] for x in url]

In [24]:
get_hostname("https://cs.wellesley.edu/~cs313/")

{'cs.wellesley.edu', 'wellesley.edu'}

In [9]:
# try out
findURLs("does this find https://lol.com or nytimes.com/2021/10/19/us/politics/trump-border.html or https://nytimes.com/2021/10/19/us/politics/trump-border.html")

['https://lol.com',
 'https://nytimes.com/2021/10/19/us/politics/trump-border.html']

In [22]:
def get_hostname(url, uri_type='both'):
    """Get the host name from the url"""
    hostnames = set()
    extracted = tldextract.extract(url)
    subdomain, domain, suffix = extracted
    # add both versions of domain.suffix and subdomain.domain.suffix
    full = ""
    # with subdomain
    if len(subdomain) > 0 and len(suffix) > 0:
        #print(f"{subdomain}.{domain}.{suffix}")
        full = f"{subdomain}.{domain}.{suffix}"
        if len(full) > 0:
            hostnames.add(full[4:].strip('/')) if full.startswith("www.") else hostnames.add(full.strip('/'))
    full = f"{domain}.{suffix}"
    if len(full) > 0 and len(suffix) > 0:
        hostnames.add(full[4:].strip('/')) if full.startswith("www.") else hostnames.add(full.strip('/'))
    return hostnames

In [20]:
# function try out
print(get_hostname("https://www.nytimes.com"))
print(get_hostname("http://www.aiaia.nytimes.com/add"))
print(get_hostname("www.nytimes.com/additional"))

{'nytimes.com'}
{'nytimes.com', 'aiaia.nytimes.com'}
{'nytimes.com'}


In [21]:
"realtor.com" in news_sources

True

In [22]:
zst_files = ["RS_2021-01.zst", "RS_2021-02.zst", "RS_2021-03.zst", "RS_2021-04.zst", "RS_2021-05.zst", "RS_2021-06.zst"]

In [23]:
subreddit_srid = dict() # to keep track of subreddit id's

In [24]:
# example of flattening
import itertools
x = [[], ['foo'], ['bar', 'baz'], ['quux'], ("tup_1", "tup_2"), {1:"one", 2:"two"}]
print(list(itertools.chain(*x)))
# print([element for sub in x for element in sub])

['foo', 'bar', 'baz', 'quux', 'tup_1', 'tup_2', 1, 2]


In [None]:
print("start time:", datetime.datetime.now())

counter = 0
for zst_file in zst_files:
    ns_subreddit = defaultdict(dict)
    # counting how many time a news source appears in each subreddit along with weighted counts
    print("***** Start processing for {} *****".format(zst_file))
    with open(zst_file, 'rb') as ifh: #, open("stream_output.json", 'wb') as ofh:
        with dctx.stream_reader(ifh, read_size=2) as reader:
            text_stream = io.TextIOWrapper(reader, encoding='utf-8')
            for d in text_stream:
                line = json.loads(d)
                subreddit, subreddit_id = line['subreddit'], line['subreddit_id']
                num_comments = line['num_comments']
                upvote_ratio = line['upvote_ratio']
                if subreddit not in subreddit_srid:
                    subreddit_srid[subreddit] = subreddit_id
                URLs = findURLs(line['url']) + findURLs(line['selftext'])                
                hostnames = [get_hostname(url) for url in URLs]
                # the following counter to count how many times each url appears in the post
                URLs = Counter([element for sub in hostnames for element in sub])
                for url in URLs:
                    if url in news_sources:
                        mention_count = URLs[url]
                        weighted_by_upvote_ratio = URLs[url]*upvote_ratio
                        weighted_by_num_comments = URLs[url]*num_comments
                        # update
                        triple = (mention_count, weighted_by_upvote_ratio, weighted_by_num_comments)
                        # if subreddit in ns_subreddit[url], update. Else, initialize
                        if subreddit not in ns_subreddit[url]:
                            ns_subreddit[url][subreddit] = triple
                        else:
                            ns_subreddit[url][subreddit] = (ns_subreddit[url][subreddit][0] + triple[0],
                                                            ns_subreddit[url][subreddit][1] + triple[1],
                                                            ns_subreddit[url][subreddit][2] + triple[2])
                counter += 1
                if counter%500000 == 0: 
                    print("processed {} by {}".format(counter, str(datetime.datetime.now())[11:19]))
                
    
    print("-------------------------------- Done reading, will write files now --------------------------------")
    
    # write into files separated by months
    with open("ns_subreddit_{}.json".format(zst_file[3:10]), "w", encoding="utf-8") as outfile:
        json.dump(ns_subreddit, outfile, indent=4)
        
    with open("subreddit_ns_{}.json".format(zst_file[3:10]), "w", encoding = "utf-8") as outfile1:
        json.dump(subreddit_ns, outfile1, indent=4)
        
    with open ("subreddit_srid_{}.json".format(zst_file[3:10]), "w", encoding = "utf-8") as infile_srid:
        json.dump(subreddit_srid, infile_srid, indent=4)
    counter = 0
        
    print("----------------------------------------------------------------------------------------")
    print("-------------------------------- Done processing for {} --------------------------------".format(zst_file))
    print("----------------------------------------------------------------------------------------")
                
print("finish time:", datetime.datetime.now())