# Finding Intersection between Muck Rack and GDELT

"What counts as a news source?" is an important question. Muck Rack and GDELT have their own definition of what can be considered as a news source. An intersection between a set of manually curated news source selection from Muck Rack, and a set of news sources recognized by the global database GDELT will provide us with news sources that are recognized by at least two institutions.

Table of content

1. [Read news sources from GDELT](#sec1)
    * [Does GDELT have domains and subdomains OR just domains?](#sec1a) Gdelt has subdomains!
    
2. [Read news sources from Muck Rack](#sec2)
    * [New Muck Rack](#sec2a)
    * [Old Muck Rack](#sec2b)
    * [Union between old and new Muck Rack](#sec2c)

3. [Intersection between Muck Rack and GDELT](#sec3)
4. [Intersection between Muck Rack, GDELT, and MBFC](#sec4)

In [1]:
import json
import csv

from collections import defaultdict, Counter

import tldextract

<a id="sec1"></a>
## Read news sources from GDELT

In [2]:
years = [2016, 2017, 2018, 2019, 2020, 2021]

In [3]:
gdelt_ns = set()

for year in years:
    with open (f"D:\\Wellesley\\F21\\thesis\\data\\gdelt\\gdelt_{year}.csv", "r", encoding="utf-8") as gi:
        reader = csv.DictReader(gi)
        for row in reader:
            gdelt_ns.add(row['SourceCommonName'])

In [4]:
gdelt_ns = list(gdelt_ns)
len(gdelt_ns)

205998

In [5]:
gdelt_ns[:10]

['',
 'beststoreonthenet.com',
 'iflcars.co.uk',
 'addapinch.com',
 'future-business-consulting.com',
 'melted.design',
 'artemzin.com',
 'forbes.com',
 'robertlaird.com',
 'kakijalans.com']

<a id="sec1a"></a>
### Does GDELT have domains and subdomains OR just domains? YES, with subdomains!

In [6]:
with_subdomains = []

for gn in gdelt_ns:
    extracted = tldextract.extract(gn)
    subdomain, domain, suffix = extracted
    if subdomain:
        with_subdomains.append(gn)
        
print(f"there are {len(with_subdomains)} news sources with subdomains.")

there are 15441 news sources with subdomains.


### Strip www. from gdelt news sources

In [7]:
gdelt_ns = list(set([k.replace("www.","") for k in gdelt_ns]))

In [8]:
len(gdelt_ns)

205994

<a id="sec2"></a>
## Read news sources from Muck Rack

<a id="sec2a"></a>
### First, new Muck Rack

(scraped in January 2022)

In [9]:
with open("D:\\Wellesley\\F21\\thesis\\data\\muckrack\\all_new_muckrack.json", "r", encoding="utf-8") as mrinfile:
    new_muckrack = json.load(mrinfile)

In [10]:
new_muckrack_ns = list(new_muckrack.keys())
new_muckrack_ns[:10]

['http://absolutelynot.libsyn.com/website',
 'https://ShaunSaunders.podbean.com',
 'https://www.ivoox.com/podcast-la-encarnacion-divina_sq_f1366021_1.html',
 'https://anchor.fm/wisconsinprod',
 'https://redstonearchender.wordpress.com',
 'https://soundcloud.com/manqeezus',
 'https://anchor.fm/tayland-garrison6',
 'www.ieshivaanoji.com',
 'https://anchor.fm/en-notes',
 'https://anchor.fm/shotaro-miyahara']

In [11]:
len(set(new_muckrack_ns))

594687

#### Let's clean it

In [12]:
clean_new_muckrack = set()

for nsm in new_muckrack_ns:
    extracted = tldextract.extract(nsm)
    subdomain, domain, suffix = extracted
    # add both versions of domain.suffix and subdomain.domain.suffix
    full = ""
    # with subdomain
    if len(subdomain) > 0:
        #print(f"{subdomain}.{domain}.{suffix}")
        full = f"{subdomain}.{domain}.{suffix}"
        if len(full) > 0:
            clean_new_muckrack.add(full.replace("www.","").strip('/'))
    # without subdomain
    full = f"{domain}.{suffix}"
    if len(full) > 0:
        clean_new_muckrack.add(full.replace("www.","").strip('/'))
len(clean_new_muckrack)

450591

In [13]:
list(clean_new_muckrack)[:10]

['',
 'gercekgazetesi.net',
 'threedaysofglory.com',
 'mylifestyleacademy.com',
 'blancmange.tmstor.es',
 'commercetomorrow.com',
 'whatisblackpodcast.com',
 'prestigepraguetours.com',
 'elmerday.co.uk',
 'donate.unicef.ph']

<a id="sec2b"></a>

### Old Muck Rack

In [14]:
with open("D:\\Wellesley\\F21\\thesis\\data\\muckrack\\all_old_muckrack.json", "r", encoding="utf-8") as mrinfile1:
    old_muckrack = json.load(mrinfile1)

In [15]:
old_muckrack_ns = list(old_muckrack.keys())
old_muckrack_ns[:10]

['http://theunionstar.com',
 'http://uniontimes.org',
 'http://unionrecorder.com',
 'http://unionandtimes.com',
 'http://unionvillerepublicanonline.com',
 'http://blog.uniquehomes.com',
 'http://uniservity.com',
 'http://uniteseattlemag.com',
 'http://unite.ai',
 'http://unitedbypop.com']

#### Let's clean it

In [16]:
len(set(old_muckrack_ns))

77878

In [17]:
clean_old_muckrack = set()

for lsm in old_muckrack_ns:
    extracted = tldextract.extract(lsm)
    subdomain, domain, suffix = extracted
    # add both versions of domain.suffix and subdomain.domain.suffix
    full = ""
    # with subdomain
    if len(subdomain) > 0:
        #print(f"{subdomain}.{domain}.{suffix}")
        full = f"{subdomain}.{domain}.{suffix}"
        if len(full) > 0:
            clean_old_muckrack.add(full.replace("www.","").strip('/'))
    # without subdomain
    full = f"{domain}.{suffix}"
    if len(full) > 0:
        clean_old_muckrack.add(full.replace("www.","").strip('/'))
len(clean_old_muckrack)

67782

In [18]:
list(clean_old_muckrack)[:10]

['asahi.com',
 'subscribe.micromart.co.uk',
 'wrpi.org',
 'hrtools.com.mx',
 'caribbeancricket.com',
 'forbes.com',
 'artnewengland.com',
 'contactmanagement.ca',
 'bloomsburyreview.com',
 'inlandnwbroadcasting.com']

<a id="sec2c"></a>
### Union between Old and New Muck Rack

In [19]:
len(clean_old_muckrack & clean_new_muckrack)

60490

In [20]:
import difflib

In [21]:
difflib.get_close_matches('995magicfm.com', list(old_muckrack.keys()))

['http://995magicfm.com', 'http://989magicfm.com', 'http://magic93fm.com']

In [22]:
difflib.get_close_matches('http://995magicfm.com', list(new_muckrack.keys()))

['http://989magicfm.com', 'http://magic93fm.com', 'http://magic995fm.com']

In [23]:
muckrack_union = clean_old_muckrack | clean_new_muckrack

In [24]:
muckrack_union.remove("")

In [25]:
len(muckrack_union)

457882

In [26]:
# with open ("D:\\Wellesley\\F21\\thesis\\data\\muckrack\\muckrack_union.json", "w", encoding="utf-8") as unionfile:
#     json.dump(list(muckrack_union), unionfile, indent=4)

<a id="sec3"></a>
## Intersection between Muck Rack and GDELT

In [27]:
gm_intersection = set(gdelt_ns) & set(muckrack_union)
len(gm_intersection)

42477

In [28]:
# with open ("D:\\Wellesley\\F21\\thesis\\data\\gm_intersection.json", "w", encoding="utf-8") as outfile:
#     json.dump(list(gm_intersection), outfile, indent=4)

In [31]:
len(gdelt_ns)

205994

In [29]:
len(set(gdelt_ns) - gm_intersection)

163517

In [33]:
len(gm_intersection)/len(gdelt_ns)

0.20620503509810964

In [32]:
len(muckrack_union)

457882

In [30]:
len(set(muckrack_union) - gm_intersection)

415405

In [34]:
len(gm_intersection)/len(muckrack_union)

0.0927684425244932

<a id="sec4"></a>
## Intersection with MBFC

In [74]:
categories = ["center", "conspiracy", "fake-news", "left", "leftcenter", "pro-science", "pseudoscience-dictionary", "right", "right-center", "satire"]

In [77]:
mbfc_all = set()
mbfc_divided = defaultdict(set)

for c in categories:
    with open(f"D:\\Wellesley\\F21\\thesis\\data\\mbfc\\mbfc_{c}.json", "r", encoding="utf-8") as infile:
        file = json.load(infile)
        for ns in file:
            source_name, source_link, page_link = ns
            mbfc_all.add(source_link) # I will clean later
            mbfc_divided[c].add(source_link)

In [163]:
len(mbfc_all)

3248

In [164]:
list(mbfc_all)[:10]

['',
 'www.srnnews.com',
 'north99.org',
 'wcyb.com',
 'www.gallup.com',
 'adamscountytimes.com',
 'sedenvernews.com',
 'southbirminghamtimes.com',
 'www.ced.org',
 'www.ihealthtube.com']

### Cleaning `mbfc_all`

In [165]:
mbfc_summary = set()

for ns in mbfc_all:
    extracted = tldextract.extract(ns)
    subdomain, domain, suffix = extracted
    # add both versions of domain.suffix and subdomain.domain.suffix
    full = ""
    # with subdomain
    if len(subdomain) > 0:
        #print(f"{subdomain}.{domain}.{suffix}")
        full = f"{subdomain}.{domain}.{suffix}"
        if len(full) > 0:
            mbfc_summary.add(full.replace("www.","").strip('/'))
    # without subdomain
    full = f"{domain}.{suffix}"
    if len(full) > 0:
        mbfc_summary.add(full.replace("www.","").strip('/'))
len(mbfc_summary)
    

3289

In [166]:
list(mbfc_summary)[:10]

['north99.org',
 'wcyb.com',
 'betootaadvocate.com',
 'colddeadhands.us',
 'wikinews.org',
 'adamscountytimes.com',
 'sedenvernews.com',
 'southbirminghamtimes.com',
 'oklahoman.com',
 'centralgeorgianews.com']

In [167]:
with open("D:\\Wellesley\\F21\\thesis\\data\\mbfc\\mbfc_summary.json", "w", encoding="utf-8") as outfile1:
    json.dump(list(mbfc_summary), outfile1, indent=4)

#### Intersection with gdelt and muck rack

In [168]:
len(mbfc_summary & set(gm_intersection))

1631

In [169]:
gmm_intersection = list(mbfc_summary & set(gm_intersection))

In [170]:
with open("D:\\Wellesley\\F21\\thesis\\data\\gmm_intersection.json", "w", encoding="utf-8") as outfile2:
    json.dump(gmm_intersection, outfile2)