# Finding Intersection between Muck Rack and GDELT

"What counts as a news source?" is an important question. Muck Rack and GDELT have their own definition of what can be considered as a news source. An intersection between a set of manually curated news source selection from Muck Rack, and a set of news sources recognized by the global database GDELT will provide us with news sources that are recognized by at least two institutions.

Table of content

1. [Read news sources from GDELT](#sec1)
    * [Does GDELT have domains and subdomains OR just domains?](#sec1a) Gdelt has subdomains!
    
2. [Read news sources from Muck Rack](#sec2)
    * [New Muck Rack](#sec2a)
    * [Old Muck Rack](#sec2b)
    * [Union between old and new Muck Rack](#sec2c)

3. [Intersection between Muck Rack and GDELT](#sec3)
4. [Intersection between Muck Rack, GDELT, and MBFC](#sec4)

In [21]:
import json
import csv

from collections import defaultdict, Counter

import tldextract

<a id="sec1"></a>
## Read news sources from GDELT

In [14]:
years = [2016, 2017, 2018, 2019, 2020, 2021]

In [17]:
gdelt_ns = set()

for year in years:
    with open (f"D:\\Wellesley\\F21\\thesis\\data\\gdelt\\gdelt_{year}.csv", "r", encoding="utf-8") as gi:
        reader = csv.DictReader(gi)
        for row in reader:
            gdelt_ns.add(row['SourceCommonName'])

In [19]:
gdelt_ns = list(gdelt_ns)
len(gdelt_ns)

205998

In [20]:
gdelt_ns[:10]

['',
 'academyofmusic.ac.uk',
 'websterprogresstimes.com',
 'auto.onliner.by',
 'e-zone.com.hk',
 'leithcars.com',
 'dailyfootball.ir',
 'kolping.hotel.hu',
 'tuttotritiumgiana.com',
 'tonisant.com']

<a id="sec1a"></a>
### Does GDELT have domains and subdomains OR just domains? YES, with subdomains!

In [24]:
with_subdomains = []

for gn in gdelt_ns:
    extracted = tldextract.extract(gn)
    subdomain, domain, suffix = extracted
    if subdomain:
        with_subdomains.append(gn)
        
print(f"there are {len(with_subdomains)} news sources with subdomains.")

there are 15441 news sources with subdomains.


In [25]:
with_subdomains

['auto.onliner.by',
 'medicina.finance.si',
 'wojciechsadurski.natemat.pl',
 'boonepeter.github.io',
 'meloncafe.blog.jp',
 'auto.idnes.cz',
 'rust-lang.github.io',
 'auto.9am.ro',
 'blogs.citizenmatters.in',
 'blogs.prensaescuela.es',
 'armavir.riasv.ru',
 'blog.360dgrs.nl',
 'strefabiznesu.gloswielkopolski.pl',
 'anhdfjdni.diskstation.eu',
 'afukurs.oevsv.at',
 'paladin-t.github.io',
 'gmina.nowykorczyn.pl',
 'afrique.latribune.fr',
 'm.thepaper.cn',
 'old.klops.ru',
 'dn.kartquiz.se',
 'shop.welt.de',
 'hel.naszemiasto.pl',
 'babluboy.github.io',
 'site.fest.pt',
 'kj.81.cn',
 'vi.rfi.fr',
 'nsk.rbc.ru',
 'm.pg13.ru',
 'apps.vienna.at',
 'news.gtp.gr',
 'xl.huni.is',
 'fr.ami.mr',
 'alberta.ctvnews.ca',
 'ub.uni-muenchen.de',
 'kromerizsky.denik.cz',
 'en.vogue.fr',
 'especial.t13.cl',
 'roadshow.searchinform.ru',
 'jacektabisz.natemat.pl',
 'fici.sme.sk',
 'eichsfeld.tlz.de',
 'irpen-bucha.est.ua',
 'omma-son.doorblog.jp',
 'motori.corriere.it',
 'crimea.kp.ru',
 'grosi.znaj.ua',
 

<a id="sec2"></a>
## Read news sources from Muck Rack

<a id="sec2a"></a>
### First, new Muck Rack

(scraped in January 2022)

In [4]:
with open("D:\\Wellesley\\F21\\thesis\\data\\muckrack\\all_new_muckrack.json", "r", encoding="utf-8") as mrinfile:
    new_muckrack = json.load(mrinfile)

In [7]:
new_muckrack_ns = list(new_muckrack.keys())
new_muckrack_ns[:10]

['http://absolutelynot.libsyn.com/website',
 'https://ShaunSaunders.podbean.com',
 'https://www.ivoox.com/podcast-la-encarnacion-divina_sq_f1366021_1.html',
 'https://anchor.fm/wisconsinprod',
 'https://redstonearchender.wordpress.com',
 'https://soundcloud.com/manqeezus',
 'https://anchor.fm/tayland-garrison6',
 'www.ieshivaanoji.com',
 'https://anchor.fm/en-notes',
 'https://anchor.fm/shotaro-miyahara']

In [9]:
len(set(new_muckrack_ns))

594687

#### Let's clean it

In [49]:
clean_new_muckrack = set()

for nsm in new_muckrack_ns:
    extracted = tldextract.extract(nsm)
    subdomain, domain, suffix = extracted
    # add both versions of domain.suffix and subdomain.domain.suffix
    full = ""
    if len(subdomain) > 0:
        #print(f"{subdomain}.{domain}.{suffix}")
        full = f"{subdomain}.{domain}.{suffix}"
    else:
        full = f"{domain}.{suffix}"
    if len(full) > 0:
        clean_new_muckrack.add(full.strip("www.").strip('/'))
len(clean_new_muckrack)

433925

In [50]:
list(clean_new_muckrack)[:10]

['',
 'theorydish.blog',
 'allsaredancing.fr',
 'lists.gt.net',
 'theinnerconnection.co',
 'realproblemspodcast.com',
 'cippenhamttc.co.uk',
 'dylanpaulus.com',
 'apologetikalt.podbean.com',
 'collegesportsmaven.io']

<a id="sec2b"></a>

### Old Muck Rack

In [40]:
with open("D:\\Wellesley\\F21\\thesis\\data\\muckrack\\all_old_muckrack.json", "r", encoding="utf-8") as mrinfile1:
    old_muckrack = json.load(mrinfile1)

In [42]:
old_muckrack_ns = list(old_muckrack.keys())
old_muckrack_ns[:10]

['http://theunionstar.com',
 'http://uniontimes.org',
 'http://unionrecorder.com',
 'http://unionandtimes.com',
 'http://unionvillerepublicanonline.com',
 'http://blog.uniquehomes.com',
 'http://uniservity.com',
 'http://uniteseattlemag.com',
 'http://unite.ai',
 'http://unitedbypop.com']

#### Let's clean it

In [45]:
len(set(old_muckrack_ns))

77878

In [51]:
clean_old_muckrack = set()

for lsm in old_muckrack_ns:
    extracted = tldextract.extract(lsm)
    subdomain, domain, suffix = extracted
    # add both versions of domain.suffix and subdomain.domain.suffix
    full = ""
    if len(subdomain) > 0:
        #print(f"{subdomain}.{domain}.{suffix}")
        full = f"{subdomain}.{domain}.{suffix}"
    else:
        full = f"{domain}.{suffix}"
    if len(full) > 0:
        clean_old_muckrack.add(full.strip("www.").strip('/'))
len(clean_old_muckrack)

66480

In [52]:
list(clean_old_muckrack)[:10]

['vozlatinanewspaper.com',
 'b95forlife.iheart.com',
 'cordeledispatch.com',
 'invisiblewheelchair.com',
 'en.shafaqna.com',
 'fly2let.co.uk',
 'overloadr.com.br',
 'ukconstructionmedia.co.uk',
 'competitionlawinsight.com',
 'borntoplay.es']

<a id="sec2c"></a>
### Union between Old and New Muck Rack

In [53]:
len(clean_old_muckrack & clean_new_muckrack)

59223

In [55]:
import difflib

In [59]:
difflib.get_close_matches('995magicfm.com', list(old_muckrack.keys()))

['http://995magicfm.com', 'http://989magicfm.com', 'http://magic93fm.com']

In [61]:
difflib.get_close_matches('http://995magicfm.com', list(new_muckrack.keys()))

['http://989magicfm.com', 'http://magic93fm.com', 'http://magic995fm.com']

In [62]:
muckrack_union = clean_old_muckrack | clean_new_muckrack

In [63]:
len(muckrack_union)

441182

In [70]:
muckrack_union.remove("")

In [71]:
with open ("D:\\Wellesley\\F21\\thesis\\data\\muckrack\\muckrack_union.json", "w", encoding="utf-8") as unionfile:
    json.dump(list(muckrack_union), unionfile, indent=4)

<a id="sec3"></a>
## Intersection between Muck Rack and GDELT

In [72]:
gm_intersection = set(gdelt_ns) & set(muckrack_union)
len(gm_intersection)

38656

In [73]:
with open ("D:\\Wellesley\\F21\\thesis\\data\\muckrack\\gm_intersection.json", "w", encoding="utf-8") as outfile:
    json.dump(list(gm_intersection), outfile, indent=4)

<a id="sec4"></a>
## Intersection with MBFC

In [None]:
with open("")