# Reading Reddit Data

Some of the Reddit datasets are compressed as a ZST archive. I found the following code in the materials about PushShift.

The following shows the extraction of links for only one month. If one downloads the files for 12 months, this can be run a couple of times.

In [2]:
import zstandard as zstd
import ujson as json

**A Simple Class to Read the Zipped Archive**

In [3]:
class Zreader:

    def __init__(self, file, chunk_size=16384):
        '''Init method'''
        self.fh = open(file,'rb')
        self.chunk_size = chunk_size
        self.dctx = zstd.ZstdDecompressor()
        self.reader = self.dctx.stream_reader(self.fh)
        self.buffer = ''


    def readlines(self):
        '''Generator method that creates an iterator for each line of JSON'''
        while True:
            chunk = self.reader.read(self.chunk_size).decode()
            if not chunk:
                break
            lines = (self.buffer + chunk).split("\n")

            for line in lines[:-1]:
                yield line

            self.buffer = lines[-1]

**Instatiate the reader of the archive**

In [22]:
# Adjust chunk_size as necessary -- defaults to 16,384 if not specified
reader = Zreader("RC_2019-06.zst", chunk_size=8192)

**Store only partial posts that contain a URL in the body**

In [23]:
postCnt = 0
postsWithURLs = []

for line in reader.readlines():
    obj = json.loads(line)
    postCnt += 1
    if ('body' in obj) and ('http' in obj['body']):
        postsWithURLs.append({key: obj[key] for key in ['body', 'created_utc', 
                                               'score', 'subreddit'] if key in obj})

        
print(len(postsWithURLs))

6041437


**Total number of posts with URLs as a portion of all posts**

In [24]:
len(postsWithURLs)/postCnt

0.04502723906196912

Only 4.5% of all posts contain at list one link.

**Example of a partial post**

In [30]:
postsWithURLs[108010]

{'body': "Why haven't you followed[ this twitter account](https://twitter.com/UNmigration) yet?",
 'created_utc': 1559407265,
 'score': 2,
 'subreddit': 'neoliberal'}

## Extract URLs from text

I found the following function on the Web. I'm not convinced that it's the best way to retrieve URLs. The URLs are not very clean. For the moment it might be good enough.

In [32]:
import re

def findURLs(string):
    """One possible way to extract URLs from text.
    It finds all URLs in a long string.
    """
    regex = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))"
    url = re.findall(regex,string)      
    return [x[0] for x in url]

**IMPORTANT NOTE:** The function above is too slow, it is not be used. Below is a faster version to extract URLs from text.

In [42]:
def findU(phrase):
    regex = re.compile('((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)')
    url = re.findall(regex, phrase)     
    return [x[0] for x in url]

We'll add the URLs to the existing dataset of `postsWithURLs`.

In [43]:
for post in postsWithURLs:
    post['urls'] = findU(post['body'])

In [37]:
postsWithURLs[3]

{'body': '[Just gave $27.](https://imgur.com/FhFdthj)',
 'created_utc': 1559347203,
 'score': 2,
 'subreddit': 'SandersForPresident',
 'urls': ['https://imgur.com/FhFdthj']}

In [44]:
done = [p for p in postsWithURLs if len(p) == 5]
len(done)

6041437

## Find total number of unique URLs / unique domains

In [45]:
from collections import defaultdict, Counter

In [46]:
uniqueURLs = Counter()

for post in postsWithURLs:
    urls = post['urls']
    for url in urls:
        uniqueURLs[url] += 1

In [48]:
len(uniqueURLs)

5373689

In [49]:
uniqueURLs.most_common(10)

[('https://www.reddit.com/message/compose?to=/r/dankmemes)', 209598),
 ('https://www.reddit.com/r/TranscribersOfReddit/wiki/index)', 200912),
 ('https://github.com/GrafeasGroup/tor)', 142651),
 ('https://www.reddit.com/message/compose?to=%2Fr%2FTranscribersOfReddit&amp;subject=Bot%20Question&amp;message=)',
  142651),
 ('https://www.reddit.com/r/PewdiepieSubmissions/comments/c0m06h/introducing_community_moderation/)',
  63539),
 ('https://discord.gg/BzUnwjt)', 60620),
 ('https://www.reddit.com/r/Market76/comments/biz92l/ign_megathread/)', 60618),
 ('https://reddit.com/message/compose?to=%2Fr%2FMarket76)', 60566),
 ('https://www.reddit.com/r/Market76/comments/bj7uzm/43019_big_subreddit_changes/)',
  60562),
 ('https://www.reddit.com/r/Market76/wiki/index)', 60562)]

Extract the domain name and organize links by domain, together with their counts.

In [50]:
from urllib.parse import urlparse
linksCounter = defaultdict(Counter)

for url in uniqueURLs:
    domain = urlparse(url).netloc
    linksCounter[domain][url] = uniqueURLs[url]

Total number of domains:

In [51]:
len(linksCounter)

367769

### Create a counter for domains and their total links

In [52]:
domainsWithCounts = Counter()

for dom in linksCounter:
    domainsWithCounts[dom] = sum(linksCounter[dom].values())

In [53]:
domainsWithCounts.most_common(10)

[('www.reddit.com', 4803689),
 ('www.youtube.com', 482683),
 ('np.reddit.com', 401595),
 ('github.com', 381511),
 ('imgur.com', 367810),
 ('youtu.be', 361313),
 ('i.imgur.com', 324031),
 ('reddit.com', 222838),
 ('en.wikipedia.org', 209967),
 ('discord.gg', 180598)]

**How many links are simply reddit?**

In [54]:
reddit = [dom for dom in domainsWithCounts if 'reddit' in dom]
len(reddit)

1534

In [55]:
reddit[:10]

['www.reddit.com',
 'www\\.reddit\\.com',
 'reddit.com',
 'old.reddit.com',
 'np.reddit.com',
 'contact.dankmemesreddit.com)',
 'reddit.comepnk3a7)',
 'reddit.zendesk.com',
 'redditsearch.io',
 'redditmetrics.com']

In [57]:
sumReddit = sum([domainsWithCounts[dom] for dom in reddit])
sumReddit

5610232

In [59]:
sumReddit/sum(uniqueURLs.values())

0.42330070657359564

**RESULT:** 42% of all links belong to Reddit domain itself.

In [60]:
import json
json.dump(domainsWithCounts, open('june2020-domains.json', 'w'))