# <center>           Reddit: A Window Into Ancient Worlds     </center>

<img src="../images/romevgreece.jpg"
     style="float: left; margin-right: 2px;width:600px" />

## Background

For centuries, historians have done the bulk of their work in the field. Be it an archaeological excavation on a remote location or arduous sifting through old documents in city archives, historians have seldom made a revolutionary new discovery just by sitting in their offices and pondering the times past.

That has all changed over the past few decades as more and more research has been published online - from a detailed list of pottery found during an archaeological dig, to a groundbreaking new research article shedding light on previously unknown aspects of life and times long ago, and even to those dusty old archival texts, they have all, slowly but surely, found their way on to the World Wide Web.

With such a vast array of historical information at their fingertips, historians are no longer able to rely on simple browser search techniques to find the relevant content. As a result, a top university has hired us to build a machine learning model which can perform that search for them, both for history, as well as, down the road, for other academic departments.

Why did we choose to start with history? There are several reason, but the main one is that...it's in the past. As such, the topics, discussions and words used are not changing as much as in other fields such as technology and medicine where disciplines seem to find an entirely knew paradigm from one year to the next. To be sure, progress in historical research still takes place, but instead of reinventing the entire field, it takes place at the margins. That is why it is perfectly suitable for a machine learning model - once a model learns the language used in a historical topic, that language will not change much from year to year, making the model efficient to use and simple to maintain for longer periods of time.


## Task

The first stage of teaching the machine to recognize topics is to train them to distinguish between only a pair of them. This is where this project comes in.

We have chosen to train and test the model on Reddit website, namely, its two subreddits dedicated to Ancient Greece and Ancient Rome. Reddit, as an online public forum, has a wide variety of participants - from university professors to teenagers, and as such, represents a microcosm of online experience. Another benefit is that the two ancient civilizations are far removed from current events, pop culture and vitriolic debates that pervade online spaces, and yet, because of their importance in the development of our civilization, they still manage to have active subreddits with close to 50k followers each. Such a diverse, but relatively serious set of followers should produce less spam and more useful vocabulary for the machine to train on than most other subreddits.

In this first step of the project we will develop a model that can learn to distinguish between texts topics on ancient Rome and Greece based on 6000 posts taken from Reddit. 

Further on down the line, the model will have to be trained to distinguish between many different topics all at once, as well as to recognize when a topic does not belong to any of the academic fields, but that is beyond the scope of this notebook.

## Metrics and their names

For the purposes of the project, our model evaluation will mostly be based on total accuracy - basically, the percentage of correct predictions. Since there is no real difference between mistaking topic number one for number two vs vice versa, there is no real point to have classification metrics of 1 for positive outcome and 0 for negative. Granted, we will still use 1 and 0, but only in purely nominal terms, with 1 representing greece and 0 rome.

Concepts of false positive and false negative are likewise not used in this study, and also, precision will not be split in sensitivity (for false positives) and specificity (for negatives). As neither greece nor rome are inherently positive nor negative, such concepts have no place in our project (unless a build-in function automatically displays them).

As far as right/wrong predictions are concerned we will use adapted precision metrics we will call "rome recall" and "greece recall". As the name implies, they describe to total rome( greece) outcomes correctly predicted out of the entire number of rome(greece) outcomes in the dataset.

In [1]:
import requests
import pandas as pd

- the real scraping was initially done for 3000 posts for each subreddit, 6000 in total, using the cell below. the results were saved and were used throughout the project. the cell now scrapes only 500 posts each, to keep it integrated while saving time. the new results are not used.

In [2]:
url = 'https://api.pushshift.io/reddit/search/submission'  # the base link

reddit = None  # instantiate a new Ancient Greece/Rome dataframe,
# set it to 'none' initially

# we'll be scraping ancientgreece and ancientrome subreddits
for subreddits in ['ancientgreece', 'ancientrome']:

    utc_time = 1894262032  # set initial time to year 2030,
#     so it starts scraping from the latest post

# (loop 5 times) X (2 subreddits) X (100 posts per loop) = 1000 posts
    for i in range(5): # the initial loop was set to 30, giving us
        # 6000 posts to work with
        params = {'subreddit': subreddits,
                  'size': 100,
                  'before': utc_time}

        req = requests.get(url, params)
        posts = req.json()['data']
        if reddit is None:
            # create a dataframe only on the first iteration
            reddit = pd.DataFrame(
                posts)[['created_utc', 'title', 'selftext', 'subreddit']]
        else:
            # on all subsequent iteration, concat by row
            reddit_temp = pd.DataFrame(
                posts)[['created_utc', 'title', 'selftext', 'subreddit']]
            reddit = pd.concat([reddit, reddit_temp], ignore_index=True)

        utc_time = reddit.iloc[-1, 0]  # resets time to the time of the 
        #last row so the next iteration scrapes only the posts that came 
        # earlier, resulting in no overlap

In [3]:
# this file is different everytime we run the notebook...
reddit.to_csv('../datasets/ancients.csv', index=False)

# ... while the file below is used for NLP -  it was saved once, code commented out, so it's always the same. 
# reddit.to_csv('../datasets/ancients_for_NLP.csv', index=False)