# GA Project 3: Classification of Subreddits
This project is a classification machine learning model with a single goal - to be able to distinguish and accurately identify which subreddit each of the entries come from. The workflow is loosely as follows:

1. **[Webscraping](P3-1_WebScrape.ipynb)** using the [pushshift api](https://github.com/pushshift/api) to obtain the necessary submissions from each subreddit;
2. Performing **[preprocessing and EDA](P3-2_PP_EDA.ipynb)** on the data obtained, and;
3. Evaluating a plethora of **relevant [models](P3-3_Model_Consolidated.ipynb)** with **hyperparameter tuning** to figure out which is the best predictive model for this job, and evaluate its results.

## 0 Background of Scenario
*"For roughly 98 percent of the last 2,500 years of Western intellectual history, **philosophy was considered the mother of all knowledge**. It generated most of the fields of research still with us today. This is why we continue to call our highest degrees Ph.D.’s, namely, philosophy doctorates. At the same time, we live an age in which many seem no longer sure what philosophy is or is good for anymore. Most seem to see it as a highly abstracted discipline with little if any bearing on objective reality — something more akin to art, literature or religion. All have plenty to say about reality. But the overarching assumption is that none of it actually qualifies as knowledge until proven scientifically."*

This is the introduction from a [2012 nytimes opinion post](https://archive.nytimes.com/opinionator.blogs.nytimes.com/2012/04/05/philosophy-is-not-a-science/) about Philosophy, and how it has gone from being mankind's source of knowledge and wisdom to being relegated to the realm of arts. In its place, 'Science' has become our society's fundamental source of objective truth, trusting more in scientific methods than the lofty abstractions philosophers offer. 

One observation you may have, however, is that the language used in these two fields often share some commonality. Both are still treated as formal disciplines, complete with their own insular communities of scholars, debating using evidence-based and research-backed arguments. The common man would such naturally have trouble distinguishing between the two, and may struggle to separate scientific fact from philosophical reasoning, if only given the text.

The moderators of r/science and r/philosophy have caught on to this issue, and teamed up to try and solve the problem using **machine learning** to tell difference between the rhetorics. 

Armed with an army of data from each subreddit, they hope to create a tool that can help their redditers to figure out (*with reasonable accuracy*) **whether their articles/posts/thoughts are better suited for scientific discussion or philosophical debate, and more importantly which subreddit it should go into**.

><font color=red>Clear explanation and problem statement

## 1 Data Extraction - Scraping from Subreddits

This notebook aims to document and run the webscraping algorithm used to generate the submission data from r/science and r/philosophy, utilising the [pushshift api](https://github.com/pushshift/api). This is done with the following considerations:

1. Importing 25,000 submissions (posts) from each subreddit;
2. Webscraping about 200 submissions each time, with a 3-5 second time delay;
3. Compile the submissions into a single dataframe with the outputs 'subreddit', 'title', 'selftext', 'url', and 'author'. Note that for the purposes of this analysis, only the title and selftext are considered for data, and the subreddit as the target vector;
4. Standardise post scraping to before Monday, 3 October 2022 16:00:00 GMT (*Epoch Time: 1664812800*), so that the data is consistent each time the scraping function is run.

For generalisation, the function was defined with variable inputs, in case there is a need to rerun again with different inputs or targets.

In [1]:
# Base Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# For Web Scraping
from bs4 import BeautifulSoup
import requests

# For Pre-Processing/Word Cleaning
import re, nltk, string
import demoji
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Miscellaneous
import time
from tqdm.notebook import tqdm

><font color=red>Import only libraries used in this notebook

In [2]:
# Add a switch so that we can run through the code without running the scrape
run_scrape = False

In [3]:
# Setting base API call url
url = 'https://api.pushshift.io/reddit/search/submission'

In [4]:
# Function to scrape subreddits
def get_subm(posts_per_subred, scrape_size, before, subred_list, columns):
    
    # Setting up loop function to run
    start_time = time.time()

    # Parameter variable setting
    posts_required = posts_per_subred
    size = scrape_size
    posts_before = before
    subreddits = subred_list

    # Setting empty dataframe to start concatenation from
    data_df = pd.DataFrame(columns = columns)

    # Iterating through the two subreddits
    for subreddit in subreddits:

        # Set the starting time to sieve posts from designated epoch time
        time_before = posts_before

        # Iterating through enough scrapes to make up required data size
        # for iterations in range(0,int(posts_required/size)):
        while len(data_df[data_df['subreddit'] == subreddit].index) < posts_required:

            # Parameters for API
            # Note: size is based on min(set_size, total required - total scraped)
            # This means the last scrape would get only the balance remaining (which may be <200)
            # This is necessary because sometimes the webscraper does not scrape the full 200, only 199
            # I suspect this is due to some complication on the reddit end of the api
            # By scraping this way, we will always get our desired amount of data, as long as it available
            params = {
                'subreddit': subreddit, # subreddit to scrape from
                # Number of scrapes per iteration
                'size': min(size,int(posts_required - len(data_df[data_df['subreddit'] == subreddit].index))), 
                'before': time_before, # time ceiling to where to start scraping from
            }

            # Assign the data from the scraped text to a temporary dataframe
            rd_data_tmp = requests.get(url, params)
            data_df_tmp = pd.DataFrame(rd_data_tmp.json()['data'])[columns]

            # Set the scrape timing for next iteration to just before last scraped post
            time_before = rd_data_tmp.json()['data'][-1]['created_utc']

            # Add data to master dataframe
            data_df = pd.concat([data_df,data_df_tmp]).sort_index(ignore_index = True)

            # Time delay between scrapes, and milestone tracking
            # Elapsed time
            time_curr = time.time()
            hours = int((time_curr - start_time)/(60 ** 2))
            minutes = int(round(((time_curr - start_time)/60) % 60,0))
            seconds = int(round((time_curr - start_time) % 60,0))

            # Print Elapsed time and data volumne
            print(f'''Elapsed time(h.m.s):{str(hours)}.{str(minutes)}.{str(seconds)
                    }, Curr data vol: {len(data_df.index)}''')

            # Loop will pause for a random period between 3 & 5s
            time.sleep(np.random.choice(range(3,6))) 

    for i in subreddits:
        print(f"data volume ({i}): {len(data_df[data_df['subreddit'] == i].index)} entries")
    
    return data_df

In [5]:
# Actual data scrape step to get data

if run_scrape == True:
    data_df = get_subm(posts_per_subred = 25000, # 25000 per subreddit
                       scrape_size = 200, # scraping 200 per scrape
                       before = 1664812800, # Epoch time for Monday, 3 October 2022 16:00:00 GMT
                       subred_list = ['science','philosophy'], # Required subreddits
                       columns = ['subreddit','author','selftext','title','url']) # Required fields


In [6]:
# Save the data to a csv for later use
if run_scrape == True:
    data_df.to_csv('Datasets/reddit_data.csv', index = False)


><font color=red>Comprehensive use of function and explanation, well done!