<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3 - Part 1: Web Scraping

## Problem Statement

Reddit is a very diverse collection of forums, divided categorically into smaller forums called subreddits. The main purpose of the subreddit, however, tends to tread a very fine line. For example, /r/stocks and /r/wallstreetbets are both about stocks, but the direction the posts tend to head to are quite different.

Imagine that Reddit would like to venture into the news industry. The news industry would feed them articles, and Reddit would automatically propagate them to subreddits via a bot, where they think people will be interested in clicking on them. But they also need to strike a balance so that people will not treat it as spam and thus get annoyed by their website. 

The final bot should be able to take in the headlines of articles and predict what kind of subreddits they should be propagated into, without using the subject of the article.

This project aims to train a classifier that will serve as the basis of such a bot, by classifying what kind of subreddit a length of text (news article header) will fit in based on the contents of posts that were already on that subreddit.

### Contents:
- [Background](#Background)
- [Web Scraping](#Web-Scraping)
- [Preprocessing](#Preprocessing)

## Background

The two subreddits chosen are **/r/magicTCG** and **/r/mtgfinance**. While both are related, the latter discusses the cards' financial prices while the former discusses gameplay and the metagame. What is interesting is that both subreddits tend to also contain posts you would more commonly find in the other. For example, it is possible to find finance discussions on /r/magicTCG and metagame discussions on /r/mtgfinance (the moderators try to prevent this from happening, but the system isn't perfect). 

If the classifier was able to classify posts between these two subreddits with a good accuracy, it should be able to distinguish between most subreddits as well.

### A short intro to /r/magicTCG

Magic: the Gathering is a Trading Card Game (TCG) introduced in 1993. It was THE original TCG, in fact, and thus naturally has a large collectible value to it. Boasting a global playerbase of 35 million, it is also the most widely played TCG in the world. For more details, refer to the Wikipedia page.
(https://en.wikipedia.org/wiki/Magic:_The_Gathering)

The subreddit dedicated to this game is /r/magicTCG. The main forms of textual posts in this subreddit are usually questions about the gameplay, like whether the 252 page rulebook is comprehensive enough, and discussions regarding the lore. Most other posts are pictorial, like fanarts and card alters.

However, every quarter or so, the company behind this game, Wizards of the Coast (abbrev. Wotc), releases an expansion (aka 'set' to the playerbase). These expansions are major events in the subreddit and will take over the majority of the posts as spoilers of the new cards are released.

### A short intro to /r/mtgfinance

And then we have /r/mtgfinance, the self-proclaimed /r/wallstreetbets of Magic: the Gathering with all of the same sass and attitude.

The posts on that subreddit are mostly dedicated to asking whether it is wise to hold onto old sets as investments, asking whether buying a suspicious Black Lotus for $500,000 is wise (https://www.polygon.com/2021/1/27/22253079/magic-the-gathering-black-lotus-auction-price-2021) and flipping cards like they were cryptocurrencies.

Two main topics that frequently turn up are 'Secret Lairs' and 'Reserved List'. 

'Secret Lairs' are a fairly new initiative by Wotc (who was recently acquired by Hasbro). Basically, they are a collection of reprinted old cards with new art that are then purchasable online. This initiative has garnered a lot of flak from the main subreddit with people calling the move 'greedy' as Wotc had always stayed out of manipulating the secondary market until recently. /r/mtgfinance people, however, see each Secret Lair release as a potential investment, buying them so that time will appreciate their value.

'Reserved List' is a list of old cards that Wotc has declared that they will never reprint, else they suffer the legal consequences. Clearly, this causes scarcity, and scarcity leads to demand, and demand leads to high prices. This is the basis of the subreddit, and many people who use this subreddit are very interested in collecting these Reserved List cards.

### Workflow

The following steps will be taken to complete this project.

   1. Collection of data through web scraping
   
   2. Preprocessing to remove common words, define stop words etc.
   
   3. Performing of EDA to take a look at preliminary effects of lemmatization and stemming to see what kind of effect it should have on modelling
   
   4. Modelling and inferences
   
   5. Draw final conclusions

### Datasets Used

* [`test.csv`](../datasets/test.csv): Kaggle dataset
* [`train.csv`](../datasets/train.csv): Training dataset

### Library Imports

In [1]:
# basic libraries
import numpy as np
import pandas as pd

# scraping libraries
import requests

# nlp libraries
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
import re

### Functions Used

In [2]:
def get_posts(subreddit, n):
    full_df = pd.DataFrame() # instantiating empty dataframe
    last_post = None # setting last epoch to None
    for _ in range(n): # setting up for loop
        data = requests.get(
            'https://api.pushshift.io/reddit/search/submission',
            {
                'subreddit': subreddit,
                'size': 100,
                'before': last_post
            } # applying api to pull latest 100 posts
        ).json()
        
        df = pd.DataFrame(data['data']) # convert 100 posts to dataframe
        last_post = df.iloc[-1]['created_utc'] # update last epoch
        full_df = pd.concat([full_df, df]).reset_index(drop=True) # concat each dataframe
    return full_df

In [3]:
def get_relevant_columns(df):
    df = df.loc[
        :,
        [
            'selftext',
            'title'
        ]
    ]
    
    return df    

In [4]:
def drop_na(df):
    df = df.replace( # replacing removed and deleted posts with NaN
        [
            '',
            '[removed]',
            '[deleted]'],
        np.nan
    ).dropna( # dropping NaNs
    ).drop_duplicates( # dropping duplicates
    ).reset_index( # resetting index
        drop=True
    )
    
    return df

In [5]:
def merge_text(df):
    df['alltext'] = df['selftext'] + ' ' + df['title']
    df = df.drop(columns=['selftext', 'title'])
    return df

In [6]:
def remove_duplicate(row):
    if row['selftext'] == row['title']:
        row['title'] = ' '  
    return row

In [7]:
def remove_url(df):
    df['alltext'] = df['alltext'].apply(lambda row: re.sub(r'http\S+', '', row))
    return df

## Web Scraping

In [None]:
# scrape posts and convert to df
tcg_df = get_posts('magicTCG', 22)

In [None]:
mtgfin_df = get_posts('mtgfinance', 14)

In [None]:
# checking if scrape was successful
tcg_df.shape

In [None]:
mtgfin_df.shape

In [None]:
# isolating the columns with text that we actually want to analyze
# removing [removed] posts and [deleted] posts and blanks posts
tcg_df = get_relevant_columns(tcg_df)
tcg_df = drop_na(tcg_df)

In [None]:
mtgfin_df = get_relevant_columns(mtgfin_df)
mtgfin_df = drop_na(mtgfin_df)

In [None]:
# checking if there are enough posts leftover
tcg_df.shape

In [None]:
mtgfin_df.shape

## Preprocessing

In [None]:
# check for duplicates in selftext and title
tcg_df.loc[(tcg_df['selftext'] == tcg_df['title']), :]

In [None]:
# check for duplicates in selftext and title
mtgfin_df.loc[(mtgfin_df['selftext'] == mtgfin_df['title']), :]

In [None]:
# removing duplicate text
mtgfin_df = mtgfin_df.apply(remove_duplicate, axis=1)

In [None]:
# merging selftext and title columns
tcg_df = merge_text(tcg_df)
mtgfin_df = merge_text(mtgfin_df)

In [None]:
# remove urls
remove_url(mtgfin_df)
remove_url(tcg_df);

The word 'finance' is intuitively only in the financial subreddit of the TCG, so we should remove it and any stemmed version of the word as it will make differentiating the subreddits too straightforward.

In [None]:
# instantiate porterstemmer
p_stemmer = PorterStemmer()

# stem the word 'finance'
p_stemmer.stem('finance')

In [None]:
# instantiate lemmatizer
lemmatizer = WordNetLemmatizer()

# lemmatize the word 'finance'
lemmatizer.lemmatize('finance')

In [None]:
# remove all instances of the words 'finance' and 'financ'
tcg_df['alltext'] = tcg_df['alltext'].map(lambda x: x.replace('finance', '').replace('financ', ''))
mtgfin_df['alltext'] = mtgfin_df['alltext'].map(lambda x: x.replace('finance', '').replace('financ', ''))

In [None]:
mtgfin_df['alltext'].str.contains('finance').sort_values()

In [None]:
# export to csv
tcg_df.to_csv('../datasets/tcg_df_clean.csv', index=False)
mtgfin_df.to_csv('../datasets/mtgfin_df_clean.csv', index=False)