# Data Collection
I collect posts from the past two years from the Magic, Eternal, and Hearthstone subreddits here. This code is meant to be run a single time and then saves the collected data to a csv file (tcg_raw.csv). This collection will only occur again if the code cannot find a data file to read, due to the length of time required to scrape all of the data from the API. 100 posts are gathered from each subreddit for each week from the past 2 years, resulting in a dataset containing over 30,000 entries. I gathered the data in this way in order to mitigate results pertaining to specific card sets or events, in order to gain more understanding on game-wide trends. The following information is gathered

|Feature|Type|Description|
|---|---|---|
num_comments|int|The number of comments attached to the post
title|str|The title of the post
sub|int|The subreddit that the post originated from, encoded as integers: magicTCG (0), EternalCardGame (1), or hearthstone (2)

It is important to note here that I did *not* take the selftext from each title. The reason I chose to do this is that a large number of posts do not have any selftext associated with them. Those that do have selftext will likely contain words that are already in or related to the title itself, and so would not necessarily add useful information to the models, and so I choose to only analyze the titles themselves

In [1]:
# Pulls 100 posts per week from each subreddit for the past two years. This notebook should only be run once! It takes a long time to collect all the posts

import requests
import pandas as pd
import pathlib

if not pathlib.Path('../datasets/tcg_raw.csv'):
    url = 'https://api.pushshift.io/reddit/search/submission'
    reddits = ['magicTCG', 'EternalCardGame', 'hearthstone']
    subs = []

    for tcg in reddits:
        params = {'subreddit':tcg,
                  'size':100,
                  'fields':['title', 'num_comments']}

        for days in range(104):
            params['before'] = str(7*days) + 'd'
            params['after'] = str(7*(days + 1)) + 'd'
            data = requests.get(url, params).json()
            posts = data['data']
            for post in posts:
                post['sub'] = tcg
            subs += [post for post in posts]

            time.sleep(3)
            print(f'Scraped {len(posts)} posts from {days} weeks ago from {tcg}')

    df = pd.DataFrame(subs)
    df['sub'] = df['sub'].map({'magicTCG':0, 'EternalCardGame':1,'hearthstone':2})
    df.drop_duplicates()
    df.to_csv('../datasets/tcg_raw.csv', index=False)

## Card Lists

I also load the full card lists (or, due to Magic's long history, only the past 2 year's worth of cards) in order to add these words as stop words. This is to reduce our model's reliance on specific card names or proper nouns found only in one specific game. Creates a single list of all card names and stores that list to be passed to the EDA notebook for parsing and duplicate removal. Sources for the card lists can be found in the project summary document.

In [2]:
all_cards = ''

In [3]:
for card in pd.read_csv('../datasets/mtg_cards.csv')['Card']:
    all_cards += card + ' '

In [4]:
for card in pd.read_csv('../datasets/hearthstone_cards.csv')['Card']:
    all_cards += card + ' '

In [5]:
for card in pd.read_csv('../datasets/eternal_cards.csv')['Cards']:
    all_cards += card + ' '

In [6]:
%%capture

%store all_cards