# Project Overview
In this notebook and subsequent notebooks I will attempt to predict mechanically meaningful labels associated with some fan-generated roleplaying game (RPG) content.

I collected post and comment data from Reddit in multiple runs using a command-line tool I wrote for this purpose (see: [PRAW-CoDiaLS](https://github.com/nkuehnle/praw-codials))

The actual text I will be attempting to predict is hosted on one of two websites: [GM Binder](https://gmbinder.com/) or [Homebrewery](https://homebrewery.naturalcrit.com/), which are tools/content hosting providers for fan-made RPG content, primarily Dungeons and Dragons 5th Edition (DnD5e).

The labels make use of the user-assigned "flair" from the subreddit [/r/UnearthedArcana](https://www.reddit.com/r/UnearthedArcana/). UnearthedArcana describes itself as a source of "homebrew" (fan-generated) content and its community rules ensure that every post has a meaningful label ("flair") associated with it. These labels describe the "type" of game content that has been created, be it a character "class" (think achetypal high-fantasy characters knights, wizards, archers, etc) or a "race" (elves, dwarves, humans, etc).

In [9]:
# Utility Imports
import os
from pathlib import Path
# Imports for data processing/handling
import numpy as np
import pandas as pd
# Custom modules
from src.preprocessing.text_cleaning import clean_raw_scrapes
from src.preprocessing import data_io

# This Notebook
Next, we'll import the submission data and collect the mixed html/css/markdown-style source code used by Homebrewery/GMBinder.

**These will form the documents which we will later work to learn from.**

I've defined several complex functions in src/dnd_scraper_tools.py unique to the websites that I'll be scraping.

Accessing/calling/internals of these functions are decribed below.

Note: This present notebook is revising a quick/scrappy project performed over a single weekend for a class, I already have collected the texts for many of the entries included. Some may need updating now that I've taken a more complete, clean approach. I'll integrate new data into the existing dataset. This will also form a blueprint for all future integration going forward.

In [2]:
# Define a constant (CWD) to use
CWD = Path(os.getcwd())
DATA = CWD / 'data'
PRAW = DATA / 'praw_data'

In [3]:
TEXT = DATA / 'text_files'
try:
    TEXT.mkdir(parents=True, exist_ok=False)
    print("Raw text files directory created")
except:
    print("Raw text files directory already exists")

CLEAN_TEXT = DATA / "Clean_Text_Files"
try:
    CLEAN_TEXT.mkdir(parents=True, exist_ok=False)
    print("Cleaned text files directory created")
except:
    print("Cleaned text files directory already exists")

Raw text files directory already exists
Cleaned text files directory already exists


# Data Imports
There are now five files I'll be working with:
1. ProcessedData.csv
2. CommentsFiltered.csv
3. SubmissionsFiltered.csv

If you are not familiar with Reddit, when users post it can take two forms, "Submissions" and "comments."

Submissions are associated with a specific community or "subreddit" and users can comment on Submissions to discuss them or provide additional commentary.
Submissions can come in multiple forms: link, text, image, etc. The submissions here are only link submissions. Comments are searched for all other types of submissions as users will frequently post an image version of their content and then link to the text copy in the comments on these subreddits. The vast majority of content on /r/UnearthedArcana comes in this second form (an image submission with the link in the comments).

In any case, PRAW-CoDiaLS is written so that they both share the same overall structure, although certain fields are only used for comments (and blank otherwise).


### Shared Fields

| column name | description |
| ----- | ----- |
| link | Raw text URL |
| submission_author | Reddit user who posted the comment |
| submission_id | A unique ID used for locating the submission on Reddit |
| submission_title | The title of the submission on Reddit |
| subreddit | The name of the Subreddit the content was found on (in this case, either UnearthedArcana or DnDHomebrew) |
| submission_flair | the figure number as indicated in the manuscript |
| submission_score | The approximate number of net positive votes received by the submission on Reddit, the actual values are obfuscated by Reddit to avoid manipulation |
| submission_upvote_ratio | The ratio of positive to negative votes |
| submission_date | Date on which the submission was originally made |

### Comment-only fields

| column name | description |
| ----- | ----- |
| comment_author | Reddit user who posted the comment |
| comment_id | A unique ID used for locating the comment on Reddit |
| comment_score | The approximate number of net positive votes received by the comment on Reddit, the actual values are obfuscated by Reddit to avoid manipulation |
| comment_body | Text of the comment containing markdown elements |
| comment_date | Date on which the comment was originally posted |
| related_link | Boolean indicator for whether linked text was related to the labeled post (MultiLinkCmts.csv only) |
| manually_reviewed | Boolean indicator of whether the linked text has been manually reviewed for accuracy |

In [4]:
filtered_subs = pd.read_csv(PRAW/'SubmissionsFiltered.csv', index_col='idx')
filtered_cmts = pd.read_csv(PRAW/'FinalCommentsFiltered.csv', index_col='idx')
dropped_subs = pd.read_csv(PRAW/'SubmissionsDropped.csv', index_col='idx')
dropped_cmts = pd.read_csv(PRAW/'FinalCommentsDropped.csv', index_col='idx')

In [5]:
filtered_df = pd.concat([filtered_subs,filtered_cmts])
dropped_df = pd.concat([dropped_subs,dropped_cmts])
collected_df = pd.read_csv(PRAW/'CollectedData.csv')

## Updating manually-reviewed labels
Some of the labels were manually reviewed and then updated in the previous steps. We'll fix that now.

In [6]:
manually_updated = filtered_df[filtered_df['manually_reviewed'] == True].copy()
collected_df = collected_df.sort_values(by='link')
manually_updated = manually_updated.sort_values(by='link')

In [7]:
to_update = collected_df['link'].isin(manually_updated['link'])
new_vals = manually_updated[manually_updated['link'].isin(collected_df['link'])]
mapper = new_vals.set_index("link")["submission_flair"].to_dict()

In [8]:
collected_df.loc[to_update, 'submission_flair'] = collected_df.loc[to_update, 'link'].map(mapper)
collected_df = collected_df.sort_values(by='UID')

## Inspecting New Data

Here I'll do the following:

1. Check how much of the data is new (not already in the collected dataset
2. Check if any previously collected data has been excluded by my newer, more nuanced filtering approach
3. If any previously collected links have now been included, I'll take a look at why and assess whether their exclusion was appropriate or not.

In [9]:
collected_filtered = filtered_df['link'].isin(collected_df['link'])
collected_dropped = dropped_df['link'].isin(collected_df['link'])
print(f"{sum(collected_filtered)} of {len(filtered_df)} filtered links have been collected already.")
print(f"{sum(collected_dropped)} of {len(dropped_df)} dropped links have been collected already.")

2356 of 3376 filtered links have been collected already.
452 of 1345 dropped links have been collected already.


Based on the above, we can see that about 3/4 of the links have already had text collected. However, the new links should nicely expand the corpus of texts.

In [10]:
collected_dropped_only = collected_dropped & (~dropped_df['link'].isin(filtered_df['link']))
collected_dropped_only = dropped_df[collected_dropped_only]
print(f"{len(collected_dropped_only['link'].unique())} links that are strictly in the discard collection ({len(collected_dropped_only)} entries) have already been collected.")

7 links that are strictly in the discard collection (10 entries) have already been collected.


Based on the above, we can see that only a few of the links that we've already collected were dropped and aren't simply duplicates of links found in the filtered dataset. Now I'll see why these links were excluded.

In [11]:
collected_dropped_only['discard_reason'].value_counts()

Invalid/Missing Content Type    10
Name: discard_reason, dtype: int64

In [12]:
manually_inspected_links = collected_dropped_only['link'].unique()
manually_inspected_links

array(['https://homebrewery.naturalcrit.com/share/B1Hju_QaTx',
       'https://gmbinder.com/share/-Kx5i9fuxhuci6BUrbJq',
       'https://gmbinder.com/share/-MbAAe7gUkgvdAdE47qp',
       'https://gmbinder.com/share/-LsEd761uB8mQlHJmpYd',
       'https://gmbinder.com/share/-LM5jFKi0XzDEt-KZACa',
       'https://gmbinder.com/share/-LNKsc1ZWIwwDRvfScQu',
       'https://gmbinder.com/share/-MVT7DxS1Z08371vBCcL'], dtype=object)

Manual inspection confirms that none of these links contain useful training data.

#### Drop Discarded Links

In [13]:
collected_df = collected_df[~collected_df['link'].isin(manually_inspected_links)]

#### Are any collected links not in the filtered dataset?

In [14]:
collected_only = collected_df[~collected_df['link'].isin(filtered_df['link'])]
print(f"{len(collected_only)} links have been collected but are not contained in the present dataset")
collected_only.to_csv(PRAW/'MissingCollected.csv', index_label='idx')
collected_only

12 links have been collected but are not contained in the present dataset


Unnamed: 0,UID,link,submission_author,submission_id,submission_title,subreddit,submission_flair,submission_score,submission_upvote_ratio,submission_date,comment_author,comment_id,comment_score,comment_body,comment_date,src_url,manually_reviewed,related_link
123,127,https://gmbinder.com/images/TdynXtN.png,RSquared,o8jg66,"Wild Magic, Revisited (v1.4): When you have a ...",UnearthedArcana,Subclass,119,0.96,1624742801,RSquared,h35bgcy,5.0,"## Wild Magic, Revisited\nPDF/GMBinder: [https...",1624743040,https://www.gmbinder.com/images/TdynXtN.png,,
169,174,https://gmbinder.com/share/-Mo3iAw7pM6ey8vc3U5o,vonBoomslang,s9d5bl,[Subclass] Fighter: Swiftblade,DnDHomebrew,Subclass,74,0.97,,vonBoomslang,htlw4sf,2.0,[Art](https://www.artstation.com/artwork/DBWEe...,,https://www.gmbinder.com/share/-Mo3iAw7pM6ey8v...,,
319,326,https://gmbinder.com/share/-LTuQ9tRb_ZGFIfKLno4,FungalBrews,stvbuc,The Warrior's Codex - a 156-page remaster of a...,UnearthedArcana,Compendium,280,0.98,2022-02-16 13:00:13,FungalBrews,hx75s6n,1.0,Can't believe I forgot to put the [GMBinder](h...,2022-02-16 17:39:09,https://www.gmbinder.com/share/-LTuQ9tRb_ZGFIf...,,
384,391,https://gmbinder.com/share/-MZ8TgPaZ7Cy-GfA6ODa,zaelos_3,t08upg,Class: Warlord by zaelos_3 (v1.8.4) - Non-magi...,UnearthedArcana,Class,50,0.9,2022-02-24 11:32:53,zaelos_3,hy8as2c,2.0,"Hey! Still working on adjustments, but in gene...",2022-02-24 11:52:14,https://www.gmbinder.com/share/-MZ8TgPaZ7Cy-Gf...,,
435,444,https://gmbinder.com/images/NwecXt3.png,Frail_Luna,lx0mo3,Way of the Dance - A Monastic Tradition for dr...,UnearthedArcana,Subclass,1282,0.99,1614795693,Nihil_esque,gpmg6gf,2.0,"I don't know how they achieved it, but you can...",1614835599,https://www.gmbinder.com/images/NwecXt3.png,,
463,472,https://gmbinder.com/share/-Muk72f9o48Y_sxYWEsD,Sensitive_Coyote_865,shyemc,Summon weapon (corrected) - A cantrip,UnearthedArcana,Spell,513,0.97,2022-02-01 16:03:05,Sensitive_Coyote_865,hv5ay6b,1.0,Here is a simple conjuration cantrip I designe...,2022-02-01 16:04:44,https://www.gmbinder.com/share/-Muk72f9o48Y_sx...,,
622,635,https://gmbinder.com/share/-MY3BKQPmwzGOm8b58d1,KaiTries5,npyqnp,The Chevalier 2.0 - A Charismatic Martial Clas...,UnearthedArcana,Class,349,0.97,2021-06-01 16:59:03,KaiTries5,h07o3fm,16.0,"Hello, r/UnearthedArcana!\n\nA little over a w...",2021-06-01 17:02:47,https://www.gmbinder.com/share/-MY3BKQPmwzGOm8...,,
1248,1659,https://gmbinder.com/share/-MsDn-S30J-qAuY7b0VU,FriskyRisque,siukf3,The Seeker class is updated! [Alpha v1.0] Tell...,UnearthedArcana,Class,254,0.97,2022-02-02 17:34:39,FriskyRisque,hvavqh9,5.0,[Google PDF](https://drive.google.com/file/d/1...,2022-02-02 17:35:11,https://www.gmbinder.com/share/-MsDn-S30J-qAuY...,,
1304,1716,https://gmbinder.com/share/-LtakOF3PZ1Re19Sn6ca,actlikeyoubelong__c,nghrwf,Oath of the Frontier: The frontier is vast and...,UnearthedArcana,Subclass,510,0.98,2021-05-19 21:31:08,actlikeyoubelong__c,gyvyyxm,1.0,Swift justice has been changed to quick draw! ...,2021-05-21 00:15:47,https://www.gmbinder.com/share/-LtakOF3PZ1Re19...,,
1368,1782,https://homebrewery.naturalcrit.com/share/v_5s...,Xenoezen,saxugc,Eldritch Invocation: Pact Tactics,UnearthedArcana,Feat,1438,0.97,2022-01-23 16:48:55,Xenoezen,htwcgso,15.0,"Here's an eldritch invocation for chainlocks, ...",2022-01-23 16:50:49,https://homebrewery.naturalcrit.com/source/v_5...,,


In [15]:
collected_only['link'].unique()

array(['https://gmbinder.com/images/TdynXtN.png',
       'https://gmbinder.com/share/-Mo3iAw7pM6ey8vc3U5o',
       'https://gmbinder.com/share/-LTuQ9tRb_ZGFIfKLno4',
       'https://gmbinder.com/share/-MZ8TgPaZ7Cy-GfA6ODa',
       'https://gmbinder.com/images/NwecXt3.png',
       'https://gmbinder.com/share/-Muk72f9o48Y_sxYWEsD',
       'https://gmbinder.com/share/-MY3BKQPmwzGOm8b58d1',
       'https://gmbinder.com/share/-MsDn-S30J-qAuY7b0VU',
       'https://gmbinder.com/share/-LtakOF3PZ1Re19Sn6ca',
       'https://homebrewery.naturalcrit.com/share/v_5saUNTTEtN]',
       'https://gmbinder.com/images/3ORYq56.png',
       'https://homebrewery.naturalcrit.com/share/1qJmpQuXYOOUZErtdKwC'],
      dtype=object)

#### UID 127, 444, 2117, and 2811 are bad links

In [16]:
collected_df = collected_df[~collected_df['UID'].isin([127,444,2117,2811])].copy()

# Collect Markdown/Source Code

The code below attempts to scrape the actual text for each entry. This code only runs if the the data hasn't been processed already.

In [17]:
new_links = filtered_df[~filtered_df['link'].isin(collected_df['link'])]
has_new_links = len(new_links) >= 1
print(f"{len(new_links)} links not in existing collection")

1020 links not in existing collection


## dnd_scraper_tools

As mentioned previously, the actual text scraping is handled by functions defined in dnd_scraper_tools.

Note that it's possible for either gmbinder.com or homebrewery.naturalcrit.com to change in ways substantially enough that these functions would be broken in the future.

### `get_source_texts()`
General function: his function wraps three other functions.

It also includes rate_limit and pass_attempts variables that can be used to slow down the rate of requests that are sent to these sites to minimize impact(s).

In the event that any requests fail due to HTTP exceptions, the function will pass back over those URLs a number of times equal to pass_attempts.

#### Step 1: Find the best URL to get the markdown text.

The first function simply checks for a "source code' button on the rendered view linked to be each submission (typically users link to this view as it is more aesthetically pleasing and the primary draw of these websites).

This clean raw source code is a form of Markdown (and potentially some CSS) unique to GMBinder/Homebrewery and designed for rendering nice content. Server-side, this is converted to a much more complex mix of HTML/JS/CSS that is delivered to the clients to produce the rendered viewing page.

In the event that there is a raw source page, we'll try to use that as it's generally going to provide clear section delineations

If the source page is not found, we'll try to collect the text from the rendered page and convert it to markdown, so in these cases, we'll return the original URL.

Note: This works only for GMBinder's rendered pages, as Homebrewery uses a ReactJS framework that doesn't play nicely with BeautifulSoup4. To keep things simple, the function returns a value of None if the source is not available for a Homebrewery link. We'll need to drop any such rows later.

From what I observed clicking through some of the links, an enabled source link seems to be standard on Homebrewery, but can be optionally disabled on GMBinder (and is done with appreciable frequency).

#### Step 2: Collect Text

The second function from the dnd_scraper_tools module is used to collect the text from the newly located source URLs.

This can either be the raw markdown source text or the HTML/JS/CSS mixture from the rendered view.

In either case, we will try to make sure it's in the form of markdown, not HTML or raw text.

There are still many undesirable elements at this point, but we'll clean these up towards the end of this notebook.

In [18]:
new_collection_path = PRAW/'NewCollectionFiltered.csv'
TRY_N_TIMES = 7
if has_new_links:
    from src import get_source_texts
    
    new_links = new_links.reset_index(drop=False).copy()

    if new_collection_path.is_file():
        print("Loading data")
        _new_links = pd.read_csv(new_collection_path)
        start_range = max([collected_df['UID'].max(), _new_links['UID'].max()]) + 1
        missing_links = new_links[~new_links['link'].isin(_new_links['link'])].copy()
        missing_links.insert(0, 'UID', range(start_range, start_range + len(missing_links)))
        new_links = _new_links[_new_links['link'].isin(new_links['link'])].copy()

        mapper = manually_updated.set_index("link")['submission_flair'].to_dict()
        new_links.loc[new_links['link'].isin(mapper.keys()),'submission_flair'] = new_links[new_links['link'].isin(mapper.keys())]['link'].map(mapper)
        new_links.loc[new_links['link'].isin(mapper.keys()), 'manually_reviewed'] = True
        
        if len(missing_links) > 0:
            print(f"Starting new entries at UID {start_range}")
            missing_links = get_source_texts(
                df = missing_links,
                rate_limit = 1, # Max one request per second
                pass_attempts = TRY_N_TIMES, # Try up to N times for timeout, etc,
                verbose = True
                )
            new_links = pd.concat([_new_links, missing_links])
            new_links.to_csv(new_collection_path, index=False)
        else:
            new_links.to_csv(new_collection_path, index=False)
            print("All target data has been collected already.")
    else:
        start_range = collected_df['UID'].max() + 1
        new_links.insert(0, 'UID', range(start_range, start_range + len(new_links)))
        # Get new texts
        new_links = get_source_texts(
            df = new_links,
            rate_limit = 1, # Max one request per second
            pass_attempts = TRY_N_TIMES, # Try up to N times for timeout, etc,
            verbose = True
            )

        new_links.to_csv(new_collection_path, index=False)

Loading data
All target data has been collected already.


Some of the links are to dead/removed content or content that is purely an image (typically from a single user). I didn't plan for this in my initial collection, so I'm going to drop them now.

In [19]:
if has_new_links:
    to_drop = [
        # Drop GMBinder links that are dead.
        new_links['Text'].str.contains('NoSuchKey'),
        # Drop Homebrwery links that are dead
        new_links['Text'].str.contains('Can not find brew'),
        # Some users have relocated their files to other sources and are also being pruned here.
        new_links['Text'].str.contains('This content has been moved'),
        # Some are not found
        new_links['Text'].str.contains('Error: File not found'),
        # There are a few examples of nonsensical text, which is generally short, thus I'm doing a check for anything less than 150 characters
        new_links['Text'].str.len() <= 200,
        # Pernicious watercolor links...
        new_links['Text'].str.contains("# Full Page Watercolor Stains"),
        # Drop anything totally empty.
        new_links['Text'].isna()
        ]

    drop_mask = None
    for mask in to_drop:
        if isinstance(drop_mask, pd.Series) or isinstance(drop_mask, np.ndarray):
            drop_mask = drop_mask | mask
        else:
            drop_mask = mask
    print(*new_links[drop_mask]['link'].unique(), sep='\n')
    dropped = new_links[drop_mask]
    new_links = new_links[~drop_mask]
    print(f"Dropped {sum(drop_mask)} instances of unsuccessfully retrieved content")

https://gmbinder.com/share/-LlMLqn89zSK9HarUVb0
https://homebrewery.naturalcrit.com/share/fxIFF3_otodv
https://homebrewery.naturalcrit.com/share/rkv7KcWhN
https://homebrewery.naturalcrit.com/share/S1bgDI6oe
https://homebrewery.naturalcrit.com/share/SJb7f4h2Jm
https://homebrewery.naturalcrit.com/share/Q8h-5qPRnSIt
https://homebrewery.naturalcrit.com/share/AIs9CFcEl5cZ
https://gmbinder.com/share/-MpmR3BmlaL98mXF9Tat
https://gmbinder.com/share/-M6Pbsi5tQ896CKv0i6
https://gmbinder.com/share/-MZK9zMAViuPzPOGJwLj
https://homebrewery.naturalcrit.com/share/rJbFLYlib
https://homebrewery.naturalcrit.com/share/vlLpF0uS968g
https://homebrewery.naturalcrit.com/share/BJQGzT-md4
https://homebrewery.naturalcrit.com/share/r1Z_Pk2a1Q
https://gmbinder.com/share/-LjvXYwjrv22ExcBsPGV
https://homebrewery.naturalcrit.com/share/aFjGJsE89
https://homebrewery.naturalcrit.com/share/Lk4tshmkY0en
https://homebrewery.naturalcrit.com/share/F3gBoje3Lg45
https://homebrewery.naturalcrit.com/share/B1m1TTXHBb
https://hom

In [20]:
new_links = new_links.sort_values(['manually_reviewed', "UID"], ascending=[False, True])
new_links = new_links[~new_links.duplicated("src_url", keep='first')]

Clicking through the above links confirms no useful content was collected.

# Saving collected data

Lastly, I'll merge, filter, and save the newly collected data.

In [21]:
final_df = pd.concat([collected_df, new_links.drop(columns=['Text','idx'])])
final_df['submission_flair'] = final_df['submission_flair'].replace('Feature', 'Feat')
final_df = final_df.sort_values(['manually_reviewed', "UID"], ascending=[False, True])
final_df[final_df.duplicated(["src_url", "submission_flair"], keep=False)].sort_values("src_url")

Unnamed: 0,UID,link,submission_author,submission_id,submission_title,subreddit,submission_flair,submission_score,submission_upvote_ratio,submission_date,comment_author,comment_id,comment_score,comment_body,comment_date,src_url,manually_reviewed,related_link
1103,1510,https://gmbinder.com/share/-M-kCpCFpxZA3chLFuOF,RSquared,oov1no,Martial Prowess 2.1: Expanding martial combat ...,UnearthedArcana,Compendium,228.0,0.98,2021-07-21 17:46:21,RSquared,h611j1j,6.0,PDF: [https://drive.google.com/file/d/1212E9yQ...,2021-07-21 17:49:17,https://www.gmbinder.com/share/-M-kCpCFpxZA3ch...,True,
1741,2272,https://gmbinder.com/share/-M-kCpCFpxZA3chLFuO...,RSquared,scfbxq,Martial Prowess v2.3: A 5E Tome of Battle with...,UnearthedArcana,Compendium,341.0,0.99,,RSquared,hu5oydh,6.0,PDF: [https://drive.google.com/file/d/1p0vEwqH...,,https://www.gmbinder.com/share/-M-kCpCFpxZA3ch...,True,True
1739,2270,https://gmbinder.com/share/-Muk72f9o48Y_sxYWEs...,Sensitive_Coyote_865,svicxd,Trickster's Quirk - A cantrip,UnearthedArcana,Spell,1062.0,0.99,,Sensitive_Coyote_865,hxlhmxp,1.0,How about this one? https://www.gmbinder.com/s...,,https://www.gmbinder.com/share/-Muk72f9o48Y_sx...,True,False
463,472,https://gmbinder.com/share/-Muk72f9o48Y_sxYWEsD,Sensitive_Coyote_865,shyemc,Summon weapon (corrected) - A cantrip,UnearthedArcana,Spell,513.0,0.97,2022-02-01 16:03:05,Sensitive_Coyote_865,hv5ay6b,1.0,Here is a simple conjuration cantrip I designe...,2022-02-01 16:04:44,https://www.gmbinder.com/share/-Muk72f9o48Y_sx...,,
2320,2920,https://gmbinder.com/share/-MwdP3QqzF1UxUiy5MD...,St0rmyknight,t0qic7,Artificer Rebrewed Alchemist,DnDHomebrew,Class,3.0,0.81,,St0rmyknight,hyd7b0b,1.0,"Here is the new iteration, new to GM Binder an...",,https://www.gmbinder.com/share/-MwdP3QqzF1UxUi...,False,
2173,2745,https://gmbinder.com/share/-MwdP3QqzF1UxUiy5MDy,St0rmyknight,t0qic7,Artificer Rebrewed Alchemist,DnDHomebrew,Class,3.0,0.81,,,,,,,https://www.gmbinder.com/share/-MwdP3QqzF1UxUi...,,True


In [22]:
final_df[final_df['UID'].duplicated(keep=False)]

Unnamed: 0,UID,link,submission_author,submission_id,submission_title,subreddit,submission_flair,submission_score,submission_upvote_ratio,submission_date,comment_author,comment_id,comment_score,comment_body,comment_date,src_url,manually_reviewed,related_link


In [23]:
final_df = final_df[~final_df.duplicated(["src_url", "submission_flair"], keep='first')]
final_df[final_df.duplicated("src_url", keep=False)].sort_values("src_url")

Unnamed: 0,UID,link,submission_author,submission_id,submission_title,subreddit,submission_flair,submission_score,submission_upvote_ratio,submission_date,comment_author,comment_id,comment_score,comment_body,comment_date,src_url,manually_reviewed,related_link


In [24]:
submissions = final_df['comment_body'].isna() & final_df["comment_id"].isna()
final_df.loc[submissions, "related_link"] = True

In [25]:
final_df.to_pickle(DATA/'Metadata.pkl')
final_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3250 entries, 2 to 2335
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   UID                      3250 non-null   int64  
 1   link                     3250 non-null   object 
 2   submission_author        3237 non-null   object 
 3   submission_id            3250 non-null   object 
 4   submission_title         3250 non-null   object 
 5   subreddit                3250 non-null   object 
 6   submission_flair         3235 non-null   object 
 7   submission_score         3250 non-null   float64
 8   submission_upvote_ratio  3247 non-null   float64
 9   submission_date          2009 non-null   object 
 10  comment_author           2628 non-null   object 
 11  comment_id               2628 non-null   object 
 12  comment_score            2628 non-null   float64
 13  comment_body             2628 non-null   object 
 14  comment_date            

# Export .txt versions of each file.

Lastly, I'd like to export a .txt version of each file to have a nicer manual inspection format to work with than the main CSV.

This also allows me to share the meta-data and labels of the processed examples without infringing on anyone else's data.

The name will be {uid}.txt

In [26]:
raw_uids = data_io.get_uids_from_path(TEXT)
to_write = new_links[~new_links['UID'].isin(raw_uids)].copy()
print(f"{len(raw_uids)} texts already collected. Writing {len(to_write)} text files.")

for _, row in to_write.iterrows():
    data_io.write_row_to_file(row, TEXT)

3786 texts already collected. Writing 0 text files.


## Make clean copies of texts

**Note**: The cleaning function requires some human help

Occasionally, human errors/typos (typically just missing `}` CSS closing brackets) produce errors.

I've kept the texts as simple .txt files in part for this reason to make it easier to manually fix these.

In [27]:
clean_raw_scrapes(final_df, raw_path=TEXT, clean_path=CLEAN_TEXT, limit_to_df=True)

3786 raw texts found
0 clean texts found
Limiting cleaning only to files in the provided dataframe
Cleaning 3250 texts, starting with UID 1.
Note processes ~4-6 texts/sec.
ETA: 9 min 1 sec to 13 min 32 sec


## Manually review  unusual flair
Since /r/DnDHomebrew has optional flair and often less informative, I'll be checking all of these.
Similarly, a few flair are un-common and possibly ineffective even on /r/UnearthedArcana, and I'll review these to see if they might be better categorized as something more meaningful

In [28]:
final_df.groupby(by=["submission_flair"])["UID"].agg("count")

submission_flair
5e              46
Adventure        4
Background       8
Class          470
Compendium     146
Feat           118
Item           163
Mechanic        96
Monster        558
Official         1
Other           12
Prestige         3
Race           228
Resource        12
Spell          200
Subclass      1166
World            4
Name: UID, dtype: int64

In [29]:
MAJOR_LABELS = [
    "Class",
    "Item",
    "Monster",
    "Race",
    "Spell",
    "Subclass",
    "Feat",
    "Compendium",
    "Mechanic",
]
good_flair = final_df['submission_flair'].isin(MAJOR_LABELS)
manually_reviewed = final_df['manually_reviewed']
good_labels = good_flair
bad_labels = final_df[~good_labels].copy()
bad_labels['manually_reviewed'] = False

In [30]:
bad_labels_path = PRAW/'LabelsToReview.csv'
if bad_labels_path.is_file():
    # Read in reviewed labels
    reviewed_labels = pd.read_csv(bad_labels_path)
    reviewed_labels['manually_reviewed'] = reviewed_labels['manually_reviewed'].fillna(False)
    reviewed_labels = reviewed_labels[reviewed_labels['manually_reviewed']]
    
    # Update with any missing labels
    new_to_review = ~bad_labels['src_url'].isin(reviewed_labels['src_url'])
    print(f"{new_to_review.sum()} new labels to review.")
    new_to_review = bad_labels[new_to_review].copy()
    bad_labels = pd.concat([reviewed_labels, new_to_review])
    print(f"{bad_labels['manually_reviewed'].sum()}/{len(bad_labels)} labels have been manually reviewed.")

bad_labels.to_csv(bad_labels_path, index=False)

0 new labels to review.
371/371 labels have been manually reviewed.


### An example of how the review tool looks
I haven't figured out how to use this outside the command line yet.

In [31]:
import sys
!{sys.executable} ./label_review_tool.py ingest -p 55 -d ./data/praw_data/LabelsToReview.csv -m ./data/Metadata.pkl -t ./data/Text_Files/ --clean

[33m[2m371 records already processed out of 371[0m
[0m[0m