# Import & examine data

In [1]:
import pandas as pd
import os
from pathlib import Path

cwd = Path(os.getcwd())
cmts_csv = cwd / 'Data' / 'Comments_20210319_005325.csv'
posts_csv = cwd / 'Data' / 'Posts_20210319_005325.csv'

cmts_df = pd.read_csv(cmts_csv,sep=',',index_col=0)
cmts_df.head(15)

Unnamed: 0,comment_id,post_id,body,sub,post_title,post_flair
0,grfoljw,m86g8n,Here's the first in a project I've been workin...,UnearthedArcana,Variant Hobgoblin - Emporium of the Races,Race
1,grf99dk,m7y9rp,This subclass has features requiring the *Psyc...,UnearthedArcana,Bard - College of Forgotten Echoes; a spiritua...,Subclass
2,grd344y,m7s097,"Hello, everyone! Hoping to get some serious c...",UnearthedArcana,Fortune Domain - 5e Cleric Subclass (CONSTRUCT...,Subclass
3,gr9amwd,m75i2f,"When I say partly inspired by Hollow Knight, I...",UnearthedArcana,"Bugfolk, a Race of Creepy-Crawlies Partly Insp...",Race
4,gr99ryw,m75bsj,Note: The class currently just uses the Warloc...,UnearthedArcana,"The Eldritch Brawler | Class with a unique, fa...",Class
5,gr8y8lg,m73c7l,Hi everyone! This a subclass that I've been th...,UnearthedArcana,Oath of Self - A Paladin subclass for those wh...,Subclass
6,gr67wsv,m6kq63,Survey [here](https://docs.google.com/forms/d/...,UnearthedArcana,Ranger - Crystalline Trapper,Subclass
7,gr577dw,m68lk9,[https://homebrewery.naturalcrit.com/share/5L4...,UnearthedArcana,Greywatcher: An Advanced Demon Hunter Class,Class
8,gr2twa9,m5y7f2,[Link to my Homebrewer Version](https://homebr...,UnearthedArcana,"Gourd of the Dragon Lords - a ""Journey to the ...",Item
9,gqyg3ll,m56sxy,I think most people here know by now that WOTC...,UnearthedArcana,Fairy Race Remixed (5e),Race


### Exploring comments...
Let's print the first 10 comments in full to see how it looks

In [2]:
for i in cmts_df['body'][0:9]:
    print(i+"\n")

Here's the first in a project I've been working on for a while: Emporium of the Races! Essentially, I aim to upgrade and add new options for every race in 5e, as well as introduce new Versatile Subraces inspired by Pathfinder 2e's versatile heritages. I will post updates to this project every week. If you have any suggestions, comments, or requests feel free to share. 

For this first entry, I've decided to do my favorite race: Hobgoblins.   The Strongblade Hobgoblin is essentially the original race, the Forge Hobgoblin makes a great artificer, The aquatic Koalinth is an old d&d monster that I'm surprised hasn't been made playable yet, and the Drakogoblin is an original subrace based on Hobgoblins apparent penchant for drakes. Tell me what you would like me to cover next!

Gmbinder Link: https://www.gmbinder.com/share/-MW5xKc1QPj4TXEgBzQm

Google Drive Link: https://drive.google.com/file/d/1fxn2YHNDnM2W8KJPV_17tpmpyFxqeWJS/view?usp=sharing

This subclass has features requiring the *Psy

# Extracting the GMBinder and/or Homebrewery links

We can see we have some different varieties of potentially useful links, including /share/ and /edit/ links. There are also some from unrelated domains (Google Docs, Wikipedia articles, etc.) and some direct links to PDF files that lack the HTML source we want to collect.

We also have three varieties of link text:

## Raw links
URLs with no special formating around them.
> <pre>Gmbinder Link: https://www.gmbinder.com/share/-MW5xKc1QPj4TXEgBzQm</pre>

## Markdown links
These take the form of a display text portion in brackets immediately followed by a URL in parenthesis.
> <pre>[Display text](url)</pre>

Moreover, in some cases, the display text is also a url and in at least one case early on the url is incorrect in the display text portion, but correct in the actual url segment of the link code.

> <pre>[https://homebrewery.naturalcrit.com/share/MhTK9\_7zHuB6](https://homebrewery.naturalcrit.com/share/MhTK9_7zHuB6)</pre>

Since, the url portion is more likely to be correct in most cases, we'll go exclusively with raw links and the (url) portion of markdown code.

To do this we can use a set of regular expressions. After a few trial and errors, I arrived at the following:

### Homebrewery
> <pre>?:https?:\/\/)?(?:www\.)?homebrewery\.naturalcrit\.com\/[\w\-][\w\-]*\/?[\w\-]*\/?[\w\-]*</pre>

### GMBinder
> <pre>(?:https?:\/\/)?(?:www\.)?gmbinder\.com\/[\w\-][\w\-]*\/?[\w\-]*\/?[\w\-]*[\s\)]</pre>

In [3]:
cmts_df['url'] = cmts_df['body'].str.findall(r'[\s\(](?:https?:\/\/)?(?:www\.)?homebrewery\.naturalcrit\.com\/[\w\-][\w\-]*\/?[\w\-]*\/?[\w\-]*[\s\\)]|[\s\(](?:https?:\/\/)?(?:www\.)?gmbinder\.com\/[\w\-][\w\-]*\/?[\w\-]*\/?[\w\-]*[\s\)]')
for i in cmts_df['url'][0:14]:
    print(i)

[' https://www.gmbinder.com/share/-MW5xKc1QPj4TXEgBzQm\n']
['(https://www.gmbinder.com/share/-MVeY68xfxI2MrESmAq0)', '(https://www.gmbinder.com/share/-M90l_C5MPCYvWvO1mpD)']
['(https://homebrewery.naturalcrit.com/share/MhTK9_7zHuB6)']
['(https://www.gmbinder.com/share/-MVy-6VQ01Ft9YsPWZ15)']
['(https://homebrewery.naturalcrit.com/share/leDBql2umJCU)']
['(https://www.gmbinder.com/share/-MVxOdThHWkEdg07s5yH)']
['(https://www.gmbinder.com/share/-MVJujifYnF1PBcZg2yu)']
['(https://homebrewery.naturalcrit.com/share/5L4tFD0_a0Kt)']
['(https://homebrewery.naturalcrit.com/edit/gUo-uGIukIe-)']
['(https://www.gmbinder.com/share/-MVm5LgT_HqPPsPa8Bgd)']
[]
['(https://www.gmbinder.com/share/-MU8f5FYNfXKKi7QkQXQ)', '(https://www.gmbinder.com/profile/FrostBladestorm)', '(https://www.gmbinder.com/profile/FrostBladestorm)']
['(https://homebrewery.naturalcrit.com/share/1X_0vq1WBXrQq2EeNZ1CMOeTVt5PxyWmSLs3wWVRyBTEy)']
['(https://www.gmbinder.com/share/-MHtvbKTKQu8Faoz4YJp)']


# Final cleaning steps...

As expected, some entries have multiple urls, while others now lack any entries.

Many of our strings also having some hanging portions at the ends from our Regex expression:
* '('
* ')'
* '\n'
* ' '

For our final data, we'd like to get everything in a format where each row represets one URL rather than one post and each URL is unique and has been cleaned.

Lastly, we'll remove anything with 'profile' or 'pdf' in the url, as these URLs require significantly more processing to handle and would not be worth the effort (especially the PDFs)

Luckily, we have a few convient tools at our disposal, so we will takle this in the following manner:

1) Expand each entry into multiple rows for each URL in the dataset. To do this we will use Pandas' .explode() function designed for this very purpose.

2) Remove rows that are now empty (i.e. posts that only contained direct links to the homebrewery/gmbinder homepage) by dropping NA values

3) Each each individual string to remove the extraneous portions left over using the map function.

4) Remove duplicates using the drop_duplicates() function in Pandas.

5) Lastly, we'll remove any rows that contain the string 'pdf' or 'profile'

In [4]:
cmts_df = cmts_df.explode('url')
cmts_df = cmts_df.dropna()
cmts_df['url'] = cmts_df['url'].map(lambda x : x.strip('(').strip(')').strip('\n').strip(' '))
cmts_df.drop_duplicates(subset=['url'],keep='first',inplace=True)
cmts_df = cmts_df[cmts_df['url'].str.contains('pdf') == False]
cmts_df = cmts_df[cmts_df['url'].str.contains('profile') == False]
cmts_df = cmts_df[cmts_df['url'].str.contains('edit') == False]
cmts_df.reindex()
cmts_df.head(25)

Unnamed: 0,comment_id,post_id,body,sub,post_title,post_flair,url
0,grfoljw,m86g8n,Here's the first in a project I've been workin...,UnearthedArcana,Variant Hobgoblin - Emporium of the Races,Race,https://www.gmbinder.com/share/-MW5xKc1QPj4TXE...
1,grf99dk,m7y9rp,This subclass has features requiring the *Psyc...,UnearthedArcana,Bard - College of Forgotten Echoes; a spiritua...,Subclass,https://www.gmbinder.com/share/-MVeY68xfxI2MrE...
1,grf99dk,m7y9rp,This subclass has features requiring the *Psyc...,UnearthedArcana,Bard - College of Forgotten Echoes; a spiritua...,Subclass,https://www.gmbinder.com/share/-M90l_C5MPCYvWv...
2,grd344y,m7s097,"Hello, everyone! Hoping to get some serious c...",UnearthedArcana,Fortune Domain - 5e Cleric Subclass (CONSTRUCT...,Subclass,https://homebrewery.naturalcrit.com/share/MhTK...
3,gr9amwd,m75i2f,"When I say partly inspired by Hollow Knight, I...",UnearthedArcana,"Bugfolk, a Race of Creepy-Crawlies Partly Insp...",Race,https://www.gmbinder.com/share/-MVy-6VQ01Ft9Ys...
4,gr99ryw,m75bsj,Note: The class currently just uses the Warloc...,UnearthedArcana,"The Eldritch Brawler | Class with a unique, fa...",Class,https://homebrewery.naturalcrit.com/share/leDB...
5,gr8y8lg,m73c7l,Hi everyone! This a subclass that I've been th...,UnearthedArcana,Oath of Self - A Paladin subclass for those wh...,Subclass,https://www.gmbinder.com/share/-MVxOdThHWkEdg0...
6,gr67wsv,m6kq63,Survey [here](https://docs.google.com/forms/d/...,UnearthedArcana,Ranger - Crystalline Trapper,Subclass,https://www.gmbinder.com/share/-MVJujifYnF1PBc...
7,gr577dw,m68lk9,[https://homebrewery.naturalcrit.com/share/5L4...,UnearthedArcana,Greywatcher: An Advanced Demon Hunter Class,Class,https://homebrewery.naturalcrit.com/share/5L4t...
9,gqyg3ll,m56sxy,I think most people here know by now that WOTC...,UnearthedArcana,Fairy Race Remixed (5e),Race,https://www.gmbinder.com/share/-MVm5LgT_HqPPsP...


# Collect Markdown/Source Code

Next, we'll import the post data and collect the mixed html/css/markdown-style source code used by Homebrewery/GMBinder.

For this, I've defined two functions in dnd_scraper_tools that I'll be importing.

The first function simply checks for a source code button. This seems to be standard on Homebrewery, but can be optionally disabled on GMBinder.

In that event, we'll try to get the cleanest text we can from the HTML available on the main display page, so the function will return the original URL if it can't find a button. This is done only for the GMBinder links, as Homebrewery's mcontained directain display page is rendered by React JS and collecting the text via BeautifulSoup is not possible. My initial exploration of the links suggests that this does not occur. If it does, the script will return an value of None that we can later filter out.

##### Note, I performed some manual curation of the post data to add UnearthedArcana style descriptive flair tags to the DnDHomebrew posts, that way this datset is initially complete with respect to that column.

In [5]:
from dnd_scraper_tools import grab_src_url, collect_text, remove_html

posts_df = pd.read_csv(posts_csv,sep=',',index_col=0)

posts_df = posts_df[posts_df['url'].str.contains('pdf') == False]
posts_df = posts_df[posts_df['url'].str.contains('profile') == False]
posts_df = posts_df[posts_df['url'].str.contains('edit') == False]

In [6]:
posts_df.head(10)

Unnamed: 0,post_id,title,url,sub,score,flair,upvote_ratio
0,m7s6jd,"[5E] Metamorph, a (hopefully) balanced substit...",https://homebrewery.naturalcrit.com/share/tpKj...,UnearthedArcana,2,Class,0.75
1,m7ezuy,Cute Magical Critter Warlock Patron | Sign a c...,https://www.gmbinder.com/share/-MT4G-aRMFvYwYw...,UnearthedArcana,21,Subclass,0.93
2,m73s8f,Akkator's Compendium of Classes,https://homebrewery.naturalcrit.com/share/gvVS...,UnearthedArcana,5,Compendium,1.0
3,m71mtw,Order of the Thereon-Blood Hunter subclass v 1.01,https://homebrewery.naturalcrit.com/share/1ScB...,UnearthedArcana,3,Subclass,1.0
4,m6t7k8,"Selakyn - a Race of Nomadic Shark People, stil...",https://homebrewery.naturalcrit.com/share/p-k2...,UnearthedArcana,4,Race,1.0
5,m6odpv,HD and XP as Currency? Thoughts and critiques ...,https://homebrewery.naturalcrit.com/share/1fCq...,UnearthedArcana,2,Mechanic,1.0
6,m6o99q,"Satyr Reimagined! Now with 3 new subraces, fea...",https://homebrewery.naturalcrit.com/share/vxyP...,UnearthedArcana,1,Race,0.6
7,m6h3jr,Sealed Dragon v1.4 | A Dragon Sealed Inside a ...,https://homebrewery.naturalcrit.com/share/SRjD...,UnearthedArcana,5,Class,0.7
8,m5xg2a,"Taking a swing at my own Ranger revision, with...",https://www.gmbinder.com/share/-MTpa3ordIxC74W...,UnearthedArcana,0,Class,0.4
9,m5tt9a,Way of the Coiled Cobra | Dodge and Strike wit...,https://www.gmbinder.com/share/-MUKsI6Ft8OQgYe...,UnearthedArcana,12,Subclass,1.0


In [7]:
posts_df['src_url'] = posts_df['url'].map(grab_src_url)
posts_df.head(10)

Unnamed: 0,post_id,title,url,sub,score,flair,upvote_ratio,src_url
0,m7s6jd,"[5E] Metamorph, a (hopefully) balanced substit...",https://homebrewery.naturalcrit.com/share/tpKj...,UnearthedArcana,2,Class,0.75,https://homebrewery.naturalcrit.com/source/tpK...
1,m7ezuy,Cute Magical Critter Warlock Patron | Sign a c...,https://www.gmbinder.com/share/-MT4G-aRMFvYwYw...,UnearthedArcana,21,Subclass,0.93,https://www.gmbinder.com/share/-MT4G-aRMFvYwYw...
2,m73s8f,Akkator's Compendium of Classes,https://homebrewery.naturalcrit.com/share/gvVS...,UnearthedArcana,5,Compendium,1.0,https://homebrewery.naturalcrit.com/source/gvV...
3,m71mtw,Order of the Thereon-Blood Hunter subclass v 1.01,https://homebrewery.naturalcrit.com/share/1ScB...,UnearthedArcana,3,Subclass,1.0,https://homebrewery.naturalcrit.com/source/1Sc...
4,m6t7k8,"Selakyn - a Race of Nomadic Shark People, stil...",https://homebrewery.naturalcrit.com/share/p-k2...,UnearthedArcana,4,Race,1.0,https://homebrewery.naturalcrit.com/source/p-k...
5,m6odpv,HD and XP as Currency? Thoughts and critiques ...,https://homebrewery.naturalcrit.com/share/1fCq...,UnearthedArcana,2,Mechanic,1.0,https://homebrewery.naturalcrit.com/source/1fC...
6,m6o99q,"Satyr Reimagined! Now with 3 new subraces, fea...",https://homebrewery.naturalcrit.com/share/vxyP...,UnearthedArcana,1,Race,0.6,https://homebrewery.naturalcrit.com/source/vxy...
7,m6h3jr,Sealed Dragon v1.4 | A Dragon Sealed Inside a ...,https://homebrewery.naturalcrit.com/share/SRjD...,UnearthedArcana,5,Class,0.7,https://homebrewery.naturalcrit.com/source/SRj...
8,m5xg2a,"Taking a swing at my own Ranger revision, with...",https://www.gmbinder.com/share/-MTpa3ordIxC74W...,UnearthedArcana,0,Class,0.4,https://www.gmbinder.com/share/-MTpa3ordIxC74W...
9,m5tt9a,Way of the Coiled Cobra | Dodge and Strike wit...,https://www.gmbinder.com/share/-MUKsI6Ft8OQgYe...,UnearthedArcana,12,Subclass,1.0,https://www.gmbinder.com/share/-MUKsI6Ft8OQgYe...


In [8]:
cmts_df['src_url'] = cmts_df['url'].map(grab_src_url)
cmts_df.head(10)

Unnamed: 0,comment_id,post_id,body,sub,post_title,post_flair,url,src_url
0,grfoljw,m86g8n,Here's the first in a project I've been workin...,UnearthedArcana,Variant Hobgoblin - Emporium of the Races,Race,https://www.gmbinder.com/share/-MW5xKc1QPj4TXE...,https://www.gmbinder.com/share/-MW5xKc1QPj4TXE...
1,grf99dk,m7y9rp,This subclass has features requiring the *Psyc...,UnearthedArcana,Bard - College of Forgotten Echoes; a spiritua...,Subclass,https://www.gmbinder.com/share/-MVeY68xfxI2MrE...,https://www.gmbinder.com/share/-MVeY68xfxI2MrE...
1,grf99dk,m7y9rp,This subclass has features requiring the *Psyc...,UnearthedArcana,Bard - College of Forgotten Echoes; a spiritua...,Subclass,https://www.gmbinder.com/share/-M90l_C5MPCYvWv...,https://www.gmbinder.com/share/-M90l_C5MPCYvWv...
2,grd344y,m7s097,"Hello, everyone! Hoping to get some serious c...",UnearthedArcana,Fortune Domain - 5e Cleric Subclass (CONSTRUCT...,Subclass,https://homebrewery.naturalcrit.com/share/MhTK...,https://homebrewery.naturalcrit.com/source/MhT...
3,gr9amwd,m75i2f,"When I say partly inspired by Hollow Knight, I...",UnearthedArcana,"Bugfolk, a Race of Creepy-Crawlies Partly Insp...",Race,https://www.gmbinder.com/share/-MVy-6VQ01Ft9Ys...,https://www.gmbinder.com/share/-MVy-6VQ01Ft9Ys...
4,gr99ryw,m75bsj,Note: The class currently just uses the Warloc...,UnearthedArcana,"The Eldritch Brawler | Class with a unique, fa...",Class,https://homebrewery.naturalcrit.com/share/leDB...,https://homebrewery.naturalcrit.com/source/leD...
5,gr8y8lg,m73c7l,Hi everyone! This a subclass that I've been th...,UnearthedArcana,Oath of Self - A Paladin subclass for those wh...,Subclass,https://www.gmbinder.com/share/-MVxOdThHWkEdg0...,https://www.gmbinder.com/share/-MVxOdThHWkEdg0...
6,gr67wsv,m6kq63,Survey [here](https://docs.google.com/forms/d/...,UnearthedArcana,Ranger - Crystalline Trapper,Subclass,https://www.gmbinder.com/share/-MVJujifYnF1PBc...,https://www.gmbinder.com/share/-MVJujifYnF1PBc...
7,gr577dw,m68lk9,[https://homebrewery.naturalcrit.com/share/5L4...,UnearthedArcana,Greywatcher: An Advanced Demon Hunter Class,Class,https://homebrewery.naturalcrit.com/share/5L4t...,https://homebrewery.naturalcrit.com/source/5L4...
9,gqyg3ll,m56sxy,I think most people here know by now that WOTC...,UnearthedArcana,Fairy Race Remixed (5e),Race,https://www.gmbinder.com/share/-MVm5LgT_HqPPsP...,https://www.gmbinder.com/share/-MVm5LgT_HqPPsP...


In [9]:
cmts_df['Text'] = cmts_df['src_url'].map(collect_text)
posts_df['Text'] = posts_df['src_url'].map(collect_text)

In [10]:
cmts_df.head(10)

Unnamed: 0,comment_id,post_id,body,sub,post_title,post_flair,url,src_url,Text
0,grfoljw,m86g8n,Here's the first in a project I've been workin...,UnearthedArcana,Variant Hobgoblin - Emporium of the Races,Race,https://www.gmbinder.com/share/-MW5xKc1QPj4TXE...,https://www.gmbinder.com/share/-MW5xKc1QPj4TXE...,<img\nsrc='https://images.squarespace-cdn.com/...
1,grf99dk,m7y9rp,This subclass has features requiring the *Psyc...,UnearthedArcana,Bard - College of Forgotten Echoes; a spiritua...,Subclass,https://www.gmbinder.com/share/-MVeY68xfxI2MrE...,https://www.gmbinder.com/share/-MVeY68xfxI2MrE...,<style>\n .phb#p2:after { display:none; }\n ...
1,grf99dk,m7y9rp,This subclass has features requiring the *Psyc...,UnearthedArcana,Bard - College of Forgotten Echoes; a spiritua...,Subclass,https://www.gmbinder.com/share/-M90l_C5MPCYvWv...,https://www.gmbinder.com/share/-M90l_C5MPCYvWv...,<style>\n .phb:after {\n display: no...
2,grd344y,m7s097,"Hello, everyone! Hoping to get some serious c...",UnearthedArcana,Fortune Domain - 5e Cleric Subclass (CONSTRUCT...,Subclass,https://homebrewery.naturalcrit.com/share/MhTK...,https://homebrewery.naturalcrit.com/source/MhT...,### Fortune Domain\nClerics of the Fortune dom...
3,gr9amwd,m75i2f,"When I say partly inspired by Hollow Knight, I...",UnearthedArcana,"Bugfolk, a Race of Creepy-Crawlies Partly Insp...",Race,https://www.gmbinder.com/share/-MVy-6VQ01Ft9Ys...,https://www.gmbinder.com/share/-MVy-6VQ01Ft9Ys...,<style>\n/* Background */\n .phb{ background-...
4,gr99ryw,m75bsj,Note: The class currently just uses the Warloc...,UnearthedArcana,"The Eldritch Brawler | Class with a unique, fa...",Class,https://homebrewery.naturalcrit.com/share/leDB...,https://homebrewery.naturalcrit.com/source/leD...,<div>\n# Eldritch Brawler\n<div class='pageNum...
5,gr8y8lg,m73c7l,Hi everyone! This a subclass that I've been th...,UnearthedArcana,Oath of Self - A Paladin subclass for those wh...,Subclass,https://www.gmbinder.com/share/-MVxOdThHWkEdg0...,https://www.gmbinder.com/share/-MVxOdThHWkEdg0...,### Oath of Self\n\nAn Oath of Self paladin is...
6,gr67wsv,m6kq63,Survey [here](https://docs.google.com/forms/d/...,UnearthedArcana,Ranger - Crystalline Trapper,Subclass,https://www.gmbinder.com/share/-MVJujifYnF1PBc...,https://www.gmbinder.com/share/-MVJujifYnF1PBc...,\n \n \n \n \n \n \n \n ...
7,gr577dw,m68lk9,[https://homebrewery.naturalcrit.com/share/5L4...,UnearthedArcana,Greywatcher: An Advanced Demon Hunter Class,Class,https://homebrewery.naturalcrit.com/share/5L4t...,https://homebrewery.naturalcrit.com/source/5L4...,<style>\n .phb#p1{ text-align:center; }\n .p...
9,gqyg3ll,m56sxy,I think most people here know by now that WOTC...,UnearthedArcana,Fairy Race Remixed (5e),Race,https://www.gmbinder.com/share/-MVm5LgT_HqPPsP...,https://www.gmbinder.com/share/-MVm5LgT_HqPPsP...,# Fairies - A remixed race (v1.1)\n\nFairies h...


In [11]:
posts_df.head(10)

Unnamed: 0,post_id,title,url,sub,score,flair,upvote_ratio,src_url,Text
0,m7s6jd,"[5E] Metamorph, a (hopefully) balanced substit...",https://homebrewery.naturalcrit.com/share/tpKj...,UnearthedArcana,2,Class,0.75,https://homebrewery.naturalcrit.com/source/tpK...,# Metamorph\nAn eldarly human sits locked in c...
1,m7ezuy,Cute Magical Critter Warlock Patron | Sign a c...,https://www.gmbinder.com/share/-MT4G-aRMFvYwYw...,UnearthedArcana,21,Subclass,0.93,https://www.gmbinder.com/share/-MT4G-aRMFvYwYw...,## Otherworldly Patron: Cute Magical Critter (...
2,m73s8f,Akkator's Compendium of Classes,https://homebrewery.naturalcrit.com/share/gvVS...,UnearthedArcana,5,Compendium,1.0,https://homebrewery.naturalcrit.com/source/gvV...,# Akkator's Compendium of Classes\n\nWithin th...
3,m71mtw,Order of the Thereon-Blood Hunter subclass v 1.01,https://homebrewery.naturalcrit.com/share/1ScB...,UnearthedArcana,3,Subclass,1.0,https://homebrewery.naturalcrit.com/source/1Sc...,## Blood Hunter Order\n<img \n src='https://p...
4,m6t7k8,"Selakyn - a Race of Nomadic Shark People, stil...",https://homebrewery.naturalcrit.com/share/p-k2...,UnearthedArcana,4,Race,1.0,https://homebrewery.naturalcrit.com/source/p-k...,"## The Selakyn\nSelakyns, also known as Shark ..."
5,m6odpv,HD and XP as Currency? Thoughts and critiques ...,https://homebrewery.naturalcrit.com/share/1fCq...,UnearthedArcana,2,Mechanic,1.0,https://homebrewery.naturalcrit.com/source/1fC...,## Hit Dice and Experience Point Currency\n*Du...
6,m6o99q,"Satyr Reimagined! Now with 3 new subraces, fea...",https://homebrewery.naturalcrit.com/share/vxyP...,UnearthedArcana,1,Race,0.6,https://homebrewery.naturalcrit.com/source/vxy...,# The Satyr / Faun \nThe Satyr are a race of j...
7,m6h3jr,Sealed Dragon v1.4 | A Dragon Sealed Inside a ...,https://homebrewery.naturalcrit.com/share/SRjD...,UnearthedArcana,5,Class,0.7,https://homebrewery.naturalcrit.com/source/SRj...,<div class='classTable wide'>\n##### The Seale...
8,m5xg2a,"Taking a swing at my own Ranger revision, with...",https://www.gmbinder.com/share/-MTpa3ordIxC74W...,UnearthedArcana,0,Class,0.4,https://www.gmbinder.com/share/-MTpa3ordIxC74W...,"# Ranger\n## Class Features\nAs a Ranger, you ..."
9,m5tt9a,Way of the Coiled Cobra | Dodge and Strike wit...,https://www.gmbinder.com/share/-MUKsI6Ft8OQgYe...,UnearthedArcana,12,Subclass,1.0,https://www.gmbinder.com/share/-MUKsI6Ft8OQgYe...,### Way of the Coiled Cobra\n\nMonks of the Wa...


# Removing residual HTML
A quick scan reveals that some HTML tags in the source text (extra styling elements, etc) did not get removed. Let's strip all of the HTML one more time.

In [12]:
cmts_df['Text'] = cmts_df['Text'].map(remove_html)
posts_df['Text'] = posts_df['Text'].map(remove_html)

In [13]:
posts_df.head(10)

Unnamed: 0,post_id,title,url,sub,score,flair,upvote_ratio,src_url,Text
0,m7s6jd,"[5E] Metamorph, a (hopefully) balanced substit...",https://homebrewery.naturalcrit.com/share/tpKj...,UnearthedArcana,2,Class,0.75,https://homebrewery.naturalcrit.com/source/tpK...,# Metamorph\nAn eldarly human sits locked in c...
1,m7ezuy,Cute Magical Critter Warlock Patron | Sign a c...,https://www.gmbinder.com/share/-MT4G-aRMFvYwYw...,UnearthedArcana,21,Subclass,0.93,https://www.gmbinder.com/share/-MT4G-aRMFvYwYw...,## Otherworldly Patron: Cute Magical Critter (...
2,m73s8f,Akkator's Compendium of Classes,https://homebrewery.naturalcrit.com/share/gvVS...,UnearthedArcana,5,Compendium,1.0,https://homebrewery.naturalcrit.com/source/gvV...,# Akkator's Compendium of Classes\n\nWithin th...
3,m71mtw,Order of the Thereon-Blood Hunter subclass v 1.01,https://homebrewery.naturalcrit.com/share/1ScB...,UnearthedArcana,3,Subclass,1.0,https://homebrewery.naturalcrit.com/source/1Sc...,## Blood Hunter Order\n\nCredit: Jikayen\n####...
4,m6t7k8,"Selakyn - a Race of Nomadic Shark People, stil...",https://homebrewery.naturalcrit.com/share/p-k2...,UnearthedArcana,4,Race,1.0,https://homebrewery.naturalcrit.com/source/p-k...,"## The Selakyn\nSelakyns, also known as Shark ..."
5,m6odpv,HD and XP as Currency? Thoughts and critiques ...,https://homebrewery.naturalcrit.com/share/1fCq...,UnearthedArcana,2,Mechanic,1.0,https://homebrewery.naturalcrit.com/source/1fC...,## Hit Dice and Experience Point Currency\n*Du...
6,m6o99q,"Satyr Reimagined! Now with 3 new subraces, fea...",https://homebrewery.naturalcrit.com/share/vxyP...,UnearthedArcana,1,Race,0.6,https://homebrewery.naturalcrit.com/source/vxy...,# The Satyr / Faun \nThe Satyr are a race of j...
7,m6h3jr,Sealed Dragon v1.4 | A Dragon Sealed Inside a ...,https://homebrewery.naturalcrit.com/share/SRjD...,UnearthedArcana,5,Class,0.7,https://homebrewery.naturalcrit.com/source/SRj...,\n##### The Sealed Dragon\n| Level | Proficien...
8,m5xg2a,"Taking a swing at my own Ranger revision, with...",https://www.gmbinder.com/share/-MTpa3ordIxC74W...,UnearthedArcana,0,Class,0.4,https://www.gmbinder.com/share/-MTpa3ordIxC74W...,"# Ranger\n## Class Features\nAs a Ranger, you ..."
9,m5tt9a,Way of the Coiled Cobra | Dodge and Strike wit...,https://www.gmbinder.com/share/-MUKsI6Ft8OQgYe...,UnearthedArcana,12,Subclass,1.0,https://www.gmbinder.com/share/-MUKsI6Ft8OQgYe...,### Way of the Coiled Cobra\n\nMonks of the Wa...


In [14]:
cmts_df.head(10)

Unnamed: 0,comment_id,post_id,body,sub,post_title,post_flair,url,src_url,Text
0,grfoljw,m86g8n,Here's the first in a project I've been workin...,UnearthedArcana,Variant Hobgoblin - Emporium of the Races,Race,https://www.gmbinder.com/share/-MW5xKc1QPj4TXE...,https://www.gmbinder.com/share/-MW5xKc1QPj4TXE...,\n\n\n\n\n\n## Variant Hobgoblin\n\n\nOften co...
1,grf99dk,m7y9rp,This subclass has features requiring the *Psyc...,UnearthedArcana,Bard - College of Forgotten Echoes; a spiritua...,Subclass,https://www.gmbinder.com/share/-MVeY68xfxI2MrE...,https://www.gmbinder.com/share/-MVeY68xfxI2MrE...,\n .phb#p2:after { display:none; }\n .phb#p5...
1,grf99dk,m7y9rp,This subclass has features requiring the *Psyc...,UnearthedArcana,Bard - College of Forgotten Echoes; a spiritua...,Subclass,https://www.gmbinder.com/share/-M90l_C5MPCYvWv...,https://www.gmbinder.com/share/-M90l_C5MPCYvWv...,\n .phb:after {\n display: none;\n ...
2,grd344y,m7s097,"Hello, everyone! Hoping to get some serious c...",UnearthedArcana,Fortune Domain - 5e Cleric Subclass (CONSTRUCT...,Subclass,https://homebrewery.naturalcrit.com/share/MhTK...,https://homebrewery.naturalcrit.com/source/MhT...,### Fortune Domain\nClerics of the Fortune dom...
3,gr9amwd,m75i2f,"When I say partly inspired by Hollow Knight, I...",UnearthedArcana,"Bugfolk, a Race of Creepy-Crawlies Partly Insp...",Race,https://www.gmbinder.com/share/-MVy-6VQ01Ft9Ys...,https://www.gmbinder.com/share/-MVy-6VQ01Ft9Ys...,\n/* Background */\n .phb{ background-image: ...
4,gr99ryw,m75bsj,Note: The class currently just uses the Warloc...,UnearthedArcana,"The Eldritch Brawler | Class with a unique, fa...",Class,https://homebrewery.naturalcrit.com/share/leDB...,https://homebrewery.naturalcrit.com/source/leD...,\n# Eldritch Brawler\n1\n1 | Eldritch Brawler\...
5,gr8y8lg,m73c7l,Hi everyone! This a subclass that I've been th...,UnearthedArcana,Oath of Self - A Paladin subclass for those wh...,Subclass,https://www.gmbinder.com/share/-MVxOdThHWkEdg0...,https://www.gmbinder.com/share/-MVxOdThHWkEdg0...,### Oath of Self\n\nAn Oath of Self paladin is...
6,gr67wsv,m6kq63,Survey [here](https://docs.google.com/forms/d/...,UnearthedArcana,Ranger - Crystalline Trapper,Subclass,https://www.gmbinder.com/share/-MVJujifYnF1PBc...,https://www.gmbinder.com/share/-MVJujifYnF1PBc...,Ranger Subclass: Crystalline Trapper\n ...
7,gr577dw,m68lk9,[https://homebrewery.naturalcrit.com/share/5L4...,UnearthedArcana,Greywatcher: An Advanced Demon Hunter Class,Class,https://homebrewery.naturalcrit.com/share/5L4t...,https://homebrewery.naturalcrit.com/source/5L4...,\n .phb#p1{ text-align:center; }\n .phb#p1:a...
9,gqyg3ll,m56sxy,I think most people here know by now that WOTC...,UnearthedArcana,Fairy Race Remixed (5e),Race,https://www.gmbinder.com/share/-MVm5LgT_HqPPsP...,https://www.gmbinder.com/share/-MVm5LgT_HqPPsP...,# Fairies - A remixed race (v1.1)\n\nFairies h...


In [15]:
processed_cmts = cwd / 'Data' / 'CommentsProcessed.csv'
processed_posts = cwd / 'Data' / 'PostsProcessed.csv'

posts_df.to_csv(processed_posts,sep=',',encoding='utf-8')
cmts_df.to_csv(processed_cmts,sep=',',encoding='utf-8')

In [16]:
cmts_df.rename(columns={"post_flair":"flair"},inplace=True)
cmts_df.rename(columns={"post_title":"title"},inplace=True)

In [17]:
all_data = pd.concat([posts_df[["src_url","sub","flair","title","Text"]], cmts_df[["src_url","sub","flair","title","Text"]]])
all_data.drop_duplicates(subset=["src_url"], inplace=True)

In [18]:
processed = cwd / 'Data' / 'TrainProcessed.csv'

all_data = all_data[["sub","flair","title","Text","src_url"]]

all_data.to_csv(processed, sep=',', index=False,encoding='utf-8')