# Welcome Back

...To the great unveiling of the **Brakefield Enterprises RPG Classifier&trade;**!

In our previous notebook we showed you how we used the Reddit API to glean text data from posts and comments in the popular [rpg](https://www.reddit.com/r/rpg/new/) and [rpg_gamers](https://www.reddit.com/r/rpg_gamers/new/) subreddits.

Now take a look at just how we turn that data into a useful decision-assisting model!

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split,GridSearchCV

from sklearn.ensemble import VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import GradientBoostingClassifier,AdaBoostClassifier
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

import requests
import time
import json
from IPython.display import clear_output

In [2]:
dfpp = pd.read_csv('./data/pen-and-paper.csv')
dfvg = pd.read_csv('./data/video-game.csv')

In [3]:
dfpp.head()

Unnamed: 0,sub,name,text,url,comments
0,rpg,t3_a7pwer,What are your favorite pre-made campaigns?I do...,https://www.reddit.com/r/rpg/comments/a7pwer/w...,"[""Operation Morpheus for Aftermath!. It is th..."
1,rpg,t3_a7pprr,50 Fantasy RPG Quest Ideas,https://www.reddit.com/r/rpg/comments/a7pprr/5...,
2,rpg,t3_a7pdaz,What system should I use for a fantasy army vs...,https://www.reddit.com/r/rpg/comments/a7pdaz/w...,['Gurps. Also check out the novel The Doomfare...
3,rpg,t3_a7pd5z,Physical Purchases 2018I just did an inventory...,https://www.reddit.com/r/rpg/comments/a7pd5z/p...,"[""It's only a problem if you don't play them a..."
4,rpg,t3_a7opa2,Roleplaying Intelligent Creatures in D&amp;D 5...,https://www.reddit.com/r/rpg/comments/a7opa2/r...,


In [4]:
dfpp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 993 entries, 0 to 992
Data columns (total 5 columns):
sub         993 non-null object
name        993 non-null object
text        993 non-null object
url         993 non-null object
comments    857 non-null object
dtypes: object(5)
memory usage: 38.9+ KB


In [5]:
dfvg.head()

Unnamed: 0,sub,name,text,url,comments
0,rpg_gamers,t3_a7q3sm,Fallout Inspired RPG Atom Released,https://www.reddit.com/r/rpg_gamers/comments/a...,
1,rpg_gamers,t3_a7olpq,Check out new game in the making !,https://www.reddit.com/r/rpg_gamers/comments/a...,
2,rpg_gamers,t3_a7o0qt,People who've played Ni No Kuni 2.https://yout...,https://www.reddit.com/r/rpg_gamers/comments/a...,
3,rpg_gamers,t3_a7mq9q,Open-World hardcore RPG 'Outward’ Trailer,https://www.reddit.com/r/rpg_gamers/comments/a...,"['Looks good, but I do hope there is a story t..."
4,rpg_gamers,t3_a7mmjs,The Philosophy of Planescape: Torment,https://www.reddit.com/r/rpg_gamers/comments/a...,"['Nice channel, thanks for sharing. ']"


In [6]:
dfvg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
sub         1000 non-null object
name        1000 non-null object
text        1000 non-null object
url         1000 non-null object
comments    890 non-null object
dtypes: object(5)
memory usage: 39.1+ KB


In [7]:
# Looking ahead towards modeling, I'm going to binarize the sub feature
dfpp['sub'] = 1
dfvg['sub'] = 0

# And everything should be in one dataframe
df = pd.concat([dfpp,dfvg])
df.reset_index(drop=True,inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1993 entries, 0 to 1992
Data columns (total 5 columns):
sub         1993 non-null int64
name        1993 non-null object
text        1993 non-null object
url         1993 non-null object
comments    1747 non-null object
dtypes: int64(1), object(4)
memory usage: 77.9+ KB


In [8]:
# We won't need the 'name' or 'url' columns for our purposes:
df.drop(columns=['name','url'],inplace=True)

In [9]:
# We're going to use a CountVectorizer on our text, which will work best with a single
# string for each row:

def text_blob(row):
    if type(df['comments'][row]) == str:
        return df['text'][row] + df['comments'][row]
    else:
        return df['text'][row]

In [10]:
df['alltext'] = [text_blob(i) for i in range(len(df))]
df.head()

Unnamed: 0,sub,text,comments,alltext
0,1,What are your favorite pre-made campaigns?I do...,"[""Operation Morpheus for Aftermath!. It is th...",What are your favorite pre-made campaigns?I do...
1,1,50 Fantasy RPG Quest Ideas,,50 Fantasy RPG Quest Ideas
2,1,What system should I use for a fantasy army vs...,['Gurps. Also check out the novel The Doomfare...,What system should I use for a fantasy army vs...
3,1,Physical Purchases 2018I just did an inventory...,"[""It's only a problem if you don't play them a...",Physical Purchases 2018I just did an inventory...
4,1,Roleplaying Intelligent Creatures in D&amp;D 5...,,Roleplaying Intelligent Creatures in D&amp;D 5...


In [11]:
# Looking good, df! Now to establish and split our modeling data
X = df['alltext']
y = df['sub']

# Our data is about a 50/50 split, but I always like to stratify just in case
X_train,X_test,y_train,y_test = train_test_split(X,y,stratify=y,random_state=42)

In [12]:
# Here we establish our CountVectorizer. For starters I'll only include the 1000 most
# common words of each subreddit, and will exclude English stop words

cv = CountVectorizer(max_features = 5000, stop_words = 'english')

In [13]:
X_train.head()

437     Another Werewolf: The Apocalypse 20th ed quest...
1852    Obsidian Entertainment to be Bought out by Mic...
572     Medieval, realistic rpg systems?Hello! I've go...
1218    JRPGS with intuitive progression systems and f...
1550    What is your favorite kind of prologue? (the f...
Name: alltext, dtype: object

In [14]:
# apply our CountVectorizer 
X_train_cv = pd.DataFrame(cv.fit_transform(X_train).todense(),
                          columns = cv.get_feature_names())
X_test_cv = pd.DataFrame(cv.transform(X_test).todense(),
                         columns = cv.get_feature_names())

In [15]:
X_train_cv.head()

Unnamed: 0,00,000,01,03,04,06,07,08,09,10,...,youtube,ys,zelda,zero,zodiac,zombie,zombies,zone,zones,zweihander
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,2,0,2,0,0,...,2,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
# With a large amount of text data like this, I'm feeling good about a Naive-Bayes model
nb = MultinomialNB(alpha=0.01)
model = nb.fit(X_train_cv, y_train)
model.score(X_train_cv, y_train)

0.963855421686747

In [17]:
model.score(X_test_cv, y_test)

0.9498997995991983

In [18]:
# # NB looks really good. Next we tried a RandomForestClassifier was very much overfit.
# # Let's see how a Gridsearch helps
# rf = RandomForestClassifier()
# rf_params = {
#     'n_estimators': [ 400,500,600],
#     'max_depth': [None, 2,5],
#     'min_samples_split': [1.0,2,10]
# }
# gs = GridSearchCV(rf, param_grid=rf_params)
# gs.fit(X_train_cv, y_train)
# print(gs.best_score_)
# gs.best_params_

In [19]:
# NB looks pretty good! Now, as per the prompt, to try running a random forest
rf = RandomForestClassifier(min_samples_split=2,n_estimators=500)

rf.fit(X_train_cv,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [20]:
rf.score(X_train_cv,y_train)

1.0

In [21]:
rf.score(X_test_cv,y_test)

0.9218436873747495

## But wait, there's less!

This all got pretty technical, but we here at Brakefield Enterprises always want to put our clients first. That's why we've packaged our model into a single, user-friendly Python function! Simply input the URL of your target subreddit and let the **Brakefield Enterprises RPG Classifier&trade;** do the rest!

In [22]:
# build a single function to export and run in a single cell! So convenient!
def rpg_classify(suburl):
    print("Importing Modules...")
    import pandas as pd
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.model_selection import train_test_split

    from sklearn.naive_bayes import MultinomialNB

    import requests
    import time
    import json
    from IPython.display import clear_output
    
    # first let's build our model. It will read in our data:
    clear_output()
    print("Reading Data...")
    dfpp = pd.read_csv('./data/pen-and-paper.csv')
    dfvg = pd.read_csv('./data/video-game.csv')
    
    dfpp['sub'] = 1
    dfvg['sub'] = 0
    df = pd.concat([dfpp,dfvg])
    df.reset_index(drop=True,inplace=True)
    df.drop(columns=['name','url'],inplace=True)
    
    def text_blob(row):
        if type(df['comments'][row]) == str:
            return df['text'][row] + df['comments'][row]
        else:
            return df['text'][row]
    
    df['alltext'] = [text_blob(i) for i in range(len(df))]

    X = df['alltext']
    y = df['sub']
    X_train,X_test,y_train,y_test = train_test_split(X,y,stratify=y,random_state=42)
    
    # we'll vectorize the data too, but that won't get a print statement
    cv = CountVectorizer(max_features = 4000, stop_words = 'english')
    X_train_cv = pd.DataFrame(cv.fit_transform(X_train).todense(),
                          columns = cv.get_feature_names())
    
    # second, let's instantiate and fit our Naive-Bayes model
    clear_output()
    print("Building Model...")
    nb = MultinomialNB(alpha=0.01)
    model = nb.fit(X_train_cv, y_train)
    
    # And now we'll repeat our earlier steps to read in our posts
    clear_output()
    print("Reading Posts...")
    posts = []
    after = None
    headers = {'User-agent':'BBLab03'}

    # iterating any more than this is a waste.
    for i in range(40):
        # our API scrape uses Reddit's 'after' parameter
        if after == None:
            params = {}
        else:
            params = {'after':after}

        url = suburl+'.json'
        res = requests.get(url, params = params, headers=headers)
        # A quick check that everything is coming through alright
        if res.status_code == 200:
            the_json = res.json()
            # add to posts
            posts.extend(the_json['data']['children'])
            after = the_json['data']['after']
        # Print any error codes that come up
        else:
            print(res.status_code)
            break
        # Print something to make sure progress happens while the program runs 
        if i % 10 == 0:
            headers['User-agent'] = 'BBLab03-'+str(i)
            clear_output()
            print("Reading Posts...")
            print(str((i)*25)+" posts so far.")
        # make sure not to overload Reddit with requests!
        time.sleep(.25)
    
    clear_output()
    print("Formatting Posts...")
    # that read in all our posts. Now we extract what info we need into a dataframe
    list_of_lists = []
    # iterate over the posts
    for i in range(len(posts)):
        # fill in our desired fields
        sub = posts[i]['data']['subreddit']
        name = posts[i]['data']['name']
        title = posts[i]['data']['title']
        body = posts[i]['data']['selftext']
        suffix = posts[i]['data']['permalink']
        url = 'https://www.reddit.com'+ str(suffix)
        text = title + body
        row = [sub,name,text,url,None]
        # Here's where I catch duplicates
        if row not in list_of_lists:
            list_of_lists.append(row)
    # Here I put my list into an easy-to-use DataFrame!    
    df = pd.DataFrame(data=list_of_lists,columns=['sub','name','text','url','comments'])
    
    # iterate over the dataframe
    for row in range(len(df)):
        # finish formatting the url address
        url = str(df['url'][row]+'.json')
        res = requests.get(url,headers=headers)
        the_json = res.json()
        # empty list for depositing our comments
        comment_list = []
        # A quick check to skip over comment-less posts
        if the_json[1]['data']['children']:
            for comment in range(len(the_json[1]['data']['children'])):
                # the Reddit API doesn't give body text for more than 50 comments on a single
                # post
                if comment <= 50:
                    try:
                        comment_list.append(the_json[1]['data']['children'][comment]['data']['body'])
                    except KeyError:
                        print('We got some invalid comments!')
                        print("row: ",row,'; comment: ',comment)
                        break
                    df['comments'][row] = comment_list
        # print to ensure the program runs, delay it enough that it doesn't clog up the series
        # of tubes
        if row % 10 == 0:
            clear_output()
            print("Reading Comments...")
            print(str(row)+" rows down!")
        time.sleep(.07)
    
    clear_output()
    print("Formatting Text...")
    def text_blob(row):
        if type(df['comments'][row]) == str:
            return df['text'][row] + df['comments'][row]
        else:
            return df['text'][row]

    df['alltext'] = [text_blob(i) for i in range(len(df))]
    
    clear_output()
    print("Classifying Subreddit...")
    X = df['alltext']
    X_cv = pd.DataFrame(cv.transform(X).todense(),
                         columns = cv.get_feature_names())
    
    # since 1=pen and paper and 0=video game, rounding the mean of all posts' predictions
    # will give the model its classification!
    subscore = model.predict(X_cv).mean()
    if subscore >= 0.5:
        print("This is a pen-and-paper RPG subreddit!")
    elif (subscore >= 0) & (subscore < 0.5):
        print("This is a video game RPG subreddit!")
    else:
        print("Something has gone horribly wrong!")
        