#### Notebook Summary
Notebook 01 contains data importing.

## Executive summary
Using NLP techniques, we can create a highly accurate classification model that will automatically differentiate between categories for Reddit comments.  Misclassified comments are usually ones that a human with previous knowledge of the two sub-Reddits would not be able to classify, when only looking at the comment text and not the previous conversation they were part of.

The model will work any two sub-Reddits, the default used here were for the two games Stellaris and Dwarf Fortress.


## Data science problem
Are the comments between two given sub-Reddits significantly different enough for a machine learning model to categorize the comments into their respective groups?


## Requirements
Python 3.6+, pandas, re, scikitlearn, time, seaborn, itertools, warnings

## Index

- [SPI Setup](#API-Setup)
- [Reddit Comment Import](#Reddit-Comment-Import)
- [Convert to Dataframe](#Convert-to-Dataframe)
- [Save Comments as CSV's](#Save-Comments-as-CSV's)

## API Setup

API will be setup and tested to ensure status code 200 (no problems with connection).

In [1]:
# imports
import requests
import time
import pandas as pd
import re

In [2]:
# set up API
url = 'https://www.reddit.com/hot.json'
headers = {'User-agent': 'DSI SDbot'}
res = requests.get(url, headers = headers)
res.status_code

200

In [3]:
the_json = res.json()
sorted(the_json.keys())

['data', 'kind']

In [4]:
 # Naming the URL objects
subreddit_1 = 'stellaris'
subreddit_2 = 'dwarffortress'

subreddit_1_url = f"https://www.reddit.com/r/{subreddit_1}/.json" 
subreddit_2_url = f'https://www.reddit.com/r/{subreddit_2}/.json' 

## Reddit Comment Import

Reddit comments will be imported using the API.

In [5]:
posts_1 = []
after = None
for _ in range(40):
    if after == None:
        params = {}
    else:
        params = {'after': after}
    res = requests.get(subreddit_1_url, params=params, headers=headers)
    if res.status_code == 200:
        the_json = res.json()
        posts_1.extend(the_json['data']['children'])
        after = the_json['data']['after']
    else:
        print(res.status_code)
        break
    time.sleep(1)

In [6]:
len(posts_1)

991

In [7]:
posts_2 = []
after = None
for _ in range(40):
    if after == None:
        params = {}
    else:
        params = {'after': after}
    res = requests.get(subreddit_2_url, params=params, headers=headers)
    if res.status_code == 200:
        the_json = res.json()
        posts_2.extend(the_json['data']['children'])
        after = the_json['data']['after']
    else:
        print(res.status_code)
        break
    time.sleep(1)

In [8]:
len(posts_2)

992

## Convert to Dataframe

Reddit comments will be converted to a pandas dataframe, all words will be lowercased.  Invalid common string &amp;#x200B; will be blanked out.

In [9]:
d = {'Comment': []}
df_1 = pd.DataFrame(data=d)
for p in range(len(posts_1)):
    new_entry = posts_1[p]['data']['selftext'].lower()
    if new_entry.strip():
        new_entry.replace('','')
        df_1 = df_1.append({'Comment': new_entry}, ignore_index=True)
df_1['Comment'].replace('&amp;#x200B;','',regex=True,inplace=True)
df_1.head()

Unnamed: 0,Comment
0,"every december, the admins give reddit moderat..."
1,**greetings!**\n\*^(\*rules and conditions may...
2,to establish a branch office!
3,"this is not a new idea, but it has become espe..."
4,"a lot of perks and what not have things ""we"" a..."


In [10]:
d = {'Comment': []}
df_2 = pd.DataFrame(data=d)
for p in range(len(posts_2)):
    new_entry = posts_2[p]['data']['selftext'].lower()
    if new_entry.strip():
        df_2 = df_2.append({'Comment': new_entry}, ignore_index=True)
df_2['Comment'].replace('&amp;#x200B;','',regex=True,inplace=True)
df_2.head()

Unnamed: 0,Comment
0,ask about anything related to dwarf fortress -...
1,i am a fairly new player and am still trying t...
2,"anytime i try to gen a world now, it rejects t..."
3,wait i think i have it. a safe material drop c...
4,how do you organize pasturing space in your fo...


In [11]:
df_2.head()

Unnamed: 0,Comment
0,ask about anything related to dwarf fortress -...
1,i am a fairly new player and am still trying t...
2,"anytime i try to gen a world now, it rejects t..."
3,wait i think i have it. a safe material drop c...
4,how do you organize pasturing space in your fo...


## Save Comments as CSV's

Comment dataframes will be saved as CSV's for future use, and to ensure the API import isn't required for the analysis and model evaluation.

In [12]:
df_1.to_csv('./Data/data1.csv')
df_2.to_csv('./Data/data2.csv')