# Problem Statement

The objective of this project is to classify posts from the following sub-reddits:
- r/TalesFromTheCustomer
- r/TalesFromYourServer

TalesFromTheCustomer posts generally comprise accounts of poor customer service encountered by contributors.

TalesFromYourServer posts mainly comprise contributions from people who work(ed) as waiters/waitresses regarding unreasonable customers they encountered at work.

As such, the lexicon/vocabulary in both types of posts are very similar. Hence, the challenge is to create a model that relies not only individual words, but also multi-word sequences to tell the posts apart.

# Import libraries/packages

In [1]:
import requests
import pandas as pd
import time
import random

### Set URL for subreddit --- 'Tales From The Customer'

In [2]:
url = 'https://www.reddit.com/r/TalesFromTheCustomer/new.json?limit=100'

### Iteratively retrieve 100 posts. Set maximum iterations to 15. Stop iterating if duplicate posts are detected.

In [3]:
posts = []
after = None
list_of_afters = []

for a in range(15):
    if after == None:
        current_url = url
        if len(list_of_afters): # not empty
            print('collected',len(posts),'posts ...')
            break
        else:
            pass
    else:
        if after in list_of_afters:
            print('collected',len(posts),'posts ...')
            break
        else:
            pass
        
        list_of_afters.append(after)
        current_url = url + '&after=' + after
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'Pony Inc 1.0'})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']
    
    # generate a random sleep duration to not look like a Denial-of-Service attack.
    sleep_duration = random.randint(2,10)
#     print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/TalesFromTheCustomer/new.json?limit=100
https://www.reddit.com/r/TalesFromTheCustomer/new.json?limit=100&after=t3_d8x1wd
https://www.reddit.com/r/TalesFromTheCustomer/new.json?limit=100&after=t3_cxor6a
https://www.reddit.com/r/TalesFromTheCustomer/new.json?limit=100&after=t3_co4m5a
https://www.reddit.com/r/TalesFromTheCustomer/new.json?limit=100&after=t3_ch487j
https://www.reddit.com/r/TalesFromTheCustomer/new.json?limit=100&after=t3_caqgrw
https://www.reddit.com/r/TalesFromTheCustomer/new.json?limit=100&after=t3_c2b93p
https://www.reddit.com/r/TalesFromTheCustomer/new.json?limit=100&after=t3_bu4ahw
https://www.reddit.com/r/TalesFromTheCustomer/new.json?limit=100&after=t3_bko32k
https://www.reddit.com/r/TalesFromTheCustomer/new.json?limit=100&after=t3_bbughr
collected 997 posts ...


### Save posts to file 'talesfromthecustomer.csv'

In [4]:
pd.DataFrame(posts).to_csv('../data/talesfromthecustomer.csv',index=False)

### Set URL for subreddit --- 'Tales From Your Server'

In [5]:
url = 'https://www.reddit.com/r/TalesFromYourServer/new.json?limit=100'

### Iteratively retrieve 100 posts. Set maximum iterations to 15. Stop iterating if duplicate posts are detected.

In [6]:
posts = []
after = None
list_of_afters = []

for a in range(15):
    if after == None:
        current_url = url
        if len(list_of_afters): # not empty
            print('collected',len(posts),'posts ...')
            break
        else:
            pass
    else:
        if after in list_of_afters:
            print('collected',len(posts),'posts ...')
            break
        else:
            pass
        
        list_of_afters.append(after)
        current_url = url + '&after=' + after
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'Pony Inc 1.0'})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']
    
    # generate a random sleep duration to not look like a Denial-of-Service attack.
    sleep_duration = random.randint(2,10)
#     print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/TalesFromYourServer/new.json?limit=100
https://www.reddit.com/r/TalesFromYourServer/new.json?limit=100&after=t3_dhvc3l
https://www.reddit.com/r/TalesFromYourServer/new.json?limit=100&after=t3_dfxcwh
https://www.reddit.com/r/TalesFromYourServer/new.json?limit=100&after=t3_de3lgj
https://www.reddit.com/r/TalesFromYourServer/new.json?limit=100&after=t3_dbond0
https://www.reddit.com/r/TalesFromYourServer/new.json?limit=100&after=t3_da3ue5
https://www.reddit.com/r/TalesFromYourServer/new.json?limit=100&after=t3_d7zu7b
https://www.reddit.com/r/TalesFromYourServer/new.json?limit=100&after=t3_d5i06t
https://www.reddit.com/r/TalesFromYourServer/new.json?limit=100&after=t3_d3j37f
https://www.reddit.com/r/TalesFromYourServer/new.json?limit=100&after=t3_d14o67
collected 996 posts ...


### Save posts to file 'talesfromyourserver.csv'

In [7]:
pd.DataFrame(posts).to_csv('../data/talesfromyourserver.csv',index=False)

## The raw data has been collected and saved. Kindly refer to notebook 02.