# Problem Statement.
The moderators for a Spirituality discussion forum have approached me to help them sieve out unwanted posts. The forum is focused on the subject of Spirituality but there have been frequent postings from religious groups that try to disrupt the discussion and we have received feedback from our senior forum members to remove such postings. However, there are hundreds of postings each day and we will need to develop a quick way to identify such posts and remove them quickly.

### Stakeholders:
Moderators of the Spirituality forum.

### Approach:
Using classification models such as Logistic Regression, Naive Bayes, RandomForest, correctly predict whether a post belongs to Spirituality or Religion. Data will be scrapped from reddit.com, using subreddits /r/spirituality and /r/religion. A model will be trained to predict Spirituality as the positive class and top prediction features will also be identified.

### Measure of Success:
The classification model would be assessed on its test accuracy and specificity score. The model should reduce false positives as much as possible. We do not want posts related to Religion getting classified as Spirituality.

# Executive Summary.

To solve the problem of posting of religious content in our Spirituality forum, we built a Classification Model using postings from reddit.com/r/spirituality/ and reddit.com/r/religion/. A total of about 1300 posts from each subreddits were collected for training and testing the model.

A range of classification models including Logistic Regression, Naive-Bayes Multinomial and RandomForest Classifier were tested. The best performing model was the Naive-Bayes Classifier, with a test accuracy of 0.904, specificity of 0.880 and  less overfitting. On new data, the accuracy is also high at 0.917 while specificity score peaked at 0.971.

The final model can now be deployed with above 90% accuracy. To prevent false positives (i.e. post related to Religion being predicted as Spirituality), posts classified as Spirituality should be checked again should they contain keywords like phone, day, life, time and death as these are known to cause false positive misclassification by the model.

# Part I. Data Collection.
Posts and discussions on the subjects of religion and spirituality will be collected from reddit.com. Using subreddits 'https://www.reddit.com/r/religion' and 'https://www.reddit.com/r/spirituality', data will be collected through reddit's API using the .json feeds.

In [1]:
import requests
import pandas as pd
import time
import random

In [2]:
# set the parameters needed
headers = {'User-agent': 'Laz Inc 1.0'}
subjects = ['religion','spirituality']
areas = ['hot', 'controversial', 'new', 'top']
periods = ['week','month', 'year', 'all']

In [3]:
# subreddit json can be download using the url format below. we will repeat the api call for each subject, area and period
# https://www.reddit.com/r/boardgames/top.json?t=all&limit=25
for subject in subjects:
    for area in areas:
        for period in periods:
            url = 'https://www.reddit.com/r/' + subject + '/' + area + '.json?t=' + period + '&limit=50'
            after = None
            filename = subject + '_' + area + '_' + period + '.csv'
            for a in range(10):
                if after == None:
                    params = None
                else:
                    params = {'after':after}
                print(params)
                res = requests.get(url, headers=headers, params=params)

                if res.status_code != 200:
                    print('Status error', res.status_code)
                    break

                current_dict = res.json()
                current_posts = [p['data'] for p in current_dict['data']['children']]

                # convert current posts into dataframe
                current_df = pd.DataFrame(current_posts)

                # get the after value
                after = current_dict['data']['after']
                if after == None:
                    break

                # add new posts to csv file
                if a > 0:
                    prev_posts = pd.read_csv('./csv_files_2/' + filename)
                    pd.concat([prev_posts, current_df]).to_csv(filename, index = False)
                else:
                    current_df.to_csv('./csv_files_2/' + filename, index = False)

                # generate a random sleep duration to look more 'natural'
                sleep_duration = random.randint(2,5)
                # print(sleep_duration)
                time.sleep(sleep_duration)

None
{'after': 't3_k5echi'}
{'after': 't3_k4apfr'}
{'after': 't3_k302sl'}
{'after': 't3_k1i5q5'}
{'after': 't3_jzzibs'}
{'after': 't3_jydi7h'}
{'after': 't3_jx19jc'}
{'after': 't3_jvim5r'}
{'after': 't3_ju511t'}
None
{'after': 't3_k626o4'}
{'after': 't3_k5oz4a'}
{'after': 't3_k56e22'}
{'after': 't3_k4q3ol'}
{'after': 't3_k4d6g1'}
{'after': 't3_k3zy4h'}
{'after': 't3_k3mu1s'}
{'after': 't3_k37wkr'}
{'after': 't3_k2r2v1'}


## Combine all csv files downloaded and drop duplicates and NAs.
After downloading the json files and saving each download as a separate .csv file, we have 16 csv files in total for each subreddit.

Next we will combined these 16 files into 1 single file for each subreddit, remove duplicate rows and any rows with nan data.

In [2]:
subjects = ['religion', 'spirituality']

In [5]:
# using glob to combine
import glob
import pandas as pd

# loop for each subject
for subject in subjects:
    # create an empty dataframe for concatenation
    df = pd.DataFrame()

    # loop for all csv files belonging to the subject
    for name in glob.glob('./csv_files/' + subject + '*.csv'):
        print(name)
        # read in the csv file as dataframe
        df_new = pd.read_csv(name)
        
        # concatenate the newly read dataframe with the df dataframe
        df = pd.concat([df, df_new])
    
    # drop all duplicate rows and any rows with NAN values
    df_clean = df[['subreddit', 'title','selftext','created_utc']].drop_duplicates(subset=['subreddit', 'title','selftext']).dropna()
    
    # save the combined data for each subreddit 
    df_clean.to_csv(subject + '_cleaned.csv', index=False)
    
    # print out the number of rows saved for each subreddit
    print(subject, len(df_clean))

./csv_files\religion_controversial_all.csv
./csv_files\religion_controversial_month.csv
./csv_files\religion_controversial_week.csv
./csv_files\religion_controversial_year.csv
./csv_files\religion_hot.csv
./csv_files\religion_hot_all.csv
./csv_files\religion_hot_month.csv
./csv_files\religion_hot_week.csv
./csv_files\religion_hot_year.csv
./csv_files\religion_new.csv
./csv_files\religion_new_all.csv
./csv_files\religion_new_month.csv
./csv_files\religion_new_week.csv
./csv_files\religion_new_year.csv
./csv_files\religion_top_all.csv
./csv_files\religion_top_month.csv
./csv_files\religion_top_week.csv
./csv_files\religion_top_year.csv
religion 1342
./csv_files\spirituality_controversial_all.csv
./csv_files\spirituality_controversial_month.csv
./csv_files\spirituality_controversial_week.csv
./csv_files\spirituality_controversial_year.csv
./csv_files\spirituality_hot_all.csv
./csv_files\spirituality_hot_month.csv
./csv_files\spirituality_hot_week.csv
./csv_files\spirituality_hot_year.csv


Now the data is ready, [continue to classification](classification.ipynb)