![image.png](images/image.jpg)

## Part 1 - Introduction & Business Problem
---

### Background


Reddit is an American social news aggregation, web content rating, and discussion website. Registered members submit content to the site and the site is organized by subject into user-created boards called "subreddits", which cover a variety of topics such as news, politics, religion, movies, video games, music, books, sports, fitness, cooking, pets, and image-sharing etc. Moderation is also conducted by community-specific moderators, who are not considered Reddit employees.


Due to the laissez-faire approach adopted by Reddit with respect to online content, it is common for inappropriate content such as illegal/undesirable activities to be posted online on Reddit before those posts can be removed by community moderators, which can be a time-consuming and tedious process for larger subreddits.

### Business Problem

The moderators of the r/Football subreddit (the customer) has engaged our Tech Consultancy firm to develop a classification model to detect whether a post is related to Soccer Betting in the subreddit. The moderators do not wish for gambling content to be on the subreddit as there are users who are minors and such content would not be appropriate. The intent is for the model to be ran periodically to remove soccer-betting related posts from the subreddit.


### Broad Methodology


Data will be obtained from both the r/Football and r/SoccerBetting subreddits so that the classification model can be trained on the keywords in both subreddits.Posts from r/Soccerbetting will be used to obtain words that are associated with soccer betting and gambling. The classification result of a post belonging to r/Soccerbetting will be used as a proxy to signify gambling content in a post. 


The eventual binary classification model will be able to classify whether a future post belong to r/Football or r/Soccerbetting, if the post is classified as belonging to r/Soccerbetting, it will be flagged as an inappropriate post and be deleted. Automating the moderation of the subreddit will save significant effort on the part of the moderators.


## Using the PushShift API to obtain Subreddit Data


The [PushShift API](https://github.com/pushshift/api) is used to obtain 2000 posts each from the r/Football and r/Soccerbetting subreddits. This is done via the function `get_posts` which loops several times to load the data into a DataFrame.


The DataFrame is then exported as a .csv file to Part 2 to process in a separate notebook.

In [257]:
import requests
import pandas as pd
import re

pd.options.display.max_columns = 100
pd.options.display.max_rows = 100

In [258]:
def get_posts(num_comments_required, subreddit_name, url='https://api.pushshift.io/reddit/search/submission'):
    
    num_iter = int(num_comments_required/100)
    params = {'size':100,
              'subreddit': subreddit_name}
    df = pd.DataFrame()
   
    for i in range(num_iter):
        res = requests.get(url, params)
        data = res.json()
        comments = data['data']
        temp = pd.DataFrame(comments)
        df = pd.concat([df,temp], axis=0)
        params['before'] = comments[-1]['created_utc']
    
    return df

In [262]:
df_soccer = get_posts(2000,"Football")
df_betting = get_posts(2000,"SoccerBetting")

print(df_soccer.shape)
print(df_betting.shape)

df = pd.concat([df_soccer, df_betting], axis=0)
df.reset_index(drop=True, inplace=True)
print(df.shape)

(2000, 82)
(2000, 77)
(4000, 84)


In [263]:
subreddit = df[['subreddit', 'selftext', 'title']]
subreddit.reset_index(drop=True,inplace=True)
display(subreddit.shape)

subreddit.to_csv('subreddit.csv', index_label=False)

(4000, 3)