# Table of Content (Code Part 1 of 6)

1. [Problem Statement](#Problem_Statement)
2. [Libraries and Functions](#libraries_and_functions)
3. [Import Bouldering Data](#import-bouldering-data)
4. [Import Climbharder Data](#import-climbharder-data)


# Problem Statement

Mr Najib from MYActive currently has five empty sports hall across Singapore which he wants to turn them into either sport climbing gym or bouldering gym. However, he is not sure which type of sport have more interest in Singapore and he has come to our company for advice. We know that MYActive has a social platform which allows the public to post their idea/comment, we are going to utilize this information and generate a classification model to classify the posts whether they show more interest towards sport climbing or bouldering. 

In order to train our model, we are going to pull the information from Reddit. In Reddit, there are two major subreddits namely Bouldering and Climbharder. BOuldering subreddit is for people to post anything about bouldering while Climbharder subreddit contains posts mainly for sport climbing.

Performance of the model will be evaluated and the best model will be selected to be implented on MYActive social platform's posts.

## Libraries and Functions

In [1]:
import numpy as np
import pandas as pd
import requests

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 100)

In [14]:
# Function to remove ay row with word count less than 20 in the title or selftext
def remove_unsatisfactory_row(df):
    wordlimit = 20
    df = df[(df['wordcount']>=wordlimit) | (df['titlewordcount']>=wordlimit)]
    return df

In [3]:
# Function to set the 'before' parameter based on the last collected post from Reddit
def get_params(df, subreddit):
    params = {
        'subreddit': subreddit,
        'size': 101,
        'before': df.loc[(df.index[-1]), 'created_utc']
    }
    return params

In [4]:
# Function to collect posts from Reddit
def get_posts(params, baseurl='https://api.pushshift.io/reddit/search/submission'):
    res = requests.get(baseurl, params)
    if res.status_code != 200:
        return f'Error! Status code: {res.status_code}'
    else:
        data = res.json()
        posts = data['data']
    return posts

In [5]:
# Function to update DataFrame with latest collected posts from Reddit
def update_df(base_df, subreddit):
    params = get_params(base_df, subreddit)
    # print(params)
    posts = get_posts(params)
    # print(len(posts))
    df2 = pd.DataFrame(posts)
    # print(df2.shape)
    updated = pd.concat([base_df, df2], axis=0, ignore_index=True, sort=True)
    return updated

## Import Bouldering Data

In [6]:
# Initialize the link with pushshift
url = 'https://api.pushshift.io/reddit/search/submission'

In [1]:
# Set initial parameter for Bouldering subreddit
params_bd_init = {
    'subreddit': 'bouldering',
    'size': 100,
}

In [8]:
# Check if any error code
res = requests.get(url, params_bd_init)
res.status_code

200

In [9]:
# Convert the posts collected into Json format

data = res.json()
bdposts = data['data']

In [10]:
len(bdposts)

100

In [11]:
# Convert data in Json format to DataFrame
bd = pd.DataFrame(bdposts)

In [12]:
# Read first 5 rows of the bd DataFrame
bd.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_css_class,link_flair_richtext,link_flair_template_id,link_flair_text,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,post_hint,preview,pwls,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,thumbnail_height,thumbnail_width,title,total_awards_received,treatment_tags,upvote_ratio,url,url_overridden_by_dest,whitelist_status,wls,media,media_embed,secure_media,secure_media_embed,author_flair_background_color,author_flair_text_color,removed_by_category,suggested_sort,author_flair_template_id,gallery_data,is_gallery,media_metadata
0,[],False,lamar0320,,[],,text,t2_b4vbtnim,False,False,False,[],False,False,1667616241,/r/bouldering/comments/ymi23i/got_my_first_ove...,https://www.reddit.com/r/bouldering/comments/y...,{},ymi23i,False,True,False,False,False,True,False,True,,redflair,[],58784d7c-9dfe-11e5-b10f-0e22511d9e11,Indoor,dark,text,False,False,True,0,0,False,all_ads,/r/bouldering/comments/ymi23i/got_my_first_ove...,False,hosted:video,"{'enabled': False, 'images': [{'id': 'Ecb8TZhG...",6,1667616251,1,,True,False,False,bouldering,t5_2rb1o,337100,public,https://a.thumbs.redditmedia.com/ygEOKINXrbPIy...,140.0,140.0,Got my first overhang climb last week!,0,[],1.0,https://v.redd.it/xtrj7geqp1y91,https://v.redd.it/xtrj7geqp1y91,all_ads,6,,,,,,,,,,,,
1,[],False,husky868,,[],,text,t2_144w2r,False,False,False,[],False,False,1667612579,youtube.com,https://www.reddit.com/r/bouldering/comments/y...,{},ymgssm,False,True,False,False,False,True,False,False,,,[],,,dark,text,False,False,False,0,0,False,all_ads,/r/bouldering/comments/ymgssm/some_high_qualit...,False,rich:video,"{'enabled': False, 'images': [{'id': 'Xp76OTjA...",6,1667612590,1,,True,False,False,bouldering,t5_2rb1o,337094,public,https://b.thumbs.redditmedia.com/J1qfdB5I2ampC...,105.0,140.0,Some high quality New England boulders from th...,0,[],1.0,https://youtube.com/watch?v=kSpu0tVAPcY&amp;fe...,https://youtube.com/watch?v=kSpu0tVAPcY&amp;fe...,all_ads,6,"{'oembed': {'author_name': 'Evan Decina', 'aut...","{'content': '&lt;iframe width=""356"" height=""20...","{'oembed': {'author_name': 'Evan Decina', 'aut...","{'content': '&lt;iframe width=""356"" height=""20...",,,,,,,,
2,[],False,[deleted],,,,,,False,,,[],False,False,1667612432,,https://www.reddit.com/r/bouldering/comments/y...,{},ymgqx3,False,False,False,False,False,False,False,False,,,[],,,dark,text,False,False,True,0,0,False,all_ads,/r/bouldering/comments/ymgqx3/some_high_qualit...,False,,,6,1667612443,1,[deleted],True,False,False,bouldering,t5_2rb1o,337094,public,default,105.0,140.0,Some high quality New England boulders from th...,0,[],1.0,,,all_ads,6,"{'oembed': {'author_name': 'Evan Decina', 'aut...","{'content': '&lt;iframe width=""356"" height=""20...","{'oembed': {'author_name': 'Evan Decina', 'aut...","{'content': '&lt;iframe width=""356"" height=""20...",,dark,deleted,,,,,
3,[],False,_sharleen,,[],,text,t2_ch0qlojn,False,False,False,[],False,False,1667608183,/r/bouldering/comments/ymf7o0/fun_project_comp...,https://www.reddit.com/r/bouldering/comments/y...,{},ymf7o0,False,True,False,False,False,True,False,True,,,[],,,dark,text,False,False,False,0,0,False,all_ads,/r/bouldering/comments/ymf7o0/fun_project_comp...,False,,,6,1667608193,1,,True,False,False,bouldering,t5_2rb1o,337082,public,default,,,Fun project completed 💯,0,[],1.0,https://v.redd.it/0p0lbyqr11y91,https://v.redd.it/0p0lbyqr11y91,all_ads,6,,,,,,,,,,,,
4,[],False,alaska_boulders,,[],,text,t2_4nlwwhx9,False,False,False,[],False,False,1667600628,/r/bouldering/comments/ymcbby/my_favorite_line...,https://www.reddit.com/r/bouldering/comments/y...,{},ymcbby,False,True,False,False,False,True,False,True,,,[],9e2afd4c-8011-11eb-9343-0e5a7fc82f21,Outdoor,dark,text,False,False,True,0,0,False,all_ads,/r/bouldering/comments/ymcbby/my_favorite_line...,False,hosted:video,"{'enabled': False, 'images': [{'id': 'lYbxdtHR...",6,1667600639,1,,True,False,False,bouldering,t5_2rb1o,337068,public,https://b.thumbs.redditmedia.com/xnO2m5RbcVMQ8...,140.0,140.0,My favorite line from an area I helped develop...,0,[],1.0,https://v.redd.it/21wljgf7f0y91,https://v.redd.it/21wljgf7f0y91,all_ads,6,,,,,,,,,,,,


In [15]:
# Loop the process of collecting posts until 5000 rows is reached
while len(bd) < 5000:
    bd = update_df(bd, 'bouldering')
    bd = bd[~bd['selftext'].isnull()]
    bd = bd.drop_duplicates(subset=['title', 'selftext'], keep='last')
    bd['splitstring'] = bd['selftext'].map(lambda x: x.split())
    bd['splittitle'] = bd['title'].map(lambda y: y.split())
    bd['wordcount'] = bd['selftext'].map(lambda a: len(a))
    bd['titlewordcount'] = bd['splittitle'].map(lambda b: len(b))
    bd = remove_unsatisfactory_row(bd)
    print (bd.shape)


(444, 84)
(459, 84)
(468, 84)
(482, 84)
(491, 84)
(503, 84)
(518, 84)
(529, 84)
(545, 84)
(564, 84)
(579, 84)
(594, 85)
(614, 85)
(633, 85)
(650, 85)
(662, 85)
(678, 85)
(699, 85)
(723, 85)
(741, 85)
(756, 85)
(769, 85)
(786, 85)
(806, 85)
(821, 85)
(834, 85)
(849, 85)
(871, 85)
(893, 85)
(911, 85)
(922, 85)
(938, 85)
(952, 85)
(975, 85)
(997, 85)
(1022, 85)
(1037, 85)
(1061, 85)
(1075, 85)
(1102, 85)
(1133, 85)
(1170, 85)
(1200, 85)
(1227, 85)
(1260, 85)
(1286, 85)
(1313, 85)
(1347, 85)
(1390, 85)
(1418, 85)
(1445, 86)
(1486, 86)
(1521, 86)
(1557, 86)
(1599, 86)
(1635, 86)
(1665, 86)
(1699, 86)
(1730, 86)
(1761, 86)
(1793, 86)
(1814, 87)
(1838, 87)
(1873, 87)
(1904, 87)
(1924, 87)
(1948, 87)
(1976, 87)
(2001, 88)
(2020, 88)
(2035, 88)
(2052, 88)
(2073, 89)
(2097, 92)
(2119, 92)
(2141, 92)
(2159, 92)
(2175, 92)
(2197, 92)
(2222, 92)
(2246, 92)
(2267, 92)
(2282, 92)
(2291, 92)
(2300, 92)
(2313, 92)
(2330, 92)
(2354, 92)
(2375, 92)
(2392, 92)
(2410, 92)
(2424, 92)
(2444, 92)
(2452, 92)
(

In [16]:
# Make sure we have at least 5000 rows
bd.shape

(5007, 100)

In [17]:
# Save the data into csv file
bd.to_csv('../data/bouldering.csv')

## Import Climbharder Data

In [18]:
# Set initial parameter for Climbharder subreddit
params_rc_init = {
    'subreddit': 'climbharder',
    'size': 100,
}

In [19]:
# Check if any error code
res = requests.get(url, params_rc_init)
res.status_code

200

In [20]:
# Convert data in Json format to DataFrame
data = res.json()
rcposts = data['data']
rc = pd.DataFrame(rcposts)

In [21]:
# Display first 5 rows of rc DataFrame
rc.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_richtext,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,pwls,removed_by_category,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls,post_hint,preview,thumbnail_height,thumbnail_width,url_overridden_by_dest,author_flair_background_color,author_flair_template_id,author_flair_text_color,crosspost_parent,crosspost_parent_list,media,media_embed,secure_media,secure_media_embed,gallery_data,is_gallery,media_metadata,suggested_sort
0,[],False,slutbuttfucker,,[],,text,t2_q0clo5nz,False,False,False,[],False,False,1667600650,self.climbharder,https://www.reddit.com/r/climbharder/comments/...,{},ymcbo2,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,1,0,False,all_ads,/r/climbharder/comments/ymcbo2/experiencing_so...,False,6,moderator,1667600661,1,[removed],True,False,False,climbharder,t5_2s5er,156008,public,self,Experiencing some tightness/soreness for a few...,0,[],1.0,https://www.reddit.com/r/climbharder/comments/...,all_ads,6,,,,,,,,,,,,,,,,,,
1,[],False,Rotem_,,[],,text,t2_5x7eezu0,False,False,False,[],False,False,1667599784,self.climbharder,https://www.reddit.com/r/climbharder/comments/...,{},ymbzfv,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/climbharder/comments/ymbzfv/gumby_wanting_t...,False,6,,1667599795,1,Hey!\nIn the past 6-7 month I’ve been climbing...,True,False,False,climbharder,t5_2s5er,156008,public,self,Gumby wanting to send in a trip,0,[],1.0,https://www.reddit.com/r/climbharder/comments/...,all_ads,6,,,,,,,,,,,,,,,,,,
2,[],False,cbclimbfeedbackacct,,[],,text,t2_la1p39yq,False,False,False,[],False,False,1667586622,/r/climbharder/comments/ym6ip4/why_i_did_i_do_...,https://www.reddit.com/r/climbharder/comments/...,{},ym6ip4,False,True,False,False,False,True,False,True,,[],dark,text,False,False,True,0,0,False,all_ads,/r/climbharder/comments/ym6ip4/why_i_did_i_do_...,False,6,,1667586632,1,,True,False,False,climbharder,t5_2s5er,156005,public,default,Why I did I do the last crux move in isolation...,0,[],1.0,https://v.redd.it/ptnmohrn9zx91,all_ads,6,hosted:video,"{'enabled': False, 'images': [{'id': 'VJq0rIbW...",140.0,140.0,https://v.redd.it/ptnmohrn9zx91,,,,,,,,,,,,,
3,[],False,golf_ST,,"[{'e': 'text', 't': 'V10ish'}]",V10ish,richtext,t2_68oug6cv,False,False,False,[],False,False,1667586198,self.weightroom,https://www.reddit.com/r/climbharder/comments/...,{},ym6c8v,False,True,False,False,False,True,False,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/climbharder/comments/ym6c8v/from_0_pull_ups...,False,6,,1667586209,1,,False,False,False,climbharder,t5_2s5er,156004,public,default,From 0 Pull Ups to One Arm Chin Up in 1000 Day...,0,[],1.0,/r/weightroom/comments/thqzer/from_0_pull_ups_...,all_ads,6,link,"{'enabled': False, 'images': [{'id': '6mF0fWEQ...",,,/r/weightroom/comments/thqzer/from_0_pull_ups_...,transparent,9de5d878-f71d-11ec-a271-96674fc1f5b9,dark,t3_thqzer,[{'all_awardings': [{'award_sub_type': 'GLOBAL...,,,,,,,,
4,[],False,Little_Beat_8862,,[],,text,t2_apug4i5k,False,False,False,[],False,False,1667576510,self.climbharder,https://www.reddit.com/r/climbharder/comments/...,{},ym2aie,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/climbharder/comments/ym2aie/v10_climbers_wh...,False,6,,1667576520,1,Part-time lurker and late 30’s climbing coach ...,True,False,False,climbharder,t5_2s5er,156006,public,self,"V10+ climbers, what’s my low hanging fruit?",0,[],1.0,https://www.reddit.com/r/climbharder/comments/...,all_ads,6,,,,,,,,,,,,,,,,,,


In [22]:
# Loop the process of collecting posts until 5000 rows is reached
while len(rc) < 5000:
    rc = update_df(rc, 'climbharder')
    rc = rc[~rc['selftext'].isnull()]
    rc = rc.drop_duplicates(subset=['title', 'selftext'], keep='last')
    rc['splitstring'] = rc['selftext'].map(lambda x: x.split())
    rc['splittitle'] = rc['title'].map(lambda y: y.split())
    rc['wordcount'] = rc['selftext'].map(lambda a: len(a))
    rc['titlewordcount'] = rc['splittitle'].map(lambda b: len(b))
    rc = remove_unsatisfactory_row(rc)
    print (rc.shape)

(55, 82)
(81, 82)
(104, 82)
(129, 83)
(159, 83)
(190, 83)
(271, 85)
(346, 85)
(427, 86)
(513, 86)
(602, 86)
(692, 86)
(778, 88)
(855, 88)
(940, 88)
(1022, 88)
(1110, 89)
(1186, 89)
(1277, 89)
(1363, 89)
(1446, 89)
(1535, 89)
(1622, 89)
(1675, 89)
(1705, 89)
(1740, 89)
(1808, 89)
(1891, 89)
(1959, 89)
(2040, 89)
(2126, 89)
(2212, 89)
(2302, 89)
(2382, 89)
(2460, 89)
(2548, 89)
(2634, 89)
(2719, 89)
(2785, 89)
(2850, 89)
(2933, 89)
(2959, 89)
(3037, 89)
(3115, 89)
(3195, 89)
(3259, 89)
(3322, 89)
(3404, 89)
(3484, 89)
(3554, 89)
(3631, 89)
(3703, 89)
(3785, 89)
(3863, 89)
(3951, 89)
(4036, 89)
(4129, 89)
(4220, 89)
(4305, 89)
(4399, 89)
(4488, 89)
(4576, 89)
(4660, 89)
(4748, 89)
(4836, 89)
(4925, 89)
(5004, 89)


In [24]:
# Export rc DataFrame to csv file
rc.to_csv('../data/rockclimbing.csv')