# Project 3: Web APIs & NLP
> By: Matthew Lio
---


Project notebook organisation:

01. Data Scraping (current notebook)
02. EDA and Preprocessing
03. Model Tuning and Insights

# 01. Data Scraping
---

## Introduction

In today's media driven world, it is very easy to share information. However, the lack of regulation means information derived online are often times without any factual backing by experts. Fitness and health is one of the topics that suffers the most from misinformation online. There exists a huge array of ways to keep fit, and with it comes many different topics under fitness. Moreover, the internet also consists of fitness fads, the latest diet craze, celebrity-endorsed fitness products, and wellness trends. Which advice can you trust? Even with degrees in sports science available in universities, the vast variety of fitness activities and training methods mean that fitness experts often have to decide on a specific type of sport activity or training to specialise in. Specialised methods, technique and diet in a particular sport, while effective for a particular sport, may be less effective for another. Individuals seeking advice online often find vast and varied information, and without factual backing by experts from specific fields, said advices could be ineffective or even dangerous when applied.

### Contents:
- [Problem Statement](#Problem-Statement)
- [Executive Summary](#Executive-Summary)
- [Library Imports](#Library-Imports)
- [Scrape Function](#Scrape-Function)
- [Scraping r/gainit](#Scraping-r/gainit)
- [Scraping r/bodyweightfitness](#Scraping-r/bodyweightfitness)
- [Export data to CSV](#Export-data-to-CSV)

## Problem Statement

Misinformation is evident in the world of fitness. A popular fitness and health forum that welcomes all kinds of topics regarding sports, fitness and health, has realised the potential negative impact that could happen to individuals when wrong advices are followed. Beginners often direct questions to the wrong experts due to a lack of knowledge, by posting in the wrong threads, therefore getting misinformed.

The fitness forum wants to focus on 2 popular topics, weightlifting and calisthenics (bodyweight fitness). These 2 sport types, while they share some similarities, are actually very different in terms of training methods and techniques. The fitness forum wants to hire a data scientist with knowledge in the fitness industry, in hopes of creating a classification model that will be able to classify questions from beginners into the proper threads, so that the respondents would be the right experts in each field.

## Executive Summary

Our problem statement is to classify fitness and health related questions or posts into the correct subreddit, so that the right experts can give proper advice and answers to these posts. If posts are incorrectly classified, the original poster may receive wrong information, as incorrect experts would be answering their questions. Incorrect information includes inefficient training methods or bad diet plans for their intended sport. This in turn could cause the posters slower improvements/growth and even injury. We can thus see that wrongly classified posts could have a really negative impact, and we should reduce this as much as possible.

We have selected 3 different estimators and 2 different transformers, for a total of 6 different types of models. Different parameters will be passed into each model using Pipeline and GridSearchCV, which would help us to cross validate within itself and produce for us the best scores using the best parameters.

To summarise post model tuning, we have managed to tune a model to our accuracy expectations, although there are rooms for improvement. Our model, the Logistic Regression with TF-IDF Vectorizer model, scored reasonably well with test accuracy, PPV and NPV scores all above 0.8. This roughly translates to 1 wrongly classified post for every 5 posts. This is acceptable in our terms to solve our classification problem; wrongly classified posts, although may result in negative impacts toward the poster for following wrong advices, are not immediately dangerous towards the poster himself or the general public. In addition, due to the similarities of weightlifting and bodyweight fitness in terms of words used in the fitness industry as well as advices that might be given by experts, we expected some posts to overlap or be very similar to either subreddits, and could actually prove to belong to either subreddits. So a low amount of "wrongly" classified posts might not be so wrong after all.

We also thought about ways to improve our production model, including fine-tuning current model to avoid overfitting, or to fine-tune other more complex models that could potentially be more accurate as compared to Logistic Regression as the estimator. More parameters could be explored to include into our models, like increasing the max features, or use more n-gram ranges. Other classification models not tested in this project could also be explored.

## Library Imports

In [1]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)

import requests

## Scrape Function

In [2]:
def scrape(subreddit, utc):
    # pushshift main url
    url = 'https://api.pushshift.io/reddit/search/submission'
    
    # initiate empty list to convert to df after loop
    posts = []
    
    # initiate current utc into loop
    oldest_utc = utc
    
    # loop to gain older posts
    while len(posts) < 1500:
        params = {
            'subreddit': subreddit,
            'size': 100,
            'before': oldest_utc
        }
        # converting data to a list
        data = requests.get(url, params).json()
        posts_new = data['data']
        
        # concatenate the list of posts
        posts = posts + posts_new
        
        # setting to oldest utc in the current list
        oldest_utc = posts[-1]['created_utc'] - 1
        
    df = pd.DataFrame(posts)
    return df

## Scraping r/gainit

In [3]:
%%time
wl_df = scrape('gainit', 1642321000)
wl_df

Wall time: 1min 13s


Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_richtext,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,pwls,removed_by_category,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls,post_hint,preview,poll_data,author_flair_template_id,author_flair_text_color,author_flair_background_color,author_cakeday,banned_by,crosspost_parent,crosspost_parent_list,url_overridden_by_dest,edited,thumbnail_height,thumbnail_width
0,[],False,Jax1030,,[],,text,t2_859ozklo,False,False,False,[],False,False,1642320187,self.gainit,https://www.reddit.com/r/gainit/comments/s570p...,{},s570ps,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,1,0,False,all_ads,/r/gainit/comments/s570ps/always_shit_myself_w...,False,6,moderator,1642320197,1,[removed],True,False,False,gainit,t5_2s9bg,329427,public,self,Always shit myself when I do deadlifts?,0,[],1.0,https://www.reddit.com/r/gainit/comments/s570p...,all_ads,6,,,,,,,,,,,,,,
1,[],False,Tylenol4ThePain,,[],,text,t2_d93bpntm,False,False,False,[],False,False,1642307971,self.gainit,https://www.reddit.com/r/gainit/comments/s53nv...,{},s53nvf,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,1,0,False,all_ads,/r/gainit/comments/s53nvf/can_i_still_do_cardi...,False,6,,1642307982,1,I want to continue running but I feel like tha...,True,False,False,gainit,t5_2s9bg,329418,public,self,Can I still do cardio while trying to gain wei...,0,[],1.0,https://www.reddit.com/r/gainit/comments/s53nv...,all_ads,6,,,,,,,,,,,,,,
2,[],False,imjustadudeguy,,[],,text,t2_1bkxsocn,False,False,False,[],False,False,1642304686,self.gainit,https://www.reddit.com/r/gainit/comments/s52mn...,{},s52mnc,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,1,0,False,all_ads,/r/gainit/comments/s52mnc/60_118lbs_worked_out...,False,6,,1642304696,1,"The day after was nice, a little proof of my h...",True,False,False,gainit,t5_2s9bg,329416,public,self,"6'0, 118lbs. Worked out for the first time wit...",0,[],1.0,https://www.reddit.com/r/gainit/comments/s52mn...,all_ads,6,,,,,,,,,,,,,,
3,[],False,DrTiki43,,[],,text,t2_656afnui,False,False,False,[],False,False,1642303898,self.gainit,https://www.reddit.com/r/gainit/comments/s52di...,{},s52dix,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,1,0,False,all_ads,/r/gainit/comments/s52dix/need_help_feeling_ch...,False,6,,1642303908,1,\nHi! I’m a really skinny dude trying to start...,True,False,False,gainit,t5_2s9bg,329417,public,self,Need help feeling chest,0,[],1.0,https://www.reddit.com/r/gainit/comments/s52di...,all_ads,6,,,,,,,,,,,,,,
4,[],False,BobbyJohnson31,,[],,text,t2_kcr6h,False,False,False,[],False,False,1642301774,self.gainit,https://www.reddit.com/r/gainit/comments/s51oq...,{},s51oqg,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/gainit/comments/s51oqg/lost_10_pounds_due_t...,False,6,moderator,1642301785,1,[removed],True,False,False,gainit,t5_2s9bg,329416,public,self,Lost 10 Pounds due to covid,0,[],1.0,https://www.reddit.com/r/gainit/comments/s51oq...,all_ads,6,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1495,[],False,Familiar_Editor1240,,[],,text,t2_af0u659i,False,False,False,[],False,False,1638392233,self.gainit,https://www.reddit.com/r/gainit/comments/r6ozh...,{},r6ozhe,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,2,0,False,all_ads,/r/gainit/comments/r6ozhe/pawg/,False,6,moderator,1638392245,1,[removed],True,False,False,gainit,t5_2s9bg,326030,public,self,PAWG,0,[],1.0,https://www.reddit.com/r/gainit/comments/r6ozh...,all_ads,6,,,,,,,,,,,,,,
1496,[],False,namejefffiy,,[],,text,t2_66fiqbg5,False,False,False,[],False,False,1638389956,self.gainit,https://www.reddit.com/r/gainit/comments/r6o3b...,{},r6o3bs,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,14,0,False,all_ads,/r/gainit/comments/r6o3bs/when_will_i_see_size...,False,6,,1638389967,1,I’ve been working out 6x a week for 3 months. ...,True,False,False,gainit,t5_2s9bg,326027,public,self,When will I see size improvements,0,[],1.0,https://www.reddit.com/r/gainit/comments/r6o3b...,all_ads,6,,,,,,,,,,,,,,
1497,[],False,Carpenter4875,,[],,text,t2_pkyvl,False,False,False,[],False,False,1638389613,self.gainit,https://www.reddit.com/r/gainit/comments/r6nym...,{},r6nymh,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,13,0,False,all_ads,/r/gainit/comments/r6nymh/really_weak_after_a_...,False,6,,1638389624,1,Was getting to some respectable numbers before...,True,False,False,gainit,t5_2s9bg,326028,public,self,Really weak after a year-long break. How do yo...,0,[],1.0,https://www.reddit.com/r/gainit/comments/r6nym...,all_ads,6,,,,,,,,,,,,,,
1498,[],False,DaltionNeo,,[],,text,t2_getprr9x,False,False,False,[],False,False,1638387289,self.gainit,https://www.reddit.com/r/gainit/comments/r6n1y...,{},r6n1y0,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,2,0,False,all_ads,/r/gainit/comments/r6n1y0/does_boxing_destroy_...,False,6,moderator,1638387300,1,[removed],True,False,False,gainit,t5_2s9bg,326024,public,self,Does Boxing destroy upper body gains?,0,[],1.0,https://www.reddit.com/r/gainit/comments/r6n1y...,all_ads,6,,,,,,,,,,,,,,


## Scraping r/bodyweightfitness

In [4]:
%%time
bw_df = scrape('bodyweightfitness', 1642321000)
bw_df

Wall time: 1min 17s


Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_richtext,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,pwls,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls,post_hint,preview,removed_by_category,suggested_sort,crosspost_parent,crosspost_parent_list,url_overridden_by_dest,author_flair_template_id,author_flair_text_color,author_cakeday,author_flair_background_color,banned_by,thumbnail_height,thumbnail_width
0,[],False,Forward-Pineapple849,,[],,text,t2_cufwg8u4,False,False,False,[],False,False,1642319742,self.bodyweightfitness,https://www.reddit.com/r/bodyweightfitness/com...,{},s56wij,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/bodyweightfitness/comments/s56wij/what_size...,False,6,1642319752,1,I’m looking at buying some weights and was loo...,True,False,False,bodyweightfitness,t5_2tf0a,2350576,public,self,What size weights should I start off with? Female,0,[],1.0,https://www.reddit.com/r/bodyweightfitness/com...,all_ads,6,,,,,,,,,,,,,,
1,[],False,figtarr,,[],,text,t2_c8fbq,False,False,False,[],False,False,1642314809,self.bodyweightfitness,https://www.reddit.com/r/bodyweightfitness/com...,{},s55njz,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,1,0,False,all_ads,/r/bodyweightfitness/comments/s55njz/do_you_pr...,False,6,1642314819,1,[removed],True,False,False,bodyweightfitness,t5_2tf0a,2350532,public,self,Do you protract your shoulder blades at the bo...,0,[],1.0,https://www.reddit.com/r/bodyweightfitness/com...,all_ads,6,self,"{'enabled': False, 'images': [{'id': 'TFMtWn6o...",moderator,,,,,,,,,,,
2,[],False,figtarr,,[],,text,t2_c8fbq,False,False,False,[],False,False,1642314679,self.bodyweightfitness,https://www.reddit.com/r/bodyweightfitness/com...,{},s55m95,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,1,0,False,all_ads,/r/bodyweightfitness/comments/s55m95/do_you_pr...,False,6,1642314689,1,[removed],True,False,False,bodyweightfitness,t5_2tf0a,2350530,public,self,Do you protract your shoulder blades at the bo...,0,[],1.0,https://www.reddit.com/r/bodyweightfitness/com...,all_ads,6,,,moderator,,,,,,,,,,,
3,[],False,virtualenthusiast,,[],,text,t2_10zid3,False,False,False,[],False,False,1642309985,self.bodyweightfitness,https://www.reddit.com/r/bodyweightfitness/com...,{},s549mg,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,1,0,False,all_ads,/r/bodyweightfitness/comments/s549mg/does_cycl...,False,6,1642309996,1,[removed],True,False,False,bodyweightfitness,t5_2tf0a,2350486,public,self,Does cycling cause muscle loss?,0,[],1.0,https://www.reddit.com/r/bodyweightfitness/com...,all_ads,6,,,moderator,,,,,,,,,,,
4,[],False,Sir-Pinball_Wizard,,[],,text,t2_3j4gfzt0,False,False,False,[],False,False,1642307721,self.bodyweightfitness,https://www.reddit.com/r/bodyweightfitness/com...,{},s53lb1,False,False,False,False,False,False,True,False,,[],dark,text,False,False,True,1,0,False,all_ads,/r/bodyweightfitness/comments/s53lb1/would_thi...,False,6,1642307731,1,[removed],True,False,False,bodyweightfitness,t5_2tf0a,2350461,public,self,Would this routine target all my muscles?,0,[],1.0,https://www.reddit.com/r/bodyweightfitness/com...,all_ads,6,,,moderator,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1495,[],False,gymfoc,,[],,text,t2_fiau7ltf,False,False,False,[],False,False,1639668226,self.bodyweightfitness,https://www.reddit.com/r/bodyweightfitness/com...,{},rhsyzq,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/bodyweightfitness/comments/rhsyzq/protein_j...,False,6,1639668237,1,[Protein](https://gym-foc.blogspot.com/2021/1...,True,False,False,bodyweightfitness,t5_2tf0a,2315858,public,self,Protein Just Aides In Muscle Advancement: Fant...,0,[],1.0,https://www.reddit.com/r/bodyweightfitness/com...,all_ads,6,self,"{'enabled': False, 'images': [{'id': 'MKSznYwp...",,,,,,,,,,,,
1496,[],False,gymfoc,,[],,text,t2_fiau7ltf,False,False,False,[],False,False,1639667964,self.bodyweightfitness,https://www.reddit.com/r/bodyweightfitness/com...,{},rhsvsa,False,True,False,False,False,True,True,False,,[],dark,text,False,False,False,0,0,False,all_ads,/r/bodyweightfitness/comments/rhsvsa/what_happ...,False,6,1639667975,1,**How eggs are useful.**\n\nAn egg is 13% pro...,True,False,False,bodyweightfitness,t5_2tf0a,2315855,public,self,What happens if you eat eggs every day?,0,[],1.0,https://www.reddit.com/r/bodyweightfitness/com...,all_ads,6,self,"{'enabled': False, 'images': [{'id': '0e6uuoxu...",,,,,,,,,,,,
1497,[],False,gymfoc,,[],,text,t2_fiau7ltf,False,False,False,[],False,False,1639667622,self.bodyweightfitness,https://www.reddit.com/r/bodyweightfitness/com...,{},rhsrku,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/bodyweightfitness/comments/rhsrku/the_compl...,False,6,1639667632,1,If you don’t recognize what to do within the f...,True,False,False,bodyweightfitness,t5_2tf0a,2315852,public,self,The Complete gym Workout Guidance For Beginners,0,[],1.0,https://www.reddit.com/r/bodyweightfitness/com...,all_ads,6,self,"{'enabled': False, 'images': [{'id': 'ERfuMSuN...",,,,,,,,,,,,
1498,[],False,Mcerioni86,,[],,text,t2_244fn7sk,False,False,False,[],False,False,1639665653,self.bodyweightfitness,https://www.reddit.com/r/bodyweightfitness/com...,{},rhs338,False,True,False,False,False,True,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/bodyweightfitness/comments/rhs338/progress_...,False,6,1639665664,1,Hi!\n\nI need help to get my current program t...,True,False,False,bodyweightfitness,t5_2tf0a,2315830,public,self,Progress from my current routine and get to th...,0,[],1.0,https://www.reddit.com/r/bodyweightfitness/com...,all_ads,6,,,,,,,,,,,,,,


## Export data to CSV

In [5]:
wl_df.to_csv('../data/weightlifting.csv')
bw_df.to_csv('../data/bodyweight.csv')