<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3 - Web APIs & NLP (Part 1)

## Problem Statement

As part of the marketing team for ClassPass, we are tasked to increase awareness for 'Pilates'. Hence, the stakeholders have instructed our team to conduct a NLP research on 'Pilates' and another similar exercise, 'Yoga' using reddit. As we will be using Search Engines to publish our advertisements, this research aims to come out with a classification model for a more effective advertising campaigns for 'Pilates'. Beside the classification model, this reserach will also address the following:

- Provide a general sentiment analysis on 'Yoga' and 'Pilates'.

### Contents:
- Background
- Outside Research
- Data Collection

### Background

Based on the 'ClassPass Comeback Report: Fitness and Beauty Trends 2021' we found out that Yoga and Pilates are both in the top 10 classes being booked. Yoga is ranked second while Pilates is ranked fourth. One can read more on it from ClassPass blog ([*source*](https://classpass.com/blog/2021-comeback-fitness-trends/)).

We were also able to understand that 'Yoga is growing a loyal following.' Newcomers are most likely to book a yoga class first. Livestream yoga is the only digital class type in the top 10 most booked classes and appointments. In-studio yoga has climbed to the second most popular class type since studios have reopened. It’s likely that the at-home yoga spike produced some new students, and that both new and seasoned yogis are excited to practice with a community again.

Pilates is ranked fourth as people expressed that it can be hard to fit a Pilates reformer (an equipment used when practicing Pilates) into their house. Hence 2/3 of our members are saying that access to equipment is one of the top reason for heading back to classes for Pilates.

### Outside Research

Yoga and Pilates are often compared as they have many similarities but the two are inherently very different forms of exercise. They are both low-impact exercise and are able to offer connection to the body and relief stress while developing flexibility, strength, control and endurance. One can understand more from the article from Harpers Bazaar ([*source*](https://www.harpersbazaar.com/uk/beauty/fitness-wellbeing/a25626354/yoga-vs-pilates/)).

But the biggest difference between the two is the emphasis on the spiritual side in Yoga classes. Yoga is an integrated health management system using breath, movement and meditation to unite mind, body and spirit. Also there are different types of Yoga like 'Hatha Yoga', 'Power Yoga', 'Hot Yoga' and many more.

While Pilates is a physical system that uses very specific targeted exercises to improve strength, flexibility and posture with particular focus on the core. Pilates are mainly split into 'Mat Pilates' or 'Classical Pilates'.

Pilates have been on an upwards trend in terms of popularity and accessibility. Pilates was invented by Joseph Pilates more than a century ago. Due to covid, gyms were closed and we were all confined to our homes. Thanks to social media being the driving force behind the recent resurgence of the low-impact workout, awareness for Pilates was increased. As we had to look for new ways to work out that involved minimal equipment that are generally accessible and easy to follow with just a mat and a resistance band while watching online videos. One can understand more from the article from Body and Soul ([*source*](https://www.bodyandsoul.com.au/fitness/training-tips/why-is-pilates-everywhere-now-and-what-did-covid-have-to-do-with-it/news-story/49ece150142fad14048bdb6608b44581)).

In [2]:
#Importing requests, pandas, numpy and time

import requests
import pandas as pd
import numpy as np
import time

### Data Collection

In [4]:
#Define function to loop the required number of posts

def data_collection_post(before='',
                        after='', 
                        subreddit='yoga', #default is yoga
                        no_of_posts=100,
                        add_columns = []): #Additional step to allow user to input more columns if needed

    url = "https://api.pushshift.io/reddit/search/submission/" # target web page
    
    loop = 1          # initialize with loop 1 for easier tracking in the loop later
    error_count = 0   # initialize with variable for error count checking to break from while loop
    
    #Additional Step: this is all the columns we are pulling
    identified_columns = ['subreddit', 'id', 'author', 'title', 'selftext', 'score', 'num_comments', 'created_utc']
    
    #Initialize the dataframe using list which would be substituted in the while loop
    all_df = []
    
    print(f"Data Collection for {subreddit}\n")

    while len(all_df) < no_of_posts: # to get the number of entries
        print(f"Loop #{loop}") 
    
        params = {
            'subreddit': subreddit,
            'size': 100,
            'before': before,  # Would be substituted with min epoch, so that later loops would get earlier posts
            'after': after, 
            'fields': identified_columns, #These are the identified useful columns
            #'selftext:not': '""' #This is to eliminate blank selftext field 
        }
        print("=== Retrieving... ========")
        res = requests.get(url, params) #Establish connection to the web page  
        print(f"Status Code: {res.status_code}")
    
        #Error checking: to continue if success and "error_count+=1 if not successful"
        if res.status_code == 200:
            print("=== Success! =============")
            data = res.json() #Store the json data (dict) into "data"
            posts = data['data'] #Retrieve the posts from the dictionary
            posts_df = pd.DataFrame(posts) #Convert to dataframe
            
            if len(posts_df) == 0: 
                print("No more posts to collect! \nTry adjusting before/after epoch time!")
                break  
            
            #Additional step: eliminates empty, [deleted], [removed] posts
            posts_df.drop(posts_df[posts_df.selftext==""].index, inplace=True)
            posts_df.drop(posts_df[posts_df.selftext=="[removed]"].index, inplace=True)
            posts_df.drop(posts_df[posts_df.selftext=="[deleted]"].index, inplace=True)
            
            before = posts_df.created_utc.min() # get the earliest utc in this loop

            if loop == 1:
                all_df = posts_df
                latest_epoch = posts_df.created_utc.max()
                latest_post = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(latest_epoch)) # get the date/time of latest post
            else:
                all_df = pd.concat([all_df, posts_df], axis=0)
                
                #Additional step: check and remove the "extra" posts
                if len(all_df) > no_of_posts: # 
                    all_df = all_df.iloc[:no_of_posts,:]

            print(f"{len(all_df)*100/no_of_posts}% of data has been added to the dataframe! \n")

            #Provide short summary at the end
            if len(all_df) >= no_of_posts:
                earliest_epoch = all_df.created_utc.min()
                earliest_post = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(earliest_epoch))
                print("=== Summary ==============")
                print(f"Subreddit: {subreddit}")
                print(f"No of Posts: {len(all_df)}")
                print(f"Start Date: {earliest_post}")
                print(f"End Date: {latest_post}")
                print(f"Start Epoch Time: {earliest_epoch}")
                print(f"End Epoch Time: {latest_epoch}")

            else: #Loop is still active
                time.sleep(np.random.randint(20, 35)) #provide a random time (seconds) for code to sleep
                loop += 1

        else: #Handle error where break out of while loop when there are 6 or more connection error.
            loop += 1
            error_count += 1
            print("=== Error! ===============")
            print(f"Error Count: {error_count}\n")
            if error_count > 5:
                print("=== Break ================")
                print(f"Detected more than 5 errors.")
                break
            time.sleep(np.random.randint(20, 35))
            continue
            
    all_df.reset_index(drop= True, inplace= True) #Reset the index

    return all_df

In [5]:
#Data collection for yoga

yoga = data_collection_post(subreddit='yoga', no_of_posts=1000)

Data Collection for yoga

Loop #1
Status Code: 200
3.0% of data has been added to the dataframe! 

Loop #2
Status Code: 200
5.8% of data has been added to the dataframe! 

Loop #3
Status Code: 200
8.8% of data has been added to the dataframe! 

Loop #4
Status Code: 200
11.6% of data has been added to the dataframe! 

Loop #5
Status Code: 200
14.0% of data has been added to the dataframe! 

Loop #6
Status Code: 200
16.6% of data has been added to the dataframe! 

Loop #7
Status Code: 200
18.6% of data has been added to the dataframe! 

Loop #8
Status Code: 200
20.8% of data has been added to the dataframe! 

Loop #9
Status Code: 200
23.1% of data has been added to the dataframe! 

Loop #10
Status Code: 200
24.9% of data has been added to the dataframe! 

Loop #11
Status Code: 200
28.0% of data has been added to the dataframe! 

Loop #12
Status Code: 200
30.7% of data has been added to the dataframe! 

Loop #13
Status Code: 200
33.7% of data has been added to the dataframe! 

Loop #14
St

In [8]:
yoga.shape

(1000, 8)

#### Summary on yoga data collection:

Expected number of loops is 10. But it took 35 loops to of 100 posts to fully collect 1,000 posts in yoga. It might be due to the fact that most of the posts are either deleted, removed or blank (posts like images). This data is collected from 19th March 2022 to 22nd July 2022. We can infer that this community is more active as we are able to collect 1,000 posts in 4 months.

In [7]:
#Data collection for pilates

pilates = data_collection_post(subreddit='pilates', no_of_posts=1000)

Data Collection for pilates

Loop #1
Status Code: 200
8.0% of data has been added to the dataframe! 

Loop #2
Status Code: 200
15.9% of data has been added to the dataframe! 

Loop #3
Status Code: 200
21.7% of data has been added to the dataframe! 

Loop #4
Status Code: 200
28.6% of data has been added to the dataframe! 

Loop #5
Status Code: 200
34.6% of data has been added to the dataframe! 

Loop #6
Status Code: 200
40.9% of data has been added to the dataframe! 

Loop #7
Status Code: 200
45.3% of data has been added to the dataframe! 

Loop #8
Status Code: 200
49.5% of data has been added to the dataframe! 

Loop #9
Status Code: 200
51.4% of data has been added to the dataframe! 

Loop #10
Status Code: 200
55.3% of data has been added to the dataframe! 

Loop #11
Status Code: 200
61.0% of data has been added to the dataframe! 

Loop #12
Status Code: 200
66.3% of data has been added to the dataframe! 

Loop #13
Status Code: 200
71.6% of data has been added to the dataframe! 

Loop #

In [9]:
pilates.shape

(1000, 8)

#### Summary on pilates data collection:

Expected number of loops is 10. But it took 19 loops to of 100 posts to fully collect 1,000 posts in pilates. It might be due to the fact that most of the posts are either deleted, removed or blank (posts like images). This data is collected from 25th October 2020 to 22nd July 2022. We can infer that this community is not so active as we need to collect data from past 2 years to hit 1,000 posts.

Comparing to yoga, pilates has lesser deleted or removed or blank posts as it only took 18 loops to collect 1,000 posts. But yoga community is more active as we were able to collect 1,000 posts in 4 months.

In [20]:
yoga.isnull().sum()

author          0
created_utc     0
id              0
num_comments    0
score           0
selftext        7
subreddit       0
title           0
dtype: int64

In [21]:
pilates.isnull().sum()

author          0
created_utc     0
id              0
num_comments    0
score           0
selftext        2
subreddit       0
title           0
dtype: int64

As the function above have taken into consideration of empty cells, we have minimal missing values from both community. Hence, we will proceed first and only address the missing values later on under data cleaning.

In [10]:
#Columns needed for NLP

nlp_columns = ['subreddit','title', 'selftext']

In [13]:
yoga_nlp = yoga[nlp_columns]
yoga_nlp.head()

Unnamed: 0,subreddit,title,selftext
0,yoga,Ankle Swelling From Yoga?,I recently signed up for a 30 day pass to a lo...
1,yoga,Are cork yoga mats better than a regular mat?,I find that I get a better workout with no mat...
2,yoga,Found My New Yoga Studio,"I just moved to Henderson, NV a while ago. Hav..."
3,yoga,We have started going to classes!,I’m sure you get these kind of posts frequentl...
4,yoga,How long did it take you to get good at yoga?,I'm struggling with core strength at the momen...


In [14]:
pilates_nlp = pilates[nlp_columns]
pilates_nlp.head()

Unnamed: 0,subreddit,title,selftext
0,pilates,Thoughts on BB IQ reformer?,Does anybody have experience with the Balanced...
1,pilates,Hundred - I’m doing it wrong,I don’t feel anything in my abs when I do the ...
2,pilates,Mini pro or reformer,Hi there!\n\nI am new to lagree and have been ...
3,pilates,Should I tip my instructor for private session?,What’s the rule here? Do I tip my instructor f...
4,pilates,Looking to Purchase Reformer - Seeking Advice/...,Hey all - I'm looking to purchase a reformer f...


In [15]:
#Exporting the subreddits into csv file

yoga_nlp.to_csv('yoga_nlp.csv', index = False)
pilates_nlp.to_csv('pilates_nlp.csv', index = False)