# Webscraping

# Background

There is increased competition in the space for coding bootcamps. Bootcamps such as Hack Reactor, Vertical Institute, Rocket Academy and Le Wagon. If no action is taken, there will be a decline in market share, poor marketing return of investment (ROI), poorer lead generation which means we will not be able to meet the enrollment KPI's.

GA marketing is therefore trying to figure out how to better identify the online persona of a bootcamp seeker as opposed to that of a computer science major to aid in targeted advertising.


Considering the two topics have quite a bit in common, efforts to further segregate the two targets could yield better advertising ROI.

# Problem Statement

Due to increased competition in the market for bootcamps. General Assembly has been facing poorer enrollments and they intend to maintain their position as one of the better bootcamps out there.We are team of data scientists that are being tasked by General Assembly to build a model with >90% accuracy that helps to identify between those who are looking for bootcamp style learning vs computer science majors/prospective students based on the words they use online.

## Data Dictionary

| Feature | Type | Dataset | Description|
| :--- | :--- | :--- | :--- |
| subreddit | Object | cs_major / coding_bootcamp | Subreddit contains the topic of the subreddit in the dataframe. Either cs Major or coding bootcamp|
| selftext | Object | cs_major / coding_bootcamp | selftext contains the text or the message of the post written by the end user. |
| title | Object | cs_major / coding_bootcamp | title contains the title of the post. |
| csMajors | Object | cs_major | csMajors is the topic or also known as the subreddit. csMajors refers to Computer Science Major that universities offers to students. |
| coding_bootcamp | Object | coding_bootcamp| coding_bootcamp is the topic or also known as the subreddit. coding_bootamp refers to coding bootcamps that are taken by mid-career switches, companies and students who are interested in upskilling. | 
| combined_text | Object | cs_major / coding_bootcamp | combined_text is the combined columns of selftext and title. |


# Import Libraries

In [4]:
import numpy as np
import pandas as pd

import requests
import bs4 as BeautifulSoup
import time

# Webscrape data from Reddit

We will be using the redding API, pushshift.io to webscrape from their subreddit submissions (this can be known as the topic that the end users will find themselves in when they are seeking discussion or posting their interests). We will then organise them into a data frame with 3 columns; subreddit, selftext and title.

In [5]:
#define function to webscrape from subreddit
def get_posts(subreddit, number):
    url = 'https://api.pushshift.io/reddit/search/submission'
    params = {
            'subreddit': subreddit,
            'size': 100
        } # 'before' is added later in the while loop
    df = pd.DataFrame() # empty dataframe for concatenating
    returned = pd.DataFrame()
    while True:
        time.sleep(3) # in seconds, used to indicate that it will be scraping 100 posts every 3 seconds
        
        res = requests.get(url, params) # using requests.get to link up with the url(https address)
        data = res.json()
        posts = data['data']
        df = pd.DataFrame(posts)
        
        params['before'] = df.iloc[-1:,:]['created_utc']
        
        returned = pd.concat([returned, df[['subreddit','selftext','title']]], axis=0)
        returned.drop_duplicates(inplace=True)
        
        if len(returned) > number:
            break
      
    returned.reset_index(inplace=True,drop=True)
    return returned[:number]

In [6]:
cs_major = get_posts('csMajors', 4000) #scrape 4000 posts
coding_bootcamp= get_posts('codingbootcamp', 4000) #scrape 4000 posts

Here we give a variable to each subreddit that will be scraped and formed into a dataframe. Do note that we have tried to scrape at least 5000 posts each however, coding_bootcamp does not have that many. Hence, we have lowered down to 4000 where it was able to scrape.

In [7]:
cs_major.head() #check for dataframe
cs_major.to_csv('data/cs_major.csv',index = False)

In [8]:
coding_bootcamp.head() #check for dataframe
coding_bootcamp.to_csv('data/coding_bootcamp.csv', index = False)

We will be saving these dataframes into a .csv file for our use to clean, do EDA, model and evaluate.