# Project 3: Part 1
## Web APIs & NLP

###  Background
There has been a surge in demand for data professionals in recent years. This led to an increased competition in the space for coding bootcamps. Competitors like Hack Reactor, Le Wagon, Vertical Institute, and Rocket Academy, have risen to meet the demands. If no action is taken, General Assembly may be faced with decline in market share, poor marketing ROI, and poorer lead generation. 

The General Assembly marketing team would need to better identify the online presence of a bootcamp seeker as opposed to that of the computer science major to aid in tergeted advertising. As both are fairly similar in nature, efforts to further segregate the two targets could yield better advertising ROI.

Keywords are an important aspect of the digital advertising, allowing for targeted strategies at all levels of the marketing funnel. They also guide marketing teams on the sort of advertising content that is required.

Thus, the aim is to segment and target the right audience for amrketing efforts streamline marketing efforts, rasie brand awareness with interested individuals, and increase advertising ROI. 

### Problem Statement
Build a model with at least 90% accuracy that helps to identify between those who are looking for bootcamp style learning as oppose to computer science majors or prospective students based on the words they use online. 

## Data Collection
This notebook will focus on scraping of raw data in prepatory for cleaning, EDA and modelling (Part 2). 

### Data Dictionary

#### Dataset name: df
|Feature|Type|Description|
|---|---|---|
|subreddit|string| The subreddit the posts are extracted from|
|title|string| Title of the posts|
|selftext|string|Body of the posts|

### Data Collection

In [1]:
import requests
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline


In [2]:
url = 'https://api.pushshift.io/reddit/search/submission'

There is a strong active following on Reddit where topics about coding bootcamps and full computer science degrees are discussed at length. The good thing about a source like Reddit is that each subreddit are managed by moderators. This effectively reduces posts that could have potentially skewed our findings. 

Begin with having a look at the data that can be scraped. The group has decided on looking at these 2 subreddits:
- r/codingbootcamp
- r/csMajors

#### First look at data
The data needs to be scraped from Reddit. Using Reddit's Pushshift API we are able to attain the data we require. 

In [3]:
# r/codingbootcamp
# Set parameters for scraping
params_cbc =  {
    'subreddit':'codingbootcamp',
    'size':500,
    'sort': 'desc',
    'sort_type': 'created_utc',
    'is_video': False
    }

In [4]:
# Submit request
res_cbc = requests.get(url, params_cbc)

In [5]:
res_cbc.status_code

200

In [6]:
data_cbc = res_cbc.json()

In [7]:
# Extract just posts
posts_cbc = data_cbc['data']

In [8]:
len(posts_cbc)

250

In [9]:
df_cbc = pd.DataFrame(posts_cbc)

In [10]:
df_cbc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 77 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   all_awardings                  250 non-null    object 
 1   allow_live_comments            250 non-null    bool   
 2   author                         250 non-null    object 
 3   author_flair_css_class         0 non-null      object 
 4   author_flair_richtext          249 non-null    object 
 5   author_flair_text              0 non-null      object 
 6   author_flair_type              249 non-null    object 
 7   author_fullname                249 non-null    object 
 8   author_is_blocked              250 non-null    bool   
 9   author_patreon_flair           249 non-null    object 
 10  author_premium                 249 non-null    object 
 11  awarders                       250 non-null    object 
 12  can_mod_post                   250 non-null    boo

As there are 77 columns shown, not all of them are necessary for what we are trying to achieve. By reducing this it will help with storage, speed, and accuracy. 

In [11]:
df_cbc[['subreddit', 'selftext', 'title', 'created_utc']].head()

Unnamed: 0,subreddit,selftext,title,created_utc
0,codingbootcamp,Looking for a career change to programming/cod...,"Hello all, brand new to the coding world.",1669899316
1,codingbootcamp,https://www.codespaces.com/best-data-structure...,Just thought to share :),1669897818
2,codingbootcamp,\n\n[View Poll](https://www.reddit.com/poll/z9...,"I've considered taking up a bootcamp however, ...",1669856640
3,codingbootcamp,does anyone tried Le Wagon Bootcamp in Lausann...,Le Wagon bootcamp Lausanne,1669840304
4,codingbootcamp,I ask because they have so many five star revi...,Has anyone done Tech Elevator and not had a po...,1669839828


In [12]:
# r/csMajors
# Set parameters, submit request, extract and store as df

params_cm =  {
    'subreddit':'csMajors',
    'size':500,
    'sort': 'desc',
    'sort_type': 'created_utc',
    'is_video': False}
res_cm = requests.get(url, params_cm)
print(res_cm.status_code)
data_cm = res_cm.json()
posts_cm = data_cm['data']
print(len(posts_cm))
df_cm = pd.DataFrame(posts_cm)
df_cm[['subreddit', 'selftext', 'title', 'created_utc']].head()

200
250


Unnamed: 0,subreddit,selftext,title,created_utc
0,csMajors,Honestly just curious about this one. I’m an u...,Stay @ UCLA or transfer to Berkeley,1669902859
1,csMajors,Title says it all. Have been in both Undergrad...,Know of any good online Masters of CS programs...,1669897046
2,csMajors,"Hi,\n\nI know there’s lots of posts like this ...",Dropped out of college. What now?,1669895244
3,csMajors,Hi! I am really worried because I have no offe...,I am an international Sophomore student in the...,1669894479
4,csMajors,[removed],Dark net site as experience,1669892017


From here, it shows that even though the limit is set to 500, only 249 posts from r/codingbootcamp and 250 posts from r/csMajors were scraped. This is definitely not enough to be use for model training. There will be a need to scrape at least 3000 or more as the size of the data could be reduced after cleaning. 

Created_utc is the timestamp for the post. It is necessary to take note of this to avoiding scraping the same data over and over. It is also a good indicator like a marker to show where the data was last scraped at. 

#### Create function to automate scraping process and scrape >3000 posts

In [13]:
# Define funtion
def scrapedata(count, subreddit, before):
    df = pd.DataFrame()
    url='https://api.pushshift.io/reddit/search/submission'
    for i in range(count):
        params = {'subreddit': subreddit, 'size':250, 'before': before}
        req = requests.get(url,params)
        print(req.status_code)
        data = req.json()
        df = df.append(pd.DataFrame(data['data']))
        df.reset_index(drop=True, inplace=True)
        before=df.loc[len(df)-1,['created_utc']]
        print(before)
        i += 1
    print(f'{len(df)} post scrapped')
    return df


In [14]:
# Use function to scrape >3000 posts
count = 13   # (3000/250 + 1)

# Scrape for r/codingbootcamp
# 'Before' parameter in function will be taken from 'created_utc' from the first post in the above df.
# before = 1669244663
df_cbc = scrapedata(count, 'codingbootcamp', 1669244663)

200
created_utc    1666690179
Name: 248, dtype: object
200
created_utc    1664390254
Name: 497, dtype: object
200
created_utc    1662067301
Name: 746, dtype: object
200
created_utc    1659580596
Name: 994, dtype: object
200
created_utc    1657032489
Name: 1242, dtype: object
200
created_utc    1654274797
Name: 1492, dtype: object
200
created_utc    1651628663
Name: 1739, dtype: object
200
created_utc    1649373636
Name: 1989, dtype: object
200
created_utc    1646067460
Name: 2239, dtype: object
200
created_utc    1642694158
Name: 2489, dtype: object
200
created_utc    1639056962
Name: 2739, dtype: object
200
created_utc    1635019825
Name: 2989, dtype: object
200
created_utc    1630467703
Name: 3239, dtype: object
3240 post scrapped


In [15]:
# Scrape for r/csMajors
# 'Before' parameter in function will be taken from 'created_utc' from the first post in the above df.
# before = 1669252773

df_cm = scrapedata(count, 'csMajors', 1669252773)

200
created_utc    1669050306
Name: 249, dtype: object
200
created_utc    1668797107
Name: 499, dtype: object
200
created_utc    1668611059
Name: 746, dtype: object
200
created_utc    1668444854
Name: 996, dtype: object
200
created_utc    1668199245
Name: 1245, dtype: object
200
created_utc    1668031724
Name: 1494, dtype: object
200
created_utc    1667886130
Name: 1744, dtype: object
200
created_utc    1667761624
Name: 1994, dtype: object
200
created_utc    1667586358
Name: 2243, dtype: object
200
created_utc    1667443257
Name: 2493, dtype: object
200
created_utc    1667318260
Name: 2743, dtype: object
200
created_utc    1667117791
Name: 2993, dtype: object
200
created_utc    1666906918
Name: 3243, dtype: object
3244 post scrapped


###### Data was scraped on 24th November 2022.

In [16]:
df_cbc.shape

(3240, 81)

In [17]:
df_cm.shape

(3244, 74)

As both dataframes have over 70 columns, most of which will not be used for model training, it is best to reduce this to just the necessary columns. 

In [18]:
df_cbc = df_cbc[['subreddit', 'selftext', 'title']]

In [19]:
df_cm = df_cm[['subreddit', 'selftext', 'title']]

In [20]:
# Combine both dfs
df = pd.concat([df_cbc, df_cm], ignore_index=True)

In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6484 entries, 0 to 6483
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  6484 non-null   object
 1   selftext   6479 non-null   object
 2   title      6484 non-null   object
dtypes: object(3)
memory usage: 152.1+ KB


In [22]:
# Save df as csv
df.to_csv('data/df.csv', index=False)

The data has been scraped and stored in a dataframe. It is now ready for the next part of the process. This will continue in Part 2. 