# Zoom Subreddit API

In this notebook, we will be looking at the extraction of Zoom Subreddit ([link](https://www.reddit.com/r/Zoom/)) submissions using Pushshift API ([link](https://github.com/pushshift/api)).

## Problem Statement

`Microsoft Teams` is one of `Zoom`'s largest and fastest growing competitor. We want to examine what users have been discussing on Reddit by applying NLP techniques.

__Subreddit links:__
- r/Zoom: https://www.reddit.com/r/Zoom/ (26.7k members, 14.7k submissions since 1 Jan 2020)
- r/Microsoft Teams: https://www.reddit.com/r/MicrosoftTeams/ (36.5k members, 12.6k submissions since 1 Jan 2020)

We will then train a classifier to accurately classify content between the two subreddits, `Zoom` or `Microsoft Teams`. Based on the models, we will make recommendations on two prongs - to the software development team and the marketing team:
1. Software Development Team - to highlight what are the common issues faced by users, as well as any additional features that users would like (if any)
2. Marketing - (i) to look at what features Microsoft Teams users have issues with (more than Zoom users) and tweak our campaigns to capitalise on their perceived weaknesses and (ii) to look at which words are closely associated with Zoom and Microsoft Teams. These words can be considered for our Search Engine Marketing and Search Engine Optimisation campaigns. To utilise these words as paid keywords such as Google AdWords or organic keywords in our sites.

### Report
This report is prepared to tackle the identified problem statement which is meant to build our skill in `Web APIs and NLP`.

The report is split into 3 notebooks:
1. [Zoom_Subreddit_API](./zoom_api.ipynb) <font color='red'>(Current)</font>: Data extraction of `Zoom` subreddit
2. [Microsoft_Teams_Subreddit_API](./teams_api.ipynb): Data extraction of `Microsoft Teams` subreddit
3. [Analysis_of_the_subreddits](./analysis.ipynb): Sentiment Analysis and (Key) Problems faced by users using Zoom and/or Microsoft Teams and predicting which subreddit a submission belongs to.

## Contents:
1. [Libraries](#Libraries)
2. [Data Extraction from subreddit](#Data-Extraction-from-subreddit)
3. [Data Export](#Data-Export)

---

## Libraries
In this section, we will be importing all the libraries used in this code notebook.

In [1]:
# To read url
import requests

# For Calculation and Data Manipulation
import numpy as np
import pandas as pd

# For `.csv` file exportion folding creation
import os

# for datetime conversion
import datetime

# for data collection server buffer time
import time

# this setting widens how many characters pandas will display in a column:
pd.options.display.max_colwidth = 400

---

## Data Extraction from subreddit

We will use the Pushshift API ([link](https://github.com/pushshift/api)) to extract the submissions from subreddit. 

In [2]:
# to get submissions
url_submissions = 'https://api.pushshift.io/reddit/search/submission'   

# to get comments
# url_comments = 'https://api.pushshift.io/reddit/search/comment'

In [3]:
# subreddit url: https://www.reddit.com/r/Zoom/
# create parameters dictionary
params = {
    'subreddit': 'Zoom',   # subreddit name
    'size': 100,   # number of posts to return, integer <= 100
    
    # 'after' gets posts after indicated date/period, i.e. the start date/time
    #'after': 30d,   # number followed by the characters s,m,h,d (which stand for second, minute, hour and day)
    # option 2: epoch value
    'after': 1577836800,   # Data and Time (GMT): 1 Jan 2020, 00:00
    
    # 'before' gets posts before indicated date/period, i.e. the end date/time
    #'before': 30d,   # option 1: number followed by the characters s,m,h,d (which stand for second, minute, hour and day)
    # option 2: epoch value
    #'before': 1614009600,   # 23 Feb 2021, Time: 00:00, to collect a month data with reference to 23 Jan 2021
}

In [4]:
# Create counter of data set:
counter = 1

# get first (counter value) set of data using Pushshift API
zoom_sub_res = requests.get(url_submissions, params)

# confirm that data is obtained
print(f'Status Code of Set {counter}: {zoom_sub_res.status_code}')

# extract out the information
zoom_data = zoom_sub_res.json()['data']

# confirm that 100 submission posts was extracted
print(f'Length of Set {counter}: {len(zoom_data)}')

Status Code of Set 1: 200
Length of Set 1: 100


In [5]:
%%time

# create list to store temp values of response from scraping
zoom_data_add = ['temp']   # dummy value in the list to allow below while loop to work

# create loop for submission posts extraction
while len(zoom_data_add) > 0:
    
    # to include buffer timing for each request 
    time.sleep(3)
    
    # update the `after` params with the latest timestamp of the extraction
    params['after'] = zoom_data[-1]['created_utc']    # zoom_data is a list, each element inside is a dictionary, created_utc is the time of submission
    # print(params)    # check that parameters is being updated
    
    # Update counter:
    counter += 1

    # get (counter value) set of data using Pushshift API
    zoom_sub_res_add = requests.get(url_submissions, params)

    # confirm that data is obtained
    # to print when you want to check
    # print(f'Status Code of Set {counter}: {zoom_sub_res.status_code}')

    # extract out the information
    zoom_data_add = zoom_sub_res_add.json()['data']
    # print(zoom_data_add[-1])   # check that zoom_data_add is being extracted

    # confirm that 100 submission posts was extracted
    # to print when you want to check
    print(f'Length of Set {counter}: {len(zoom_data_add)}')
    
    # add / append the extraction to `zoom_data`
    zoom_data += zoom_data_add

Length of Set 2: 100
Length of Set 3: 100
Length of Set 4: 100
Length of Set 5: 100
Length of Set 6: 99
Length of Set 7: 100
Length of Set 8: 100
Length of Set 9: 100
Length of Set 10: 100
Length of Set 11: 100
Length of Set 12: 100
Length of Set 13: 100
Length of Set 14: 100
Length of Set 15: 99
Length of Set 16: 100
Length of Set 17: 100
Length of Set 18: 100
Length of Set 19: 100
Length of Set 20: 100
Length of Set 21: 100
Length of Set 22: 100
Length of Set 23: 100
Length of Set 24: 100
Length of Set 25: 100
Length of Set 26: 100
Length of Set 27: 100
Length of Set 28: 100
Length of Set 29: 100
Length of Set 30: 100
Length of Set 31: 100
Length of Set 32: 100
Length of Set 33: 100
Length of Set 34: 100
Length of Set 35: 99
Length of Set 36: 100
Length of Set 37: 100
Length of Set 38: 100
Length of Set 39: 100
Length of Set 40: 100
Length of Set 41: 100
Length of Set 42: 100
Length of Set 43: 100
Length of Set 44: 100
Length of Set 45: 100
Length of Set 46: 100
Length of Set 47: 100

In [6]:
# put the submissions posts into a DataFrame
zoom_df_original = pd.DataFrame(zoom_data)

# Lets take a look at the dataframe shape and it's first 5 rows
print(zoom_df_original.shape)
zoom_df_original.head()

(15150, 87)


Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,url_overridden_by_dest,parent_whitelist_status,pwls,whitelist_status,wls,gallery_data,is_gallery,distinguished,is_created_from_ads_ui,author_is_blocked
0,[],False,rifaterdemsahin,,[],,text,t2_cjg05,False,False,...,,,,,,,,,,
1,[],False,rifaterdemsahin,,[],,text,t2_cjg05,False,False,...,,,,,,,,,,
2,[],False,roastedpot,,[],,text,t2_9chhr,False,False,...,,,,,,,,,,
3,[],False,secmehmet,,[],,text,t2_50mboo32,False,False,...,,,,,,,,,,
4,[],False,jumpinjj81,,[],,text,t2_e7de5,False,False,...,,,,,,,,,,


In [7]:
# Lets take a look at the columns and missing values
zoom_df_original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15150 entries, 0 to 15149
Data columns (total 87 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   all_awardings                  15150 non-null  object 
 1   allow_live_comments            15150 non-null  bool   
 2   author                         15150 non-null  object 
 3   author_flair_css_class         0 non-null      object 
 4   author_flair_richtext          14910 non-null  object 
 5   author_flair_text              18 non-null     object 
 6   author_flair_type              14910 non-null  object 
 7   author_fullname                14910 non-null  object 
 8   author_patreon_flair           14910 non-null  object 
 9   author_premium                 14910 non-null  object 
 10  awarders                       15150 non-null  object 
 11  can_mod_post                   15150 non-null  bool   
 12  contest_mode                   15150 non-null 

In [8]:
# create copy of extraction
# this allows subsequent manipulation to not affect original data
zoom_df = zoom_df_original.copy()

In [9]:
# convert epoch value to datetime

# utc datetime convert
zoom_df['datetime_utc'] = pd.to_datetime(zoom_df['created_utc'], unit='s')

# local time convert
zoom_df['datetime_local'] = pd.to_datetime(
    [datetime.datetime.fromtimestamp(i).strftime('%Y-%m-%d %H:%M:%S')
     for i in zoom_df['created_utc']])

In [10]:
# Lets take a look at the new dataframe shape and it's first 5 rows
print(zoom_df.shape)
zoom_df.head()

(15150, 89)


Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,pwls,whitelist_status,wls,gallery_data,is_gallery,distinguished,is_created_from_ads_ui,author_is_blocked,datetime_utc,datetime_local
0,[],False,rifaterdemsahin,,[],,text,t2_cjg05,False,False,...,,,,,,,,,2020-01-09 09:38:02,2020-01-09 17:38:02
1,[],False,rifaterdemsahin,,[],,text,t2_cjg05,False,False,...,,,,,,,,,2020-01-09 11:51:14,2020-01-09 19:51:14
2,[],False,roastedpot,,[],,text,t2_9chhr,False,False,...,,,,,,,,,2020-02-13 15:10:36,2020-02-13 23:10:36
3,[],False,secmehmet,,[],,text,t2_50mboo32,False,False,...,,,,,,,,,2020-02-17 13:30:47,2020-02-17 21:30:47
4,[],False,jumpinjj81,,[],,text,t2_e7de5,False,False,...,,,,,,,,,2020-02-22 18:16:41,2020-02-23 02:16:41


In [11]:
# Lets take a look at the columns and missing values
zoom_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15150 entries, 0 to 15149
Data columns (total 89 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   all_awardings                  15150 non-null  object        
 1   allow_live_comments            15150 non-null  bool          
 2   author                         15150 non-null  object        
 3   author_flair_css_class         0 non-null      object        
 4   author_flair_richtext          14910 non-null  object        
 5   author_flair_text              18 non-null     object        
 6   author_flair_type              14910 non-null  object        
 7   author_fullname                14910 non-null  object        
 8   author_patreon_flair           14910 non-null  object        
 9   author_premium                 14910 non-null  object        
 10  awarders                       15150 non-null  object        
 11  can_mod_post   

In [12]:
# Lets take a look at the columns we are interested in
zoom_df[['id', 'author', 'subreddit',
         'selftext', 'title', 'datetime_utc', 'datetime_local']].head()

Unnamed: 0,id,author,subreddit,selftext,title,datetime_utc,datetime_local
0,em7f7x,rifaterdemsahin,Zoom,I want to have a recurring meeting I want automatically any one who joins to room?,Zoom room online all the time,2020-01-09 09:38:02,2020-01-09 17:38:02
1,em8lan,rifaterdemsahin,Zoom,How can I multiple usages automatically meet on zoom?,Multiple Usage Zoom,2020-01-09 11:51:14,2020-01-09 19:51:14
2,f3bcaa,roastedpot,Zoom,I was wondering if anyone had found a way to get the chat portion of Zoom to change to a dark mode? I have the Left Sidebar theme set to Dark but the chat window is still white.,Dark mode in chats,2020-02-13 15:10:36,2020-02-13 23:10:36
3,f58rff,secmehmet,Zoom,"hey, I have a problem with the zoom I can't share multi-monitor on zoom room how can I share multi-monitor please help me ??\n\nnext question we're meeting zoom room but we cant share all monitor why?",zoom multi-monitor,2020-02-17 13:30:47,2020-02-17 21:30:47
4,f7wcwg,jumpinjj81,Zoom,"I have a potential client who has a conference that needs to be live streamed. I’m using a tricaster with three PTZ cameras to switch between the in room speakers, slideshows and zoom callers. They want to attendees in the room to see the video from the current zoom speaker but they want to keep the other zoom attendees from seeing each other. They all need to remain anonymous. Is there a way ...",Caller anonymity question,2020-02-22 18:16:41,2020-02-23 02:16:41


---

## Data Export
We will export the data for analysis purpose.

In [13]:
# create cleaned_data folder if it does not exist in current folder
if not os.path.exists('../data'):
    os.makedirs('../data')
    
# Export the csv file into kaggle_submission folder
zoom_df_original.to_csv('../data/zoom_original.csv', index=False)
(zoom_df[['id', 'author', 'subreddit',
         'selftext', 'title', 'datetime_utc', 'datetime_local']]).to_csv('../data/zoom_df.csv', index=False)