# Microsoft Teams Subreddit API

In this notebook, we will be looking at the extraction of Microsoft Teams Subreddit ([link](https://www.reddit.com/r/MicrosoftTeams/)) submissions using Pushshift API ([link](https://github.com/pushshift/api)).

## Problem Statement

`Microsoft Teams` is one of `Zoom`'s largest and fastest growing competitor. We want to examine what users have been discussing on Reddit by applying NLP techniques.

__Subreddit links:__
- r/Zoom: https://www.reddit.com/r/Zoom/ (26.7k members, 14.7k submissions since 1 Jan 2020)
- r/Microsoft Teams: https://www.reddit.com/r/MicrosoftTeams/ (36.5k members, 12.6k submissions since 1 Jan 2020)

We will then train a classifier to accurately classify content between the two subreddits, `Zoom` or `Microsoft Teams`. Based on the models, we will make recommendations on two prongs - to the software development team and the marketing team:
1. Software Development Team - to highlight what are the common issues faced by users, as well as any additional features that users would like (if any)
2. Marketing - (i) to look at what features Microsoft Teams users have issues with (more than Zoom users) and tweak our campaigns to capitalise on their perceived weaknesses and (ii) to look at which words are closely associated with Zoom and Microsoft Teams. These words can be considered for our Search Engine Marketing and Search Engine Optimisation campaigns. To utilise these words as paid keywords such as Google AdWords or organic keywords in our sites.

### Report
This report is prepared to tackle the identified problem statement which is meant to build our skill in `Web APIs and NLP`.

The report is split into 3 notebooks:
1. [Zoom_Subreddit_API](./zoom_api.ipynb): Data extraction of `Zoom` subreddit
2. [Microsoft_Teams_Subreddit_API](./teams_api.ipynb) <font color='red'>(Current)</font>: Data extraction of `Microsoft Teams` subreddit
3. [Analysis_of_the_subreddits](./analysis.ipynb): Sentiment Analysis and (Key) Problems faced by users using Zoom and/or Microsoft Teams and predicting which subreddit a submission belongs to.

## Contents:
1. [Libraries](#Libraries)
2. [Data Extraction from subreddit](#Data-Extraction-from-Subreddit)
3. [Data Export](#Data-Export)

---

## Libraries
In this section, we will be importing all the libraries used in this code notebook.

In [1]:
# To read url
import requests

# For Calculation and Data Manipulation
import numpy as np
import pandas as pd

# For `.csv` file exportion folding creation
import os

# for datetime conversion
import datetime

# for server buffer time
import time

# this setting widens how many characters pandas will display in a column:
pd.options.display.max_colwidth = 400

---

## Data Extraction from Subreddit

We will use the Pushshift API ([link](https://github.com/pushshift/api)) to extract the submissions from subreddit. 

In [2]:
# to get submissions
url_submissions = 'https://api.pushshift.io/reddit/search/submission'   

# to get comments
# url_comments = 'https://api.pushshift.io/reddit/search/comment'

In [3]:
# subreddit url: https://www.reddit.com/r/MicrosoftTeams/
# create parameters dictionary
params = {
    'subreddit': 'MicrosoftTeams',   # subreddit name
    'size': 100,   # number of posts to return, integer <= 100
    
    # 'after' gets posts after indicated date/period, i.e. the start date/time
    #'after': 30d,   # number followed by the characters s,m,h,d (which stand for second, minute, hour and day)
    # option 2: epoch value
    'after': 1577836800,   # Data and Time (GMT): 1 Jan 2020, 00:00
    
    # 'before' gets posts before indicated date/period, i.e. the end date/time
    #'before': 30d,   # option 1: number followed by the characters s,m,h,d (which stand for second, minute, hour and day)
    # option 2: epoch value
    #'before': 1614009600,   # 23 Feb 2021, Time: 00:00, to collect a month data with reference to 23 Jan 2021
}

In [4]:
# Create counter of data set:
counter = 1

# get first (counter value) set of data using Pushshift API
mst_sub_res = requests.get(url_submissions, params)

# confirm that data is obtained
print(f'Status Code of Set {counter}: {mst_sub_res.status_code}')

# extract out the information
mst_data = mst_sub_res.json()['data']

# confirm that 100 submission posts was extracted
print(f'Length of Set {counter}: {len(mst_data)}')

Status Code of Set 1: 200
Length of Set 1: 100


In [5]:
%%time

# create list to store temp values of response from scraping
mst_data_add = ['temp']   # dummy value in the list to allow below while loop to work

# create loop for submission posts extraction
while len(mst_data_add) > 0:
    
    # to include buffer timing for each request 
    time.sleep(3)
    
    # update the `after` params with the latest timestamp of the extraction
    params['after'] = mst_data[-1]['created_utc']    # zoom_data is a list, each element inside is a dictionary, created_utc is the time of submission
    # print(params)    # check that parameters is being updated
    
    # Update counter:
    counter += 1

    # get (counter value) set of data using Pushshift API
    mst_sub_res_add = requests.get(url_submissions, params)

    # confirm that data is obtained
    # to print when you want to check
    # print(f'Status Code of Set {counter}: {zoom_sub_res.status_code}')

    # extract out the information
    mst_data_add = mst_sub_res_add.json()['data']
    # print(zoom_data_add[-1])   # check that zoom_data_add is being extracted

    # confirm that 100 submission posts was extracted
    # to print when you want to check
    print(f'Length of Set {counter}: {len(mst_data_add)}')
    
    # add / append the extraction to `zoom_data`
    mst_data += mst_data_add

Length of Set 2: 100
Length of Set 3: 100
Length of Set 4: 100
Length of Set 5: 100
Length of Set 6: 100
Length of Set 7: 100
Length of Set 8: 100
Length of Set 9: 100
Length of Set 10: 100
Length of Set 11: 100
Length of Set 12: 100
Length of Set 13: 100
Length of Set 14: 100
Length of Set 15: 100
Length of Set 16: 100
Length of Set 17: 100
Length of Set 18: 100
Length of Set 19: 100
Length of Set 20: 100
Length of Set 21: 100
Length of Set 22: 100
Length of Set 23: 100
Length of Set 24: 100
Length of Set 25: 100
Length of Set 26: 100
Length of Set 27: 100
Length of Set 28: 100
Length of Set 29: 100
Length of Set 30: 100
Length of Set 31: 100
Length of Set 32: 100
Length of Set 33: 100
Length of Set 34: 100
Length of Set 35: 100
Length of Set 36: 100
Length of Set 37: 100
Length of Set 38: 100
Length of Set 39: 100
Length of Set 40: 100
Length of Set 41: 100
Length of Set 42: 100
Length of Set 43: 100
Length of Set 44: 100
Length of Set 45: 100
Length of Set 46: 100
Length of Set 47: 

In [6]:
# put the submissions posts into a DataFrame
mst_df_original = pd.DataFrame(mst_data)

# Lets take a look at the dataframe shape and it's first 5 rows
print(mst_df_original.shape)
mst_df_original.head()

(13137, 91)


Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,gallery_data,is_gallery,event_end,event_is_live,event_start,gilded,is_created_from_ads_ui,author_is_blocked,discussion_type,suggested_sort
0,[],False,x12Mike,,[],,text,t2_8g3nd,False,False,...,,,,,,,,,,
1,[],False,Repent2019,,[],,text,t2_2xirsuwe,False,False,...,,,,,,,,,,
2,[],False,jdlnewborn,,[],,text,t2_mxiqs,False,False,...,,,,,,,,,,
3,[],False,jacoke3,,[],,text,t2_12aj9l,False,False,...,,,,,,,,,,
4,[],False,mikeprennie,,[],,text,t2_3fpkfnn0,False,False,...,,,,,,,,,,


In [7]:
# Lets take a look at the columns and missing values
mst_df_original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13137 entries, 0 to 13136
Data columns (total 91 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   all_awardings                  13137 non-null  object 
 1   allow_live_comments            13137 non-null  bool   
 2   author                         13137 non-null  object 
 3   author_flair_css_class         1 non-null      object 
 4   author_flair_richtext          13028 non-null  object 
 5   author_flair_text              145 non-null    object 
 6   author_flair_type              13028 non-null  object 
 7   author_fullname                13028 non-null  object 
 8   author_patreon_flair           13028 non-null  object 
 9   author_premium                 13028 non-null  object 
 10  awarders                       13137 non-null  object 
 11  can_mod_post                   13137 non-null  bool   
 12  contest_mode                   13137 non-null 

In [8]:
# create copy of extraction
# this allows subsequent manipulation to not affect original data
mst_df = mst_df_original.copy()

In [9]:
# convert epoch value to datetime

# utc datetime convert
mst_df['datetime_utc'] = pd.to_datetime(mst_df['created_utc'], unit='s')

# local time convert
mst_df['datetime_local'] = pd.to_datetime(
    [datetime.datetime.fromtimestamp(i).strftime('%Y-%m-%d %H:%M:%S')
     for i in mst_df['created_utc']])

In [10]:
# Lets take a look at the new dataframe shape and it's first 5 rows
print(mst_df.shape)
mst_df.head()

(13137, 93)


Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,event_end,event_is_live,event_start,gilded,is_created_from_ads_ui,author_is_blocked,discussion_type,suggested_sort,datetime_utc,datetime_local
0,[],False,x12Mike,,[],,text,t2_8g3nd,False,False,...,,,,,,,,,2020-01-01 00:33:22,2020-01-01 08:33:22
1,[],False,Repent2019,,[],,text,t2_2xirsuwe,False,False,...,,,,,,,,,2020-01-01 19:45:40,2020-01-02 03:45:40
2,[],False,jdlnewborn,,[],,text,t2_mxiqs,False,False,...,,,,,,,,,2020-01-02 01:52:35,2020-01-02 09:52:35
3,[],False,jacoke3,,[],,text,t2_12aj9l,False,False,...,,,,,,,,,2020-01-02 16:04:37,2020-01-03 00:04:37
4,[],False,mikeprennie,,[],,text,t2_3fpkfnn0,False,False,...,,,,,,,,,2020-01-03 14:24:10,2020-01-03 22:24:10


In [11]:
# Lets take a look at the columns and missing values
mst_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13137 entries, 0 to 13136
Data columns (total 93 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   all_awardings                  13137 non-null  object        
 1   allow_live_comments            13137 non-null  bool          
 2   author                         13137 non-null  object        
 3   author_flair_css_class         1 non-null      object        
 4   author_flair_richtext          13028 non-null  object        
 5   author_flair_text              145 non-null    object        
 6   author_flair_type              13028 non-null  object        
 7   author_fullname                13028 non-null  object        
 8   author_patreon_flair           13028 non-null  object        
 9   author_premium                 13028 non-null  object        
 10  awarders                       13137 non-null  object        
 11  can_mod_post   

In [12]:
# Lets take a look at the columns we are interested in
mst_df[['id', 'author', 'subreddit',
         'selftext', 'title', 'datetime_utc', 'datetime_local']].head()

Unnamed: 0,id,author,subreddit,selftext,title,datetime_utc,datetime_local
0,eibfah,x12Mike,MicrosoftTeams,"Has anyone found a PPA for easier installation under Ubuntu? We're a 50/50 (Win/Lin) shop, and I've been working on adding Teams to our Ubuntu preseed. I can get things working with the MS gpg key from an existing system and just adding a copy of the teams.list file to `/etc/apt.sources.d`, but ideally, I'd rather do this smarter.\n\nAnyone potentially found a way to simplify Teams installat...",Linux PPA?,2020-01-01 00:33:22,2020-01-01 08:33:22
1,eimpat,Repent2019,MicrosoftTeams,"We use Teams at our university, and we have a research colloquium every term with a submission deadline. I've created a Team for the colloquium to house materials, send announcements, etc. but I'd **REALLY** love for all team members to be able to set reminders that the submission deadline is upcoming. From my feckless Googling, I gather that such an ability is not built in, but have any of yo...","Third party app, or other work-around, to send reminders?",2020-01-01 19:45:40,2020-01-02 03:45:40
2,eirf48,jdlnewborn,MicrosoftTeams,"I am in the midst of a move to the Teams/SharePoint system, so there is a lot of backend work happening. Therefore my users are getting welcome emails to certain teams, and my phone starts ringing before I am ready to roll this out for a department. Lastly, I’m making new channels, and moving files into those channels, so its generating emails along the lines of ‘you’ve deleted emails, you ...",Disable team welcome emails and ‘files deleted’ emails,2020-01-02 01:52:35,2020-01-02 09:52:35
3,ej01y1,jacoke3,MicrosoftTeams,,"Teams Adoption Flipbook - Any help on what tool they used to create this? Would love it to create flipbooks for our company. I know Issuu and Flipbook Pro, but are they the same as this?",2020-01-02 16:04:37,2020-01-03 00:04:37
4,ejg1kd,mikeprennie,MicrosoftTeams,Are there any Teams specific GPO files out there? I want to disable gpu acceleration globally in a persistent VDI environment.,Teams group policy files?,2020-01-03 14:24:10,2020-01-03 22:24:10


---

## Data Export
We will export the data for analysis purpose. 

In [13]:
# create cleaned_data folder if it does not exist in current folder
if not os.path.exists('../data'):
    os.makedirs('../data')
    
# Export the csv file into kaggle_submission folder
mst_df_original.to_csv('../data/mst_original.csv', index=False)
(mst_df[['id', 'author', 'subreddit',
         'selftext', 'title', 'datetime_utc', 'datetime_local']]).to_csv('../data/mst_df.csv', index=False)