This is an API practice notebook; code is taken from a [Medium post by Karan Bhanot](https://towardsdatascience.com/creating-a-dataset-using-an-api-with-python-dcc1607616d).

In [1]:
import numpy as np
import pandas as pd
import requests
import json

In [2]:
url = "https://wind-bow.glitch.me/twitch-api/channels/freecodecamp"
JSONContent = requests.get(url).json()
content = json.dumps(JSONContent, indent = 4, sort_keys=True)
print(content)

{
    "_id": 79776140,
    "_links": {
        "chat": "https://api.twitch.tv/kraken/chat/freecodecamp",
        "commercial": "https://api.twitch.tv/kraken/channels/freecodecamp/commercial",
        "editors": "https://api.twitch.tv/kraken/channels/freecodecamp/editors",
        "follows": "https://api.twitch.tv/kraken/channels/freecodecamp/follows",
        "self": "https://api.twitch.tv/kraken/channels/freecodecamp",
        "stream_key": "https://api.twitch.tv/kraken/channels/freecodecamp/stream_key",
        "subscriptions": "https://api.twitch.tv/kraken/channels/freecodecamp/subscriptions",
        "teams": "https://api.twitch.tv/kraken/channels/freecodecamp/teams",
        "videos": "https://api.twitch.tv/kraken/channels/freecodecamp/videos"
    },
    "background": null,
    "banner": null,
    "broadcaster_language": "en",
    "created_at": "2015-01-14T03:36:47Z",
    "delay": null,
    "display_name": "FreeCodeCamp",
    "followers": 10122,
    "game": "Creative",
    "langua

In [9]:
# List of channels we want to access
channels = ["ESL_SC2", "OgamingSC2", "cretetion", "freecodecamp", "storbeck", "habathcx", "RobotCaleb", "noobs2ninjas",
            "ninja", "shroud", "Dakotaz", "esltv_cs", "pokimane", "tsm_bjergsen", "boxbox", "wtcn", "a_seagull",
           "kinggothalion", "amazhs", "jahrein", "thenadeshot", "sivhd", "kingrichard", "shroud"]

In [10]:
channels_list = []

In [11]:
# For each channel, we access its information through its API
for channel in channels:
    JSONContent = requests.get("https://wind-bow.glitch.me/twitch-api/channels/" + channel).json()
    if 'error' not in JSONContent:
        channels_list.append([JSONContent['_id'], JSONContent['display_name'], JSONContent['status'],
                             JSONContent['followers'], JSONContent['views']])

In [19]:
dataset = pd.DataFrame(channels_list)
dataset.head(20)

Unnamed: 0,0,1,2,3,4
0,30220059,ESL_SC2,RERUN: StarCraft 2 - Terminator vs. Parting (P...,135394,60991791
1,71852806,OgamingSC2,UnderDogs - Rediffusion - Qualifier.,40895,20694507
2,90401618,cretetion,It's a Divison kind of Day,908,11631
3,79776140,FreeCodeCamp,Greg working on Electron-Vue boilerplate w/ Ak...,10122,163747
4,86238744,storbeck,,10,1019
5,6726509,Habathcx,Massively Effective,14,764
6,54925078,RobotCaleb,Code wrangling,20,4602
7,82534701,noobs2ninjas,Building a new hackintosh for #programming and...,835,48102


In [20]:
dataset.columns = ['Id', 'Name', 'Status', 'Followers', 'Views']
dataset.dropna(axis = 0, how = 'any', inplace = True)
dataset.index = pd.RangeIndex(len(dataset.index))

In [21]:
dataset.head()

Unnamed: 0,Id,Name,Status,Followers,Views
0,30220059,ESL_SC2,RERUN: StarCraft 2 - Terminator vs. Parting (P...,135394,60991791
1,71852806,OgamingSC2,UnderDogs - Rediffusion - Qualifier.,40895,20694507
2,90401618,cretetion,It's a Divison kind of Day,908,11631
3,79776140,FreeCodeCamp,Greg working on Electron-Vue boilerplate w/ Ak...,10122,163747
4,6726509,Habathcx,Massively Effective,14,764


In this example, we used a Twitch API...but what about a "bigger API", like pushshift.io? That provides information on Reddit. To learn how to use the pushshift API, I turned to a [second tutorial, by JEAN-CHRISTOPHE-CHOUINARD](https://www.jcchouinard.com/how-to-use-reddit-api-with-python/).

In [42]:
import requests

In [43]:
query="seo" #Define Your Query
url = f"https://api.pushshift.io/reddit/search/comment/?q={query}"
request = requests.get(url)
json_response = request.json()
json_response

{'data': [{'all_awardings': [],
   'associated_award': None,
   'author': 'bazjoe',
   'author_flair_background_color': None,
   'author_flair_css_class': None,
   'author_flair_richtext': [],
   'author_flair_template_id': None,
   'author_flair_text': None,
   'author_flair_text_color': None,
   'author_flair_type': 'text',
   'author_fullname': 't2_dniyv',
   'author_patreon_flair': False,
   'author_premium': True,
   'awarders': [],
   'body': 'I had a SEO expert (who wasn’t looking at it with the desire to earn my business)  and pronto has done zero SEO.  Even the basics of every page needs a description.',
   'collapsed_because_crowd_control': None,
   'created_utc': 1589753176,
   'gildings': {},
   'id': 'fqyr53n',
   'is_submitter': False,
   'link_id': 't3_gl710a',
   'locked': False,
   'no_follow': True,
   'parent_id': 't1_fqvu70x',
   'permalink': '/r/msp/comments/gl710a/seeking_an_msp_marketing_company/fqyr53n/',
   'retrieved_on': 1589757280,
   'score': 1,
   'send_re

In [44]:
def get_pushshift_data(data_type, **kwargs):
    """
    Gets data from the pushshift api.
 
    data_type can be 'comment' or 'submission'
    The rest of the args are interpreted as payload.
 
    Read more: https://github.com/pushshift/api
    """
 
    base_url = f"https://api.pushshift.io/reddit/search/{data_type}/"
    payload = kwargs
    request = requests.get(base_url, params=payload)
    return request.json()

In [45]:
data_type="comment"     # give me comments, use "submission" to publish something
query="python"          # Add your query
duration="30d"          # Select the timeframe. Epoch value or Integer + "s,m,h,d" (i.e. "second", "minute", "hour", "day")
size=1000               # maximum 1000 comments
sort_type="score"       # Sort by score (Accepted: "score", "num_comments", "created_utc")
sort="desc"             # sort descending
aggs="subreddit"        #"author", "link_id", "created_utc", "subreddit"

In [47]:
get_pushshift_data(data_type=data_type,     
                   q=query,                 
                   after=duration,          
                   size=size,               
                   sort_type=sort_type,
                   sort=sort)

{'data': [{'all_awardings': [],
   'associated_award': None,
   'author': 'rawbamatic',
   'author_flair_background_color': None,
   'author_flair_css_class': None,
   'author_flair_richtext': [],
   'author_flair_template_id': None,
   'author_flair_text': None,
   'author_flair_text_color': None,
   'author_flair_type': 'text',
   'author_fullname': 't2_5rdvt',
   'author_patreon_flair': False,
   'author_premium': True,
   'awarders': [],
   'body': 'Something right out of Monty Python.',
   'collapsed_because_crowd_control': None,
   'created_utc': 1589081510,
   'gildings': {},
   'id': 'fq4q0a8',
   'is_submitter': False,
   'link_id': 't3_ggqw9w',
   'locked': False,
   'no_follow': False,
   'parent_id': 't1_fq4pclc',
   'permalink': '/r/AskReddit/comments/ggqw9w/what_is_the_greatest_fuck_it_ill_do_it_myself_in/fq4q0a8/',
   'retrieved_on': 1589095305,
   'score': 588,
   'send_replies': True,
   'stickied': False,
   'subreddit': 'AskReddit',
   'subreddit_id': 't5_2qh1i',
   

In [53]:
data = requests.get("aggs").get(aggs)

MissingSchema: Invalid URL 'aggs': No schema supplied. Perhaps you meant http://aggs?

## Scraping Practice

How to scrape tables from PDFs? I followed along with a [Medium article](https://medium.com/better-programming/convert-tables-from-pdfs-to-pandas-with-python-d74f8ac31dc2) that used tabula-py to do this.

In [1]:
import pandas as pd
import tabula

Let's test tabula out on the NYC_marathon_winners_wikipedia.pdf file in this folder.

In [2]:
file_path = "./NYC_marathon_winners_wikipedia.pdf"

In [4]:
df = tabula.read_pdf(file_path)

'pages' argument isn't specified.Will extract only from page 1 by default.


JavaNotFoundError: `java` command is not found from this Python process.Please ensure Java is installed and PATH is set for `java`

I really despise that tabula-py has a Java 8 dependency. Let's try a different approach...namely, Camelot-py