<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px"> 

# Data Science on the Net!
---

## Learning Objectives
* **Explain** how HTTP works
* **Make** HTTP requests from Python
* **Read** API documentation and get the data
* **Scrape** a website using BeautifulSoup



![](http://i.imgur.com/zjwHc.jpg)

## Part 1: HTML = The Language of the Web
HTML is a language that describes the "nouns" of the internet. HTML objects come in _tags_ that look like this:

```html
<tag>Contents of the tag</tag>
```

Let's go to a website (any website!) and open the object viewer by pressing **Ctrl-Shift-i**.
Maybe:
* www.washingtonpost.com
* www.example.com
* www.realpython.com

HTML tags can have **classes** and **ids**, too.
* **Classes** are non-unique descriptors that you can use to identify various HTML tags by a joining quality. For example, you might define a "foreground" class. You can use CSS to then color all "foreground" class objects blue.
* **ids** are _unique_ descriptors. They work the same as classes, except there can only be one tag per id.

For example:
```html
<p class="speech" id="gettysburg">Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.</p>
```

## Part 2: The HTTP Protocol
HTTP is the **hypertext transfer protocol**. It's just a system of rules for throwing data around the internet.

> Aside: There are other protocols you might have heard of. For example: FTP (file transfer protocol), SFTP (Secure FTP), and IP (internet protocol).

You can interact using the HTTP via different kinds of **requests**. There are several of them, but two are used most widely: **GET** and **POST**.
* The GET request is for getting data.
    - What Tweets are in my news feed?
* The POST request is for telling a server what data _you_ have, often times causing the server to act upon it.
    - Post this Tweet espousing my controversial political opinions!

HTTP requests also have **response codes (response statuses)**. You've probably heard of 404 (page not found). But there are many more: 200 (everything good), 300 (redirection), 401 (unauthorized).
    
Let's go to any of those previous websites and go to the "Network" tab of the object viewer. Refresh the page. Every line you see is a series of HTTP requests and their response codes.

## Part 3: The `requests` Library & APIs
The `requests` library is a library for submitting HTTP requests from Python. Despite its frequent use, it's not included in the Python standard library. You'll need to `pip install requests` yourself.
![](assets/pokeapi.png)

In [1]:
import requests

In [2]:
# Create url for API call.
base_url = 'https://pokeapi.co/api/v2/'
get_pokemon_endpoint = 'pokemon/'

In [3]:
# Make request
pokemon_req = requests.get(base_url + get_pokemon_endpoint + 'snorlax')

In [4]:
# Request response code
pokemon_req

<Response [200]>

In [5]:
# Text of request
pokemon_req.text[:500]

'{"abilities":[{"ability":{"name":"immunity","url":"https://pokeapi.co/api/v2/ability/17/"},"is_hidden":false,"slot":1},{"ability":{"name":"thick-fat","url":"https://pokeapi.co/api/v2/ability/47/"},"is_hidden":false,"slot":2},{"ability":{"name":"gluttony","url":"https://pokeapi.co/api/v2/ability/82/"},"is_hidden":true,"slot":3}],"base_experience":189,"forms":[{"name":"snorlax","url":"https://pokeapi.co/api/v2/pokemon-form/143/"}],"game_indices":[{"game_index":132,"version":{"name":"red","url":"ht'

In [6]:
# Bring in the JSON!
snorlax = pokemon_req.json()

In [7]:
# Since we've converted the JSON -> dict, we know how to work with this!
snorlax.keys()

dict_keys(['abilities', 'base_experience', 'forms', 'game_indices', 'height', 'held_items', 'id', 'is_default', 'location_area_encounters', 'moves', 'name', 'order', 'species', 'sprites', 'stats', 'types', 'weight'])

In [8]:
# Height, Weight
height = snorlax['height']
weight = snorlax['weight']

height, weight

(21, 4600)

In [9]:
# Sprites?
snorlax['sprites']['front_default']

'https://raw.githubusercontent.com/PokeAPI/sprites/master/sprites/pokemon/143.png'

In [10]:
# What moves can squirtle learn?
pokemon_req = requests.get(base_url + get_pokemon_endpoint + 'squirtle')
squirtle = pokemon_req.json()

[move['move']['name'] for move in squirtle['moves']]

['mega-punch',
 'ice-punch',
 'mega-kick',
 'headbutt',
 'tackle',
 'body-slam',
 'take-down',
 'double-edge',
 'tail-whip',
 'bite',
 'mist',
 'water-gun',
 'hydro-pump',
 'surf',
 'ice-beam',
 'blizzard',
 'bubble-beam',
 'submission',
 'counter',
 'seismic-toss',
 'strength',
 'dig',
 'toxic',
 'confusion',
 'rage',
 'mimic',
 'double-team',
 'withdraw',
 'defense-curl',
 'haze',
 'reflect',
 'bide',
 'waterfall',
 'skull-bash',
 'bubble',
 'rest',
 'substitute',
 'snore',
 'curse',
 'flail',
 'protect',
 'mud-slap',
 'foresight',
 'icy-wind',
 'endure',
 'rollout',
 'swagger',
 'attract',
 'sleep-talk',
 'return',
 'frustration',
 'dynamic-punch',
 'rapid-spin',
 'iron-tail',
 'hidden-power',
 'rain-dance',
 'mirror-coat',
 'rock-smash',
 'whirlpool',
 'fake-out',
 'hail',
 'facade',
 'focus-punch',
 'brick-break',
 'yawn',
 'refresh',
 'secret-power',
 'dive',
 'mud-sport',
 'rock-tomb',
 'water-spout',
 'muddy-water',
 'iron-defense',
 'water-pulse',
 'gyro-ball',
 'brine',
 'nat

In [14]:
# Whoa! Let's build a function to extract a pokemon's possible moves

def get_moves(pokemon_name):
    url = base_url + get_pokemon_endpoint + pokemon_name
    pokemon_req = requests.get(url)
    pokemon = pokemon_req.json()

    return [move['move']['name'] for move in pokemon['moves']]

In [15]:
get_moves('charmander')

['mega-punch',
 'fire-punch',
 'thunder-punch',
 'scratch',
 'swords-dance',
 'cut',
 'mega-kick',
 'headbutt',
 'body-slam',
 'take-down',
 'double-edge',
 'leer',
 'bite',
 'growl',
 'ember',
 'flamethrower',
 'submission',
 'counter',
 'seismic-toss',
 'strength',
 'dragon-rage',
 'fire-spin',
 'dig',
 'toxic',
 'rage',
 'mimic',
 'double-team',
 'smokescreen',
 'defense-curl',
 'reflect',
 'bide',
 'fire-blast',
 'swift',
 'skull-bash',
 'rest',
 'rock-slide',
 'slash',
 'substitute',
 'snore',
 'curse',
 'protect',
 'scary-face',
 'belly-drum',
 'mud-slap',
 'outrage',
 'endure',
 'swagger',
 'fury-cutter',
 'attract',
 'sleep-talk',
 'return',
 'frustration',
 'dynamic-punch',
 'dragon-breath',
 'iron-tail',
 'metal-claw',
 'hidden-power',
 'sunny-day',
 'crunch',
 'ancient-power',
 'rock-smash',
 'beat-up',
 'heat-wave',
 'will-o-wisp',
 'facade',
 'focus-punch',
 'brick-break',
 'secret-power',
 'air-cutter',
 'overheat',
 'rock-tomb',
 'aerial-ace',
 'dragon-claw',
 'dragon-da

In [20]:
url = 'https://pokeapi.co/api/v2/pokemon'
params = {'offset':0, 'limit':300}

pokemon_urls= {}

while url:
    poke_req = requests.get(url, params=params)
    pokemon = poke_req.json()
    url = pokemon['next']
    
    for poke in pokemon['results']:
        pokemon_urls[poke['name']] = poke['url']
    print(url)
    
pokemon_urls

https://pokeapi.co/api/v2/pokemon?offset=300&limit=300
https://pokeapi.co/api/v2/pokemon?offset=600&limit=300
https://pokeapi.co/api/v2/pokemon?offset=900&limit=218
None


{'bulbasaur': 'https://pokeapi.co/api/v2/pokemon/1/',
 'ivysaur': 'https://pokeapi.co/api/v2/pokemon/2/',
 'venusaur': 'https://pokeapi.co/api/v2/pokemon/3/',
 'charmander': 'https://pokeapi.co/api/v2/pokemon/4/',
 'charmeleon': 'https://pokeapi.co/api/v2/pokemon/5/',
 'charizard': 'https://pokeapi.co/api/v2/pokemon/6/',
 'squirtle': 'https://pokeapi.co/api/v2/pokemon/7/',
 'wartortle': 'https://pokeapi.co/api/v2/pokemon/8/',
 'blastoise': 'https://pokeapi.co/api/v2/pokemon/9/',
 'caterpie': 'https://pokeapi.co/api/v2/pokemon/10/',
 'metapod': 'https://pokeapi.co/api/v2/pokemon/11/',
 'butterfree': 'https://pokeapi.co/api/v2/pokemon/12/',
 'weedle': 'https://pokeapi.co/api/v2/pokemon/13/',
 'kakuna': 'https://pokeapi.co/api/v2/pokemon/14/',
 'beedrill': 'https://pokeapi.co/api/v2/pokemon/15/',
 'pidgey': 'https://pokeapi.co/api/v2/pokemon/16/',
 'pidgeotto': 'https://pokeapi.co/api/v2/pokemon/17/',
 'pidgeot': 'https://pokeapi.co/api/v2/pokemon/18/',
 'rattata': 'https://pokeapi.co/api

## Ok, let's try a more complicated API - for stocks!
![](assets/alpha-vantage.png)
If you haven't already - grab your free API key for Alpha Vantage [here](https://www.alphavantage.co). It takes five seconds.

**(THREAD): Why do you think companies would require the use of an API key?**

Alpha Vantage has documentation [here](https://www.alphavantage.co/documentation/).

In [67]:
# Most APIs have a single base URL from which API calls are made.
# If you look closely at the examples, this is Alpha Vantage's.

intraday_function = 'TIME_SERIES_INTRADAY'

base_url = "https://www.alphavantage.co/"
query_endpoint = "query"
api_key = "YOUR_KEY"

In [62]:
# Let's build out this request.
# This is a very common format for pure API requests to come in
params = {
    'function': intraday_function,
    'symbol': 'IBM',
    'interval': '5min',
    'apikey': 'demo'
}

# Let's grab that data!
r = requests.get(base_url + query_endpoint, params=params)

In [66]:
ibm = r.json()
ibm['Meta Data']

{'1. Information': 'Intraday (5min) open, high, low, close prices and volume',
 '2. Symbol': 'IBM',
 '3. Last Refreshed': '2021-02-12 19:50:00',
 '4. Interval': '5min',
 '5. Output Size': 'Compact',
 '6. Time Zone': 'US/Eastern'}

### Challenge
Write your own function that inputs a ticker symbol and outputs the above.

In [69]:
def get_intraday(stock_symbol, apikey='demo'):
    params = {
        'function': intraday_function,
        'symbol': stock_symbol,
        'interval': '5min',
        'apikey': apikey
    }

    # Let's grab that data!
    r = requests.get(base_url + query_endpoint, params=params)
    
    return r.json()

ibm = get_intraday('IBM', apikey='demo')
ibm['Meta Data']

{'1. Information': 'Intraday (5min) open, high, low, close prices and volume',
 '2. Symbol': 'IBM',
 '3. Last Refreshed': '2021-02-12 19:50:00',
 '4. Interval': '5min',
 '5. Output Size': 'Compact',
 '6. Time Zone': 'US/Eastern'}

### Did this feel like a lot of work? You're not alone.
For web APIs such as these, open sourcerers (ordinary programmers like you and me!) like to build language-specific **API wrappers** to easier call the API. Interestingly, based on our very vague definition of APIs, API wrappers are also themselves APIs!

Alpha Vantage has a Python API wrapper made by user `RomelTorres` [here](https://github.com/RomelTorres/alpha_vantage)!

![](assets/opensource.jpg)

## You want data? You got data.

### Key Takeaway #1: Your favorite thing has a free API
* **Stock prices**: [Alpha Vantage](https://github.com/RomelTorres/alpha_vantage)
* **Cryptocurrency prices**: [ccxt](https://github.com/ccxt/ccxt) provides a unified API for several cryptocurrency markets. You can even buy and sell crypto from within Python!
* **Weather**: [OpenWeather](https://openweathermap.org/api)

### Key Takeaway #2: Your favorite website has a free API
Below is a brief list of websites that have a free API. Note that "free" here means "zero-cost", not "permissive and easy to use." APIs can be abused. Not all Twitter bots are friendly like [Every Sheriff Bot](https://twitter.com/EverySheriff).
* Twitter
* Reddit
* Yelp
* Twitch
* Facebook/Instagram
* GitHub (yes, even GitHub!)
* Most Google services
* Spotify
* Slack (no, you can't have a key.)

## Part 4: Web Scraping with Beautiful Soup!
![](https://static.datasciencedojo.com/wp-content/uploads/PythonBeautifulSoup-04-495x400.png)

The library `bs4` (Beautiful Soup) allows you to take the raw HTML from a `request` and pick out the parts you need!

In [70]:
from bs4 import BeautifulSoup

In [71]:
url = "https://example.com"
req = requests.get(url)
soup = BeautifulSoup(req.text)  # Convert to a BeautifulSoup object

In [72]:
soup

<!DOCTYPE html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples

In [75]:
# Get a list of all h1 tags
h1_tags = soup.select('h1')
h1_tags

[<h1>Example Domain</h1>]

In [76]:
# Get the text part of the first h1 tag
h1_tags[0].text

'Example Domain'

In [78]:
# Challenge: Can you print the text of all <p> tags?

for ptag in soup.select('p'):
    print(ptag.text)

This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.
More information...


### Let's use BeautifulSoup to actually scrape a website!
Let's scrape: [Real Python](https://realpython.com), a new (very good) blog site for Python users (aimed at mostly beginners).

In [82]:
url = "https://realpython.com"
req = requests.get(url)

In [83]:
req.status_code

200

In [85]:
soup = BeautifulSoup(req.text)

In [86]:
# Let's scrape a list of articles from the front page!
h2s = soup.select('h2.card-title')
h2s

[<h2 class="card-title h2 my-0 py-0">Creating PyQt Layouts for GUI Applications</h2>,
 <h2 class="card-title h4 my-0 py-0">Pandas Sort: Your Guide to Sorting Data in Python</h2>,
 <h2 class="card-title h4 my-0 py-0">Python Microservices With gRPC</h2>,
 <h2 class="card-title h4 my-0 py-0">Python Modulo: Using the % Operator</h2>,
 <h2 class="card-title h4 my-0 py-0">Python Inner Functions: What Are They Good For?</h2>,
 <h2 class="card-title h4 my-0 py-0">Qt Designer and Python: Build Your GUI Applications Faster</h2>,
 <h2 class="card-title h4 my-0 py-0">Plot With Pandas: Python Data Visualization Basics</h2>,
 <h2 class="card-title h4 my-0 py-0">Python Web Applications: Deploy Your Script as a Flask App</h2>,
 <h2 class="card-title h4 my-0 py-0">Stochastic Gradient Descent Algorithm With Python and NumPy</h2>,
 <h2 class="card-title h4 my-0 py-0">Evaluate Expressions Dynamically With Python eval()</h2>,
 <h2 class="card-title h4 my-0 py-0">How to Use Python: Your First Steps</h2>,
 <

In [91]:
titles = [h2.get_text() for h2 in h2s]
titles

['Creating PyQt Layouts for GUI Applications',
 'Pandas Sort: Your Guide to Sorting Data in Python',
 'Python Microservices With gRPC',
 'Python Modulo: Using the % Operator',
 'Python Inner Functions: What Are They Good For?',
 'Qt Designer and Python: Build Your GUI Applications Faster',
 'Plot With Pandas: Python Data Visualization Basics',
 'Python Web Applications: Deploy Your Script as a Flask App',
 'Stochastic Gradient Descent Algorithm With Python and NumPy',
 'Evaluate Expressions Dynamically With Python eval()',
 'How to Use Python: Your First Steps',
 'C for Python Programmers',
 'Introduction to Sorting Algorithms in Python']

In [90]:
link = h2s[5].find_parent()
link

<a href="/qt-designer-python/">
<h2 class="card-title h4 my-0 py-0">Qt Designer and Python: Build Your GUI Applications Faster</h2>
</a>

In [89]:
link.attrs['href']

'/qt-designer-python/'

In [92]:
links = ["https://realpython.com" + h2.find_parent().attrs['href'] for h2 in h2s]
links

['https://realpython.com/courses/creating-pyqt-layouts-gui-applications/',
 'https://realpython.com/pandas-sort-python/',
 'https://realpython.com/python-microservices-grpc/',
 'https://realpython.com/courses/python-modulo-operator/',
 'https://realpython.com/inner-functions-what-are-they-good-for/',
 'https://realpython.com/qt-designer-python/',
 'https://realpython.com/courses/plot-pandas-data-visualization/',
 'https://realpython.com/python-web-applications/',
 'https://realpython.com/gradient-descent-algorithm-python/',
 'https://realpython.com/courses/evaluate-expressions-dynamically-python-eval/',
 'https://realpython.com/python-first-steps/',
 'https://realpython.com/c-for-python-programmers/',
 'https://realpython.com/courses/intro-sorting-algorithms/']

In [93]:
article_req = requests.get(links[5])
article_soup = BeautifulSoup(article_req.text)
article_soup.select('p')[1].get_text()

'\nby Leodanis Pozo Ramos\n Feb 03, 2021\n\n\ngui\nintermediate\n\n\n\nTweet\nShare\nEmail\n\n\n\n'

In [94]:
def get_article_contents(link):
    article_req = requests.get(link)
    article_soup = BeautifulSoup(article_req.text)
    try:
        return article_soup.select('p')[2].get_text()
    except:
        return ""

In [95]:
contents = [get_article_contents(link) for link in links[:5]]
contents

['In this course, you’ll learn:',
 'Table of Contents',
 'Table of Contents',
 'In this course, you’ll learn:',
 'Table of Contents']

## Conclusion & Summary
Today, we:
* Learned how HTTP works
* Made HTTP requests from Python
* Read API documentation and got the data we want
* Scraped a website using BeautifulSoup