# How The World Wide Web Works

---

## HTTP

Hyper Text Transfer Protocol. 

HTTP is a protocol, which is to say a set of rules and formats, that allows one computer to get content from another computer. 

---

## Client and Server

HTTP is assymetric. One computer requests the content, while the other computer serves the content. 

The computer requesting the content is the "client."

The computer serving the content is the "server."

You can think of HTTP as similar to the transaction that happens in a restaurant. Information needs to flow both ways between client and server, but the two parties have very different roles in the transaction. Thus it is an assymetric operation.

---

## Visiting a web page

When you visit twitter.com: 

* Your browser (i.e. Firefox) is the client.
* The server is some custom-built software running in Twitter's cloud infrastructure somewhere.

Note that the client and the server are just software, they don't need to be (but usually are) running on separate physical hardware. 

---

## Request/Response

* The format that the client uses to send information to the server is called a "request".
* The format that the server uses to send information to the client is called a "response".

When you visit twitter.com, your browser (the client) makes an HTTP request, and Twitter's server returns an HTTP response

---

## HTTP Methods

There are several types of requests the client can make. The type of the request is called the "method." There are several HTTP methods the client can choose from, but for now we will focus on one method: 

GET

GET requests are the most common type of request when browsing the internet. It's a way for your client to say "give me some content."

---

## Requests 

GET requests consist of: 

* URL
* Additional Metadata

Responses consist of: 

* Body (content)
* Status Code
* Additional Metadata

---

## URLs

A URL (Universal Resource Locator) is a type of URI (Universal Resource Identifier).

A URL is meant to be a unique identifier for a "resource", or a piece of content, that can be accessed via the world wide web. 

A URL consists of: 
  

 protocol     [subdomain.]      domain       [:port]    [/path]    [?query]
----------   --------------  ------------   ---------  ---------  -----------
 http://        blog.         science.com     :443       /foo       ?bar=baz


---

## Protocol 

HTTP is unencrypted. 

HTTPS is encrypted. 

---

## Domains

Every computer connected to the internet lives at a certain address (domain/subdomain pair).

---

## Ports

Ports are like doors. 

Every computer connected to the internet has thousands of potential ports.

A server is a piece of software that runs on a computer, and "listens" for HTTP requests on a certain port.

---


## Default Ports

* HTTP requests default to port 80
* HTTPS requests default to port 443

Most web pages serve content on the default ports, and as such, we drop the port from the URL.

---

## Servers

It's the internet's job to direct the HTTP request to the right address. 

It's the computer's job to direct the HTTP request to the right port. 

Then, it's up to the server listening on that port to send a response. 


---

## Path

From one server, we often want to serve lots of different content. 

Like the menu of a restaraunt. 

A path is a unique identifier for each piece of content the server can provide (each item on the menu)

Paths are trees! Just like menus are organized categorically by their creators (burritos, tacos, tostadas), content within the server is organized with a taxonomy that is created by whoever wrote the server software.

---

## Query String

In a restaraunt, the food might be like: 

- Burritos
 -- Chimichangas
 -- Grande
 -- Children's

While each burrito might have the following options:

Meat: Asada, Chorizo, Lengua, Soyrizo, Tofu
Cheese: Queso Fresco, Vegan Cheese

---

## Query String

Sometimes we want to allow options for each piece of content. A query string is a collection of key-value pairs (like a Python dictionary) that describe the content requested.

A Grande Burrito with Soyrzo and Vegan Cheese would look like this in a URL: 

https://myburrito.com/burritos/grande?meat=soyrizo&cheese=vegan

---

## Servers

Paths and query strings are nothing but a way for the server to organize its content for the client. There are no strict rules, as long as the client can learn the format and the server is happy with the organization.

---

## Headers

All metadata sent in HTTP requests and responses lives in the "headers". 

The headers are just a collection of key-value pairs.

Status codes live in the headers. The key is "status" and the value is the code number! 

Another common header in HTTP Responses exists under the key "Content-Type". This tells the client what type of content is in the body, and thus how to decode it.

---

## Content-Types

There are 3 main content types you should know about: 

* Plain text
* HTML
* JSON

---

## Plain Text

Plain text is the simplest form of content. It is not used very often in actual applications. 

---

## JSON

JavaScript Object Notation.

An "object" in JavaScript is similar to a Dictionary in python. It's an associative data structure, a collection of key-value pairs! 

JSON is the most common format for sending data over HTTP when the data is meant to be consumed by a computer program, rather than presented for a human to view.

API's (application programming interfaces) commonly use JSON to send data.

---

## JSON

An example of JSON: 

```{js}

{
    "id": "b4vd345s45gd",
    "tweets": [12543, 9878945, 90384],
    "profile": { 
        "name": "Man Onthe Moon",
        "location": "moon"
    }
}

```

---

## HTML

Hyper Text Markup Language. 

HTML is a format used to encode content, so that it can be displayed for humans to read in a web browser. 

HTML tells the browser what to display, and how to display it. 

For example: if you have a heading (title) followed by two paragraphs. You need to tell the browser not only about the order of the text, but to make the heading larger and bolder, and to separate the paragraphs with a new line!

---

## HTML

HTML is a tree. It organizes all the content for the browser into a hierarchical taxonomy. 

```{html}

                  |-- meta qux
       |-- head --|
       |          |-- script baz
html --| 
       |          |-- div.foo
       |-- body --|
                  |-- div.bar
```


---

## HTML

The root node is called "html", which has only two possible child nodes, "head" and "body." Those two nodes can have unlimited children. 

```{html}
<html>
    <head>
        ...
    </head>
    <body> 
        <div class="foo"></div>
        <div class="bar"></div>
    </body>
</html>

```

## HTML or JSON??? 

JSON and HTML are used in two very different contexts: 

* HTML is used to create a "UI", a user interface, for consumption by human eyes. 

* JSON is used in an "API", an application programming interface, for consumption by other computer programs.

It should be clear that, in general, you should prefer APIs for getting data, wherever they are available. 

Getting data out of HTML, gotten from websites, is referred to as "scraping". 

# Making HTTP Requests

In Python, there are many libraries to make HTTP requests. We will use a 3rd-party library called "requests", which is very easy to use. 

Making a "GET" request is as simple as: 

```python
import requests

res = requests.get(url) # returns a "Response" object
res.content # has the "body" of the response
```

You might need to install the requests library! 

You can do that with the following code in a Jupyter cell: 

```python
! pip install requests
```

Or, if you're using anaconda, optionally you can also do: 

```python
! conda install -c anaconda requests
```

## Parsing JSON data

To parse JSON data in Python, we will use the "json" module: 

```python
import json
```

Read more about the module on the [documentation page](https://docs.python.org/3/library/json.html)!

All we care about for this part is the method "loads", which turns JSON data into a Python object: 

```python
json.loads(my_string_encoded_json)
```

## Pokemon API

There is a simple, open API called "pokeapi" that allows us to make requests and see how to use APIs. Like everything, we first look at the documentation: 

https://pokeapi.co/docs/v2.html

In [3]:
! pip install --user requests




In [4]:
# Let's see how to make a get request to the API: 
import requests
import json

res = requests.get('https://pokeapi.co/api/v2/berry')
str = json.loads(res.content)
str

{'count': 64,
 'next': 'https://pokeapi.co/api/v2/berry?offset=20&limit=20',
 'previous': None,
 'results': [{'name': 'cheri', 'url': 'https://pokeapi.co/api/v2/berry/1/'},
  {'name': 'chesto', 'url': 'https://pokeapi.co/api/v2/berry/2/'},
  {'name': 'pecha', 'url': 'https://pokeapi.co/api/v2/berry/3/'},
  {'name': 'rawst', 'url': 'https://pokeapi.co/api/v2/berry/4/'},
  {'name': 'aspear', 'url': 'https://pokeapi.co/api/v2/berry/5/'},
  {'name': 'leppa', 'url': 'https://pokeapi.co/api/v2/berry/6/'},
  {'name': 'oran', 'url': 'https://pokeapi.co/api/v2/berry/7/'},
  {'name': 'persim', 'url': 'https://pokeapi.co/api/v2/berry/8/'},
  {'name': 'lum', 'url': 'https://pokeapi.co/api/v2/berry/9/'},
  {'name': 'sitrus', 'url': 'https://pokeapi.co/api/v2/berry/10/'},
  {'name': 'figy', 'url': 'https://pokeapi.co/api/v2/berry/11/'},
  {'name': 'wiki', 'url': 'https://pokeapi.co/api/v2/berry/12/'},
  {'name': 'mago', 'url': 'https://pokeapi.co/api/v2/berry/13/'},
  {'name': 'aguav', 'url': 'https

In [24]:
# Challenge: 
# Create a Dataframe with all the Pokemon names and their URLs. 

def get_pokes(url):
    # Make the HTTP request to the given url. 
    # Parse the response as json
    # return the "next" and the "results" (as a 2-tuple!)
    # make sure to return a "falsey" value (such as None)
    # if there is not a "next!"
    res = requests.get(url)  ## requesting the url
    res = res.json()   ## response as json
    n = res['next']  ## seeing keys of dictionary res
    r = res['results'] ## list of dictionaries
    return n, r
    
    


def catch_em_all(url):
    pokes = []
    
    # While loop! Like a for-loop, 
    # but goes on for an indetermined amount
    # of time (while condition is truthy):
    while url:
        url, results = get_pokes(url)
        pokes += results
    return pokes
        
    
list_of_pokes = catch_em_all('https://pokeapi.co/api/v2/pokemon')

# This data is most naturally represented as a list of dictionaries. 
# How can we create a dataframe from a list of dictionaries? 
# Try to find out on your own, from the internet!

# TODO: turn list_of_pokes into a dataframe.

In [25]:
list_of_pokes

[{'name': 'bulbasaur', 'url': 'https://pokeapi.co/api/v2/pokemon/1/'},
 {'name': 'ivysaur', 'url': 'https://pokeapi.co/api/v2/pokemon/2/'},
 {'name': 'venusaur', 'url': 'https://pokeapi.co/api/v2/pokemon/3/'},
 {'name': 'charmander', 'url': 'https://pokeapi.co/api/v2/pokemon/4/'},
 {'name': 'charmeleon', 'url': 'https://pokeapi.co/api/v2/pokemon/5/'},
 {'name': 'charizard', 'url': 'https://pokeapi.co/api/v2/pokemon/6/'},
 {'name': 'squirtle', 'url': 'https://pokeapi.co/api/v2/pokemon/7/'},
 {'name': 'wartortle', 'url': 'https://pokeapi.co/api/v2/pokemon/8/'},
 {'name': 'blastoise', 'url': 'https://pokeapi.co/api/v2/pokemon/9/'},
 {'name': 'caterpie', 'url': 'https://pokeapi.co/api/v2/pokemon/10/'},
 {'name': 'metapod', 'url': 'https://pokeapi.co/api/v2/pokemon/11/'},
 {'name': 'butterfree', 'url': 'https://pokeapi.co/api/v2/pokemon/12/'},
 {'name': 'weedle', 'url': 'https://pokeapi.co/api/v2/pokemon/13/'},
 {'name': 'kakuna', 'url': 'https://pokeapi.co/api/v2/pokemon/14/'},
 {'name': '

# Scraping!

---

## Terminology: Crawling vs Scraping

Web crawling refers to the act of traversing the internet by a bot, of a piece of software.

Search engines, traditionally, crawl the internet and don't do much with the content except index the text.

Scraping refers to the act of extracting specific data from web pages.

Often, scraping involves some crawling, and uses the same libraries and techniques. Thus, the terms and concepts will overlap.

---

## Scraping

Scraping consists of:

1. Making HTTP requests to servers for HTML content.
2. Parsing that HTML content to:
   1. Store desired information.
   2. Find links to follow (for each link, go back to 1.)

---

## Making HTTP Requests

Browsers make HTTP requests.

Every major programming language also has a way to make HTTP requests.

This is sometimes done via a third-party library.

---

## Parsing HTML

HTML parsing is something you will use a trusted library for.

HTML parsers take the raw HTML from an HTTP response, and turn it into a (usually custom) tree-like data structure or class that you can easily traverse, and from which you can easily extract the desired content.

This functionality is conceptually different from that of making the HTTP request. It will usually be included in a separate library, for this reason.

You will need to learn the API, the interface, of the HTML parser you are using!

Read the documentation.

---

## Following Links

Making an HTTP request and parsing it is all you need to scrape a single page.

Usually, however, we want to scrape more than one page.

How do we get all the pages we will scrape?

Often, we get them from links in other pages!

---

## Following Links

What are some websites you might want to scrape?

Which pages?

How can we access all the pages?

---

## Sitemaps

Some sites provide a link to an XML sitemap in their robots.txt file (more on that later).

Other sites provide a sitemap directly as an HTML page, labelled "sitemap".

Still others provide no sitemap at all.

Sitemaps are generally meant as a way in which crawlers can easily get to all the pages on the site. There might also be some sort of "directory" pages for part of the content.

---

## Rules of the road

What can you scrape?

What should you scrape?

What is legal to scrape?

---

## Public vs. Private

There are two types of content on the web: that which everybody can see (public), and that which only certain individuals can see (private).

When you login to a website, you usually see some private content. You also agreed to a legal document, whether you read it or not, their Terms and Conditions!

Those Terms and Conditions can, and often do, make it illegal for you to scrape private content. And because you have agreed to them, you are bound by them.

---

## Public vs. Private

Public content, on the other hand, is less black and white.

There are websites who are happy with you scraping their content, as long as you do it politely.

There are others that don't want you scraping their content unless they know who you are. Almost everyone wants Google to crawl and index all their pages. But they may not want their competitor doing the same!

---

## Being Polite

Let's assume you have the website's blessing.

How does one act politely?

1. Follow robots.txt file
2. Scrape slowly
3. Identify yourself

---

## Robots.txt

Most major websites will have a robots.txt file. This is just a text file that they create in order to tell bots (web crawlers and scrapers) the rules of their website.

You should obey robots.txt files, it's part of being a good citizen on the web!

Let's look at an example. Canonically, they are always at /robots.text:

<https://www.airbnb.com/robots.txt>

Mostly, they just describe which paths bots are allowed to access, and which they are not.

---

## Scraping Speed

HTTP requests take time to complete, even at the speed of light, the data might have to go all around the world and back.

While an HTTP request is being made, your computer, and it's processor, is idling. Your processor can prepare many other requests, and handle many other responses, while waiting for its first HTTP request to complete (it could be hundreds of milliseconds!).

---

## Scraping Speed

Scraping in parallel can happen, depending on the language, via processes, threads, or an asynchronous event loop.

Modern machines can thus make many requests very quickly!

However, servers are limited in how many requests they can handle at a given time. For this reason, they prefer to spread the load of requests as evenly as possible, avoiding large spikes in usage.

For this reason, they want you to scrape slowly.

---

## Identify Yourself

One Header that you can send with an HTTP request is that of "user-agent".

In the case of normal web browsing, "user-agent" refers to the exact browser and version being used.

In the case of scraping, however, it is polite to use a name that refers to your bot and your website, thus that identifies you uniquely so their engineers can know who you are.

---

## Problems

* Getting Blocked
* Javascript

---

## Getting blocked

If you scrape too quickly, scrape from a commercial IP address (AWS), or the website doesn't know you, you might be blocked from crawling. Instead of giving you the page you asked for, they might give you a different page, potentially with a 403 status code, that tells you that you have been denied access.

There are many ways around being blocked. You can lie about your user-agent, use proxies, hold on to cookies, etc.

In general, however, you should be careful. Even if you feel you are ethically justified, this can be a slipper cat and mouse game that eats up a lot of your time!

It might be easier just to ask the website to whitelist you!

---

## Javascript

We have been focusing on the paradigm wherein the server responds to HTTP GET requests with HTML.

Sometimes, however, not all the HTML that we see in our browser is actually sent by the server. The server might send a small amount of HTML, along with some Javascript code, whose responsibility it is to generate the rest of the HTML.

This poses a major problem for scraping, as the content we want isn't returned by the server!

---

## Javascript

The solution is to use a headless browser. 

A headless browser is just a browser that does not render the content to a UI. 

Headless browsers can be embedded within your scraping program via a library, or run as a separate piece of software and accessed over HTTP. 

Popular options: Selenium and Splash

---

## Storing Data

There are many options for storing data from scraping: 

* Flat files (json lines, csv, etc.)
* Database

---

## Following Many Links

Let's return to the problem of following links. 

Often, the links grow expontentially in number as we scrape. 

This is because one page in a directory or search results might link to 10-20 "detail" pages which are often the ones we actually want data from. 

In other words, we might want to scrape thousands or millions of individual pages, but we won't have that list of pages ahead of time, we will build it as we go. 

---

## Following Many Links

How can we deal with this ever-growing list? 

* Loops in loops in loops
* Recursive function calls
* A queue

Which of these will work in a distributed or parallel framework? 

Elegantly, only the quee.

---

## Production Scraping

When you actually want to scrape a large site, you will need to make requests in parallel to get the speed needed. 

You can build this yourself quite simply, or use a scraping library that gives you this for free. 

Scrapy is a Python library that gives you this, and much more, for free. It's a highly opinionated and structured library, so there is a learning curve, but it is well documented and popular. 


# Parsing HTML

---

## HTML

Each node of the tree is an "HTML element."

Some common elements:

```{html}
<div>
<p>
<span>
<h1>
<a>
```

---

## HTML

In addition to having children and/or text, each element can have "attributes." Some common attributes are "id" and "class":

```{html}
<div id="foo">
    <span class="email"> man@themoon.space </span>
</div>

```

---

## HTML

Elements, classes, and ids give us a way to traverse the HTML tree and target a specific node (and its subtree!)

This is very important. This is used in styling webpages as well as in web scraping.

Let's see an example:

---

## HTML

```{html}
<body>
    <div class="foo">
        <h3> EMAIL </h3>
        <span id="email"> man@themoon.space </span>
    </div>
    <article class="bar">
        <span> My Day </span>
        <p> Hello, I would like to discuss...</p>
    </article>
</body>
```

---

## HTML


Using CSS notation, we can target the email via:

```{css}
div.foo span#email
```

Additionally, we could simplify it to:

```{css}
.foo span
```

Because there is only one element with the class "foo", and only one span element inside that!

Or, because there is an id, we can use that and nothing else:

```{css}
#email
```

---


## HTML

(example with browser inspector on live webpage)

---

## HTML

Some elements have special attributes.

Anchor tags can have an "href" attribute, which is a link to another page. Anchor links and hrefs form the basis of the internet!

```{html}
<a href="https://man.mars/redmanred">
    Checkout my boy's homepage!
</a>
```


In [None]:
# We will see how we can use a library called Beautiful Soup to parse
# the html. Getting the correct node out of an HTML tree can be tricky, 
# but there are many tutorials online and the Beautiful Soup documentation is
# a great place to start!)

# Install "beautifulsoup4" if you don't have it! (pip or conda)

from bs4 import BeautifulSoup

def get_soup(url):
    res = requests.get(url)
    return BeautifulSoup(res.text)


soup = get_soup('https://brickset.com/sets/year-2016')

soup.select('.set h1 a')

# This gets the titles first page, how can we follow the pages to get ALL the titles?

In [None]:
def get_titles(soup):    
    """ Returns a list of titles on the page """
    return [s.get_text() for s in soup.select('.set h1 a')]

def get_next(soup):
    try:
        return soup.select('li.next a')[0]['href']
    except IndexError:
        return None

def get_soup(url):
    res = requests.get(url)
    return BeautifulSoup(res.text)

def parse_bricks(url):
    """ Fetches Lego Bricks page and extracts titles """
    # Use a while loop to get all brickset titles
    # gather them into a list
    # and return the whole list
    pass

In [None]:
bricks = parse_bricks('https://brickset.com/sets/year-2016')

In [None]:
assert(bricks[0] == '10251:  Brick Bank')
assert(bricks[9] == '10722:  Snake Showdown')

## Project: Live Exchange Rates

Imagine that you work with financial assets which are denominated in different currencies. You analyze this data regularly, and want to create a "transformation" function that transforms all your assets into EUR prices, based on today's exchange rate. 

Your data with the local-currency-denominated value of each asset lives in a file called "assets.csv" which should be located in the same folder as this notebook. 

Write a "data loading" function that: 

1. Reads the data, given the path to the file. 
2. Returns a dataframe with an additional column that has the assets value in euros, as of today.

Use this free API to get today's exchange rates: https://exchangeratesapi.io/. You will need to read the documentation and try it out to see how it works. 

HINT: Write a separate function to get the current exchange rates! That can be reused!

In [4]:
import pandas as pd
import requests
import json
 
df = pd.read_csv("assets.csv")
df

Unnamed: 0,value,curr
0,48.910052,THB
1,16.505115,THB
2,30.370579,INR
3,14.126967,SEK
4,23.406904,HKD
...,...,...
995,13.593894,HRK
996,41.710860,ZAR
997,12.877760,AUD
998,29.561696,KRW


In [29]:
def get_exchange(url):
    res = requests.get(url)  ## requesting the url
    res = res.json()   ## response as json
    return pd.DataFrame(res).reset_index().rename(columns = {"index": "curr"})
        
url = 'https://api.exchangeratesapi.io/latest'
df1 = get_exchange(url)
df1

Unnamed: 0,curr,rates,base,date
0,AUD,1.6734,EUR,2020-04-28
1,BGN,1.9558,EUR,2020-04-28
2,BRL,6.1004,EUR,2020-04-28
3,CAD,1.5179,EUR,2020-04-28
4,CHF,1.0586,EUR,2020-04-28
5,CNY,7.6977,EUR,2020-04-28
6,CZK,27.227,EUR,2020-04-28
7,DKK,7.457,EUR,2020-04-28
8,GBP,0.87078,EUR,2020-04-28
9,HKD,8.4301,EUR,2020-04-28


In [38]:
new_df = pd.merge(df, df1, on='curr', how='outer').drop(columns = ['base', 'date'])
new_df

Unnamed: 0,value,curr,rates
0,48.910052,THB,35.2800
1,16.505115,THB,35.2800
2,26.431815,THB,35.2800
3,15.357862,THB,35.2800
4,41.928861,THB,35.2800
...,...,...,...
995,27.189609,RON,4.8445
996,46.409661,RON,4.8445
997,13.215449,RON,4.8445
998,38.344216,RON,4.8445


In [39]:
def transformation(new_df):
    return new_df.assign(exchange_rate = new_df.value / new_df.rates)
transformation(new_df)

Unnamed: 0,value,curr,rates,exchange_rate
0,48.910052,THB,35.2800,1.386339
1,16.505115,THB,35.2800,0.467832
2,26.431815,THB,35.2800,0.749201
3,15.357862,THB,35.2800,0.435314
4,41.928861,THB,35.2800,1.188460
...,...,...,...,...
995,27.189609,RON,4.8445,5.612470
996,46.409661,RON,4.8445,9.579866
997,13.215449,RON,4.8445,2.727928
998,38.344216,RON,4.8445,7.915000
