# APIs and Webscraping Summary : Guided Practice

This week, we learned about a couple of common tools and methods used to acquire data from HTML sources found online.
 
 * APIs (Application Programming Interface)
 * Webscraping

### APIs
APIs are commonly used to retrieve data from remote websites. Sites like Reddit, Twitter, and Facebook all offer certain data through their APIs. To use an API, you make a specific request to a remote server and retrieve the data you need based on the parameters you have indicated in your request.

The HTML page one accesses in an API is a portal to a backend database in websites, which you can query.

![](images/web_api.png)
> <a href="https://commons.wikimedia.org/wiki/File:Web_API.png">Brivadeneira</a>, <a href="https://creativecommons.org/licenses/by-sa/4.0">CC BY-SA 4.0</a>, via Wikimedia Commons

## Parts of an API

* **Access Permissions**
    + User allowed to ask?
* **API Call/Request**
    + Code used to make API call to implement complicated tasks/features
    + *Methods*: what questions can we ask?
    + *Parameters*: more info to be sent
* **Response**
    + Result of request
    


## Secure APIs vs Insecure APIs

We saw after the lessons of last week that there may be security differences between APIs that we might interact with.
We first took a look at OpenNotify, an insecure API that gives the past, current and projected locations of the ISS (international space station) in the night sky. What makes this an insecure API is that it does not require the client or user making requests, to identify themselves or "authenticate". This was unlike our experience with the yelp API, which required us to create an account and generate an API that we could use to authenticate our requests (identify ourselves).

In [1]:
import requests

url = 'http://api.open-notify.org/iss-now.json'
iss_response = requests.get(url)

In [2]:
type(iss_response)

requests.models.Response

In [3]:
iss_response.status_code

200

#### Status Codes

The request we make may not always be successful. The best way is to check the status code which gets returned with the response: `response.status_code`

for more extensive descriptions of different status codes:

[Status Code Info](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) <br/>

In [4]:
iss_response.content

b'{"iss_position": {"longitude": "37.2389", "latitude": "35.5545"}, "message": "success", "timestamp": 1652465480}'

#### Endpoints

OpenNotify has several API **endpoints**. We just made a request to the `iss-now` endpoint. An endpoint is a server route that is used to retrieve different data from the API. For example, the `/comments` endpoint on the Reddit API might retrieve information about comments, whereas the `/users` endpoint might retrieve data about users. To access them, you would add the endpoint to the base url of the API.

In [5]:
# Let's check out who is in space right now!

url = 'http://api.open-notify.org/astros.json'
astro_response = requests.get(url)
print(astro_response.status_code)

200


In [6]:
astro_response.content

b'{"message": "success", "number": 11, "people": [{"name": "Raja Chari", "craft": "ISS"}, {"name": "Tom Marshburn", "craft": "ISS"}, {"name": "Kayla Barron", "craft": "ISS"}, {"name": "Matthias Maurer", "craft": "ISS"}, {"name": "Oleg Artemyev", "craft": "ISS"}, {"name": "Denis Matveev", "craft": "ISS"}, {"name": "Sergey Korsakov", "craft": "ISS"}, {"name": "Kjell Lindgren", "craft": "ISS"}, {"name": "Bob Hines", "craft": "ISS"}, {"name": "Samantha Cristoforetti", "craft": "ISS"}, {"name": "Jessica Watkins", "craft": "ISS"}]}'

In [7]:
type(astro_response.content)

bytes

In [8]:
astro_response.text

'{"message": "success", "number": 11, "people": [{"name": "Raja Chari", "craft": "ISS"}, {"name": "Tom Marshburn", "craft": "ISS"}, {"name": "Kayla Barron", "craft": "ISS"}, {"name": "Matthias Maurer", "craft": "ISS"}, {"name": "Oleg Artemyev", "craft": "ISS"}, {"name": "Denis Matveev", "craft": "ISS"}, {"name": "Sergey Korsakov", "craft": "ISS"}, {"name": "Kjell Lindgren", "craft": "ISS"}, {"name": "Bob Hines", "craft": "ISS"}, {"name": "Samantha Cristoforetti", "craft": "ISS"}, {"name": "Jessica Watkins", "craft": "ISS"}]}'

In [11]:
print(type(astro_response.text))

<class 'str'>


Most times we perform requests, the data we obtain will be in a json format, thus we can make use of out json library to parse our data.

In [12]:
astro_data = astro_response.json()
astro_data.keys()

dict_keys(['message', 'number', 'people'])

#### Hitting the right endpoint

In [16]:
#lets make a request to the iss-pass.json enpoint

iss_pass_url = 'http://api.open-notify.org/iss-pass.json'
response = requests.get(iss_pass_url)
response.status_code

400

In [18]:
# lets take a look at the response message by inspecting response.content
response.content

b'{\n  "message": "failure", \n  "reason": "Latitude must be specified"\n}\n'

#### Query Parameters

If you look at the [documentation](https://web.archive.org/web/20201224141953/http://open-notify.org/Open-Notify-API/ISS-Pass-Times/), we see that the ISS Pass endpoint requires two parameters.

We can do this by adding an optional keyword argument, params, to our request. In this case, there are two parameters we need to pass:

* lat — The latitude of the location we want.
* lon — The longitude of the location we want.

We can make a dictionary with these parameters, and then pass them into the `requests.get()` method. We’ll make a request using the coordinates of New York City, and see what response we get.

We can also add the query parameters to the url, like this: http://api.open-notify.org/iss-pass.json?lat=47.6&lon=-122.3. However, it’s almost always preferable to pass the parameters as a dictionary, because `requests` takes care of some potential issues, like properly formatting the query parameters.

In [19]:
# Our code here

response = requests.get(iss_pass_url,
            params={'lat': 40.7, 'lon': -74})

# Print the content of the response (the data the server returned)

print(response.text)

# This gets the same data as the command above:
# requests.get("http://api.open-notify.org/iss-pass.json?lat=40.71&lon=-74")

{
  "message": "success", 
  "request": {
    "altitude": 100, 
    "datetime": 1652466692, 
    "latitude": 40.7, 
    "longitude": -74.0, 
    "passes": 5
  }, 
  "response": [
    {
      "duration": 492, 
      "risetime": 1652487430
    }, 
    {
      "duration": 649, 
      "risetime": 1652493124
    }, 
    {
      "duration": 600, 
      "risetime": 1652498973
    }, 
    {
      "duration": 553, 
      "risetime": 1652504860
    }, 
    {
      "duration": 608, 
      "risetime": 1652510693
    }
  ]
}



For an exercise with a secure api that requires authentication refer to your canvas assignment: `Using the Yelp API - codealong`

### Webscraping

There is publicly available data all over the internet ripe for scraping, whether that be artist information data from wikipedia, song lyrics from songlyrics.com, or texts of famous books from Project Gutenberg.

#### The components of a web page

When we visit a web page, our web browser makes a GET request to a web server. The server then sends back files that tell our browser how to render the page for us. The files fall into a few main types:

- HTML — contain the main content of the page.
- CSS — add styling to make the page look nicer.
- JS — Javascript files add interactivity to web pages.
- Images — image formats, such as JPG and PNG allow web pages to show pictures.

After our browser receives all the files, it renders the page and displays it to us. There’s a lot that happens behind the scenes to render a page nicely, but we don’t need to worry about most of it when we’re web scraping.

#### HTML

HyperText Markup Language (HTML) is a language that web pages are created in. HTML isn’t a programming language, like Python — instead, it’s a markup language that tells a browser how to layout content. 

Let’s take a quick tour through HTML so we know enough to scrape effectively. HTML consists of elements called tags. The most basic tag is the `<html>` tag. This tag tells the web browser that everything inside of it is HTML. We can make a simple HTML document just using this tag:

~~~html
<html>
</html>
~~~

Right inside an html tag, we put two other tags, the head tag, and the body tag. The main content of the web page goes into the body tag. The head tag contains data about the title of the page, and other information that generally isn’t useful in web scraping:

~~~html
<html>
    <head>
    </head>
    <body>
    </body>
</html>
~~~

We’ll now add our first content to the page, in the form of the p tag. The p tag defines a paragraph, and any text inside the tag is shown as a separate paragraph:
~~~html
<html>
    <head>
    </head>
    <body>
          <p>
            Here's a paragraph of text!
        </p>
        <p>
            Here's a second paragraph of text!
        </p>
    </body>
</html>
~~~

<html>
    <head>
    </head>
    <body>
          <p>
            Here's a paragraph of text!
        </p>
        <p>
            Here's a second paragraph of text!
        </p>
    </body>
</html>

Tags have commonly used names that depend on their position in relation to other tags:

- **child** — a child is a tag inside another tag. So the two p tags above are both children of the body tag.
- **parent** — a parent is the tag another tag is inside. Above, the html tag is the parent of the body tag.
- **sibiling** — a sibiling is a tag that is nested inside the same parent as another tag. For example, head and body are siblings, since they’re both inside html. Both p tags are siblings, since they’re both inside body.

We can also add properties to HTML tags that change their behavior:

~~~html
<html>
  <head></head>
  <body>
    <p>
      Here's a paragraph of text!
      <a href="https://www.dataquest.io">Learn Data Science Online</a>
    </p>
    <p>
      Here's a second paragraph of text!
      <a href="https://www.python.org">Python</a>        
    </p>
  </body>
</html>
~~~

<html>
    <head>
    </head>
    <body>
        <p>
            Here's a paragraph of text!
            <a href="https://www.dataquest.io">Learn Data Science Online</a>
        </p>
        <p>
            Here's a second paragraph of text!
            <a href="https://www.python.org">Python</a>        </p>
    </body></html>

In the above example, we added two a tags. a tags are links, and tell the browser to render a link to another web page. The href property of the tag determines where the link goes.

a and p are extremely common html tags. Here are a few others:

- *div*: indicates a division, or area, of the page.
- *b*: bolds any text inside.
- *i*: italicizes any text inside.
- *u*: underlines any text inside.
- *table*: creates a table.
- *form*: creates an input form.


For a full list of tags, look [here](https://developer.mozilla.org/en-US/docs/Web/HTML/Element).

<html>
    <b>bold</b> <br/>
    <i>italics</i> <br/>
    <u>underlining</u>
</html>

There are two special properties that give HTML elements names, and make them easier to interact with when we’re scraping: **class** and **id**. 

- One element can have multiple classes, and a class can be shared between elements. 
- Each element can only have one id, and an id can only be used once on a page. 
- Classes and ids are optional, and not all elements will have them.

We can add classes and ids to our example:

~~~html
<html>
    <head>
    </head>
    <body>
        <p class="bold-paragraph">
            Here's a paragraph of text!
            <a href="https://www.dataquest.io" id="learn-link">Learn Data Science Online</a>
        </p>
        <p class="bold-paragraph extra-large">
            Here's a second paragraph of text!
            <a href="https://www.python.org" class="extra-large">Python</a>
        </p>
    </body>
</html>
~~~

These class and element ids are the tags we exploit using packages like Beautiful soup to hone in on what pieces of information or data we would like to obtain. Refer to the webscraping lecture notebook for an example walkthrough.