# Web Scraping 
Web scraping is the process of gathering information from the Internet. Some websites don’t like it when automatic scrapers gather their data, while others don’t mind.

Generally involves 2 steps:
- `requests` library for retrieving content from a webpage
- `bs4` (BeautifulSoup) for extracting the relevant information

These two libraries are often used together in the following manner: 
- first, we make a GET request to a website. 
- Then, we create a Beautiful Soup object from the content that is returned and parse it using several methods.

In [23]:
%%capture
%env http_proxy=http://entproxy.kdc.capitalone.com:8099
%env https_proxy=http://entproxy.kdc.capitalone.com:8099

In [1]:
import requests

### 01_Download a Webpage
The first thing we’ll need to do to scrape a web page is to download the page. We can download pages using the Python requests library. The requests library will make a `GET` request to a web server, which will download the HTML contents of a given web page for us.
- more interaction with API: https://www.dataquest.io/blog/python-api-tutorial/

In [4]:
# Make the GET request to a url
URL = 'https://www.monster.com/jobs/search/?q=Software-Developer&where=Australia'
page = requests.get(URL)

You can deconstruct the above URL into two main parts:
- The base URL represents the path to the search functionality of the website: https://www.monster.com/jobs/search/.
- The query parameters `?q=Software-Developer&where=Australia` represents additional values that can be declared on the page

This code performs an HTTP request to the given URL. It retrieves the HTML data that the server sends back and stores that data in a Python object.

In [7]:
page.status_code

200

After running our request, we get a Response object. This object has a status_code property, which indicates if the page was downloaded successfully: A status_code of 200 means that the page downloaded successfully!
- a status code starting with a 2 generally indicates success
- a code starting with a 4 or a 5 indicates an error

Specifically:
- 200 — everything went okay, and the result has been returned (if any)
- 301 — the server is redirecting you to a different endpoint. This can happen when a company switches domain names, or an endpoint name is changed.
- 401 — the server thinks you’re not authenticated. This happens when you don’t send the right credentials to access an API (we’ll talk about authentication in a later post).
- 400 — the server thinks you made a bad request. This can happen when you don’t send along the right data, among other things.
- 403 — the resource you’re trying to access is forbidden — you don’t have the right permissions to see it.
- 404 — the resource you tried to access wasn’t found on the server.
- ...

In [10]:
page2 = requests.get("http://api.open-notify.org/iss-pass")
page2.status_code

407

### 02_Query parameters

The ISS Pass endpoint returns when the ISS will next pass over a given location on earth. In order to compute this, we need to pass the coordinates of the location to the API. We do this by passing two parameters — latitude and longitude.

We can do this by adding an optional keyword argument `params` to our request. In this case, there are two parameters we need to pass:
- lat — The latitude of the location we want.
- lon — The longitude of the location we want.

We can make a dictionary with these parameters, and then pass them into the `requests.get` function.

In [15]:
# Set up the parameters we want to pass to the API.
# This is the latitude and longitude of New York City.
parameters = {"lat": 40.71, "lon": -74}

# Make a get request with the parameters.
page3 = requests.get("http://api.open-notify.org/iss-pass.json", params=parameters)
page3.status_code

200

In [16]:
# Get the response data as a python object. Verify that it's a dictionary.
data = response.json()
print(type(data))
print(data)

<class 'dict'>
{'message': 'success', 'request': {'altitude': 100, 'datetime': 1595446695, 'latitude': 40.71, 'longitude': -74.0, 'passes': 5}, 'response': [{'duration': 460, 'risetime': 1595452524}, {'duration': 651, 'risetime': 1595458199}, {'duration': 613, 'risetime': 1595464040}, {'duration': 559, 'risetime': 1595469931}, {'duration': 606, 'risetime': 1595475772}]}


### 03_Beautiful_Soup

In [17]:
from bs4 import BeautifulSoup
import lxml

In [18]:
links = ['https://h1bdata.info/index.php?em=&job=Data+Scientist&year=All+Years',
         'https://h1bdata.info/index.php?em=&job=Data+Engineer&year=All+Years',
         'https://h1bdata.info/index.php?em=&job=Data+Analyst&year=All+Years',
        ]

In [21]:
# Scrape table data from each of the above links and store in a list

all_data = []
for link in links:
    page_link = link
    page_response = requests.get(page_link, timeout=1000)
    page_content = BeautifulSoup(page_response.content, 'lxml')

    # save data 
    for row in page_content.find_all('tr')[1:]:
        row_data = []
        for i in row:
            row_data.append(i.text)
        all_data.append(row_data)

In [25]:
page_response.content



In [28]:
len(all_data)

19718

In [29]:
all_data[0:3]

[['PERCOLATA CORPORATION',
  'DATA SCIENTIST',
  '46,060',
  'PALO ALTO, CA',
  '03/18/2016',
  '09/02/2016',
  'CERTIFIED'],
 ['MY LIFE REGISTRY LLC',
  'DATA SCIENTIST',
  '47,960',
  'FORT LEE, NJ',
  '02/18/2015',
  '08/20/2015',
  'CERTIFIED'],
 ['MY LIFE REGISTRY LLC',
  'DATA SCIENTIST',
  '47,960',
  'FORT LEE, NJ',
  '02/18/2015',
  '08/20/2015',
  'CERTIFIED']]

### Tutorials
- Space ISS example: https://www.dataquest.io/blog/python-api-tutorial/
- https://realpython.com/beautiful-soup-web-scraper-python/

Now we've successfully downloaded a webpage in HTML and be able to render the content in the page. 