# Web Scraping 
Web scraping is the process of gathering information from the Internet. Some websites don’t like it when automatic scrapers gather their data, while others don’t mind.

Generally involves 2 steps:
- `requests` library for retrieving content from a webpage
- `bs4` (BeautifulSoup) for extracting the relevant information
    - we create a Beautiful Soup object from the content that is returned and parse it using several methods.

In [1]:
import requests

## 01_Download a Webpage
The first thing we’ll need to do to scrape a web page is to download the page. 
The requests library will make a `GET` request to a web server, which will download the HTML contents of a given web page for us.

In [7]:
# Make the GET request to a url
URL = 'https://www.capitalone.ca/credit-cards/guaranteed-mastercard/?filter=all'
page = requests.get(URL)

You can deconstruct the above URL into two main parts:
- The base URL represents the path to the search functionality of the website: https://www.capitalone.ca/credit-cards/guaranteed-mastercard/.
- The query parameters `?filter=all` represents additional values that can be declared on the page

This code performs an HTTP request to the given URL. It retrieves the HTML data that the server sends back and stores that data in a Python object.

In [45]:
type(page)

requests.models.Response

In [3]:
page.status_code

200

After running our request, we get a Response object. This object has a status_code property, which indicates if the page was downloaded successfully: A status_code of 200 means that the page downloaded successfully!
- a status code starting with a 2 generally indicates success
- a code starting with a 4 or a 5 indicates an error

Specifically:
- 200 — everything went okay, and the result has been returned (if any)
- 301 — the server is redirecting you to a different endpoint. This can happen when a company switches domain names, or an endpoint name is changed.
- 401 — the server thinks you’re not authenticated. This happens when you don’t send the right credentials to access an API (we’ll talk about authentication in a later post).
- 400 — the server thinks you made a bad request. This can happen when you don’t send along the right data, among other things.
- 403 — the resource you’re trying to access is forbidden — you don’t have the right permissions to see it.
- 404 — the resource you tried to access wasn’t found on the server.
- ...

In [13]:
# look into all attributes of the page  
# look into requests.get

[0;31mSignature:[0m [0mrequests[0m[0;34m.[0m[0mget[0m[0;34m([0m[0murl[0m[0;34m,[0m [0mparams[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Sends a GET request.

:param url: URL for the new :class:`Request` object.
:param params: (optional) Dictionary, list of tuples or bytes to send
    in the query string for the :class:`Request`.
:param \*\*kwargs: Optional arguments that ``request`` takes.
:return: :class:`Response <Response>` object
:rtype: requests.Response
[0;31mFile:[0m      ~/.conda/envs/h1b/lib/python3.7/site-packages/requests/api.py
[0;31mType:[0m      function


In [51]:
requests.get?

[0;31mSignature:[0m [0mrequests[0m[0;34m.[0m[0mget[0m[0;34m([0m[0murl[0m[0;34m,[0m [0mparams[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Sends a GET request.

:param url: URL for the new :class:`Request` object.
:param params: (optional) Dictionary, list of tuples or bytes to send
    in the query string for the :class:`Request`.
:param \*\*kwargs: Optional arguments that ``request`` takes.
:return: :class:`Response <Response>` object
:rtype: requests.Response
[0;31mFile:[0m      ~/.conda/envs/h1b/lib/python3.7/site-packages/requests/api.py
[0;31mType:[0m      function


## 02_Query parameters

The ISS Pass endpoint returns when the ISS will next pass over a given location on earth. In order to compute this, we need to pass the coordinates of the location to the API. We can do this by adding an optional keyword argument `params` to our request. In this case, there are two parameters we need to pass:
- lat — The latitude of the location we want.
- lon — The longitude of the location we want.

We can make a dictionary with these parameters, and then pass them into the `requests.get` function.

In [60]:
page2 = requests.get("http://api.open-notify.org/iss-pass")
page2.status_code

404

In [61]:
page2 = requests.get("http://api.open-notify.org/iss-pass.json")
page2.status_code

400

In [62]:
# Set up the parameters we want to pass to the API.
# This is the latitude and longitude of New York City.
location_dict = {"lat": 40.71, "lon": -74}

# Make a get request with the parameters.
page3 = requests.get("http://api.open-notify.org/iss-pass.json", params=location_dict)
page3.status_code

200

In [63]:
page3.url

'http://api.open-notify.org/iss-pass.json?lat=40.71&lon=-74'

In [16]:
# Get the response data as a python object. Verify that it's a dictionary.
data = page3.json()
print(type(data))
print(data)

<class 'dict'>
{'message': 'success', 'request': {'altitude': 100, 'datetime': 1595605325, 'latitude': 40.71, 'longitude': -74.0, 'passes': 5}, 'response': [{'duration': 604, 'risetime': 1595625283}, {'duration': 646, 'risetime': 1595631058}, {'duration': 577, 'risetime': 1595636939}, {'duration': 571, 'risetime': 1595642812}, {'duration': 641, 'risetime': 1595648627}]}


In [17]:
page3.url

'http://api.open-notify.org/iss-pass.json?lat=40.71&lon=-74'

You’ve successfully scraped some HTML from the Internet, but when you look at it now, it just seems like a huge mess. There are tons of HTML elements here and there, thousands of attributes scattered around—and wasn’t there some JavaScript mixed in as well?

## 03_Beautiful_Soup
Beautiful Soup is a Python library for parsing structured data. 

It allows you to interact with HTML in a similar way to how you would interact with a web page using developer tools.

In [20]:
from bs4 import BeautifulSoup
import lxml

First we’re creating a `Beautiful Soup object` that takes the HTML content you scraped earlier as its input:

In [64]:
# without bs4 ...
page4 = requests.get('https://h1bdata.info/index.php?em=Capital+One+Services+Llc&job=Data&city=&year=2020')
soup = BeautifulSoup(page4.content, 'html.parser')

### Find Elements by ID

In an HTML web page, every element can have an id attribute assigned. As the name already suggests, that id attribute makes the element uniquely identifiable on the page. You can begin to parse your page by selecting a specific element by its ID. Beautiful Soup allows you to find that specific element easily by its ID.

For easier viewing, you can `.prettify()` any Beautiful Soup object when you print it out. If you call this method on the results variable that you just assigned above, then you should see all the HTML contained within the division

In [66]:
results = soup.find(id='joblink')
print(results.prettify())

<div class="alert alert-success alert-dismissible" id="joblink" role="alert" style="display:none;">
 <button aria-label="Close" class="close" data-dismiss="alert" type="button">
  <span aria-hidden="true">
   ×
  </span>
 </button>
 <div align="left" style="font-size:18px;">
  We have found
  <a href="http://www.indeed.com/jobs?q=company:capital one services llc title:Data&amp;l=&amp;indpubnum=7749215865220997" onclick="ga('send', 'event', 'Indeed Link', 'to-list-page', 'company:capital one services llc title:Data');" rel="nofollow" target="indeed_search">
   job openings
  </a>
  of
  <b>
   Data
  </b>
  job from
  <b>
   Capital One Services Llc
  </b>
  .
 </div>
</div>



### Find Elements by Class Name and Text Content

In [27]:
soup.find_all?

[0;31mSignature:[0m
[0msoup[0m[0;34m.[0m[0mfind_all[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mname[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mattrs[0m[0;34m=[0m[0;34m{[0m[0;34m}[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrecursive[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtext[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlimit[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m**[0m[0mkwargs[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Extracts a list of Tag objects that match the given
criteria.  You can specify the name of the Tag and any
attributes you want the Tag to have.

The value of a key-value pair in the 'attrs' map can be a
string, a list of strings, a regular expression object, or a
callable that takes a string and returns whether or not the
string matches for some custom definition of '

In [67]:
results2 = soup.find_all('tr')
len(results2)

7

In [70]:
results2[0:3]

[<tr><th>EMPLOYER</th><th>JOB TITLE</th><th>BASE SALARY</th><th>LOCATION</th><th data-date-format="mm/dd/yy">SUBMIT DATE</th><th data-date-format="mm/dd/yy">START DATE</th><th>CASE STATUS</th></tr>,
 <tr><td><a href="index.php?em=CAPITAL+ONE+SERVICES+LLC&amp;job=DATA&amp;city=&amp;year=2020">CAPITAL ONE SERVICES LLC</a></td><td><a href="index.php?em=CAPITAL+ONE+SERVICES+LLC&amp;job=DATA+ANALYSIS+MANAGER&amp;city=&amp;year=2020">DATA ANALYSIS MANAGER</a></td><td>75,171</td><td><a href="index.php?em=CAPITAL+ONE+SERVICES+LLC&amp;job=DATA&amp;city=RICHMOND&amp;year=2020">RICHMOND, VIRGINIA</a></td><td>02/25/2020</td><td>06/23/2020</td><td>CERTIFIED</td></tr>,
 <tr><td><a href="index.php?em=CAPITAL+ONE+SERVICES+LLC&amp;job=DATA&amp;city=&amp;year=2020">CAPITAL ONE SERVICES LLC</a></td><td><a href="index.php?em=CAPITAL+ONE+SERVICES+LLC&amp;job=DATA+ANALYSIS+MANAGER&amp;city=&amp;year=2020">DATA ANALYSIS MANAGER</a></td><td>75,171</td><td><a href="index.php?em=CAPITAL+ONE+SERVICES+LLC&amp;job

In [72]:
results2[1].text

'CAPITAL ONE SERVICES LLCDATA ANALYSIS MANAGER75,171RICHMOND, VIRGINIA02/25/202006/23/2020CERTIFIED'

In [73]:
links = ['https://h1bdata.info/index.php?em=Capital+One+Services+Llc&job=Data&city=&year=All+Years',
         'https://h1bdata.info/index.php?em=Capital+One+Services+Llc&job=Business&city=&year=All+Years', ]

In [74]:
# Scrape table data from each of the above links and store in a list

all_data = []
for link in links:
    page_link = link
    page_response = requests.get(page_link, timeout=1000)
    page_content = BeautifulSoup(page_response.content, 'lxml')

    # save data 
    for row in page_content.find_all('tr')[1:]:
        row_data = []
        for i in row:
            row_data.append(i.text)
        all_data.append(row_data)

In [75]:
len(all_data)

289

In [76]:
all_data[0:3]

[['CAPITAL ONE SERVICES LLC',
  'DATA ANALYSIS MANAGER',
  '64,355',
  'PLANO, TEXAS',
  '11/22/2019',
  '12/02/2019',
  'CERTIFIED'],
 ['CAPITAL ONE SERVICES LLC',
  'DATA ANALYSIS MANAGER',
  '64,355',
  'PLANO, TEXAS',
  '11/22/2019',
  '12/02/2019',
  'CERTIFIED'],
 ['CAPITAL ONE SERVICES LLC',
  'DATA ANALYSIS MANAGER',
  '64,355',
  'PLANO, TEXAS',
  '11/22/2019',
  '12/02/2019',
  'CERTIFIED']]

### Tutorials
- Space ISS example: https://www.dataquest.io/blog/python-api-tutorial/
- bs4 tutorial: https://realpython.com/beautiful-soup-web-scraper-python/

Now we've successfully downloaded a webpage in HTML and be able to render the content in the page. 