# Web Scraping With Python And BeautifulSoup

## Learning Outcomes

- Understand the benefits and use cases of web scraping.
- Learn how to parse the HTML content of a webpage using BeautifulSoup to extract specific elements.
- Learn how to scan the HTML for specific keywords.
- Learn how to scrape multiple webpages.
- Learn how to use the pandas .apply() method to extract multiple elements into new columns.

Potentials;
- Learn how to scrape dynamically rendered HTML content with a headless browser.

-----------------------------------------------------------------

The following installations are for a Jupyter Notebook, however if you are using a command line then simply exclude the ! symbol

~~~
!pip install beautifulsoup4
!pip install requests

~~~

In [14]:
# Library Imports
import pandas as pd
from bs4 import BeautifulSoup
import requests

--------------------------------------------------------

## Why Learn Web Scraping?

Learning web scraping is a useful skill, whether you work as a programmer, marketer or analyst. Its a fantastic way for you to analyse websites. Web scraping should never replace a tool such as ScreamingFrog, however when you're creating Python or JavaScript scripts and data pipelines, then you'll likely want to write a custom scraper.

Because what's the point of doing a website crawl if you only need a few pieces of information?

------------------------------------------------------------

Once you have acquired advanced web scraping skills, you can:
    
- Accurately monitor your competitors.
- Create data pipelines that push fresh HTML data into a data warehouse such as BigQuery.
- Allow you to blend it with other data sources such as Google Search Console data or Google Analytics data.
- Create your own APIs for websites that don't publicly have an API.

There are many other uses for why [web scraping](https://understandingdata.com/what-is-web-scraping/) is a powerful skill to possess.

------------------------------------------------------------------------------------

## Challenges of Web Scraping

Firstly every website is different, this means it can be difficult to build a robust web scraper that will work on every website. You'll likely need to create unique selectors for each website which can be time-consuming.

Secondly, your scripts are more likely to fail over time because websites change. Whenever a marketer, owner or developer makes changes to their website, it could lead to your script breaking. Therefore for larger proejcts its essential that you create a monitoring system so that you can fix these problems as they arise.

------------------------------------------------------------

## How To Web Scrape A Single HTML Page:

In order to scrape a web page in python or any programming language, we will need to download the HTML content.

The library that we'll be using is [requests](https://requests.readthedocs.io/en/master/). 

In [50]:
url = 'https://www.indeed.co.uk/jobs?q=data%20scientist&l=london&start=40&advn=2102673149993430&vjk=40339845379bc411'

response = requests.get(url)

In [51]:
print(response)

<Response [200]>


As long as the status code is 200 (which means Ok.), then we'll be able to access the web page. You can always check the status code with:

~~~

print(response.status_code)

~~~

In [52]:
if response.status_code == 200:
    print(response)

<Response [200]>


To access the content of a request, simply use:
    
~~~

response.content

~~~

![how to sign into gcp](https://understandingdata.com/wp-content/uploads/2020/11/html_content.png)

In [53]:
# This will store the HTML content as a stream of bytes:
html_content = response.content

# This will store th eHTML content as a string:
html_content_string = response.text

--------------------------------------------------------------------------------

### Parsing the HTML Content to a Parser

Simply downloading the HTML page is not enough, particularly if we would like to extract elements from it. Therefore we will use a python package called [BeautifulSoup.](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). BeautifulSoup provides us with a large amount of DOM (document object model) parsing methods. 

This is very useful for when we'd like to extract specific elements from the page.

In order to parse the DOM of a page, simply use:



In [55]:
soup = BeautifulSoup(html_content, 'html.parser')

In [56]:
help(soup)

Help on BeautifulSoup in module bs4 object:

class BeautifulSoup(bs4.element.Tag)
 |  BeautifulSoup(markup='', features=None, builder=None, parse_only=None, from_encoding=None, exclude_encodings=None, element_classes=None, **kwargs)
 |  
 |  A data structure representing a parsed HTML or XML document.
 |  
 |  Most of the methods you'll call on a BeautifulSoup object are inherited from
 |  PageElement or Tag.
 |  
 |  Internally, this class defines the basic interface called by the
 |  tree builders when converting an HTML/XML document into a data
 |  structure. The interface abstracts away the differences between
 |  parsers. To write a new tree builder, you'll need to understand
 |  these methods as a whole.
 |  
 |  These methods will be called by the BeautifulSoup constructor:
 |    * reset()
 |    * feed(markup)
 |  
 |  The tree builder may call these methods from its feed() implementation:
 |    * handle_starttag(name, attrs) # See note about return value
 |    * handle_endtag(n

We can now see that instead of a HTML string, we have a BeautifulSoup object, that has a range of functions on it!

------------------------------------------------------------------------

In our example, we'll be web scraping indeed and extracting job information.

* The job will be: data scientist.
* The area will be london.

## Investigate The URL

url = 'https://www.indeed.co.uk/jobs?q=data%20scientist&l=london&start=40&advn=2102673149993430&vjk=40339845379bc411'

There can be a lot of information inside of the URLs. Its important for you to be able to identify the structure of URLs and to reverse engineer how they might have been created.

1. <strong> The base URL </strong> means the path to the jobs functionality of the website which in this case is: https://www.index.co.uk/
2. <strong> Query Parameters </strong> are a way for the jobs search to be dynamic, in the above example they are: 
<strong> ?q=data%20scientist&l=london&start=40&advn=2102673149993430&vjk=40339845379bc411'</strong>

Query parameters consist of:
- The start of the query at q
- A key and value for each query parameter (i.e. l = london or start=40)
- A separator which is an ampersand symbol (&) that separates all of the key + value query parameters.

------------------------------------------------------------------------------------------

## Visually Inspect The Webpage In Google Chrome Dev Tools

Before jumping straight into coding, its worthwhile visually inspecting the HTML page content within your browser. This will give you a sense to how the website is constructed and what repeating patterns you can see within the HTML.

Google Chrome Developer tools is a free available tool that allows you to visually inspect the HTML code. To navigate to it:

1. Open up Google Chrome.
2. Right click on a webpage.
3. Click inspect.

![alt text](https://understandingdata.com/wp-content/uploads/2020/11/google-chrome-dev-tools.png)

------------------------------------------------------------------------------

![alt text](https://understandingdata.com/wp-content/uploads/2020/11/google-chrome-dev-tools-2.png)

------------------------------------------------------------------------

### Find Elements By HTML ID

It is possible to select specific HTML elements by using the <strong> #id CSS selector.</strong>

In [64]:
appPromoBanner = soup.findAll('div', {'id':'appPromoBanner'})

----------------------------------------------------------------------------

## Find Elements By HTML Class Name

Alternatively, you can find elements by their class selector:

In [72]:
container_div = soup.findAll('div', class_='tab-container')

In [73]:
len(container_div)

15

--------------------------------------------------------------------------------------------

## How To Extract Text From HTML Elements

As well as selecting the entire HTML element, you can also easily extract the text using BeautifulSoup. 

Let's see how this might work whilst scraping a specific job advertisement:

----------------------------------------------------

In [74]:
job_url = 'https://www.indeed.co.uk/m/viewjob?jk=c4fa65390861f13b&from=serp'
resp = requests.get(job_url)
soup = BeautifulSoup(resp.content, 'html.parser')

In [86]:
# Firstly we grab the title tag and then use .text to access the elements text:

In [84]:
title_tag_text = soup.title.text

In [85]:
print(title_tag_text)

Data Scientist - London - Indeed.co.uk


Or we can extract the first paragraph on the webpage, then get the text for that element:

In [88]:
first_paragraph = soup.find('p')

----------------------------------------------------------------------------------------

## How To Extract Multiple HTML Elements

Sometimes you'll like to store multiple elements, for example if there is a list of job advertisements on the same page. The following method will return a list of elements rather than just the first element:

~~~

soup.findAll(some_element)

~~~

In [91]:
all_paragraphs = soup.findAll('p')

In [92]:
print(all_paragraphs[0:3])

[<p><b><font size="+1">Data Scientist</font></b><br/>
Deutsche Bank - <span class="location">London</span>
</p>, <p></p>, <p><b>Job Title: </b>Data Scientist
</p>]


If we wanted to extract the text of every paragraph element, we could just do a list compehension:

In [97]:
all_paragraphs_text = [paragraph.text.strip() for paragraph in all_paragraphs]

-------------------------------------------------------

It's also possible to clean the data if there are paragraphs which are empty strings like this:

In [99]:
# This will only return paragraphs that don't have empty strings!
full_paragraphs = [paragraph for paragraph in all_paragraphs_text if paragraph]

In [100]:
print(len(full_paragraphs))

37


--------------------------------------------------------------------------------------------------------------

In [None]:
### Exercises

- Try and extract all of the H3s and H4s 
- Try and extract the meta description of the page

-------------------------------------------------------------------------------------------

## How To Web Scrape Multiple HTML Pages:

--------------------------------------------------------------------------------

## How To Scan HTML Content For Specific Keywords 

Particularly in a marketing context, if one of your web pages is ranking for 5 keywords it would be beneficial to know:
    
- If every keyword was on the HTML page.
- If there were keywords on / missing from the HTML page.

By writing a web scraper we can easily answer these questions.

----------------------------------------------------------------------------------------------------