# Web scraping in Python: Scraping with Requests and BeautifulSoup

Web scraping is a valuable tool for programmers to effortlessly gather information from the vast resources of the internet. While it is generally acceptable for non-commercial purposes with publicly available data, caution must be exercised to avoid scraping protected information such as personal data, intellectual property, or confidential information. 

Additionally, the complexities of scraping social media due to its varying levels of accessibility highlight the need for cautious and informed scraping practices.


This coffee and coding session, we will use Python's two libraries; <b>Requests</b> and <b>BeautifulSoup</b> to scrap book reviews. <br>http://books.toscrape.com/ contains review  for fake books for the beginners learning web scrapings. <br>
This session aims for the beginners (like me!) introduction to web scraping.

To gather information from the internet through web scraping, one typically follows a four-step process:

<li> Sending an HTTP GET request to the URL </li>
<li> Retrieving HTML content </li>
<li> Building the HTML document tree </li>
<li> Extracting information from the HTML document tree </li>   

### Requests


The Requests library in Python is a popular and widely used library for making HTTP requests. This library allows user to send HTTP requests to server, receive response and handle in a simple and efficient manner. 

It supports varous methods for making requests, such as GET, POST, HEAD, PUT, DELETE etc.

### BeautifulSoup

BeautifulSoup is a Python library for web scraping and data extraction from HTML and XML files. <br>
It provides a convenient and efficient way to parse and naviage through HTML contents, allowing for the extraction of specific elements and data.

We will first fetch HTML code from our fake-books site. But before we start, I will show you a very quick overview of basic HTML. 

HTML (Hypertext Markup Language) is a standard markup language used for creating web pages and other information that can be displayed in a web browser. It consists of a set of tags and attributes that define the structure, content, and appearance of a web page. 

In [3]:
from IPython.core.display import HTML

In [48]:
html_test = """
<html>
  <head>
    <title>My First Web Page</title>
  </head>
  <body>
    <h1>Welcome to my first test web page</h1>
    <p>This is a paragraph of text. I want to list some of my favourite composer and their music </p>
    <ul id = "composer", class = "myclass">
      <li>Listz</li>
      <li>Mozart</li>
      <li>Debussy</li>
    </ul>
    <ul id = "piece", class = "myclass2">
        <li>Love Dream (No.3)</li>
        <li>Sonatina No.1 in C Major</li>
        <li>Moonlight</li>        
    </ul>
  </body>
</html>
"""

In [49]:
# Let's import libraries first
from bs4 import BeautifulSoup
import requests
import pandas as pd

We explore functions in BeautifulSoup using the sample HTML we've just created. 

In [50]:
# create soup variable
soup = BeautifulSoup(html_test, features = "html.parser")

In [51]:
# soup has the information extracted from HTML string. We can make it better display
soup


<html>
<head>
<title>My First Web Page</title>
</head>
<body>
<h1>Welcome to my first test web page</h1>
<p>This is a paragraph of text. I want to list some of my favourite composer and their music </p>
<ul ,="" class="myclass" id="composer">
<li>Listz</li>
<li>Mozart</li>
<li>Debussy</li>
</ul>
<ul ,="" class="myclass2" id="piece">
<li>Love Dream (No.3)</li>
<li>Sonatina No.1 in C Major</li>
<li>Moonlight</li>
</ul>
</body>
</html>

In [52]:
# Now indent etc works better
# Prettify arranges all the tags in a parse-tree manner with better readability.
print(soup.prettify())

<html>
 <head>
  <title>
   My First Web Page
  </title>
 </head>
 <body>
  <h1>
   Welcome to my first test web page
  </h1>
  <p>
   This is a paragraph of text. I want to list some of my favourite composer and their music
  </p>
  <ul ,="" class="myclass" id="composer">
   <li>
    Listz
   </li>
   <li>
    Mozart
   </li>
   <li>
    Debussy
   </li>
  </ul>
  <ul ,="" class="myclass2" id="piece">
   <li>
    Love Dream (No.3)
   </li>
   <li>
    Sonatina No.1 in C Major
   </li>
   <li>
    Moonlight
   </li>
  </ul>
 </body>
</html>



In [53]:
# Codes to nativage data structure (https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
display(soup.title)
display(soup.title.name)
display(soup.title.string)
display(soup.get_text())

<title>My First Web Page</title>

'title'

'My First Web Page'

'\n\n\nMy First Web Page\n\n\nWelcome to my first test web page\nThis is a paragraph of text. I want to list some of my favourite composer and their music \n\nListz\nMozart\nDebussy\n\n\nLove Dream (No.3)\nSonatina No.1 in C Major\nMoonlight\n\n\n\n'

Let's try `find()` or `find_all()` functions to search for specific tags in the HTML content. <br>
`find()` returns only the first occurrence of the search query. `find_all()` returns a list of all matches.

In [54]:
display(soup.find('li'))
display(soup.find('li').text) # This allows us to extract the inner HTML text
display(soup.find_all('li'))

<li>Listz</li>

'Listz'

[<li>Listz</li>,
 <li>Mozart</li>,
 <li>Debussy</li>,
 <li>Love Dream (No.3)</li>,
 <li>Sonatina No.1 in C Major</li>,
 <li>Moonlight</li>]

If we only want to extract Test items `<li>`, we can use `attr ={}` dictionary to define the attributes of an HTML tag. Dictionary keys are the name of the attributes, and the values are the attribute values.<br>

In [59]:
myList = soup.find(attrs = {'id':'composer', 'class':'myclass'})
myList.find_all('li')

[<li>Listz</li>, <li>Mozart</li>, <li>Debussy</li>]

You can also traverse the parent and children elements in the HTML code. 