# REQUESTS AND BEAUTIFULSOUP MODULES AND WEB SCRAPING

Before using requests and BeautifulSoup modules, it is important to know some basic html tags. Here are some of them:

In [None]:
<div> Tag:
Purpose: Used to create a block-level section in an HTML document.
Example:
<div>
    This is a block-level section.
</div>


In [None]:
<table> Tag:
Purpose: Used to create an HTML table.

<td> Tag:
Purpose: Represents a table data cell in an HTML table.

<tr> Tag:
Purpose: Represents a table row in an HTML table.
    
Example:
html
Copy code
<table>
    <tr>
        <td>Row 1, Cell 1</td>
        <td>Row 1, Cell 2</td>
    </tr>
    <tr>
        <td>Row 2, Cell 1</td>
        <td>Row 2, Cell 2</td>
    </tr>
</table>

In [None]:
<a> Tag:
Purpose: Used to create links (hyperlinks) in an HTML document.
    
Example:
<a href="https://www.example.com">This is a link.</a>

In [None]:
<p> Tag:
Purpose: Used to create paragraphs of text in an HTML document.

Example
<p>This is a text paragraph.</p>


In [None]:
<h1>, <h2>, <h3>, <h4>, <h5>, <h6> Tags:

Purpose: Used to create headings of different levels in an HTML document.
# Example: same logic in jupyter notebook

In [None]:
<head> and <body> Tags:
Purpose: Used to define the head and body sections of an HTML document.
<!DOCTYPE html>
<html>
<head>
    <title>Web Page Title</title>
</head>
<body>
    <h1>Hello World!</h1>
</body>
</html>

### 1) Requests Module

The requests module in Python is a powerful library used to send HTTP requests and handle HTTP responses. It allows you to interact with web APIs, make HTTP GET, POST, PUT, DELETE, and other types of requests to fetch data from web servers, submit form data, and perform various web-related tasks. The requests module provides a more user-friendly and straightforward interface compared to the standard urllib library for working with HTTP.

The main objective to use requests module is that it subtracts html data from a url that we provide. After obtaining source code, we use beautifulsoup module to process the data.

get(): This method subtracts data from the url.

In [2]:
import requests 

url = "https://www.python.org/"

response = requests.get(url)

If we print out response object,  we will get <Response [200]> output. Otherwise we may have a problem with our connection, url etc.  

In [3]:
print(response)

<Response [200]>


We must use the latest ".content" method in order to use this data we have pulled with the beutifulsoup module.

In [None]:
url_content = response.content
print(url_content) # we will obtain source codes of the url when we run this code.

### 2) Beautiful Soup Module

Beautiful Soup is a Python library used for web scraping and parsing HTML and XML documents. It provides a convenient way to extract data from HTML and XML files by parsing the document and navigating the parsed data using Python objects. Beautiful Soup is widely used for web scraping tasks due to its simplicity and flexibility.

After obtaining content, Parse the HTML content with Beautiful Soup:  BeautifulSoup(url_content, "html.parser")

Here are some other methods to extract and modify data:

find() method:<br>
This method is used to find the first occurrence of an element that matches a given tag name or set of attributes.

In [6]:
from bs4 import BeautifulSoup

html_content = """
<div>
    <h1>Hello, Beautiful Soup!</h1>
    <p>This is a sample paragraph.</p>
</div>
"""

soup = BeautifulSoup(html_content, 'html.parser')

# Find the first <h1> tag
h1_tag = soup.find('h1')
print(h1_tag.text)  # Output: Hello, Beautiful Soup!

# Find the first <p> tag
p_tag = soup.find('p')
print(p_tag.text)   # Output: This is a sample paragraph.

Hello, Beautiful Soup!
This is a sample paragraph.


find_all() method:<br>
This method is used to find all occurrences of elements that match a given tag name or set of attributes. It returns a list of matching elements.

In [7]:
from bs4 import BeautifulSoup

html_content = """
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
</ul>
"""

soup = BeautifulSoup(html_content, 'html.parser')

# Find all <li> tags
li_tags = soup.find_all('li')

for li in li_tags:
    print(li.text)
# Output:
# Item 1
# Item 2
# Item 3

Item 1
Item 2
Item 3


prettify() method:<br>
This method is used to prettify the HTML or XML content, making it more readable with proper indentation

In [9]:
from bs4 import BeautifulSoup

html_content = "<div><h1>Hello, Beautiful Soup!</h1><p>This is a sample paragraph.</p></div>"

soup = BeautifulSoup(html_content, 'html.parser')

# Prettify the HTML content
prettified_html = soup.prettify()
print(prettified_html)
# Output:
# <div>
#  <h1>Hello, Beautiful Soup!</h1>
#  <p>This is a sample paragraph.</p>
# </div>

<div>
 <h1>
  Hello, Beautiful Soup!
 </h1>
 <p>
  This is a sample paragraph.
 </p>
</div>



get() method:<br>
This method is used to retrieve the value of a specific attribute of an element.

In [12]:
from bs4 import BeautifulSoup

html_content = """
<a href="https://www.example.com">Click here</a>
"""

soup = BeautifulSoup(html_content, 'html.parser')

# Get the value of the 'href' attribute of the <a> tag
link = soup.find('a')
print(link)
print(link.get('href')) 
print(link['href'])          # same result as above
print(soup.a.get('href'))    # same result as above

<a href="https://www.example.com">Click here</a>
https://www.example.com
https://www.example.com
https://www.example.com


In Beautiful Soup, the .text method is used to extract the textual content of an HTML or XML element. When you parse an HTML or XML document using Beautiful Soup, it creates a tree-like data structure that represents the structure of the document. Each element in the document is represented as a Python object with various attributes and methods, and .text is one of those methods

In [14]:
from bs4 import BeautifulSoup

html_content = """
<div>
    <h1>Hello, Beautiful Soup!</h1>
    <p>This is a sample paragraph.</p>
</div>
"""

soup = BeautifulSoup(html_content, 'html.parser')

# Extract the text of the <h1> tag
h1_text = soup.h1.text
print(h1_text)
# Output: Hello, Beautiful Soup!

# Extract the text of the <p> tag
p_text = soup.p.text
print(p_text)
# Output: This is a sample paragraph.

Hello, Beautiful Soup!
This is a sample paragraph.


Here is a general example:

In [None]:
from bs4 import BeautifulSoup

# Sample HTML content
html_content = """
<html>
<head>
    <title>Sample Page</title>
</head>
<body>
    <h1>Hello, Beautiful Soup!</h1>
    <p>This is a sample paragraph.</p>
    <ul>
        <li>Item 1</li>
        <li>Item 2</li>
    </ul>
</body>
</html>
"""

# Parse the HTML content with Beautiful Soup
soup = BeautifulSoup(html_content, 'html.parser')

# Extract data from the parsed document
title = soup.title.text
heading = soup.h1.text
paragraph = soup.p.text
items = [li.text for li in soup.ul.find_all('li')]

# Print the extracted data
print("Title:", title)
print("Heading:", heading)
print("Paragraph:", paragraph)
print("Items:", items)


### Filtering Findall

When you use the find_all() method in Beautiful Soup, you can search more precisely by specifying additional parameters, such as the class attribute, to filter the results. This allows you to find only the elements that match specific criteria, such as having a particular class value.

In [15]:
# Here's an example of how to search for <div> tags with class="yp-example":
from bs4 import BeautifulSoup

html_content = """
<div class="yp-example">This is the first yp-example div.</div>
<div class="another-class yp-example">This is the second yp-example div.</div>
<div class="yp-example">This is the third yp-example div.</div>
"""

soup = BeautifulSoup(html_content, 'html.parser')

# Find all <div> tags with class="yp-example"
div_tags_with_class = soup.find_all('div', class_='yp-example')

for div_tag in div_tags_with_class:
    print(div_tag.text)

This is the first yp-example div.
This is the second yp-example div.
This is the third yp-example div.
