# <font color=green>1. WEB SCRAPING FOR BEGINNERS

# 1.1. What is web scraping?

> *Web scraping* is the term used to extract data from web sites, there are numerous reasons to do this.

> For example, suppose you want to build a machine learning model that recognizes in a photo whether or not it has a car, to train your model you will need hundreds of photos of cars to train the model.

> Then you can create a bot that accesses car sales websites and downloads these photos to train your model later.

> The idea here is to show you how to do this in python using some libraries for this, BeatifulSoup to work with the html we are analyzing. Urlib to request and download the content of pages and images and Pandas to transform this information into structured data so that we can store and work with them.

# 1.2. Importing the libraries

> For this example of web scraping, I used one fake website about cars, and the goal is download images of cars and extract some information about them.

> The first step is importing the libraries, like the code below.


In [41]:
import bs4
import urllib.request as urllib_request
import pandas

print("BeautifulSoup ->", bs4.__version__)
print("urllib ->", urllib_request.__version__)
print("pandas ->", pandas.__version__)

BeautifulSoup -> 4.10.0
urllib -> 3.9
pandas -> 1.3.4


---
# <font color=green>2. Working with requests

# 2.1. Getting the HTML content of a website

> For the download of code html of a page web, I used the library urlib.request, the link to documentation and the sample of use is below.

# urllib.request
## https://docs.python.org/3/library/urllib.html

# 2.2. If it's possible to fail, it will fail
> When us to working with request a website, a large number of possible of error is possible. The website is down, is slow, the domain change, the access was blocked and the other situations.

> In this case it is important to deal with possible errors. The code below was created to deal with 3 different errors:

## https://docs.python.org/3/library/urllib.request.html#urllib.request.Request

# 2.3. String handling

### Converting type bytes to string

### Remove tab characters, line breaks, etc.

### Eliminating white spaces between TAGS

### String handling function

---
# <font color=green>3. Extracting information with BeautifulSoup

# 3.1. Understand the format of an HTML code

**HTML** (*HyperText Markup Language*) is a markup language made up of **tags** that determine the role that each part of the document will assume. The **tags** are formed by your name and attributes. Attributes are used to configure and also modify the default characteristics of a **tag**.

## Basic Structure

```html
<html>
    <head>
        <meta charset="utf-8" />
        <title>Alura Motors</title>
    </head>
    <body>
        <div id="container">
            <h1>Alura</h1>
            <h2 class="formato">Cursos de Tecnologia</h2>
            <p>Você vai estudar, praticar, discutir e aprender.</p>
            <a href="https://www.alura.com.br/">Clique aqui</a>
        </div>
    </body>
</html>
```

```<html>``` - determines the beginning of the document.

```<head>``` - header. Contains document information and settings.

```<body>``` - is the body of the document, where all the content is placed. This is the part visible in a browser.

## Most common tags

```<div>``` - Defines a division of the page. Can be formatted in different ways.

```<h1>, <h2>, <h3>, <h4>, <h5>, <h6>``` - Title markerss.

```<p>``` - Paragraph marker.

```<a>``` - hiperlink.

```<img>``` - image display.

```<table>``` - tables.

```<ul>, <li>``` - lists.


# 3.2. Creating a BeautifulSoup object

## https://www.crummy.com/software/BeautifulSoup/

### About parser: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#parser-installation

# 3.3. Accessing tags

# 3.4. Accessing tag content

# 3.5. Accessing the attributes of a tag

---
# <font color=green>4. SEARCHING WITH BEAUTIFULSOUP

# 4.1. The *find()* and *findAll()* methods

- ### *find(tag, attributes, recursive, text, **kwargs)*

- ### *findAll(tag, attributes, recursive, text, limit, **kwargs)*

#### https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find
#### https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all

> **Note:**
> - *findAll()* can also be used as *find_all()*

### *find()* method

### *findAll()* method

### Command equivalent to the *find()* method

### Shortcut to *findAll()* method

### Passing tag lists

### Using the *attributes* argument

### Searching for the content of a TAG

### Using attributes directly

### Beware of the "class" attribute

### Getting all the text content of a page

# 4.2. Other research methods

- ### *findParent(tag, attributes, text, **kwargs)*

- ### *findParents(tag, attributes, text, limit, **kwargs)*

#### https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-parents-and-find-parent

> **Notes:**
> - *findParent()* e *findParents()* também podem ser utilizados como *find_parent()* e *find_parents()*, respectivamente.
---
- ### *findNextSibling(tag, attributes, text, **kwargs)*

- ### *findNextSiblings(tag, attributes, text, limit, **kwargs)*

- ### *findPreviousSibling(tag, attributes, text, **kwargs)*

- ### *findPreviousSiblings(tag, attributes, text, limit, **kwargs)*

#### https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-next-siblings-and-find-next-sibling
#### https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-previous-siblings-and-find-previous-sibling

> **Notes:**
> - *findNextSibling()*, *findNextSiblings()*, *findPreviousSibling()* and *findPreviousSiblings()* can also be used as *find_next_sibling()*, *find_next_siblings()*, *find_previous_sibling()* and *find_previous_siblings()*.
---
- ### *findNext(tag, attributes, text, **kwargs)*

- ### *findAllNext(tag, attributes, text, limit, **kwargs)*

- ### *findPrevious(tag, attributes, text, **kwargs)*

- ### *findAllPrevious(tag, attributes, text, limit, **kwargs)*

#### https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all-next-and-find-next
#### https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all-previous-and-find-previous

> **Notes:**
> - *findNext()*, *findAllNext()*, *findPrevious* e *findAllPrevious* can also be used as *find_next()*, *find_all_next()*, *find_previous()* and *find_all_previous()*.

## Sample HTML to illustrate the use of BeautifulSoup search methods

<img src="https://caelum-online-public.s3.amazonaws.com/1381-scraping/01/BeautifulSoup-method.png" width=80%>

---
## Result

<html>
    <body>
        <div id=“container-a”>
            <h1>Título A</h1>
            <h2 class="ref-a">Sub título A</h2>
            <p>Texto de conteúdo A</p>
        </div>
        <div id=“container-b”>
            <h1>Título B</h1>
            <h2 class="ref-b">Sub título B</h2>
            <p>Texto de conteúdo B</p>
        </div>
    </body>
</html>

### HTML string treatments

### Creating the BeautifulSoup object

### Parents

## Siblings

## Next and Previous

# <font color=green>5. CAR SITE WEB SCRAPING - GETTING THE DATA OF AN ADVERTISING

# 5.1. Identifying and Selecting Data in HTML

### Getting the HTML and creating the BeautifulSoup object

### Creating variable to store information

### Getting the data of the first CARD

# 5.2. Getting the VALUE of the advertised vehicle

### <font color=red>Summary

In [42]:
# Valor


# 5.3. Obtaining information about the advertised vehicle

### <font color=red>Summary

In [43]:
# Informações


# 5.4. Getting the ACCESSORIES of the advertised vehicle

### <font color=red>Summary

In [44]:
# Acessórios


# 5.5 Creating a DataFrame with the data collected 

# 5.6. Getting the ad PHOTO

### Viewing the PHOTO on the notebook (extra)

### Routine to access and save the ad PHOTO

## https://docs.python.org/3/library/urllib.request.html#urllib.request.urlretrieve

### <font color=red>Summary

# <font color=green>6. WEBSITE WEB SCRAPING - GETTING THE DATA OF ALL ADS FROM A PAGE

# 6.1. Identifying information in HTML

# 6.2. Creating a scraping routine

# <font color=green>7. WEB SCRAPING THE SITE - GETTING THE DATA OF ALL ADVERTISING ON THE SITE

# 7.1. Identifying information in HTML

# 7.2. Creating a scraping routine