<a href="https://colab.research.google.com/github/natelson/python/blob/main/web_scraping/download_info_about_cars_basic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <font color=green>1. WEB SCRAPING FOR BEGINNERS

# 1.1. What is web scraping?

> *Web scraping* is the term used to extract data from web sites, there are numerous reasons to do this.

> For example, suppose you want to build a machine learning model that recognizes in a photo whether or not it has a car, to train your model you will need hundreds of photos of cars to train the model.

> Then you can create a bot that accesses car sales websites and downloads these photos to train your model later.

> The idea here is to show you how to do this in python using some libraries for this, BeatifulSoup to work with the html we are analyzing. Urlib to request and download the content of pages and images and Pandas to transform this information into structured data so that we can store and work with them.

# 1.2. Importing the libraries

> For this example of web scraping, I used one fake website about cars, and the goal is download images of cars and extract some information about them.

> The first step is importing the libraries, like the code below.


In [68]:
import bs4
import urllib.request as urllib_request
import pandas

print("BeautifulSoup ->", bs4.__version__)
print("urllib ->", urllib_request.__version__)
print("pandas ->", pandas.__version__)

BeautifulSoup -> 4.6.3
urllib -> 3.7
pandas -> 1.3.5


---
# <font color=green>2. Working with requests

## 2.1. Getting the HTML content of a website

> For the download of code html of a page web, I used the library urlib.request, the link to documentation and the sample of use is below.

# urllib.request
## https://docs.python.org/3/library/urllib.html

In [69]:
from urllib.request import urlopen

url = 'https://alura-site-scraping.herokuapp.com/hello-world.php'

response = urlopen(url)
html = response.read()
print(html)

b'<!DOCTYPE html>\r\n<html lang="pt-br">\r\n<head>\r\n    <meta charset="utf-8">\r\n    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">\r\n\r\n    <title>Alura Motors</title>\r\n\t<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" crossorigin="anonymous">\r\n\t<link rel="stylesheet" href="css/styles.css" media="all">\r\n\r\n\t<script src="https://code.jquery.com/jquery-1.12.4.js"></script>\r\n\t<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js" integrity="sha384-Tc5IQib027qvyjSMfHjOMaLkfuWVxZxUPnCJA7l2mCWNIpG9mGCD8wGNIcPD7Txa" crossorigin="anonymous"></script>\r\n\t<script type="text/javascript" src="js/index.js"></script>\r\n\r\n</head>\r\n<body cz-shortcut-listen="true">\r\n    <noscript>You need to enable JavaScript to run this app.</noscript>\r\n\r\n    <div id="root">\r\n        <h

## 2.2. Some sites not allow bots, what you do?

> Some sites, not allow bots, in this case for “lie” to website is necessary to send information about the user-agent in header of request. The information about user-agent sends the server of website of type of browser is visiting.
> When our working with bots the web scraping is important create a behavior similar a user browsing in a web browser like the chrome. The user-agent have an important role this case.

In [70]:
from urllib.request import Request, urlopen
url = 'https://alura-site-scraping.herokuapp.com'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}

req = Request(url, headers = headers)

## 2.3. If it's possible to fail, it will fail
> When us to working with request a website, a large number of possible of error is possible. The website is down, is slow, the domain change, the access was blocked and the other situations.

> In this case it is important to deal with possible errors. The code below was created to deal with 3 different errors:

### 2.3.1 HTTP Status Code <> 2xxx

> Some times one website is not allowing the direct access and return some status code that raise errors in your application.
> For example, the website of a company of course is not allowed access without header, return the HTTP Code 403, the code below handle HTTPError type errors.
Note: For more information about http code visit https://developer.mozilla.org/en-US/docs/Web/HTTP/Status


In [71]:
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
url = 'https://alura.com.br/'
try:
    req = Request(url)
    response = urlopen(req)
    print(response.read())

except HTTPError as e:
    print(e.status, e.reason)


403 Forbidden


### 2.3.2 URL Errors

Some times one website is down or the domain is changed, in this case your request returns one URLError.

For this example, I change the url https://twitter.com for https://twitter.comx , such this domain not exist, the urlopen will return one urlerror

In [72]:
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
url = 'https://twitter.comx/'
try:
    req = Request(url)
    response = urlopen(req)
    html = response.read()

except HTTPError as e:
    print(e.status, e.reason)

except URLError as e:
  print(e.reason)

[Errno -2] Name or service not known


#### For more information about the library request visit https://docs.python.org/3/library/urllib.request.html#urllib.request.Request

## 2.4 String handling

> When working with web scraping is important transform the code html in some text more friendly for work with him.
> Below have some examples to convert the html code a text more friendly for work.

### Converting type bytes to string

In [73]:
from urllib.request import urlopen

url = 'https://alura-site-scraping.herokuapp.com/index.php'

response = urlopen(url)
html = response.read()
html

b'<!DOCTYPE html>\r\n<html lang="pt-br">\r\n<head>\r\n    <meta charset="utf-8">\r\n    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">\r\n\r\n    <title>Alura Motors</title>\r\n\r\n\t<style>\r\n\t\t/*Regra para a animacao*/\r\n\t\t@keyframes spin {\r\n\t\t\t0% { transform: rotate(0deg); }\r\n\t\t\t100% { transform: rotate(360deg); }\r\n\t\t}\r\n\t\t/*Mudando o tamanho do icone de resposta*/\r\n\t\tdiv.glyphicon {\r\n\t\t\tcolor:#6B8E23;\r\n\t\t\tfont-size: 38px;\r\n\t\t}\r\n\t\t/*Classe que mostra a animacao \'spin\'*/\r\n\t\t.loader {\r\n\t\t\tborder: 16px solid #f3f3f3;\r\n\t\t\tborder-radius: 50%;\r\n\t\t\tborder-top: 16px solid #3498db;\r\n\t\t\twidth: 80px;\r\n\t\t\theight: 80px;\r\n\t\t\t-webkit-animation: spin 2s linear infinite;\r\n\t\t\tanimation: spin 2s linear infinite;\r\n\t\t}\r\n\t</style>\r\n\t<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity="sha384-BVYiiSIFeK1dGmJRAkycuH

In [74]:
type(html)

bytes

In [75]:
html_text = html.decode('utf-8')

In [76]:
type(html_text)

str

### Remove tab characters, line breaks, etc.

In [77]:
html_text.split()[:50]

['<!DOCTYPE',
 'html>',
 '<html',
 'lang="pt-br">',
 '<head>',
 '<meta',
 'charset="utf-8">',
 '<meta',
 'name="viewport"',
 'content="width=device-width,',
 'initial-scale=1,',
 'shrink-to-fit=no">',
 '<title>Alura',
 'Motors</title>',
 '<style>',
 '/*Regra',
 'para',
 'a',
 'animacao*/',
 '@keyframes',
 'spin',
 '{',
 '0%',
 '{',
 'transform:',
 'rotate(0deg);',
 '}',
 '100%',
 '{',
 'transform:',
 'rotate(360deg);',
 '}',
 '}',
 '/*Mudando',
 'o',
 'tamanho',
 'do',
 'icone',
 'de',
 'resposta*/',
 'div.glyphicon',
 '{',
 'color:#6B8E23;',
 'font-size:',
 '38px;',
 '}',
 '/*Classe',
 'que',
 'mostra',
 'a']

In [78]:
" ".join(html_text.split())

'<!DOCTYPE html> <html lang="pt-br"> <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> <title>Alura Motors</title> <style> /*Regra para a animacao*/ @keyframes spin { 0% { transform: rotate(0deg); } 100% { transform: rotate(360deg); } } /*Mudando o tamanho do icone de resposta*/ div.glyphicon { color:#6B8E23; font-size: 38px; } /*Classe que mostra a animacao \'spin\'*/ .loader { border: 16px solid #f3f3f3; border-radius: 50%; border-top: 16px solid #3498db; width: 80px; height: 80px; -webkit-animation: spin 2s linear infinite; animation: spin 2s linear infinite; } </style> <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" crossorigin="anonymous"> <link rel="stylesheet" href="css/styles.css" media="all"> <script src="https://code.jquery.com/jquery-1.12.4.js"></script> <script src="https://

### Eliminating white spaces between TAGS

In [87]:
" ".join(html_text.split()).replace('> <', '><')[:500]

'<!DOCTYPE html><html lang="pt-br"><head><meta charset="utf-8"><meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"><title>Alura Motors</title><style> /*Regra para a animacao*/ @keyframes spin { 0% { transform: rotate(0deg); } 100% { transform: rotate(360deg); } } /*Mudando o tamanho do icone de resposta*/ div.glyphicon { color:#6B8E23; font-size: 38px; } /*Classe que mostra a animacao \'spin\'*/ .loader { border: 16px solid #f3f3f3; border-radius: 50%; border-top: '

### String handling function

In [80]:
def change_html_byte_to_text(html_byte):
    html_text = html_byte.decode('utf-8')
    html_text =  " ".join(html_text.split()).replace('> <', '><')
    return html_text

In [81]:
html = change_html_byte_to_text(html)

In [86]:
html[:500]

'<!DOCTYPE html><html lang="pt-br"><head><meta charset="utf-8"><meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"><title>Alura Motors</title><style> /*Regra para a animacao*/ @keyframes spin { 0% { transform: rotate(0deg); } 100% { transform: rotate(360deg); } } /*Mudando o tamanho do icone de resposta*/ div.glyphicon { color:#6B8E23; font-size: 38px; } /*Classe que mostra a animacao \'spin\'*/ .loader { border: 16px solid #f3f3f3; border-radius: 50%; border-top: '

# <font color=green>3. Extracting information with BeautifulSoup

# 3.1. Understand the format of an HTML code

**HTML** (*HyperText Markup Language*) is a markup language made up of **tags** that determine the role that each part of the document will assume. The **tags** are formed by your name and attributes. Attributes are used to configure and also modify the default characteristics of a **tag**.

## Basic Structure

```html
<html>
    <head>
        <meta charset="utf-8" />
        <title>Alura Motors</title>
    </head>
    <body>
        <div id="container">
            <h1>Alura</h1>
            <h2 class="formato">Cursos de Tecnologia</h2>
            <p>Você vai estudar, praticar, discutir e aprender.</p>
            <a href="https://www.alura.com.br/">Clique aqui</a>
        </div>
    </body>
</html>
```

```<html>``` - determines the beginning of the document.

```<head>``` - header. Contains document information and settings.

```<body>``` - is the body of the document, where all the content is placed. This is the part visible in a browser.

## Most common tags

```<div>``` - Defines a division of the page. Can be formatted in different ways.

```<h1>, <h2>, <h3>, <h4>, <h5>, <h6>``` - Title markerss.

```<p>``` - Paragraph marker.

```<a>``` - hiperlink.

```<img>``` - image display.

```<table>``` - tables.

```<ul>, <li>``` - lists.


# 3.2. Creating a BeautifulSoup object

## https://www.crummy.com/software/BeautifulSoup/

### About parser: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#parser-installation

# 3.3. Accessing tags

# 3.4. Accessing tag content

# 3.5. Accessing the attributes of a tag

---
# <font color=green>4. SEARCHING WITH BEAUTIFULSOUP

# 4.1. The *find()* and *findAll()* methods

- ### *find(tag, attributes, recursive, text, **kwargs)*

- ### *findAll(tag, attributes, recursive, text, limit, **kwargs)*

#### https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find
#### https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all

> **Note:**
> - *findAll()* can also be used as *find_all()*

### *find()* method

### *findAll()* method

### Command equivalent to the *find()* method

### Shortcut to *findAll()* method

### Passing tag lists

### Using the *attributes* argument

### Searching for the content of a TAG

### Using attributes directly

### Beware of the "class" attribute

### Getting all the text content of a page

# 4.2. Other research methods

- ### *findParent(tag, attributes, text, **kwargs)*

- ### *findParents(tag, attributes, text, limit, **kwargs)*

#### https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-parents-and-find-parent

> **Notes:**
> - *findParent()* e *findParents()* também podem ser utilizados como *find_parent()* e *find_parents()*, respectivamente.
---
- ### *findNextSibling(tag, attributes, text, **kwargs)*

- ### *findNextSiblings(tag, attributes, text, limit, **kwargs)*

- ### *findPreviousSibling(tag, attributes, text, **kwargs)*

- ### *findPreviousSiblings(tag, attributes, text, limit, **kwargs)*

#### https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-next-siblings-and-find-next-sibling
#### https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-previous-siblings-and-find-previous-sibling

> **Notes:**
> - *findNextSibling()*, *findNextSiblings()*, *findPreviousSibling()* and *findPreviousSiblings()* can also be used as *find_next_sibling()*, *find_next_siblings()*, *find_previous_sibling()* and *find_previous_siblings()*.
---
- ### *findNext(tag, attributes, text, **kwargs)*

- ### *findAllNext(tag, attributes, text, limit, **kwargs)*

- ### *findPrevious(tag, attributes, text, **kwargs)*

- ### *findAllPrevious(tag, attributes, text, limit, **kwargs)*

#### https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all-next-and-find-next
#### https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all-previous-and-find-previous

> **Notes:**
> - *findNext()*, *findAllNext()*, *findPrevious* e *findAllPrevious* can also be used as *find_next()*, *find_all_next()*, *find_previous()* and *find_all_previous()*.

## Sample HTML to illustrate the use of BeautifulSoup search methods

<img src="https://caelum-online-public.s3.amazonaws.com/1381-scraping/01/BeautifulSoup-method.png" width=80%>

---
## Result

<html>
    <body>
        <div id=“container-a”>
            <h1>Título A</h1>
            <h2 class="ref-a">Sub título A</h2>
            <p>Texto de conteúdo A</p>
        </div>
        <div id=“container-b”>
            <h1>Título B</h1>
            <h2 class="ref-b">Sub título B</h2>
            <p>Texto de conteúdo B</p>
        </div>
    </body>
</html>

### HTML string treatments

### Creating the BeautifulSoup object

### Parents

## Siblings

## Next and Previous

# <font color=green>5. CAR SITE WEB SCRAPING - GETTING THE DATA OF AN ADVERTISING

# 5.1. Identifying and Selecting Data in HTML

### Getting the HTML and creating the BeautifulSoup object

### Creating variable to store information

### Getting the data of the first CARD

# 5.2. Getting the VALUE of the advertised vehicle

### <font color=red>Summary

In [82]:
# Valor


# 5.3. Obtaining information about the advertised vehicle

### <font color=red>Summary

In [83]:
# Informações


# 5.4. Getting the ACCESSORIES of the advertised vehicle

### <font color=red>Summary

In [84]:
# Acessórios


# 5.5 Creating a DataFrame with the data collected 

# 5.6. Getting the ad PHOTO

### Viewing the PHOTO on the notebook (extra)

### Routine to access and save the ad PHOTO

## https://docs.python.org/3/library/urllib.request.html#urllib.request.urlretrieve

### <font color=red>Summary

# <font color=green>6. WEBSITE WEB SCRAPING - GETTING THE DATA OF ALL ADS FROM A PAGE

# 6.1. Identifying information in HTML

# 6.2. Creating a scraping routine

# <font color=green>7. WEB SCRAPING THE SITE - GETTING THE DATA OF ALL ADVERTISING ON THE SITE

# 7.1. Identifying information in HTML

# 7.2. Creating a scraping routine