# Web Scraping

**Info/Comments**   
- Principe: 
    - Web scraping is the recovery of data from web pages, in an automatic way. 
    - It is a technique, based on a simple principle. It is used in many applications: search engines, price comparison, monitoring tools etc.
- 2 steps: 
    - downloading the HTML code of the page to be scraped, 
    - and parsing it.

Page to test web scraping: https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/

## Table of Contents 

* [1. Importing the required packages into our python environment. ](#part1)
* [2. Download html code ](#part2)
* [3. Extracting features](#part3)
* [4. Application](#part4)



## 1. IMPORTING PACKAGES <a class="anchor" id="part1"></a>

In [1]:
# !pip install requests
# !pip install beautifulsoup4
import requests
from bs4 import BeautifulSoup
import pandas as pd

## 2. Download html code <a class="anchor" id="part2"></a>

- To get the content of the web page (download) you just have to make a **HTTP request** and wait for the answer.
- The `requests` module allows you to send HTTP requests using **Python**.
- The HTTP request returns a Response Object with all the response data (content, encoding, status, and so on).

Status code information:
- Informational responses (100 – 199)
- Successful responses (200 – 299)
- Redirection messages (300 – 399)
- Client error responses (400 – 499)
- Server error responses (500 – 599)   
   
For more information see: https://developer.mozilla.org/en-US/docs/web/http/status

In [2]:
# Make a request to a web page, and return the status code:
url = "https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/"


reponse = requests.get(url)


Store the status code in a variable called `status`

In [3]:
status = reponse.status_code
print(f"Status: {status}")

Status: 200


Store the text response in a variable called `content`

In [4]:
content = reponse.text
# Overview of the data
print(content[:100])

<!DOCTYPE html>
<html lang="en">
	<head>
		<!-- Anti-flicker snippet (recommended)  -->
		<style>
		


## 3. Extract with BeautifulSoup <a class="anchor" id="part3"></a>

URL → requête HTTP → HTML → BeautifulSoup

- We will be using a library called `BeautifulSoup` in Python to do web scraping.
- Beautiful Soup is on top of popular Python parsers like `lxml` and `html5lib`, allowing you to try out different parsing strategies.
- It can parse anything on the web.

In [5]:
soup = BeautifulSoup(content, 'html.parser')

# Extracting title with BeautifulSoup
title = soup.title.text # gets you the text of the <title>(...)</title>
print(f"Page title with no.text: '{soup.title}'")
print(f"Page title: '{title}'")

Page title with no.text: '<title>codedamn Web Scraper demo</title>'
Page title: 'codedamn Web Scraper demo'


In [6]:
# Extract body of page
page_body = soup.body

# Extract head of page
page_head = soup.head
print(f"Type var head: '{type(page_head)}'")
print(f"\nVar head: \n{page_head}")


Type var head: '<class 'bs4.element.Tag'>'

Var head: 
<head>
<!-- Anti-flicker snippet (recommended)  -->
<style>
			.async-hide {
				opacity: 0 !important;
			}
		</style>
<title>codedamn Web Scraper demo</title>
<meta charset="utf-8"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta content="web scraping,Web Scraper,Chrome extension,Crawling,Cross platform scraper, " name="keywords"/>
<meta content="The most popular web scraping website." name="description"/>
<link href="/webscraper-python-codedamn-classroom-website/favicon.png" rel="icon" sizes="128x128"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<link href="/webscraper-python-codedamn-classroom-website/app.css" rel="stylesheet"/>
<link href="/webscraper-python-codedamn-classroom-website/logo-icon.png" rel="apple-touch-icon"/>
<script defer="" src="/webscraper-python-codedamn-classroom-website/app.js"></script>
</head>


- Once you have the soup variable (like previous labs), you can work with `.select` on it which is a CSS selector inside BeautifulSoup.
- That is, you can reach down the DOM tree just like how you will select elements with CSS. Let's look at an example:

In [7]:
# Extract first <h1>(...)</h1> text
first_h1 = soup.select('h1')[0].text
first_h1

'Test Sites'

In [8]:
# Create all_h1_tags as empty list
all_h1_tags = list()

# Set all_h1_tags to all h1 tags of the soup
for element in soup.select('h1'):
    all_h1_tags.append(element.text)
all_h1_tags

['Test Sites', 'E-commerce training site']

In [9]:
# p element text of the page
soup.select('p')

[<p>Web Scraper</p>,
 <p>Cloud Scraper</p>,
 <p>Pricing</p>,
 <p>Learn</p>,
 <p>
 								Welcome to WebScraper e-commerce site. You can use this site for
 								training to learn how to use the Web Scraper. Items listed here are
 								not for sale.
 							</p>,
 <p class="description">
 											Asus AsusPro Advanced BU401LA-FA271G Dark Grey,
 											14", Core i5-4210U, 4GB, 128GB SSD, Win7 Pro 64bit,
 											ENG
 										</p>,
 <p class="pull-right">7 reviews</p>,
 <p data-rating="3">
 <span class="glyphicon glyphicon-star"></span>
 <span class="glyphicon glyphicon-star"></span>
 <span class="glyphicon glyphicon-star"></span>
 </p>,
 <p class="description">
 											Apple MacBook Air 13.3", Core i5 1.8GHz, 8GB, 128GB
 											SSD, Intel HD 4000, RUS
 										</p>,
 <p class="pull-right">4 reviews</p>,
 <p data-rating="2">
 <span class="glyphicon glyphicon-star"></span>
 <span class="glyphicon glyphicon-star"></span>
 </p>,
 <p class="description">
 									

## 4. Application <a class="anchor" id="part4"></a>

### Getting Top items being scraped

In [10]:
# Create top_items as empty list
top_items = list()

# Extract and store in top_items according to instructions on the left
products = soup.select('div.thumbnail')
for elem in products:
    # title = elem.select('h4 > a.title')[0]['title']
    title = elem.select('h4 > a.title')[0].get('title')
    review_label = elem.select('div.ratings')[0].text
    description = elem.select('div.caption > p.description')[0].text.replace('\t', '').replace('\n', '')
    # price = elem.select('div.caption > h4')[0].text or
    price = elem.select('h4.price')[0].text
    image = elem.select('img')[0].get('src')
    info = {
        "title": title.strip(),
        "review": review_label.strip(),
        "description": description.strip(),
        "price": price.strip(),
        "image": image
    }
    top_items.append(info)

top_items


[{'title': 'Asus AsusPro Advanced BU401LA-FA271G Dark Grey',
  'review': '7 reviews',
  'description': 'Asus AsusPro Advanced BU401LA-FA271G Dark Grey,14", Core i5-4210U, 4GB, 128GB SSD, Win7 Pro 64bit,ENG',
  'price': '$1139.54',
  'image': '/webscraper-python-codedamn-classroom-website/cart2.png'},
 {'title': 'Asus ROG Strix GL553VD-DM535T',
  'review': '4 reviews',
  'description': 'Apple MacBook Air 13.3", Core i5 1.8GHz, 8GB, 128GBSSD, Intel HD 4000, RUS',
  'price': '$1101.83',
  'image': '/webscraper-python-codedamn-classroom-website/cart2.png'},
 {'title': 'Acer Aspire 3 A315-51 Black',
  'review': '2 reviews',
  'description': 'Acer Aspire 3 A315-51 Black, 15.6" FHD, Corei3-7100U, 4GB, 500GB + 128GB SSD, Windows 10 Home',
  'price': '$494.71',
  'image': '/webscraper-python-codedamn-classroom-website/cart2.png'}]

In [11]:

data = pd.DataFrame(columns = info.keys())
data['title'] = [items['title'] for items in top_items]
data['review'] = [items['review'] for items in top_items]
data['price'] = [items['price'] for items in top_items]
data['description'] = [items['description'] for items in top_items]
data['image'] = [items['image'] for items in top_items]
data

Unnamed: 0,title,review,description,price,image
0,Asus AsusPro Advanced BU401LA-FA271G Dark Grey,7 reviews,Asus AsusPro Advanced BU401LA-FA271G Dark Grey...,$1139.54,/webscraper-python-codedamn-classroom-website/...
1,Asus ROG Strix GL553VD-DM535T,4 reviews,"Apple MacBook Air 13.3"", Core i5 1.8GHz, 8GB, ...",$1101.83,/webscraper-python-codedamn-classroom-website/...
2,Acer Aspire 3 A315-51 Black,2 reviews,"Acer Aspire 3 A315-51 Black, 15.6"" FHD, Corei3...",$494.71,/webscraper-python-codedamn-classroom-website/...
