# Web Scraping

* A **Website** is a collection of related web pages that may contain text, images, audio and video. It can be static or dynamic.


* **HTTP** is the data communication protocol used to establish communication between client and server.


* **HTTP Requests**	is the request send by the computer to a web server that contains all sorts of potentially interesting information.


* A **Server** is used to manage the network resources (**web server**). Servers are also used for running the program or software that provides services (**application server**).


* **API** or **Application Programming Interface** allows interactions between systems by following a set of standards and protocols in order to share features, information and data. API acts as an interface between different applications.


* A **REST API** (or **Representational State Transfer API**) is an architecture style to develop web applications. It uses HTTP protocol as a communication inteface for different software systems to communicate with each other through the internet. It transfers data through HTTP methods. There are five main methods used in a REST API:
	* `GET` - retrieves a specific resource or collection of resources
	* `POST` - creates a new resource
	* `PUT` - updates an existing resource
	* `DELETE` - removes a specific resource
	* `PATCH` - partially updates an existing resource

* **GET** request data from a resource (URL) whereas **POST** creates/updates a response.
	* `GET` : `requests.get("<url>")`
	* `POST` : `requests.post("<url>", data={"key":"value"})`

* The **Response** object is returned from the HTTP request that holds the results of the request, it can either be a success or an error. A success response typically includes the requested information or a message confirming that the requested action was completed. An error response includes a message explaining why the request could not be completed. The Response object contains not only the **page content**, but also many other items about the result such as **HTTP status codes** and **headers**.
    * **Content Type** is HTTP header that provides the description about what are we sending to the browser.


* Output for common requests attributes:
	* `response.content`: raw bytes response payload
    * `response.text`: character encoded (e.g. UTF-8) string payload
    * `response.headers`: dictionary-like object which contains header payload as key-value
	* `response.status_code`: status code returned by the external service

* Requests is used only to get the page, it **does not do an parsing**.

* We use **Beautiful Soup** to do the parsing of the HTML and also the finding of content within the HTML.

## Scrape the upcoming Python events

In [1]:
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
# Get CSS classes from "Right Click / Inspect", and search the element
css_class_dict = {'recent_events_class' : 'list-recent-events',
                 'event_location_class' : 'event-location'}

# URL for website
url = 'https://www.python.org/events/python-events/'

In [3]:
def _parse_response(response_text, css_class_dict):
    
    # create a BeautifulSoup object and pass it the HTML text
    soup = BeautifulSoup(response_text, 'lxml')
    
    # find the main <ul> tag for the recent events, and then to get all the <li> tags below it
    events = soup.find('ul', {'class': css_class_dict['recent_events_class']}).findAll('li')
    
    # mapping all results to upcoming_events
    upcoming_events = []
    for event in events:
        event_details = dict()
        event_details['name'] = event.find('h3').find("a").text
        event_details['location'] = event.find('span', {'class', css_class_dict['event_location_class']}).text
        event_details['time'] = event.find('time').text
        upcoming_events.append(event_details)

    # Creating events dataframe
    events_df = pd.DataFrame(upcoming_events)
        
    return events_df

In [4]:
def get_upcoming_events(response_text, css_class_dict):
    events_df = _parse_response(response_text, css_class_dict)
    return events_df

### Using `requests` library

In [5]:
import requests

In [7]:
# GET HTTP request using requests library
response = requests.get(url)
response_text = response.text

events_df = get_upcoming_events(response_text, css_class_dict)
events_df

Unnamed: 0,name,location,time
0,EuroPython 2023,"Prague, Czech Republic",17 July – 23 July 2023
1,North Bay Python,"Petaluma, California, USA",29 July – 30 July 2023
2,PyCon PL 2023,"Gliwice, Poland",29 July – 02 Aug. 2023
3,PyCon KR,"Seoul, South Korea",11 Aug. – 13 Aug. 2023
4,EuroSciPy 2023,"Basel, Switzerland",14 Aug. – 18 Aug. 2023
5,DjangoConAU 2023,"Adelaide, Australia",18 Aug. 2023


### Using `urllib3` library
This is another common library for retrieving data from URLs and for other functions involving URLs such as parsing of the parts of the actual URL and handling various encodings.

In [8]:
import urllib3

In [9]:
# urllib3 doesn't apply header encoding automatically
pool_manager = urllib3.PoolManager()

# GET HTTP request using urllib3 library
response = pool_manager.request('GET', url)
response_text = response.data

events_df = get_upcoming_events(response_text, css_class_dict)
events_df

Unnamed: 0,name,location,time
0,EuroPython 2023,"Prague, Czech Republic",17 July – 23 July 2023
1,North Bay Python,"Petaluma, California, USA",29 July – 30 July 2023
2,PyCon PL 2023,"Gliwice, Poland",29 July – 02 Aug. 2023
3,PyCon KR,"Seoul, South Korea",11 Aug. – 13 Aug. 2023
4,EuroSciPy 2023,"Basel, Switzerland",14 Aug. – 18 Aug. 2023
5,DjangoConAU 2023,"Adelaide, Australia",18 Aug. 2023


`requests` and `urllib3` are very similar in terms of capabilities. It is generally recommended to use Requests when it comes to making HTTP requests. 

## Session

In [10]:
# builds on top of urllib3's connection pooling

# session reuses the same TCP connection if requests are made to the same host
session = requests.Session()

r = session.get('http://httpbin.org/get', cookies={'my-cookie': 'browser'})
print(r.text)

{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Cookie": "my-cookie=browser", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.28.1", 
    "X-Amzn-Trace-Id": "Root=1-64b46142-593c4af214b336c527d485ba"
  }, 
  "origin": "165.23.15.159", 
  "url": "http://httpbin.org/get"
}



## Streaming

In [11]:
import json

# Streaming is another nifty feature
r = requests.get('http://httpbin.org/stream/5', stream=True)

for line in r.iter_lines():
    # filter out keep-alive new lines
    if line:
        decoded_line = line.decode('utf-8')
        print(json.loads(decoded_line))

{'url': 'http://httpbin.org/stream/5', 'args': {}, 'headers': {'Host': 'httpbin.org', 'X-Amzn-Trace-Id': 'Root=1-64b46142-0f4a3eaf4a030b065b0a641c', 'User-Agent': 'python-requests/2.28.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*'}, 'origin': '165.23.15.159', 'id': 0}
{'url': 'http://httpbin.org/stream/5', 'args': {}, 'headers': {'Host': 'httpbin.org', 'X-Amzn-Trace-Id': 'Root=1-64b46142-0f4a3eaf4a030b065b0a641c', 'User-Agent': 'python-requests/2.28.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*'}, 'origin': '165.23.15.159', 'id': 1}
{'url': 'http://httpbin.org/stream/5', 'args': {}, 'headers': {'Host': 'httpbin.org', 'X-Amzn-Trace-Id': 'Root=1-64b46142-0f4a3eaf4a030b065b0a641c', 'User-Agent': 'python-requests/2.28.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*'}, 'origin': '165.23.15.159', 'id': 2}
{'url': 'http://httpbin.org/stream/5', 'args': {}, 'headers': {'Host': 'httpbin.org', 'X-Amzn-Trace-Id': 'Root=1-64b46142-0f4a3eaf4a030b065b0a641c', 'User-Agent': 

## Scraping with Selenium

In [12]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

def get_upcoming_events_with_Selenium(url):
    options = webdriver.ChromeOptions()
    options.add_argument('--ignore-certificate-errors')
    options.add_argument("start-maximized")
    options.add_argument("disable-infobars")
    options.add_argument("--disable-extensions")
    
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
    driver.maximize_window()
    driver.get(url)
    
    upcoming_events = []
    events = driver.find_elements("xpath",'//ul[contains(@class, "list-recent-events")]/li')
    for event in events:
        event_details = dict()
        event_details['name'] = event.find_element("xpath",'h3[@class="event-title"]/a').text
        event_details['location'] = event.find_element("xpath",'p/span[@class="event-location"]').text
        event_details['time'] = event.find_element("xpath",'p/time').text
        upcoming_events.append(event_details)
    
    driver.close()
    events_df = pd.DataFrame(upcoming_events)
    return events_df

In [13]:
events = get_upcoming_events_with_Selenium(url)
events

[WDM] - Downloading: 100%|████████████████████████████████████████████████████████| 6.30M/6.30M [00:00<00:00, 18.7MB/s]


Unnamed: 0,name,location,time
0,EuroPython 2023,"Prague, Czech Republic",17 July – 23 July
1,North Bay Python,"Petaluma, California, USA",29 July – 30 July
2,PyCon PL 2023,"Gliwice, Poland",29 July – 02 Aug.
3,PyCon KR,"Seoul, South Korea",11 Aug. – 13 Aug.
4,EuroSciPy 2023,"Basel, Switzerland",14 Aug. – 18 Aug.
5,DjangoConAU 2023,"Adelaide, Australia",18 Aug.
6,SciPy 2023,"Austin, Texas, USA",10 July – 16 July
7,PyCon Israel 2023,"Ramat Gan, Israel",04 July – 05 July


## DOM

When the browser displays a web page it builds a model of the content of the page in a representation known as the **document object model** (**DOM**). The DOM is a hierarchical representation of the page's entire content, as well as structural information, style information, scripts, and links to other content.

# Scrapy Fundamentals

Scrapy is a very popular open source Python scraping framework for extracting data. It was originally designed for only scraping, but it is has also evolved into a powerful web crawling solution. Scrapy offers a number of powerful features:

* Built-in extensions to make HTTP requests and handle compression, authentication, caching, manipulate user-agents, and HTTP headers

* Built-in support for selecting and extracting data with selector languages such as CSS and XPath, as well as support for utilizing regular expressions for selection of content and links

* Encoding support to deal with languages and non-standard encoding declarations

* Flexible APIs to reuse and write custom middleware and pipelines, which provide a clean and easy way to implement tasks such as automatically downloading assets (for example, images or media) and storing data in storage such as file systems, S3, databases, and others.

## References

* https://github.com/PacktPublishing/Python-Web-Scraping-Cookbook/tree/master