# Web Scraping Basics Workshop

Do you you want to get data from a Website or lots of Websites from the Interwebs? Then this workshop is for you!

And for this Workshop, assume that you are a jobless bloke but can't relocate so you need a Work From Home job. And you want to know **what kind of jobs are out there that are available for Work From Home**. So that you can upskill and apply for those kinds of jobs and you won't be so broke anymore.

But before we dive to code, lets just set up what we need.

![image.png](attachment:834f4661-8417-4eb7-b490-5a740a5a900f.png "Source: https://www.reddit.com/r/memes/comments/f1a9uv/this_meme_was_made_by_unemployment_gang/")

## Chapter 0: Setup and Installation

First of all, we do hope that you have your Laptop ready, and install [**Anaconda**](https://www.anaconda.com/download) because I'm not sure how you're opening this Jupyter Notebook. You kinda need something to execute this one, so either use [**Google Colab**](https://colab.research.google.com/), [**VSCode**](https://code.visualstudio.com/download), or just plain [**Python**](https://www.python.org/downloads/) and [**Jupyter Notebook**](https://jupyter.org/install).

Hats off to you if you're able to do that. Now let me do my memes in peace.

We're just installing some Python Packages here.

Execute each code on the cell using ***Shift + Enter***.

In [1]:
!pip install lxml pandas scrapy -qU
!conda install -y -c conda-forge python-chromedriver-binary selenium
!pip install webdriver-manager 

  You can safely remove it manually.


In [2]:
import lxml
import pandas as pd
import requests
import scrapy
import selenium
from IPython.core.display import HTML

First important note:

**If any** of these commands **error** out (they'll usually show Red in color), **please call your Instructor** so he/she has something to do aside from chatting with the ones who invited him/her over.

Second important note:

We're using **Jupyter Notebook for Instructional Purposes only**. Web Scraping is (usually) something that has to be done repeatedly, depending on how often the website you want to get data from or interact with is updated. So you'll have to create a Production-ready Codebase and a way to Run this Code regularly... which are both outside the scope of this workshop.

## Chapter 1: Web Scraping Basics with Requests and LXML

The fastest way to find available WFH Jobs is to search on Job Posting sites. And one of our favorite Job Posting sites (as Cebuanos) is [**Mynimo**](mynimo.com).

So for Chapters 1 and 2, we'll be learning the basics of Web Scraping and Web Crawling by Scraping and Crawling for Work From Home Job Postings in Mynimo.

And to start with definitions, **Web Scraping** is basically using a program to **download and process content from the web** [[1]](#reference_1). In other words, instead of you surfing the web with your browser, **you let a program surf the web for you**.

How do we do that? Well, why not discuss how do you usually surf the web and download stuff from it?

### Section 1.1: URL and Web Browser Basics

So you surf the web by **opening your** favorite **browser** (mine is Chrome because I'm a normie).

Then, you type the website address, also known as **URL *(uniform resource locator)***, on the address bar of the browser.

Let's type in **[mynimo.com](mynimo.com)** and press **Enter**.

![image.png](attachment:626438b6-45d3-442b-9086-8eb2aa477505.png)

Voila! The homepage of Mynimo appears!

But that's not what we want! We want Job Postings! I don't see any Job Postings here!

So, let's click on W... ohhh! I'm also curious what Jobs are available in Cebu! I heard the parties there are fun!

So we click on the [**Cebu Jobs**](https://www.mynimo.com/cebu-jobs) link and see the Job Postings available in Cebu.

![image.png](attachment:a28c2bce-060b-4c48-a57b-81826358a06a.png)

Did you notice the change in URL?

Well... you should. Because that's a fundamental concept that we use in Web Scraping!

So, sure. You all know how to type a URL on the Address Bar. But what's a URL in the first place?

To explain succintly, and with the help of Ch... Mozilla (you can check the explanation here: [[2]](#reference_2)):

A **URL** is an **address of a unique Resource**. A **Resource** may be a Page (like what you see on your Browser), an Image, a PDF, etc. And like a real-life address, the URL can be moved to be pointed to another resource or to nothing. That's why we have the famous **HTTP 404 Not Found** error.

URLs also have structure as illustrated below:

![image.png](attachment:6361b2d9-3e44-4d8f-b6da-ff3c5d9f8321.png "Source: https://developer.mozilla.org/en-US/docs/Learn/Common_questions/Web_mechanics/What_is_a_URL")

Let's discuss these following parts:
1. **Scheme**: This indicates the **Protocol** that our Browser must use to access the Resource. For Web Scraping, we'll deal mostly with **HTTP *(http://)*** and it's secure version, **HTTPS *(https://)***. The **HTTP (Hypertext Transfer Protocol)** has a set of **Request Methods** (also known as *HTTP Verbs*) that the Browser has to use to do something with the resource on the given URL. For our application, we only need to **GET** a Page from a URL, so we'll use that HTTP Request Method.
2. **Domain Name**:
3. **Port**
4. **Path to the file**
5. **Parameters**
6. **Anchor**

### Section 1.2 Downloading Web Pages from URL with Requests Library

Our Web Browsers can do various actions on that URL. But for our application, we only need to use **GET** to GET a page from a URL.

So our next question is... can we use Python to *GET* a Page from a URL?

The answer is YES! And that is by using the ***Requests*** library!

In [3]:
import requests

# Use this header configuration so that you can access the websites properly. Trust me, Bro/Brodette. I'll explain later.
headers  = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36"
}
url = "https://www.mynimo.com/cebu-jobs"
response = requests.get(url=url, headers=headers)

# Use this command also to display an HTML document from the Response.
from IPython.core.display import HTML

display(HTML(response.text))

## Chapter 2: Web Crawling Basics with Scrapy

## Chapter 3: Browser Automation Basics with Selenium

Channels:
 - conda-forge
 - defaults
Platform: win-64
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



In [9]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager

service = ChromeService(ChromeDriverManager("123.0.6312.59").install())
browser = webdriver.Chrome(service=service)
browser.get("https://www.python.org")

In [None]:
browser.close()
browser.quit()

## Chapter 4: Technical and Ethical Issues on Web Scraping

## Chapter 5: Closing Notes

Please join our **[Meetup.com page](https://www.meetup.com/PizzaPy-PH/)**, and like our **[Facebook Page](https://www.facebook.com/PizzaPy.PH/)** to keep updated on our latest events. Just in case you'll be in Cebu. And some of our events are livestreamed on the Facebook Page. So please do subscribe on our channels.

And if you like to join the Community Chatter (PizzaPy Superchat) on Messenger so you find people to chat about Python, Tech News, and Memes, please let me know or **ask anyone here if they're a member of the group**, and have them add and introduce you to the Group Chat.

From, PizzaPy - Cebu Python Users Group:

**Thank you!**

![pizzapy_big_pie.png](attachment:a76e74e5-4358-4ff7-878c-75e1ac8b03e5.png)

### References:
<a id='reference_1'></a>[1] A. Sweigart and Recorded Books, Inc, _Automate the boring stuff with Python: practical programming for total beginners_. San Francisco: No Starch Press, 2020.

<a id='reference_2'></a>[2] “What is a URL?,”  *What is a URL? - Learn web development | MDN* , Aug. 03, 2023. https://developer.mozilla.org/en-US/docs/Learn/Common_questions/Web_mechanics/What_is_a_URL (accessed Mar. 11, 2024).