# Web Scraping Basics

Web scraping is a powerful tool to have as part of a data science, analyst, or engineering toolkit. 
It allows you to extract data from websites and use it for your own projects or analysis.

In football analytics, web scraping can be used to collect data on players, teams, and matches.
Most of the data that teams are using is coming from large and expensive data providers, but we can collect some of this data via web scraping.

In this notebook, we will cover the basics of web scraping using Python and the `requests` and `BeautifulSoup` libraries.

#### Web Scraping Steps
1. Send an HTTP request to the URL of the webpage you want to access
2. Get the HTML content of the webpage
3. Parse the HTML content
4. Extract the data

In [2]:
import requests
from bs4 import BeautifulSoup

In [3]:
# We'll start by scraping a normal ecommerce website, Gymshark.com
# First, we'll send an HTTP request to the URL of the webpage we want to access

url = ","

# Let's also go get our headers to pass in
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"
}

response = requests.get(
    url,
    headers=headers
)

In [4]:
# We can check the status code of the response to see if the request was successful
response.status_code

200

#### Status Codes
You'll mainly see the following status codes when web scraping:

- 200, the request was successful
- 404, the page was not found
- 403, access to the page was forbidden which means we need to add headers to our request or use a proxy
- 500, there was an internal server error

In [6]:
# We can then parse the HTML content of the webpage using BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")

In [10]:
# Now let's use some css selectors to extract the data we want
# Let's start off by getting the title of the product
# To select just one element, we can use the `select_one` method
title = soup.select_one('h1[class="product-information_title__3jR8K"]').text

In [11]:
# Now lets get the price of the product
price = soup.select_one('div[class="product-information_price__pEWjj"]').text

In [15]:
# Now let's try getting multiple elements
# To do this, we can use the `select` method

# Let's get the colors of the product
# We can do pathings with css selectors to get to the element we want
colors = soup.select('div[class="variants_variants__C9MOx"] a img')

In [16]:
# That returned a list of elements, so we can loop through them to get the text
colors = [x.attrs['alt'].replace(title + ' in', '').strip() for x in colors]
print(colors)

['Black', 'Light Pink', 'Navy', 'Core Olive', 'Stone Grey', 'Light Grey Marl', 'Natural Sage Green', 'Faded Blue']


### Selector helpers
- `soup.select_one` returns the first element that matches the selector
- `soup.select` returns a list of elements that match the selector

We can also use different ways to select elements so that it can be a wildcard, for example:
- `*=` -> class name contains a value
- `:-soup-contains()` -> text contains a value

In [20]:
# So we could rewrite colors as 
colors = soup.select(f'img[alt*="{title} in "]')

In [22]:
# if we wanted to get the button that had the "add to bag"
add_to_bag = soup.select_one('button:-soup-contains("Add to bag")')

### Exercise:

Now that you've seen how to get the colors, use the same method to get the sizes of the product
Make sure that you only select the sizes for this specific product and not all the sizes on the page (There should be 8)

In [17]:
sizes = "YOUR CODE HERE"

In [25]:
# This is the answer
sizes = soup.select('button[data-locator-id*="pdp-size"]')