# Web Scraping Basics

Web scraping is a powerful tool to have as part of a data science, analyst, or engineering toolkit. 
It allows you to extract data from websites and use it for your own projects or analysis.

In football analytics, web scraping can be used to collect data on players, teams, and matches.
Most of the data that teams are using is coming from large and expensive data providers, but we can collect some of this data via web scraping.

In this notebook, we will cover the basics of web scraping using Python and the `requests` and `BeautifulSoup` libraries.

#### Web Scraping Steps
1. Send an HTTP request to the URL of the webpage you want to access
2. Get the HTML content of the webpage
3. Parse the HTML content
4. Extract the data

In [2]:
import requests
from bs4 import BeautifulSoup

In [3]:
# We'll start by scraping a normal ecommerce website, Gymshark.com
# First, we'll send an HTTP request to the URL of the webpage we want to access
url = "https://www.scrapethissite.com/pages/simple/"

# Let's also go get our headers to pass in
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"
}

response = requests.get(
    url,
    headers=headers
)

In [4]:
# We can check the status code of the response to see if the request was successful
response.status_code

200

#### Status Codes
You'll mainly see the following status codes when web scraping:

- 200, the request was successful
- 404, the page was not found
- 403, access to the page was forbidden which means we need to add headers to our request or use a proxy
- 500, there was an internal server error

In [5]:
# We can then parse the HTML content of the webpage using BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")

In [6]:
# Now let's use some css selectors to extract the data we want
# Let's start off by getting the page title
# To select just one element, we can use the `select_one` method
title = soup.select_one('h1').text
print(title)


                            Countries of the World: A Simple Example
                            250 items



In [7]:
# Now lets get the description of the page
price = soup.select_one('p[class="lead"]').text
print(price)


                            A single page that lists information about all the countries in the world. Good for those just get started with web scraping.
                            Practice looking for patterns in the HTML that will allow you to extract information about each country. Then, build a simple web scraper that makes a request to this page, parses the HTML and prints out each country's name.
                        


In [12]:
# Now let's try getting multiple elements
# To do this, we can use the `select` method

# Let's get all of the country names
country_names = soup.select('h3[class="country-name"]')
print(country_names[0:5])

[<h3 class="country-name">
<i class="flag-icon flag-icon-ad"></i>
                            Andorra
                        </h3>, <h3 class="country-name">
<i class="flag-icon flag-icon-ae"></i>
                            United Arab Emirates
                        </h3>, <h3 class="country-name">
<i class="flag-icon flag-icon-af"></i>
                            Afghanistan
                        </h3>, <h3 class="country-name">
<i class="flag-icon flag-icon-ag"></i>
                            Antigua and Barbuda
                        </h3>, <h3 class="country-name">
<i class="flag-icon flag-icon-ai"></i>
                            Anguilla
                        </h3>]


In [13]:
# filter to just the text
country_names = [x.text.strip() for x in country_names]
print(country_names)

['Andorra', 'United Arab Emirates', 'Afghanistan', 'Antigua and Barbuda', 'Anguilla', 'Albania', 'Armenia', 'Angola', 'Antarctica', 'Argentina', 'American Samoa', 'Austria', 'Australia', 'Aruba', 'Åland', 'Azerbaijan', 'Bosnia and Herzegovina', 'Barbados', 'Bangladesh', 'Belgium', 'Burkina Faso', 'Bulgaria', 'Bahrain', 'Burundi', 'Benin', 'Saint Barthélemy', 'Bermuda', 'Brunei', 'Bolivia', 'Bonaire', 'Brazil', 'Bahamas', 'Bhutan', 'Bouvet Island', 'Botswana', 'Belarus', 'Belize', 'Canada', 'Cocos [Keeling] Islands', 'Democratic Republic of the Congo', 'Central African Republic', 'Republic of the Congo', 'Switzerland', 'Ivory Coast', 'Cook Islands', 'Chile', 'Cameroon', 'China', 'Colombia', 'Costa Rica', 'Cuba', 'Cape Verde', 'Curacao', 'Christmas Island', 'Cyprus', 'Czech Republic', 'Germany', 'Djibouti', 'Denmark', 'Dominica', 'Dominican Republic', 'Algeria', 'Ecuador', 'Estonia', 'Egypt', 'Western Sahara', 'Eritrea', 'Spain', 'Ethiopia', 'Finland', 'Fiji', 'Falkland Islands', 'Micron

### Selector helpers
- `soup.select_one` returns the first element that matches the selector
- `soup.select` returns a list of elements that match the selector

We can also use different ways to select elements so that it can be a wildcard, for example:
- `*=` -> class name contains a value
- `:-soup-contains()` -> text contains a value

In [16]:
# So we could rewrite the above code to be more specific
wild_card_names = soup.select(f'h3[class*="-name"]')
wild_card_names = [x.text.strip() for x in wild_card_names]
print(wild_card_names)

['Andorra', 'United Arab Emirates', 'Afghanistan', 'Antigua and Barbuda', 'Anguilla', 'Albania', 'Armenia', 'Angola', 'Antarctica', 'Argentina', 'American Samoa', 'Austria', 'Australia', 'Aruba', 'Åland', 'Azerbaijan', 'Bosnia and Herzegovina', 'Barbados', 'Bangladesh', 'Belgium', 'Burkina Faso', 'Bulgaria', 'Bahrain', 'Burundi', 'Benin', 'Saint Barthélemy', 'Bermuda', 'Brunei', 'Bolivia', 'Bonaire', 'Brazil', 'Bahamas', 'Bhutan', 'Bouvet Island', 'Botswana', 'Belarus', 'Belize', 'Canada', 'Cocos [Keeling] Islands', 'Democratic Republic of the Congo', 'Central African Republic', 'Republic of the Congo', 'Switzerland', 'Ivory Coast', 'Cook Islands', 'Chile', 'Cameroon', 'China', 'Colombia', 'Costa Rica', 'Cuba', 'Cape Verde', 'Curacao', 'Christmas Island', 'Cyprus', 'Czech Republic', 'Germany', 'Djibouti', 'Denmark', 'Dominica', 'Dominican Republic', 'Algeria', 'Ecuador', 'Estonia', 'Egypt', 'Western Sahara', 'Eritrea', 'Spain', 'Ethiopia', 'Finland', 'Fiji', 'Falkland Islands', 'Micron

In [19]:
# if we wanted to get a specific country name with the selector
andorra = soup.select_one('h3:-soup-contains("Andorra")').text
print(andorra.strip())

Andorra


In [20]:
# One last thing we can do is step down the tree
# For example, we can get the country population by stepping down the tree
population = soup.select('div[class="country-info"] span[class*="population"]')
print([x.text for x in population])

['84000', '4975593', '29121286', '86754', '13254', '2986952', '2968000', '13068161', '0', '41343201', '57881', '8205000', '21515754', '71566', '26711', '8303512', '4590000', '285653', '156118464', '10403000', '16241811', '7148785', '738004', '9863117', '9056010', '8450', '65365', '395027', '9947418', '18012', '201103330', '301790', '699847', '0', '2029307', '9685000', '314522', '33679000', '628', '70916439', '4844927', '3039126', '7581000', '21058798', '21388', '16746491', '19294149', '1330044000', '47790000', '4516220', '11423000', '508659', '141766', '1500', '1102677', '10476000', '81802257', '740528', '5484000', '72813', '9823821', '34586184', '14790608', '1291170', '80471869', '273008', '5792984', '46505963', '88013491', '5244000', '875983', '2638', '107708', '48228', '64768389', '1545255', '62348447', '107818', '4630000', '195506', '65228', '24339838', '27884', '56375', '1593256', '10324025', '443000', '1014999', '11000000', '30', '13550440', '159358', '1565126', '748486', '689868

### Exercise:

Now that you've seen how to get the country names, use the same method to get the capital cities of the countries.

In [14]:
capitals = "YOUR CODE HERE"

In [21]:
# This is one of the couple of ways to do it
capitals = soup.select('span[class="country-capital"]')
print([x.text.strip() for x in capitals])

['Andorra la Vella', 'Abu Dhabi', 'Kabul', "St. John's", 'The Valley', 'Tirana', 'Yerevan', 'Luanda', 'None', 'Buenos Aires', 'Pago Pago', 'Vienna', 'Canberra', 'Oranjestad', 'Mariehamn', 'Baku', 'Sarajevo', 'Bridgetown', 'Dhaka', 'Brussels', 'Ouagadougou', 'Sofia', 'Manama', 'Bujumbura', 'Porto-Novo', 'Gustavia', 'Hamilton', 'Bandar Seri Begawan', 'Sucre', 'Kralendijk', 'Brasília', 'Nassau', 'Thimphu', 'None', 'Gaborone', 'Minsk', 'Belmopan', 'Ottawa', 'West Island', 'Kinshasa', 'Bangui', 'Brazzaville', 'Bern', 'Yamoussoukro', 'Avarua', 'Santiago', 'Yaoundé', 'Beijing', 'Bogotá', 'San José', 'Havana', 'Praia', 'Willemstad', 'Flying Fish Cove', 'Nicosia', 'Prague', 'Berlin', 'Djibouti', 'Copenhagen', 'Roseau', 'Santo Domingo', 'Algiers', 'Quito', 'Tallinn', 'Cairo', 'Laâyoune / El Aaiún', 'Asmara', 'Madrid', 'Addis Ababa', 'Helsinki', 'Suva', 'Stanley', 'Palikir', 'Tórshavn', 'Paris', 'Libreville', 'London', "St. George's", 'Tbilisi', 'Cayenne', 'St Peter Port', 'Accra', 'Gibraltar', '

In [22]:
# This is another way to do it
capitals = soup.select('div[class="country-info"] span[class*="capital"]')
print([x.text for x in capitals])

['Andorra la Vella', 'Abu Dhabi', 'Kabul', "St. John's", 'The Valley', 'Tirana', 'Yerevan', 'Luanda', 'None', 'Buenos Aires', 'Pago Pago', 'Vienna', 'Canberra', 'Oranjestad', 'Mariehamn', 'Baku', 'Sarajevo', 'Bridgetown', 'Dhaka', 'Brussels', 'Ouagadougou', 'Sofia', 'Manama', 'Bujumbura', 'Porto-Novo', 'Gustavia', 'Hamilton', 'Bandar Seri Begawan', 'Sucre', 'Kralendijk', 'Brasília', 'Nassau', 'Thimphu', 'None', 'Gaborone', 'Minsk', 'Belmopan', 'Ottawa', 'West Island', 'Kinshasa', 'Bangui', 'Brazzaville', 'Bern', 'Yamoussoukro', 'Avarua', 'Santiago', 'Yaoundé', 'Beijing', 'Bogotá', 'San José', 'Havana', 'Praia', 'Willemstad', 'Flying Fish Cove', 'Nicosia', 'Prague', 'Berlin', 'Djibouti', 'Copenhagen', 'Roseau', 'Santo Domingo', 'Algiers', 'Quito', 'Tallinn', 'Cairo', 'Laâyoune / El Aaiún', 'Asmara', 'Madrid', 'Addis Ababa', 'Helsinki', 'Suva', 'Stanley', 'Palikir', 'Tórshavn', 'Paris', 'Libreville', 'London', "St. George's", 'Tbilisi', 'Cayenne', 'St Peter Port', 'Accra', 'Gibraltar', '