# Exercises
Do one of the following exercises using requests and BeautifulSoup libraries

1) Web scrape [flexcar.gr](https://www.google.com/url?q=http://flexcar.gr&sa=D&source=editors&ust=1747472228685675&usg=AOvVaw0yj1LPGkP3xFU4OXF2pzWz)
Get the features (brand, model, price, hp, gearbox, extras..) for all leasing car offers.  

2) Web scrape [ATPWorldTour](https://www.google.com/url?q=https://www.atptour.com/en/rankings/singles&sa=D&source=editors&ust=1747472228686458&usg=AOvVaw0ZSQIIU2c3msEpvj4vO0xL) website (singles)
Get all weeks (from 1973 to today) of the top100 rankings, scrape date, player name, ranking, country and points and store the data to a dataframe and save it at the end to a csv file. (Hint: use an empty string “” in the headers)  

headers={'User-Agent': ''}  
page = requests.get(url,timeout=15, headers= headers)  

Bonus:  
3) Web scrape [kariera.gr](https://www.google.com/url?q=http://www.kariera.gr&sa=D&source=editors&ust=1747472228688054&usg=AOvVaw0PYlwhUqJx0_a1Bg7YVTyJ) using Selenium
Retrieve all job ads for Data Analyst, Data Scientist and Data Engineer and store to a dataframe features like: Company, Job title, Content, Location and job occupation.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## 2) Web scrape ATPWorldTour website (singles)
Get all weeks (from 1973 to today) of the top100 rankings, scrape date, player name, ranking, country and points and store the data to a dataframe and save it at the end to a csv file. (Hint: use an empty string “” in the headers)  

I will tackle its step individually and I will then stitch them into functions and orchestrate them.

### Crawiling

First I want to deal with the crawling part. I want to get hold of all the urls, each representing a weekly ranking since 1973.

In [2]:
first_week_url = "https://www.atptour.com/en/rankings/singles?dateWeek=1973-08-23"
response = requests.get(first_week_url)
response.status_code
# Check status code, we want 200
if response.status_code != 200:
    print(response.status_code, response.reason)

403 Forbidden


Access is forbidden. This probably means that I need to pretend I am a real user and not a bot... I will do that by adding headers to my request

In [3]:
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0 Safari/537.36"
}

first_week_url = "https://www.atptour.com/en/rankings/singles?dateWeek=1973-08-23"
response = requests.get(first_week_url, headers=headers)
response.status_code
# Check status code, we want 200
if response.status_code != 200:
    print(response.status_code, response.reason)

In [4]:
soup = BeautifulSoup(response.content, 'html.parser')
type(soup)

bs4.BeautifulSoup

To inspect the html I am using a combination of 'view source' and 'inspect element' in my browser.

To create a list of all the urls I could start iterating from the starting date to now using a weekly step, datetime and a weekly timedelta  
After careful inspection I noticed to things:
- not every week is populated
- I can get hold of all the dates, (contents of the dropdown menu)

So the next step is to create a list or the urls of each week

In [5]:
# tag containing all date tags
select_tag = soup.find('select', id='dateWeek-filter')

In [6]:
# verify
select_tag.text

'\n2025.05.05\n2025.04.21\n2025.04.14\n2025.04.07\n2025.03.31\n2025.03.17\n2025.03.03\n2025.02.24\n2025.02.17\n2025.02.10\n2025.02.03\n2025.01.27\n2025.01.13\n2025.01.06\n2024.12.30\n2024.12.23\n2024.12.16\n2024.12.09\n2024.12.02\n2024.11.25\n2024.11.18\n2024.11.11\n2024.11.04\n2024.10.28\n2024.10.21\n2024.10.14\n2024.09.30\n2024.09.23\n2024.09.16\n2024.09.09\n2024.08.26\n2024.08.19\n2024.08.12\n2024.08.05\n2024.07.29\n2024.07.22\n2024.07.15\n2024.07.01\n2024.06.24\n2024.06.17\n2024.06.10\n2024.05.27\n2024.05.20\n2024.05.06\n2024.04.22\n2024.04.15\n2024.04.08\n2024.04.01\n2024.03.18\n2024.03.04\n2024.02.26\n2024.02.19\n2024.02.12\n2024.02.05\n2024.01.29\n2024.01.15\n2024.01.08\n2024.01.01\n2023.12.25\n2023.12.18\n2023.12.11\n2023.12.04\n2023.11.27\n2023.11.20\n2023.11.13\n2023.11.06\n2023.10.30\n2023.10.23\n2023.10.16\n2023.10.02\n2023.09.25\n2023.09.18\n2023.09.11\n2023.08.28\n2023.08.21\n2023.08.14\n2023.08.07\n2023.07.31\n2023.07.24\n2023.07.17\n2023.07.03\n2023.06.26\n2023.06.19\n2

In [7]:
date_tags = select_tag.find_all('option')

In [8]:
date_tags[:5]

[<option value="Current Week">2025.05.05</option>,
 <option value="2025-04-21">2025.04.21</option>,
 <option value="2025-04-14">2025.04.14</option>,
 <option value="2025-04-07">2025.04.07</option>,
 <option value="2025-03-31">2025.03.31</option>]

Now I will extract the date part of each option tag

In [9]:
dates = [date_tag['value'] for date_tag in date_tags]
dates[:5]

['Current Week', '2025-04-21', '2025-04-14', '2025-04-07', '2025-03-31']

next step is to construct a url out of each week. For this I will the following base url and concatenate the date
in the  function I will convert 'Current Week' to 'Current+Week'  for precision

In [10]:
url = "https://www.atptour.com/en/rankings/singles?dateWeek="

In [11]:
urls = sorted([f'{url}{date}'for date in dates])

In [12]:
urls[-5:]

['https://www.atptour.com/en/rankings/singles?dateWeek=2025-03-31',
 'https://www.atptour.com/en/rankings/singles?dateWeek=2025-04-07',
 'https://www.atptour.com/en/rankings/singles?dateWeek=2025-04-14',
 'https://www.atptour.com/en/rankings/singles?dateWeek=2025-04-21',
 'https://www.atptour.com/en/rankings/singles?dateWeek=Current Week']

### Scraping

Now I will be moving onto my scraping part. I will use the first week as a sample and try to extract the data from there.
date we need:
- date
- player name 
- ranking
- country
- points

##### date

this information is contained in the url, but I would prefer to get it from the html itself because I want to completely decouple the crawler and the scraper.  
In the same place we found the dates, one of the dates was the selected attribute inside of it. That is the one we are looking for

In [13]:
current_date  = select_tag.find('option', selected=True).text

In [14]:
current_date

'1973.08.23'

##### player name

for this part and probably the next ones I had to bring out the big guns.  
https://html.onlineviewer.net/ will allow me to view the full structure of the html cleanly

I have noticed that all data appears in order. all I have to do is find the tags that contain the name and that should bring them back in the correct order.

In [15]:
player_names = soup.find_all('li', class_='name center')

In [16]:
len(player_names), player_names[:5]

(100,
 [<li class="name center">
  <a href="/en/players/ilie-nastase/n008/overview">
  <span>Ilie Nastase</span>
  </a>
  </li>,
  <li class="name center">
  <a href="/en/players/manuel-orantes/o017/overview">
  <span>Manuel Orantes</span>
  </a>
  </li>,
  <li class="name center">
  <a href="/en/players/stan-smith/s060/overview">
  <span>Stan Smith</span>
  </a>
  </li>,
  <li class="name center">
  <a href="/en/players/arthur-ashe/a063/overview">
  <span>Arthur Ashe</span>
  </a>
  </li>,
  <li class="name center">
  <a href="/en/players/rod-laver/l058/overview">
  <span>Rod Laver</span>
  </a>
  </li>])

As we can see we got exactly 100 matches as we should, and I can verify that the order is correct too

In [17]:
player_names = [li.find('span').text for li in player_names]
player_names[:5]

['Ilie Nastase', 'Manuel Orantes', 'Stan Smith', 'Arthur Ashe', 'Rod Laver']

##### Ranking

Since our data is ordered there is no need to extract the ranking, I will be simply generating it programmatically

##### Country

I have noticed that we have an abbreviation of the country for each player. Lucking there is a table that will help us convert it to a full country name.

I started searching for our first player I. Nastase from Romania. I used regexp to isolate romania or rou entries in the html viewer

`(?<!g|\.)rou` this one generated  two entries for each player from Romania
They both seem to be ordered so I will just pick one of the two tag patterns

In [18]:
svg_tags = soup.find_all('svg', class_='atp-flag')
len(svg_tags), svg_tags[:5]

(200,
 [<svg class="atp-flag flag"><use href="/assets/atptour/assets/flags.svg#flag-rou"></use></svg>,
  <svg class="atp-flag flag"><use href="/assets/atptour/assets/flags.svg#flag-esp"></use></svg>,
  <svg class="atp-flag flag"><use href="/assets/atptour/assets/flags.svg#flag-usa"></use></svg>,
  <svg class="atp-flag flag"><use href="/assets/atptour/assets/flags.svg#flag-usa"></use></svg>,
  <svg class="atp-flag flag"><use href="/assets/atptour/assets/flags.svg#flag-aus"></use></svg>])

Now I will verify that the first 100 flags match the last 100 flags and in the right order

In [19]:
svg_tags[:100] == svg_tags[100:]

True

We can now safely discard the second half

In [20]:
svg_tags = svg_tags[:100]

In [21]:
use_tags = [svg_tag.find('use') for svg_tag in svg_tags]

In [22]:
use_tags[:5]

[<use href="/assets/atptour/assets/flags.svg#flag-rou"></use>,
 <use href="/assets/atptour/assets/flags.svg#flag-esp"></use>,
 <use href="/assets/atptour/assets/flags.svg#flag-usa"></use>,
 <use href="/assets/atptour/assets/flags.svg#flag-usa"></use>,
 <use href="/assets/atptour/assets/flags.svg#flag-aus"></use>]

In [23]:
links = [use_tag['href'] for use_tag in use_tags]

In [24]:
links[:5]

['/assets/atptour/assets/flags.svg#flag-rou',
 '/assets/atptour/assets/flags.svg#flag-esp',
 '/assets/atptour/assets/flags.svg#flag-usa',
 '/assets/atptour/assets/flags.svg#flag-usa',
 '/assets/atptour/assets/flags.svg#flag-aus']

now I will use regex to isolate the country abbreviation from the link

In [25]:
import re

def extract_flag_abbr(string: str) -> str:
    return re.search(r'(?<=#flag-)[A-Za-z]{3}$', string).group(0)

countries =  [extract_flag_abbr(link) for link in links]

In [26]:
len(countries)

100

In [27]:
countries[:5]

['rou', 'esp', 'usa', 'usa', 'aus']

Now let's get the dictionary to convert to real country names

In [28]:
select_region_filter = soup.find('select', id='region-filter')

In [29]:
region_option_tags  = select_region_filter.find_all('option')
countries_tuple = [(region_option_tag['value'], region_option_tag.text) for region_option_tag in region_option_tags]
countries_tuple[:5]


[('all', 'All Countries'),
 ('AFG', 'Afghanistan'),
 ('ALB', 'Albania'),
 ('ALG', 'Algeria'),
 ('ASA', 'American Samoa')]

In [30]:
country_dict = {k.lower(): v for k, v in countries_tuple}

In [31]:
countries = [country_dict.get(country_abbr, country_abbr) for country_abbr in countries]
countries[:5]

['Romania', 'Spain', 'United States', 'United States', 'Australia']

##### Points

Same case  here it all appears twice, but in order. In any case I will verify after retrieving them
Note the tag I selected is not 100% percent consistent but it is fixed later in the function

In [32]:
points_tds  = soup.find_all('td', class_='points center bold extrabold small-cell')
len(points_tds)

200

In [33]:
points_tds[0:5]

[<td class="points center bold extrabold small-cell" colspan="3">
 -                </td>,
 <td class="points center bold extrabold small-cell" colspan="3">
 -                </td>,
 <td class="points center bold extrabold small-cell" colspan="3">
 -                </td>,
 <td class="points center bold extrabold small-cell" colspan="3">
 -                </td>,
 <td class="points center bold extrabold small-cell" colspan="3">
 -                </td>]

The first hundred do not contain the link with the points, I will try the second half

In [34]:
points_tds[100:105]

[<td class="points center bold extrabold small-cell" colspan="2">
 <a href="/en/players/ilie-nastase/n008/rankings-breakdown?team=singles">
                                 0
                             </a>
 </td>,
 <td class="points center bold extrabold small-cell" colspan="2">
 <a href="/en/players/manuel-orantes/o017/rankings-breakdown?team=singles">
                                 0
                             </a>
 </td>,
 <td class="points center bold extrabold small-cell" colspan="2">
 <a href="/en/players/stan-smith/s060/rankings-breakdown?team=singles">
                                 0
                             </a>
 </td>,
 <td class="points center bold extrabold small-cell" colspan="2">
 <a href="/en/players/arthur-ashe/a063/rankings-breakdown?team=singles">
                                 0
                             </a>
 </td>,
 <td class="points center bold extrabold small-cell" colspan="2">
 <a href="/en/players/rod-laver/l058/rankings-breakdown?team=single

That should work fine

In [35]:
points = [points_td.find('a').text.strip() for points_td in points_tds[100:200]]

In [36]:
points[:5]

['0', '0', '0', '0', '0']

### Putting it all together

I  will first create functions for each task and I will then orchestrate them

##### Create functions

In [8]:
import re


def get_all_urls(soup: BeautifulSoup) -> list[str]:
    """Return a list with all weekly urls from the main soup object"""
    select_tag = soup.find("select", id="dateWeek-filter")
    date_tags = select_tag.find_all("option")
    dates = [date_tag["value"] if not "Current" in date_tag["value"] else "Current+Date" for date_tag in date_tags]
    base_url = "https://www.atptour.com/en/rankings/singles?dateWeek="
    return sorted([f"{base_url}{date}" for date in dates])


def get_active_week(soup: BeautifulSoup) -> str:
    """
    returns the active week, directly through scraping and not through the url
    unlike the get_all_urls it will extract the text contents instead of 
    the value attribute, otherwise we would get date 'Current-Date' for our
    most recent date.
    """
    
    select_tag = soup.find("select", id="dateWeek-filter")
    return select_tag.find("option", selected=True).text


def get_player_names(soup: BeautifulSoup) -> list[str]:
    """
    returns a list of the top-100 players names ordered by rank
    """
    li_tags = soup.find_all("li", class_="name center")
    return [li.find("span").text for li in li_tags]


def get_countries(soup: BeautifulSoup) -> list[str]:
    """
    returns a list of the top-100 players countries ordered by rank
    """
    svg_tags = soup.find_all("svg", class_="atp-flag")[:100]
    use_tags = [svg_tag.find("use") for svg_tag in svg_tags]
    links = [use_tag["href"] for use_tag in use_tags]
    countries = [_extract_flag_abbr(link) for link in links]
    return _convert_flag_abbr(soup, countries)


def _extract_flag_abbr(string: str) -> str:
    """
    extract the flag abbreviation out of a link to the flag png
    """
    return re.search(r"(?<=#flag-)[A-Za-z]{3}$", string).group(0)


def _convert_flag_abbr(
    soup: BeautifulSoup, countries: list[str]
) -> str:
    """
    helper function to convert a flag/country abbreviation to the 
    actual country name. The relationship between the can be find inside
    the source code and I did not have rely on external sources
    """
    select_region_filter = soup.find("select", id="region-filter")
    region_option_tags = select_region_filter.find_all("option")
    countries_tuple = [
        (region_option_tag["value"], region_option_tag.text)
        for region_option_tag in region_option_tags
    ]
    country_dict = {k.lower(): v for k, v in countries_tuple}
    return [
        country_dict.get(country_abbr, country_abbr)
        for country_abbr in countries
    ]


def get_points(soup: BeautifulSoup) -> list[str]:
    """
    after some debugging a edge case failures, I adjusted and tested
    the points extraction to the following code.
    I essentially had to find the tag before the one I was looking for, 
    and the seek the sibling.
    The reason for picking the slice is that I sometimes get 101 results 
    instead of 100 and in that case the first one does not lead to any points.
    Taking the slice of the last 100 seems safe.
    """
    points_tds = soup.find_all("td", class_="age small-cell")
    return [
        points_td.find_next_sibling("td").find("a").text.strip()
        for points_td in points_tds[-100:]
    ]

##### Orchestration

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0 Safari/537.36"
}

url = "https://www.atptour.com/en/rankings/singles"
response = requests.get(url, headers=headers)
response.status_code
# Check status code
if response.status_code != 200:
    print(response.status_code, response.reason)

In [3]:
soup = BeautifulSoup(response.content, 'html.parser')

In [6]:
import logging

logger = logging.getLogger(__name__)
logger.addHandler(logging.FileHandler("log.txt"))
logger.setLevel(logging.INFO)

In [None]:
# gather all responses

import time
import random


urls = get_all_urls(soup)
urls_length = len(urls)

response_list = []

for i, url in enumerate(urls):
    logger.info(f"processing request {i + 1} out of {urls_length}")
    time.sleep(random.uniform(0.1, 0.5))
    response_list.append(requests.get(url, headers=headers))

In [71]:
# save html to files
for i, response in enumerate(response_list):
    if response and response.status_code == 200:
        with open(f"pages/page_{i}.html", "wb") as f:
            f.write(response.content)

In [4]:
#load files to a list of html text
import glob

content_list = []
file_paths = sorted(glob.glob("pages/page_*.html"))

for file_path in file_paths:
    with open(file_path, "r", encoding="utf-8") as f:
        content_list.append(f.read())

In [32]:
# load html list to a list of dictionaries
content_length = len(content_list)
results_list = []

for i, content in enumerate(content_list):
    logger.info(f"processing response {i + 1} out of {content_length}")
    soup = BeautifulSoup(content, "html.parser")
    week = get_active_week(soup)
    ranks = list(range(1, 101))
    player_names = get_player_names(soup)
    countries = get_countries(soup)
    points_all = get_points(soup)
    results_list.extend(
        [
            {
                "week": week,
                "rank": rank,
                "player_name": player_name,
                "country": country,
                "points": points,
            }
            for rank, player_name, country, points in zip(
                ranks, player_names, countries, points_all
            )
        ]
    )

In [33]:
results_list[:5]

[{'week': '1973.08.23',
  'rank': 1,
  'player_name': 'Ilie Nastase',
  'country': 'Romania',
  'points': '0'},
 {'week': '1973.08.23',
  'rank': 2,
  'player_name': 'Manuel Orantes',
  'country': 'Spain',
  'points': '0'},
 {'week': '1973.08.23',
  'rank': 3,
  'player_name': 'Stan Smith',
  'country': 'United States',
  'points': '0'},
 {'week': '1973.08.23',
  'rank': 4,
  'player_name': 'Arthur Ashe',
  'country': 'United States',
  'points': '0'},
 {'week': '1973.08.23',
  'rank': 5,
  'player_name': 'Rod Laver',
  'country': 'Australia',
  'points': '0'}]

In [62]:
df = pd.DataFrame(results_list)
df.head()

Unnamed: 0,week,rank,player_name,country,points
0,1973.08.23,1,Ilie Nastase,Romania,0
1,1973.08.23,2,Manuel Orantes,Spain,0
2,1973.08.23,3,Stan Smith,United States,0
3,1973.08.23,4,Arthur Ashe,United States,0
4,1973.08.23,5,Rod Laver,Australia,0


In [63]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 232490 entries, 0 to 232489
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   week         232490 non-null  object
 1   rank         232490 non-null  int64 
 2   player_name  232490 non-null  object
 3   country      232490 non-null  object
 4   points       232490 non-null  object
dtypes: int64(1), object(4)
memory usage: 8.9+ MB


In [65]:
# Convert types as necessary
df["week"] = pd.to_datetime(df["week"]).dt.date
df['points'] = df['points'].str.replace(',', '').astype('Int64')
df['rank'] = df['rank'].astype('Int64')

In [66]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 232490 entries, 0 to 232489
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   week         232490 non-null  object
 1   rank         232490 non-null  Int64 
 2   player_name  232490 non-null  object
 3   country      232490 non-null  object
 4   points       232490 non-null  Int64 
dtypes: Int64(2), object(3)
memory usage: 9.3+ MB


In [67]:
df.head()

Unnamed: 0,week,rank,player_name,country,points
0,1973-08-23,1,Ilie Nastase,Romania,0
1,1973-08-23,2,Manuel Orantes,Spain,0
2,1973-08-23,3,Stan Smith,United States,0
3,1973-08-23,4,Arthur Ashe,United States,0
4,1973-08-23,5,Rod Laver,Australia,0


In [68]:
df[(df['player_name'].str.contains('Stefanos')) & (df['rank'] <= 3)]

Unnamed: 0,week,rank,player_name,country,points
129397,2021-08-09,3,Stefanos Tsitsipas,Greece,8115
129497,2021-08-16,3,Stefanos Tsitsipas,Greece,8350
129597,2021-08-23,3,Stefanos Tsitsipas,Greece,8350
129697,2021-08-30,3,Stefanos Tsitsipas,Greece,8350
129797,2021-09-13,3,Stefanos Tsitsipas,Greece,8350
129897,2021-09-20,3,Stefanos Tsitsipas,Greece,8350
129997,2021-09-27,3,Stefanos Tsitsipas,Greece,8350
130097,2021-10-04,3,Stefanos Tsitsipas,Greece,8175
130297,2021-10-18,3,Stefanos Tsitsipas,Greece,7995
130397,2021-10-25,3,Stefanos Tsitsipas,Greece,7930


In [69]:
df

Unnamed: 0,week,rank,player_name,country,points
0,1973-08-23,1,Ilie Nastase,Romania,0
1,1973-08-23,2,Manuel Orantes,Spain,0
2,1973-08-23,3,Stan Smith,United States,0
3,1973-08-23,4,Arthur Ashe,United States,0
4,1973-08-23,5,Rod Laver,Australia,0
...,...,...,...,...,...
232485,1997-09-29,96,Andrei Pavel,Romania,0
232486,1997-09-29,97,Juan Antonio Marin,Costa Rica,0
232487,1997-09-29,98,Jens Knippschild,Germany,0
232488,1997-09-29,99,Tomas Nydahl,Sweden,0


In [71]:
df.to_csv('all-time-atp-top-100.csv', index=False)