Web scraping with ATP singles rankings. Using the Request and urlopen modules from the the urllib package to open and read in the website. BeautifulSoup will be used to parse the HTML. Can also use the parse module in the urllib package to parse the HTML instead, but I'm more familiar with BeautifulSoup, so will use that instead.

In [37]:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import numpy as np
import csv

Website for the ATP rankings is here: https://www.atptour.com/en/rankings/singles

In [2]:
urlpage='https://www.atptour.com/en/rankings/singles'
request = Request(urlpage,headers={'User-Agent':'Mozilla/5.0'})
webpage = urlopen(request).read()
soup = BeautifulSoup(webpage, 'html.parser')

We need to open this website using a Mozilla Firefox proxy as the default setting will throw a 403 error. See below:

In [3]:
webpage = urlopen(urlpage).read()
soup = BeautifulSoup(webpage, 'html.parser')

HTTPError: HTTP Error 403: Forbidden

In [4]:
print(soup)


<!DOCTYPE html>

<!-- START : /modules/global/head -->
<!--[if lt IE 7]>
    <html class="no-js lt-ie10 lt-ie9 lt-ie8 lt-ie7 ">
<![endif]-->
<!--[if IE 7]>
    <html class="no-js lt-ie10 lt-ie9 lt-ie8 ">
<![endif]-->
<!--[if IE 8]>
    <html class="no-js lt-ie10 lt-ie9 ">
<![endif]-->
<!--[if IE 9]>
    <html class="no-js lt-ie10 ">
<![endif]-->
<!--[if gt IE 9]><!-->
<html class="no-js">
<!--<![endif]-->
<head>
<!-- disable auto format for telephone numbers -->
<meta content="telephone=no" name="format-detection"/>
<title>
	Rankings | Singles | ATP Tour | Tennis
</title>
<meta content="initial-scale=1.0, width=768, user-scalable=yes, minimum-scale=1.0, maximum-scale=1.25" name="viewport"/>
<meta charset="utf-8"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta content="" name="keywords"/>
<meta content="Tennis' official singles rankings of the ATP Tour, featuring Novak Djokovic, Roger Federer, Rafael Nadal, Alexander Zverev and more." name="description

This is the printed out the HTML parsed by BeautifulSoup, it doesn't look pretty. It'll be pretty difficult to find what we want from just looking at the HTML. Using inspect element we find that the information is stored in a type of table called 'mega-table'.

In [5]:
table = soup.find('table',attrs={'class':'mega-table'})
print(table)

<table class="mega-table">
<thead>
<tr>
<th class="sorting-cell rank-heading sort-up">
<div class="sorting-inner">
<div class="sorting-label">
                Ranking
            </div>
<a class="sorting-arrow" data-filter-modules="" href="/en/rankings/singles?rankDate=2019-7-22&amp;countryCode=all&amp;rankRange=0-100&amp;sort=rank&amp;sortAscending=False"></a>
</div>
</th>
<th class="sorting-cell rank-heading sort-up">
<div class="sorting-inner">
<div class="sorting-label">
                Move
            </div>
<a class="sorting-arrow" data-filter-modules="" href="/en/rankings/singles?rankDate=2019-7-22&amp;countryCode=all&amp;rankRange=0-100&amp;sort=move&amp;sortAscending=False"></a>
</div>
</th>
<th class="sorting-cell rank-heading sort-up">
<div class="sorting-inner">
<div class="sorting-label">
                Country
            </div>
<a class="sorting-arrow" data-filter-modules="" href="/en/rankings/singles?rankDate=2019-7-22&amp;countryCode=all&amp;rankRange=0-100&amp

So we narrow down the HTML to the table we're interested in. At first glance it looks pretty horrendous, but if you read through it carefully, you'l actually start seeing some names like Novak Djokovic, Rafael Nadal and Roger Federer etc. 

We'll subset the soup again by looking for parts that contain the 'td' tags which actually contain the information we want.

In [6]:
results = table.find_all('td')
print(results)

[<td class="rank-cell">
					1
				</td>, <td class="move-cell">
<div class="move-none"></div>
<div class="move-text">
</div>
</td>, <td class="country-cell">
<div class="country-inner">
<div class="country-item">
<img alt="SRB" onerror="this.remove()" src="/en/~/media/images/flags/srb.svg"/>
</div>
</div>
</td>, <td class="player-cell">
<a data-ga-label="Novak Djokovic" href="/en/players/novak-djokovic/d643/overview">Novak Djokovic</a> </td>, <td class="age-cell">
32				</td>, <td class="points-cell">
<a data-ga-label="rankings-breakdown" href="/en/players/novak-djokovic/d643/rankings-breakdown?team=singles">12,415</a> </td>, <td class="tourn-cell">
<a data-ga-label="16" href="/en/players/novak-djokovic/d643/player-activity?matchType=singles">16</a> </td>, <td class="pts-cell">
						0
					</td>, <td class="next-cell">
						0
					</td>, <td class="rank-cell">
					2
				</td>, <td class="move-cell">
<div class="move-none"></div>
<div class="move-text">
</div>
</td>, <td cl

The HTML code contained in the results variable is in a list-like form, meaning we can loop through each element and grab the text contained within. Not so straight forward in this case as each row of the list does not correspond to the rows in the table, but rather the cells. So what'll we end up with is an unzipped version of the table stacked in one dimension. For example, the first element results\[0\] gives us the ranking of the current world number one (which should be 1!)

In [7]:
print(results[0])

<td class="rank-cell">
					1
				</td>


The fourth row gives us the name of the world number one which currently as of writing is Novak Djokovic.

In [8]:
print(results[3])

<td class="player-cell">
<a data-ga-label="Novak Djokovic" href="/en/players/novak-djokovic/d643/overview">Novak Djokovic</a> </td>


Since there are nine columns in each row, this list will repeat columns every ninth entry. For instance, if we went to the 10th row of the results list, it would give us the ranking of the world number 2.

In [9]:
print(results[9])

<td class="rank-cell">
					2
				</td>


The thirteenth row of the list would give us his name since 4 + 9 = 13.

In [10]:
print(results[12])

<td class="player-cell">
<a data-ga-label="Rafael Nadal" href="/en/players/rafael-nadal/n409/overview">Rafael Nadal</a> </td>


And so on and so forth.

To make analysis easier, we'll need to convert this list into a list of lists, so that each row of the new list corresponds to each row of the table. The original table has one hundred rows (for the top one hundred players) and nine columns for each varaible, this implies that when unzipped into the results list it should have nine hundred rows. Let's check just be sure.

In [11]:
print(len(results))

900


Good. Now to convert it into a list of lists will be a little bit tricky, there's no one best way to do it. I've chosen my approach because it made the most sense to me.

There are nine entries in each row of the table meaning every ninth row of the unzipped list will correspond to a different player. Index 0-8 will contain information about the world number one: Novak Djokovic, 9-17 will have information about Rafael Nadal and 18-26 will store the data on Roger Federer. 

The idea is to have an outer loop which cycles through each player by looping through the index using a step size of 9: {0,9,18,..,881,890}. Inside the outer loop there will be an inner loop going from 0 to 8 (step size = 1) which is added onto the outer loop index. This will allow us to grab the data relevant to a particular player and put it in a temporary list for that player. This temporary list will be added onto the list of lists and then refreshed for the next player. 

Here's an example run through:

Starting with outer loop i = 0, we create a temporary list to store the results for Novak Djokovic.

For j = {0,1,2,..,7,8} we extract the element (i+j)th from the unzipped results list and put them in their own list for Novak Djokovic.

    i=0, j=0 => i+j = 0: Ranking of the world number one
    i=0, j=1 => i+j = 1: Movement on the leaderboard
    
    ...
    
    i=0, j=8, => i+j = 8: Next best 

This list for Novak Djokovic will then be added to the empty list of lists which we will initialise at the start.

For i = 9, we create a temporary list for Rafael Nadal, and for j={0,1,2,..,7,8} we repeat the same steps

    i=9, j=0 => i+j = 9: Ranking of the world number one
    i=9, j=1 => i+j = 10: Movement on the leaderboard
    
    ...
    
    i=9, j=8, => i+j = 17: Next best 

The (i+j)th entries will be extracted and put into their own list and appended to the list of lists already containing a list for Novak Djokovic. We'll keep cycling through the outer loop i = {18,27,...,881,890} until we've got every single player added to the list of lists.

Apologies for the longwindedness, but if you're a beginner you'll hopefully find these sorts of explanations useful

0: Ranking
1: Move?
2: Country Flag
3: Name
4: Age
5: Points
6: Tournaments Played
7: Points Dropping
8: Next Best


In [12]:
results_ll = []
for i in np.arange(0,len(results)-1,9):
    result_l_temp = []
    for j in range(0,9):
        result_l_temp.append(results[i+j])
    results_ll.append(result_l_temp)
print(results_ll[1])

[<td class="rank-cell">
					2
				</td>, <td class="move-cell">
<div class="move-none"></div>
<div class="move-text">
</div>
</td>, <td class="country-cell">
<div class="country-inner">
<div class="country-item">
<img alt="ESP" onerror="this.remove()" src="/en/~/media/images/flags/esp.svg"/>
</div>
</div>
</td>, <td class="player-cell">
<a data-ga-label="Rafael Nadal" href="/en/players/rafael-nadal/n409/overview">Rafael Nadal</a> </td>, <td class="age-cell">
33				</td>, <td class="points-cell">
<a data-ga-label="rankings-breakdown" href="/en/players/rafael-nadal/n409/rankings-breakdown?team=singles">7,945</a> </td>, <td class="tourn-cell">
<a data-ga-label="16" href="/en/players/rafael-nadal/n409/player-activity?matchType=singles">16</a> </td>, <td class="pts-cell">
						0
					</td>, <td class="next-cell">
						0
					</td>]


If this code works, the 4th element of the second list in this list of lists should give us the HTML code holding the name of the current world number two.

In [13]:
print(results_ll[1][3])

<td class="player-cell">
<a data-ga-label="Rafael Nadal" href="/en/players/rafael-nadal/n409/overview">Rafael Nadal</a> </td>


Great! The next thing to do will be to extract the text embedded within the HTML code for each cell in the table. The getText() method is very useful for this. Using the above example, we want to extract the name 'Rafael Nadal' from the HTML above using the getText() method.

In [14]:
print(results_ll[1][3].getText())


Rafael Nadal 


This doesn't work so well for extracting the leaderboard movement in the case where movement is zero, it'll simply return an empty string. It also will not work on the HTML for the player's country as the table on the website actually displays an image of the country flag. 

In [15]:
print(results_ll[1][1].getText()) #Rafael's movement on the leaderboard
print(results_ll[1][2].getText()) #Rafael's country















Dealing with the first issue is relatively easy, if the string turns out to be empty, replace it with a zero.

Now, when trying to find the player's country of origin, we'll have to delve into the HTML a bit further. This time we'll look at Kei Nishikori. 

In [17]:
print(results_ll[6][2])

<td class="country-cell">
<div class="country-inner">
<div class="country-item">
<img alt="JPN" onerror="this.remove()" src="/en/~/media/images/flags/jpn.svg"/>
</div>
</div>
</td>


So while the HTML contains no text to extract using the getText() method, it does have an 'alt' attribute which gives the abbreviated form of the country name. The alt attribute provides text to display in case the image can't load for whatever reason. 

To extract it we first use the find() method to locate the 'img' tag within the 'td' tag.

In [27]:
nishikori_img = results_ll[6][2].find('img')
print(nishikori_img)

<img alt="JPN" onerror="this.remove()" src="/en/~/media/images/flags/jpn.svg"/>


We can then use the get function to store the text associated with the alt attribute.

In [28]:
nishikori_country_text = results_ll[6][2].find('img').get('alt')
print(nishikori_country_text)

JPN


The get() method is usually used with dictionaries to return a value associated with a particular key. We can use it in this case because BeautifulSoup treats all of the attributes of the img tag as a dictionary. 

In [31]:
results_ll[6][2].find('img').attrs

{'src': '/en/~/media/images/flags/jpn.svg',
 'alt': 'JPN',
 'onerror': 'this.remove()'}

If we wanted to, we could also extract the other parts of the HTML using the get() method as well, but it's not really relevant to what we're looking for. Something for future reference though!

Putting everything together, we can start to extract the relevant data from the HTML we have parsed. What I'm going to do here is to work through each player in the results list of lists and sequentially pull out the text for each variable and put it into a list of lists.

In [33]:
rows = []
rows.append(['Ranking','Move','Country','Name','Age','Points','Tournaments Played','Points Dropping','Next Best'])

for result in results_ll:
    rank = result[0].getText()
    
    if result[1].getText() == '':
        move = 0
    else:
        move = result[1].getText()
    
    
    country = result[2].find('img').get('alt')
    name = result[3].getText()
    age = result[4].getText()
    points = result[5].getText()
    tournaments = result[6].getText()
    dropping = result[7].getText()
    nextbest = result[8].getText()
    
    row_temp = [rank,move,country,name,age,points,tournaments,dropping,nextbest]
    rows.append(row_temp)
    
print(rows)

[['Ranking', 'Move', 'Country', 'Name', 'Age', 'Points', 'Tournaments Played', 'Points Dropping', 'Next Best'], ['\r\n\t\t\t\t\t1\r\n\t\t\t\t', '\n\n\n\n', 'SRB', '\nNovak Djokovic ', '\r\n32\t\t\t\t', '\n12,415 ', '\n16 ', '\r\n\t\t\t\t\t\t0\r\n\t\t\t\t\t', '\r\n\t\t\t\t\t\t0\r\n\t\t\t\t\t'], ['\r\n\t\t\t\t\t2\r\n\t\t\t\t', '\n\n\n\n', 'ESP', '\nRafael Nadal ', '\r\n33\t\t\t\t', '\n7,945 ', '\n16 ', '\r\n\t\t\t\t\t\t0\r\n\t\t\t\t\t', '\r\n\t\t\t\t\t\t0\r\n\t\t\t\t\t'], ['\r\n\t\t\t\t\t3\r\n\t\t\t\t', '\n\n\n\n', 'SUI', '\nRoger Federer ', '\r\n37\t\t\t\t', '\n7,460 ', '\n16 ', '\r\n\t\t\t\t\t\t0\r\n\t\t\t\t\t', '\r\n\t\t\t\t\t\t0\r\n\t\t\t\t\t'], ['\r\n\t\t\t\t\t4\r\n\t\t\t\t', '\n\n\n\n', 'AUT', '\nDominic Thiem ', '\r\n25\t\t\t\t', '\n4,595 ', '\n23 ', '\r\n\t\t\t\t\t\t0\r\n\t\t\t\t\t', '\r\n\t\t\t\t\t\t0\r\n\t\t\t\t\t'], ['\r\n\t\t\t\t\t5\r\n\t\t\t\t', '\n\n\n\n', 'GER', '\nAlexander Zverev ', '\r\n22\t\t\t\t', '\n4,325 ', '\n24 ', '\r\n\t\t\t\t\t\t0\r\n\t\t\t\t\t', '\r\n\t\t\t\t\t

While the code does seem to be working somewhat, we do have an issue with the actual text we got. We couldn't see it when we used the getText() method, but we see many instances of '\n','\r' and '\t'. These are known as control characters and relate to the formatting of the text itself. \n indicates a new line, \r is used when you press enter/return on your keybard and \t is equivalent to pressing tab.

We don't want these characters to appear in our text so we'll need to use the replace() method to deal with them. I've modified the code slightly to allow for this.

In [35]:
rows = []
rows.append(['Ranking','Move','Country','Name','Age','Points','Tournaments Played','Points Dropping','Next Best'])

for result in results_ll:
    rank = result[0].getText()
    move = result[1].getText()
    country = result[2].find('img').get('alt')
    name = result[3].getText()
    age = result[4].getText()
    points = result[5].getText()
    tournaments = result[6].getText()
    dropping = result[7].getText()
    nextbest = result[8].getText()
    
    row_temp = [rank,move,country,name,age,points,tournaments,dropping,nextbest]
    row_temp = [x.replace('\r','').replace('\n','').replace('\t','') for x in row_temp]
    
    if row_temp[1] == '':
        row_temp[1]=0
    
    rows.append(row_temp)

ATP_df = pd.DataFrame.from_records(rows)
print(rows)

[['Ranking', 'Move', 'Country', 'Name', 'Age', 'Points', 'Tournaments Played', 'Points Dropping', 'Next Best'], ['1', 0, 'SRB', 'Novak Djokovic ', '32', '12,415 ', '16 ', '0', '0'], ['2', 0, 'ESP', 'Rafael Nadal ', '33', '7,945 ', '16 ', '0', '0'], ['3', 0, 'SUI', 'Roger Federer ', '37', '7,460 ', '16 ', '0', '0'], ['4', 0, 'AUT', 'Dominic Thiem ', '25', '4,595 ', '23 ', '0', '0'], ['5', 0, 'GER', 'Alexander Zverev ', '22', '4,325 ', '24 ', '0', '0'], ['6', 0, 'GRE', 'Stefanos Tsitsipas ', '20', '4,045 ', '28 ', '0', '0'], ['7', 0, 'JPN', 'Kei Nishikori ', '29', '4,040 ', '22 ', '0', '0'], ['8', 0, 'RUS', 'Karen Khachanov ', '23', '2,890 ', '26 ', '0', '0'], ['9', '1', 'RUS', 'Daniil Medvedev ', '23', '2,625 ', '26 ', '0', '0'], ['10', '1', 'ITA', 'Fabio Fognini ', '32', '2,535 ', '24 ', '0', '0'], ['11', 0, 'RSA', 'Kevin Anderson ', '33', '2,500 ', '16 ', '0', '0'], ['12', 0, 'ARG', 'Juan Martin del Potro ', '30', '2,380 ', '15 ', '0', '0'], ['13', 0, 'ESP', 'Roberto Bautista Agut ', 

A few things to note, the if statement for the player's movement on the leaderboard has been moved to the bottom of the code, after the control characters have been removed because the presence of the \n character messed with the logical ('\n\n\n\n' =! '').

I've also used a list comprehension for the removal of the control characters so I don't have to apply it to each of the player's variables.

You can then export it into a CSV file to use in your own analysis.

In [39]:
with open("ATP Rankings.csv",'w') as f:
    writer = csv.writer(f)
    writer.writerows(rows)