# Elite Chess Grandmaster Web Scraping Project 

## Motivation

Interested in how chess player rating progression differed among the top grandmasters, I found several online databases that kept extensive rating records for not only top chess players but also amateurs too. Websites like [fide.com](https://www.fide.com/), [chesstempo.com](https://chesstempo.com/),  [2700chess.com](https://2700chess.com/), and [chessgraphs.com](https://www.chessgraphs.com/) provided great resources for comparing a few players at a time using graphs and data tables. Most of these resources only allowed us to look at player attributes and rating history separately. 

In order to get a more complete picture of how rating progression looks like for elite chess players, I decided to produce a dataset that combines elite chess player attributes along with their rating histories using web scraping tools like BeautifulSoup and Selenium. The following is the first of several chess web scraping projects where I scrape FIDE rating history data from the official world chess federation website [fide.com](https://www.fide.com/) and merge it with the attribute data from the June rating list (also obtained from fide.com). 



## Introduction 

To measure playing strength and performance during chess competitions, federations give chess players Elo ratings. These ratings range from 0 to any number. As a chess player competes in tournaments, their rating will fluctuate based on whether they lose, draw, or win games. Although countries usually provide their own Elo ratings, as mentioned above, there exists an overarching world chess federation. The world chess federation is the only federation that endows a chess player with the title of grandmaster. To keep it simple, grandmasters are chess players that at one point in their lives achieved ratings above 2500 with three norms (superior tournament performances). For more information about ratings and chess titles please visit: 

- https://en.wikipedia.org/wiki/Elo_rating_system 
- https://en.wikipedia.org/wiki/Chess_title 

For this project, we are only concerned with the top 100 or so grandmasters for the month of June. These players range from 2648-2847. 





## Data Acquisition 

In this step, we will be loading the attribute data from the June chess player list and then web scraping the chess players' rating histories from the FIDE website.



In [11]:
# Import Necessary Library
import pandas as pd


In [12]:
# Read in fide player characteristics (sex, federation etc)
file = 'players_list_foa_june.txt'
chess = pd.read_fwf(file)


In [13]:
#Subset for top grandmasters 
player_attributes_table = chess[chess['SRtng']  >= 2600]
len(player_attributes_table)

265

In [14]:
# a cleaner  table to work with 
player_attributes_table_clean = player_attributes_table[['ID Number', 'Name', 'Fed', 'Sex', 'B-day']].reset_index()
display(player_attributes_table_clean)

Unnamed: 0,index,ID Number,Name,Fed,Sex,B-day
0,1923,13402960,"Abasov, Nijat",AZE,M,1995
1,6193,14204118,"Abdusattorov, Nodirbek",UZB,M,2004
2,12106,400041,"Adams, Michael",ENG,M,1971
3,12857,5018471,"Adhiban, B.",IND,M,1992
4,14191,10601619,"Adly, Ahmed",EGY,M,1987
...,...,...,...,...,...,...
260,994561,8600694,"Zhang, Zhong",CHN,M,1978
261,994700,8602522,"Zhao, Jun",CHN,M,1986
262,995339,14116804,"Zherebukh, Yaroslav",USA,M,1993
263,995897,8603537,"Zhou, Jianchao",CHN,M,1988



Our goal in the next block is to use the Selenium library along with other web scraping libraries to extract the rating history URLs for each chess player on the [top 100 FIDE player table](https://ratings.fide.com/). To be more clear, each row on the table represents a chess player. If you click on their name, it directs you to their profile web pages which keep records of their rating histories.


In [15]:
# Load Webscraping Libraries
from selenium import webdriver
from bs4 import BeautifulSoup
import requests
import lxml.html as lh
from requests_html import HTMLSession
from pprint import pprint

#Activate selenium webscraper
driver = webdriver.Chrome(r"C:\Users\laryl\Downloads\chromedriver_win32\chromedriver.exe")
website_url = 'https://ratings.fide.com'
driver.get(website_url)

#Extract each chess player's link to their webpages that contain their rating histories
elements = driver.find_elements_by_xpath("//td/a")

#Store each chess player's url inside list
all_player_urls=[]
for element in elements:
    all_player_urls.append(element.get_attribute("href")) 
    pprint(all_player_urls)

   

['https://ratings.fide.com/profile/1503014']
['https://ratings.fide.com/profile/1503014',
 'https://ratings.fide.com/profile/2020009']
['https://ratings.fide.com/profile/1503014',
 'https://ratings.fide.com/profile/2020009',
 'https://ratings.fide.com/profile/8603677']
['https://ratings.fide.com/profile/1503014',
 'https://ratings.fide.com/profile/2020009',
 'https://ratings.fide.com/profile/8603677',
 'https://ratings.fide.com/profile/4168119']
['https://ratings.fide.com/profile/1503014',
 'https://ratings.fide.com/profile/2020009',
 'https://ratings.fide.com/profile/8603677',
 'https://ratings.fide.com/profile/4168119',
 'https://ratings.fide.com/profile/13300474']
['https://ratings.fide.com/profile/1503014',
 'https://ratings.fide.com/profile/2020009',
 'https://ratings.fide.com/profile/8603677',
 'https://ratings.fide.com/profile/4168119',
 'https://ratings.fide.com/profile/13300474',
 'https://ratings.fide.com/profile/24116068']
['https://ratings.fide.com/profile/1503014',
 'https

From this, we can clearly see that the only difference between each player’s unique URLs is their id numbers. Now to make sure that we navigate to the specific rating history pages, we will do one more preparatory step in this next block. 


In [16]:
    #Prepped urls for rating history extractions
    all_player_urls_prep = [url + '/chart' for url in all_player_urls]
    len(all_player_urls_prep)  


100

In [18]:
# Trim function used to eventually store fide ids as the key for individual rating histories so that I can merge histories with attributes 
def trim_url(x):
    y = int(x.replace("https://ratings.fide.com/profile/", "").replace('/chart',""))
    return(y)



In [19]:
# Extract rating history records as tables and store them in a dictionary with the id numbers being the key to each chess player's history
player_statistics = {}
for player in all_player_urls_prep:
    player_table = pd.read_html(player, attrs = {'class' : 'profile-table profile-table_chart-table'})
    id=trim_url(player)
    player_statistics[id] = player_table[0]
pprint(player_statistics)

{309095:        Period  RTNG  GMS  RAPID RTNG  RAPID GMS  BLITZ RTNG  BLITZ GMS
0    2021-Jun  2697    0      2717.0        0.0      2758.0        0.0
1    2021-May  2697    0      2717.0        0.0      2758.0        0.0
2    2021-Apr  2697    0      2717.0        0.0      2758.0        0.0
3    2021-Mar  2697    0      2717.0        0.0      2758.0        0.0
4    2021-Feb  2697    0      2717.0        0.0      2758.0        0.0
..        ...   ...  ...         ...        ...         ...        ...
158  2001-Apr  2479   11         NaN        NaN         NaN        NaN
159  2001-Jan  2466   31         NaN        NaN         NaN        NaN
160  2000-Oct  2433    0         NaN        NaN         NaN        NaN
161  2000-Jul  2433   42         NaN        NaN         NaN        NaN
162  2000-Jan  2443   20         NaN        NaN         NaN        NaN

[163 rows x 7 columns],
 400041:        Period  RTNG  GMS  RAPID RTNG  RAPID GMS  BLITZ RTNG  BLITZ GMS
0    2021-Jun  2716    0      2684

In [20]:
def attach_id():
    tags = player_statistics.keys()
    for tag in tags: 
        player_statistics[tag]["ID Number"] = tag
# Create an ID Number Column for each chess players history table and set all values in that column  to the key of the table
    return player_statistics
player_statistics_with_id = attach_id()


In [21]:
#Convert dictionary to list and then to dataframe 
list_of_player_dfs= list(player_statistics_with_id.values())
all_player_statistics_table= list_of_player_dfs[0].append(list_of_player_dfs[1:-1])



In [23]:
#Merge player attributes with rating histories using id columns as key
top_grandmasters_table = player_attributes_table_clean.merge(all_player_statistics_table, on= "ID Number", how= "outer")

## Data Preparation 

Now that all the data has been acquired, it’s time to filter it. At this point, I recognized that some players were only recorded once because only 104 chess players had their rating histories scraped out of the 265 players that were in the attributes table. 


In [24]:
#Filter table by removing all chess players that were recorded only once
top_grandmasters_table_filtered = top_grandmasters_table[top_grandmasters_table["ID Number"].map(top_grandmasters_table["ID Number"].value_counts()) > 1]

In [25]:
# Save data table for future use.
the_elite = top_grandmasters_table_filtered.to_csv("top_grandmasters.csv")


# Sources 
As mentioned above the URLs for the top grandmaster were taken from [top 100 FIDE player table](https://ratings.fide.com/)

For chess player ratings for the month of June check out: 
https://ratings.fide.com/download_lists.phtml
