FIDE Chess Machine Learning
---

Frame the Problem and Look at the Big Picture
---

**1. Define the objective in business terms.**

The objective of this problem is to discover interesting trends in the world of competitive chess. FIDE will be focused on rather than national chess organizations because they have the toughest competition. FIDE is where who will be the next world champion is decided. We want to predict who the next world chess champion is going to be. We also want to figure out what the record for the highest ELO is going to be (whether it will break the current ceiling or not). We want to find out what chess players are trending in a good direction for breaking the current ceiling (setting a record) and becoming world champion. We also want to find out what countries will generate the next best players.

**2. How will your solution be used?**

Our solution can be used for a variety of ways from determining specific statistics and trends about current players to studying how players become better at chess overtime depending on age, gender and country of origin. Our dataset can be used by worldwide chess federations to determine how to grow and get better players.

**3. What are the current solutions/workarounds (if any)?**

There are no currently existing solutions. There are however machine learning solutions for a computer to play chess. These solutions are highly effective and computers are extremely strong at chess. This could possibly help us if we applied this to analyzing games of different players to find trends.

**4. How should you frame this problem (supervised/unsupervised, online/offline, ...)?**

This is a supervised regression problem because we are using historical data to predict a continuous feature. The system should be offline as the data is only updated around every year. This could also be done as a classification problem because we are trying to figure out if someone will be world champion or not. However, the world champion is most likely just going to be the person with the highest rating. But this does not necessarily always need to be true, especially in the future. This part might be able to be classification if we create a second model that we feed the results of the first model (the one that predicts ELO) into.

**5. How should performance be measured? Is the performance measure aligned with the business objective?**

Performance on the predicted ELO can be measured by measuring the mean square error of the predictions. Because this is a forecasting problem, we would take a window of our data instead of a test set. For the classification part, the performance can be measured by taking the accuracy of the predictions. The performance will definitely be aligned with our business objective since we want to accurately predict future ELO and accurately predict future world chess champions.

**6. What would be the minimum performance needed to reach the business objective?**

The minimum performance needed to reach the business objective could be subjected to the idea that we want our solution to be very accurate in at least placing players within a range of 10-20 rank for where they will be placed for coming years. If we can achieve accuracy where we can label what countries have the most prolific chess players and label what countries are achieving the most growth that would be a good place to start with the idea of minimum performance.

**7. What are comparable problems? Can you reuse experience or tools?**

When jumping into this problem the first thing that came to my mind in terms of comparable problems was the housing income problem where we ultimately tried to discover interesting remarks about the housing income and prices in a specified area. Taking this into account we can look at how that problem can be somewhat comparable as we are taking individuals (houses) from specific countries (locations) and seeing what their value / skill is to the game of chess (price). Weather forecasting is also somewhat similar.

**8. Is human expertise available?**
Chess is a very popular hobby / sport in which millions of people take part in. There are tons of very good developers that have made projects similar to ours so that may be of use to us. There is human expertise in the form of chess grandmasters and commentators. They understand the players as people and they understand their playing styles. They can predict rank/world champion based on this criteria.

**9. How would you solve the problem manually?**

The problem could be solved manually by first finding out whether someone has peaked or not. If someone has peaked then they can be ruled out for increasing ELO or becoming world chess champion. Most younger players definitely have not peaked. This is the group that is the most interesting and promising for future world champions. We could find the fastest growing younger players and simply choose one of them as a candidate for future world championships.

**10. List the assumptions you (or others) have made so far. Verify assumptions if possible.**

The assumptions that we have made so far would include that all of the data provided is accurate and updated correctly. This includes players’ age, gender, ELO rating, country of origin and title.


In [1]:
import ast

import numpy as np
import scipy.sparse
import scipy as sp
import pandas as pd

import matplotlib.pylab as plt
import seaborn as sns

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler, RobustScaler, FunctionTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MultiLabelBinarizer

Get the Data
---
**1. List the data you need and how much you need**

For our problem we will need a whole set of data about all types of demographics, ratings, and names of all players that are listed and rated under FIDE... On top of that it would be beneficial to have multiple years included with all of this data alongside it.

**2. Find and document where you can get that data** 

We will most likely end up using data that is compiled on a website called kaggle. The link is posted below.

https://www.kaggle.com/datasets/rohanrao/chess-fide-ratings?select=players.csv

**3. Get access authorizations** 

Kaggle datasets are publically avaliable

**4. Create a workspace (with enough storage space)** 

VSCode Jupyter Notebooks

**5. Get the data** 

Done

**6. Convert the data to a format you can easily manipulate (without changing the data itself)** 

Currently in csv format which is easily usable in the current format

**7. Ensure sensitive information is deleted or protected (e.g. anonymized)** 

Done

**8. Check the size and type of data (time series, geographical, ...)** 

around 200.5 MbS..

**9. Sample a test set, put it aside, and never look at it (no data snooping!)** 

In [None]:
def load_players_data():
    """
    Loads the CSV file which contains our data for players.
    """
    return pd.read_csv('players.csv')

#These functions load each individuals years data
def load_ratings_15():
    return pd.read_csv('ratings_2015.csv')

def load_ratings_16():
    return pd.read_csv('ratings_2016.csv')

def load_ratings_17():
    return pd.read_csv('ratings_2017.csv')

def load_ratings_18():
    return pd.read_csv('ratings_2018.csv')

def load_ratings_19():
    return pd.read_csv('ratings_2019.csv')

def load_ratings_20():
    return pd.read_csv('ratings_2020.csv')

def load_ratings_21():
    return pd.read_csv('ratings_2021.csv')

Explore the Data 
---
**1. Copy the data for exploration, downsampling to a manageable size if necessary.**

**2. Study each attribute and its characteristics: Name; Type (categorical, numerical, bounded, text, structured, ...); % of missing values; Noisiness and type of noise (stochastic, outliers, rounding errors, ...); Usefulness for the task; Type of distribution (Gaussian, uniform, logarithmic, ...)** 

**3. For supervised learning tasks, identify the target attribute(s)**

**4. Visualize the data** 

**5. Study the correlations between attributes** 

**6. Study how you would solve the problem manually** 

**7. Identify the promising transformations you may want to apply** 

**8. Identify extra data that would be useful (go back to “Get the Data”)** 

**9. Document what you have learned** 