# Data

The original data set taken from the
[Deloitte/FIDE Chess Rating Challenge](https://www.kaggle.com/c/ChessRatings2/data) on Kaggle and it includes information about 3,140,356 professional chess games played between 54,205 unique players spanning over 135 monthly periods. It includes 135-month period of professional chess game outcomes, extracted from the database of the world chess federation (FIDE)

The data download and processing steps below are optional as raw and processed data sets are already provided in the [data](../data) folder. These instructions are provided to create a fully reproducible framework if users would like to start from scratch.

## Step 1: Data Download (Optional)

### Step 1a: Set up your Kaggle API token

You will need to generate a personal Kaggle API token to fetch the raw project data from the Kaggle API. If you do not want to get the token, feel free to proceed to the data processing step (Step 2) below.

1. Accept the competition rules for the Deloitte/FIDE Chess Rating Challenge
    - Go to the competition page at https://www.kaggle.com/c/ChessRatings2/rules
    - Click on the **I Understand and Accept** button
2. Download an API token
    - Go to your account tab at https://www.kaggle.com/{username}/account
    - Clink on the **Create New API Token** button
3. Place the API token (`kaggle.json`) in your root directory
    - Mac & Linux: `cp Downloads/kaggle.json ~/.kaggle/kaggle.json`
4. Make the API token readable
    - Mac & Linux: `chmod 600 ~/.kaggle/kaggle.json`

### Step 1b: Download data

The following command downloads the raw project data from the Kaggle API and overwrites the existing data sets in the [data/raw](../data/raw) folder.

In [1]:
from data import download_data

In [None]:
donwnload_data(
    path="../data/raw",  # Output path
    force=True,          # Overwrite existing files
    quiet=True           # Run silently
)

## Step 3: Data Processing (Optional)

In order to downsample and process the raw data, we
- removed games that resulted in ties
- converted monthly periods to yearly periods, resulting in a total of 12 periods (years)
- chose the top 103 players who played the most number of games
- used the first 10 periods (years) for training and the remaining 2 periods (years) for testing

The resulting training and testing data sets respevtively include **2570 (90.8%)** and **261 (9.2%)** games spanning over 12 yearly periods.

The following command processes the raw project data and overwrites the existing data set in the [data/raw](data/processed) folder.

In [3]:
from data import process_data

In [5]:
process_data(
    player_num=103,                  # Keep the top 103 most frequent players             
    period_length=12,                # Combine every 12 months into one period
    perid_train=10,                  # Use the first 10 periods as training set
    path_input="../data/raw",        # Input path
    path_output="../data/processed"  # Output path
)