# A dive into chess dataset
##### Welcome! In this notebook, I explored how I would start creating a model that plays chess. Along the way, I learned a lot about handling the dataset before diving into the actual modeling process.

1. PGN Format
    Past chess games are stored in a file format known as PGN (Portable Game Notation). To interact with this data by making moves and recording them, I installed the chess Python library. Think of each move as manipulating the game in object format. By calling functions like move, we can observe how the board evolves after each action and capture this data for analysis.

2. Data Representation
    To work with this data, we represent the chess board using numbers. Each piece is assigned a unique value to indicate its position on the board, while the empty squares are also represented numerically.

##### This project opened my eyes to how innovative and exciting data handling can be, especially when applied to something as strategic as chess!

In [1]:
!pip install chess
!pip install memory_profiler
import chess.pgn

# collecting the chess data in pgn formats

%time pgn=open("chess_data.pgn")
game=chess.pgn.read_game(pgn)
board=game.board()


CPU times: total: 0 ns
Wall time: 36.7 ms


##### Here we map out the pieces to values

In [2]:
def piece_value(piece):
    piece_values={'p':1,'n':3,'b':4,'r':5,'q':9,'k':0}
    return piece_values[piece.lower()]

##### function to iterate through every spot on the board to assign the values to the pieces and ultimately represent the board numerically

In [3]:
def extract_features(board):
    pieces=str(board).split()
    features=[]

    for row in pieces:
        for piece in row:
            row_features=[]
            if piece=='.' :
                row_features.append(0)
            elif piece.islower():
                row_features.append(-piece_value(piece))
            else:
                row_features.append(piece_value(piece))
            features.append(row_features)
    return features

##### Extracting the Features: Data Processing

Processing the dataset using the above functions can be quite demanding due to its size—over 5 million records. This would require significant time and computational resources. To manage this, I explored alternatives like batch processing or multiprocessing.

I opted for batch processing, where I processed the data in groups of 1,000 records and stored each batch in a CSV file. After reaching 11,335 batches, I decided to interrupt the run.

In [11]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import pandas as pd

batch_size=1000
X_batch,y_batch=[],[]
batch_count=0


while True:
    game=chess.pgn.read_game(pgn)
    if game is None:
        break
    board=game.board()

    # Iterating through the moves in the board
    for move in game.mainline_moves():
        board.push(move) #play the move

        # Extract the board position (features) after the move
        features=extract_features(board)
        X_batch.append(features)
        y_batch.append(move.uci())

        if len(X_batch) > batch_size:
            # save the batch to the disk
            df=pd.DataFrame(X_batch)
            df['label']=y_batch
            batch_count+=1
            df.to_csv(f'batch_{batch_count}')

            # clear the batch lists
            X_batch.clear()
            y_batch.clear()
            %time
            


CPU times: total: 0 ns
Wall time: 0 ns
CPU times: total: 0 ns
Wall time: 0 ns
CPU times: total: 0 ns
Wall time: 0 ns
CPU times: total: 0 ns
Wall time: 0 ns
CPU times: total: 0 ns
Wall time: 0 ns
CPU times: total: 0 ns
Wall time: 0 ns
CPU times: total: 0 ns
Wall time: 0 ns
CPU times: total: 0 ns
Wall time: 0 ns
CPU times: total: 0 ns
Wall time: 0 ns
CPU times: total: 0 ns
Wall time: 0 ns
CPU times: total: 0 ns
Wall time: 0 ns
CPU times: total: 0 ns
Wall time: 0 ns
CPU times: total: 0 ns
Wall time: 0 ns
CPU times: total: 0 ns
Wall time: 0 ns
CPU times: total: 0 ns
Wall time: 0 ns
CPU times: total: 0 ns
Wall time: 0 ns
CPU times: total: 0 ns
Wall time: 0 ns
CPU times: total: 0 ns
Wall time: 0 ns
CPU times: total: 0 ns
Wall time: 0 ns
CPU times: total: 0 ns
Wall time: 0 ns
CPU times: total: 0 ns
Wall time: 0 ns
CPU times: total: 0 ns
Wall time: 0 ns
CPU times: total: 0 ns
Wall time: 0 ns
CPU times: total: 0 ns
Wall time: 0 ns
CPU times: total: 0 ns
Wall time: 0 ns
CPU times: total: 0 ns
Wa

KeyboardInterrupt: 

In [52]:
# alternatively you can subdivide the task into subtasks
# Going with the multi processing option
import multiprocessing as mp
from joblib import Parallel,delayed


def process_game(game):
    board=game.board()
    X_local,y_local=[],[]

    for move in game.mainline_moves():
        board.push(move)
        features=extract_features(board)
        X_local.append(features)
        y_local.append(move.uci())
    return X_local,y_local

def load_games(pgn):
    while True:
        game=chess.pgn.read_game(pgn)
        if game is None:
            break
        yield game

# parralel processing of the games
games=list(load_games(pgn))
results=Parallel(n_jobs=4)(delayed(process_game(game)) for game in games)

X,y=zip(*results)
X=[item for sublist in X for item in sublist]
y=[item for sublist in y for item in sublist]




##### Lets go ahead and look at how we can mnipulate our data from a selected batch

In [53]:
batch1=pd.read_csv('batch_1')
batch1

Unnamed: 0.1,Unnamed: 0,0,1,2,3,4,5,6,7,8,...,55,56,57,58,59,60,61,62,63,label
0,0,[-5],[-3],[-4],[-9],[0],[-4],[-3],[-5],[-1],...,[1],[5],[3],[4],[9],[0],[4],[3],[5],e2e3
1,1,[-5],[-3],[-4],[-9],[0],[-4],[-3],[-5],[-1],...,[1],[5],[3],[4],[9],[0],[4],[3],[5],b7b5
2,2,[-5],[-3],[-4],[-9],[0],[-4],[-3],[-5],[-1],...,[1],[5],[3],[4],[9],[0],[0],[3],[5],f1b5
3,3,[-5],[-3],[-4],[-9],[0],[-4],[-3],[-5],[-1],...,[1],[5],[3],[4],[9],[0],[0],[3],[5],g7g5
4,4,[-5],[-3],[-4],[-9],[0],[-4],[-3],[-5],[-1],...,[1],[5],[3],[4],[9],[0],[0],[3],[5],b5d7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
996,996,[-5],[-3],[0],[-9],[0],[-4],[-3],[-5],[-1],...,[1],[5],[3],[4],[0],[0],[0],[3],[5],d5e4
997,997,[-5],[-3],[0],[-9],[0],[-4],[-3],[-5],[-1],...,[1],[5],[3],[4],[0],[0],[0],[3],[5],e2e4
998,998,[-5],[-3],[0],[0],[0],[-4],[-3],[-5],[-1],...,[1],[5],[3],[4],[0],[0],[0],[3],[5],d8d2
999,999,[-5],[-3],[0],[0],[0],[-4],[-3],[-5],[-1],...,[1],[5],[3],[4],[0],[0],[0],[3],[5],e4e7


### Data Cleanup
* The x values are enclosed in square brackets, indicating they are stored as a list. We need to convert these values into numerical data for further processing.
* Additionally, there is a column named 'Unnamed: 0' which is unnecessary for our analysis, so we'll remove it from the dataset.

In [51]:
batch1.drop(columns=['Unnamed: 0'],axis=1,inplace=True)
batch1.apply(pd.to_numeric,errors='coerce')
batch1_cleaned=batch1.replace(r'\[|\]','',regex=True)
batch1_cleaned

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,55,56,57,58,59,60,61,62,63,label
0,-5,-3,-4,-9,0,-4,-3,-5,-1,-1,...,1,5,3,4,9,0,4,3,5,e2e3
1,-5,-3,-4,-9,0,-4,-3,-5,-1,0,...,1,5,3,4,9,0,4,3,5,b7b5
2,-5,-3,-4,-9,0,-4,-3,-5,-1,0,...,1,5,3,4,9,0,0,3,5,f1b5
3,-5,-3,-4,-9,0,-4,-3,-5,-1,0,...,1,5,3,4,9,0,0,3,5,g7g5
4,-5,-3,-4,-9,0,-4,-3,-5,-1,0,...,1,5,3,4,9,0,0,3,5,b5d7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
996,-5,-3,0,-9,0,-4,-3,-5,-1,0,...,1,5,3,4,0,0,0,3,5,d5e4
997,-5,-3,0,-9,0,-4,-3,-5,-1,0,...,1,5,3,4,0,0,0,3,5,e2e4
998,-5,-3,0,0,0,-4,-3,-5,-1,0,...,1,5,3,4,0,0,0,3,5,d8d2
999,-5,-3,0,0,0,-4,-3,-5,-1,0,...,1,5,3,4,0,0,0,3,5,e4e7


## Training the Model Using Data from the Batches
We selected a model capable of performing partial fits to train on selected batches. To ensure the model works effectively, the data must be processed and converted into numeric values:

1. ### Processing X (Independent Variables)

We'll use the Pandas to_numeric() function, but since the column values are enclosed in [], this function will return null. To resolve this, we will apply regex to remove the brackets, leaving only the numeric values.

2. ### Processing y (Dependent Variable)

We'll use one-hot encoding to convert the target variable into numeric form. Since the data is processed in batches, the encoder might encounter new y values. To address this, we will apply fit_transform() across all batches.


In [46]:
from sklearn.linear_model import SGDRegressor
import pandas as pd
import numpy as np
import os
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
model=SGDRegressor()
imputer=SimpleImputer(strategy='constant', fill_value=0)
directory='/'
total_batches=11335
train_count=int(0.80*total_batches)
encoder=LabelEncoder()
for batch_number in range(1,train_count):
        # load the batch from the csv file
        df=pd.read_csv(f'batch_{batch_number}')
        df.drop(columns=['Unnamed: 0'],axis=1,inplace=True)
        df_cleaned=df.replace(r'\[|\]', '', regex=True)        
        # df.dropna(inplace=True)        
        # df_cleaned = df_cleaned.apply(pd.to_numeric, errors='coerce')  
        if df_cleaned.empty:
            print(f"Warning: batch_{batch_number} is empty after conversion.")
            continue
        # seperating the features and labels
        X_batch=df_cleaned.drop('label',axis=1).values
        y_batch=df_cleaned['label'].values
        # Drop rows where y_batch is NaN       
        y_batch_encoded=encoder.fit_transform(y_batch) 
        # Skip if X_batch is empty
        if X_batch.shape[0] == 0:
            print(f"Warning: X_batch is empty for batch {batch_number}. Skipping batch.")
            continue
        if batch_number == 1:
            X_batch = imputer.fit_transform(X_batch)
        else:
            X_batch = imputer.transform(X_batch)            
        try:
            # Incrementally train the model
            model.partial_fit(X_batch,y_batch_encoded)
        except ValueError as e:
             print(f'Error {e} while fitting the model at batch {batch_number}')

#### Testing and evaluating the model

In [50]:
# Evaluating the model
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error

y_true=[]
y_pred=[]

for batch_numnber in range(train_count,total_batches+1):
    df=pd.read_csv(f'batch_{batch_number}')
    y_true.extend(encoder.transform(df['label'].values))
    X_test=df.drop(columns=['label','Unnamed: 0'],axis=1)
    X_test=X_test.replace(r'\[|\]','',regex=True)
    
    y_pred=model.predict(X_test.values)


# Convert the numpy arrays for evaluation
y_true=np.array(y_true)
y_pred=np.array(y_pred)

mse=mean_squared_error(y_true,y_pred)
r2=r2_score(y_true,y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R^2 score {r2}')

ValueError: Found input variables with inconsistent numbers of samples: [2270268, 1001]

### Saving the model as a pkl file

In [47]:
import joblib
joblib.dump(model,'trained_model.pkl')

['trained_model.pkl']