# Exercise 1: Chess

The file `chess_games.csv` is a collection of chess games from Lichess and Chess.com along with a collection of metrics 
of each player's performance during the game (the actual game is not included).

The task is this: __Determine the difference in player behavior when they’re winning vs losing vs maintaining their elo score__

Be sure to look at the problem carefully, while we are looking at overall patterns amongst players, *how* you break up a 
player's performance into winning, losing, and maintaining is important. 


Note: What is the Elo score?
- The elo score is a ranking based on if you won a match, the number of matches you played, and the elo score of your opponent.
- Roughly = {score of person you played} + 400 * {win: 1, lose: -1, draw: 0} averaged over all games.

# Table of Contents
* [Load Data and Run Exploratory Data Analysis](#eda)
    * [Basic data/dataframe checks](#data)
    * [Examine Categorical columns](#categorical)
    * [Create EDA Plots](#plots)
* [What does it mean to be Winning, Losing, or Maintaining?](#win_lose_main)
    * [Feature engineering](#feat_eng)
    * [Plot timeseries](#timeseries)
    * [Slope Calculation](#slope)
        * [Side bar: Elo score: Can I calculate it?](#calculate-elo)
* [ Different behaviors of different groups](#behavior)
    * [Histograms, medians, and means](#hist)
    * [Common opening moves](#eco)
    * [How do errors evolve within a game](#gameplay)
    * [Difference between Time Classes](#timeclass)
    * [Statistical Tests](#hypothesis)
* [Summary of Results](#summary)
* [Bonus: ML Classification](#bonus)

In [None]:
# Import basic packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Load Data and Run Exploratory Data Analysis (EDA) <a class="anchor" id="eda"></a>

Here are some notes about the data to serve as a reference while I work through the dataset.
- When Outcome 
    - 1: White won
    - 0: black won
    - 2: draw
- Elo Score = `rating`
- Service: location -> Note: prompt says there are two data sources, but data shows only data from Lichess
- Eco Name: Name of opening move
- Eco Code: Classification system for the opening moves
- Eco Category: Classification system for the opening moves
- ACL = average centipaw loss: the score of engine less the score of your move. lower score is better

Action items:
- Basic Data Checks
    * Look at data and columns
    * Check for Nulls
- Examine categorical columns
- Plot histogram of data and examine correlations. 

In [None]:
# Read in Data
chess = pd.read_csv('chess_games.csv',  sep='\t')
chess.head()

## Basic data/dataframe checks <a class="anchor" id="data"></a>
* Look at data and columns
* Check for Nulls

In [None]:
#Run a basic check on the data frame to see the data types and nulls
chess.info()
# some of these are Null, so need to look at Nulls. 

In [None]:
# Timestamp is an int column, but it can be more helpful to look at it in real date time.
chess["datetime"] = pd.to_datetime(chess["timestamp"], unit='s')
chess.head()


What is the number of unique chess players? There are 2000 possible unique players (one white, one black)

In [None]:
len(set(list(chess.black_username) + list(chess.white_username)))
# 847/2000

If players play an equal number of games and there are only 847 players out of 2000 possible slots, do we use timestamps? Can we use time series to determine evolution of winning/losing/maintaining when this would indicate only 2-3 games?

To answer this question, I counted how many entries per player for white and black. We see that `ieshuaganocry`, `hooligandi`, and `aimchess_bot` are the most prolific players. I will be able to use time as a way to determine "winning", "losing", "maintain". 

In [None]:
print(chess.groupby("black_username").timestamp.count().reset_index().sort_values(by='timestamp', ascending=False).head(10), '\n\n')
print(chess.groupby("white_username").timestamp.count().reset_index().sort_values(by='timestamp', ascending=False).head(10))

In [None]:
# how many players have more than one entry?
white_username = chess.groupby("white_username").timestamp.count().reset_index().rename(columns={'timestamp': "count", "white_username": "username"})
black_username = chess.groupby("black_username").timestamp.count().reset_index().rename(columns={'timestamp': "count", "black_username": "username"})

usernames = pd.concat([white_username, black_username])
usernames = usernames.groupby('username').sum().reset_index()
print("# > 1:", len(usernames[usernames["count"] > 1]))
print("# > 2:", len(usernames[usernames["count"] > 2]))
print("# > 3:", len(usernames[usernames["count"] > 3]))

Only 80 users have more than 1 game, only 28 have more than 2, and only 13 have more than 3 games. This will limit the robustness of the split because there is limited data to run a time series analysis on.

## Categorical columns <a class="anchor" id="categorical"></a>

- What types of games are there?
- What are the different services (expect 2)
- What is ECO

In [None]:
# different types of chess play
chess.groupby(["time_class"]).timestamp.count().reset_index().rename(columns={'timestamp': 'count'})


The time class is dominated by `blitz` with `rapid` a distant second. Likely will not get good statistics on bullet or classic given the low number counts.

Below I checked what websites were used. The prompt says there are two, but I only see one in the data set.

In [None]:
# Where does the data come from? Looks like it is actually all from one service. 
chess["service"].unique() 

Eco is a way to name and categories opening moves and games.
Based on the analysis below, `eco_name` is the most granular and most descriptive while `eco_category` places the options into broader categories.

In [None]:
chess.groupby(["eco_name", "eco_code", "eco_category"]).timestamp.count().reset_index().sort_values("eco_code")

In [None]:
for col in ["eco_name", "eco_code", "eco_category"]:
    print(col, len(chess[col].unique()))

# different levels of specificity. 

## Create EDA Plots <a class="anchor" id="plots"></a>

- Histograms of the distributions of all columns
- Correlation Matrix across all entries to understand what might be related.


To start, I cannot look at histograms of categorical variables using pandas `hist` function. I will ordinally encode them (give them a number 0 - max(category)) and see broadly where the density is centered.

In [None]:
# Get a list of categorical columns to encode 
from sklearn.compose import make_column_selector as selector

categorical_columns_selector = selector(dtype_include=object)
categorical_columns = categorical_columns_selector(chess)
categorical_columns

In [None]:
from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder().set_output(transform="pandas")
encoded_names = []

# ignore username from the list
for col in ["time_class", "eco_name", "eco_code", "eco_category"]:
    _column = chess[[col]]
    _encoded = encoder.fit_transform(_column)
    chess[f'{col}_encoded'] = _encoded
    encoded_names.append(f'{col}_encoded')

Pandas has an automatic way to check all histograms. I am blocking out the categorical columns that I already encoded.

In [None]:
fig = plt.figure(figsize=(25,25))
ax = fig.gca()

_ = chess.loc[:, ~chess.columns.isin(categorical_columns + ['timestamp'])].hist(ax=ax)

- Looking through the histograms, we can see that the data is well behaved without large quantities of outliers.
- There are similar numbers of wins for white versus loses for white, but draws are less likely.
- There are several metrics that are binary 0 or 1 entries (typically `winrate`s)
- The `eco`s do appear to have peaks indicating that some opening moves are preferred.
- White and black distributions should roughly be the same assuming white and black are assigned randomly.


Now on to the correlation matrix. Again avoiding the categoricals.

In [None]:
correlation_matrix = chess.loc[:, ~chess.columns.isin(categorical_columns + ['timestamp'])].corr()

fig = plt.figure(figsize=(15,15))
ax = fig.gca()

sns.heatmap(correlation_matrix, xticklabels=correlation_matrix.columns, yticklabels=correlation_matrix.columns, ax=ax)

# Complex range of correlations

We can see complex correlations across most of the entries.

Let's take a closer look at what is correlated with outcome

In [None]:
fig = plt.figure(figsize=(2, 15))
ax = fig.gca()

sns.heatmap(correlation_matrix.iloc[:, 0:1], yticklabels=correlation_matrix.columns, annot=True)

It is interesting to note here that there is a strong negative correlation to the output with the black win rate, but much less strong correlations with the white win rate (which are opposite of black and positive.)

# What does it mean to be Winning, Losing, or Maintaining? <a class="anchor" id="win_lose_main"></a>


Naively, could assign:
- Winning match = conditions for winning
- Losing match = conditions for losing
- Draw = conditions for maintaining

However, the prompt says examine `player behavior when they’re winning vs losing vs maintaining their elo score`.
Does one win constitue "winning"? Given the wording in the prompt, I will assume that we need multiple wins to define "winning".


Choices to define the groups:
1. Multipe wins, loses, or draws (maintain) in a row. If so, how many?
2. Postive, negative, same elo scores (ratings) differences compared to the last elo score calculated.
3. Tracking the change in score across many timesteps requiring the Elo score to have the same pattern multiple times.
4. Measuring the slope of the change across multiple entries

Let's look at some timeseries to get a better sense of which method to use and how many data points we need.

Before we do that, let's update the data a little bit.


## Feature engineering  <a class="anchor" id="feat_eng"></a>

The data is not optimal for this task yet because each row is for one game which consists of two players. We need one row per game per player.
Here are the action items for feature engineering:

- Double length of dataframe by creating individual rows per player (both white and black).
- Create a new column when the player was white versus black because white plays first.
- Add column showing when the time difference for when the last game was played.
- Add column for the difference between ratings.

Let's start by creating the new table focused on _players x games_ instead of only games.

Note: renaming `white` and `black` to be `player` (primary) and `opponent` (secondary). Every white and every black gets a row where it is the primary.

In [None]:
chess_white = chess.copy()
chess_black = chess.copy()

for col in chess.columns:
    if 'black' in col:
        chess_white.rename(columns={col: col.replace("black", "opponent")}, inplace=True)
        chess_black.rename(columns={col: col.replace("black", "player")}, inplace=True)
    if 'white' in col:
        chess_white.rename(columns={col: col.replace("white", "player")}, inplace=True)
        chess_black.rename(columns={col: col.replace("white", "opponent")}, inplace=True)

# encode meaning of white versus black which encodes order
chess_white["goes_first"] = 1
chess_black["goes_first"] = 0

# for black, need to reverse what outcome means
chess_black.loc[chess_black.outcome == 0, 'outcome'] = 3 # placeholder
chess_black.loc[chess_black.outcome == 1, 'outcome'] = 4 # placeholder

chess_black.loc[chess_black.outcome == 3, 'outcome'] = 1 # where black won
chess_black.loc[chess_black.outcome == 4, 'outcome'] = 0 # where white won

chess_df = pd.concat([chess_white, chess_black])
chess_df.head()



How much time is there between games? We can see below that there is a long tail of large amounts of time between games but most of the games are played in a relatively recent time period.

In [None]:
chess_df.sort_values(['player_username', 'timestamp'], inplace=True)
chess_df = chess_df.reset_index(drop=True)

chess_df['time_diff'] = chess_df.groupby('player_username').timestamp.transform(lambda x: (x - x.min())/60)
chess_df["time_diff"].hist() # in units of minutes.

Let's also look at the difference between each player's ratings for every game.

We see there are a few outliers, but mostly all the scores are within 100 points of each other. This make sense given the US Chess Federation rating assigns classes every 100-200 points.

In [None]:
chess_df["score_diff"] = chess_df["player_rating"] - chess_df["opponent_rating"]

chess_df["score_diff"].hist()

## Plot timeseries <a class="anchor" id="timeseries"></a>

I'm considering 4 choices to separate the classes.
Let's get some insights from the timeseries plots.

To start, let's create a dataframe to save how many games each player has done.

In [None]:
num_of_games = chess_df.groupby("player_username").timestamp.count().reset_index().sort_values(by='timestamp', ascending=False)
num_of_games.head()

The player with the most games has 100x the number of games compared to the other players!

Let's first focus on players with more than 3 data points to get an understanding of what "winning", "losing", and "maintaining" might mean.

In [None]:
# plot the timeseries.

plt.figure(figsize=(10, 5))
plt.title('ieshuaganocry')
plt.plot(chess_df.loc[chess_df.player_username == 'ieshuaganocry', 'datetime'],
         chess_df.loc[chess_df.player_username == 'ieshuaganocry', 'player_rating'])
plt.scatter(chess_df.loc[chess_df.player_username == 'ieshuaganocry', 'datetime'],
            chess_df.loc[chess_df.player_username =='ieshuaganocry', 'player_rating'],
            alpha=0.3)

plt.xlabel("time")
plt.ylabel('player rating/elo score')

fig, axes = plt.subplots(3, 4, figsize=(10,10))
ax = axes.flat

for i, player in enumerate(num_of_games.loc[(num_of_games.timestamp > 3) & (num_of_games.timestamp < 100), 'player_username']):
    ax[i].set_title(player)
    ax[i].plot(chess_df.loc[chess_df.player_username == player, 'time_diff'],
               chess_df.loc[chess_df.player_username == player, 'player_rating'])

    ax[i].scatter(chess_df.loc[chess_df.player_username == player, 'time_diff'],
                  chess_df.loc[chess_df.player_username ==player, 'player_rating'],
                  alpha=0.3)
    
ax[0].set_ylabel('elo_score')
ax[-1].set_xlabel('time')

Before we answer the question of what to use, I noticed in these plots that there were often repeated ratings at approximately the same time.
(I set an `alpha` to the plots, so the darker the circles from the scatter plot, the more data points are located there.)

Let's look at what this means for a high frequency player.
We can see below that the outcome is changing, the player is winning and losing games, but the player rating is not changing in time. They are playing the same time class on the same service. Looking at the highest frequency player, I determined that the scores appear to update approximately once every 30 mins.

In [None]:
player = 'hooligandi'
chess_df.loc[chess_df.player_username == player, ['datetime',  'player_rating','outcome', 'opponent_rating', 'time_class']].copy()


Let's go back and examine the four choices I outlines above.

1. Multipe wins, loses, or draws (maintain) in a row. If so, how many?
    * The timeseries did not address this question because I did not plot outcomes over time.
    * However, we can see from the table example above that games are being won and lost, but the elo rating is not changing. Given that the elo score is explicitly mentioned in the prompt, I will assume that is the metric we want to track (though the outcomes are a tracer of the elo score).
2. Postive, negative, same elo scores (ratings) differences compared to the last elo score calculated.
    * This is viable for a one shot description, which again may not be good enough to define "winning"
3. Tracking the change in score across many timesteps requiring the Elo score to have the same pattern multiple times.
    * This has the problem of the repeated ELO scores.
    * Why is this not the same as #1? There are repeated ratings across many games. It appears that the ratings are only updated approximately every 30 mins. If you play many games quickly in a row, the ELO score doesn't change.
4. Measuring the slope of the change across multiple entries. Is so, how many?
    * This is the measurement that I am going to use.
    * This takes a wider look to see what the current trend is. It will measure if the elo rating is trending up or down so that if a player is having an overall winning streak, but loses a game, the slope may still be positive.
    * How many points to use when calculating the slope? I chose 3 points but would prefer to use more. I used 3 points to allow in a maximum number of players where we can get a measurement of how they are performing.



## Slope Calculation <a class="anchor" id="slope"></a>

To calculate slope, I'm going to use:
- rolling calulation with a minium of 3 data points. Must have 3 data points in order to calculate slope. The first two entries will not have measurements.
- The x-axis is time; however, given that there can be long lags between games, I do not want to degrade the steepness of the slope by encorporating the difference in time (in seconds). Instead, I will assume only the ordering is set and use as the x-axis an array of length equal to the rolling factor.
    - Note: I tested using the time difference, but the general results were equivalent.
- I also tried using a rolling window over time, but too many games were played with a very wide gap and so many windows only had a single entry. 

In [None]:
# Create the slope function to apply to the group by.
from scipy import stats

def slope(input):
    time = np.arange(len(input))
    #time = chess_df.loc[df.index, 'time_diff'].to_numpy() # Example of how to use time difference in the calculation

    elo_scores = input
    slope, intercept, r_value, p_value, std_err = stats.linregress(time, 
                                                                   elo_scores)
    return round(slope, 3)

Create a slope column using 3 data points

In [None]:
chess_df['elo_slope'] = chess_df.groupby(['player_username']).player_rating.rolling(3).apply(slope, raw=False).reset_index().player_rating


Plot the timeseries and the slope as a function of time for a few players. The red line is a zero slope, so all above zeros is "winning", below zeros is "losing", and on the line is "maintaining".

In [None]:
for player in num_of_games.loc[(num_of_games.timestamp > 3), 'player_username'][:4]:
        if player == 'ieshuaganocry':
            figsize=(10,10)
            time= 'datetime'
        else:
            figsize=(6,6)
            time = 'time_diff'

        fig, axes = plt.subplots(2, 1, figsize=figsize, sharex=True)
        axes[0].set_title(player)

        # rating plot
        axes[0].plot(chess_df.loc[chess_df.player_username == player, time],
                     chess_df.loc[chess_df.player_username == player, 'player_rating'],
                     label='player rating')
        axes[0].scatter(chess_df.loc[chess_df.player_username == player, time],
                        chess_df.loc[chess_df.player_username == player, 'player_rating'],
                        alpha=0.3)
        axes[0].legend()

        # Slope plot
        axes[1].plot(chess_df.loc[chess_df.player_username == player, time],
                     chess_df.loc[chess_df.player_username == player, 'elo_slope'])
        axes[1].scatter(chess_df.loc[chess_df.player_username == player, time],
                        chess_df.loc[chess_df.player_username == player, 'elo_slope'],
                        label='slope',
                        alpha=0.3)
        axes[1].axhline(0, color='r', label='zero slope')
        axes[1].legend()


We can see that the slope is not a perfect indicator with only 3 data points. For example `hooligandi` maintains that spike to the "winning" side, but they are in a losing trend. This is happening because there are several data points with the same Elo score that the rolling, 3 data point nature of the calculation captures. The way to dampen this trend would be to include more data points, but including more data points would decrease the number of players used in the behavioral analysis. 

I will keep the slope as is for now, but note that a longer term analysis needs more data to avoid these issues.

### Side bar: Elo score: Can I calculate it? <a class="anchor" id="calculate-elo"></a>

Looking at the timeseries of the elo score, it appears that the score is not actually being updated after every game. Looking at the formula from Wikipedia, I want to check if I can update the elo score to actually update and match the score when it changes. That way the results from the slope will be more accurate.

Function is rating of opponents plus 400 per win, minus 400 per loss, and plus nothing if draw. Then divided by the number of games played.

As we can see below, I am _not_ able to reproduce the elo scores using a subset. I will have to continue with the data set as is.

In [None]:
## tested using the elo formula
player = 'hooligandi'
check = chess_df.loc[chess_df.player_username == player, ['datetime',  'time_diff', 'player_rating','outcome', 'opponent_rating', 'time_class', 'elo_slope']].copy()

check['adjustment'] = 0
check.loc[check.outcome == 0, 'adjustment'] = -400
check.loc[check.outcome == 1, 'adjustment'] = 400
check["elo_update"] = (check["player_rating"] + check["opponent_rating"] + check['adjustment'])/2.
check

Now we have our separations!

- Winning occurs when `elo_slope` > 0
- Losing occurs when `elo_slope` < 0
- Maintaining occurs when `elo_slope` = 0

# Different behaviors of different groups <a class="anchor" id="behavior"></a>

Now that the classes have been defined, let's examine how the behavior is different.

1. Histograms, medians, and means.
2. Common games across the classes
3. How do errors evolve over time within the game
4. Difference between Time Classes
5. Are the different behaviors statistically significant



Separate out into different data frames and check length to understand scope. Winning and losing have about the same number of entries, while maintainig is the smallest set.

In [None]:
winning = chess_df[chess_df.elo_slope > 0].copy()
losing = chess_df[chess_df.elo_slope < 0].copy()
maintain = chess_df[chess_df.elo_slope == 0].copy()

In [None]:
print('length winning:', len(winning))
print('length losing:', len(losing))
print('length maintain:', len(maintain))

## Histograms, medians, and means <a class="anchor" id="hist"></a>

The question asks how does a _player_'s behavior change, so I'm going to focus on the primary player and not consider what the opponent is doing.

In [None]:
player_columns = []
opponent_columns = []
all_other = ['goes_first', 'score_diff', 'time_class_encoded',
             'eco_name_encoded', 'eco_code_encoded', 'eco_category_encoded']
for c in winning.columns:
    if 'player' in c and 'username' not in c:
        player_columns.append(c)
    if 'opponent' in c and 'username' not in c:
        opponent_columns.append(c)

print(len(player_columns), len(opponent_columns), len(all_other))
# Only want PLAYER difference, so don't need to look at opponent columns

Create histograms of all the features for winning, losing, and maintain. I'm also plotting the median of each distribution with the actual value printed in the legend.

To have fewer objects in this plot, I've separated out the columns by player columns and other interesting information.

In [None]:
stats_dict = {}  # Save mean, median, standard deviation
fig, axes = plt.subplots(6, 5, figsize=(22,22))
ax = axes.flat

for i, c in enumerate(player_columns):
    # Plot the histogram per winning, losing, maintain. 
    # Want these to be directly comparable, so they need the same binning.
    _max = max(winning[c].max(), losing[c].max(), maintain[c].max())
    ax[i].hist(winning[c], density=True, bins=np.linspace(0, _max, 15),
         alpha=0.5)
    ax[i].hist(losing[c], density=True, bins=np.linspace(0, _max, 15),
            alpha=0.4)
    ax[i].hist(maintain[c], density=True, bins=np.linspace(0, _max, 15),
            alpha=0.3)

    # Calculate median and save other interesting metrics.
    win_med = winning[c].median()
    los_med = losing[c].median()
    mai_med = maintain[c].median()

    stats_dict[c] = {'winning_median': win_med, 'losing_median': los_med, "maintain_median": los_med,
                     'winning_mean': winning[c].mean(), 'losing_mean': losing[c].mean(), "maintain_mean": maintain[c].mean(),
                     'winning_std': winning[c].std(), 'losing_std': losing[c].std(), "maintain_std": maintain[c].std()}

    # Add vertical line at median.
    ax[i].axvline(win_med, color='#1f77b4', label=f'Winning: {round(win_med, 2)}')
    ax[i].axvline(los_med, color='#ff7f0e', label=f'Losing: {round(los_med, 2)}')
    ax[i].axvline(mai_med, color='#2ca02c', label=f'Maintaining: {round(mai_med, 2)}')

    ax[i].legend()
    ax[i].set_xlabel(c[7:])  # removing "player" from label to make easier to read
    
ax[0].set_ylabel('probability density')
 

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(10, 7))
ax = axes.flat

for i, c in enumerate(all_other):
    # Plot the histogram per winning, losing, maintain. 
    # Want these to be directly comparable, so they need the same binning.
    _max = max(winning[c].max(), losing[c].max(), maintain[c].max())
    ax[i].hist(winning[c], density=True, bins=np.linspace(0, _max, 15),
         alpha=0.5)
    ax[i].hist(losing[c], density=True, bins=np.linspace(0, _max, 15),
            alpha=0.4)
    ax[i].hist(maintain[c], density=True, bins=np.linspace(0, _max, 15),
            alpha=0.3)

    # Calculate median and save other interesting metrics.
    win_med = winning[c].median()
    los_med = losing[c].median()
    mai_med = maintain[c].median()

    stats_dict[c] = {'winning_median': win_med, 'losing_median': los_med, "maintain_median": los_med,
                       'winning_mean': winning[c].mean(), 'losing_mean': losing[c].mean(), "maintain_mean": maintain[c].mean(),
                       'winning_std': winning[c].std(), 'losing_std': losing[c].std(), "maintain_std": maintain[c].std()}

    # Add vertical line at median.
    ax[i].axvline(win_med, color='#1f77b4', label=f'Winning: {round(win_med, 2)}')
    ax[i].axvline(los_med, color='#ff7f0e', label=f'Losing: {round(los_med, 2)}')
    ax[i].axvline(mai_med, color='#2ca02c', label=f'Maintaining: {round(mai_med, 2)}')

    ax[i].legend()
    ax[i].set_xlabel(c)
    
ax[0].set_ylabel('probability density')

I recorded the mean, median, and standard deviation for each distribution to look a little closer at the numbers.

Note: when mean does not equal median, the distribution is not Gaussian.

In [None]:
stats_df = pd.DataFrame(stats_dict)
stats_df.T

### Summary of histogram, means, medians seen above:

- `Opening Score`: Winning group plays a better opening move than maintaining or losing.
    - Options = [0, 1] (int)
    - winning has more 1s than zeros > maintaining > losing has more zeros than 1s
- `Advantage Capitalization Winrate`: Measures how well a player can turn early advantages into wins
    - Winners have an average of 2x better winrate over losing elo scores. Gap is much smaller comparing Winning to Maintaining.
    - Options = [0, 1] (int)
    - winning has more 1s than zeros > maintaining > losing has more zeros than 1s
- `Resourcefullness Win draw rate`: The ability to win or draw when at a disadvantage
    - Winning behavior has a much better ability to win or draw at disadvange. The average score for winning is 2x Maintiaing and >4x losing. 
    - Options = [0, 1] (int)
    - winning has more 1s than zeros > maintaining > losing has more zeros than 1s
- `Tactics Score`: Percent assigned based on performance.
    - Float percent, range 0-1
    - Winners on average have 18% higher score compared to losing, but only a 5% higher score compared to maintaining
    - winning > maintaining > losing
- `Inaccuracies per game`: Range from 0-1. Winning shows slightly fewer inaccuracies per game with winning < maintaining < losing.
    - `Opening`: maintaining games have more inaccuraies in openings than losing! winning < losing < maintaining
    - `Middle`: But this is reversed back to expected by the middle of the game winning < maintaining < losing
    - `End`: The difference between the groups widens by the time we get to the end
- `Mistakes per game`: Range from 0-1. Winning shows fewer mistakes per game with winning < maintaining < losing
    - `Opening`: We see the same issues as above where the maintaing make more mistakes in opening. winning < losing < maintaining
    - `Middle`: But this is fixed by midgame. winning < maintaining < losing
    - `End`: Continue the expected trend at the end. winning < maintaining < losing
- `Blunders per game`: Range from 0-1. The losing group has a fairly wide difference between winning here, almost 2x. winning < maintaining < losing
    - `Opening`: Winning group is doing much better but again we see maintaining performing slightly worse. winning < losing < maintaining
    - `Middle`: But maintiaining is close to the winning group by endgame. winning < maintaining < losing
    - `End`: Blunder rate for losing is 3x the rate for winning team while maintaining is much closer. winning < maintaining < losing
- `Endgame Win Rate`: Options = [0, 1] (int), There are very few wins for the losing group (as expected). The maintaining group is similar to winning but winning is better.
    - `Endgame Win rate with equal`: Same as above. 
    - `Endgame Win rate with advantage`: Same as above but winning and maintaining are 2x higher than losing.
    - `Endgame Win rate with disadvantage`: The winning group has the biggest difference here. They perform 2x better than maintaining and almost 10x better than losing.
- `Long thinking outcome score`
    - Float percent, range 0-1
    - This is a fairly flat distribution but we do see winning > maintaining > losing. Winning tends to have better long thinking outcome scorees.
- `Time advantage score`: Weaker player receives more time to think. Winning receives the smallest advantage and maintaining receives the largest advantage. Float percent
    - `Significant Time advantage win rate`: Options = [0, 1] (int) if they received the advantage and won. If in the winning group, it is much more likely that you will win versus the other groups. Nearly 3x compared to the losing group.
    - `Significant Time disadvantage win rate`:  Options = [0, 1] (int), if they received disadvantage and won.The winning group is less likley to convert these games to win, but they still have a major advantage over the losing groups. The winning group can convert this about 5x more than th elosing group.



Extra notes:
- Going first gives an advantage to win, but it might be a small advantage. More rigourous statistical studies required.
- A player with a higher Elo score is more likely to win.
- player ACL = average centipawn loss . As the score decreases, your accuracy increases. It is interesting that the losing trend has a lower ACL than the winning trend.


## Common opening moves across the classes <a class="anchor" id="eco"></a>

Given the nature of the Eco groups above, this way of looking at the data may not be helpful. So let's look at a different way.

(These plots are messy and I would not show them to a customer, but they are to give me an understanding of what is going on.)

In [None]:
# eco
for col in ['eco_category', 'eco_code', 'eco_name']: # in increasing complexity
    fig, ax = plt.subplots(1, 3, figsize=(15, 10), sharex=True) #, sharex=True)
    # note player_username is just a placeholder
    winning.groupby([col]).player_username.count().reset_index().plot.barh(x=col, y='player_username', ax=ax[0], label='winning')
    losing.groupby([col]).player_username.count().reset_index().plot.barh(x=col, y='player_username', ax=ax[1], color='#ff7f0e', label='losing')
    maintain.groupby([col]).player_username.count().reset_index().plot.barh(x=col, y='player_username', ax=ax[2], color='#2ca02c', label='maintain')

    # use same x limit across all subplots
    ax[0].set_xlim(0, max(winning.groupby([col]).player_username.count().reset_index().player_username.max(),
                          losing.groupby([col]).player_username.count().reset_index().player_username.max(),
                          maintain.groupby([col]).player_username.count().reset_index().player_username.max()
                          ))
    plt.subplots_adjust(wspace=0, hspace=0)

### Eco Classification Summary

- `eco_category`
    - Winning opening moves/games played:
        1. Russian Game.
        2. Sicilian Defense
        3. Queen's Pawn Game
        4. French Defense
    - Losing opening moves/games played:
        1. Russian Game.
        2. Queen's Pawn Game
        3. Sicilian Defense
        4. The Queen's Gambit Refused
    - Maintain opening moves/games played:
        1. Russian Game.
        2. Sicilian Defense
        3. French Defense
        4. Four Knights Game
- `eco_code`
    - C42 is the Russian Game
    - Which one of these you use is dependent on how granular you need the information. 
- `eco_name`
    - Most granulary specification for game classification and as such we see the flattest distribution. Though a few specific games have up to 16 plays, the majority are a single play per group.

Everyone is playing the same opening moves the most often across all groups. Another interesting thing to explore would be the error and win rates per different opening moves. However, that is out of scope for this project.

## How do errors evolve over time within a game <a class="anchor" id="gameplay"></a>

Relative to the winning numbers, is the losing group or maintaining group changing in mistakes. I am not looking at the absolute change because I want to understand how they are different from the winning team.

In [None]:
# exploring different times in the game play
over_time = {}
i = 0
for type_ in ['inaccuracies', 'mistakes', 'blunders']:
    for game in ['opening', 'middlegame', 'endgame']:
        over_time[i] = {"issue": type_, "time": game,
                        "losing": stats_df[f'player_{type_}_per_{game}'].iloc[4]/stats_df[f'player_{type_}_per_{game}'].iloc[3],
                        "maintain": stats_df[f'player_{type_}_per_{game}'].iloc[5]/stats_df[f'player_{type_}_per_{game}'].iloc[3]}
        i += 1

change_v_winning = pd.DataFrame(over_time).T
change_v_winning


### Error Evolution Summary

- Losing:
    - inaccuracies increase with time
    - mistakes increase with time
    - blunders decrease in middle game but then increase
- Maintain:
    - Start out with the highest amount of mistakes, then perform fairly close to the winning team in the middle game, but finish with increased number of mistakes, typically less than the start.

## Difference Between Time Classes <a class="anchor" id="timeclass"></a>

Are the distributions different across the different kinds of games people are playing?

Do we have enough data to answer this question?

In [None]:
# Different in winning versus in different versions of game

print('All', chess_df.groupby('time_class').outcome.count(), '\n')
print('Winning', winning.groupby('time_class').outcome.count(), '\n')
print('Losing', losing.groupby('time_class').outcome.count(), '\n')
print('Maintain', maintain.groupby('time_class').outcome.count())

Let's check to see if there are differences in game play for winning because it has the most entries in 2 of the classes.

In [None]:
# Winning
time_class_dict = {}
fig, axes = plt.subplots(6, 5, figsize=(22,22))
ax = axes.flat

for i, c in enumerate(player_columns):
    for tc, color in zip(winning.time_class.unique(), ['#1f77b4', '#ff7f0e', '#2ca02c','#9467bd']):
        ax[i].hist(winning.loc[winning.time_class == tc, c], density=True, bins=np.linspace(0, winning[c].max(), 15),
             alpha=0.5)
        time_class_dict[c] = {f'{tc}_median': winning.loc[winning.time_class == tc, c].median(),
                              f'{tc}_mean': winning.loc[winning.time_class == tc, c].mean(),
                              f'{tc}_std': winning.loc[winning.time_class == tc, c].std()}

        label_num = round(time_class_dict[c][f'{tc}_median'], 2)
        ax[i].axvline(time_class_dict[c][f'{tc}_median'], color=color, label=f'{tc}: {label_num}') # {round(time_class_dict[c]['{tc}_median'], 2)}

    ax[i].legend()
    ax[i].set_xlabel(c[7:])
    
ax[0].set_ylabel('probability density')

plt.suptitle('Winning')


In [None]:
# Losing
time_class_dict = {}
fig, axes = plt.subplots(6, 5, figsize=(22,22))
ax = axes.flat

for i, c in enumerate(player_columns):
    for tc, color in zip(losing.time_class.unique(), ['#1f77b4', '#ff7f0e', '#2ca02c','#9467bd']):
        ax[i].hist(losing.loc[losing.time_class == tc, c], density=True, bins=np.linspace(0, winning[c].max(), 15),
             alpha=0.5)
        time_class_dict[c] = {f'{tc}_median': losing.loc[losing.time_class == tc, c].median(),
                              f'{tc}_mean': losing.loc[losing.time_class == tc, c].mean(),
                              f'{tc}_std': losing.loc[losing.time_class == tc, c].std()}

        label_num = round(time_class_dict[c][f'{tc}_median'], 2)
        ax[i].axvline(time_class_dict[c][f'{tc}_median'], color=color, label=f'{tc}: {label_num}') # {round(time_class_dict[c]['{tc}_median'], 2)}

    ax[i].legend()
    ax[i].set_xlabel(c[7:])
    
ax[0].set_ylabel('probability density')

plt.suptitle('Losing')

### Time Class Summary

The winning group showed little difference between the time class of games and the losing group showed a higher difference between them. However, most of the games are `blitz` and the losing group has even less samples. The differences are likely due to low number statistics in the other games though I would expect _strategies_ to be different across different time games. 

## Statistical Tests <a class="anchor" id="hypothesis"></a>
Are these really 3 individual samples?

I will run 3 tests:
- ANOVA to test if all three samples come from the same distribution
- 2 sample z-test to see if each pair come from the same distribution for float columns
- 2 sample proportion test to see if each pair come from the same distribution for binary columns

Null hypopthesis in all cases is that they come from same distribution and p-value < 0.05 is significant enough to reject the Null hypothesis. I will run this over all columns.

Note: 
- I will not do this for the different time classes. There is not enough data across the classes.

In [None]:
# Define which columns use binary designations and so need the Proportion test instead of the z-test
binary_columns = ['player_opening_score', 'player_advantage_capitalization_winrate', 'player_resourcefulness_windrawrate',
                  'player_endgame_winrate',
                  'player_endgame_winrate_with_equal',
                  'player_endgame_winrate_with_advantage',
                  'player_endgame_winrate_with_disadvantage',
                  'player_significant_time_advantage_winrate',
                  'player_significant_time_disadvantage_winrate',
                  'goes_first']

### ANOVA test across the 3 samples. 
This implementation returns the F score and p-value. I am only surfacing the p-value to get a sense of the numbers. I have also flagged when the Null Hypothesis is not rejected (`SAME DISTRIBUTION`).

In [None]:
# Create the list of columns for the z-test
cols_for_float_test = list(set(chess_df.columns) - set(binary_columns) - set(['datetime', 'timestamp', 'time_diff', 'outcome', 'player_username', 'time_class_encoded']) - set(categorical_columns))

# ANOVA Test for all float statistics. 
from scipy.stats import f_oneway

for fcol in cols_for_float_test:
    if 'opponent' not in fcol and 'eco' not in fcol:
        fstat, pvalue = f_oneway(winning[fcol], losing[fcol], maintain[fcol], nan_policy='omit')
        if pvalue > 0.05:
            print(f'SAME DISTRIBUTION: {fcol} pvalue={round(pvalue, 4)}')
        else:
            print(f'{fcol} pvalue={round(pvalue, 4)}')

### 2 Sample z-test
This will see if any of the two distribuions are similar.
- Winning versus Losing
- Winning versus Maintain
- Losing versus Maintain

For float columns

In [None]:
import statsmodels.api as sm
import statsmodels as sm

# Winning versus Losing
same = 0
different = 0
for fcol in cols_for_float_test:
    if 'player' in fcol:
        zstat, pvalue = sm.stats.weightstats.ztest(winning[fcol].dropna(), losing[fcol].dropna(), value=0)
        if pvalue > 0.05:
            #print(f'SAME DISTRIBUTION: {fcol} pvalue={round(pvalue, 4)}')
            same += 1
        else:
            #print(f'{fcol} pvalue={round(pvalue, 4)}')
            different += 1
print(f'Between Winning and Losing, {same} columns have same distribution, {different} columns reject the Null Hypothesis')

# Winning versus Maintaining
same = 0
different = 0
win_maintain_diff = []
for fcol in cols_for_float_test:
    if 'player' in fcol:
        zstat, pvalue = sm.stats.weightstats.ztest(winning[fcol].dropna(), maintain[fcol].dropna(), value=0)
        if pvalue > 0.05:
            #print(f'SAME DISTRIBUTION: {fcol} pvalue={round(pvalue, 4)}')
            win_maintain_diff.append(fcol)
            same += 1
        else:
            #print(f'{fcol} pvalue={round(pvalue, 4)}')
            different += 1
print(f'Between Winning and Maintain, {same} columns have same distribution, {different} columns reject the Null Hypothesis')


# Losing versus Maintaining
same = 0
different = 0
lose_maintain_diff = []
for fcol in cols_for_float_test:
    if 'player' in fcol:
        zstat, pvalue = sm.stats.weightstats.ztest(maintain[fcol].dropna(), losing[fcol].dropna(), value=0)
        if pvalue > 0.05:
            #print(f'SAME DISTRIBUTION: {fcol} pvalue={round(pvalue, 4)}')
            lose_maintain_diff.append(fcol)
            same += 1
        else:
            #print(f'{fcol} pvalue={round(pvalue, 4)}')
            different += 1
print(f'Between Maintain and Losing, {same} columns have same distribution, {different} columns reject the Null Hypothesis')

In [None]:
print('Win/Maintain Diff', win_maintain_diff, '\n')
print('Lose/Maintain Diff', lose_maintain_diff)

### 2 Sample Proportion test test
This will see if any of the two distribuions are similar.
- Winning versus Losing
- Winning versus Maintain
- Losing versus Maintain

For binary columns.

In [None]:
# Winning versus Losing
same = 0
different = 0
for col in binary_columns:
    ncount = np.array([winning[col].sum(), losing[col].sum()])
    nobs = np.array([winning[col].count(), losing[col].count()])

    stat, pval = sm.stats.proportion.proportions_ztest(ncount, nobs)
    if pval > 0.05:
        #print(f'SAME DISTRIBUTION: {col} pvalue={round(pval, 4)}')
        same += 1
    else:
        #print(f'{col} pvalue={round(pval, 4)}')
        different += 1
    
print(f'Between Winning and Losing, {same} columns have same distribution, {different} columns reject the Null Hypothesis')

# Winning versus Maintaining
same = 0
different = 0
for col in binary_columns:
    ncount = np.array([winning[col].sum(), maintain[col].sum()])
    nobs = np.array([winning[col].count(), maintain[col].count()])

    stat, pval = sm.stats.proportion.proportions_ztest(ncount, nobs)
    if pval > 0.05:
        #print(f'SAME DISTRIBUTION: {col} pvalue={round(pval, 4)}')
        same += 1
    else:
        #print(f'{col} pvalue={round(pval, 4)}')
        different += 1
    
print(f'Between Winning and Maintain, {same} columns have same distribution, {different} columns reject the Null Hypothesis')

# Losing versus Maintaining
same = 0
different = 0
for col in binary_columns:
    ncount = np.array([losing[col].sum(), maintain[col].sum()])
    nobs = np.array([losing[col].count(), maintain[col].count()])

    stat, pval = sm.stats.proportion.proportions_ztest(ncount, nobs)
    if pval > 0.05:
        #print(f'SAME DISTRIBUTION: {col} pvalue={round(pval, 4)}')
        same += 1
    else:
        #print(f'{col} pvalue={round(pval, 4)}')
        different += 1
    
print(f'Between Losing and Maintain, {same} columns have same distribution, {different} columns reject the Null Hypothesis')

### Hypothesis/Statistics Summary

- The majority of the statistics comparing all 3 groups (winning, losing, maintain) reject the null hypothesis except `mistakes_per_opening`, `inaccuracies_per_opening`, and `time_advantage_score`. 
- Winning and Losing are the most distinct from each other having the largest number of columns the reject the null hypothesis.
    - The null hypothesis is not rejected for the opening score and which player opened.
    - This is not surprising because we see that the different groups used the same opening moves and black versus white is assigned randomly.
- Winning and Maintaining mostly come from the same distributions, except 10 columns. 
    - The null hypothesis is not rejected for inaccuracies, mistakes, middle game blunders, endgame winrates, opening score, endgame winrates, which player opened.
- Maintain and Losing are less similar than Winning and maintaining, but are more similar than winning and losing
    - The null hypothesis is not rejected for acl, mistakes, blunders in opening, opening score, which player opened.


# Summary of Results <a class="anchor" id="results"></a>
- The winning group plays strong opening moves and has relatively fewer mistakes, inaccuracies, and blunders. 
- The losing group is typically weak all around creating more mistakes, inaccuracies, and blunders
    - Typically have more errors (inaccuracies, mistakes, blunders) as the game continues.
- Maintaining group is interesting. They have more mistakes, inaccuracies, and blunders when starting the game, but recover to be close to the winning group by midgame. My conclusion is that they had great potential of winning the game if they had a stronger start.
    - This group has the most errors at the beginning of the game, the fewest errors in the middle of the game, and then start having more errors towards the end (though never matching errors at the start of the game.)
- Every group is playing the same classifications of opening moves

Though these groups do have some similarities, they do broadly represent different playing distributions.

---
---

## Bonus: ML Classification <a class="anchor" id="bonus"></a>

We have defined "winning" (positive slope), "losing" (negative slope), and "maintain" (zero slope) classes.
We have also seen that the metrics from game play are generally distinct groups.
Now that we have these labels and this information, we can try running a classifier to see when people are in different groups.

What steps do we need to take:
1. Preprocess Data:
    - One hot encode catgoricals
    - Null Handling
2. Create a train/test split
3. Standardize data
4. Train a model
5. Evaluate performance


### Step 1: Preprocess Data

The categorical columns do not actually represent a meaning order. We could rearrange them and it would not matter. Therefore, ordinal encoding is creating information that does not exist. One way to remove that information while still encoding categoricals is to use one-hot encoding. One hot encoding will create new columns per category and the entries will be zero or one depending on if that row has the categorical.
The disadvantage is that we are introducing many sparse columns to the data.

In [None]:
# Drop Ordinally encoded categoricals. Will replace with one-hot
chess_df_enc = chess_df.drop(columns=encoded_names)

chess_df_enc = pd.get_dummies(chess_df_enc, columns=["time_class", "eco_name", "eco_code", "eco_category", "service"],  dtype=float)
chess_df_enc


Drop any entry that does not have an `elo_slope` because we cannot classify it.

Fill remaining NaNs with zeros so the ML works properly.

In [None]:
chess_df_enc = chess_df_enc[~chess_df_enc.elo_slope.isna()].copy()

# fill remaining nulls with zero
chess_df_enc.fillna(0, inplace=True)

Create the target column

In [None]:
# Also need to change `elo_slope` to classification
# 0 = winning
# 1 = losing
# 2 = maintain
chess_df_enc['classification'] = 0
chess_df_enc.loc[chess_df_enc.elo_slope < 0, 'classification'] = 1
chess_df_enc.loc[chess_df_enc.elo_slope == 0, 'classification'] = 2

In [None]:
# drop unneeded columns
chess_df_enc.drop(columns=['timestamp', 'player_username', 'opponent_username', 'time_diff', 'datetime', 'elo_slope'], inplace=True)


### Step 2: Create a train/test split

Now that we've gone through data processing, can create data set

In [None]:
from sklearn.model_selection import train_test_split

features = list(chess_df_enc.columns)
features.remove('classification')

# split data into features and labels
X = chess_df_enc[features].copy()
y = chess_df_enc["classification"].copy()


# Split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    train_size=.7, # train size is 70%
                                                    random_state=25) 

### Step 3: Standardize data

Must run standardization after performing the train/test split or else you will introduce a data leak into the standardization.


In [None]:
from sklearn.preprocessing import StandardScaler

# Instantiate the scaler and fit on features
scaler = StandardScaler()
scaler.fit(X_train)

# Transform features
X_train_scaled = scaler.transform(X_train.values)
X_test_scaled = scaler.transform(X_test.values)

### Step 4: Train a Model

This is a lot of tabular data, will test out two options:
1. Logistic Regression
2. Grdient Boosted Decision Tree Classifer

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier

# Instantiating the models 
logistic_regression = LogisticRegression(max_iter=500)
tree = GradientBoostingClassifier(n_estimators=700, subsample=0.7, learning_rate=0.01, max_depth=3) # criterion, splitter, max_depth

# Training the models 
logistic_regression.fit(X_train_scaled, y_train)
tree.fit(X_train_scaled, y_train)

# Making predictions with each model
log_reg_preds = logistic_regression.predict(X_test_scaled)
tree_preds = tree.predict(X_test_scaled)

### Step 5: Evaluate Performance

Will create a classification report measuring precision, recall, f1-score, and accuracy as well as a confusion matrix. There are many other ways to measure performance, but this is not the main goal of this exercise, so I will limit evaluations.


In [None]:
from sklearn.metrics import classification_report

# Store model predictions in a dictionary
# this makes it's easier to iterate through each model
# and print the results. 
model_preds = {
    "Logistic Regression": log_reg_preds,
    "Boosted Decision Tree": tree_preds,
}

for model, preds in model_preds.items():
    print(f"{model} Results:\n{classification_report(y_test, preds)}", sep="\n\n")


The Boosted Decision Tree is outperforming the logistic regression. The logistic regression performs best with linear correlations, so maybe a non-linear model would be a better choice given this data set.

Create the confusion matrix.

In [None]:
from sklearn import metrics

# confusion matrix per model
cnf_matrix_log = metrics.confusion_matrix(y_test, log_reg_preds)
cnf_matrix_tree = metrics.confusion_matrix(y_test, tree_preds)

class_names=["winning","losing", "maintaining"]

# plot
fig, ax = plt.subplots(1, 2, figsize=(12, 5))

sns.heatmap(pd.DataFrame(cnf_matrix_log, columns=class_names, index=class_names), cmap="YlGnBu", annot=True, ax=ax[0],fmt='g')
sns.heatmap(pd.DataFrame(cnf_matrix_tree, columns=class_names, index=class_names), cmap="YlGnBu", annot=True, ax=ax[1],fmt='g')

plt.ylabel('Actual label')
plt.xlabel('Predicted label')

ax[0].set_title('Logistic Regression')
ax[1].set_title('Boosted Decision Tree')

We can see there is a lot of confusion around the `maintain` class. The majority of these labels are misclassified. Winning and losing are more easily separable.

This quick classification exercise did not yeild good results! The BDT classifier did perform better than the logistic regression, but it's overall acurracy was very low. 

One other quick question:
- Everything above ignored the opponent's information except for these classifiers. Let's do two more models ignoring the opponent's info.

In [None]:
# Remove unwanted features
feature_list_no_opp = list(X_train.columns)
for c in feature_list_no_opp:
    if 'opponent' in c:
        feature_list_no_opp.remove(c)

# Need to rescale the data, but will leave the train/test split unchanged.
# Instantiate the scaler and fit on features
scaler_no_opp = StandardScaler()
scaler_no_opp.fit(X_train[feature_list_no_opp])
        
# Transform features
X_train_scaled_no_opp = scaler_no_opp.transform(X_train[feature_list_no_opp].values)
X_test_scaled_no_opp = scaler_no_opp.transform(X_test[feature_list_no_opp].values)

In [None]:
# Train new models

# Instantiating the models 
logistic_regression_no_opp = LogisticRegression(max_iter=500)
tree_no_opp = GradientBoostingClassifier(n_estimators=700, subsample=0.7, learning_rate=0.01, max_depth=3) # criterion, splitter, max_depth

# Training the models 
logistic_regression.fit(X_train_scaled_no_opp, y_train)
tree.fit(X_train_scaled_no_opp, y_train)

# Making predictions with each model
log_reg_preds_no_opp = logistic_regression.predict(X_test_scaled_no_opp)
tree_preds_no_opp = tree.predict(X_test_scaled_no_opp)

In [None]:
# Examine classification report.
model_preds_no_opp = {
    "Logistic Regression": log_reg_preds_no_opp,
    "Boosted Decision Tree": tree_preds_no_opp,
}

for model, preds in model_preds_no_opp.items():
    print(f"{model} Results:\n{classification_report(y_test, preds)}", sep="\n\n")


In [None]:
# Plot the confusion matrices.
cnf_matrix_log = metrics.confusion_matrix(y_test, log_reg_preds_no_opp)
cnf_matrix_tree = metrics.confusion_matrix(y_test, tree_preds_no_opp)

class_names=["winning","losing", "maintaining"]

fig, ax = plt.subplots(1, 2, figsize=(12, 5))

sns.heatmap(pd.DataFrame(cnf_matrix_log, columns=class_names, index=class_names), cmap="YlGnBu", annot=True, ax=ax[0],fmt='g')
sns.heatmap(pd.DataFrame(cnf_matrix_tree, columns=class_names, index=class_names), cmap="YlGnBu", annot=True, ax=ax[1],fmt='g')

plt.ylabel('Actual label')
plt.xlabel('Predicted label')

ax[0].set_title('Logistic Regression')
ax[1].set_title('Boosted Decision Tree')


Numbers are very similar in both of the models.

If this were something we wanted to pursue, we need to explore:
- Definitions of winning, losing, maintain
- Fine-tuning the models
- Picking models better suited to the problem
- Creating a larger sample size
- Better feature engineering.