# Evaluating Bias Within Cards Dealt to Players within a Session by the International Skat Server (ISS)
## Overview
Personal observations by players idicated that cards dealt by the International Skat Server (ISS) were not fully random and that certain sessions had sequences of ‘good’ cards, while other sessions had sequences of ‘bad’ cards.

The resulting questions is whether there is a higher occurance of high or low quality cards within a particular session than can be explained by chance alone.

What are high or low quality cards? The quality of a hand in Skat is complex as it depends both on the combination of cards, the type of game that a player could chose with them, the value of that game in bidding vs the hands of other players, and the skill of the player in recognising all these factors.

## Libraries Used


In [None]:
import numpy
import matplotlib.pyplot as pyplot
import scipy.stats as stats
import statsmodels.stats.multitest as multitest
import statsmodels.api as api
import datetime
import random

import issgame


## Skat Hand Quality
An established approximation of the quality of a skat hand can be obtained by the Stegen Model, used in particular to give training skat professionals a reference for recognising which games are worth bidding for (https://www.skatfuchs.eu/SB-Kapitel3.pdf):

For a normal suit game (pick highest scoring suit):
+ 1 point per trump
+ 1 point per jack
+ 1 point per trump A, 10
+ 1 point per other A, 10
+ 0.5 points if JC JS
+ 1.5 points if JC JS JH
+ 0.5 points if JS JH JC
+ 2 points if JC JS JH JC
+ 0.5 points per missing suit
+ (0.5 if opponent does not bid)

A score of 10 or higher is considered good enough to bid.

For a grand game:
+ 1 point per jack
+ 1 point per A, 10

For grands a score of 6 is considered good enough to bid. To make this comparable to the suit score, we can inflate Grand score by 5/3

Based on this we can evaluate the quality of a hand base don the total score of a hand either for a suit or a grand game, whichever is higher.

Stegen does not account for null games. For the sake of this analysis does not take into account the quality of cards for a null game.

## Dataset
Two datasets are available from the ISS Server that includes over 7 million Skat games played:
+ https://skatgame.net/iss/iss-games-04-2021.sgf.bz2
+ https://skatgame.net/iss/iss2-games-04-2021.sgf.bz2

In [None]:
with open('data/iss-games-04-2021.sgf') as games_file:
    print(games_file.readline())

These datasets contain data points relevant to the analysis:
 + Players involved in a game (*P0\[Montana\]P1\[vaun\]P2\[Ben\]*)
 + Date and time of the game (*DT\[2007-10-29/04:44:01/UTC\]*)
 + Cards dealt to the Skat and players (*HT.ST.DK.HK.CT.* etc.)

The datasets also contain details of the result of bidding and the resulting game, which we do not use for this analysis.

## Extraction and conversion
The datasets use a unique format and contain more information than required, so we extract only the relevant fields from both sets and write them to a clean datasets.

In [None]:
input_filenames = ['data/iss-games-04-2021.sgf', 'data/iss2-games-04-2021.sgf']
output_filename = 'data/iss_all_games.csv'

issgame.extract_data(input_filenames, output_filename)

This reduces the data to only the required fields by calculating a score for each hand of each player:

In [None]:
with open(output_filename) as games_file:
    line = games_file.readline()
    print(line)
    id_tag, session, player, position, hand_score = line.split(',')
    print('ID:', id_tag, '\nSession:', session, '\nPlayer:', player, '\nPosition:', position, '\nHand Score:', hand_score)

Since we are particularly interested in the sessions played by player PeterB we further summarise those sessions into a single file:

In [None]:
input_filename = output_filename
issgame.extract_sessions(input_filename, 'PeterB')

with open(input_filename[:-4] + '_PeterB.csv') as games_file:
    for i in range(4):
        line = games_file.readline()
        print(line)

    player, session, = line.split(',')[:2]
    hand_scores = line.split(',')[2:]
    hand_scores[-1] = hand_scores[-1][:-1]
    print('\nPlayer:', player, '\nSession:', session, '\nHand scores:', hand_scores)


## PeterB Session Means Compared
If the hands in a session were particularly good or poor this would be reflected in a high or low mean hand score of that session. To assess whether hands dealt to PeterB in certain sessions, we compare the mean hand score of that session to the mean hand score of all hands played from position 0 on the iss server.

We first load all scores for hands played by PeterB, excluding those sessions with less than 10 hands, since a meaningful comparison cannot be made with such few hands.


In [None]:
player_filename = 'data/iss_all_games_PeterB.csv'
player_hands = issgame.load_sessions(player_filename, 'PeterB')

player_sessions = []
player_means = []
session_n = []
session_count = 0

for session in player_hands['PeterB']:
    # only consider sessions with 10 or more games
    if len(player_hands['PeterB'][session]) >= 10:

        player_means.append(numpy.mean(player_hands['PeterB'][session]))
        player_sessions.append(session_count)
        session_count += 1
        session_n.append(len(player_hands['PeterB'][session]))

We then load all scores for hands played by the player in position 1.

In [None]:
all_hands_filename = 'data/iss_all_games.csv'
all_hands = issgame.load_hands(input_filename, 1)

Our default hypothesis that we are atempting to reject is:
+ H0: The mean score of the session is the same as the mean score of overall sample.

So the alternate hypothesis that we can accept, if H0 can be rejected is:
+ HA: The mean score of the session is different from the mean score of the overall sample.

The two populations whose mean hand score we are comparing are:
+ All hands that could be dealt to all players on the ISS server
+ The hands that could be dealt to PeterB in a particular session

Since we do not know the standard deviation of the population and the session sample sizes are low (so we cannot assume normal distribution under CLT) we cannot use a z-test.

We cannot assume that the variance of the sessions and overall sample are the same we cannot use a Student's t-test, so we use an independent two-sample Welch's t-test instead. Since we are testing for equality, we use a two-tailed measure (the session mean could be lower or higher than overall sample mean). The following additional assumptions must be met for this test:

1. The data should be sampled independently 
a) Technically the scores of all hands includes a small number of scores of a particular sessions hands. However, since the sample size of the scores of all hands is several million, this should have no effect and can be neglected.

2. The means of the two populations being compared should follow normal distributions.

A Q-Q plot of a subsample of all hand scores does not appear to be fully normal:

In [None]:
all_hands_sample = []
for i in range(100000):
    all_hands_sample.append(random.choice(all_hands))

all_hands_sample = numpy.array(all_hands_sample)

api.qqplot(all_hands_sample, line='s')
pyplot.show()

Applying a square root transformation to the data corrects for this:

In [None]:
all_hands_sample_sqrt = numpy.sqrt(all_hands_sample)

api.qqplot(all_hands_sample_sqrt, line='s')
pyplot.show()

In order to fulfill the requirement of normality, we transform (by taking the square root) both the dataset of all hands and the dataset of PeterB's sessions.

In [None]:
all_hands = numpy.sqrt(all_hands)
for session in player_hands['PeterB']:
    player_hands['PeterB'][session] = numpy.sqrt(player_hands['PeterB'][session])

A repeat of the subsample Q-Q plot for scores of all hands now shows a normal distribution:

In [None]:
all_hands_sample = []
for i in range(100000):
    all_hands_sample.append(random.choice(all_hands))

all_hands_sample = numpy.array(all_hands_sample)

api.qqplot(all_hands_sample, line='s')
pyplot.show()

For the sessions, since the sample sizes are smaller:


In [None]:
print(session_n)


A Shapiro-Wilk test can be used to test whether the data was drawn from a normal distribution. We perform this test at the 5% confidence level. Since we are testing the same hypothesis (that the overall population of hand scores dealt to PeterB is normal) multiple times, we use the Holm-Bonferroni method to ensure a family-wise error rate of less than 5%.

In [None]:
shwi_ps = []
for session in player_hands['PeterB']:
    # only consider sessions with 10 or more games
    if len(player_hands['PeterB'][session]) >= 10:
        #store p value of Shapiro-Wilk test for each session
        shwi_ps.append(stats.shapiro(numpy.array(player_hands['PeterB'][session]))[1])

# sort the p values
shwi_ps.sort()

shwi_multi = multitest.multipletests(shwi_ps, alpha=0.05, method='holm', is_sorted=True)
print("Result", "P", "Corr. P", "Threshhold")
for i in range(len(shwi_multi[0])):
    print(shwi_multi[0][i], format(round(shwi_ps[i], 3), '.3f'), format(round(shwi_multi[1][i],3),'.3f'), 0.050)


We fail to reject the default hypothesis that the population from which the samples were drawn in normally distributed.

Since both the individual session samples and the overall hand score sample show a normal distribution we can proceed with Welch's t-test. Once again, since we are performing multiple tests of the same hypothesis, we apply a Holm-Bonferroni correction.

Since the impact of a type I error is high (unfairly accusing ISS of unfair shuffling), but the sample size within individual sessions is small, we select a 5% significance level.


In [None]:
ttest = []

for session in player_hands['PeterB']:
    # only consider sessions with 10 or more games
    if len(player_hands['PeterB'][session]) >= 10:
      
        # store p-value of Welch's t-test of difference in mean score between
        # this session and all hands in the datasets
        ttest.append(stats.ttest_ind(player_hands['PeterB'][session], all_hands, equal_var = False)[1])

# sort the p values
ttest.sort()

ttest_multi = multitest.multipletests(ttest, alpha=0.05, method='holm', is_sorted=True)
print("Result", "P", "Corr. P", "Threshhold")
for i in range(len(ttest_multi[0])):
    print(ttest_multi[0][i], format(round(ttest[i], 3), '.3f'), format(round(ttest_multi[1][i],3),'.3f'), 0.050)

## Conclusion
We fail to reject the default hypothesis and can conclude that we have found no evidence that the means of the scores of hands dealt to PeterB are different from the means of the scores of all hands dealt by the ISS. Thus there is no evidence that the quality of cards within a particular session differs from the others by more than could by expected due to random chance.
