# Introduction

This notebook shows how to select part of the gametime dataset and how to visualize the
dataset with [seaborn](https://seaborn.pydata.org/). All cells can be executed directly
in the notebook. Order of execution matter, variables declared in a cell will only be
available to another cell after its execution.

## Concepts and background

The dataset is stored in a CSV file and is both read and written by 
[pandas](https://pandas.pydata.org/). The plotting is handled through 
[seaborn](https://seaborn.pydata.org/). Introduction to both libraries can be very
informative:

- [pandas intro](https://pandas.pydata.org/docs/getting_started/index.html)
- [seaborn intro](https://seaborn.pydata.org/tutorial/introduction.html)

## Imports and reading the dataset

The cell below will import all the functions and constants we need in this notebook.

In [None]:
import pandas as pd
import seaborn as sns

from wp.gametime import DF_DTYPES
from wp.gametime.selection import prepare_dataframe, select_datetimes, select_steam_ids
from wp.gametime.viz import make_plot_prettier

%matplotlib inline

Next, we need to define which file we are going to read. For this introduction purposes,
we'll use a sample file shipped with this package. If you want to use a different file,
ignore the cell below and enter the `gametime.csv` full file path in the variable 
`fname`.

In [None]:
# ignore this cell if you want to run the introduction/demo, else provide the file path
# to your dataset
fname: str = ""

In [None]:
from importlib.resources import files

fname = files("wp.gametime.tests") / "data" / "gametime.csv"

Now that we know which file we are going to read, we can open it with `pd.read_csv`. 
We include some non-default argument to:
- force the datatype of the different columns (`str` for `steam_id` and `game_id`, ...)
- parse the datetime from the `acq_time` column

In [None]:
df = pd.read_csv(fname, index_col=0, dtype=DF_DTYPES, parse_dates=["acq_time"])

We can render the dataframe inline in this notebook:

In [None]:
df

To explore the dataset, let's have a look at the different acquisition times and the
different steam IDs present.

In [None]:
df["steam_id"].unique()

In [None]:
df["acq_time"].unique()

## Select part of the dataset

The functions `prepare_dataframe`, `select_datetimes`, and `select_steam_ids` are used
to prepare the dataset before plotting and to select data spans.

- `prepare_dataframe` will map the steam IDs to usernames/tokens and map the game IDs to
  game names. It will make plot prettier!
- `select_steam_ids` will select a limited list of steam IDs. Attention, if the steam ID
  was map to a username, the username should be used to select this user.
- `select_datetimes` will select a time span and will resample the dataset.

### Prepare the dataset

In [None]:
mapping = {"76561198329580271": "necromancia"}  # let's map this username on the ID
df = prepare_dataframe(df, mapping)

In [None]:
df

### Select steam IDs

Let's select only 2 steam IDs, `76561198329580279` and the recently mapped 
`necromancia`.

In [None]:
df = select_steam_ids(df, ["necromancia", "76561198329580279"])

In [None]:
df

### Select time span and resampled

`select_datetimes` takes 3 arguments in input: `start`, `end` and `freq`. At least one
of the arguments must be provided. 

*Note: if you want help with any function, just enter `function_name?` in a cell. You 
can click on `View as a scrollable element` if the output is truncated.*

In [None]:
select_datetimes?

Anyway, this sample dataset has 8 timepoints, spaced by 1 hour on the 12th of April 
2024.

In [None]:
df["acq_time"].unique()

We can select the dates from 12h to 17h with a spacing (frequency) of 2 hours:

In [None]:
df = df.copy(deep=True)  # let's make a copy to try different selection
df_sel = select_datetimes(
    df, start="2024-04-12 12:00", end="2024-04-12 17:00", freq="3h"
)

In [None]:
df_sel

That's weird, our 2 selected time-points are spread by 2 hours instead of 3 hours as
requested. Let's dissect what is going on:

- We select all dates between `12h00` and `17h00`.
- We create an index between `start` and `end` with a resolution of `3h`, and select
  the `acq_time` closest to the index values.

This second step is index on `12h00` and `17h00`, not on the first acquisition time per
`steam_id`. The closest remaining `acq_time` to `12h00` is `12h41` and the closest 
remaining `acq_time` to `15h00` is `14h41` (and not `15h41`).

In [None]:
pd.date_range(start="2024-04-12 12:00", end="2024-04-12 17:00", freq="3h", tz="utc")

Instead of performing both the time-span selection and resampling in one operation, we
could perform it in 2 operations. By doing so, the resampling will base it's `start` and
`end` arguments on the absolute `min()` and `max()` acquisition time.

In [None]:
df = select_datetimes(df, start="2024-04-12 12:00", end="2024-04-12 17:00", freq=None)
df = select_datetimes(df, start=None, end=None, freq="3h")

In [None]:
df

This time, we do get a selection of `12h41` and `15h41`.

## Plot the dataframe

Enter the beauty of [seaborn](https://seaborn.pydata.org/), a very high-level plotting
library in python. Give it what type of plot, what X-axis, what Y-axis and what category
grouping it should do, and it will handle the rest.

The function `make_plot_prettier` is used to improve the render of labels and ticks on 
the created axis.

For starting, let's reload the entire dataset to remove all the selection we did before.

In [None]:
df = pd.read_csv(fname, index_col=0, dtype=DF_DTYPES, parse_dates=["acq_time"])
df = prepare_dataframe(df, dict())  # to map the game names

In [None]:
df

Let's start with barplot to compare the participants `game_time` in function of the
dates, per game.

In [None]:
grid = sns.catplot(
    df, kind="bar", x="acq_time", y="game_time", col="game_id", errorbar=None
)
make_plot_prettier(grid)

Or maybe you also want to split by steam ID.

In [None]:
grid = sns.catplot(
    df,
    kind="bar",
    x="acq_time",
    y="game_time",
    col="game_id",
    hue="steam_id",
    errorbar=None,
)
make_plot_prettier(grid)

Or maybe line plots of the total gametime per game in function of datetimes.

In [None]:
ax = sns.lineplot(df, x="acq_time", y="game_time", hue="game_id")
make_plot_prettier(ax)

Or a line plot per participant to represent an information similar to the category plot
above.

In [None]:
ax = sns.lineplot(df, x="acq_time", y="game_time", hue="steam_id")
make_plot_prettier(ax)

Or a line plot separating both participant and games:

In [None]:
ax = sns.lineplot(df, x="acq_time", y="game_time", hue="steam_id", style="game_id")
make_plot_prettier(ax)

Or splitting those between 2 plots to separate per `game_id`:

In [None]:
ax = sns.relplot(
    df, x="acq_time", y="game_time", hue="steam_id", col="game_id", kind="line"
)
make_plot_prettier(ax)

Or splitting per `steam_id`:

In [None]:
ax = sns.relplot(
    df,
    x="acq_time",
    y="game_time",
    hue="game_id",
    col="steam_id",
    kind="line",
    col_wrap=5,
)
make_plot_prettier(ax)

Or as a heatmap to plot the `game_time_diff` per participant in function of time.

In [None]:
pivot_df = df.pivot_table(index="steam_id", columns="acq_time", values="game_time_diff")
ax = sns.heatmap(pivot_df)
make_plot_prettier(ax)