## Introduction to tomes

Let's say you wanted to cluster smoke grenades on dust2. One match may have ~100 smokes which isn't enough to do clustering. To get a large enough dataset for clustering, you need hundreds of thousands or millions of smokes. You would have to loop through thousands of matches and read thousands of files. No matter the file size, reading so many files is time consuming and cumbersome.

The solution to this is combining the data from many matches into a "**tome**".  A tome contains "**pages**" which have concatenated dataframes, reducing the number of files to read. The maximum size one page can grow to is something you have control over. Making a tome is three basic steps:

1. Determine which matches to include.
2. Loop through these matches and apply a transformation to a single dataframe.
3. Combine all of these dataframes into a tome.

The tome "**curator**" manages steps 1 and 3 while step 2 happens in a loop you define. The first step is to decide which csds files to include, and you do that by pointing at a *special kind of tome*, a **header** or **subheader** tome. This notebook shows you how to make those special tomes. We discuss tomes in more details in later steps of the tutorial.

## Make header tome

The header tome contains the header channel data and path to all csds files it can find.  This requires a special function within the tome creator called `create_header_tome`. It uses glob to find all your csds files, reads in the header channel of each one, then stitches them all together. 

The end result is a tome that contains a dataframe where each row corresponds to one match's header channel data and the path to the csds file (from glob). Since the path is included, we never need to use glob to find the files again. 

If you want to include all matches in a tome, point at the header tome to determine which matches to include (step 1 from previous cell). This is the default option anyway though...

(Don't forget the subheader section below.)

_**Run this notebook as-is.**_

In [None]:
from pureskillgg_makenew_pyskill.notebook import setup_notebook

In [None]:
setup_notebook()

In [None]:
import os
from pureskillgg_dsdk.tome import create_tome_curator 

In [None]:
# The curator is our interface to the tomes
curator = create_tome_curator()

In [None]:
header_loader = curator.get_header_loader()

In [None]:
if not header_loader.exists:
    header_loader = curator.create_header_tome()

In [None]:
df = header_loader.get_dataframe()
keys = header_loader.get_keyset()
if df is None:
    raise RuntimeError('Something went wrong when making the header.')
print('There are',len(df),'matches in the header.')

## Make subheaders

You can also make subheader tomes that don't include some of the header tome rows (remember, each row = one match). You might want to analyze players on a specific map, rank, or platform. You can create "subheaders" that are a filtered view of the main header. Then, when making a tome, you can point at a subheader to run your transformation and combination only on relevant matches.

Subheader tomes are useful when exploring certain maps or skill ranges, but the filtering is limited to info you can find in a header channel. Subheader tomes do not use glob to find files, but instead just read in the header tome and apply a filter that goes through pandas' `loc` function.

An example of making a subheader tome is below this cell. The `create_subheader_tome` will create the subheader with the specified filter applied to the header tome.

Remember that the convention for the tome names are: `tome_name.start-date,end-date.comment`

In [None]:
def map_name_selector(map_name):
    return lambda df: df['map_name']==map_name

subheader_loader = curator.create_subheader_tome('subheader_dust2.2022-05-15,2022-05-15', map_name_selector('de_dust2'))

In [None]:
df = subheader_loader.get_dataframe()

In [None]:
df.head()

Advance to the [next notebook](4%20-%20Do%20datascience%20exploration.ipynb).