# Train-Test by small sessions

The audio files in the ReCANVo data set were collected by clipping vocalizations from longer recordings, taken in *sessions*. Part of our goal for this project is to understand if and how our classifiers can learn to distinguish sessions and use that information to "cheat" during label classification. To that end, there are two important things to note:
1. In order to validate our models on unseen data, we need our test set to contain vocalizations from sessions that are not represented in training data;
1. The number of vocalizations in each session varies considerably, to the point that some groups are small enough to cause issues during cross validation.

With this in mind, we decided that all sessions with fewer than 15 representative after our initial train-test split should be moved to the test set. This should ensure that all sessions within training data are reasonably well populated, and that our testing set contains plenty of data from sessions not seen during training.

In [1]:
import pandas as pd
from pathlib import Path

In [2]:
old_csv_loc = Path('directory_w_train_test.csv')
new_csv_loc = Path('tt_small_sessions.csv')

In [3]:
old_df = pd.read_csv(old_csv_loc)

In [4]:
# The session of each recording can be found in its file name.

def get_session(filename: str) -> str:
  return filename.split('-')[0][:-3]

In [5]:
old_df['Session'] = old_df.Filename.apply(get_session)

In [6]:
session_counts = old_df.loc[old_df.is_test==0].Session.value_counts()

In [7]:
new_df = old_df.copy()
new_df.loc[new_df.Session.isin(session_counts[session_counts < 15].index), 'is_test'] = 1

In [8]:
new_df.to_csv(new_csv_loc)