In [7]:
%matplotlib inline
from __future__ import division, print_function
import numpy as np
import os

# Sketches and progress for SHS I/O

## Finished

### Read URI's

```Python
def read_uris():
...
```

In [2]:
import SHS_data

uris, ids = SHS_data.read_uris()

### Read Cliques

```Python
def read_cliques(clique_file='shs_pruned.txt'):
...
```

In [3]:
import SHS_data

cliques_by_name, cliques_by_id = SHS_data.read_cliques()

#### Let's take a quick look at set sizes

In [9]:
# check set sizes
uris, ids = SHS_data.read_uris()
N_ids = len(ids)
print('N_ids: ', N_ids)

cliques_pruned, _ = SHS_data.read_cliques(clique_file='shs_pruned.txt')
N_pruned = np.sum([len(clique) for clique in cliques_pruned.values()])
print('N_pruned: ', N_pruned)

chroma_path = os.path.join(SHS_data.data_dir + 'chroma/')
chroma_files = os.listdir(chroma_path)
N_chroma_files = len(chroma_files)
print('N_chroma_files', N_chroma_files)

cliques_train, _ = SHS_data.read_cliques(clique_file='shs_train.txt')
cliques_test, _ = SHS_data.read_cliques(clique_file='shs_test.txt')
N_train = np.sum([len(clique) for clique in cliques_train.values()])
N_test = np.sum([len(clique) for clique in cliques_test.values()])
print('N_train + N_test: ', N_train + N_test)

N_ids:  18069
N_pruned:  18069
N_chroma_files 18069
N_train + N_test:  18196


Good news:

    N_ids == N_pruned == N_chroma_files

`18196 - 18069 = 127` files were 'pruned'

#### Check set sizes in more detail

In [11]:
pruned_uris = set([uri for clique in cliques_pruned.values() for uri in clique])
train_uris = set([uri for clique in cliques_train.values() for uri in clique])
test_uris = set([uri for clique in cliques_test.values() for uri in clique])
train_test_uris = set(train_uris | test_uris)

# check if that made sense:
assert N_train == len(train_uris)
assert N_test ==  len(test_uris)

# check if no overlap between train and test data
assert len(train_uris) + len(test_uris) == N_train + N_test
assert len(train_uris) + len(test_uris) == len(train_test_uris)

# check if len(uris) is len(pruned_uris)
assert len(train_uris) + len(test_uris) == N_train + N_test

# check again how many songs were pruned
pruned_files = train_test_uris - pruned_uris
print(len(pruned_files))

127


Good news: no overlap between train and test data.

Number of pruned files is confirmed.


#### Were any songs added?

In [13]:
len(pruned_uris - train_uris - test_uris)

0

As expected, no extra songs in SHS_pruned that weren't already in SHS_train or SHS_test.

#### In short:

Where I thought the situation was like this (before processing duplicate clique names and last clique in the file correctly):

       ____________________________________
     /           /                           \
    | ids       | SHS_pruned                 |
    |           |                            |
    |    196    |             2              |
    |    _______|____________________________|
    |  /        |                 |          |\
    | |   131   |      12557      |   5183   | |
     \|__________\________________|_________/  |
      |                           |            |
      |            104            |     23     |  < missing
      |                           |            |
      |                 SHS_train | SHS_test   |
       \ _________________________|__________ /
       
       
It's actually like this (much simpler):

       __________________________________
     //                    |             \\
    ||           SHS_train | SHS_test     ||
    ||                     |              ||__ SHS_pruned
    ||        12856        |     5213     ||
    ||                     |              ||
     \ ____________________|_____________ /
     |                     |              |
     |         104         |      23      |
      \ ___________________|____________ /


### Train, test, evaluation sets

```Python
def split_train_test_validation(clique_dict, ratio=(50,20,30),
                           random_state=1988):
...
```