# Matching files between the TUT and Isophonics

In this notebook, we create create a system for matching the two Beatles datasets. To do this, we create two `.csv` files that list the Beatles `.jams` files from TUT and Isophonics in the same order. 

A few notes: 
* Paths are hard coded, and as such, you will need to update these on your own machine for this notebook to run"

#### Set Up

In this section, we import the require python packages as well as define variables with the relevant directory structures. 

In [None]:
import os 
import numpy


In [None]:
dir_name = "/Users/kkinnaird/Documents/Research/R-Music/Brian-AHE/jams-data/datasets/"

tut_dir = dir_name + "BeatlesTUT/"
iso_dir = dir_name + "Isophonics/The Beatles/"

### Understanding the directories

The two datasets are set-up differently. The Isophonics data has subdirectories for each CD, while the TUT data has no subdirectories instead listing all songs in one directory. The below blocks show quick explorations of these differences:

In [None]:
# List of the CDs inside the Isophonics data directory

iso_cd_list = os.listdir(iso_dir)

In [None]:
print(iso_cd_list)

In [None]:
# List of songs in TUT data directory

tut_list = os.listdir(tut_dir)

In [None]:
tut_list

### First pass at matching 

We will match songs between the two directories in two passes. In the first pass, we use string matching on the song titles to automatically match tracks. For the tracks not matched in the first pass, we will match these through a manual process. 

Below is the code for the first pass. As a reminder, we are building two lists with the tracks in the same order. 

In [None]:
# Initialize the lists for comparison
tut_files = []
iso_files = []

# Keep track of the files that do not have a match through direct string matching
iso_files_2_match = []

for subdir in os.listdir(iso_dir):
    for filename in os.listdir(iso_dir + subdir + "/"):
        # Select a track
        
        # write out the full track name with the CD that it is on (which is the 
        # subdirectory name in the ISO file organization)
        iso_filename = subdir + "/" + filename
        
        # Two CDs have a leading set of characters that do not exist in TUT
        # Removing this preamble allows for more tracks to be matched automatically
        if filename[0] == "C":
            filename = filename[6:]
        
        # Check for the query track name in the TUT list
        # If the track is in TUT, add both track names to their respective lists
        # If not, add the ISO track name to the files to match in the next phase
        if filename in tut_list:
            ind = tut_list.index(filename)
            tut_files.append(tut_list[ind])
            
            iso_files.append(iso_filename)
        else:
            iso_files_2_match.append(iso_filename)

The below 4 blocks check the number of files that each directory in total and how many were matched. 

In [None]:
len(iso_files)

In [None]:
len(tut_list)

In [None]:
len(iso_files_2_match)

In [None]:
len(tut_files)

### Manual Matching 

The first pass on matching tracks were matched automatically using string matchings. The were 11 more files in the Isophonics dataset that were not matched to files in the TUT dataset using automatic string matching. For these 11 files, we examine each one and manually seek a match. 

We are able to match 5 of these to files in the TUT set. These pairs were not matched in the first phase due to small differences in spelling (such as *Lizzy* vs *Lizzie*) or misspellings (such as *Trough* instead of *Through*)

In [None]:
iso_files_2_match

In [None]:
iso_files.append(iso_files_2_match[2])
tut_files.append("13_-_She_Came_In_Trough_The_Bathroom_Window.jams")

In [None]:
iso_files.append(iso_files_2_match[3])
tut_files.append("14_-_Dizzy_Miss_Lizzie.jams")

In [None]:
iso_files.append(iso_files_2_match[4])
tut_files.append("06_-_You're_Going_to_Lose_That_Girl.jams")

In [None]:
iso_files.append(iso_files_2_match[7])
tut_files.append("06_-The_Continuing_Story_of_Bungalow_Bill.jams")

In [None]:
iso_files.append(iso_files_2_match[9])
tut_files.append("04_-_Everybody's_Got_Something_To_Hide_Except_Me_and_M.jams")

In [None]:
iso_files

In [None]:
tut_files

In [None]:
# Remove the manually matched files from the "To be matched" list

for i in [9,7,4,3,2]:
    iso_files_2_match.pop(i)

In [None]:
iso_files_2_match

### Saving the results

Create two files that each save one of our created lists. 

In [None]:
save_dir = "/Users/kkinnaird/Documents/Research/R-Music/Brian-AHE/" 
numpy.savetxt("iso_file_list.csv", iso_files, fmt="%s", delimiter=",")
numpy.savetxt("tut_file_list.csv", tut_files, fmt="%s")

### Post-processing step

Three of the files have commas in the titles. This means that you need to hand-edit three rows in each file to get the correct name in the first cell. 