# Exploring Recommendation System Suitability By User History Depth  
## Data Imports and Cleaning
Pippi de Bree

## Table of Contents
### [Introduction](#Introduction)

### [Library Imports](#Imports)
### [Lakh MIDI Dataset - Import and Cleaning](#Lakh-MIDI-Dataset)
    - Dataset Description
    - Dataset Cleaning
 
    
### [The Echo Nest Taste Profile Subset -  Import and Cleaning](#The-Echo-Nest-Taste-Profile-Subset) 
    - Dataset Desccription
    - Dataset Cleaning 
    
 
### [Intersection Data Creation](#Intersection)

### [Audio File Intersection Creation](#Audio-Intersection)
### [Conlusion of Data Loading and Cleaning](#Conclusion)


# Introduction <a id=Introduction a>

The dataset used in this exploration is the [The Lakh MIDI Dataset v0.1](https://colinraffel.com/projects/lmd/), a subset taken from the [Million Songs Dataset](http://millionsongdataset.com/). This subset includes the data from the Million Songs Dataset but matched with MIDI files for each track. (Audio files were obtained by reaching out to the owner). This dataset was combined with [The Echo Nest Taste Profile Subset](http://millionsongdataset.com/tasteprofile/), from the Million Songs Dataset in order to create a subset of Audio matched songs that have user taste information. This will allow us to create the recommendation system based on these user's tastes. 

References: 
*Thierry Bertin-Mahieux, Daniel P. W. Ellis, Brian Whitman, and Paul Lamere. "The Million Song Dataset". In Proceedings of the 12th International Society for Music Information Retrieval Conference, pages 591–596, 2011.*

*Colin Raffel. "Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching". PhD Thesis, 2016.*

# Get Data

# Library Imports <a id=Imports a>

In order to work with our data we will import some libraries that will allow for easier data manipulation, namely `numpy` and `pandas`. These libraries let us easily transform our data to dataframes with a high level of functionality.

In [39]:
# Importing numpy and pandas in order to gain functionality for working with .csv files
import numpy as np
import pandas as pd

We will also import a custom library named `ds_utils_capstone` which contains a subset of utils functions that will be used accross this project. 

In [40]:
# Importing ds_utils_capstone from the current working directory. 
import ds_utils_capstone

# Lakh MIDI Dataset Import  <a id=Lakh-MIDI-Dataset a>

## Data Set Description

The Lakh MIDI dataset is a subset of the MSD dataset, with matching to a group of audio files (made available by the dataset creator). The following descriptions of the dataset are a outlines based on the [MSD website descriptions](#http://millionsongdataset.com/pages/example-track-description/), with some more research. This is an extensive list of attributes that will likely be edited.

LMD-matched

- `analysis_sample_rate`	not sure 
- `artist_7digitalid`	IDs for artists in 7digital database
- `artist_familiarity`	Measure of how known the artist is generally (e.g. Cyndi Lauper has 0.71)
- `artist_hotttnesss`	Measure of how 'hot' or currently popular an artist is (Cyndi Lauper has 0.56) - could be time biased
- `artist_id`	ID for artist in MSD 
- `artist_latitude`	Latitude for artists location
- `artist_location`	Artist's declared location (e.g. Brooklyn, NY)
- `artist_longitude`	Longitude for artist's location
- `artist_mbid`	Artist ID for musicbrains.org
- `artist_mbtags`	Tags this artist is associated with on musicbrainz.org  in list form (i.e. 'classis pop and rock')
- `artist_mbtags_count`	Number of tags this artist has on musicbrainz.org
- `artist_name`	Name of the Artist 
- `artist_playmeid`	Artist ID for playme.com
- `artist_terms`	Tags this artists is associated with on The Echo Nest (a main source for data) - a list
- `artist_terms_freq`	Frequency of usage of terms, as a proportion (0-1), when the artist is mentioned - as list e.g. \[1, 0.6...]
- `artist_terms_weight`	Weight of the terms, list of proportions between 0 and 1 
- `audio_md5`	Hash code for the audio used for analysis by The Echo Nest
- `bars_confidence`	Confidence value associated with start of each bar, from The Echo Nest
- `bars_start`	Start time of each bar, according to The Echo Nest 
- `beats_confidence`	Confidence value associated with start of each beat, from The Echo Nest
- `beats_start`	Start time of each beat, accoring to The Echo Nest
- `danceability`	Measure of how 'danceable' the song is (i.e. how easy it is to dance to - indicator of listening style) between 0 and 1.
- `duration`	Length of track, in seconds.
- `end_of_fade_in`	Time of the end of the fade in, at the beginning of the song, according to The Echo Nest
- `energy`	Measure of how much 'energy' a song has, according to The Echo Nest. Between 0 and 1 (0 means not analysed).
- `key`	Estimation of the key, by The Echo Nest
- `key_confidence` 	Confidence in the key estimation 
- `loudness`	Measure of the 'loudness' of a song (how much noise is present within a song)
- `mode` 	Estimation of the mode, by The Echo Nest
- `mode_confidence`	Confidence in the mode estimation
- `release`	Date of release 
- `release_7digitalid`	ID of the album on the 7digital service 
- `sections_confidence`	Confidence associated with each section according to The Echo Nest (between 0 and 1)
- `sections_start`	Start time of each section, according to The Echo Nest
- `segments_confidence`	Confidence value associated with each segment, by The Echo Nest
- `segments_loudness_max`	Max loudness during each segment (list)
- `segments_loudness_max_time`	Time of max loudness during each segment (list)
- `segments_loudness_start`	Loudness at the beginning of each segment
- `segments_pitches`	Chroma features for each segment - normalised so max = 1
- `segments_start`	Start time of each segment 
- `segments_timbre`	MFCC-like features for each segment
- `silimar_artists`	List of 100 artists (their Echo Nest ID) similar to this song's artist 
- `song_hotttnesss`	According to The Echo Nest, when downloaded in 2010, the popularity/prevalence of the song (potentially outdated and time biased)
- `song_id` 	ID of song within MSD 
- `start_of_fade_out`	Start time of fade out, in seconds, at the end of song 
- `tatums_confidence`	Confidence in each tatum value (list)
- `tatums_start`	Start of each tatum (smallest interval between successive notes) according to Echo Nest
- `tempo`	Tempo in BPM according to Echo Nest
- `time_signature`	Time signature according to The Echo Nest
- `time_signature_confidence`	Confidence in the time signature estimation 
- `title`	Title of the song (e.g. Never Gonna Give You Up)
- `track_7digital` 	ID of the song on 7digital 
- `track_id` 	ID of track in The Echo Nest(used as a tag for .h5 files but song_id would be used to merge datasets).
- `year`	Year of song release, according to musicbrainz.org


## Accessing the Data

The data takes the form of hierarchical `.h5` files - meaning we need to expand these to create a `.csv` file. To do this we used code from the following:
https://github.com/rcrdclub/mm-songs-db-tools

However, some changes were made to allow this to work with the dataset. In hindsight, this preocessing may have been done using a different version of python, but only a few lines of code were changed. In order to execute this I ran the code below. This calls mmsongsdb_to_csv.py which creates a `.csv` by iterating all of the h5 files in the `lmd_matched_h5` dataset. I decided to use this subset because it gives me the option to add in MIDI or audio files. 

The code in the `mmsongsdb_to_csv.py` is from the github repo mentioned above - it iterates over the entire folder of nested `.h5` files. When it accesses each file, it calls `mmsongsdbtocsvconverter.py` which in adds each attribute in the HDF file to a growing `.csv` file by using the the `hdf5_getters.py`. 

The only change I made to this code is in the `mmsongsdbtocsvconverter.py` - this code is designed to be used with an older version of Python so I had some syntax issues but resolved them. 

The below code accesses the terminal to run the `mmsongsdb_to_csv.py` on the folder `lmd_matched_h5`, and thereby create `lmd.csv`. I have the code written below to show how I created the `lmd.csv` dataset, though do not recommend running it (this would take hours to execute).

In [41]:
# Code to convert h5 files to a csv 
# ! python code_for_h5/mmsongsdb_to_csv.py data/lmd.csv data/lmd_matched_h5

Having created the`lmd.csv`, we read it in as a pandas dataframe in order to look at the format of the data.

To do so we will utilise the functionality of pandas and numpy and use the function created in the `ds_utils_capstone.py` file that can be found in the current directory. Throughout this project we will use functions created within this file, as give us more functionality accross workbooks.  

Below we list the imports needed for this data cleaning stage.

Now we will read in the data as a `.csv` file using the `ds_utils_capstone` function `.read_csv_pd`. This code is written for direct use of the local data - it is very large so a copy has been saved in [this](https://drive.google.com/drive/folders/12nRIYn6zHiYHOLe4or-fGiN7FAAdxTnW) Google Drive folder.

In [42]:
# Reading in the .csv file created from the h5 files, using ds_utils_capstone read-in function 
lmd = ds_utils_capstone.read_csv_pd("data/lmd.csv")

DataFrame contains 124136 rows and 54 columns.
Missing values or duplicated rows found.


## Dataset Shape 

From the above output we can see the created dataset has 124,136 rows and 54 columns. However, we know from the data source that the intersection is only supposed to be 45,129 songs. This means there was an error in the import. (This is an issue that could be returned to). The read-in of the data was successfulw with the first few rows looking as they should.

In [43]:
# looking at the first few rows of the data 
lmd.head(5)

Unnamed: 0,analysis_sample_rate,artist_7digitalid,artist_familiarity,artist_hotttnesss,artist_id,artist_latitude,artist_location,artist_longitude,artist_mbid,artist_mbtags,...,start_of_fade_out,tatums_confidence,tatums_start,tempo,time_signature,time_signature_confidence,title,track_7digitalid,track_id,year
0,22050,11319,0.712886,0.559257,ARGE7G11187FB37E05,,"Brooklyn, NY",,7bd9e20e-74b9-446a-a2ed-a223f82a36e7,['classic pop and rock'],...,240.64,[1. 1. 1. 1. 1. 1. 1. 1. ...,[1.2807000e-01 3.7284000e-01 6.1882000e-01 8.6...,123.989,4,0.8,Into The Nightlife,3110092,TRAAAGR128F425B14B,2008
1,22050,93189,0.546102,0.383787,ARJJ8611187FB5321F,40.79086,"New York, NY [Manhattan]",-73.96644,471e21ab-7a14-4190-a9d2-f95197616df4,[],...,167.607,[1. 1. 1. 1. 1. 1. 1. 1. ...,[ 0.76749 1.04856 1.32007 1.59294 1.8...,110.129,4,0.711,Break My Stride,8473798,TRAAAZF12903CCCF6B,1983
2,22050,1396,0.7072,0.513463,ARYKCQI1187FB3B18F,,,,eeacb319-8d4c-48e0-80a0-944e71c375bf,[],...,285.605,[0.238 0.232 0.216 ... 0. 0. 0. ],[5.7280000e-02 2.6051000e-01 4.6674000e-01 ......,150.062,4,0.931,Caught In A Dream,4143071,TRAABVM128F92CA9DC,2004
3,22050,611,0.635346,0.463478,ARD9UVF1187B9B17FE,,"Hawthorne, CA",,634fe78e-fc6b-4b2a-ba83-c8c66e13a8aa,['classic pop and rock'],...,160.717,[0.897 0.839 0.784 0.759 0.68 0.641 0.604 0.5...,[ 0.23383 0.54057 0.84423 1.15034 1.4...,100.494,3,1.0,Keep An Eye On Summer (Album Version),1140917,TRAABXH128F42955D6,1998
4,22050,153505,0.583006,0.333922,ARDDIBO1187B9B0822,,,,7720a649-0c70-4c7a-972a-c29ccb898201,[],...,156.973,[0.894 0.808 0.741 0.676 0.631 0.598 0.571 0.5...,[ 0.27878 0.53605 0.7946 1.05329 1.3...,118.43,4,0.61,Summer,7473946,TRAACQE12903CC706C,2007


The format of the data seems to make sense, so we will look at the columns and see if we find any issues in how the columns were imported. 

In [44]:
lmd.columns

Index(['analysis_sample_rate', 'artist_7digitalid', 'artist_familiarity',
       'artist_hotttnesss', 'artist_id', 'artist_latitude', 'artist_location',
       'artist_longitude', 'artist_mbid', 'artist_mbtags',
       'artist_mbtags_count', 'artist_name', 'artist_playmeid', 'artist_terms',
       'artist_terms_freq', 'artist_terms_weight', 'audio_md5',
       'bars_confidence', 'bars_start', 'beats_confidence', 'beats_start',
       'danceability', 'duration', 'end_of_fade_in', 'energy', 'key',
       'key_confidence', 'loudness', 'mode', 'mode_confidence', 'release',
       'release_7digitalid', 'sections_confidence', 'sections_start',
       'segments_confidence', 'segments_loudness_max',
       'segments_loudness_max_time', 'segments_loudness_start',
       'segments_pitches', 'segments_start', 'segments_timbre',
       'similar_artists', 'song_hotttnesss', 'song_id', 'start_of_fade_out',
       'tatums_confidence', 'tatums_start', 'tempo', 'time_signature',
       'time_signature_

From this we can see that we have a lot of attributes, but none that are extra as they all match the data description.
We will look further into these columns in our Exploratory Data Analysis.

## Duplicated Rows

We may now consider that we have an issue of duplicated rows. First we will count the number of duplicated rows to see if this is the cause of our issue.

In [45]:
# We will look at the number of duplicated rows.
print("The number of duplicated rows are:", lmd.duplicated().sum())

The number of duplicated rows are: 93102


To see the number of non-duplicated rows we will use the boolean output of the `.duplicated()` to select only rows that are designated as `False`. By default, the function marks the first instance of every duplicated row as `False` in order to let us keep one instance of the duplication. 

In [46]:
# The number of non
print("The number of non-duplicated rows are:", len(lmd[lmd.duplicated() == False]))

The number of non-duplicated rows are: 31034


As we have 124,136 rows total, we should hope that the number of duplicated and non-duplicated rows add up:

31034 + 93102 = 124136 

Therefore we have exactly 3 times as many duplicates as rows, this may be the result of an issue in our import. 

We have successfully isolated singular instances of each row, so we will create a dataframe with only the singular instances. This will allow us to continue on with our exploration. 

In [47]:
# Creating a new dataframe with only one instance of every row. 
lmd_no_dup = lmd[lmd.duplicated() == False]

As a check, we will see if this gives us the expected number of rows (matching the above output).

In [48]:
print("Length of singular instance dataframe:", len(lmd_no_dup))

Length of singular instance dataframe: 31034


This gives us the correct number of rows - 31034. 

## Missing Data 

Now we are going to look into missing data in order to see if there are any serious issues. We will use the `.info()` function to see if any of the columns are missing data points. It also allows us to look at data types, but we will address those later.

In [49]:
lmd_no_dup.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 31034 entries, 0 to 91877
Data columns (total 54 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   analysis_sample_rate        31034 non-null  int64  
 1   artist_7digitalid           31034 non-null  int64  
 2   artist_familiarity          31030 non-null  float64
 3   artist_hotttnesss           31034 non-null  float64
 4   artist_id                   31034 non-null  object 
 5   artist_latitude             10507 non-null  float64
 6   artist_location             17438 non-null  object 
 7   artist_longitude            10507 non-null  float64
 8   artist_mbid                 29049 non-null  object 
 9   artist_mbtags               31034 non-null  object 
 10  artist_mbtags_count         31034 non-null  object 
 11  artist_name                 31034 non-null  object 
 12  artist_playmeid             31034 non-null  int64  
 13  artist_terms                310

From this we can see that the columns we may have problems with are `artist_latitude`, `artist_location`, `artist_longitude`, `artist_mbid` and `song_hotttnesss`. 

If you think about music as localised it may make sense that people would listen to music from their areas - in specific more underground and local music scenes would have specific local users - it could be interesting to look into this, especially if we had user location data. However, we do not, so we can move on from using this location data. Overall, it seems unlikely that location would have a large impact. Though there are more data points missing from the `artist_latitude` and `artist_longitude` than from the `artist_location` - we still cannot use the `artist_location` because too many values are missing. 

Missing `artist_mbid` is not a big issue because this is data that we would not use. This attribute is the artist id for musicbrains.org - as we are not using data from this website we are not concerned that we are missing this data (we will simply not use the column).

Finally, the `song_hotttnesss` attribute is missing about half of it's data. The data source said that this was a measure of the prevalence of the song in 2010 (when the dataset was created), and this may be helpful to the dataset. However, it is not logical to create values for this data, as about half of it is missing.

Therefore, we will continue on without these variables.

In [50]:
# Dropping the attributes with too many missing values
lmd_no_dup = lmd_no_dup.drop(columns=['artist_latitude', 'artist_location', 'artist_longitude', 'artist_mbid',
                      'song_hotttnesss'])

In [51]:
# Checking number of columns.
len(lmd_no_dup.columns)

49

After dropping the missing values columns we have 49 columns. 

Now we can look more deeply into our attributes. From the description list above we can see that there are many columns that will not be helpful to us as they reference different datasets or website that will not be used in this analysis. We will go through each column, and decide whether we want to bring them into our EDA. We will also consider datatypes as we do this.

In [52]:
lmd_no_dup.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 31034 entries, 0 to 91877
Data columns (total 49 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   analysis_sample_rate        31034 non-null  int64  
 1   artist_7digitalid           31034 non-null  int64  
 2   artist_familiarity          31030 non-null  float64
 3   artist_hotttnesss           31034 non-null  float64
 4   artist_id                   31034 non-null  object 
 5   artist_mbtags               31034 non-null  object 
 6   artist_mbtags_count         31034 non-null  object 
 7   artist_name                 31034 non-null  object 
 8   artist_playmeid             31034 non-null  int64  
 9   artist_terms                31034 non-null  object 
 10  artist_terms_freq           31034 non-null  object 
 11  artist_terms_weight         31034 non-null  object 
 12  audio_md5                   31034 non-null  object 
 13  bars_confidence             310

Firstly, looking `analysis_sample_rate` we found no description of what it is and therefore cannot infer anything from it's values. 

The attributes that reference different datasets or attributes in different datasets are:
- `artist_7digitalid` - artist reference id for the 7digital platform
- `artist_playmeid` - artist reference id for playme.com
- `release_7digitalid` - album reference for the 7digital platform
- `track_7digitalid` - track reference for the 7digital platform
- `artist_mbtags` - tags associated with the artist on musicbrainz.org (we will remove this because terms from the original dataset will serve a similar function).
- `artist_mbtags_count` - number of times the musicbrainz.org tags are used in reference to the artist
- `audio_m5` - a code used to link audio to the main dataset, we will not need this. 

We will drop all of these variables, as they will not aide our analysis. As we have dropped them their data types do not matter. Now we will look at which variables we believe are not well suited to out analysis, even though they are musical in nature. 
- `bars_start` and `bars_confidence` - give the start time of every bar in the songs, and the confidence behind these calculations. This seems like it is too specific a timing interval and in itself would not bring any value to how users listen to songs. (It could be argued that number of bar starts hint at duration and tempo, but we have those attributes so do not need bars information. 
- `beats_start` and `beats_confidence` - give the start time of every beat in the song, with the confidence. Similarly to the bars, this information seems unhelpful because what would be used to infer user listening habits is already overed in other attributes. 
- `tatums_start` and `tatums_confidence` - give the start time of every tatum (a very small section of a note) and the confidence. Again, the information that could be infered from these attributes can be found elsewhere and the data is too specific for our analysis.
- `segments_confidence`, `segments_loudness_max`, `segments_loudness_max_time`, `segments_loudness_start`, `segments_pitches`, `segments_start`, `segments_timbre` - A segment, according to the MSD is a musical event, or onset. The sample song for the dataset has 935 segments. This is a lot of specific and localised data for our recommendation system, so we will remove this segment data and instead focus on using the section data (sample song only has 10). This will aide in computational load and the lost specificity will likely not be missed.

To drop: 
- `analysis_sample_rate`
- `artist_7digitalid`
- `artist_playmeid`
- `release_7digitalid`
- `track_7digitalid`
- `artist_mbtags`
- `artist_mbtags_count`
- `audio_m5` 
- `bars_start`
- `bars_confidence`
- `beats_start`
- `beats_confidence`
- `tatums_start`
- `tatums_confidence`
- `segments_confidence`
- `segments_loudness_max` 
- `segments_loudness_max_time` 
- `segments_loudness_start` 
- `segments_pitches` 
- `segments_start` 
- `segments_timbre`

We use the pandas `.drop()` method to do so. 

In [53]:
# Dropping unnecessary columns
lmd_clean = lmd_no_dup.drop(columns=[
    'analysis_sample_rate',
    'artist_7digitalid',
    'artist_playmeid',
    'release_7digitalid',
    'track_7digitalid',
    'artist_mbtags',
    'artist_mbtags_count',
    'audio_md5', 
    'bars_start',
    'bars_confidence',
    'beats_start',
    'beats_confidence',
    'tatums_start',
    'tatums_confidence',
    'segments_confidence',
    'segments_loudness_max', 
    'segments_loudness_max_time',
    'segments_loudness_start', 
    'segments_pitches', 
    'segments_start',
    'segments_timbre'
])

As we are dropping 21 columns, we expect to end up with 28 left over.

In [54]:
print("Number of columns left after cleaning:", len(lmd_clean.columns))

Number of columns left after cleaning: 28


Now we can consider the data types of the attributes that we have remaining:

- `artist_familiarity` - As a measure (from 0 to 1) it makes sense that this is a float.
- `artist_hotttnesss` - As a measure (from 0 to 1) it makes sense that this is a float.
- `artist_id` - As a alphanumerical value, it makes sense that this is an object.
- `artist_name` - As a string, it makes sense that this is an object.
- `artist_terms` - As a list of strings, it makes sense that this is an object. 
- `artist_terms_freq` - As a list of floats, it makes sense that this is an object.
- `artist_terms_weight` - As a list of floats, it makes sense that this is an object.
- `duration` - As a value for the length of a song, it makes sense that this is a float.
- `danceability` - As a measure (from 0 to 1) it makes sense that this is a float.
- `end_of_fade_in` - As a value for the time at which the fade in ends, it makes sense that this is a float.
- `energy` - As a measure (from 0 to 1) it makes sense that this is a float.
- `key` - The key is numeric in this case, for easier categorisation, so it makes sense that this is an int. 
- `key_confidence` - As a measure (from 0 to 1) it makes sense that this is a float.
- `loudness` - The value of the loudness is measured in a way such that it makes sense for it to be a float.
- `mode` - Similar to key, the mode of the song is numeric here for easier cateogorisation, so it makes sense that it is an int.
- `mode_confidence` - As a measure (from 0 to 1) it makes sense that this is a float.
- `release` - As a string, it makes sense that this is an object.
- `sections_confidence` - As a list of ints, it makes sense that this is an object (we will need to consider expanding)
- `sections_start` - As a list of ints, it makes sense that this is an object (we will need to consider expanding)
- `similar_artists` - As a list of strings, it makes sense that this is an object (we will need to consider expanding)
- `song_id` - As a alphanumerical value, it makes sense that this is an object. 
- `start_of_fade_out` - As a value for the time at which the fade our starts, it makes sense that this is a float.
- `tempo` - As a measure of beats per minute, it makes sense that this is a float. 
- `time_signature` - It makes sense for this to be an int because this value will be an integer
- `time_signature_confidence` - As a measure (from 0 to 1) of the confidence of the time signature, it makes sense that this is a float.
- `title` - As a string, it makes sense that this is an object.
- `track_id` - As a alphanumerical value, it makes sense that this is an object. 
- `year` - The year attribute is an int which makes sense given that it is 


In [55]:
lmd_clean.shape

(31034, 28)

# The Echo Nest Taste Profile Subset Import and Cleaning <a id=The-Echo-Nest-Taste-Profile-Subset a>

## Dataset Description

The user data for our recommendation system comes from [The Echo Nest Taste Profile](http://millionsongdataset.com/tasteprofile/). It takes the form of a tab separated value with three columns; `user`, `song` and `count`.

- `user` - contains an identification number for the user who listened to the song. 
- `song` - contains the song identification number, which can be paired with `song_id` in the Lakh Dataset
- `count` - countains the number of times the user listened to the song.

For the entire dataset, each user appears at least 10 times, meaning there are 10 songs associated with each user. Overall, the dataset contains 1,019,318 unique users, 384,546 unique MSD songs and 48,373,586 user play-count data points. 

The dataset comes in the form of a `.txt` file but the three columns are separated by tabs. We will look into the beginning of the file using the below command line code (again, this is commented out because it has a very long run time.

In [56]:
# We will show the beginning of the .txt file to see if we correctly import it 
# ! head data/train_triplets.txt

We know now that there is no introduction or column titles, so using the `.read_csv` method with a tab separation correctly should give us the correct beginning. This code is written for direct use of the local data - it is very large so a copy has been saved in [this](https://drive.google.com/drive/folders/12nRIYn6zHiYHOLe4or-fGiN7FAAdxTnW) Google Drive folder.

In [57]:
# The dataset is tab separated, and has no column titles, so we name them below. 
plays = pd.read_csv("data/train_triplets.txt", sep="\t", header=None, names=['user', 'song', 'count'])

FileNotFoundError: [Errno 2] No such file or directory: 'data/train_triplets.txt'

From this read in we know that there are no missing values or duplicated values. We will alo check that we have correctly read in our data by comparing the head of our command line code to the head of our pandas dataframe.

In [None]:
# The head method shows the first 5 values of the dataframe.
plays.head()

In [None]:
# The shape attribute shows us the number of data points we have in the dataset. 
plays.shape

## Datatypes 

We will check the data types of the columns in the Taste Profile by using the `.info()` method.

In [None]:
# Looking at the information for the columns
plays.info()

We can see from this output that `user` and `song` are object types, which makes sense given that these identifier are both alphanumeric values. Then we can see that `count` is an int, this is appropriate because the value should be a finite whole number for each row. 

At this point we could save our data as a `.csv` file in order to keep the full dataset in a format that could be  more easily used. However, the dataset is very large so we will not do this in the practical report of this project.

In [None]:
# Exporting the plays data to a csv file
# plays.to_csv("data/play_counts.csv", index=False)

# Intersection Dataset Creation <a id=Intersection a>

The code above shows us there are are 48,373,586 unique data points that come from any songs in the entire Million Songs Dataset. However, we are only using a subset of the full dataset (tracks that are matched to audio files) so we will look into how this dataset interacts with the Lakh Dataset. To do so we will first look at the intersection of the two datasets, firstly at how many of the songs in the Taste Profile are present in the Lakh Dataset. 

We will create the intersection of songs that appear in the Taste Profile and the Lakh, doing this here will seriously reduce the size of our dataset and allow for much easier storage. 

In [None]:
# Unique songs in the Taste Profile
unique_play_songs = plays['song'].unique()
print("The number of unique songs in the Taste Profile is:", len(unique_play_songs))

In [None]:
# Unique songs in the Lakh Dataset
unique_lmd_songs = lmd_clean["song_id"].unique()
print("The number of unique songs in the Lakh Dataset is:", len(unique_lmd_songs))

The Lakh dataset has far fewer songs, so now we want to find how large the intersection between these two is. If it is large enough we will be able to use this as our dataset and have audio data as well as user data. To find this intersection we create sets for both song groups, giving us only the unique values. Then we will make a list of the songs that are in both by using the `&` conditional. 

In [None]:
# Finding the unique songs that are in both 
intersect_songs = list(set(unique_play_songs) & set(unique_lmd_songs))

In [None]:
print("The number of songs in intersection of the Taste Profile and Lakh Dataset is:\n", len(intersect_songs) ,sep='')

Though this is not a huge dataset, we will continue on with it because if it was much larger we would begin to get a serious issue of computational load (especially if we begin to consider audio data). We will now look at how big the Lakh Subset will be if we only include these songs and how much user data we get if we only include these songs. 

First we will look at the Lakh Dataset with only the intersection songs. To do so we will use the `song_id` attribute and compare it to the `intersect_songs` list. This should act as a sanity check, as if there are no duplicated songs in the dataset we should see 11890 rows. 

In [None]:
# First we will check the number of data points with songs in the intersect_songs like
lmd_clean['song_id'].isin(intersect_songs).sum()

This matches our assumption, so we will now add these rows to a new dataframe. 

In [None]:
# Creating new dataframe
sect_lmd = lmd_clean[lmd_clean['song_id'].isin(intersect_songs)]

In [None]:
# Checking dataframe length 
len(sect_lmd)

Now we know that we have 11890 songs, with matched audio data that appear in our taste profile song. At this point we will save our cleaned subset as `sect_lmd.csv`, so that we can continue our exploration of this dataset.

In [None]:
# Saving the subset dataset
sect_lmd.to_csv("data/sect_lmd.csv", index=False)

We will now look into how many times each of these songs appear in the Taste Profile, by creating a subset of the Taste Profile that includes every instances of the songs in the `intersect_songs` list. First we will see how large this set is. 

In [None]:
# Counting the number of unique user-song-count combinations that contain a song in the intersection. 
plays['song'].isin(intersect_songs).sum()

In [None]:
sect_plays = plays[plays['song'].isin(intersect_songs)]

This is a very encouraging result, as it suggests that, on average, we have over 500 occurances of each song in the user profile. This suggests that we have a lot of data to create a specific categorisation/view of every track. We will also export this subset for later use.

In [None]:
sect_plays.to_csv("data/sect_plays.csv", index=False)

# Audio Intersection Matching <a id=Audio-Intersection a>

We now we want to reduce our audio files to only include those with songs in the `intersection_songs`. As the audio files are saved under their track names, not song names, we want to check that there are matched one-to-one, and that there are no tracks or songs that appear multiple times. In order to do this we will compare the number of unique songs, unique tracks and unique song and track group by combinations. 

In [None]:
# Number of unique songs in the sect_lmd
print("The number of unique songs:", sect_lmd['song_id'].nunique())
print("The number of unique tracks:", sect_lmd['track_id'].nunique())
print("The number of unique song/track combinations:",len(sect_lmd.groupby(['song_id', 'track_id'])))

Now we know that each song matches exactly one track. If we pick the tracks from the audio files that are in the `sect_lmd` data, we know that they are associated with a song in the `intersect_songs` set. Meaning we are only picking files that match our subset. We will likely do this iteration over the audio files from the command line, so we can create a file that contains only the names of the tracks for easier iteration. 

In [None]:
# Creating a list of the tracks in the subset
sect_tracks = sect_lmd['track_id']

In [None]:
# we do not actually need to create this subset, so we will not export the data
#sect_tracks.to_csv('data/sect_tracks.csv', index=False)

The collection of the Audio files into a format that would be usable with the rest of our dataset will be done in a separate notebook **1B_audio_subset**. This is done because the data takes a very different format. 

# Conlusion <a id=Conclusion a>
    
After creating this intersection between the Echo Nest Taste Profile Subset and the LAkh MIDI Matched Subset, we are ready to move on to Exploratory Data Analysis. This is where we will begin lookin more deeply into the two datasets and their usability in our aim of finding the best recommendation system for a specific type of user.
    
As a note, in our initial exploration, as we cleaned the data, showed us the vast range of attributes that can be found within the Million Songs Dataset. This is very encouraging for future use, as there are many different potential additions to our dataset.