# Making Sense of Data

Dear Diary, <br/>
It is Saturday, September 21. I am sitting in the Bean🥫.<br /><br/>
The purpose of this notebook is to make sense of the data contained in the [UCI FMA Music Analysis Dataset](https://archive.ics.uci.edu/ml/datasets/FMA:+A+Dataset+For+Music+Analysis): **genres, and tracks**. <br/>For genres, we are interested in exploring the **colors** associated with each sub-genre and the **hierarchy structure** organizing the 164 genres. For tracks, we are interested in mapping tracks to genres to find the genres with the most songs to use for our initial model. We also want to explore associated track metadata, such as **year**.

<hr />

# Genres
The `raw_genres.csv` file was small enough that it was easier to analyze the data in Google Sheets. Sorry to betray the CS community by using layman's tools.

The file had 164 rows with the following columns:

   | genre_id | genre_color | genre_handle | genre_parent_id | genre_title |
   | :-: | :-: | :-: | :-: | :-: |
   | 46	| #CC3300 | Latin_America| 2 | Latin America |
   | ... | ... | ... | ... | ... |

### Comments:
- Parent genres did not have `parent_id`s.
- The rows were in a haphazard order; they were not sorted numerically by `genre_id`/`genre_parent_id` nor alphabetically by `genre_handle`/`genre_title`.
- I did not consider `genre_color`, but if it was sorted by color, that's not useful to me.

### In Google Sheets, I did the following:
1. Sorted rows by `parent_id` to get a sense of which genres had the most breadth (the most sub-genres).
2. This moved all the parent rows to the bottom, and I pulled them out to the side.
3. I created two new columns for the parent sub-table, `num sub_genres`.
4. I counted all instances of each sub genre and added it to the parent table.

<hr />

### Results:

|  Top Genres (sub-genres) | Graph |
| :- | :-: |
| <ol><li>International (15)</li><li>Rock (15)</li><li>Electronic (14)</li><li>Experimental (14)</li><li>Spoken (8) </li></ol> | <img src="images/sub_genre_pie_uci_fma.png" /> |

<!--
| genre_id | genre_color | genre_handle | genre_parent_id | genre_title | num sub-genres |
| :-: | :-: | :-: | :-: | :-: |	:-: |
| 2	|#CC3300|	International |	|	International	|15|
|3|	#000099	|Blues	|	|Blues	|1|
|4|	#990099	|Jazz	|	|Jazz	|6|
|5|	#8A8A65	|Classical|	|	Classical|	7|
|8|	#665666	|Old-Time__Historic	|	|Old-Time / Historic|	0|
|9|	#663366	|Country	|	|Country	|4|
|10| #009900|	Pop	|	|Pop	|2|
|12	|#840000|	Rock	| |	Rock|	15|
|14	|#330033|	Soul-RB	|	| Soul-RnB|	2|
|15	|#FF6600|	Electronic|	|	Electronic	|14|
|17	|#5E6D3F|	Folk	|	|Folk|	5|
|20	|#006699|	Spoken	| |	Spoken|	8|
|21	|#CC0000|	Hip-Hop	| 	|Hip-Hop|	7|
|38	|#dddd00|	Experimental|	|	Experimental|	14|
|1235|	#000000|	Instrumental|	|	Instrumental	|3|
-->    

## Tracks
The file containing track data is too big to assess in Google Sheets (wah). Let's do some pandas parsing activities.
The goal here is to see if the top genres above (based on sub-genre) matches the quantity of tracks for each genre. I'll start by loading `raw_tracks.csv` into a pandas df:

In [40]:
import numpy as np
import pandas as pd

# change filepath if running on another machine, this is local to mine
tracks = pd.read_csv("/Users/mkarroqe/Desktop/github/dancing-screen/fma_metadata/raw_tracks.csv")

Next, I want to create a dictionary that maps genres to number of tracks with those genres.

In [45]:
headers = ["track_id","album_id","album_title","album_url","artist_id","artist_name","artist_url","artist_website","license_image_file","license_image_file_large","license_parent_id","license_title","license_url","tags","track_bit_rate","track_comments","track_composer","track_copyright_c","track_copyright_p","track_date_created","track_date_recorded","track_disc_number","track_duration","track_explicit","track_explicit_notes","track_favorites","track_file","track_genres","track_image_file","track_information","track_instrumental","track_interest","track_language_code","track_listens","track_lyricist","track_number","track_publisher","track_title","track_url"]
print("num headers:", len(headers))
print("num cols in row 0:", len(tracks.iloc[1]))

print(tracks.iloc[1])

# for genres in tracks['track_genres']:
#     [{'genre_id': '17', 'genre_title': 'Folk', 'genre_url': 'http://freemusicarchive.org/genre/Folk/'}]
    

num headers: 39
num cols in row 0: 39
track_id                                                                    3
album_id                                                                    1
album_title                                              AWOL - A Way Of Life
album_url                   http://freemusicarchive.org/music/AWOL/AWOL_-_...
artist_id                                                                   1
artist_name                                                              AWOL
artist_url                            http://freemusicarchive.org/music/AWOL/
artist_website                        http://www.AzillionRecords.blogspot.com
license_image_file          http://i.creativecommons.org/l/by-nc-sa/3.0/us...
license_image_file_large    http://fma-files.s3.amazonaws.com/resources/im...
license_parent_id                                                           5
license_title               Attribution-NonCommercial-ShareAlike 3.0 Inter...
license_url               