# Amadeus Further Analysis

## Problem Statement and Background

In this project, we will attempt to discover the features behind the popular music of each generation. For instance, if Britney Spears, “Oops I Did It Again” made the charts in 2001, and The Beatles’ “Real Love” made the charts in 1996, we want to see what made the music popular back then – was it the timbre, audio quality, or lyrics? We will then attempt to build a model that is able to predict whether a song will be popular or not, and will also attempt to apply it to modern music. 

Questions: 
1. What features best predict the popularity of a song?
2. Do those features change with time?
3. Do different features change how long it takes for a song to become popular?

This is an interesting problem because it can be used to create music which is more likely to become popular. 

## Data Sources you Intend to Use?


The Dataset that we are using is the Million Song Dataset, which can be found here: http://labrosa.ee.columbia.edu/millionsong/. 

We are initially using the subset (10,000 songs) of the entire dataset, and once we are confident we have a substantial model, we will then expand the database and include all million songs, while running our model on an EC2 server.

## Data Joining/Cleaning You Did (4 points)
If data is being joined, describe the joining process and any problems with it - explain the metric used for fuzzy joins.

Explain how you will handle missing or duplicate keys. Describe the tools you used to examine/repair/clean the data.

If you found any statistical anomalies last time, explain how you plan to deal with them.


## Analysis Approach (3 points)
Describe what analysis you are doing: This will probably comprise:

- Featurization: Explain how you generated features from the raw data. e.g. thresholding to produce binary features, binning, tf-idf, multinomial -> multiple binary features (one-hot encoding). 
- Describe any value transformations you did, e.g. histogram normalization.
- Modeling: Which machine learning models did you try? Which do you plan to try in the future?
- Performance measurement: How will you evaluate your model and improve featurization etc.


## Preliminary Results (6 Points)
Summarize the results you have so far:

Define suitable performance measures for your problem. Explain why they make sense, and what other measures you considered.
Give the results. These might include accuracy scores, ROC plots and AUC, or precision/recall plots, or results of hypothesis tests.
Describe any tuning that you did.
Explain any hypothesis tests you did. Be explicit about the null and alternative hypothesis. Be very clear about the test you used and how you used it. Include all the experiment details (between/within-subjects, degrees-of-freedom etc). Be frugal with tests. Do not try many tests and report the best results.
Use graphics! Please use visual presentation whenever possible. The next best option is a table. Try to avoid "inlining" important results.


In [None]:
from setup import *
from sklearn import datasets

pp = pprint.PrettyPrinter(indent=2)


"""
Method to convert all hdf5 files into csv with 10,000 lines of format:
data:
key, mode, tempo, time_signature, loudness, *timbre*

target:
year

The indices of the two match up.

"""
def convert_to_csv():
    i = 0
    data = []
    target = []
    count = 0


    for root, dirs, files in os.walk(msd_subset_data_path):
        files = glob.glob(os.path.join(root,'*.h5'))
        for f in files:
            print("Getting data from: " + str(f))
            with parser.File(f, 'r') as h5:
                year = get_year(h5)
                if year:
                    count +=1
                    target.append([year])
                    row = []
                    print("Getting duration...")
                    row += [get_analysis_property(h5,'duration')]
                    print("Getting End Fade...")
                    row += [get_analysis_property(h5,'end_of_fade_in')]
                    print("Getting Key...")
                    row += [get_analysis_property(h5,'key')]
                    row += [get_analysis_property(h5,'key_confidence')]
                    print("Getting Loudness...")
                    row += [get_analysis_property(h5,'loudness')]
                    print("Getting Start Fade Out...")
                    row += [get_analysis_property(h5,'start_of_fade_out')]
                    print("Getting Tempo...")
                    row += [get_analysis_property(h5,'tempo')]
                    print("Getting Time Signiture...")
                    row += [get_analysis_property(h5,'time_signature')]
                    row += [get_analysis_property(h5,'time_signature_confidence')]
                    print(str(len(row)) + " Features aquired.")
                    print(row)
                    # uncomment row below to get the timbre as well.
                    # row += [get_timbre(h5)]
                    # print row
                    data.append(row)
                    i+=1

    with open('data_no_timbre.csv', 'w+') as f:
        writer = csv.writer(f)
        writer.writerows(data)

    print target
    with open('target_no_timbre.csv', 'w+') as f:
        writer = csv.writer(f)
        writer.writerows(target)

def get_timbre(h5):
    listy = []
    timbres = h5['analysis']['segments_timbre']
    print len(timbres)
    timbres = np.array(timbres, dtype='f2')
    timbres = timbres.flatten()
    return list(timbres)

def get_analysis_property(h5, prop):
    to_return = h5['/analysis/songs'][prop][0]
    if to_return:
        return to_return
    else:
        return 0

def get_year(h5):
    to_return = h5['/musicbrainz/songs']['year'][0]
    if to_return:
        return to_return
    else:
        return 0



convert_to_csv()

## Final Analysis, any Obstacles (3 Points)
Describe the final analysis you plan to do:

- Scale: how much data will you use?
- Model complexity: What complexity of models will you use, this is relevant for models like clustering, factor models, Random Forests etc.
- What tools will you use?
- Estimate of processing time? You should be able to form an estimate of how much time you need on your chosen tools.
and outline any obstacles you foresee.

