# Amadeus Further Analysis

## Problem Statement and Background

In this project, we will attempt to discover the features behind the popular music of each generation. For instance, if Britney Spears, “Oops I Did It Again” made the charts in 2001, and The Beatles’ “Real Love” made the charts in 1996, we want to see what made the music popular back then – was it the timbre, audio quality, or lyrics? We will then attempt to build a model that is able to predict whether a song will be popular or not, and will also attempt to apply it to modern music. 

Questions: 
1. What features best predict the popularity of a song?
2. Do those features change with time?
3. Do different features change how long it takes for a song to become popular?

This is an interesting problem because it can be used to create music which is more likely to become popular. 

## Data Sources you Intend to Use?


The Dataset that we are using is the Million Song Dataset, which can be found here: http://labrosa.ee.columbia.edu/millionsong/. 

We are initially using the subset (10,000 songs) of the entire dataset, and once we are confident we have a substantial model, we will then expand the database and include all million songs, while running our model on an EC2 server.

## Data Joining/Cleaning You Did (4 points)
If data is being joined, describe the joining process and any problems with it - explain the metric used for fuzzy joins.

Explain how you will handle missing or duplicate keys. Describe the tools you used to examine/repair/clean the data.

If you found any statistical anomalies last time, explain how you plan to deal with them.


We started extracting features out of the dataset – this included duration, end of fade in, key, key confidence, loudness, start of fade out, tempo, time signature, and time signature confidence. These were all features that were already in a float format, so it was easy to extract them.

For missing keys, (e.g. NaNs), we removed them as part of our model prediction, since neutral values were not valid. There is no duplicate data (as promised by the data source), but we will do fuzzy joins to try to match potentially mislabelled data in the dataset to real world data.
Currently, the data is denormalized into a HDF5 data model for every song. In order to process them, we will have to enumerate through each file to find out more about its characteristics. However, for more generic ones like all the names of the artistes, we will be using the databases that are provided in the .db format. We will then extract more information by performing an equi-join on the idx_artist_id field (which is a primary key in the subset_track_metadata.db file, and a foreign key in most of the other databases). To do this, we use the sqlite3 library in Python which will help us run SQL queries.

We used h5py to help parse, scipy to collect statistics and the csv library to help print to csv. 

## Analysis Approach (3 points)
Describe what analysis you are doing: This will probably comprise:

- Featurization: Explain how you generated features from the raw data. e.g. thresholding to produce binary features, binning, tf-idf, multinomial -> multiple binary features (one-hot encoding). 
- Describe any value transformations you did, e.g. histogram normalization.
- Modeling: Which machine learning models did you try? Which do you plan to try in the future?
- Performance measurement: How will you evaluate your model and improve featurization etc.


### Featurization

Our featurization occurs in the script below. We first need to extract a set of important features from the hd5 dataformat, we chose duration, key, loudness, end of fade in, start of fade out, tempo, and time signiture. Most of our features from didn't require additional cleaning, as many are already in workable formats (float) we simply piped them into our various models. We ensured that at least 90% of the rows were filled, before extracting them to form part of our model. For one particular feature, ‘timbre’, it was stored as a 2d array which was of varying sizes. To tackle this, we converted it to a 1D array, and then ensured that they were all of the same size.


### Modelling

We tried a number of machine learning models: k Nearest Neighbors, Random Forest, Linear Regression, and Logistic Regression. We had relatively poor performance with k Nearest Neighbors, and slightly better performance with Random Tree, Logistic Regression, and Linear Regression. We plan on further tuning the above models to improve performance, and additionally trying unsupervised machine learning to attempt to cluster our songs on measures besides year released (potential clusters could be genre, band gender, band age). For validation of these measures we'll also need to acquire a dataset which contains information about the bands in our dataset. 

### Performance Measurements

For our preliminary analysis, we aimed for completing one of our initial goals of determining when a song was released based on a song's features. Since our song release year range is about 100 years, we considered a successful classification as being within 5 years of the correct release year. We only considered the accuracy around this classification as being indicative of our model's performance. We used cross validation to help us measure our performance. By holding out 10% as the test set, we were easily able to test our model against our own data.


In [1]:
from setup import *
from sklearn import datasets

pp = pprint.PrettyPrinter(indent=2)

print_local = False # Allows output supression in ipynotebook. 

"""
Method to convert all hdf5 files into csv with 10,000 lines of format:
data:
key, mode, tempo, time_signature, loudness, *timbre*

target:
year

The indices of the two match up.

"""
def convert_to_csv():
    i = 0
    header = ['duration',
            'end_of_fade_in',
            'key',
            'key_confidence',
            'loudness',
            'start_of_fade_out',
            'tempo',
            'time_signature',
            'time_signature_confidence']
    #Include header which describes features extracted.
    data = [header]
    target = []
    count = 0


    for root, dirs, files in os.walk(msd_subset_data_path):
        files = glob.glob(os.path.join(root,'*.h5'))
        for f in files:
            local_print("Getting data from: " + str(f))
            with parser.File(f, 'r') as h5:
                year = get_year(h5)
                if year:
                    count +=1
                    target.append([year])
                    row = []
                    local_print("Getting duration...")
                    row += [get_analysis_property(h5,'duration')]
                    local_print("Getting End Fade...")
                    row += [get_analysis_property(h5,'end_of_fade_in')]
                    local_print("Getting Key...")
                    row += [get_analysis_property(h5,'key')]
                    row += [get_analysis_property(h5,'key_confidence')]
                    local_print("Getting Loudness...")
                    row += [get_analysis_property(h5,'loudness')]
                    local_print("Getting Start Fade Out...")
                    row += [get_analysis_property(h5,'start_of_fade_out')]
                    local_print("Getting Tempo...")
                    row += [get_analysis_property(h5,'tempo')]
                    local_print("Getting Time Signiture...")
                    row += [get_analysis_property(h5,'time_signature')]
                    row += [get_analysis_property(h5,'time_signature_confidence')]
                    local_print(str(len(row)) + " Features aquired.")
                    local_print(row)
                    # uncomment row below to get the timbre as well.
                    # row += [get_timbre(h5)]
                    # print row
                    data.append(row)
                    i+=1

    with open('data_no_timbre.csv', 'w+') as f:
        writer = csv.writer(f)
        writer.writerows(data)

    local_print(target)
    with open('target_no_timbre.csv', 'w+') as f:
        writer = csv.writer(f)
        writer.writerows(target)
        
    sample = data[1]
    return zip(sample, header)

def local_print(string):
    if print_local:
        print(string)

def get_timbre(h5):
    listy = []
    timbres = h5['analysis']['segments_timbre']
    print_local(len(timbres))
    timbres = np.array(timbres, dtype='f2')
    timbres = timbres.flatten()
    return list(timbres)

def get_analysis_property(h5, prop):
    to_return = h5['/analysis/songs'][prop][0]
    if to_return:
        return to_return
    else:
        return 0

def get_year(h5):
    to_return = h5['/musicbrainz/songs']['year'][0]
    if to_return:
        return to_return
    else:
        return 0


sample = convert_to_csv()
print(sample)


[(148.03546, 'duration'), (0.14799999999999999, 'end_of_fade_in'), (6, 'key'), (0.16900000000000001, 'key_confidence'), (-9.843, 'loudness'), (137.91499999999999, 'start_of_fade_out'), (121.274, 'tempo'), (4, 'time_signature'), (0.38400000000000001, 'time_signature_confidence')]


## Preliminary Results (6 Points)
Summarize the results you have so far:

Define suitable performance measures for your problem. Explain why they make sense, and what other measures you considered.
Give the results. These might include accuracy scores, ROC plots and AUC, or precision/recall plots, or results of hypothesis tests.
Describe any tuning that you did.
Explain any hypothesis tests you did. Be explicit about the null and alternative hypothesis. Be very clear about the test you used and how you used it. Include all the experiment details (between/within-subjects, degrees-of-freedom etc). Be frugal with tests. Do not try many tests and report the best results.
Use graphics! Please use visual presentation whenever possible. The next best option is a table. Try to avoid "inlining" important results.


Here are the results of our Machine Learning experiments on our data. The script which generated this output is below for reference:


| Model               | Prediction Accuracy on Test Set |
|---------------------|---------------------------------|
| k Nearest Neighbors | 25.4%                           |
| Neural Networks     | 25.0%                           |
| Linear Regression   | 44.9%                           |
| Logistic Regression | 45.2%                           |
| Random Forest       | 44.0%                           |

In [4]:
from sklearn import cross_validation, ensemble, linear_model, feature_selection, neighbors
import csv
import numpy as np
import pdb

def main():
    print("Loading Data...")
    with open('data_no_timbre.csv','r') as f:
        reader = csv.reader(f)
        headers = next(reader)
    train = np.genfromtxt(open('data_no_timbre.csv','r'), delimiter=',', dtype='f8', skip_header=1)
    train[np.isnan(train)] = 0
    target = np.genfromtxt(open('target_no_timbre.csv','r'), delimiter=',', dtype='f8')

    print("Creating test and train data...")
    X_train, X_test, y_train, y_test = cross_validation.train_test_split(train, target, test_size=0.1, random_state=0)

    # pdb.set_trace()

    # kNN classifier
    print("Generating kNN...")
    knn = neighbors.KNeighborsClassifier() # default neighbors is 5,
    knn.fit(X_train, y_train);

    # Random Forest
    print("Generating Random Forest...")
    rf = ensemble.RandomForestClassifier(n_estimators=1000, n_jobs=8)
    rf.fit(X_train, y_train)

    # Linear Regression Model
    print("Generating Linear Regression...")
    lr = linear_model.LinearRegression(n_jobs=8)
    lr.fit(X_train, y_train)

    # Logistic Regression Model
    print("Generating Logistic Regression...")
    lr2 = linear_model.LogisticRegression()
    # lr2.fit(X_train, y_train)

    rfe = feature_selection.RFE(lr2, 2)
    rfe.fit(X_train, y_train)

    print("kNN Accuracy: ")
    model_accuracy(knn, X_test, y_test)
    print("Random Forest Accuracy: ")
    model_accuracy(rf, X_test, y_test)
    print("Linear Regression Accuracy: ")
    model_accuracy(lr, X_test, y_test)
    print("Logistic Regression Accuracy: ")
    model_accuracy(rfe, X_test, y_test)

def model_accuracy(model, test_set, test_target):
    guesses = model.predict(test_set)
    right = 0

    for counter, guess in enumerate(guesses):
        if abs(guess - test_target[counter]) < 5:
            right += 1

    print 'Number of close guesses are: ' + str(right)
    print 'Total accuracy = ' + str(right*1.0/len(test_target))



if __name__=="__main__":
    main()

Loading Data...
Creating test and train data...
Generating kNN...
Generating Random Forest...
Generating Linear Regression...
Generating Logistic Regression...
kNN Accuracy: 
Number of close guesses are: 119
Total accuracy = 0.254273504274
Random Forest Accuracy: 
Number of close guesses are: 192
Total accuracy = 0.410256410256
Linear Regression Accuracy: 
Number of close guesses are: 193
Total accuracy = 0.412393162393
Logistic Regression Accuracy: 
Number of close guesses are: 218
Total accuracy = 0.465811965812


## Final Analysis, any Obstacles (3 Points)
Describe the final analysis you plan to do:

- Scale: how much data will you use?
- Model complexity: What complexity of models will you use, this is relevant for models like clustering, factor models, Random Forests etc.
- What tools will you use?
- Estimate of processing time? You should be able to form an estimate of how much time you need on your chosen tools.
and outline any obstacles you foresee.



### Scale

We've been running our preliminary analysis on a subset of 10,000 songs (2.7 GB). We will be running our final analysis on 1,000,000 songs (~270GB). 

### Model Complexity

We've used fairly basic configurations for our Random Forest and other ML algorithms, as we continue to explore our models, we'll be refining our models to try to improve our predictions. 

### Tools

The tools we will be using will be mainly **scikitlearn** for our machine learning and modelling needs. We'll also likely use **d3** and **matplotlib** to create visualizations of the factors that we are considering. 

### Estimate of Processing Time

The above processing takes about 5 minutes to convert our hd5 data into a useable format, we expect that to scale to around 9 hours when we scale to the full dataset. The model generation time will likely also increase proportionately, from 2 minutes to around 3.5 hours. Our total processing time will likely be half a day for the 270 GB of data we expect to process. We will be performing this full analysis on an EC2 instance provided by the course staff. 