# iTunes music library analysis: Novelty Detection
This is the 4th post in a series of posts devoted to analysis of iTunes music library using Scikit-Learn tools.   
The purpose of the analysis is to detect tracks in my iTunes music library that would suit my fitness practices, which are "cycling", "yoga", and "ballet". To solve that problem I use machine learning classification algorithms.    

The previous posts cover the following steps:
1. [00_Summary]() — Summary of this analysis, its goals and methods, installation notes.
2. [01_Data_preparation]() — Data gathering and cleaning.
3. [02_Data_visualisation]() — Visualisation and overview of data.
4. [03_Preprocessing]() — Data preprocessing to use it as input for Scikit-learn machine learning algorithms.

As a result of previous manipulations I have two databases (DBs): 
* training DB contains 88 tracks labeled with one of the three classes: "ballet", "cycling", "yoga";
* test DB contains 444 non-labeled tracks. 

The three classes I have in the training set don't cover all classes of music I have in my iTunes music library (test DB). Because of that I can't apply a classification algorithm to the whole test set as it will also assign irrelevant tracks to some class.  

In the following notebook I'm going to identify tracks in the unlabeled dataset that fit classes in the training dataset and eliminate tracks that completely unfit the classes. Only then I will perform classification.  

As a shortcut, in this notebook I import module "data_processing.py" where I perform steps from the [01_Data_preparation]() and [03_Preprocessing]() notebooks.  
I start with importing the modules required in the following notebook.

In [14]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
from IPython.display import display
import pandas as pd
import numpy as np
from sqlitedict import SqliteDict

# import my module from the previous notebook
import data_processing as prs

# set seaborn plot defaults
import seaborn as sns; 
sns.set(palette="husl")
sns.set_context("notebook")
sns.set_style("ticks")

# format floating point numbers
# within pandas data structures
pd.set_option('float_format', '{:.2f}'.format)

#### Preprocessing data

In [17]:
# training DB
train_db = SqliteDict('./labeled_tracks')
# create a df with training data
train_df = prs.read_db_in_pandas(train_db)

# test DB
test_db = SqliteDict('./itunes_tracks')
# create a df with test data
test_df = prs.read_db_in_pandas(test_db)

# convert both df's to numpy arrays with standardized features
train_std, test_std = prs.standardize_data(prs.convert_df_to_array(train_df), 
                                           prs.convert_df_to_array(test_df))


NameError: global name 'np' is not defined

For my purpose I use [One-Class SVM](http://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html#sklearn.svm.OneClassSVM) unsupervised algorithm. One-Class SVM is used for novelty detection, that is, given a set of samples (training set), it will detect the soft boundary of that set so as to classify new points (test set) as belonging to that set or not. It's important to point out that the algorithm treats the training data as not polluted by outliers.  

I use the following parameters: radial basis function, or 'rbf', kernel; 'nu' value has been chosen by trial and error method. 'nu' value is an upper bound on the fraction of training errors and a lower bound of the fraction of support vectors.

In [16]:
from sklearn import svm

# fit the model
svm_model = svm.OneClassSVM(kernel="rbf", nu=0.07)
svm_model.fit(train_std)

NameError: name 'train_std' is not defined

New observations, or test data, can now be sorted as inliers or outliers with a predict method. Inliers are labeled 1, while outliers are labeled -1.

In [None]:
# make prediction
test_novelty_pred = svm_model.predict(test_std)

# number of tracks outside set boudaries
n_error_test = test_novelty_pred[test_novelty_pred == -1].size

print ("Number of tracks that match the training set: {0}."
       "\nNumber of tracks outside set boudaries: {1}."
       .format((test_novelty_pred.size - n_error_test), 
               n_error_test))