# R&D engineer test

## Assignment

Build a HTTP API able to receive two sound-recording ids as input, and to provide a JSON output with an automatic classification about whether the two IDs correspond to the same actual sound-recording or not. When two SRs are the same, the classifier provides the output class `"valid"`, otherwise it outputs `"invalid"`.



### Machine learning approach

The candidate is not expected to implement hard-crafted rules to do the classification. Instead, we provide a groundtruth file that allows to automatically train a classifier. This groundtruth provides the actual relationship between two given sound-recording ids (also called `source_id`).

On the other hand, the metadata for each sound-recording id can be found in the SQLite3 database file `db.db`.

We suggest to train a simple classifier using the following four features:
* Title similarity
* Artists similarity
* ISRC coincidence
* Contributors similarity

Note: string similarities can be easily computed with python package `fuzzywuzzy`.

### API


## Questions to think about

In the interview, maybe we would discuss about these things:

* We want to run your system to deduplicate our 100M SRs catalog: do you recommend it?
* After developing such a system how would the system evolve over time in terms of algorithm and feedback loop?
* What other features of the model would you select to release a new version of the model? What enhancements would be part of further developments? (algorithm, data, external sources,…)
* How would you proceed if you want to deploy this system in AWS for large-scale usage?
* In the future we would like to use embeddings for the task of candidates retrieval and validation. Could you present an approach of how we would do so? How could this go into production?

In [1]:
import pandas as pd
import sqlite3

from fuzzywuzzy import fuzz
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from hpsklearn import HyperoptEstimator, any_classifier
from hyperopt import tpe
import numpy as np
import pickle

## ETL - Building the dataset for the training

In [2]:
groundtruth = pd.read_csv('groundtruth.csv')
groundtruth.head()

Unnamed: 0,q_source_id,m_source_id,tag
0,spotify_apidsr__2NbYAPqE6FTyQte9kW4vgr,crawler_believe__26052217,invalid
1,crawler_believe__34028360,crawler_believe__34168410,valid
2,crawler_fuga__7427128907609_1_6_ITZB42136782,crawler_believe__42573832,valid
3,crawler_believe__34168476,spotify_apidsr__3kOHtCewbmdWgMVgJ8rpkC,invalid
4,spotify_apidsr__28JA0VuEMS8i3N6fpRXr2M,spotify_apidsr__1d6j1PD3Z8NqbCgCYKDbCy,invalid


In [3]:
len(set(groundtruth.q_source_id.unique()).intersection(set(groundtruth.m_source_id.unique())))

635

In [4]:
len(groundtruth.q_source_id.unique()), len(groundtruth.m_source_id.unique())

(28093, 28091)

In [5]:
conn = sqlite3.connect("db.db")    
soundrecording = pd.read_sql_query('SELECT sr_id, title, artists, isrcs, contributors FROM soundrecording', conn)
conn.close()

In [6]:
soundrecording.columns

Index(['sr_id', 'title', 'artists', 'isrcs', 'contributors'], dtype='object')

In [7]:
def add_similarity_feature(df_data, functions, fields):
    new_features = []
    for function in functions:
        for field in fields:
            assert field + '_q' in df_data.columns
            fuzz_function = getattr(fuzz, function)
            df_data[f'{field}_{function}'] = df_data.apply(lambda x: fuzz_function(x[f'{field}_q'], x[f'{field}_m']), axis=1)
            new_features.append(f'{field}_{function}')
    return df_data, new_features

In [8]:
def join_sr(groundtruth, soundrecording):
    data = groundtruth.merge(soundrecording, right_on='sr_id', left_on='q_source_id', how='inner').merge(soundrecording, right_on='sr_id', left_on='m_source_id', how='inner')
    data.columns = [c.replace('_x', '_q').replace('_y', '_m') for c in data.columns]
    return data

In [9]:
def get_features(input_string, features, fields, conn):
    q_sr_id, m_sr_id = [item.split('=')[1] for item in input_string.split('&')]
    soundrecording = pd.read_sql_query(f"SELECT * FROM soundrecording where sr_id in ('{q_sr_id}', '{m_sr_id}')", conn)
    groundtruth = pd.DataFrame({'q_source_id': [q_sr_id], 'm_source_id': [m_sr_id]})
    data = join_sr(groundtruth, soundrecording)
    new_data, new_features = add_similarity_feature(data, features, fields)
    new_data['isrcs_coincidence'] = (new_data['isrcs_m'] == new_data['isrcs_q']).astype(int)
    return new_data[[*new_features, *['isrcs_coincidence']]]

In [10]:
def check_input_string(input_string):
        return all([item in input_string for item in ['q_sr_id=', '&', 'm_sr_id=']])

def get_api_response(input_string, loaded_model, features, fields, conn):
    if not check_input_string(input_string):
        return {"error": "Incorrect request format. Please use q_sr_id= & m_sr_id="}
    else:
        X = get_features(input_string, features, fields, conn)
        return {"class": "valid" if loaded_model.predict(X) == 1 else "invalid"}

In [11]:
features = [
    'partial_ratio',
    'partial_token_set_ratio',
    'partial_token_sort_ratio',
    'ratio',
    'token_set_ratio',
    'token_sort_ratio'
]
fields = ['artists', 'contributors', 'title']
data = join_sr(groundtruth, soundrecording)
new_data, new_features = add_similarity_feature(data, features, fields)
new_data['isrcs_coincidence'] = (new_data['isrcs_m'] == new_data['isrcs_q']).astype(int)

In [12]:
new_data['target'] = new_data['tag'].apply(lambda x: 1 if x == 'valid' else 0)
num_feature_dataset = new_data[[*new_features, *['isrcs_coincidence', 'target']]]

In [13]:
num_feature_dataset.head()

Unnamed: 0,artists_partial_ratio,contributors_partial_ratio,title_partial_ratio,artists_partial_token_set_ratio,contributors_partial_token_set_ratio,title_partial_token_set_ratio,artists_partial_token_sort_ratio,contributors_partial_token_sort_ratio,title_partial_token_sort_ratio,artists_ratio,contributors_ratio,title_ratio,artists_token_set_ratio,contributors_token_set_ratio,title_token_set_ratio,artists_token_sort_ratio,contributors_token_sort_ratio,title_token_sort_ratio,isrcs_coincidence,target
0,38,0,61,25,0,100,25,0,75,13,0,62,15,0,74,13,0,74,0,0
1,100,100,84,100,100,100,100,100,82,100,100,87,100,100,100,100,100,80,0,1
2,25,35,62,25,35,100,25,35,62,25,24,50,25,26,62,25,26,62,0,0
3,100,38,57,100,100,100,100,63,62,100,33,49,100,79,65,100,57,51,0,1
4,17,40,70,17,100,100,17,53,67,14,28,75,14,79,100,14,46,79,0,0


## Training the ML model

In [14]:
X = num_feature_dataset.drop('target', axis=1)
y = num_feature_dataset['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [15]:
clf = GradientBoostingClassifier(n_estimators=300, learning_rate=0.4, max_depth=1, random_state=0).fit(X_train, y_train)
clf.score(X_test, y_test)

0.9679589875040051

In [16]:
estim = HyperoptEstimator(classifier=any_classifier('clf'), algo=tpe.suggest, trial_timeout=300)
estim.fit(X_train, y_train)

print(estim.score(X_test, y_test))
print(estim.best_model())

100%|██████████| 1/1 [00:01<00:00,  1.94s/trial, best loss: 0.05207785376117835]
100%|██████████| 2/2 [00:03<00:00,  3.67s/trial, best loss: 0.0352446081009995]
100%|██████████| 3/3 [00:01<00:00,  1.36s/trial, best loss: 0.0352446081009995]
100%|██████████| 4/4 [00:04<00:00,  4.66s/trial, best loss: 0.0352446081009995]
100%|██████████| 5/5 [00:01<00:00,  1.96s/trial, best loss: 0.0352446081009995]
100%|██████████| 6/6 [00:01<00:00,  1.29s/trial, best loss: 0.0352446081009995]
100%|██████████| 7/7 [00:01<00:00,  1.19s/trial, best loss: 0.0352446081009995]
100%|██████████| 8/8 [00:01<00:00,  1.27s/trial, best loss: 0.032877432930036865]
100%|██████████| 9/9 [00:05<00:00,  5.25s/trial, best loss: 0.032877432930036865]
100%|██████████| 10/10 [00:01<00:00,  1.28s/trial, best loss: 0.032877432930036865]
0.962512015379686
{'learner': GradientBoostingClassifier(learning_rate=0.020906439945477112,
                           loss='exponential', max_features='log2',
                           max



In [17]:
# best_model = estim.best_model()['learner']
# best_model.fit(X_train, y_train)

In [18]:

filename = 'best_model.sav'
# pickle.dump(best_model, open(filename, 'wb'))
pickle.dump(clf, open(filename, 'wb'))

In [19]:
loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.score(X_test, y_test)
print(result)

0.9679589875040051


### Testing the model

In [20]:
input_string = 'q_sr_id=spotify_apidsr__2NbYAPqE6FTyQte9kW4vgr&m_sr_id=crawler_fuga__7427128907609_1_6_ITZB42136782'

In [21]:
conn = sqlite3.connect("db.db")
get_api_response(input_string, loaded_model, features, fields, conn)

{'class': 'invalid'}

In [23]:
# Removing the &
input_string = 'q_sr_id=spotify_apidsr__2NbYAPqE6FTyQte9kW4vgrm_sr_id=crawler_fuga__7427128907609_1_6_ITZB42136782'
get_api_response(input_string, loaded_model, features, fields, conn)

{'error': 'Incorrect request format. Please use q_sr_id= & m_sr_id='}

In [1]:
!curl -X GET -d '"q_sr_id=crawler_believe__34028360&m_sr_id=crawler_believe__34168410"' http://localhost:8002/

{"class":"valid"}