# R&D engineer test

Imagine you have a large catalog of music sound recordings (SRs) with metadata only (no audio available). In this large catalog, you might have duplicates: the same sound recording (same master recording) appears more than once written in slightly different ways. For example:

```
{'source_id': '123',
 'title': 'Yesterday',
 'artist': 'Beatles The',
 'isrc': 'None',
 'contributors': 'Lennon|McCartney'
 }
{'source_id': '456',
 'title': 'Yesterday',
 'artist': 'The Beatles',
 'isrc': 'GBAYE6500521',
 'contributors': 'John Lennon|Paul McCartney'
 }
```

Let's imagine that we have already run a rough deduplication process, which provides a set of duplicate candidates in your database for each SR. This process is able to retrieve candidate to duplicates, but it is not able to properly classify between duplicate or not-duplicate. For example, given this query:


```
Query:
{'source_id': '123',
 'title': 'Yesterday',
 'artist': 'Beatles The',
 'isrc': 'None',
 'contributors': 'Lennon|McCartney'
 }
```

The candidates might be these ones:

```
{'source_id': '456',
 'title': 'Yesterday',
 'artist': 'The Beatles',
 'isrc': 'GBAYE6500521',
 'contributors': 'John Lennon|Paul McCartney'
 }
{'source_id': '789',
 'title': 'Yesterday',
 'artist': 'Elvis Presley',
 'isrc': 'USRC16908444',
 'contributors': 'John Lennon|Paul McCartney|Elvis Presley'
 }
```

So we have now the following links that might correspond, or might not correspond to the same SR:

* `id 123 vs. id 456`
* `id 123 vs. id 789`

We want to implement a system able to determine if two SRs metadata really correspond to the same SR or not. We want this system to be very easy to call from external processes, so we suggest to provide a HTTP API for it.

## Assignment

Build a HTTP API able to receive two sound-recording ids as input, and to provide a JSON output with an automatic classification about whether the two IDs correspond to the same actual sound-recording or not. When two SRs are the same, the classifier provides the output class `"valid"`, otherwise it outputs `"invalid"`.

### Example of usage

Given these three SRs:

```
{'source_id': '123',
 'title': 'Yesterday',
 'artist': 'Beatles The',
 'isrc': 'None',
 'contributors': 'Lennon|McCartney'
 }
{'source_id': '456',
 'title': 'Yesterday',
 'artist': 'The Beatles',
 'isrc': 'GBAYE6500521',
 'contributors': 'John Lennon|Paul McCartney'
 }
{'source_id': '789',
 'title': 'Yesterday',
 'artist': 'Elvis Presley',
 'isrc': 'USRC16908444',
 'contributors': 'John Lennon|Paul McCartney|Elvis Presley'
 }
```

The API, ideally, would provide these outputs for the following URLs:

```
$ curl -X GET "http://127.0.0.1:8002/?q_sr_id=123&m_sr_id=456"
{"class": "valid"}
$ curl -X GET "http://127.0.0.1:8002/?q_sr_id=123&m_sr_id=789"
{"class": "invalid"}
$ curl -X GET "http://127.0.0.1:8002/?q_sr_id=456&m_sr_id=789"
{"class": "invalid"}
```

Note: these examples are not present in the provided database

### Machine learning approach

The candidate is not expected to implement hard-crafted rules to do the classification. Instead, we provide a groundtruth file that allows to automatically train a classifier. This groundtruth provides the actual relationship between two given sound-recording ids (also called `source_id`).

On the other hand, the metadata for each sound-recording id can be found in the SQLite3 database file `db.db`.

We suggest to train a simple classifier using the following four features:
* Title similarity
* Artists similarity
* ISRC coincidence
* Contributors similarity

Note: string similarities can be easily computed with python package `fuzzywuzzy`.

### API

The API program should be able to access the provided database `db.db` (to fetch the metadata of each input source), and to load the previously trained model, so that it can compute the suggested features for each SR and provide a classification value.

### Evaluation criteria

We are looking for a MVP / PoC properly implemented, following good SW engineering and ML practices. **Do not overengineer your solution.** We are not expecting a super optimized implementation / ML model, but we value if the candidate takes that aspect into consideration in all her/his choices.

Make easy for us to run your application, so please indicate dependencies, or create a very simple docker image able to run your API with all dependencies installed.

Finally, we are **very** interested in your insight about your solution. Does it work well for the purpose? What else is needed to keep improving your solution? Any extra insight about the nature of the problem in the music industry, etc. is very welcome.

### Suggestions:

* Use a jupyter notebook to train the classifier and present results
* Use FastAPI to implement the API
* It's ok if you run the API with some development server in localhost

## Questions to think about

In the interview, maybe we would discuss about these things:

* We want to run your system to deduplicate our 100M SRs catalog: do you recommend it?
* After developing such a system how would the system evolve over time in terms of algorithm and feedback loop?
* What other features of the model would you select to release a new version of the model? What enhancements would be part of further developments? (algorithm, data, external sources,…)
* How would you proceed if you want to deploy this system in AWS for large-scale usage?
* In the future we would like to use embeddings for the task of candidates retrieval and validation. Could you present an approach of how we would do so? How could this go into production?

In [1]:
import pandas as pd

In [2]:
groundtruth = pd.read_csv('groundtruth.csv')

In [3]:
groundtruth

Unnamed: 0,q_source_id,m_source_id,tag
0,spotify_apidsr__2NbYAPqE6FTyQte9kW4vgr,crawler_believe__26052217,invalid
1,crawler_believe__34028360,crawler_believe__34168410,valid
2,crawler_fuga__7427128907609_1_6_ITZB42136782,crawler_believe__42573832,valid
3,crawler_believe__34168476,spotify_apidsr__3kOHtCewbmdWgMVgJ8rpkC,invalid
4,spotify_apidsr__28JA0VuEMS8i3N6fpRXr2M,spotify_apidsr__1d6j1PD3Z8NqbCgCYKDbCy,invalid
...,...,...,...
28366,apple__1354975784,youtube_dsr__A461439239803827,valid
28367,apple__1052537885,crawler_pias__5060099505690_1_2_GBRNP1400106,valid
28368,crawler_247__5060099505690_GBRNP1400106,spotify_apidsr__3x99UdcqjXhQcqdgadKeXA,valid
28369,youtube_dsr__A219026358613851,spotify__3x99UdcqjXhQcqdgadKeXA,valid


In [4]:
len(set(groundtruth.q_source_id.unique()).intersection(set(groundtruth.m_source_id.unique())))

635

In [5]:
len(groundtruth.q_source_id.unique()), len(groundtruth.m_source_id.unique())

(28093, 28091)

In [6]:
import pandas as pd
import sqlite3
import sqlalchemy 

try:
    conn = sqlite3.connect("db.db")    
except Exception as e:
    print(e)

# # Now in order to read in pandas dataframe we need to know table name
# cursor = conn.cursor()
# cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
# print(f"Table Name : {cursor.fetchall()}")

soundrecording = pd.read_sql_query('SELECT * FROM soundrecording', conn)
conn.close()

In [7]:
soundrecording.columns

Index(['id', 'sr_id', 'title', 'artists', 'isrcs', 'contributors'], dtype='object')

In [8]:
len(groundtruth)

28371

In [9]:
data = groundtruth.merge(soundrecording, right_on='sr_id', left_on='q_source_id', how='inner').merge(soundrecording, right_on='sr_id', left_on='m_source_id', how='inner')[['q_source_id', 'm_source_id', 'tag', 'title_x',
       'artists_x', 'isrcs_x', 'contributors_x', 'title_y',
       'artists_y', 'isrcs_y', 'contributors_y']]

In [10]:
data.columns = ['q_source_id', 'm_source_id', 'tag', 'title_q', 'artists_q', 'isrcs_q',
       'contributors_q', 'title_m', 'artists_m', 'isrcs_m', 'contributors_m']

In [11]:
data.head()

Unnamed: 0,q_source_id,m_source_id,tag,title_q,artists_q,isrcs_q,contributors_q,title_m,artists_m,isrcs_m,contributors_m
0,spotify_apidsr__2NbYAPqE6FTyQte9kW4vgr,crawler_believe__26052217,invalid,Astronomia - Tequila Edit,Bing Lee,ITZB42033435,Edizioni Lungoviaggio|Victor Pool|Ruben Christ...,Astronomia (feat. Tish),"Marco Marzi, Marco Skarica, David White",ITF341800025,Игумнов
1,crawler_believe__34028360,crawler_believe__34168410,valid,Astronomia (Coffin Dance) [Dance Edit],Josh Nor,FR2X42061192,Victor Pool|Ruben den Boer|Антон Игумнов,Astronomia (Coffin Dance) [Tequila Edit],Josh Nor,FR96X2013991,Victor Pool|Ruben den Boer|Антон Игумнов
2,crawler_believe__34028360,apple__1535650073,invalid,Astronomia (Coffin Dance) [Dance Edit],Josh Nor,FR2X42061192,Victor Pool|Ruben den Boer|Антон Игумнов,Astronomia (Never Go Home),Tony Igy,DEE862002424,Reinhard Raith|Kristin Carpenter|Anton Igumnov...
3,crawler_fuga__7427128907609_1_6_ITZB42136782,crawler_believe__42573832,valid,Astronomia (Purple Mix),Josh Nor,ITZB42136782,Ruben Christopher Den Boer|Anton Igumnov|Victo...,Astronomia (Coffin Dance) [Dance Edit],Josh Nor,FR2X42204962,Victor Pool|Ruben den Boer|Антон Игумнов
4,crawler_believe__34168476,spotify_apidsr__3kOHtCewbmdWgMVgJ8rpkC,invalid,Astronomia (Coffin Dance) [EDM Edit],Haures,FR96X2014044,Victor Pool|Ruben den Boer|Антон Игумнов,Astronomia - Dance Edit,Bing Lee,ITZB42033425,Ruben Christopher den Boer|Anton Igumnov|Victo...


In [12]:
from fuzzywuzzy import fuzz

In [13]:
# data['title_partial_ratio'] = data.apply(lambda x: fuzz.partial_ratio(x['artists_q'], x['artists_m']), axis=1)

In [14]:
data.columns

Index(['q_source_id', 'm_source_id', 'tag', 'title_q', 'artists_q', 'isrcs_q',
       'contributors_q', 'title_m', 'artists_m', 'isrcs_m', 'contributors_m'],
      dtype='object')

In [15]:
def add_similarity_feature(df_data, functions, fields):
    new_features = []
    for function in functions:
        for field in fields:
            assert field + '_q' in df_data.columns
            fuzz_function = getattr(fuzz, function)
            df_data[f'{field}_{function}'] = df_data.apply(lambda x: fuzz_function(x[f'{field}_q'], x[f'{field}_m']), axis=1)
            new_features.append(f'{field}_{function}')
    return df_data, new_features

In [37]:
def get_features(input_string, features, fields, conn):
    # q_sr_id=123&m_sr_id=456
    q_sr_id, m_sr_id = [item.split('=')[1] for item in input_string.split('&')]
    soundrecording = pd.read_sql_query(f"SELECT * FROM soundrecording where sr_id in ('{q_sr_id}', '{m_sr_id}')", conn)
    groundtruth = pd.DataFrame({'q_source_id': [q_sr_id], 'm_source_id': [m_sr_id]})
    data = groundtruth.merge(soundrecording, right_on='sr_id', left_on='q_source_id', how='inner').merge(soundrecording, right_on='sr_id', left_on='m_source_id', how='inner')[['q_source_id', 'm_source_id', 'title_x',
       'artists_x', 'isrcs_x', 'contributors_x', 'title_y',
       'artists_y', 'isrcs_y', 'contributors_y']]
    data.columns = ['q_source_id', 'm_source_id', 'title_q', 'artists_q', 'isrcs_q',
       'contributors_q', 'title_m', 'artists_m', 'isrcs_m', 'contributors_m']
    new_data, new_features = add_similarity_feature(data, features, fields)
    new_data['isrcs_coincidence'] = (new_data['isrcs_m'] == new_data['isrcs_q']).astype(int)
    return new_data[[*new_features, *['isrcs_coincidence']]]

In [34]:
def get_api_response(input_string, loaded_model, features, fields, conn):
    X = get_features(input_string, features, fields, conn)
    return {"class": "valid" if loaded_model.predict(X) == 1 else "invalid"}

In [18]:
features = [
    'partial_ratio',
    'partial_token_set_ratio',
    'partial_token_sort_ratio',
    'ratio',
    'token_set_ratio',
    'token_sort_ratio'
]
fields = ['artists', 'contributors', 'title']
new_data, new_features = add_similarity_feature(data, features, fields)

In [19]:
new_data['isrcs_coincidence'] = (new_data['isrcs_m'] == new_data['isrcs_q']).astype(int)

In [21]:
new_data['target'] = new_data['tag'].apply(lambda x: 1 if x == 'valid' else 0)

In [22]:
new_data['target'] = new_data['tag'].apply(lambda x: 1 if x == 'valid' else 0)
num_feature_dataset = new_data[[*new_features, *['isrcs_coincidence', 'target']]]

In [23]:
num_feature_dataset

Unnamed: 0,artists_partial_ratio,contributors_partial_ratio,title_partial_ratio,artists_partial_token_set_ratio,contributors_partial_token_set_ratio,title_partial_token_set_ratio,artists_partial_token_sort_ratio,contributors_partial_token_sort_ratio,title_partial_token_sort_ratio,artists_ratio,contributors_ratio,title_ratio,artists_token_set_ratio,contributors_token_set_ratio,title_token_set_ratio,artists_token_sort_ratio,contributors_token_sort_ratio,title_token_sort_ratio,isrcs_coincidence,target
0,38,0,61,25,0,100,25,0,75,13,0,62,15,0,74,13,0,74,0,0
1,100,100,84,100,100,100,100,100,82,100,100,87,100,100,100,100,100,80,0,1
2,25,35,62,25,35,100,25,35,62,25,24,50,25,26,62,25,26,62,0,0
3,100,38,57,100,100,100,100,63,62,100,33,49,100,79,65,100,57,51,0,1
4,17,40,70,17,100,100,17,53,67,14,28,75,14,79,100,14,46,79,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28366,15,27,23,100,100,100,100,91,100,15,26,23,100,93,100,100,93,100,1,1
28367,100,100,100,100,100,100,100,83,100,79,48,100,81,100,100,81,44,100,1,1
28368,100,0,100,100,0,100,100,0,100,100,0,100,100,0,100,100,0,100,1,1
28369,23,29,10,100,83,100,100,83,100,23,19,10,100,60,100,100,60,100,1,1


In [24]:
from sklearn.model_selection import train_test_split

X = num_feature_dataset.drop('target', axis=1)
y = num_feature_dataset['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [25]:
from sklearn.ensemble import GradientBoostingClassifier

clf = GradientBoostingClassifier(n_estimators=200, learning_rate=0.1, max_depth=1, random_state=0).fit(X_train, y_train)
clf.score(X_test, y_test)

0.967104560504112

In [26]:
from hpsklearn import HyperoptEstimator, any_classifier
from hyperopt import tpe
import numpy as np


estim = HyperoptEstimator(classifier=any_classifier('clf'), algo=tpe.suggest, trial_timeout=300)
estim.fit(X_train, y_train)

print(estim.score(X_test, y_test))
print(estim.best_model())

100%|██████████| 1/1 [00:02<00:00,  2.29s/trial, best loss: 0.07785376117832721]
100%|██████████| 2/2 [00:01<00:00,  1.75s/trial, best loss: 0.03235139400315623]
100%|██████████| 3/3 [00:03<00:00,  3.73s/trial, best loss: 0.03235139400315623]
100%|██████████| 4/4 [00:01<00:00,  1.94s/trial, best loss: 0.02972119936875328]
100%|██████████| 5/5 [00:02<00:00,  2.27s/trial, best loss: 0.02972119936875328]
100%|██████████| 6/6 [00:01<00:00,  1.30s/trial, best loss: 0.02972119936875328]
100%|██████████| 7/7 [00:04<00:00,  4.08s/trial, best loss: 0.02972119936875328]
100%|██████████| 8/8 [00:01<00:00,  1.01s/trial, best loss: 0.02972119936875328]
100%|██████████| 9/9 [00:01<00:00,  1.94s/trial, best loss: 0.02972119936875328]
100%|██████████| 10/10 [00:03<00:00,  3.61s/trial, best loss: 0.02972119936875328]
0.9662501335042187
{'learner': XGBClassifier(base_score=0.5, booster=None, callbacks=None,
              colsample_bylevel=0.6385602875974441, colsample_bynode=None,
              colsampl



In [27]:
best_model = estim.best_model()['learner']
best_model.fit(X_train, y_train)

In [30]:
import pickle

filename = 'best_model.sav'
pickle.dump(best_model, open(filename, 'wb'))

In [31]:
loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.score(X_test, y_test)
print(result)

0.9662501335042187


In [32]:
input_string = 'q_sr_id=spotify_apidsr__2NbYAPqE6FTyQte9kW4vgr&m_sr_id=crawler_fuga__7427128907609_1_6_ITZB42136782'

In [36]:
conn = sqlite3.connect("db.db")
get_api_response(input_string, loaded_model, features, fields, conn)

{'class': 'invalid'}