# R&D engineer test

Imagine you have a large catalog of music sound recordings (SRs) with metadata only (no audio available). In this large catalog, you might have duplicates: the same sound recording (same master recording) appears more than once written in slightly different ways. For example:

```
{'source_id': '123',
 'title': 'Yesterday',
 'artist': 'Beatles The',
 'isrc': 'None',
 'contributors': 'Lennon|McCartney'
 }
{'source_id': '456',
 'title': 'Yesterday',
 'artist': 'The Beatles',
 'isrc': 'GBAYE6500521',
 'contributors': 'John Lennon|Paul McCartney'
 }
```

Let's imagine that we have already run a rough deduplication process, which provides a set of duplicate candidates in your database for each SR. This process is able to retrieve candidate to duplicates, but it is not able to properly classify between duplicate or not-duplicate. For example, given this query:


```
Query:
{'source_id': '123',
 'title': 'Yesterday',
 'artist': 'Beatles The',
 'isrc': 'None',
 'contributors': 'Lennon|McCartney'
 }
```

The candidates might be these ones:

```
{'source_id': '456',
 'title': 'Yesterday',
 'artist': 'The Beatles',
 'isrc': 'GBAYE6500521',
 'contributors': 'John Lennon|Paul McCartney'
 }
{'source_id': '789',
 'title': 'Yesterday',
 'artist': 'Elvis Presley',
 'isrc': 'USRC16908444',
 'contributors': 'John Lennon|Paul McCartney|Elvis Presley'
 }
```

So we have now the following links that might correspond, or might not correspond to the same SR:

* `id 123 vs. id 456`
* `id 123 vs. id 789`

We want to implement a system able to determine if two SRs metadata really correspond to the same SR or not. We want this system to be very easy to call from external processes, so we suggest to provide a HTTP API for it.

## Assignment

Build a HTTP API able to receive two sound-recording ids as input, and to provide a JSON output with an automatic classification about whether the two IDs correspond to the same actual sound-recording or not. When two SRs are the same, the classifier provides the output class `"valid"`, otherwise it outputs `"invalid"`.

### Example of usage

Given these three SRs:

```
{'source_id': '123',
 'title': 'Yesterday',
 'artist': 'Beatles The',
 'isrc': 'None',
 'contributors': 'Lennon|McCartney'
 }
{'source_id': '456',
 'title': 'Yesterday',
 'artist': 'The Beatles',
 'isrc': 'GBAYE6500521',
 'contributors': 'John Lennon|Paul McCartney'
 }
{'source_id': '789',
 'title': 'Yesterday',
 'artist': 'Elvis Presley',
 'isrc': 'USRC16908444',
 'contributors': 'John Lennon|Paul McCartney|Elvis Presley'
 }
```

The API, ideally, would provide these outputs for the following URLs:

```
$ curl -X GET "http://127.0.0.1:8002/?q_sr_id=123&m_sr_id=456"
{"class": "valid"}
$ curl -X GET "http://127.0.0.1:8002/?q_sr_id=123&m_sr_id=789"
{"class": "invalid"}
$ curl -X GET "http://127.0.0.1:8002/?q_sr_id=456&m_sr_id=789"
{"class": "invalid"}
```

Note: these examples are not present in the provided database

### Machine learning approach

The candidate is not expected to implement hard-crafted rules to do the classification. Instead, we provide a groundtruth file that allows to automatically train a classifier. This groundtruth provides the actual relationship between two given sound-recording ids (also called `source_id`).

On the other hand, the metadata for each sound-recording id can be found in the SQLite3 database file `db.db`.

We suggest to train a simple classifier using the following four features:
* Title similarity
* Artists similarity
* ISRC coincidence
* Contributors similarity

Note: string similarities can be easily computed with python package `fuzzywuzzy`.

### API

The API program should be able to access the provided database `db.db` (to fetch the metadata of each input source), and to load the previously trained model, so that it can compute the suggested features for each SR and provide a classification value.

### Evaluation criteria

We are looking for a MVP / PoC properly implemented, following good SW engineering and ML practices. **Do not overengineer your solution.** We are not expecting a super optimized implementation / ML model, but we value if the candidate takes that aspect into consideration in all her/his choices.

Make easy for us to run your application, so please indicate dependencies, or create a very simple docker image able to run your API with all dependencies installed.

Finally, we are **very** interested in your insight about your solution. Does it work well for the purpose? What else is needed to keep improving your solution? Any extra insight about the nature of the problem in the music industry, etc. is very welcome.

### Suggestions:

* Use a jupyter notebook to train the classifier and present results
* Use FastAPI to implement the API
* It's ok if you run the API with some development server in localhost

## Questions to think about

In the interview, maybe we would discuss about these things:

* We want to run your system to deduplicate our 100M SRs catalog: do you recommend it?
* After developing such a system how would the system evolve over time in terms of algorithm and feedback loop?
* What other features of the model would you select to release a new version of the model? What enhancements would be part of further developments? (algorithm, data, external sources,…)
* How would you proceed if you want to deploy this system in AWS for large-scale usage?
* In the future we would like to use embeddings for the task of candidates retrieval and validation. Could you present an approach of how we would do so? How could this go into production?