(For the full description of the challenge check here)
- Geolocator - Vio.com coding challenge
The challenge is split into 2 parts:
The main features of this module are:
- Owns the
location
entity/schema and the access to the DB (I have chosen Postgres w/ ecto) - Has a CSV loader used to populate the database with a given file
- The load process generates statistics that are returned at the end of the process
- Elixir
- Erlang
- Docker (and docker compose)
- Note that all examples in this document use
docker compose
which is the MacOS syntax. In unix we have to usedocker-compose
instead (also, it may requiresudo
)
- Note that all examples in this document use
I have included a .tool-versions
file so we can use asdf install
and install the dependencies automatically (See asdf)
Docker can be found here
Once elixir and erlang are installed, we can compile the module with
mix deps.get
mix compile
This will compile the module using MIX_ENV=dev
which is enough for using the CSV loader functionality
Since we need a running database with a valid table in order to load the locations into it, we need to start an instance of Postgres. I've configured a docker compose file to help doing that in a local dev enviroment. In order to start the DB just type
docker compose up db
The DB files are stored at ./postgres_data
so you can stop the service and delete the folder to start from scratch.
Once the DB is running, we need to run the migrations in order to create the database and the locations
table. We can set up everything by running the following:
mix ecto.setup
Once mix gives us the ok, the database will be ready for importing the CSV.
In order to not require using iex
and calling Geolocator.Loader
directly, I've created a Mix Task
that will do so for us.
LOADER_BATCH_SIZE=2500 mix geolocator.loader -f <file_path>
The LOADER_BATCH_SIZE
can be any number between 1 and 9362, this last number being due to ecto/postgres limit of 65535 parameters in a query and since the location
schema has 7 parameters per row, we can only fit 9362 rows per insert. The environment variable can also be ommited and will default to 2000 rows
per insert.
I've performed some preliminary benchmarking to decide on the batch size and it doesn't seem to affect the process too much:
Batch size | Run 1 ms | Run 2 ms | Run 3 ms |
---|---|---|---|
1 | Cancelled it after 5 mins | ||
500 | 38217631 | 43317925 | 41711280 |
9001 | 40358881 | 41050354 | 40085983 |
2000 | 37804438 | 37651701 | 37803080 |
5000 | 38389044 | 43081773 | 43060038 |
93622 | 45540493 | 39264909 | 45297188 |
As we can see the fastest load times seem to happen around the 2000 batch size. Some other value have been chosen because they seemed relevant:
-
1: The file buffer is set to be 100,000 bytes and with rows being around 100 characters, it means it can fit about 1000 rows, so I went to slightly less in order to try to fit a whole batch in the buffer.
-
2: As previously mentioned, the max number of rows allowed by Postgres/ecto due to parameter limits in queries. This number will change if more columns are added to (or removed from) the
locations
schema.
Instead of creating a whole separate mix project, I decided to combine this component and the web server in the same component. This is due to the way phoenix naturally splits the http server and service layers into different modules. It seemed a little redundant to create a mix project with all the code and then a phoenix project with just a 5 line controller and the dependency to the module. I could have used an umbrella project to make the separation of concerns/owneship more apparent, but it still seemed like complexity for the sake of complexity.
In a real world scenario, if the geolocation library were to be used by multiple services, it could make sense to have it extracted from this repo (and it would be quite trivial to do so) but still, there would be a number of questions to answer like who owns the database; e.g: it is a good (or at least common) practice for a database to have a single owner/service, how/when to run the DB migrations, etc
The stats
struct is as simple as I could make it and still make sense.
Also regarding the stats
the &load_row/2
in the loader returns some simple counters (accepted and discarded rows) instead of a stats
object of its own for the sake of simplicity, but if the stats were to be more complex, it would make sense for the whole load pipeline to have access to a stats
object and pass it around each stage while filling it.
Lastly, regarding what to do when conflicts happen (a conflict being two or more rows with the same ip address), I've decided to take the last occurence as the truth, as I expect that to be the correct behaviour in the case of some other
service/process appending to the file the most up to date information. This also has the benefit of being really easy to
code, as it only requires transforming the insert operation into an upsert one. If we wanted to do the opposite (only accept
the first occurence) we can just change the inserts's on_conflict
config so it rejects conflicts, meaning it will only
allow the insert the first time. If we just don't want to insert rows with any conflict at all, it would require us to
either load the whole file in memory (not really an option for really huge files) or read it twice, first to pre-process it
and second to actually load it. The pre-processing would mean building a map with ip_addresses of conflicting ips and then
use that in the second pass to ignore rows with an ip in the map, but a this point it would be a better idea to just process
the file outside elixir, for example directly in the filesystem using any of the tools provided by the os (for example with
something like awk
)
There are some improvements that could be done in a real world scenario as the project growns:
Add more metrics to the stats
object, for example subcategories for discarded rows; where they discarded during CSV parse, while checking the changeset or at the insert itself? This would also require passing a stats
object around the load pipeline as previously mentioned.
Separate the repository layer form the service one via a ports-adapter, dependency inyection pattern or something similar. I didn't think it necessary just yet because we are only using a single storage backend and for testing we can use Ecto's Sandbox adapter (although you could argue that by using it any test we write cannot be considered a unit test at all)
There are 2 ways of running the service; locally using a console or having it deploy in a docker container.
Usefull we you are activelly working witht the codebase and want to test the changes fast. This requires having both
elixir
and erlang
installed as described in the pre-requirements section.
First get the dependencies with
mix deps.get
And the either launch the phoenix server by itself or inside an interactive terminal
# Standalone phoenix server
mix phx.server
# This will open a iex terminal with Phoenix running on it
iex -S mix phx.server
Take into account that since we are not specifying the MIX_ENV, by default it will be set to dev
, so the configuration
file it will use will be dev.exs
(plus runtime.exs
, but other than LOADER_BATCH_SIZE
all other properties will only
be used in the prod
enviroment)
This is the most convenient way of deploying the service if we want to either simulate the production enviroment or we don't even want to install elixir/erlang to run the service. We only need to have docker installed and running (check pre-requirements section).
First we will ask docker compose to build an image with the phoenix release
docker compose build
This will use a build image with all the required dependencies to generate a release of the phoenix service and then copy the release inside a much more lightweight image that will be used as the deploy image (check the Dockerfile for details)
After the deploy image has been, depending if the database container is already running we can either start just the phoenix service or both the service and the database:
# This will start both the database and the phoenix service (phoenix will wait for Postgres)
docker compose up
# This will just start the phoenix service. If the database is not running it will crash on boot
docker compose up geolocator
In both cases, the phoenix service will attempt run any pending database migrations (stored in priv/repo/migrations)
There is just a single endpoint in the API, a GET
request to /api/locations/<IP_ADDRESS>
where <IP_ADDRESS> is the ip address we want to obtain the location info for. Of course, in order for this endpoint to return anything other than a 404, we need to have previously loaded the database (see Loading a CSV file).
Request:
curl --location 'http://localhost:4000/api/locations/35.227.116.242'
Returns a 200 with the body:
{
"city": "Port Simoneside",
"country": "Uganda",
"country_code": "LR",
"ip_address": "35.227.116.242",
"latitude": 27.028236306590998,
"longitude": -86.40283568649986,
"mystery_value": 3227619485
}
The API returns a 404 when there is no match in the database:
curl --location 'http://localhost:4000/api/locations/192.168.1.1'
Returns a 404 with the body:
{
"errors": {
"detail": "Not Found"
}
}
The API returns a 400 when the path param provided is not a valid ip address
curl --location 'http://localhost:4000/api/locations/a'
Returns a 404 with the body:
{
"errors": {
"detail": "Bad Request"
}
}
If the API where to grow in the future, it would be a good idea to add Swagger/OpenAPI specs for example using the open_api_spex library.
Also either create our own metrics and/or reuse the one autogenerated by phoenix in telemetry.ex (although to be fair I haven't really used them ever so I am not sure if they can be sent to an external metrics storage like prometheus or they can only be consumed by the Phoenix.LiveDashboard)
Lastly, the way the ip address validation works is by trying to cast the value provided in the path param into a EctoNetwork.INET
. This works fine but has the weird effect of also allowing any integer between 0 and 2^32-1, in which case it will take the integer as the decimal representation of the bits underneath. For example, 0.255.0.255
is 00000000 11111111 00000000 11111111 in binary, and translates to 16711935 in decimal so if we try to ask for /api/locations/16711935
, it will accept it and look for /api/locations/0.255.0.255
. This is a weird (and probably unexpected case) and could be fixed by checking the ip format in the controller instead of relying in the database schema, but I figured it would be fine for now(and it is still technically correct).
We can use quantum in order to call functions with a cron-like syntax. That
way we can call Geolocator.Loader.load_from_csv
with any given path. The only problem would be what to do with already loaded files, should we rename them, delete them or are they appended and we need to keep an ofset.
We would need to start a scheduller in the supervision tree of the application and then just configure it.
The config to enable a daily load would be something like (in prod.exs
):
config :geolocator, Geolocator.Scheduler,
jobs: [
{"@daily", {Geolocator.Loader, :load_from_csv, ["path/to/file"]}}
]
Some tests need access to the database (although they will not persist data in it) since we are using the Sandbox adapter for ecto which wraps test operations in a transaction that is later rolled back, but still requires a live connection to a real database with the migrations up to date.
Tests can be launched with
mix test
Results
......................
Finished in 0.1 seconds (0.04s async, 0.09s sync)
22 tests, 0 failures
Test coverage can be obtained with:
MIX_ENV=test mix coveralls
......................
Finished in 0.1 seconds (0.04s async, 0.08s sync)
22 tests, 0 failures
Randomized with seed 604344
----------------
COV FILE LINES RELEVANT MISSED
80.0% lib/geolocator/application.ex 34 5 1
50.0% lib/geolocator/loader.ex 77 14 7
100.0% lib/geolocator/loader/stats.ex 27 1 0
100.0% lib/geolocator/locations.ex 65 3 0
100.0% lib/geolocator/locations/location.ex 36 2 0
100.0% lib/geolocator_web/controllers/error_json.ex 15 1 0
66.6% lib/geolocator_web/controllers/fallback_controller.ex 31 3 1
100.0% lib/geolocator_web/controllers/location_controller.ex 18 3 0
100.0% lib/geolocator_web/controllers/location_json.ex 18 8 0
100.0% lib/geolocator_web/router.ex 12 2 0
[TOTAL] 78.5%
----------------
It is possible to dump this into a file and to modify the output so it works better with different tools such as travis, for the full documentation of coveralls, check their github repo