## Design considerations for extracting data from San Francisco civic issue tracking public API

Data would have to be pulled from https://data.sfgov.org/ so it makes sense to run a job periodically as a batch process (it doesn't seem likely that the data is updated close to real-time).

System should be able to load data incrementally if possible. It this case it is possible through provided API (https://dev.socrata.com/foundry/data.sfgov.org/ktji-gk7t). Initial data import would have to use the big file but then endpoint can be used with a date filter to get incremental update. Full import procedure should be repeatable and automated.

If the ETL process is complicated enough it's good to divide the whole process in components. Each component should follow basic UNIX philosophy: be composable, do one thing well, expect output of component to be input to another. If one component fails system should still be in a stable state without incomplete data, duplicates or data corruption.

Extending this pipeline to handle more data source would be an example of good use case for modular architecture. Extracting data could be done by different components or maybe the same one but running with different parameters. Same elements of data normalization and processing could be probably shared so this design approach would enable code reusing. 

If the data is very big in volume and/or if we want to use multiple workers to speed up the process, data should be partitioned. Files should be read in bulk for performance.

If multiple components are involved there should some kind of work flow system controlling their execution and allowing defining task dependencies and conditions. Such system should also handle failures in some way: maybe just reporting it, maybe retrying, running rollbacks etc. This could achieved through systems with more programmatic approach Airflow, Luigi, Spring Batch, full-on ETL system like Pentaho or Clover or custom one, implemented using Spark, Kafka or similar technology allowing distributing and controlling work between workers. Of course it's also possible to implement everything completely  by hand. 

Save, preferably non-local storage system should be used to store raw data from the the endpoint and data from intermediate steps. This is especially important if we want to use many distributed workers. That can self-hosted file system or Amazon S3, GCS or similar solution.

Then would be time for other parts E from ETL operations like:
* joining with other data set
* aggregation 
* adding calculated values
* transposing or pivoting
or similar
