Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

💻 M1 compatibility and generalized CLI #99

Merged
merged 10 commits into from
Mar 1, 2022

Conversation

mattigrthr
Copy link
Contributor

Since we experienced several system-related issues across platforms due to the pure Python approach, we try to dockerize the CLI after all.

We've realized several compatibility issues exist for Docker containers on M1-based Macs. We have fixed those and switched to pulling pre-built images from our Docker Hub in the docker-compose.yml.

Closes #62

@mattigrthr mattigrthr added bug Something isn't working pipeline Issues related to pipelines pipeline/population-density Issues related to the population-density pipeline pipeline/osm-poi Issues related to the osm-poi pipeline pipeline/google-poi Issues related to the google-poi pipeline core Issues related to the core CLI Issues related to the CLI Jupyter Notebook Issues related to the Jupyter Notebook pipeline/admin-boundaries Issues related to the admin-boundaries pipeline pipeline/google-trends Issues related to the google-trends pipeline labels Feb 18, 2022
@mattigrthr mattigrthr self-assigned this Feb 18, 2022
@mattigrthr
Copy link
Contributor Author

I have discovered python-on-whales which is a great pip package to execute Docker stuff from a Python script.

I got rid of all the initializing scripts that were building the Docker containers.

Using the individual images for the pipelines works nicely on all operating systems.

The last problem now is the dockerized CLI. Once the CLI is dockerized, we have a "Docker in Docker" scenario. In that case, the mounted volumes don't seem to work across two containers. E.g., if I mount the tmp directory for the population files into the CLI Docker which then mounts that into the population-density Docker which it is executing, the files are not written to the local host file system.

I had a look at Airflow but didn't see a way to get it integrated in a good way with the current CLI flow.

Any help would be appreciated!

@mattigrthr mattigrthr added the help wanted Extra attention is needed label Feb 24, 2022
@mattigrthr
Copy link
Contributor Author

Here are a few more details from the user perspective of what we are trying to achieve.

User story

As a user, I want a CLI to select all the pipelines I'd like to run. I can choose the geographical region for which I want to run the pipelines. In the case of the population data, I also want to select specific demographic groups I am interested in.
Once I have made all my selections, the CLI will run all the pipelines correctly (e.g., the google-poi pipeline depends on the osm-poi pipeline). Once the data pipelines ran successfully, all the data should be imported into the Postgres database.
When all the data has been imported, the Jupyter environment should be launched so I can start working with the data conveniently.
Next to running the individual data pipelines, I want to be able to download the demo data through the CLI. Once the demo data is downloaded, the database and Jupyter notebook with the popularity correlation should be launched.

Current status

We have a Python script for the CLI that collects all the necessary user inputs and executes all components (data pipelines, database importer, database, Jupyter) as Docker images. It works perfectly if it is executed as a local Pythons script. However, we need to dockerize the CLI to run independently from the local user setup as long as Docker has been set up.
Currently, we have a "Docker in Docker" scenario when the CLI is a Docker container from which other Docker containers are executed. We need to have the files of the data pipelines saved on the host to make them shareable. Mounted volumes from the host to one container, which then mounts the same volume to another container (e.g., local host to CLI to population-density), do not work correctly, and the data is not saved to the local host.

@mattigrthr mattigrthr changed the title 🐳 Dockerizing CLI and M1 compatibility 💻 M1 compatibility and generalized CLI Feb 28, 2022
@mattigrthr mattigrthr marked this pull request as ready for review February 28, 2022 17:52
@mattigrthr
Copy link
Contributor Author

We are going to merge this PR and open a new issue for the Dockerization as this PR contains fixes for M1 compatibility issues and uses pre-built Docker images which simplifies a local Python setup for the CLI.

Copy link
Contributor

@floriankuwala floriankuwala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed setup readme file to postgresql and newest version of pip needed

@mattigrthr mattigrthr merged commit bc20dba into master Mar 1, 2022
@mattigrthr mattigrthr deleted the fixes/cli_dockerization branch March 1, 2022 10:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CLI Issues related to the CLI core Issues related to the core help wanted Extra attention is needed Jupyter Notebook Issues related to the Jupyter Notebook pipeline/admin-boundaries Issues related to the admin-boundaries pipeline pipeline/google-poi Issues related to the google-poi pipeline pipeline/google-trends Issues related to the google-trends pipeline pipeline/osm-poi Issues related to the osm-poi pipeline pipeline/population-density Issues related to the population-density pipeline pipeline Issues related to pipelines
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Running demo on Windows
2 participants