Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

💻 M1 compatibility and generalized CLI #99

Merged
merged 10 commits into from
Mar 1, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion kuwala/.dockerignore → .dockerignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
node_modules
docker-compose.yml
prettier.confiog.js
prettier.config.js
tmp
env
54 changes: 38 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,10 +66,10 @@ We currently have five pipelines for different third-party data sources which ca
database. The following pipelines are integrated:

- [Admin Boundaries](https://github.com/kuwala-io/kuwala/tree/master/kuwala/pipelines/admin-boundaries/README.md)
- [Google POIs](https://github.com/kuwala-io/kuwala/tree/master/kuwala/pipelines/google-poi/README.md)
- [Google Trends](https://github.com/kuwala-io/kuwala/tree/master/kuwala/pipelines/google-trends/README.md)
- [OSM POIs](https://github.com/kuwala-io/kuwala/tree/master/kuwala/pipelines/osm-poi/README.md)
- [Population Density](https://github.com/kuwala-io/kuwala/tree/master/kuwala/pipelines/population-density/README.md)
- [Google POIs](https://github.com/kuwala-io/kuwala/tree/master/kuwala/pipelines/google-poi/README.md)
- [Google Trends](https://github.com/kuwala-io/kuwala/tree/master/kuwala/pipelines/google-trends/README.md)
- [OSM POIs](https://github.com/kuwala-io/kuwala/tree/master/kuwala/pipelines/osm-poi/README.md)
- [Population Density](https://github.com/kuwala-io/kuwala/tree/master/kuwala/pipelines/population-density/README.md)

### Jupyter environment & CLI

Expand All @@ -90,22 +90,44 @@ score. In the demo, we have preprocessed popularity data and a test dataset with
#### Run the demo

You could either use the deployed example on Binder using the badge above or run everything locally. The Binder example
simply uses Pandas dataframes and is not connecting to a data warehouse. \
To run the demo locally, launch Docker in the background and from inside the root directory run:
simply uses Pandas dataframes and is not connecting to a data warehouse.

Linux/Mac:
```zsh
cd kuwala/scripts && sh initialize_core_components.sh && sh run_cli.sh
```
and for Windows (Please use PowerShell or any Docker integrated terminal):
```PS
cd kuwala/scripts && sh initialize_windows.sh && cd windows && sh initialize_core_components.sh && sh run_cli.sh
<details>
<summary>Setting up and running the CLI</summary><br/>

#### Prerequisites

1. Installed version of `Docker` and `docker-compose v2`.
- We recommend using the latest version of [`Docker Desktop`](https://www.docker.com/products/docker-desktop).
2. Installed version of `Python3` and *latest* `pip, setuptools, and wheel` version.
- We recommend using version `3.9.5` or higher.
- To check your current version run `python3 --version`.
3. Installed version of `libpq`.
- For Mac, you can use brew: `brew install libpq`
4. Installed version of `postgresql`.
- For Mac, you can use brew: `brew install postgresql`

#### Setup

1. Change your directory to `kuwala/core/cli`.
2. Create a virtual environment.
- For instructions on how to set up a `venv` on different system see [here](https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/).
3. Install dependencies by running `pip3 install --no-cache-dir -r requirements.txt`

#### Run

To start the CLI, run the following command from inside the `kuwala/core/cli/src` directory and follow the instructions:

```zsh
python3 main.py
```

#### Run the data pipelines yourself
</details>

### Using Kuwala components individually

To run the pipelines yourself, please follow the instructions for the
[CLI](https://github.com/kuwala-io/kuwala/tree/master/kuwala/core/cli/README.md).
To use Kuwala's components, such as the data pipelines or the Jupyter environment, individually, please refer to the
[instructions under `/kuwala`](https://github.com/kuwala-io/kuwala/blob/master/kuwala/README.md).

---

Expand Down
82 changes: 31 additions & 51 deletions kuwala/docker-compose.yml → docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,7 @@ services:
# docker-compose --profile database up
postgres:
container_name: postgres
build:
context: .
dockerfile: ./core/database/dockerfile
image: kuwala/core:postgis-h3-0.2.0-alpha
shm_size: 16g
restart: always
environment:
Expand All @@ -31,34 +29,30 @@ services:
ports:
- '5432:5432'
volumes:
- ./tmp/kuwala/db/postgres:/var/lib/postgresql
- ./kuwala/tmp/kuwala/db/postgres:/var/lib/postgresql
profiles:
- network
- database

# docker-compose run database-importer --continent=<> --country=<> --country_region=<>
# docker-compose run database-importer --continent=<> --country=<> --country_region=<> [--population_density_date=<>]
database-importer:
container_name: database-importer
build:
context: .
dockerfile: ./core/database/importer/dockerfile
image: kuwala/core:database-importer-0.2.0-alpha
restart: always
environment:
- DATABASE_HOST=postgres
- DATABASE_NAME=kuwala
- DATABASE_USER=kuwala
- DATABASE_PASSWORD=password
volumes:
- ./tmp/kuwala:/opt/tmp/kuwala
- ./kuwala/tmp/kuwala:/opt/tmp/kuwala
profiles:
- network

# docker-compose run database-transformer
database-transformer:
container_name: database-transformer
build:
context: .
dockerfile: ./core/database/transformer/dockerfile
image: kuwala/core:database-transformer-0.2.0-alpha
restart: always
environment:
- DBT_HOST=postgres
Expand All @@ -68,35 +62,33 @@ services:
# docker-compose run --service-ports jupyter
jupyter:
container_name: jupyter
build:
context: .
dockerfile: core/jupyter/dockerfile
image: kuwala/core:jupyter-0.2.0-alpha
restart: always
environment:
- JUPYTER_ENABLE_LAB=yes
- DBT_HOST=postgres
volumes:
- ./core/jupyter/modules:/home/jovyan/kuwala/modules
- ./core/database/transformer:/home/jovyan/kuwala/dbt
- ./core/jupyter/notebooks:/home/jovyan/kuwala/notebooks
- ./core/jupyter/resources:/home/jovyan/kuwala/resources
- ./tmp/kuwala/transformer:/home/jovyan/kuwala/tmp/kuwala/transformer
- ./kuwala/core/jupyter/modules:/home/jovyan/kuwala/modules
- ./kuwala/core/database/transformer:/home/jovyan/kuwala/dbt
- ./kuwala/core/jupyter/notebooks:/home/jovyan/kuwala/notebooks
- ./kuwala/core/jupyter/resources:/home/jovyan/kuwala/resources
- ./kuwala/tmp/kuwala/transformer:/home/jovyan/kuwala/tmp/kuwala/transformer
ports:
- '8888:8888'
profiles:
- network
- jupyter


# docker-compose run admin-boundaries --continent=<> --country=<> --country_region=<>
admin-boundaries:
container_name: admin-boundaries
image: kuwala/data-pipelines:admin-boundaries-0.2.0-alpha
environment:
- SPARK_MEMORY=16g
build:
context: .
dockerfile: ./pipelines/admin-boundaries/dockerfile
volumes:
- ./tmp/kuwala/osm_files:/opt/tmp/kuwala/osm_files
- ./tmp/kuwala/admin_boundary_files:/opt/tmp/kuwala/admin_boundary_files
- ./kuwala/tmp/kuwala/osm_files:/opt/tmp/kuwala/osm_files
- ./kuwala/tmp/kuwala/admin_boundary_files:/opt/tmp/kuwala/admin_boundary_files
restart: always
profiles:
- network
Expand All @@ -105,14 +97,12 @@ services:
# docker-compose --profile google-poi-scraper up
google-poi-api:
container_name: google-poi-api
image: kuwala/data-pipelines:google-poi-api-0.2.0-alpha
environment:
- PROXY_ADDRESS=socks5://torproxy:9050
- QUART_DEBUG=False
build:
context: .
dockerfile: ./pipelines/google-poi/dockerfile
volumes:
- ./pipelines/google-poi/resources/categories.json:/opt/app/pipelines/google-poi/resources/categories.json
- ./kuwala/pipelines/google-poi/resources/categories.json:/opt/app/pipelines/google-poi/resources/categories.json
restart: always
depends_on: [torproxy]
ports:
Expand All @@ -124,30 +114,26 @@ services:
# docker-compose run google-poi-pipeline --continent=<> --country=<> --country_region=<> --polygon_coords=<> --polygon_resolution=<> --search_string_basis=<>
google-poi-pipeline:
container_name: google-poi-pipeline
image: kuwala/data-pipelines:google-poi-pipeline-0.2.0-alpha
environment:
- GOOGLE_POI_API_HOST=google-poi-api
- SPARK_MEMORY=16g
build:
context: .
dockerfile: ./pipelines/google-poi/src/pipeline/dockerfile
volumes:
- ./tmp/kuwala/google_files:/opt/app/tmp/kuwala/google_files
- ./tmp/kuwala/osm_files:/opt/app/tmp/kuwala/osm_files
- ./kuwala/tmp/kuwala/google_files:/opt/app/tmp/kuwala/google_files
- ./kuwala/tmp/kuwala/osm_files:/opt/app/tmp/kuwala/osm_files
restart: always
profiles:
- network

# docker-compose run google-trends --continent=<> --country=<> --country_region=<> --keyword=<>
google-trends:
container_name: google-trends
image: kuwala/data-pipelines:google-trends-0.2.0-alpha
environment:
- PROXY_ADDRESS=socks5://torproxy:9050
build:
context: .
dockerfile: ./pipelines/google-trends/dockerfile
volumes:
- ./tmp/kuwala/admin_boundary_files:/opt/tmp/kuwala/admin_boundary_files
- ./tmp/kuwala/google_trends_files:/opt/tmp/kuwala/google_trends_files
- ./kuwala/tmp/kuwala/admin_boundary_files:/opt/tmp/kuwala/admin_boundary_files
- ./kuwala/tmp/kuwala/google_trends_files:/opt/tmp/kuwala/google_trends_files
restart: always
depends_on: [torproxy]
profiles:
Expand All @@ -156,42 +142,36 @@ services:
# docker-compose run osm-parquetizer java -jar target/osm-parquetizer-1.0.1-SNAPSHOT.jar --continent=<> --country=<> --country_region=<>
osm-parquetizer:
container_name: osm-parquetizer
build:
context: .
dockerfile: ./pipelines/osm-poi/osm-parquetizer/dockerfile
image: kuwala/data-pipelines:osm-parquetizer-0.2.0-alpha
restart: always
volumes:
- ./tmp/kuwala/osm_files:/opt/app/tmp/kuwala/osm_files
- ./kuwala/tmp/kuwala/osm_files:/opt/app/tmp/kuwala/osm_files
profiles:
- network

# docker-compose run osm-poi
osm-poi:
container_name: osm-poi
build:
context: .
dockerfile: ./pipelines/osm-poi/dockerfile
image: kuwala/data-pipelines:osm-poi-0.2.0-alpha
environment:
- SPARK_MEMORY=16g
- PROXY_ADDRESS=socks5://torproxy:9050
restart: always
depends_on: [torproxy]
volumes:
- ./tmp/kuwala/osm_files:/opt/app/tmp/kuwala/osm_files
- ./kuwala/tmp/kuwala/osm_files:/opt/app/tmp/kuwala/osm_files
profiles:
- network


# docker-compose run population-density
population-density:
container_name: population-density
build:
context: .
dockerfile: ./pipelines/population-density/dockerfile
image: kuwala/data-pipelines:population-density-0.2.0-alpha
environment:
- SPARK_MEMORY=16g
restart: always
volumes:
- ./tmp/kuwala/population_files:/opt/app/tmp/kuwala/population_files
- ./kuwala/tmp/kuwala/population_files:/opt/app/tmp/kuwala/population_files
profiles:
- network
43 changes: 23 additions & 20 deletions kuwala/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,24 +9,17 @@ Installed version of *Docker* and *docker-compose v2*

### Pipelines

If you want to build all containers for all pipelines, change your working directory to `./kuwala/scripts` (or move to
`./kuwala/scripts/`, run `initialize_windows.sh`, and change directory to `windows/` if you are running a Windows
machine) and run:

```zsh
sh initialize_all_components.sh
```

You can also build the containers individually for single pipelines. All services are listed in the
[`./docker-compose.yml`](https://github.com/kuwala-io/kuwala/tree/master/kuwala/docker-compose.yml). Please refer to
each pipeline's `README.md` on how to run them. You can find the pipeline directories under
Please refer to each pipeline's `README.md` on how to run them. You can find the pipeline directories under
[`./pipelines`](https://github.com/kuwala-io/kuwala/tree/master/kuwala/pipelines).

It would be safer to always run the build commands after pulling new code changes so that your local images
have the latest code.
We currently have five pipelines for different third-party data sources which can easily be imported into a Postgres
database. The following pipelines are integrated:

Docker images will only be built once when you run the `initialize_all_components.sh` script, so new changes in the code
will not reflect on your local unless you explicitly run the build commands.
- [Admin Boundaries](https://github.com/kuwala-io/kuwala/tree/master/kuwala/pipelines/admin-boundaries/README.md)
- [Google POIs](https://github.com/kuwala-io/kuwala/tree/master/kuwala/pipelines/google-poi/README.md)
- [Google Trends](https://github.com/kuwala-io/kuwala/tree/master/kuwala/pipelines/google-trends/README.md)
- [OSM POIs](https://github.com/kuwala-io/kuwala/tree/master/kuwala/pipelines/osm-poi/README.md)
- [Population Density](https://github.com/kuwala-io/kuwala/tree/master/kuwala/pipelines/population-density/README.md)

Please note that the Docker runs will create data folders under `./kuwala/tmp/kuwala` that will be used for db, file
downloads, and processing results. You can always find the downloaded files over there.
Expand All @@ -35,20 +28,30 @@ Now you can proceed to any of the pipelines' `README.md` and follow the steps to

### Core

To initialize the CLI and Jupyter notebook run within the `./kuwala/scripts` directory (or `./kuwala/scripts/windows` ):
#### Database

To launch the database in the background, run:

```zsh
sh initialize_core_components.sh
docker-compose --profile database up
```

To launch the CLI run:
#### Database Importer

To import the result of the data pipelines, run the `database-importer`:

```zsh
sh run_cli.sh
docker-compose run database-importer --continent=<> --country=<> --country_region=<> [--population_density_date=<>]
```

#### CLI

To launch the CLI, please refer to the instructions in its [README](https://github.com/kuwala-io/kuwala/tree/master/kuwala/core/cli/README.md).

#### Jupyter

If you only want to start the Jupyter environment run:

```zsh
sh run_jupyter_notebook.sh
docker-compose run --service-ports jupyter
```
13 changes: 13 additions & 0 deletions kuwala/common/docker/python-java/dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
ARG IMAGE_VARIANT=slim-buster
ARG OPENJDK_VERSION=8
ARG PYTHON_VERSION=3.10.2

FROM python:${PYTHON_VERSION}-${IMAGE_VARIANT} AS py3
FROM openjdk:${OPENJDK_VERSION}-${IMAGE_VARIANT}

COPY --from=py3 / /

RUN apt-get update && \
apt-get install --no-install-recommends build-essential=12.6 -y && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
4 changes: 2 additions & 2 deletions kuwala/common/python_utils/src/FileSelector.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,15 @@
from time import sleep
import urllib.error

from fuzzywuzzy import fuzz
from hdx.api.configuration import Configuration
from hdx.data.dataset import Dataset
from hdx.data.organization import Organization
from hdx.hdx_configuration import Configuration
import pycountry
import pycountry_convert as pcc
from pyquery import PyQuery
import questionary
import requests.exceptions
from thefuzz import fuzz

CONTINENTS = [
{"code": "af", "name": "Africa", "geofabrik": "africa"},
Expand Down
2 changes: 1 addition & 1 deletion kuwala/common/python_utils/src/spark_udfs.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
import json

from fuzzywuzzy import fuzz
import h3
from pyspark.sql.functions import udf
from pyspark.sql.types import (
Expand All @@ -13,6 +12,7 @@
StructType,
)
from shapely.geometry import shape
from thefuzz import fuzz

DEFAULT_RESOLUTION = 11

Expand Down
Loading