Elasticsearch loader and playground for SIRENE dataset
This project is meant to be run with Docker so it requires:
- A Unix-compatible environment (Linux, MacOSX)
- Docker et Docker Compose
You can also run it the native way, in this case the requirements are:
- Elasticsearch 5.0
- Python 3.5
In both case, Elasticsearch 5.0 requires to run:
# As root
sysctl -w vm.max_map_count=262144
# As user with permissions
sudo sysctl -w vm.max_map_count=262144
More details in the Elasticsearch Virtual Memory documentation section and the officiel docker details
For quick start, splashes provides a dockerized playground which we will use to get ready.
You can use it in three ways:
- Fully native: native
splashes
on Python 3.5 with native Elasticsearch 5.0 - Hybrid: native
splashes
on Python 3.5 with dockerized Elasticsearch/Kibana - Fully dockerized: fully dockerized environment
Install Elasticsearch using your favorite package manager or as described on the official documentation.
Then, install the ICU Analysis Plugin using the elasticsearch plugin manager:
$ELASTIC_HOME/bin/elasticsearch-plugin install analysis-icu
(where $ELASTIC_HOME
is the Elasticsearch installation directory)
Restart Elasticsearch and you can then install the python executable with:
pip install -e .
splashes --help
In this configuration, you will use the provided Elasticsearch/Kibana docker stack
with Docker Compose.
A ./dc
executable helper is provided to manipulate docker-compose
.
Persistent data are stored into the elasticsearch/data
directory.
Start the Elastcisearch stack with
./dc up
Then go grab a coffee because it can take some times on the first launch.
This command use your current terminal, so if you want to launch everything in the background execute this command instead:
./dc up -d
You can then access:
- elasticsearch on http://localhost:9200
- kibana on http://localhost:5601
Then install the splashes
application:
pip install -e .
splashes --help
Note: You can override docker-compose configuration with a docker-compose.override.yml
file.
This methods use the provided Elasticsearch/Kibana docker stack from the hybrid method
plus a dockerized splashes
application.
You will use the ./dc-splashes
helper to manipulate both docker-compose
and splashes
.
You can download and/or build docker images and get the services up and ready with:
./dc-splashes up
This command use your current terminal, so if you want to launch everything in the background execute this command instead:
./dc-splashes up -d
You can then access:
- elasticsearch on http://localhost:9200
- kibana on http://localhost:5601
and you can use splashes
with:
./dc-splashes --help
Note: You can override docker-compose configuration with a docker-compose.override.yml
file.
You can list all available commands using:
splashes --help
You can have help on each command using:
splashes CMD --help
You can pass common options before your command:
splashes --es http://elastic.somewhere.com --index splahes -v CMD
Options are:
- -es/--elasticsearch: The Elasticsearch URL, defaults to http://localhost:9200
- -i/--index: The Elasticsearch index, defaults to
sirene
- -v/--verbose: More verbose output
You can also use environment variables:
SPLASHES_ELASTICSEARCH
SPLASHES_INDEX
SPLASHES_VERBOSE
You can load stock data with:
splashes load my-data.csv
and daily updates with:
splashes update daily/updates/directory
# or
splashes update daily/updates/directory/file.csv
both commands accept to optionnal parameters:
-l
/--lines
to limit the amount of data loaded to X lines-p
/--progress
to display progression indication every X lines
Note: the fully dockerized methods requires the dataset to be present in the current directory (or any child directory) or to add the directory as a volume.
The load
can also load geo-sirene data with the --geo
parameter:
splashes load path/to/geo-sirene/data --geo -l 100000 -p 1000
This feature requires IPython
pip install ipython
splashes --es http://my.elasticsearch:9200 shell
You will land in IPython interactive shell with the following objects available:
es
: an instaciated Elasticsearch connectionCompany
: the elasticsearch documents model classconfig
: the globalsplashes
configuration
# List PME names
companies = es.search_companies().filter(legal='SARL', category='PME').execute()
for company in companies:
print(company.name)
The search_companies()
methods returns a Search object
See the Elasticsearch DSL documentation
You can install extra elasticsearch plugins with:
./dc run elasticsearch elasticsearch-plugin install my-plugin
Don't forget to restart elasticsearch by either using CTRL-C
if docker-compose is on the foreground
or docker-compose stop && docker-compose up -d
when running in the background.
You can install extra Kibana plugins with:
./dc run kibana kibana-plugin install my-plugin