Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dockerfile #83

Merged
merged 8 commits into from
Nov 8, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 41 additions & 8 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ A lot of discussions about ideas take place in the [Issues](https://github.com/d

## Environment

##### Local Installation Environment

The recommended way of setting your environment up is with [Anaconda](https://www.continuum.io/), a Python distribution with useful packages for Data Science. [Download it](https://www.continuum.io/downloads) and create an _environment_ for the project.

```console
Expand All @@ -29,14 +31,45 @@ $ ./setup

The `activate serenata_de_amor` command must be run every time you enter in the project folder to start working.

### Pyenv users

If you installed Anaconda via [pyenv](https://github.com/yyuu/pyenv) probably `source activate serenata_de_amor` will fail _unless_ you explicitly use the path to the Anaconta `activate` script. For example:
**For Pyenv users:** If you installed Anaconda via [pyenv](https://github.com/yyuu/pyenv) probably `source activate serenata_de_amor` will fail _unless_ you explicitly use the path to the Anaconta `activate` script. For example:

```console
$ source /usr/local/var/pyenv/versions/anaconda3-4.1.1/bin/activate serenata_de_amor
```

##### Docker Installation Environment

Requirements:

* [Docker](https://docs.docker.com/engine/installation/)
* [Docker-compose](https://docs.docker.com/compose/install/)

Start the environment (maybe it will take some time, the docker image has 4GB):

```console
$ docker-compose up -d
```

Create your config.ini file from the example:

```console
$ cp config.ini.example config.ini
```

Run the script to fetch Quota for Exercising Parliamentary Activity (CEAP) datasets:

```console
$ docker-compose run --rm jupyter python src/fetch_datasets.py
```

If you want to access the console:

```console
$ docker-compose run --rm jupyter bash
```

And access Jupyter Notebook here: [localhost:8888](localhost:8888)

## Best practices

In order to avoid tons of conflicts when trying to merge [Jupyter Notebooks](http://jupyter.org), there are some [guidelines we follow](http://www.svds.com/jupyter-notebook-best-practices-for-data-science/).
Expand All @@ -46,7 +79,7 @@ Basically we have four big directories with different purposes:
| Directory | Purpose | File naming |
|-----------|---------|-------------|
| **`develop/`** | This is where we _explore_ data, feel free to create your own notebook for your exploration. | `[ISO 8601 date]-[author-initials]-[2-4 word description].ipynb` (e.g. `2016-05-13-ec-air-tickets.ipynb`) |
|**`report/`** | This is where we write up the findings and results, here is where we put together different data, analysis and strategies to make a point, feel free to jump in. | Meaninful title for the report (e.g. `Transport-allowances.ipybn` |
|**`report/`** | This is where we write up the findings and results, here is where we put together different data, analysis and strategies to make a point, feel free to jump in. | Meaningful title for the report (e.g. `Transport-allowances.ipynb` |
| **`src/`** | This is where our auxiliary scripts lies, code to scrap data, to convert stuff etc. | Small caps, no special character, `-` instead of spaces. |
| **`data/`** | This is not supposed to be committed, but it is where saved databases will be stored locally (scripts from `src/` should be able to get this data for you); a copy of this data will be available elsewhere (_just in case_). | Small caps, no special character, `-` instead of spaces. |

Expand All @@ -56,13 +89,13 @@ Here we explain what each script from `src/` does for you:

##### One script to rule them all

1. `src/fetch_datasets.py` dowloads all the available datasets to `data/` is `.xz` compressed CSV format with headers translated to English.
1. `src/fetch_datasets.py` downloads all the available datasets to `data/` is `.xz` compressed CSV format with headers translated to English.


##### Quota for Exercising Parliamentary Activity (CEAP)

1. `src/fetch_datasets.py --from-source` dowloads all CEAP datasets to `data/` from the official source (in XML format in Portuguese) .
1. `src/fetch_datasets.py` dowloads the CEAP datasets into `data/`; it can download them from the official source (in XML format in Portuguese) or from our backup server (`.xz` compressed CSV format, with headers translated to English).
1. `src/fetch_datasets.py --from-source` downloads all CEAP datasets to `data/` from the official source (in XML format in Portuguese) .
1. `src/fetch_datasets.py` downloads the CEAP datasets into `data/`; it can download them from the official source (in XML format in Portuguese) or from our backup server (`.xz` compressed CSV format, with headers translated to English).
1. `src/xml2csv.py` converts the original XML datasets to `.xz` compressed CSV format.
1. `src/translate_datasets.py` translates the datasets file names and the labels of the variables within these files.
1. `src/translation_table.py` creates a `data/YYYY-MM-DD-ceap-datasets.md` file with details of the meaning and of the translation of each variable from the _Quota for Exercising Parliamentary Activity_ datasets.
Expand Down Expand Up @@ -99,6 +132,6 @@ The project basically happens in four moments, and contributions are welcomed in

## Jarbas

As soon as we started _Serenata de Amor_ [we felt the need for a simple webservice](https://github.com/datasciencebr/serenata-de-amor/issues/34) to browse our data and refer to documents we analize. This is how [Jarbas](https://github.com/datasciencebr/jarbas) was created.
As soon as we started _Serenata de Amor_ [we felt the need for a simple webservice](https://github.com/datasciencebr/serenata-de-amor/issues/34) to browse our data and refer to documents we analyze. This is how [Jarbas](https://github.com/datasciencebr/jarbas) was created.

If you fancy web development, feel free to check Jarbas' source code, to check [Jarbas' own Issues](https://github.com/datasciencebr/jarbas/issues) and to contribute there too.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,9 @@ The Serenata de Amor Operation arose from a combination of needs, from many peop

We are building an intelligence capable of analyzing public spending and saying, with reliability, the possibility of each receipt being unlawful. This information will be used beyond the code, in the world outside of GitHub. Everything is open source from the beginning, allowing others to fork the project when their ideas diverge from the Operation Serenata de Amor.

Our current milestone is to create the means for this kind of automation with the Quota for Exercising Parliamentary Activity (CEAP), from the Brazilian Chamber of Deputies. This job includes the development of APIs, data cleaning and analyses, conception and validation of scientific hyphotheses, confirmation of illicit acts via investigation and reports - to the population and to legal authorities.
Our current milestone is to create the means for this kind of automation with the Quota for Exercising Parliamentary Activity (CEAP), from the Brazilian Chamber of Deputies. This job includes the development of APIs, data cleaning and analyses, conception and validation of scientific hypothesis, confirmation of illicit acts via investigation and reports - to the population and to legal authorities.

To achieve this goal, unprecedented, we invite everyone to train the intelligence, collect information, cross databases, validate hyphotheses and apply Machine Learning with models competing against each other and getting combined in ensembles with higher precision than any previous option.
To achieve this goal, unprecedented, we invite everyone to train the intelligence, collect information, cross databases, validate hypothesis and apply Machine Learning with models competing against each other and getting combined in ensembles with higher precision than any previous option.

## Before contributing

Expand Down
12 changes: 12 additions & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
version: '2'

services:
jupyter:
build:
context: .
dockerfile: docker/Dockerfile
ports:
- 8888:8888
volumes:
- .:/notebook
working_dir: /notebook
20 changes: 20 additions & 0 deletions docker/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
FROM jupyter/datascience-notebook:latest
MAINTAINER Serenata de Amor "datasciencebr@gmail.com"

USER root

RUN apt-get update && apt-get install -y \
unzip

USER jovyan

COPY requirements.txt ./
COPY conda_requirements.txt ./

RUN pip install --upgrade pip
RUN pip install -r requirements.txt

RUN conda update --yes conda
RUN conda config --add channels Rufone
RUN conda config --add channels conda-forge
RUN conda install --yes --file conda_requirements.txt