Skip to content
knesset data scrapers and data sync - using the datapackage pipelines framework
Jupyter Notebook Python
Branch: master
Clone or download

Latest commit

Fetching latest commit…
Cannot retrieve the latest commit at this time.

Files

Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
bills incremental update fix Aug 28, 2018
bin add script to clear hash caches Sep 10, 2018
committees remove dependency on kns_documentcommitteesession pipeline, allowing … Jan 30, 2020
data_samples improve plenum session voters stats and add party discipline stats Aug 30, 2018
datapackage_pipelines_knesset hide meeting id 543222 due to privacy request May 20, 2020
jupyter-notebooks add instructions for running locally without Docker + minor updates t… Feb 25, 2020
knesset add generate-sitemap pipeline and notebook Dec 3, 2019
laws incremental updates of large resources Jul 19, 2018
lobbyists incremental updates of large resources Jul 19, 2018
members memebers/mk_individual: skip exception for missing member --no-deploy Feb 16, 2020
people add mk faction to detailed speaker stats package + fix for gov office… Feb 20, 2020
plenum incremental updates of large resources Jul 19, 2018
votes upgrade knesset-data-python to fix crash in document committee sessio… Nov 25, 2019
votes_kmember upgrades and improvements for migration to hasadna cluster + reduce c… Jul 3, 2018
web_ui [WIP] dump data for the web ui Aug 28, 2018
.dockerignore use jupyter notebooks for easier onboarding Oct 9, 2018
.dpp_spec_ignore use jupyter notebooks for easier onboarding Oct 9, 2018
.gitignore
.travis-deploy.sh changes required for data science work (#171) Mar 29, 2019
.travis.yml travis-ci: update the kamatera environment Feb 16, 2020
Dockerfile try to fix some exceptions in members and committee docs processing Feb 16, 2020
Dockerfile.full Update Dockerfile.full Mar 29, 2019
LICENSE Initial commit Jul 17, 2017
Pipfile use jupyter notebooks for easier onboarding Oct 9, 2018
Pipfile.lock use jupyter notebooks for easier onboarding Oct 9, 2018
README.md add instructions for running locally without Docker + minor updates t… Feb 25, 2020
boto.config [WIP] add some missing tables (#167) Apr 22, 2018
dataservice_collection_grafana_dashboard.json metrics collection --no-deploy Apr 23, 2018
gsutil_cp_data.sh
rename_resource.py added members all package Feb 5, 2018
setup.py upgrade knesset-data-python to fix crash in document committee sessio… Nov 25, 2019

README.md

Knesset data pipelines

Data processing pipelines for loading, processing and visualizing data about the Knesset

Uses the datapackage pipelines and DataFlows frameworks.

Quickstart for data science

Follow this method to get started quickly with exploration, processing and testing of the knesset data.

Running using Docker

Docker is required to run the notebooks to provide a consistent environment.

Install Docker for Windows, Mac or Linux

Pull the latest Docker image

docker pull orihoch/knesset-data-pipelines

Run Jupyter Lab

Create a directory which will be shared between the host PC and the container:

sudo mkdir -p /opt/knesset-data-pipelines

Start the Jupyter lab server:

docker run -it -p 8888:8888 --entrypoint jupyter \
           -v /opt/knesset-data-pipelines:/pipelines \
           orihoch/knesset-data-pipelines lab --allow-root --ip 0.0.0.0 --no-browser \
                --NotebookApp.token= --NotebookApp.custom_display_url=http://localhost:8888/

Access the server at http://localhost:8888/

Open a terminal inside the Jupyter Lab web-ui, and clone the knesset-data-pipelines project:

git clone https://github.com/hasadna/knesset-data-pipelines.git .

You should now see the project files on the left sidebar.

Access the jupyter-notebooks directory and open one of the available notebooks.

You can now add or make modifications to the notebooks, then open a pull request with your changes.

You can also modify the pipelines code from the host machine and it will be reflected in the notebook environment.

Running from Local copy of knesset-data-pipelines

From your local PC, clone the repository into ./knesset-data-pipelines:

git clone https://github.com/hasadna/knesset-data-pipelines.git .

Change directory:

cd knesset-data-pipelines

Run with Docker, mounting the local directory

docker run -it -p 8888:8888 --entrypoint jupyter \
           -v `pwd`:/pipelines \
           orihoch/knesset-data-pipelines lab --allow-root --ip 0.0.0.0 --no-browser \
                --NotebookApp.token= --NotebookApp.custom_display_url=http://localhost:8888/

When running using this setup, you might have permission problems, fix it giving yourself ownership:

sudo chown -R $USER . 

Running locally without Docker

Following instructions were tested with Ubuntu 18.04

Install system dependencies:

sudo apt-get install python3.6 python3.6-dev build-essential libxml2-dev libxslt1-dev libleveldb1v5 libleveldb-dev \
                     python3-pip bash jq git openssl antiword python3-venv

Install Python dependencies:

python3.6 -m venv env
source env/bin/activate
pip install 'https://github.com/OriHoch/datapackage-pipelines/archive/1.7.1-oh-2.zip#egg=datapackage-pipelines[speedup]'
pip install wheel
pip install psycopg2-binary knesset-data requests[socks] botocore boto3 python-dotenv google-cloud-storage sh
pip install datapackage-pipelines-metrics psutil crcmod jsonpickle tika kvfile pyquery dataflows==0.0.14 pymongo \
            tabulate jupyter jupyterlab
pip install -e .

Start environment (these steps are required each time before starting to run pipelines):

source env/bin/activate
export KNESSET_PIPELINES_DATA_PATH=`pwd`/data

Now you can run pipelines with dpp or start the notebook server with jupyter lab

Contributing

Looking to contribute? check out the Help Wanted Issues or the Noob Friendly Issues for some ideas.

Useful resources for getting acquainted:

  • DPP documentation
  • Code for the periodic execution component
  • Info on available data from the Knesset site
  • Living document with short list of ongoing project activities
You can’t perform that action at this time.