Project to demonstrate basic data engineering skills.
-
Clone the repository:
git clone git@github.com:pavel-filatov/yelp-challenge.git
... or download just a single bash script run_docker_and_prepare_environment.sh.
-
Download the Yelp dataset.
-
Run bash script to download the Docker image and got prepared to the work:
bash run_docker_and_prepare_environment.sh </yelp/data/directory/path.tar>
This script will:
- run the Docker container in detach mode, publishing port 4040 to inspect Spark jobs from host,
- copy Yelp data into the container,
- run container in interactive mode using bash
Note that Docker image used there (
pfilatov/spark-cassandra
) will be downloaded if not presented in the Docker scope. -
Inside a container, run:
bash ingest_yelp_data_into_cassandra.sh
What this script do:
- creates keyspace and tables inside the Cassandra
- runs Spark application for data ingestion
IMPORTANT: This script may fail for several times with the following message:
Connection error: ('Unable to connect to any servers', {'127.0.0.1': error(111, "Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused")})
This behavior occurs when Cassandra has not ran yet. Please be patient and run the script a bit later.
-
Once the ingestion app completed, you may explore the data within Cassandra unsing
cqlsh
. -
To exit the container type
exit
. -
To run the container again, use
docker exec -it spark-cassandra bash
. -
To stop container (without removing the data), use
docker stop spark-cassandra
. -
To start it again, use
docker start spark-cassandra
. -
To remove container completely (including the data), use
docker rm -f spark-cassandra
.