Homework for Week 5 Streaming Pipelines
To ensure smooth processing of your submissions through GitHub Classroom automation, we cannot accommodate individual requests for changes. Therefore, please read all instructions carefully.
- Sync fork: Ensure your fork is up-to-date with the upstream repository. You can do this through the GitHub UI or by running the following commands in your local forked repository:
git pull upstream main --rebase git push origin <main/your_homework_repo> -f
- Open a PR in the upstream repository to submit your work:
- Open a Pull Request (PR) to merge changes from your
main
or customhomework
branch in your forked repository into the main branch in the upstream repository, i.e., the repository your fork was created from. - Ensure your PR is opened correctly to trigger the GitHub workflow. Submitting to the wrong branch or repository will cause the GitHub action to fail.
- See the Steps to open a PR section below if you need help with this.
- Open a Pull Request (PR) to merge changes from your
- Complete assignment prompts: Write your query files corresponding to the prompt number in the
submission
folder. Do not change or rename these files! - Lint your SQL code for readability. Ensure your code is clean and easy to follow.
- Add comments to your queries. Use the
--
syntax to explain each step and help the reviewer understand your thought process.
A link to your PR will be shared with our TA team after the homework deadline.
- Go to the upstream
streaming-pipelines
repository - Click the Pull Requests tab.
- Click the "New pull request" button on the top-right. This will take you to the "Compare changes" page.
- Click the "compare across forks" link in the text.
- Leave the base repository as is. For the "head repository", select your forked repository and then the name of the branch you want to compare from your forked repo.
- Click the "Create pull request" button to open the PR
- Grades are pass or fail, used solely for certification.
- Final grades will be submitted by a TA after the deadline.
- An approved PR means a Pass grade. If changes are requested, the grade will be marked as Fail.
- Reviewers may provide comments or suggestions with requested changes. These are optional and intended for your benefit. Any changes you make in response will not be re-reviewed.
Your task is to write a streaming pipeline to understand DataExpert.io user behavior using sessionization.
Use the glue_job_runner.py
connectors and examples in the airflow-dbt-project repo to get started. Copy and paste the DDL into submission/session_ddl.sql
and copy the job into submission/session_job.py
!
-
Write a DDL query (
session_ddl.sql
) that creates a table that tracks Data Expert sessions.- Make sure to have a unique identifier for each session
- the start and end of the session
- a unique identifier for the session
- how many events happened in that session
- the date of the beginning of the session
- What city, country, and state is associated with this session
- What operating system and browser are associated with this session
- Whether this session is for logged in or logged out users
-
Write a Spark Streaming Job (
session_job.py
) that reads from Kafka- It groups the Kafka events with a
session_window
with a 5-minute gap (sessions end after 5 minutes of inactivity)- It is also grouped by
user_id
andip
to track each user behavior
- It is also grouped by
- Make sure to increase the
timeout
of your job to1 hour
so you can capture real sessions - Create a unique identifier for each session called
session_id
(you should think about usinghash
and the necessary columns to uniquely identify the session) - If
user_id
is not null, that means the event is from a logged in user - If a user logs in, that creates a new session (if you group by
ip
anduser_id
that solves the problem pretty elegantly)
- It groups the Kafka events with a