Streaming Pipelines

Homework for Week 5 Streaming Pipelines

Submission Guidelines

To ensure smooth processing of your submissions through GitHub Classroom automation, we cannot accommodate individual requests for changes. Therefore, please read all instructions carefully.

Sync fork: Ensure your fork is up-to-date with the upstream repository. You can do this through the GitHub UI or by running the following commands in your local forked repository:
```
git pull upstream main --rebase
git push origin <main/your_homework_repo> -f
```
Open a PR in the upstream repository to submit your work:
- Open a Pull Request (PR) to merge changes from your main or custom homework branch in your forked repository into the main branch in the upstream repository, i.e., the repository your fork was created from.
- Ensure your PR is opened correctly to trigger the GitHub workflow. Submitting to the wrong branch or repository will cause the GitHub action to fail.
- See the Steps to open a PR section below if you need help with this.
Complete assignment prompts: Write your query files corresponding to the prompt number in the submission folder. Do not change or rename these files!
Lint your SQL code for readability. Ensure your code is clean and easy to follow.
Add comments to your queries. Use the -- syntax to explain each step and help the reviewer understand your thought process.

A link to your PR will be shared with our TA team after the homework deadline.

Steps to open a PR

Go to the upstream streaming-pipelines repository
Click the Pull Requests tab.
Click the "New pull request" button on the top-right. This will take you to the "Compare changes" page.
Click the "compare across forks" link in the text.
Leave the base repository as is. For the "head repository", select your forked repository and then the name of the branch you want to compare from your forked repo.
Click the "Create pull request" button to open the PR

Grading

Grades are pass or fail, used solely for certification.
Final grades will be submitted by a TA after the deadline.
An approved PR means a Pass grade. If changes are requested, the grade will be marked as Fail.
Reviewers may provide comments or suggestions with requested changes. These are optional and intended for your benefit. Any changes you make in response will not be re-reviewed.

Assignment

Your task is to write a streaming pipeline to understand DataExpert.io user behavior using sessionization. Use the glue_job_runner.py connectors and examples in the airflow-dbt-project repo to get started. Copy and paste the DDL into submission/session_ddl.sql and copy the job into submission/session_job.py!

Assignment Tasks

Write a DDL query (session_ddl.sql) that creates a table that tracks Data Expert sessions.
- Make sure to have a unique identifier for each session
- the start and end of the session
- a unique identifier for the session
- how many events happened in that session
- the date of the beginning of the session
- What city, country, and state is associated with this session
- What operating system and browser are associated with this session
- Whether this session is for logged in or logged out users
Write a Spark Streaming Job (session_job.py) that reads from Kafka
- It groups the Kafka events with a session_window with a 5-minute gap (sessions end after 5 minutes of inactivity)
  - It is also grouped by user_id and ip to track each user behavior
- Make sure to increase the timeout of your job to 1 hour so you can capture real sessions
- Create a unique identifier for each session called session_id (you should think about using hash and the necessary columns to uniquely identify the session)
- If user_id is not null, that means the event is from a logged in user
- If a user logs in, that creates a new session (if you group by ip and user_id that solves the problem pretty elegantly)

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
_app		_app
submission		submission
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Streaming Pipelines

Submission Guidelines

Steps to open a PR

Grading

Assignment

Assignment Tasks

About

Releases

Packages

Languages

jrsarrat/streaming-pipelines

Folders and files

Latest commit

History

Repository files navigation

Streaming Pipelines

Submission Guidelines

Steps to open a PR

Grading

Assignment

Assignment Tasks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages