redkite

Kaio Motawara - Redkite data engineer assignment

Instead of doing a code review, I've taken the liberty to refactor and restructure the project to reflect the approach I would have taken. The changes from and bug fixes in the provided script effectively are the review.
To run the pipeline, please create the environment specified in the environment.yml' file and run main.py, which orchestrates the pipeline.
There are folders for schemas and processing functions. I've used separate .py files to store the three schemas. I've used a similar structure for processing functions. While it might be overkill for this problem, it's meant to represent a real-world situation where I might have to deal with a large number of complex schemas and processing functions.
While I have created the prescribed schemas for the parquet files, when I try to load the files with those schemas, I get an error that I haven't been able to resolve. So I've gone ahead and loaded it without enforcing the schema just so that I could continue with the rest of the assignment.
the final step of the pipeline provided includes .coalesce() and writes to a databricks .csv. Since the number of partitions is already 1 and since I don't have access to databricks, I would have replaced the step with a simple save to .csv. However, when I tried to write the dataframe to csv, Spark threw a Java error that I couldn't resolve. Therefore, in order to complete the assignment, I converted the Spark dataframe to a Pandas dataframe and then saved it as a tab-separated file as required.
I've created one unittest in the tests folder just to demonstrate how tests might be written. To run the test, please run python -m unittest discover tests from the root directory of the project (ie the same directory that contains main.py).

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assignment_docs		assignment_docs
data		data
processing_functions		processing_functions
schemas		schemas
tests		tests
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
main.py		main.py
processed_vld.csv		processed_vld.csv
test.csv		test.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

redkite

About

Releases

Packages

Languages

kaiomurz/redkite

Folders and files

Latest commit

History

Repository files navigation

redkite

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages