DSAID Data Engineering Technical Test

This test is split into 5 sections, data pipelines, databases, system design, charts & APIs, machine learning.

Submission Guidelines

Please create a Github repository containing your submission and send us an email containing a link to the repository.

Dos:

Frequent commits
Descriptive commit messages
Clear documentation
Comments in your code

Donts:

Only one commit containing all the files
Submitting a zip file
Sparse or absent documentation
Code which is hard to read

Section 1: Data Pipelines

The objective of this section is to design and implement a solution to process a data file on a regular interval (e.g. daily). Assume that there are 2 data files dataset1.csv and dataset2.csv, design a solution to process these files, along with the scheduling component. The expected output of the processing task is a CSV file including a header containing the field names.

You can use common scheduling solutions such as cron or airflow to implement the scheduling component. You may assume that the data file will be available at 1am everyday. Please provide documentation (a markdown file will help) to explain your solution.

Processing tasks:

Split the name field into first_name, and last_name
Remove any zeros prepended to the price field
Delete any rows which do not have a name
Create a new field named above_100, which is true if the price is strictly greater than 100

Note: please submit the processed dataset too.

Section 2: Databases

You are appointed by a car dealership to create their database infrastructure. There is only one store. In each business day, cars are being sold by a team of salespersons. Each transaction would contain information on the date and time of transaction, customer transacted with, and the car that was sold.

The following are known:

Each car can only be sold by one salesperson.
There are multiple manufacturers’ cars sold.
Each car has the following characteristics:
Manufacturer
Model name
Serial number
Weight
Price

Each sale transaction contains the following information:

Customer Name
Customer Phone
Salesperson
Characteristics of car sold

Set up a PostgreSQL database using the base docker image here given the above. We expect at least a Dockerfile which will stand up your database with the DDL statements to create the necessary tables. Produce entity-relationship diagrams as necessary to illustrate your design.

Your team also needs you to query some information from the database that you have designed. You are tasked to write a sql statement for each of the following task:

I want to know the list of our customers and their spending.
I want to find out the top 3 car manufacturers that customers bought by sales (quantity) and the sales number for it in the current month.

Section 3: System Design

You are designing data infrastructure on the cloud for a company whose main business is in processing images.

The company has a web application which collects images uploaded by customers. The company also has a separate web application which provides a stream of images using a Kafka stream. The company’s software engineers have already some code written to process the images. The company would like to save processed images for a minimum of 7 days for archival purposes. Ideally, the company would also want to be able to have some Business Intelligence (BI) on key statistics including number and type of images processed, and by which customers.

Produce a system architecture diagram (e.g. Visio, Powerpoint) using any of the commercial cloud providers' ecosystem to explain your design. Please also indicate clearly if you have made any assumptions at any point.

Section 4: Charts and APIs

Your team decided to design a dashboard to display the statistic of COVID19 cases. You are tasked to display one of the components of the dashboard which is to display a visualisation representation of number of COVID19 cases in Singapore over time.

Your team decided to use the public data from https://documenter.getpostman.com/view/10808728/SzS8rjbc#b07f97ba-24f4-4ebe-ad71-97fa35f3b683.

Display a graph to show the number cases in Singapore over time using the APIs from https://covid19api.com/.

Section 5: Machine Learning

Using the dataset from https://archive.ics.uci.edu/ml/datasets/Car+Evaluation, create a machine learning model to predict the buying price given the following parameters:

Maintenance = High
Number of doors = 4
Lug Boot Size = Big
Safety = High
Class Value = Good

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
section1		section1
section2		section2
section3		section3
section4		section4
section5		section5
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

section1

section1

section2

section2

section3

section3

section4

section4

section5

section5

.gitignore

.gitignore

README.md

README.md

Repository files navigation

DSAID Data Engineering Technical Test

Submission Guidelines

Section 1: Data Pipelines

Section 2: Databases

Section 3: System Design

Section 4: Charts and APIs

Section 5: Machine Learning

About

Releases

Packages

Languages

nakamura41/dataeng_test

Folders and files

Latest commit

History

Repository files navigation

DSAID Data Engineering Technical Test

Submission Guidelines

Section 1: Data Pipelines

Section 2: Databases

Section 3: System Design

Section 4: Charts and APIs

Section 5: Machine Learning

About

Resources

Stars

Watchers

Forks

Languages