# Data Engineer Take-Home Exercise

## Overview
Your task is to design an end-to-end data pipeline. The main assumption is that data may appear once a day in the form of a ZIP file available on a publicly accessible S3 bucket. These data will appear on random days during the month, and our task is to download this data whenever it becomes available. Information that the data is available can be found on the data provider's website or through a publicly accessible API. Size of the data will increase, therefore please assume it may grow to files of hundreds of GBs while compressed. 
## Task Breakdown

### 1. Pipeline Design
- **Objective**: Develop a conceptual architecture for the data pipeline.
- **Requirements**: No coding required. Prepare a presentation (format is flexible, no need for PowerPoint) illustrating the proposed architecture, including your choice of AWS services and frameworks.

### 2. Data Processing
- **Objective**: Write code to transform the provided source data.
- **Requirements**: Use Python 3.9, Spark 3.3, or newer. Prepare a Jupyter Notebook to explain your approach and thought process.
- **Deliverables**: Transform the source data as per provided specifications.

### 3. Presentation
- **Procedure**: Submit your solutions for the first two parts. We will review them and arrange a meeting to discuss your approach and the solutions in detail.
## Detailed Instructions for Data Processing

### Task: Transforming Data
You will find `data.csv` in the files folder. Your task involves three key steps, each focusing on manipulating specific field values:

1. **String Cleaning**: The 'bio' field contains inconsistently formatted text. Standardize these values to a space-delimited string.
2. **Code Swap**: Use the `state_abbreviations` CSV from the files folder. Replace state abbreviations in the 'state' field of `data.csv` with the full state names from this supplementary file.
3. **Date Offset**: The 'start_date' field has varied date formats. Create a new field, 'start_date_description', to filter out invalid dates. Normalize valid dates to ISO 8601 format (YYYY-MM-DD).

**Final Output**: Your script should process `data.csv` and produce an `enriched.csv` and `enriched.snappy.parquet` file as per the above requirements. Address any additional data quality issues you encounter.

### Assessment Criteria
Our aim is to see your best work. We appreciate clean, DRY (Don't Repeat Yourself) code that is well-documented. Our assessment will focus on:

1. Problem-solving effectiveness.
2. Clarity and logical structuring of your solution.
3. Quality of documentation and code commenting.

In [None]:
# Write your code here