# Data Engineer - Technical Assessment

In this section of the interview at Beyond Finance, you will be assessed on your ability to perform several Data Engineering tasks. To perform well on this task, you will demonstate competence in the following areas:

* preprocessing data to prepare for a database load
* understanding entity relationships in a database
* merging data from different tables
* filtering data to relevant subsets
* calculating aggregations and descriptive statistics

It will be pretty difficult to complete all questions in the allotted time. Your goal is not to speed through the answers, but to come up with answers that demonstrate your knowledge. It's more about your thought process and logic than getting the right answer or your code.


## Getting Started

This exercise will be broken into 2 parts
1. Data Processing
2. Data Wrangling

### Data Processing
In this section you will take files from the ./raw_data/ subfolders, combine them into a single newline-delimited `json.gz` file per subfolder, and place that CSV file in a ./processed_data/ directory. You may have to do some light investigation into the data files to understand their file formats and delimiters

**Example**

Files
- ./raw_data/tracks/tracks_0.csv
- ./raw_data/tracks/tracks_1.json
- ./raw_data/tracks/tracks_2.csv
- etc... 

should be combined into a single file ./processed_data/tracks.json.gz

**What we look for**

- Can you handle all subfolders in a single pass over the raw data files?
- What if the file sizes are in GigaBytes? Can your code (if run on a standard laptop) load the files without going out of memory? (hint `chunksize`)
- Can you identify edge cases? What scenarios could break your code?
- Please directly respond to the above questions in your submission.

### Data Wrangling
For this section, we'll pretend you loaded the raw data plus additional tables into a small SQLite database containing roughly a dozen tables. **We've provided this database for you so don't worry about loading it yourself**. If you are not familiar with the SQLite database, it uses a fairly complete and standard SQL syntax, though does not many advanced analytics functions. Consider it just a remote datastore for storing and retrieving data from. 

![](db-diagram.png)

## Data Processing

In [4]:
import pandas as pd 

!pip install memory_profiler
!pip install ijson
%pip install ipython-sql
%load_ext memory_profiler

Collecting ipython-sql
  Downloading ipython_sql-0.5.0-py3-none-any.whl.metadata (17 kB)
Collecting prettytable (from ipython-sql)
  Downloading prettytable-3.16.0-py3-none-any.whl.metadata (33 kB)
Collecting sqlparse (from ipython-sql)
  Downloading sqlparse-0.5.3-py3-none-any.whl.metadata (3.9 kB)
Collecting ipython-genutils (from ipython-sql)
  Downloading ipython_genutils-0.2.0-py2.py3-none-any.whl.metadata (755 bytes)
Downloading ipython_sql-0.5.0-py3-none-any.whl (20 kB)
Downloading ipython_genutils-0.2.0-py2.py3-none-any.whl (26 kB)
Downloading prettytable-3.16.0-py3-none-any.whl (33 kB)
Downloading sqlparse-0.5.3-py3-none-any.whl (44 kB)
Installing collected packages: ipython-genutils, sqlparse, prettytable, ipython-sql
Successfully installed ipython-genutils-0.2.0 ipython-sql-0.5.0 prettytable-3.16.0 sqlparse-0.5.3
Note: you may need to restart the kernel to use updated packages.
The memory_profiler extension is already loaded. To reload it, use:
  %reload_ext memory_profiler


In [None]:
%%memit
# ... your code here

## Data Wrangling

In [9]:
import prettytable
prettytable.DEFAULT_STYLE = prettytable.TableStyle

In [11]:
%reload_ext sql 
%sql sqlite:///db/sqlite/chinook.db

In [12]:
import sqlite3

con = sqlite3.connect("db/sqlite/chinook.db")

### 1. How many different customers are there?

In [13]:
%%sql
select count(distinct customerid) from customers

 * sqlite:///db/sqlite/chinook.db
Done.


KeyError: 'DEFAULT'

### 2. How long is the longest track in minutes?

### 3. Which genre has the shortest average track length?

### 4. Which artist shows up in the most playlists?

### 5. What album had the most purchases?

### 6. Which customer has the highest number of sales in terms of dollars?

In [None]:
%%sql


### 7. Count of customers who have dollar sales more than $40?

In [None]:
%%sql


### Python File Processor with Memory Profiling

In [None]:

import os
import json
import gzip
import csv
from pathlib import Path
from datetime import datetime

RAW_DIR = Path("./raw_data")
PROCESSED_DIR = Path("./processed_data")
PROCESSED_DIR.mkdir(exist_ok=True)
LOG_DIR = Path("./logs")

LOG_FILE = LOG_DIR / f"unsupported_files_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log"

def read_csv(file_path):
    with open(file_path, newline='', encoding='utf-8') as f:
        reader = csv.DictReader(f)
        for row in reader:
            yield row

def read_json(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        first_char = f.read(1)
        f.seek(0)
        if first_char == '[':
            # Entire list-style JSON
            data = json.load(f)
            for item in data:
                yield item
        else:
            for line in f:
                line = line.strip()
                if line:
                    try:
                        yield json.loads(line)
                    except json.JSONDecodeError:
                        continue

def read_with_ijson(file_path):
    """
    reads records from a JSON file contained in array using ijson.
    """
    import ijson
    with open(file_path, 'r', encoding='utf-8') as f:
        for record in ijson.items(f, 'item'):
            yield record

def read_records(file_path):
    ext = file_path.suffix.lower()
    if ext == '.csv':
        yield from read_csv(file_path)
    elif ext == '.json':
        yield from read_json(file_path)
    else:
        with open(LOG_FILE, "a") as log_file:
            log_file.write(f"[SKIPPED] unsupported file format: {file_path}")


def process_folder(subfolder):
    output_path = PROCESSED_DIR / f"{subfolder.name}.json.gz"
    with gzip.open(output_path, 'wt', encoding='utf-8') as out_file:
        for file in subfolder.iterdir():
            if file.is_file() and file.suffix.lower() in {'.csv', '.json'}:
                for record in read_records(file):
                    json.dump(record, out_file)
                    out_file.write('\n')

def main():
    for folder in RAW_DIR.iterdir():
        if folder.is_dir():
            print(f"processing: {folder.name}")
            process_folder(folder)

# Run the main function with memory profiling
%load_ext memory_profiler
%%memit
main()
