## Libraries and settings

## Section 2: File Formats & Encoding (~7 min)

In [2]:
# Libraries
import os
import warnings
import json
import pandas as pd
import numpy as np

# Ignore warnings
warnings.filterwarnings('ignore')

# Show current working directory
print(os.getcwd())

/workspaces/data_engineer_assessment/part_2


### Task 2.1: Reading Different File Formats

The `part_2/` directory contains data in two formats:
- `apartments_data_winterthur.csv` — apartment rental data in CSV format
- `supermarkets.json` — supermarket locations from OpenStreetMap in JSON format

**Your tasks:**
1. Read the CSV file into a DataFrame using `pd.read_csv()` with appropriate parameters
2. Read the JSON file into a DataFrame using `pd.read_json()`
3. Display the first 3 rows and the shape of each DataFrame

In [None]:
# Task 2.1.1 — Read the CSV file
# TODO: Read 'apartments_data_winterthur.csv' into a DataFrame
df_apartments = ...

# Print shape and first 3 rows
print(f'Shape: {df_apartments.shape}')
df_apartments.head(3)

AttributeError: 'ellipsis' object has no attribute 'shape'

In [None]:
# Task 2.1.2 — Read the JSON file
# TODO: Read 'supermarkets.json' into a DataFrame
df_supermarkets = ...

# Print shape and first 3 rows
print(f'Shape: {df_supermarkets.shape}')
df_supermarkets.head(3)

### Task 2.2: File Format Trade-offs

Answer the following questions by writing your responses as **Python comments** or in a **print()** statement.

1. You need to store 50 GB of tabular sensor data that will be queried by column. Which file format would you choose: CSV, JSON, or Parquet? Why?

2. You are building a REST API that returns nested, hierarchical data (e.g., a product catalog with categories and subcategories). Which format is most suitable: CSV, JSON, or XML? Why?

3. Your colleague sends you a CSV file, but some columns contain nested objects (e.g., a list of tags per row). What problem does this create, and how would you handle it?

4. Name one advantage of Parquet over CSV and one advantage of CSV over Parquet.

In [None]:
# TODO: Write your answers as comments or print statements

# Q1: 50 GB sensor data, column-based queries — which format and why?


# Q2: REST API with nested/hierarchical data — which format and why?


# Q3: CSV with nested objects — what's the problem and how to handle it?


# Q4: One advantage of Parquet over CSV, and one advantage of CSV over Parquet?


### Task 2.3: Character Encoding

The apartment dataset contains Swiss German text with special characters (ä, ö, ü, etc.).

**Your tasks:**
1. Read `apartments_data_zuerich.csv` with `encoding='utf-8'` and verify that special characters display correctly
2. Try reading the same file with `encoding='ascii'` — observe and handle the error using a `try/except` block
3. Extract the `address_raw` column and count how many addresses contain non-ASCII characters (e.g., ü, ä, ö)

In [None]:
# Task 2.3.1 — Read with UTF-8 encoding
# TODO: Read the Zürich apartments CSV with UTF-8 encoding
df_zh = ...

# Print the first 5 address values to verify special characters render correctly
print(df_zh['address_raw'].head())

In [None]:
# Task 2.3.2 — Try reading with ASCII encoding (this should fail)
# TODO: Wrap in try/except and print the error message


In [None]:
# Task 2.3.3 — Count addresses with non-ASCII characters
# TODO: Count how many address values contain characters like ä, ö, ü, é, etc.
# Hint: You can use str.contains() with a regex pattern or check if a string is ASCII
non_ascii_count = ...

print(f'Addresses with non-ASCII characters: {non_ascii_count}')

### Task 2.4: File Format Conversion

**Your tasks:**
1. Take the apartments DataFrame (from Task 2.1) and write it to a **Parquet** file
2. Read the Parquet file back and verify the data is identical
3. Compare the file sizes of the CSV and Parquet files

In [None]:
# Task 2.4.1 — Write to Parquet
# TODO: Save df_apartments to 'apartments_winterthur.parquet'


# Task 2.4.2 — Read back from Parquet and verify
# TODO: Read the parquet file and compare shape/dtypes with the original
df_from_parquet = ...

print(f'Original shape:  {df_apartments.shape}')
print(f'Parquet shape:   {df_from_parquet.shape}')
print(f'DataFrames equal: {df_apartments.equals(df_from_parquet)}')

In [None]:
# Task 2.4.3 — Compare file sizes
# TODO: Use os.path.getsize() to compare the CSV and Parquet file sizes
csv_size = ...
parquet_size = ...

print(f'CSV file size:     {csv_size:>10,} bytes')
print(f'Parquet file size:  {parquet_size:>10,} bytes')
print(f'Compression ratio:  {csv_size / parquet_size:.2f}x')