# Part 1: Data Fundamentals

**Data Engineer Assessment** | Estimated time: ~15 minutes

This notebook tests your understanding of core data engineering fundamentals:
- **Section 1**: Data Types & Data Structures (~8 min)
- **Section 2**: File Formats & Encoding (~7 min)

Complete the tasks by writing code in the provided cells. Some cells contain starter code or hints — follow the instructions in the comments.

## Libraries and settings

In [None]:
# Libraries
import os
import warnings
import json
import pandas as pd
import numpy as np

# Ignore warnings
warnings.filterwarnings('ignore')

# Show current working directory
print(os.getcwd())

## Section 1: Data Types & Data Structures (~8 min)

### Task 1.1: Python Data Types & Type Casting

The apartment rental data contains raw string values like these:

| Raw value | Expected result |
|---|---|
| `"3,5 Zimmer, 88 m², CHF 2244.—"` | rooms = `3.5`, area = `88`, price = `2244` |
| `"CHF 3017.—"` | `3017.0` (float) |
| `"Preis auf Anfrage"` | `None` |

**Your tasks:**
1. Extract the number of rooms, area and price from `raw_value` as proper numeric types
2. Handle the case where the price is not available (`"Preis auf Anfrage"`)
3. Print each result together with its Python type using `type()`

In [None]:
# Raw data from the apartments dataset
raw_value = '3,5 Zimmer, 88 m², CHF 2244.—'
price_raw_1 = 'CHF 3017.—'
price_raw_2 = 'Preis auf Anfrage'

# TODO: Extract rooms as a float (e.g., 3.5)
rooms = ...

# TODO: Extract area as an integer (e.g., 88)
area = ...

# TODO: Extract price from price_raw_1 as a float (e.g., 3017.0)
price_1 = ...

# TODO: Handle price_raw_2 — assign None if it cannot be converted to a number
price_2 = ...

# Print results with their types
print(f'rooms:   {rooms}  -> {type(rooms)}')
print(f'area:    {area}   -> {type(area)}')
print(f'price_1: {price_1} -> {type(price_1)}')
print(f'price_2: {price_2} -> {type(price_2)}')

rooms:   3.5  -> <class 'float'>
area:    Ellipsis   -> <class 'ellipsis'>
price_1: Ellipsis -> <class 'ellipsis'>
price_2: Ellipsis -> <class 'ellipsis'>


### Task 1.2: Core Data Structures

**Your tasks:**

1. Create a **list** of apartment descriptions and use it to count how many contain the word "Winterthur"
2. Create a **dictionary** that maps 3 city names to their average apartment price (use made-up values)
3. Given two lists of cities, use a **set** to find which cities appear in both lists
4. Create a **tuple** representing a single apartment record (address, rooms, price) — and demonstrate that it cannot be modified

In [None]:
# Task 1.2.1 — List: count apartments mentioning 'Winterthur'
descriptions = [
    'Schöne 3.5-Zimmerwohnung mit Balkon in Winterthur',
    'Moderne Wohnung in Zürich mit Seesicht',
    'Grosszügige Wohnung im Herzen von Winterthur',
    'Attraktive Wohnung in Bern Altstadt',
    'Erstbezug in Winterthur Lokstadt'
]

# TODO: Count how many descriptions contain the word 'Winterthur'
count_winterthur = ...
print(f'Apartments in Winterthur: {count_winterthur}')

In [None]:
# Task 1.2.2 — Dictionary: city to average price mapping
# TODO: Create a dictionary with at least 3 cities as keys and average prices as values
city_prices = ...

# TODO: Print the price for one specific city using key access


In [None]:
# Task 1.2.3 — Set: find cities that appear in both lists
cities_list_a = ['Zürich', 'Winterthur', 'Bern', 'Basel']
cities_list_b = ['Winterthur', 'Luzern', 'Zürich', 'St. Gallen']

# TODO: Use sets to find cities that appear in both lists
common_cities = ...
print(f'Cities in both lists: {common_cities}')

TypeError: unsupported operand type(s) for +: 'set' and 'set'

In [None]:
# Task 1.2.4 — Tuple: immutable apartment record
# TODO: Create a tuple with (address, rooms, price) for one apartment
apartment_record = ...
print(f'Apartment record: {apartment_record}')

# TODO: Try to modify one element of the tuple (e.g., change the price)
# This should raise a TypeError — wrap it in a try/except block


### Task 1.3: Pandas Fundamentals

**Your tasks:**

1. Create a pandas **Series** from a list of prices and calculate the mean
2. Create a **DataFrame** from a dictionary of apartment data
3. Use `.loc[]` and `.iloc[]` to select specific rows and columns
4. Add a new calculated column and filter the DataFrame

In [None]:
# Task 1.3.1 — Create a Series of monthly rent prices and calculate basic statistics
prices = [1630, 1970, 2244, 2520, 3017, 3260, 3782]

# TODO: Create a pandas Series from the prices list
price_series = ...

# TODO: Print the mean, median and standard deviation


In [None]:
# Task 1.3.2 — Create a DataFrame from a dictionary
# TODO: Create a DataFrame with columns: 'address', 'rooms', 'area_m2', 'price_chf'
#       Use at least 4 rows of data (you can use values from the apartment data above)
df = ...

# Print the DataFrame
print(df)
print(f'\nShape: {df.shape}')
print(f'\nData types:\n{df.dtypes}')

In [None]:
# Task 1.3.3 — Indexing with .loc[] and .iloc[]
# TODO: Select the first two rows using .iloc[]
first_two = ...

# TODO: Select only the 'address' and 'price_chf' columns using .loc[]
address_price = ...

# Print results
print('First two rows:')
print(first_two)
print('\nAddress and price:')
print(address_price)

In [None]:
# Task 1.3.4 — Add a calculated column and filter
# TODO: Add a new column 'price_per_m2' = price_chf / area_m2
df['price_per_m2'] = ...

# TODO: Filter the DataFrame to show only apartments with more than 3 rooms
df_filtered = ...

print('DataFrame with price_per_m2:')
print(df)
print(f'\nApartments with more than 3 rooms:')
print(df_filtered)