## Libraries and settings

## Section 2: File Formats

In [1]:
# Libraries
import os
import warnings
import pandas as pd
import numpy as np
import pyarrow

# Ignore warnings
warnings.filterwarnings('ignore')

# Show current working directory
print(os.getcwd())

/workspaces/data_engineer_assessment/part_2


### Task 2.1: Reading Different File Formats

The `part_2/` directory contains data in two formats:
- `apartments_data_winterthur.csv` — apartment rental data in CSV format
- `supermarkets.json` — supermarket locations from OpenStreetMap in JSON format

**Your tasks:**
1. Read the CSV file into a DataFrame
2. Read the JSON file into a DataFrame
3. Display the first 3 rows and the shape of each DataFrame

In [2]:
# Task 2.1.1 — Read the CSV file
# TODO: Read 'apartments_data_winterthur.csv' into a DataFrame
df_apartments = pd.read_csv('apartments_data_winterthur.csv')

# TODO: Print info and first 3 rows
print(f'Info: {df_apartments.head(n = 3)}')
...

Info:   web-scraper-order                              web-scraper-start-url  \
0      1693993818-1  https://www.immoscout24.ch/de/wohnung/mieten/o...   
1      1693993818-2  https://www.immoscout24.ch/de/wohnung/mieten/o...   
2      1693993818-3  https://www.immoscout24.ch/de/wohnung/mieten/o...   

             rooms_area_price_raw  \
0  6,5 Zimmer, 143 m², CHF 3017.—   
1    1 Zimmer, 132 m², CHF 3260.—   
2  4,5 Zimmer, 117 m², CHF 3782.—   

                                     address_raw   price_raw  \
0          Am Eulachpark 25, 8404 Winterthur, ZH  CHF 3017.—   
1  Katharina Sulzer Platz 2, 8400 Winterthur, ZH  CHF 3260.—   
2                            8400 Winterthur, ZH  CHF 3782.—   

                                     description_raw  \
0      «Sie suchen die spezielle Maisonettewohnung?»   
1            «In Loft-iger Höhe MIETEN OHNE KAUTION»   
2  «MÖBLIERT, TEMPORÄR: 4½ ZI-WOHNUNG IN WINTERTH...   

                                            text_raw  
0  6,5 Zimm

Ellipsis

In [4]:
# Task 2.1.2 — Read the JSON file
# TODO: Read 'supermarkets.json' into a DataFrame
df_supermarkets = pd.read_json('supermarkets.json')

# TODO: Print last 3 rows
print(f"Info: {df_supermarkets.tail(n=3)}")

Info:       type           id        lat       lon  \
3389  node  11107076347  47.466556  9.048250   
3390  node  11107594883  47.322228  8.529748   
3391  node  11129298207  47.537518  7.608581   

                                                   tags  
3389  {'addr:city': 'Wil SG', 'addr:housenumber': '3...  
3390  {'addr:city': 'Adliswil', 'addr:housenumber': ...  
3391  {'brand': 'Coop', 'brand:wikidata': 'Q432564',...  


### Task 2.2: Nested Objects

The `df_supermarkets` DataFrame has a nested object in the `tags` column. Each row contains a dictionary with multiple OSM (OpenStreetMap) attributes like brand, opening hours, address details, etc.

**Your tasks:**
1. Inspect the `tags` column to understand its structure (display one example)
2. Flatten the nested `tags` dictionary into separate columns
3. Combine the flattened columns with the original location columns (`type`, `id`, `lat`, `lon`)
4. Drop the original `tags` column and display the resulting DataFrame
5. Compare the shape before and after flattening

In [5]:
# Task 2.2 Solution — Flattening Nested Objects

#TODO Step 1: Inspect the tags column structure
df_supermarkets['tags'].head()


0    {'brand': 'Spar', 'brand:wikidata': 'Q610492',...
1    {'addr:city': 'Uznach', 'addr:housenumber': '2...
2    {'addr:city': 'Uznach', 'addr:postcode': '8730...
3    {'addr:city': 'Zürich', 'addr:country': 'CH', ...
4    {'addr:city': 'Zürich', 'addr:housenumber': '7...
Name: tags, dtype: object

In [7]:
# Step 2 & 3: Flatten the tags columns and combine with original columns

#TODO Flatten the tags dictionary into separate columns
tags_normalized = pd.json_normalize(df_supermarkets['tags'])

#TODO Combine with the original location columns
df_supermarkets_flattened = df_supermarkets.add(tags_normalized)

print(f"Original shape: {df_supermarkets.shape}")
print(f"Flattened shape: {df_supermarkets_flattened.shape}")


Original shape: (3392, 5)
Flattened shape: (3392, 237)


In [9]:
#TODO Step 4: Display information about the flattened DataFrame
print(f"Info: {df_supermarkets_flattened.describe()}")


Info:        access  access:covid19  addr:city  addr:city:de  addr:city:fr  \
count     0.0             0.0        0.0           0.0           0.0   
mean      NaN             NaN        NaN           NaN           NaN   
std       NaN             NaN        NaN           NaN           NaN   
min       NaN             NaN        NaN           NaN           NaN   
25%       NaN             NaN        NaN           NaN           NaN   
50%       NaN             NaN        NaN           NaN           NaN   
75%       NaN             NaN        NaN           NaN           NaN   
max       NaN             NaN        NaN           NaN           NaN   

       addr:country  addr:floor  addr:full  addr:housename  addr:housenumber  \
count           0.0         0.0        0.0             0.0               0.0   
mean            NaN         NaN        NaN             NaN               NaN   
std             NaN         NaN        NaN             NaN               NaN   
min             NaN      

### Task 2.3: File Format Conversion

**Your tasks:**
1. Take the apartments DataFrame (from Task 2.1) and write it to a **Parquet** file
2. Read the Parquet file back and verify the data is identical
3. Compare the file sizes of the CSV and Parquet files

In [10]:
# Task 2.3.1 — Write to Parquet
# TODO: Save df_apartments to 'apartments_winterthur.parquet'
df_apartments.to_parquet('apartments_winterthur.parquet')
# Task 2.3.2 — Read back from Parquet and verify
# TODO: Read the parquet file and compare shape/dtypes with the original
df_from_parquet = pd.read_parquet('apartments_winterthur.parquet')

print(f'Original shape:  {df_apartments.shape}')
print(f'Parquet shape:   {df_from_parquet.shape}')
print(f'DataFrames equal: {df_apartments.equals(df_from_parquet)}')

Original shape:  (120, 7)
Parquet shape:   (120, 7)
DataFrames equal: True


In [None]:
# Task 2.3.3 — Compare file sizes
# TODO: Use os.path.getsize() to compare the CSV and Parquet file sizes
csv_size = ...
parquet_size = ...

print(f'CSV file size:     {csv_size:>10,} bytes')
print(f'Parquet file size:  {parquet_size:>10,} bytes')
print(f'Compression ratio:  {csv_size / parquet_size:.2f}x')