# Downloading dataset

This notebook downloads and unpacks the dataset used in the course.

- The zip file (~344 MB) is downloaded to:  
  `<repo-root>/data/raw/Database_CPT_PremstallerGeotechnik.zip`

- The contents are unpacked into:  
  `<repo-root>/data/raw/`

You only need to run this notebook **once**.


Shortkeys and basic tutorial to use a Jupyter notebook:
- `Shift + Enter`: Run the current cell
- `Enter`: Enter edit mode
- `Esc`: Enter command mode
- `a`: Insert cell above
- `b`: Insert cell below
- `dd`: Delete cell
- `m`: Change cell to markdown
- `y`: Change cell to code
- `x`: Cut cell
- `c`: Copy cell
- `v`: Paste cell
- `Shift + m`: Merge selected cells
- `Shift + Arrow Up/Down`: Select multiple cells
- `Ctrl + Shift + -`: Split cell at cursor

Other remarks:
- Remember to set the kernel to the environment in the project
- Be aware of strange results if you run cells out of order

Setup paths and URL

In [None]:
from pathlib import Path

URL = "https://www.tugraz.at/fileadmin/user_upload/Institute/IBG/Datenbank/Database_CPT_PremstallerGeotechnik.zip"

# Simple relative path from root directory
DATA_RAW_DIR = Path("data/raw")
DATA_RAW_DIR.mkdir(parents=True, exist_ok=True)

zip_path = DATA_RAW_DIR / "Database_CPT_PremstallerGeotechnik.zip"

print(f"Data directory: {DATA_RAW_DIR.resolve()}")
print(f"Zip file will be saved to: {zip_path}")
zip_path


Download zip if needed

In [None]:
import urllib.request

if zip_path.exists():
    print(f"Zip file already exists at:\n{zip_path}")
else:
    print(f"Downloading dataset from:\n{URL}\n")
    print("This may take a few minutes...")
    urllib.request.urlretrieve(URL, zip_path)
    print(f"\nDownload complete. Saved to:\n{zip_path}")

Unpack zip file

In [None]:
from zipfile import ZipFile

if not zip_path.exists():
    raise FileNotFoundError(f"Zip file not found at {zip_path}. Run the download cell first.")

print(f"Unpacking zip file:\n{zip_path}\n")

with ZipFile(zip_path, "r") as zf:
    print("Files in archive:")
    for name in zf.namelist():
        print(f"  - {name}")
    print(f"\nExtracting to: {DATA_RAW_DIR}\n")
    zf.extractall(DATA_RAW_DIR)

print("✓ Unpacking complete.")

Verify extraction

In [None]:
# Check if the CSV file was extracted successfully
csv_file = DATA_RAW_DIR / "CPT_PremstallerGeotechnik_revised.csv"

if csv_file.exists():
    file_size_mb = csv_file.stat().st_size / (1024 * 1024)
    print("✓ Dataset file found:")
    print(f"  {csv_file}")
    print(f"  Size: {file_size_mb:.1f} MB")
    print("\n✅ Setup complete! Dataset is ready at:")
    print(f"  {DATA_RAW_DIR.resolve()}")
    print("\nYou can now run the other notebooks.")
else:
    print("⚠ Warning: CSV file not found!")
    print(f"Expected location: {csv_file}")
    print(f"\nFiles in {DATA_RAW_DIR}:")
    for f in DATA_RAW_DIR.iterdir():
        print(f"  - {f.name}")


## Next Steps

You can now:
1. Close this notebook
2. Open `01_eda_cpt.ipynb` to start exploring the data
3. Continue with the other course notebooks in sequence
