# Downloading dataset

This notebook downloads and unpacks the dataset used in the course.

- The zip file (~344 MB) is downloaded to:  
  `data/raw/Database_CPT_PremstallerGeotechnik.zip`

- The contents are unpacked into:  
  `data/raw/`

You only need to run this notebook **once**.

Setup paths and URL

In [2]:
from pathlib import Path

URL = "https://www.tugraz.at/fileadmin/user_upload/Institute/IBG/Datenbank/Database_CPT_PremstallerGeotechnik.zip"

REPO_ROOT = Path("..").resolve()
DATA_RAW_DIR = REPO_ROOT / "data/raw"
DATA_RAW_DIR.mkdir(parents=True, exist_ok=True)

zip_path = DATA_RAW_DIR / "Database_CPT_PremstallerGeotechnik.zip"
zip_path


WindowsPath('C:/Users/TFH/git_projects/course-machine-learning-for-geotechnics-intro/data/raw/Database_CPT_PremstallerGeotechnik.zip')

Download zip if needed

In [3]:
import urllib.request

if zip_path.exists():
    print(f"Zip file already exists at:\n{zip_path}")
else:
    print(f"Downloading dataset from:\n{URL}\n")
    print("This may take a few minutes...")
    urllib.request.urlretrieve(URL, zip_path)
    print(f"\nDownload complete. Saved to:\n{zip_path}")

Zip file already exists at:
C:\Users\TFH\git_projects\course-machine-learning-for-geotechnics-intro\data\raw\Database_CPT_PremstallerGeotechnik.zip


Unpack zip file

In [4]:
from zipfile import ZipFile

if not zip_path.exists():
    raise FileNotFoundError(f"Zip file not found at {zip_path}. Run the download cell first.")

print(f"Unpacking zip file:\n{zip_path}\n")

with ZipFile(zip_path, "r") as zf:
    print("Files in archive:")
    for name in zf.namelist():
        print(f"  - {name}")
    print(f"\nExtracting to: {DATA_RAW_DIR}\n")
    zf.extractall(DATA_RAW_DIR)

print("✓ Unpacking complete.")

Unpacking zip file:
C:\Users\TFH\git_projects\course-machine-learning-for-geotechnics-intro\data\raw\Database_CPT_PremstallerGeotechnik.zip

Files in archive:
  - CPT_PremstallerGeotechnik_revised.csv

Extracting to: C:\Users\TFH\git_projects\course-machine-learning-for-geotechnics-intro\data\raw

✓ Unpacking complete.
✓ Unpacking complete.


Verify extraction

In [5]:
# Check if the CSV file was extracted successfully
csv_file = DATA_RAW_DIR / "CPT_PremstallerGeotechnik_revised.csv"

if csv_file.exists():
    file_size_mb = csv_file.stat().st_size / (1024 * 1024)
    print("✓ Dataset file found:")
    print(f"  {csv_file}")
    print(f"  Size: {file_size_mb:.1f} MB")
    print("\n✅ Setup complete! You can now run the other notebooks.")
else:
    print("⚠ Warning: CSV file not found!")
    print(f"Expected location: {csv_file}")
    print("\nFiles in data/raw/:")
    for f in DATA_RAW_DIR.iterdir():
        print(f"  - {f.name}")

✓ Dataset file found:
  C:\Users\TFH\git_projects\course-machine-learning-for-geotechnics-intro\data\raw\CPT_PremstallerGeotechnik_revised.csv
  Size: 328.3 MB

✅ Setup complete! You can now run the other notebooks.


## Next Steps

You can now:
1. Close this notebook
2. Open `eda_cpt.ipynb` to start exploring the data
3. Continue with the other course notebooks in sequence