## üìÅ Understanding the Project Structure

This notebook is part of a larger project that contains **many datasets** stored in a shared `data/` folder.

Because this notebook lives in the `lessons/` folder, we cannot assume that the current working directory is the project root. Instead, we programmatically locate the project root and then build paths from there.

This approach prevents common file-not-found errors and works even if notebooks are moved to different folders.







## üóÇÔ∏è Step 4: View a Catalog of Available Datasets

After loading the data, we create a summary table that lists:

* Dataset name
* Number of rows
* Number of columns

This table functions as a lightweight **data catalog** and helps you quickly understand what data is available for analysis.

---

## üëÄ Step 5: Preview Individual Datasets

To explore a dataset, we can display:

* The first few rows
* Column names
* Dataset dimensions
* Missing values

This is an important step in **exploratory data analysis (EDA)** and helps you understand:

* What each variable represents
* Which variables may require cleaning or transformation

---

## ‚ö†Ô∏è Step 6: Understanding Missing Values

Some datasets intentionally contain missing values.

These are not errors ‚Äî they represent real-world data challenges that analysts must address.

In this course, you will:

* Identify missing values
* Decide how to handle them (drop, impute, or analyze separately)
* Learn how missing data affects analysis and modeling

---

## üß† Why This Workflow Matters

This notebook demonstrates best practices used by professional data scientists:

* Writing **reusable code**
* Avoiding hard-coded file paths
* Verifying data before analysis
* Separating data loading from modeling

Mastering this workflow will make your future analyses:

* More reliable
* Easier to debug
* Easier to scale to larger projects

---

## üß™ What Comes Next

With all datasets loaded, you are now ready to:

* Explore relationships between variables
* Join datasets together
* Clean and transform data
* Build statistical or machine learning models

Each future lesson will build on the foundation established here.


## üìç Step 1: Locate the Data Directory

In this step, we:

* Identify the project‚Äôs root directory
* Build a reliable path to the `data/` folder
* Verify that the folder exists

```python
REPO_ROOT = Path.cwd().parents[0]
DATA_DIR = REPO_ROOT / "data"
```

If the data directory exists, we know the notebook is correctly configured.



In [None]:
import pandas as pd

DATA_DIR = Path.cwd().parent / "data"

print("Working directory:", Path.cwd())
print("Data directory:", DATA_DIR)
print("Data directory exists:", DATA_DIR.exists())

## üì¶ Step 2: Load All CSV Files Automatically

Rather than loading each dataset manually, we scan the `data/` folder and load **every CSV file** into pandas.

Each dataset is stored in a dictionary called `dfs`, where:

* The **key** is the dataset name
* The **value** is a pandas DataFrame

This allows us to work with many datasets using a consistent and scalable approach.

Missing values such as `"NA"` or empty cells are automatically converted to `NaN`.



In [3]:
from pathlib import Path
import pandas as pd

REPO_ROOT = Path.cwd().parents[0]
DATA_DIR = REPO_ROOT / "data"

csv_files = sorted(DATA_DIR.glob("*.csv"))

dfs = {}

for f in csv_files:
    name = f.stem.replace("-", "_")
    dfs[name] = pd.read_csv(
        f,
        na_values=["NA", ""],
        keep_default_na=True
    )
    print(f"Loaded {name:30s} shape={dfs[name].shape}")


Loaded atbats                         shape=(528, 23)
Loaded batter_player                  shape=(2105, 11)
Loaded bp_readings                    shape=(9, 7)
Loaded cardio                         shape=(70000, 30)
Loaded cardio1                        shape=(70000, 24)
Loaded cardio10                       shape=(67066, 27)
Loaded game_player                    shape=(2105, 11)
Loaded games                          shape=(100, 18)
Loaded medications                    shape=(5, 7)
Loaded patients                       shape=(5, 8)
Loaded pitches                        shape=(451, 32)
Loaded pitches_w_inplay               shape=(29616, 33)
Loaded pitches_w_inplay_nb            shape=(29616, 33)
Loaded players                        shape=(825, 13)
Loaded players100                     shape=(911, 5)
Loaded players_cuba                   shape=(17, 6)
Loaded players_cuba_no_index          shape=(17, 5)
Loaded signatures_simulated_dataset   shape=(1001, 28)
Loaded teams                 

## ‚úÖ Step 3: Confirm Successful Data Loading

As each file is loaded, the notebook prints:

* The dataset name
* The number of rows
* The number of columns

Example output:

```
Loaded cardio10 shape=(67066, 27)
```

This acts as a **sanity check** to confirm that all files loaded correctly.

If a dataset is missing or malformed, it will be immediately obvious.

