## üìÅ Understanding the Project Structure

This notebook is part of a larger project that contains **many datasets** stored in a shared `data/` folder.

Because this notebook lives in the `lessons/` folder, we cannot assume that the current working directory is the project root. Instead, we programmatically locate the project root and then build paths from there.

This approach prevents common file-not-found errors and works even if notebooks are moved to different folders.





## üìç Step 1: Locate the Data Directory

In this step, we:

* Identify the project‚Äôs root directory
* Build a reliable path to the `data/` folder
* Verify that the folder exists

```python
REPO_ROOT = Path.cwd().parents[0]
DATA_DIR = REPO_ROOT / "data"
```

If the data directory exists, we know the notebook is correctly configured.



In [3]:
from pathlib import Path
import pandas as pd

DATA_DIR = Path.cwd().parent / "data"

print("Working directory:", Path.cwd())
print("Data directory:", DATA_DIR)
print("Data directory exists:", DATA_DIR.exists())


Working directory: /workspaces/Data_Science-notebooks/lessons
Data directory: /workspaces/Data_Science-notebooks/data
Data directory exists: True


## üì¶ Step 2: Load All CSV Files Automatically

Rather than loading each dataset manually, we scan the `data/` folder and load **every CSV file** into pandas.

Each dataset is stored in a dictionary called `dfs`, where:

* The **key** is the dataset name
* The **value** is a pandas DataFrame

This allows us to work with many datasets using a consistent and scalable approach.

Missing values such as `"NA"` or empty cells are automatically converted to `NaN`.



In [None]:
from pathlib import Path
import pandas as pd

REPO_ROOT = Path.cwd().parents[0]
DATA_DIR = REPO_ROOT / "data"

csv_files = sorted(DATA_DIR.glob("*.csv"))

dfs = {}

for f in csv_files:
    name = f.stem.replace("-", "_")
    dfs[name] = pd.read_csv(
        f,
        na_values=["NA", ""],
        keep_default_na=True
    )
    print(f"Loaded {name:30s} shape={dfs[name].shape}")


Loaded atbats                         shape=(528, 23)
Loaded batter_player                  shape=(2105, 11)
Loaded bp_readings                    shape=(9, 7)
Loaded cardio                         shape=(70000, 30)
Loaded cardio1                        shape=(70000, 24)
Loaded cardio10                       shape=(67066, 27)
Loaded game_player                    shape=(2105, 11)
Loaded games                          shape=(100, 18)
Loaded medications                    shape=(5, 7)
Loaded patients                       shape=(5, 8)
Loaded pitches                        shape=(451, 32)
Loaded pitches_w_inplay               shape=(29616, 33)
Loaded pitches_w_inplay_nb            shape=(29616, 33)
Loaded players                        shape=(825, 13)
Loaded players100                     shape=(911, 5)
Loaded players_cuba                   shape=(17, 6)
Loaded players_cuba_no_index          shape=(17, 5)
Loaded signatures_simulated_dataset   shape=(1001, 28)
Loaded teams                 

## ‚úÖ Step 3: Confirm Successful Data Loading

As each file is loaded, the notebook prints:

* The dataset name
* The number of rows
* The number of columns

Example output:

```
Loaded cardio10 shape=(67066, 27)
```

This acts as a **sanity check** to confirm that all files loaded correctly.

If a dataset is missing or malformed, it will be immediately obvious.



In [None]:
catalog = (
    pd.DataFrame(
        [{"dataset": k, "rows": v.shape[0], "cols": v.shape[1]} for k, v in dfs.items()]
    )
    .sort_values(["rows"], ascending=False)
    .reset_index(drop=True)
)

catalog



Unnamed: 0,dataset,rows,cols
0,cardio,70000,30
1,cardio1,70000,24
2,cardio10,67066,27
3,pitches_w_inplay_nb,29616,33
4,pitches_w_inplay,29616,33
5,batter_player,2105,11
6,game_player,2105,11
7,signatures_simulated_dataset,1001,28
8,players100,911,5
9,players,825,13


## üóÇÔ∏è Step 4: View a Catalog of Available Datasets

After loading the data, we create a summary table that lists:

* Dataset name
* Number of rows
* Number of columns

This table functions as a lightweight **data catalog** and helps you quickly understand what data is available for analysis.



In [None]:
import pandas as pd

catalog = (
    pd.DataFrame(
        [{"dataset": name, "rows": df.shape[0], "cols": df.shape[1]}
         for name, df in dfs.items()]
    )
    .sort_values("dataset")  # alphabetical
    .reset_index(drop=True)
)

catalog


Unnamed: 0,dataset,rows,cols
0,atbats,528,23
1,batter_player,2105,11
2,bp_readings,9,7
3,cardio,70000,30
4,cardio1,70000,24
5,cardio10,67066,27
6,game_player,2105,11
7,games,100,18
8,medications,5,7
9,patients,5,8


## üëÄ Step 5: Preview Individual Datasets

To explore a dataset, we can display:

* The first few rows
* Column names
* Dataset dimensions
* Missing values

This is an important step in **exploratory data analysis (EDA)** and helps you understand:

* What each variable represents
* Which variables may require cleaning or transformation



In [6]:
from pathlib import Path
import pandas as pd

# Locate the data directory
DATA_DIR = Path.cwd().parent / "data"

# Load one dataset directly
df = pd.read_csv(DATA_DIR / "cardio10.csv", na_values=["NA", ""])

# Preview the first 5 rows
df.head()



Unnamed: 0,id,age_days,sex,height,weight,sbp,diastolic,cholesterol,gluc,smoking,...,dm,egfr,bptreat,statin,uacr,sdi,hba1c,prevent_full_10yr_CVD,prevent_full_10yr_ASCVD,prevent_full_10yr_HF
0,57792,14455,0,164,62,110,70,0,0,0,...,0,88,1,1,0.5,2,6.5,0.19941,0.657339,0.19941
1,17274,15452,0,156,48,110,70,0,0,0,...,0,80,0,1,1.0,1,5.4,0.20853,0.762402,0.20853
2,59931,15235,1,168,72,110,80,1,0,0,...,0,109,0,1,6.7,1,5.6,0.212875,0.828295,0.212875
3,72971,15370,1,178,78,110,70,0,0,0,...,0,100,0,1,4.8,1,5.6,0.215769,0.540801,0.215769
4,33519,15184,0,153,63,110,70,0,0,0,...,0,81,0,0,2.2,4,5.1,0.217907,0.323234,0.217907


In [9]:
df.shape


(67066, 27)

In [10]:
df.columns


Index(['id', 'age_days', 'sex', 'height', 'weight', 'sbp', 'diastolic',
       'cholesterol', 'gluc', 'smoking', 'alco', 'active', 'cardio', 'age',
       'bmi', 'tc', 'hdl', 'dm', 'egfr', 'bptreat', 'statin', 'uacr', 'sdi',
       'hba1c', 'prevent_full_10yr_CVD', 'prevent_full_10yr_ASCVD',
       'prevent_full_10yr_HF'],
      dtype='str')

## ‚ö†Ô∏è Step 6: Understanding Missing Values

Some datasets intentionally contain missing values.

These are not errors ‚Äî they represent real-world data challenges that analysts must address.

In this course, you will:

* Identify missing values
* Decide how to handle them (drop, impute, or analyze separately)
* Learn how missing data affects analysis and modeling

---

In [12]:
df.isna().sum()


id                         0
age_days                   0
sex                        0
height                     0
weight                     0
sbp                        0
diastolic                  0
cholesterol                0
gluc                       0
smoking                    0
alco                       0
active                     0
cardio                     0
age                        0
bmi                        0
tc                         0
hdl                        0
dm                         0
egfr                       0
bptreat                    0
statin                     0
uacr                       0
sdi                        0
hba1c                      0
prevent_full_10yr_CVD      0
prevent_full_10yr_ASCVD    0
prevent_full_10yr_HF       0
dtype: int64

## üß† Why This Workflow Matters

This notebook demonstrates best practices used by professional data scientists:

* Writing **reusable code**
* Avoiding hard-coded file paths
* Verifying data before analysis
* Separating data loading from modeling

Mastering this workflow will make your future analyses:

* More reliable
* Easier to debug
* Easier to scale to larger projects

---

## üß™ What Comes Next

With all datasets loaded, you are now ready to:

* Explore relationships between variables
* Join datasets together
* Clean and transform data
* Build statistical or machine learning models

Each future lesson will build on the foundation established here.