# 🏋️‍♀️ Exercise Set: Loading, Exploring, and Cleaning Data

**Dataset:** `trainingham.csv` (a modified version of the Framingham Heart Study dataset)

This exercise set practices the skills from Lectures 2.1 to 2.3:
- Loading and exploring data
- Indexing and slicing to explore subgroups
- Cleaning and coercing variable types

## 1. Load the Dataset

Import Pandas and load the `trainingham.csv` file. Then inspect the structure of the dataset. How many variables and observations do we have?

**Tasks:**
- Import `pandas` as `pd`
- Load the dataset using `pd.read_csv()`
- Use `.info()` and `.head()` to preview the dataset


In [1]:
import pandas as pd

# Load the dataset
df = pd.read_csv('../Data/trainingham.csv')

# Inspect the dataset
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12208 entries, 0 to 12207
Data columns (total 40 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   RANDID      12208 non-null  int64  
 1   SEX         12208 non-null  float64
 2   TOTCHOL     11192 non-null  float64
 3   AGE         11600 non-null  float64
 4   SYSBP       12208 non-null  float64
 5   DIABP       12208 non-null  float64
 6   CURSMOKE    12208 non-null  int64  
 7   CIGPDAY     12127 non-null  float64
 8   BMI         12155 non-null  float64
 9   DIABETES    12208 non-null  int64  
 10  BPMEDS      11590 non-null  float64
 11  HEARTRTE    12202 non-null  float64
 12  GLUCOSE     10684 non-null  float64
 13  educ        11903 non-null  object 
 14  PREVCHD     12208 non-null  int64  
 15  PREVAP      12208 non-null  int64  
 16  PREVMI      12208 non-null  int64  
 17  PREVSTRK    12208 non-null  int64  
 18  PREVHYP     12208 non-null  int64  
 19  TIME        12208 non-nul

Unnamed: 0,RANDID,SEX,TOTCHOL,AGE,SYSBP,DIABP,CURSMOKE,CIGPDAY,BMI,DIABETES,...,HYPERTEN,TIMEAP,TIMEMI,TIMEMIFC,TIMECHD,TIMESTRK,TIMECVD,TIMEDTH,TIMEHYP,CIGSPERDAY
0,2448,1.0,195.0,39.0,106.0,70.0,0,0.0,26.97,0,...,0,8766,6438,6438,6438,8766,6438,8766,8766,
1,2448,1.0,209.0,52.0,121.0,66.0,0,0.0,,0,...,0,8766,6438,6438,6438,8766,6438,8766,8766,
2,6238,2.0,250.0,46.0,121.0,81.0,0,0.0,28.73,0,...,0,8766,8766,8766,8766,8766,8766,8766,8766,
3,6238,2.0,260.0,52.0,105.0,69.5,0,0.0,29.43,0,...,0,8766,8766,8766,8766,8766,8766,8766,8766,
4,6238,2.0,237.0,58.0,108.0,66.0,0,0.0,28.5,0,...,0,8766,8766,8766,8766,8766,8766,8766,8766,


## 2. Explore the Dataset

**Tasks:**
- Check the number of rows and columns
- Use `.describe()` to get a summary of numerical columns
- Count the unique values of the `SEX` and `CURRENT_SMOKER` variables
- Find how many rows have missing values

In [None]:
# Your code here

## 3. Investigate Subgroups

**Tasks:**
- Select only female participants (`SEX == 0`)
- Calculate the average `AGE` of male and female participants separately
- Create a new DataFrame with only participants over age 60 who are current smokers

In [None]:
# Your code here

## 4. Clean and Coerce Variables

Some columns were imported with incorrect types. Fix them!

**Tasks:**
- Convert `SEX` and `CURRENT_SMOKER` to `category`
- Convert `AGE` to numeric if needed (handle errors with `errors='coerce'`)
- Check and convert any date variables to datetime (if applicable)
- Create a new column `AGE_GROUP` with values: `'Under 40'`, `'40-60'`, `'Over 60'`

In [None]:
# Your code here

## 5. Summary Questions

- How many missing values are left in the dataset?
- What proportion of participants are current smokers in each `AGE_GROUP`?
- What would you do next in a real analysis project?