# Data Wrangling for Scientists with Pandas

*From library documentation*: **pandas** is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.

In [10]:
# SETUP: Import libraries and configure display
import pandas as pd
import numpy as np
# import matplotlib.pyplot as plt  # will be used later
from datetime import datetime, timedelta
# import seaborn as sns    # will be used later

# Visualization settings (optional) - will be used later
# sns.set_theme(style="whitegrid")
# plt.rcParams["figure.figsize"] = (10,6)


# Configure pandas display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.precision', 2)

print("✓ Libraries loaded successfully!")
print(f"Pandas version: {pd.__version__}")

✓ Libraries loaded successfully!
Pandas version: 2.3.3


-----

We will use a slightly modified "Heart Desease" Data set published in the kaggle.com website (https://www.kaggle.com/datasets/neurocipher/heartdisease/data).

Dataset description:

- Age: Age in years
- Sex: 0 = Female, 1 = Male
- Chest pain type: 1-4 (1: typical angina, 2: atypical angina, 3: non-anginal pain, 4: asymptomatic)
- BP: Resting blood pressure (mm Hg)
- Cholesterol: Serum cholesterol (mg/dL)
- FBS over 120: Fasting blood sugar > 120 mg/dL (0 = False, 1 = True)
- EKG results: Resting electrocardiographic results (0: normal, 1: ST-T wave abnormality, 2: left ventricular hypertrophy)
- Max HR: Maximum heart rate 
- Exercise angina: Exercise-induced angina (0 = False, 1 = True)
- ST depression: ST depression induced by exercise relative to rest
- Slope of ST: Slope of the peak exercise ST segment (1: upsloping, 2: flat, 3: downsloping)
- Number of vessels fluro: Number of major vessels colored by fluoroscopy (0-3)
- Thallium: Thallium stress test result (3: normal, 6: fixed defect, 7: reversible defect)
- Heart Disease: Presence or Absence

In [11]:
# Input File path (origin: www.kaggle.com/)
input_file = "https://rcs.bu.edu/examples/python/DataAnalysis/Heart_Disease_Prediction.csv"

# Reading only the first 100 records for now
df = pd.read_csv(input_file)



---

## Data Exploration

Pandas `head()` method is used to return the **first n rows** (or elements) of an object based on position. It is particularly useful for a quick initial inspection of a dataset's structure and contents. 

For a negative value of n, the function returns all rows except the last |n| rows.

In [13]:
# Display first 5 records (default)
df.head()

Unnamed: 0,Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium,Heart Disease
0,70.0,1,4.0,130.0,322,0,2,109,0,2.4,2,3,3,Presence
1,67.0,0,3.0,115.0,564,0,2,160,0,1.6,2,0,7,Absence
2,57.0,1,2.0,124.0,261,0,0,141,0,0.3,1,0,7,Presence
3,64.0,1,4.0,128.0,263,0,0,105,1,0.2,2,1,7,Absence
4,74.0,0,2.0,120.0,269,0,2,121,1,0.2,1,1,3,Absence


Similarly, the `.tail()` method displays the last 5 records. The number of records displayed can be overwritten:

In [12]:
# Display the last 10 records
df.tail(10)

Unnamed: 0,Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium,Heart Disease
260,58.0,0,3.0,120.0,340,0,0,172,0,0.0,1,0,3,Absence
261,60.0,1,4.0,130.0,206,0,2,132,1,2.4,2,2,7,Presence
262,58.0,1,2.0,120.0,284,0,2,160,0,1.8,2,0,3,Presence
263,49.0,1,2.0,130.0,266,0,0,171,0,0.6,1,0,3,Absence
264,48.0,1,2.0,110.0,229,0,0,168,0,1.0,3,0,7,Presence
265,52.0,1,3.0,172.0,199,1,0,162,0,0.5,1,0,7,Absence
266,44.0,1,2.0,120.0,263,0,0,173,0,0.0,1,0,7,Absence
267,56.0,0,2.0,140.0,294,0,2,153,0,1.3,2,0,3,Absence
268,57.0,1,4.0,140.0,192,0,0,148,0,0.4,2,0,6,Absence
269,67.0,1,4.0,160.0,286,0,2,108,1,1.5,2,3,3,Presence


The `pandas.DataFrame.describe()` method generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset's distribution. By default, it provides a quick overview of *numerical data* in a DataFrame, but it can also be customized to analyze categorical data or all data types. 

In [14]:
df.describe()

Unnamed: 0,Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium
count,269.0,270.0,269.0,266.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0,270.0
mean,54.44,0.68,3.17,131.37,249.66,0.15,1.02,149.68,0.33,1.05,1.59,0.67,4.7
std,9.13,0.47,0.95,17.98,51.69,0.36,1.0,23.17,0.47,1.15,0.61,0.94,1.94
min,29.0,0.0,1.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,1.0,0.0,3.0
25%,48.0,0.0,3.0,120.0,213.0,0.0,0.0,133.0,0.0,0.0,1.0,0.0,3.0
50%,55.0,1.0,3.0,130.0,245.0,0.0,2.0,153.5,0.0,0.8,2.0,0.0,3.0
75%,61.0,1.0,4.0,140.0,280.0,0.0,2.0,166.0,1.0,1.6,2.0,1.0,7.0
max,77.0,1.0,4.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,3.0,3.0,7.0


In [15]:
# Get information about the DataFrame structure
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 270 entries, 0 to 269
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Age                      269 non-null    float64
 1   Sex                      270 non-null    int64  
 2   Chest pain type          269 non-null    float64
 3   BP                       266 non-null    float64
 4   Cholesterol              270 non-null    int64  
 5   FBS over 120             270 non-null    int64  
 6   EKG results              270 non-null    int64  
 7   Max HR                   270 non-null    int64  
 8   Exercise angina          270 non-null    int64  
 9   ST depression            270 non-null    float64
 10  Slope of ST              270 non-null    int64  
 11  Number of vessels fluro  270 non-null    int64  
 12  Thallium                 270 non-null    int64  
 13  Heart Disease            270 non-null    object 
dtypes: float64(4), int64(9), o

In [16]:
# Further explore the variables that are categorical (Sex, ChestPainType, RestingECG, ExerciseAngina, ST_Slope):
df.Sex.value_counts()

Sex
1    183
0     87
Name: count, dtype: int64

In [20]:
# Access column with space using bracket notation
df['Chest pain type'].value_counts()

Chest pain type
4.0    128
3.0     79
2.0     42
1.0     20
Name: count, dtype: int64

---

### Handling Column Names with Spaces or Other Special Characters.

Column names with spaces are common in real-world data but can cause issues in Python. Here are a few solutions:

- Use **square brackets `[]` with quotes** around the column name instead of dot notation
- Works with any column name, including spaces and special characters



#### **Method 3: Strip whitespace (if spaces are at edges)**
```python
df.columns = df.columns.str.strip()
```

#### **Method 4: Replace all problematic characters**
```python
df.columns = df.columns.str.replace(' ', '_').str.replace('-', '_').str.lower()
```

---

**Best Practice for Biological Data:**
- Rename columns immediately after loading the data
- Use `snake_case` (lowercase with underscores)
- Make column names self-documenting
- Examples: `patient_id`, `cholesterol_mg_dl`, `max_heart_rate_bpm`

---

In [23]:
df['Chest pain type'].value_counts() 

Chest pain type
4.0    128
3.0     79
2.0     42
1.0     20
Name: count, dtype: int64

In [24]:
df['Max HR'].head()

0    109
1    160
2    141
3    105
4    121
Name: Max HR, dtype: int64

**Rename columns** to remove spaces and/or special characters

You can either replace them with underscores, remove spaces completely, or give them a new name.

In [25]:
# Replace spaces with underscores in all column names that have spaces
df.columns = df.columns.str.replace(' ', '_')
df.columns 


Index(['Age', 'Sex', 'Chest_pain_type', 'BP', 'Cholesterol', 'FBS_over_120',
       'EKG_results', 'Max_HR', 'Exercise_angina', 'ST_depression',
       'Slope_of_ST', 'Number_of_vessels_fluro', 'Thallium', 'Heart_Disease'],
      dtype='object')


**Use rename()** method from Pandas with a dictionary

In [26]:
# Read the dataset again to get the original column names
# In practice, you would do this step only once at the beginning
df = pd.read_csv(input_file)

df = df.rename(columns={
    'Chest pain type': 'chest_pain_type',
    'Max HR': 'max_heart_rate',
    'FBS over 120': 'fasting_blood_sugar_high'
})

# Now you can use dot notation:
df.chest_pain_type.value_counts()

chest_pain_type
4.0    128
3.0     79
2.0     42
1.0     20
Name: count, dtype: int64