# Data Wrangling for Scientists with Pandas

*From library documentation*: **pandas** is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.

In [None]:
# SETUP: Import libraries and configure display
import pandas as pd
import numpy as np
# import matplotlib.pyplot as plt  # will be used later
from datetime import datetime, timedelta
# import seaborn as sns    # will be used later

# Visualization settings (optional) - will be used later
# sns.set_theme(style="whitegrid")
# plt.rcParams["figure.figsize"] = (10,6)


# Configure pandas display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.precision', 2)

print("âœ“ Libraries loaded successfully!")
print(f"Pandas version: {pd.__version__}")

-----

We will use a slightly modified "Heart Desease" Data set published in the kaggle.com website (https://www.kaggle.com/datasets/neurocipher/heartdisease/data).

Dataset description:

- Age: Age in years
- Sex: 0 = Female, 1 = Male
- Chest pain type: 1-4 (1: typical angina, 2: atypical angina, 3: non-anginal pain, 4: asymptomatic)
- BP: Resting blood pressure (mm Hg)
- Cholesterol: Serum cholesterol (mg/dL)
- FBS over 120: Fasting blood sugar > 120 mg/dL (0 = False, 1 = True)
- EKG results: Resting electrocardiographic results (0: normal, 1: ST-T wave abnormality, 2: left ventricular hypertrophy)
- Max HR: Maximum heart rate 
- Exercise angina: Exercise-induced angina (0 = False, 1 = True)
- ST depression: ST depression induced by exercise relative to rest
- Slope of ST: Slope of the peak exercise ST segment (1: upsloping, 2: flat, 3: downsloping)
- Number of vessels fluro: Number of major vessels colored by fluoroscopy (0-3)
- Thallium: Thallium stress test result (3: normal, 6: fixed defect, 7: reversible defect)
- Heart Disease: Presence or Absence

In [None]:
# Input File path (origin: www.kaggle.com/)
input_file = "https://rcs.bu.edu/examples/python/DataAnalysis/Heart_Disease_Prediction.csv"

# Reading only the first 100 records for now
df = pd.read_csv(input_file)



---

## Data Exploration

Pandas `head()` method is used to return the **first n rows** (or elements) of an object based on position. It is particularly useful for a quick initial inspection of a dataset's structure and contents. 

For a negative value of n, the function returns all rows except the last |n| rows.

In [None]:
# Display first 5 records (default)
df.head()

Similarly, the `.tail()` method displays the last 5 records. The number of records displayed can be overwritten:

In [None]:
# Display the last 10 records
df.tail(10)

The `pandas.DataFrame.describe()` method generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset's distribution. By default, it provides a quick overview of *numerical data* in a DataFrame, but it can also be customized to analyze categorical data or all data types. 

In [None]:
df.describe()

In [None]:
# Get information about the DataFrame structure
df.info()

In [None]:
# Further explore the variables that are categorical (Sex, ChestPainType, RestingECG, ExerciseAngina, ST_Slope):
df.Sex.value_counts()

In [None]:
# Access column with space using bracket notation
df['Chest pain type'].value_counts()

---

### Handling Column Names with Spaces or Other Special Characters.

Column names with spaces are common in real-world data but can cause issues in Python. Here are a few solutions:

- Use **square brackets `[]` with quotes** around the column name instead of dot notation
- Works with any column name, including spaces and special characters



#### **Method 3: Strip whitespace (if spaces are at edges)**
```python
df.columns = df.columns.str.strip()
```

#### **Method 4: Replace all problematic characters**
```python
df.columns = df.columns.str.replace(' ', '_').str.replace('-', '_').str.lower()
```

---

**Best Practice for Biological Data:**
- Rename columns immediately after loading the data
- Use `snake_case` (lowercase with underscores)
- Make column names self-documenting
- Examples: `patient_id`, `cholesterol_mg_dl`, `max_heart_rate_bpm`

---

In [None]:
df['Chest pain type'].value_counts() 

In [None]:
df['Max HR'].head()

**Rename columns** to remove spaces and/or special characters

You can either replace them with underscores, remove spaces completely, or give them a new name.

In [None]:
# Replace spaces with underscores in all column names that have spaces
df.columns = df.columns.str.replace(' ', '_')
df.columns 



**Use rename()** method from Pandas with a dictionary

In [None]:
# Read the dataset again to get the original column names
# In practice, you would do this step only once at the beginning
df = pd.read_csv(input_file)

df = df.rename(columns={
    'Chest pain type': 'chest_pain_type',
    'Max HR': 'max_heart_rate',
    'FBS over 120': 'fasting_blood_sugar_high'
})

# Now you can use dot notation:
df.chest_pain_type.value_counts()

In [None]:
# Check for duplicate rows
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

# Display duplicates if any exist
if duplicates > 0:
    print("\nDuplicate rows:")
    print(df[df.duplicated(keep=False)])