# Graphics

*From library documentation*: **pandas** is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.

In [None]:
# SETUP: Import libraries and configure display
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
import seaborn as sns

# Visualization settings (optional)
sns.set_theme(style="whitegrid")
plt.rcParams["figure.figsize"] = (10,6)


# Configure pandas display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.precision', 2)

In [None]:
# Input File path (origin: www.kaggle.com/)
input_file = "https://rcs.bu.edu/examples/python/DataAnalysis/Heart_Disease_Prediction.csv"

# Reading only the first 100 records for now
df = pd.read_csv(input_file)



In [None]:
# Fixing incorrect values 
df.describe()

In [None]:
df.head()

 Once we know which variables are important in our study, we can extract them and do our analysis only on those variables:

In [None]:
# numerical fearures - 6 features
num_feats = ['Age', 'Cholesterol', 'BP', 'Max HR', 'ST depression', 'Number of vessels fluro']

# categorical (binary) -  4 features
bin_feats = ['Sex', 'FBS over 120', 'Exercise angina', 'Heart Disease']

# caterorical (multi-level) - 4 features
nom_feats= ['Chest pain type', 'EKG results', 'Slope of ST', 'Thallium']
cat_feats = nom_feats + bin_feats

In [None]:
 # Let's examine numercial columns only.
df[num_feats].describe().T

## Data Visualization (variable exploration)

Once you explored the dataset with the Pandas `.describe()`, `.info()`, `.head()`, and `.tail()` methods, take some time to visually explore the variables of interest using **matplotlib** and **seaborn** libraries.

[Seaborn guide](https://seaborn.pydata.org/tutorial.html)
[Matplotlib guide](https://matplotlib.org/stable/tutorials/index.html)


#### For **Continuous/Numeric Variables**:

1. **Histogram (`histplot` or `hist`)**
   - Shows distribution of a single continuous variable
   - Use when: Examining the shape and spread of data; understanding if data is normally distributed or skewed

2. **Box Plot (`boxplot`)**
   - Shows quartiles, median, and outliers of a continuous variable
   - Use when: Comparing distributions across groups or detecting outliers

3. **Violin Plot (`violinplot`)**
   - Like box plot but shows full distribution shape
   - Use when: You want detailed distribution information across groups
   - Best for: Displaying multimodal distributions

4. **Density Plot (KDE)**
   - Smooth curve showing probability density
   - Use when: Comparing continuous distributions (especially overlapping)

#### For **Categorical Variables**:

5. **Bar Plot (`barplot`)**
   - Shows average value of a metric across categories
   - Use when: Comparing averages across groups with error bars


6. **Count Plot (`countplot`)**
   - Shows frequency of each category
   - Use when: Understanding the distribution of categorical data

7. **Pie Chart (`pie`)**
   - Shows proportion of each category
   - Use when: Displaying parts of a whole (percentages); 
   - Best for: Showing relative proportions

#### For **Relationships Between Variables**:

8. **Scatter Plot (`scatterplot`)**
   - Shows relationship between two continuous variables
   - Use when: Examining correlations and clusters


9. **Line Plot (`lineplot`)**
   - Shows trend over time or sequence
   - Use when: Tracking changes across ordered values (time series, progression)


10. **Heatmap (`heatmap`)**
    - Shows correlation or values in a 2D grid with colors
    - Use when: Visualizing correlation matrices or large 2D datasets


#### For **Multivariate Analysis**:

11. **Pair Plot (`pairplot`)**
    - Matrix of scatter plots showing all pairwise relationships
    - Use when: Exploring all relationships simultaneously


12. **Facet Grid (`FacetGrid`)**
    - Multiple subplots showing same plot across different groups
    - Use when: Comparing patterns across categorical subsets


---

### Quick Reference Table

| Data Type | Variable Type | Best Plot | Seaborn Function |
|-----------|--------------|-----------|-----------------|
| Single Continuous | Distribution | Histogram | `histplot()` |
| Single Continuous | Outliers | Box Plot | `boxplot()` |
| Single Continuous | Density | KDE | `histplot(..., kde=True)` |
| Single Categorical | Frequencies | Count Plot | `countplot()` |
| Single Categorical | Proportions | Pie Chart | `.plot(kind='pie')` |
| 2 Continuous | Relationship | Scatter | `scatterplot()` |
| Continuous + Categorical | Comparison | Box/Violin | `boxplot()` / `violinplot()` |
| Continuous + Categorical | Means | Bar Plot | `barplot()` |
| Multiple Continuous | Correlations | Heatmap | `heatmap()` |
| Multiple Variables | All Pairs | Pair Plot | `pairplot()` |

---

### Outcome variable exploration

Let's first explore the distribution of our outcome variable - `Hear Disease'

In [None]:
sns.countplot(x=df['Heart Disease'])

# Add vlues lables on top of bars
for p in plt.gca().patches:
    plt.gca().annotate(f'{int(p.get_height())}', (p.get_x() + p.get_width() / 2., p.get_height()),  
    ha='center', va='bottom')

In [None]:
# Plot histogram for a numerical column - the Age column in this case
sns.histplot(data=df, x='Age', bins=20, kde=True)


### Box Plot

Box plots are used to visualize the distribution of a continuous variable and identify outliers. They show the median, quartiles, and potential outliers in the data.

**Example:** Compare cholesterol levels across heart disease categories.

In [None]:
# Box plot for cholesterol levels across heart disease categories
sns.boxplot(data=df, x='Heart Disease', y='Cholesterol', hue='Heart Disease', palette='Set2', legend=False)
plt.title('Cholesterol Levels by Heart Disease Category')
plt.show()

### Scatter Plot

Scatter plots are used to visualize the relationship between two continuous variables. They are useful for identifying trends, clusters, and potential correlations.

**Example:** Age vs. cholesterol, colored by heart disease presence.

In [None]:
# Scatter plot for Age vs. Cholesterol, colored by Heart Disease presence
sns.scatterplot(data=df, x='Age', y='Cholesterol', hue='Heart Disease', palette='coolwarm')
plt.title('Age vs. Cholesterol')
plt.show()

### Pairplot

Examine relationship between numerical variables

In [None]:
sub_feats = ['Age', 'Cholesterol', 'BP', 'Max HR', 'ST depression', 'Heart Disease']
df_sub = df[sub_feats]
g = sns.pairplot(df_sub, hue="Heart Disease", corner=True, diag_kind='hist');
plt.suptitle('Pairplot' ,fontsize = 24);