In [1]:
import pandas as pd

### Read the data in pandas format

In [3]:
df = pd.read_csv('heart.csv')

### Display the first five rows in the dataset

In [5]:
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


### Display the shape of the dataset

In [6]:
df.shape

(918, 12)

### Print information about the dataframe

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


### Check if there is any null values in the dataset and count them

In [8]:
df.isnull().sum()

Age               0
Sex               0
ChestPainType     0
RestingBP         0
Cholesterol       0
FastingBS         0
RestingECG        0
MaxHR             0
ExerciseAngina    0
Oldpeak           0
ST_Slope          0
HeartDisease      0
dtype: int64

### Print statistical information about each column in the dataset, including the count, mean, std, min, and max

In [9]:
df.describe()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease
count,918.0,918.0,918.0,918.0,918.0,918.0,918.0
mean,53.510893,132.396514,198.799564,0.233115,136.809368,0.887364,0.553377
std,9.432617,18.514154,109.384145,0.423046,25.460334,1.06657,0.497414
min,28.0,0.0,0.0,0.0,60.0,-2.6,0.0
25%,47.0,120.0,173.25,0.0,120.0,0.0,0.0
50%,54.0,130.0,223.0,0.0,138.0,0.6,1.0
75%,60.0,140.0,267.0,0.0,156.0,1.5,1.0
max,77.0,200.0,603.0,1.0,202.0,6.2,1.0


### Print the correlation table between the columns in the dataset

In [10]:
df.corr()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease
Age,1.0,0.254399,-0.095282,0.198039,-0.382045,0.258612,0.282039
RestingBP,0.254399,1.0,0.100893,0.070193,-0.112135,0.164803,0.107589
Cholesterol,-0.095282,0.100893,1.0,-0.260974,0.235792,0.050148,-0.232741
FastingBS,0.198039,0.070193,-0.260974,1.0,-0.131438,0.052698,0.267291
MaxHR,-0.382045,-0.112135,0.235792,-0.131438,1.0,-0.160691,-0.400421
Oldpeak,0.258612,0.164803,0.050148,0.052698,-0.160691,1.0,0.403951
HeartDisease,0.282039,0.107589,-0.232741,0.267291,-0.400421,0.403951,1.0


### Example 1: Counting Occurrences of Each Chest Pain Type

Explanation: In this example, we're creating a dictionary chest_pain_counts to keep track of how many times each chest pain type appears in the dataset. We iterate through each data point and extract the 'ChestPainType' value. If the chest pain type is already in the dictionary, we increment its count; otherwise, we add it to the dictionary with a count of 1. Finally, we print the dictionary containing the counts of each chest pain type.

In [12]:
df['ChestPainType'].value_counts()

ASY    496
NAP    203
ATA    173
TA      46
Name: ChestPainType, dtype: int64

### Example 2: Creating a Dictionary of Patients with High Cholesterol

Explanation: In this example, we're creating a dictionary high_cholesterol_patients to store information about patients with high cholesterol (cholesterol level greater than 200). We iterate through each data point, convert the 'Cholesterol' value to an integer, and check if it's greater than 200. If it is, we add the entire data point to the dictionary, using the patient's age as the key.

In [16]:
# Create a new DataFrame with patients whose cholesterol is higher than the threshold
cholesterol_threshold = 360
filtered_df = df[df['Cholesterol'] > cholesterol_threshold]
filtered_df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
28,53,F,ATA,113,468,0,Normal,127,N,0.0,Up,0
30,53,M,NAP,145,518,0,Normal,130,N,0.0,Flat,1
58,54,M,ASY,150,365,0,ST,134,N,1.0,Up,0
69,44,M,ASY,150,412,0,Normal,170,N,0.0,Up,0
76,32,M,ASY,118,529,0,Normal,130,N,0.0,Flat,1


### Example 3: Calculating Average Max Heart Rate for Male and Female Patients

Explanation: In this example, we're calculating the average maximum heart rate for male and female patients. We use two dictionaries, male_max_hr and female_max_hr, to keep track of the sum of max heart rates and the count of patients for each gender. We iterate through the dataset, convert the 'MaxHR' value to an integer, and based on the patient's sex, update the respective dictionary. After processing all data points, we calculate the averages and print them.

In [17]:
# Group the DataFrame by the 'gender' column
grouped = df.groupby('Sex')
# Calculate the average max heart rate for male and female patients
average_max_heart_rate = grouped['MaxHR'].mean()
print(average_max_heart_rate)

Sex
F    146.139896
M    134.325517
Name: MaxHR, dtype: float64


### Example 4: Finding the Patient with the Highest Cholesterol

Explanation: In this example, we find the patient with the highest cholesterol level by iterating through the dataset and comparing each patient's cholesterol level to the highest recorded so far. We update the highest_cholesterol variable and store the entire data point for the patient with the highest cholesterol in patient_with_highest_cholesterol.

In [19]:
# Find the index of the row with the highest cholesterol level
highest_cholesterol_index = df['Cholesterol'].idxmax()

# Get the patient with the highest cholesterol level
patient_with_highest_cholesterol = df.loc[highest_cholesterol_index]

print(patient_with_highest_cholesterol)

Age                   54
Sex                    M
ChestPainType        ASY
RestingBP            130
Cholesterol          603
FastingBS              1
RestingECG        Normal
MaxHR                125
ExerciseAngina         Y
Oldpeak              1.0
ST_Slope            Flat
HeartDisease           1
Name: 149, dtype: object


### Example 5: Grouping Patients by Age Range

Explanation: In this example, we create a dictionary age_groups to group patients into different age ranges. We iterate through the dataset and based on the patient's age, we append the data point to the corresponding age group list within the dictionary.

In [26]:
# Define the age group bins and labels
age_bins = [18, 30, 40, 50, 60, float('inf')]  # 'inf' represents positive infinity
age_labels = ['18-30', '31-40', '41-50', '51-60', '61+']

# Use pd.cut to create the 'age_group' based on the age bins
df['age_group'] = pd.cut(df['Age'], bins=age_bins, labels=age_labels)

In [27]:
grouped_patients = df.groupby('age_group')

In [28]:
# Print the content of each group
for age_grp in grouped_patients.groups:
    print(f"Group: {age_grp}")
    print(grouped_patients.get_group(age_grp))
    print("\n")

Group: 18-30
     Age Sex ChestPainType  RestingBP  Cholesterol  FastingBS RestingECG  \
170   29   M           ATA        120          243          0     Normal   
208   28   M           ATA        130          132          0        LVH   
215   30   F            TA        170          237          0         ST   
219   29   M           ATA        140          263          0     Normal   
829   29   M           ATA        130          204          0        LVH   

     MaxHR ExerciseAngina  Oldpeak ST_Slope  HeartDisease age_group  
170    160              N      0.0       Up             0     18-30  
208    185              N      0.0       Up             0     18-30  
215    170              N      0.0       Up             0     18-30  
219    170              N      0.0       Up             0     18-30  
829    202              N      0.0       Up             0     18-30  


Group: 31-40
     Age Sex ChestPainType  RestingBP  Cholesterol  FastingBS RestingECG  \
0     40   M       

### Example 5: Calculating the Percentage of Patients with Heart Disease

Explanation: In this example, we calculate the percentage of patients in the dataset who have heart disease. We iterate through the dataset and increment the heart_disease_patients variable whenever we encounter a patient with heart disease ('HeartDisease' equal to '1'). Finally, we calculate the percentage and print it.

In [31]:
df[df['HeartDisease']==1].value_counts()

Age  Sex  ChestPainType  RestingBP  Cholesterol  FastingBS  RestingECG  MaxHR  ExerciseAngina  Oldpeak  ST_Slope  HeartDisease  age_group
31   M    ASY            120        270          0          Normal      153    Y               1.5      Flat      1             31-40        1
60   M    ASY            132        218          0          ST          140    Y               1.5      Down      1             51-60        1
61   F    ASY            130        330          0          LVH         169    N               0.0      Up        1             61+          1
60   M    NAP            141        316          1          ST          122    Y               1.7      Flat      1             51-60        1
                         140        185          0          LVH         155    N               3.0      Flat      1             51-60        1
                                                                                                                                            ..
53  