My imports

In [105]:
import pandas as pd
import numpy as np
from scipy import stats

Step 1: Create a Python script that generates a synthetic dataset matching the description of your study. The dataset should be saved as a CSV file named "exercise_data.csv".

In [106]:
def generate_synthetic_data(filename='exercise_data.csv'):
    
    # Setting up a seed for reproducibility
    np.random.seed(1)
    
    # My constants
    num_participants = 100
    exercise_groups = ['jogging', 'weightlifting', 'yoga']
    pre_exercise_mean = 120
    pre_exercise_std = 10
    bp_changes = {'jogging': -5, 'weightlifting': -3, 'yoga': -7}
    
    # Generating dataset
    participant_ids = np.arange(1, num_participants + 1)
    exercise_group_assignments = np.random.choice(exercise_groups, num_participants)
    pre_exercise_bp = np.random.normal(pre_exercise_mean, pre_exercise_std, num_participants).astype(int)
    post_exercise_bp = (pre_exercise_bp + np.vectorize(bp_changes.get)(exercise_group_assignments) + np.random.normal(0, 3, num_participants)).astype(int)
    
    # Creating and saving data frame
    df = pd.DataFrame({
        'Participant ID': participant_ids,
        'Exercise Group': exercise_group_assignments,
        'Pre-Exercise Systolic Blood Pressure': pre_exercise_bp,
        'Post-Exercise Systolic Blood Pressure': post_exercise_bp,
    })
    
    df.to_csv(filename, index=False)
    
    # Printing the results
    print("Synthetic dataset save to 'exercise_data.csv'.")

In step 1, I am using Numpy to generate the synthetic data set that simulates the blood pressure measurements of participants in a study. I am using Numpy's random number generator to create random participant IDs, exercise group assignments, pre-exercise blood pressure measurements, and post-exercise blood pressure measurements. I am trying to ensure that the synthetic dataset follows the parameters of the study, such as the mean and standard deviation of pre-exercise blood pressure, as well as the effects of different exercise regimens on blood pressure. This task allows for controlled experimentation and hypothesis testing in subsequent analysis.

Step 2: Write a Python script to read the "exercise_data.csv" file and print the participant with the highest pre-exercise systolic blood pressure in each exercise group.

In [107]:
def find_highest_pre_exercise_bp(df):

    # Grouping participants by exercise group and finding participant w/ highest pre-exercise systolic bp
    max_pre_bp_participants = df.loc[df.groupby('Exercise Group')['Pre-Exercise Systolic Blood Pressure'].idxmax()]
    
    # Printing the results
    print("Participant w/ the highest pre-exercise systolic blood pressure in each exercise group: ")
    print(max_pre_bp_participants[['Exercise Group', 'Participant ID', 'Pre-Exercise Systolic Blood Pressure']])

In step 2, I am using Pandas to load the dataset from the CSV file into a DataFrame. Here I am grouping each participant w/ their highest pre-exercise bp. By identifying participants with the highest pre-exercise blood pressure in each exercise group, we can gain insights into the distribution of blood pressure measurements within each group.

Step 3: Write a Python function that sorts the list based on blood pressure and displays the full record of the top 5.

In [108]:
def top_5_by_blood_pressure(df, column):
    
    # Sorting the df based on the name of the column in descending order
    sorted_df = df.sort_values(by=column, ascending=False)
    
    # Displaying the full record of the top 5 participants
    print("Top 5 participants base on ", column + ": ")
    print(sorted_df.head(5))

In step  3, the DataFrame is sorted based on pre-exercise blood pressure to identify the top 5 participants with the highest blood pressure measurements. I am using the Pandas library to facilitate the sorting and filtering operations efficiently. Sorting participants by blood pressure allows me to identify extreme values or outliers in the dataset. By displaying the top 5 participants, we are able to observe the range and variability of blood pressure measurements within the study population.

Step 4: Write a Python script that assumes that blood pressure measurements were taken monthly. Compute and print the average change in blood pressure for each exercise group. Note: This is hypothetical as the original study is for 6 weeks only.

In [109]:
def compute_average_monthly_change(df):
    
    # Calculating the change in bp
    df['Change in Blood Pressure'] = df['Post-Exercise Systolic Blood Pressure'] - df['Pre-Exercise Systolic Blood Pressure']
    
    # Converting the 6-week change to a monthly change 
    df['Monthly Change in Blood Pressure'] = df['Change in Blood Pressure'] * (4/6)
    
    # Computing avg monthly change for each exercise group
    avg_monthly_change = df.groupby('Exercise Group')['Monthly Change in Blood Pressure'].mean()
    
    # Printing my results
    print("Average monthly change in blood pressure for each exercise group: ")
    print(avg_monthly_change)

In step 4, I am computing the average monthly change in blood pressure for each exercise group. This involves subtracting pre-exercise blood pressure from post-exercise blood pressure, adjusting for the duration of the exercise program, and averaging the changes across participants in each group. By calculating the average monthly change in blood pressure, I can quantify the effectiveness of each exercise regimen in reducing blood pressure over time. This analysis provides insights into the potential long-term benefits of different exercise programs on cardiovascular health.

Step 5: Search for the 5 participants from the pre-exercise (Topic 4) and find their post-exercise blood pressure. Produce a table that compares their pre- and post-exercise pressure and displays the difference.

In [110]:
def compare_pre_post_bp(df, top_5_df):
    
    comparison_df = top_5_df.copy()
    comparison_df['Post-Exercise Systolic Blood Pressure'] = df.loc[
        df['Participant ID'].isin(top_5_df['Participant ID']), 
        'Post-Exercise Systolic Blood Pressure'
    ].values
    comparison_df['Difference'] = comparison_df['Post-Exercise Systolic Blood Pressure'] - comparison_df['Pre-Exercise Systolic Blood Pressure']
    return comparison_df[['Participant ID', 'Pre-Exercise Systolic Blood Pressure', 'Post-Exercise Systolic Blood Pressure', 'Difference']]

    # Printing the results
    print("Comparison of the pre- and post-exercise blood pressure for the top 5 participants:")
    print(comparison_df)

In step 5, I am filtering the dataset and selecting the top 5 participants with the highest pre-exercise blood pressure. The function is then comparing their pre- and post-exercise blood pressure measurements to calculate the differences. By comparing pre- and post-exercise blood pressure for the top 5 participants, we can assess the immediate effects of the exercise program on blood pressure. This analysis helps in understanding individual responses to different exercise regimens.

Step 6: Write a Python script to read the "exercise_data.csv" file and compute the measures of central tendency for each exercise group: mean, mode, standard deviation.

In [111]:
def compute_central_tendency(df):
    
    exercise_groups = df['Exercise Group'].unique()
    
    for group in exercise_groups:
        group_df = df[df['Exercise Group'] == group]
        print(f"\nExercise Group: {group}")
        
        for column in ['Pre-Exercise Systolic Blood Pressure', 'Post-Exercise Systolic Blood Pressure']:
            mean = group_df[column].mean()
            
            try:
                mode_result = stats.mode(group_df[column], nan_policy='omit')
                mode = mode_result.mode[0] if mode_result.mode.size > 0 else group_df[column].mode().iloc[0]
            except:
                mode = group_df[column].mode().iloc[0]
            
            std_dev = group_df[column].std()
            
            print(f"  {column}:")
            print(f"    Mean: {mean}")
            print(f"    Mode: {mode}")
            print(f"    Standard Deviation: {std_dev}")

In step 6, I am calculating the mean, mode, and standard deviation for pre- and post-exercise blood pressure within each exercise group. I am using Pandas and SciPy for the statistical computations. With this, we are able to summarize the distribution of blood pressure measurements within each exercise group. This analysis provides insights into the typical blood pressure levels and variability associated with each exercise.

This is my main script to run all the steps

In [112]:
if __name__ == "__main__":
    filename = 'exercise_data.csv'
    
    # Step 1: Generating synthetic data
    generate_synthetic_data(filename)
    # Reading the generated CSV file
    df = pd.read_csv(filename)
    
    # Step 2: Finding the participant w/ the highest pre-exercise bp in each group
    find_highest_pre_exercise_bp(df)
    
    # Step 3: Sort by pre-exercise bp and display top 5
    top_5_df = df.sort_values(by='Pre-Exercise Systolic Blood Pressure', ascending=False).head(5)
    top_5_by_blood_pressure(df, 'Pre-Exercise Systolic Blood Pressure')
    
    # Step 4: Computing the avg monthly change in bp for each group
    compute_average_monthly_change(df)
    
    # Step 5: Comparing pre- and post-exercise bp for top 5 participants
    compare_pre_post_bp(df, top_5_df)
    
    # Step 6: Computing measures of the central tendency for each exercise group
    compute_central_tendency(df)

Synthetic dataset save to 'exercise_data.csv'.
Participant w/ the highest pre-exercise systolic blood pressure in each exercise group: 
   Exercise Group  Participant ID  Pre-Exercise Systolic Blood Pressure
78        jogging              79                                   136
75  weightlifting              76                                   139
20           yoga              21                                   141
Top 5 participants base on  Pre-Exercise Systolic Blood Pressure: 
    Participant ID Exercise Group  Pre-Exercise Systolic Blood Pressure  \
20              21           yoga                                   141   
75              76  weightlifting                                   139   
78              79        jogging                                   136   
19              20        jogging                                   135   
13              14           yoga                                   132   

    Post-Exercise Systolic Blood Pressure  
20            