# Python Proficiency for Statistics

A researcher is conducting a study on the effects of different exercise regimens on blood pressure. The study involves 100 participants who are randomly assigned to one of three exercise groups: jogging, weightlifting, or yoga. Each participant's blood pressure is measured before and after the 6-week exercise program.

### Task 1: Create a Python script that generates a synthetic dataset matching the description of your study. The dataset should be saved as a CSV file named "exercise_data.csv".

In this task, the goal is to create a synthetic dataset containing each participant's unique ID, what activity they are a part of, and their pre-exercise and post-exercise systolic blood pressures. 

Their pre-exercise systolic blood pressures are from before the 6-week exercise programs, and their post-exercise systolic blood pressures are from after the program. 

A synthetic dataset must be generated with the following columns: 
 
* Participant ID (numeric)
* Exercise group (text: "jogging", "weightlifting", or "yoga")
* Pre-exercise systolic blood pressure (numeric)
* Post-exercise systolic blood pressure (numeric)

In [287]:
# Import necessary libraries
import numpy as np
import pandas as pd

### Generating the synthetic dataset

In [288]:
participants = 100

# Define data to go into the dataframe
participant_ids = np.arange(1, participants + 1)
activities = ["jogging", "weightlifting", "yoga"]
group = np.random.choice(activities, size=100)

# Assuming pre-exercise systolic blood pressure is between 100 and 140 mmHg (inclusive)
pre_exercise_values = np.arange(100, 141)
pre_exercise_systolic_bp = np.random.choice(pre_exercise_values, size=100)

# Assuming post-exercise systolic blood pressure is 15-20 mmHg lower than pre-exercise value 
decrease = np.random.choice(np.arange(15, 20), size=100)
post_exercise_systolic_bp = pre_exercise_systolic_bp - decrease

# Assemble NumPy arrays into the dataframe
df = pd.DataFrame({ 
    "id": participant_ids,
    "group": group,
    "pre_exercise_bp": pre_exercise_systolic_bp,
    "post_exercise_bp": post_exercise_systolic_bp
})

# Convert blood pressure columns to float datatype
df["pre_exercise_bp"] = df["pre_exercise_bp"].astype(float)
df["post_exercise_bp"] = df["post_exercise_bp"].astype(float)

# Export to CSV
df.to_csv("exercise_data.csv", index=False, header=True)

### Task 2: Write a Python script to read the "exercise_data.csv" file and print the participant with the highest pre-exercise systolic blood pressure in each exercise group.

The goal in this task is to group participants by activity, then find the info of the participant with the highest pre-exercise blood pressure from each group. 

In [289]:
# View the initial data
try: 
    df = pd.read_csv("exercise_data.csv")
except FileNotFoundError: 
    print("Error: The file 'exercise_data.csv' was not found.")
except pd.errors.EmptyDataError: 
    print("Error: The file 'exercise_data.csv' is empty.")
except pd.errors.ParserError: 
    print("Error: There was an issue parsing the file 'exercise_data.csv'.")
except Exception as e: 
    print("An unexpected error occurred.")

df.head()

Unnamed: 0,id,group,pre_exercise_bp,post_exercise_bp
0,1,jogging,132.0,116.0
1,2,weightlifting,133.0,114.0
2,3,yoga,119.0,104.0
3,4,yoga,131.0,116.0
4,5,yoga,111.0,94.0


In [290]:
# Group participants by activity and find the highest pre-exercise blood pressure for each
max_pre_exercise_bp_rows = df.groupby("group")["pre_exercise_bp"].idxmax()
max_pre_exercise_bp_by_group = df.loc[max_pre_exercise_bp_rows, ["id", "group", "pre_exercise_bp"]]
max_pre_exercise_bp_by_group.head()

Unnamed: 0,id,group,pre_exercise_bp
24,25,jogging,139.0
20,21,weightlifting,140.0
12,13,yoga,140.0


### Task 3: Write a Python function that sorts the list based on blood pressure and displays the full record of the top 5.

The goal in this task is to find the 5 participants with the highest pre-exercise blood pressure, the 5 with the highest post-exercise blood pressure, and the 5 with the highest average blood pressure throughout the program. This can be done by sorting the original dataset `df` by `pre_exercise_bp`, `post_exercise_bp` and `avg_bp` respectively, and showing the top 5 results for each.

In [291]:
# Find 5 participants with the highest pre-exercise blood pressure
top_5_highest_pre_exercise_bp = df.sort_values(by="pre_exercise_bp", ascending=False).head(5)
top_5_highest_pre_exercise_bp.head(5)

Unnamed: 0,id,group,pre_exercise_bp,post_exercise_bp
12,13,yoga,140.0,123.0
96,97,weightlifting,140.0,121.0
20,21,weightlifting,140.0,122.0
51,52,weightlifting,140.0,124.0
52,53,yoga,140.0,121.0


In [292]:
# Find 5 participants with the highest post-exercise blood pressure
df.sort_values(by="post_exercise_bp", ascending=False).head(5)

Unnamed: 0,id,group,pre_exercise_bp,post_exercise_bp
15,16,yoga,140.0,125.0
51,52,weightlifting,140.0,124.0
24,25,jogging,139.0,123.0
55,56,yoga,138.0,123.0
12,13,yoga,140.0,123.0


In [293]:
# Add column for average bp
df["avg_bp"] = (df["pre_exercise_bp"] + df["post_exercise_bp"]) / 2

In [294]:
# Find 5 participants with the highest average blood pressure
df.sort_values(by="avg_bp", ascending=False).head(5)

Unnamed: 0,id,group,pre_exercise_bp,post_exercise_bp,avg_bp
15,16,yoga,140.0,125.0,132.5
51,52,weightlifting,140.0,124.0,132.0
12,13,yoga,140.0,123.0,131.5
24,25,jogging,139.0,123.0,131.0
20,21,weightlifting,140.0,122.0,131.0


### Task 4: Write a Python script that assumes that blood pressure measurements were taken monthly. Compute and print the average change in blood pressure for each exercise group. Note: This is hypothetical as the original study is for 6 weeks only.

The goal in this task is to find out how much the average participant in each group saw their blood pressure change by throughout the study. This can be done by grouping participants by activity, then calculating the mean change in blood pressure for each group. 

In [295]:
# Create new column for change in blood pressure
df["change_in_bp"] = df["post_exercise_bp"] - df["pre_exercise_bp"]
df.head(1)

Unnamed: 0,id,group,pre_exercise_bp,post_exercise_bp,avg_bp,change_in_bp
0,1,jogging,132.0,116.0,124.0,-16.0


In [296]:
# Find average change in blood pressure for each group
df.groupby("group")["change_in_bp"].mean()

group
jogging         -17.242424
weightlifting   -16.833333
yoga            -16.540541
Name: change_in_bp, dtype: float64

### Task 5: Search for the 5 participants from the pre-exercise (Topic 4) and find their post-exercise blood pressure. Produce a table that compares their pre- and post-exercise pressure and displays the difference.

The goal in this task is to locate the 5 participants with the highest pre-exercise blood pressures, and see how much their blood pressures changed by. Since we found these 5 earlier with `top_5_highest_pre_exercise_bp` and already found their changes in blood pressure, we can simply take a subset of the earlier data. 

In [297]:
# Get a subset of top_5_highest_pre_exercise_bp that includes pre_exercise_bp, post_exercise_bp, and change_in_bp
columns = ["id", "group", "pre_exercise_bp", "post_exercise_bp", "change_in_bp"]

# Update top_5_highest_pre_exercise_bp with change_in_bp
top_5_highest_pre_exercise_bp = df.sort_values(by="pre_exercise_bp", ascending=False).head(5)

# Get subset of data
diff_in_pre_exercise_bp_among_top_5 = top_5_highest_pre_exercise_bp[columns]
diff_in_pre_exercise_bp_among_top_5.head()

Unnamed: 0,id,group,pre_exercise_bp,post_exercise_bp,change_in_bp
12,13,yoga,140.0,123.0,-17.0
96,97,weightlifting,140.0,121.0,-19.0
20,21,weightlifting,140.0,122.0,-18.0
51,52,weightlifting,140.0,124.0,-16.0
52,53,yoga,140.0,121.0,-19.0


### Task 6: Write a Python script to read the "exercise_data.csv" file and compute the measures of central tendency for each exercise group: mean, mode, standard deviation.

The goal in this task is to compute the mean, mode, and standard deviation for every group. We can accomplish this by splitting the original dataset `df` into three separate subsets, one for each exercise group. Then, we can use `.describe()` to find the mean and standard deviation for each column in each, and `.mode()` to find the modes. 

#### Computing measures of central tendency for the yoga group

In [298]:
yoga_participants = df[df["group"] == "yoga"]

# Compute mean and standard deviation
yoga_participants.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,37.0,46.675676,28.288252,3.0,29.0,43.0,62.0,98.0
pre_exercise_bp,37.0,123.72973,10.910364,101.0,116.0,125.0,133.0,140.0
post_exercise_bp,37.0,107.189189,10.627109,86.0,100.0,107.0,116.0,125.0
avg_bp,37.0,115.459459,10.740874,93.5,107.5,116.0,123.5,132.5
change_in_bp,37.0,-16.540541,1.574,-19.0,-18.0,-16.0,-15.0,-15.0


In [299]:
# Compute modes for each column
yoga_modes = yoga_participants.mode().iloc[0].to_frame().transpose()
yoga_modes

Unnamed: 0,id,group,pre_exercise_bp,post_exercise_bp,avg_bp,change_in_bp
0,3,yoga,133.0,100.0,109.5,-15.0


#### Computing measures of central tendency for the jogging group

In [300]:
jogging_participants = df[df["group"] == "jogging"]

# Compute mean and standard deviation
jogging_participants.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,33.0,54.30303,29.531217,1.0,26.0,61.0,79.0,100.0
pre_exercise_bp,33.0,117.242424,10.7559,100.0,110.0,115.0,126.0,139.0
post_exercise_bp,33.0,100.0,10.544548,83.0,92.0,98.0,108.0,123.0
avg_bp,33.0,108.621212,10.625022,91.5,101.0,106.5,116.5,131.0
change_in_bp,33.0,-17.242424,1.47966,-19.0,-19.0,-17.0,-16.0,-15.0


In [301]:
# Compute modes for each column
jogging_modes = jogging_participants.mode().iloc[0].to_frame().transpose()
jogging_modes

Unnamed: 0,id,group,pre_exercise_bp,post_exercise_bp,avg_bp,change_in_bp
0,1,jogging,102.0,95.0,94.5,-19.0


#### Computing measures of central tendency for the weightlifting group

In [302]:
weightlifting_participants = df[df["group"] == "weightlifting"]

# Compute mean and standard deviation
weightlifting_participants.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,30.0,51.033333,29.701368,2.0,23.5,53.5,73.5,99.0
pre_exercise_bp,30.0,121.033333,13.259956,101.0,109.0,123.5,132.0,140.0
post_exercise_bp,30.0,104.2,13.220673,83.0,90.75,107.5,114.75,124.0
avg_bp,30.0,112.616667,13.218771,92.5,99.875,115.5,123.5,132.0
change_in_bp,30.0,-16.833333,1.5105,-19.0,-18.0,-16.5,-16.0,-15.0


In [303]:
# Compute modes for each column
weightlifting_modes = weightlifting_participants.mode().iloc[0].to_frame().transpose()
weightlifting_modes

Unnamed: 0,id,group,pre_exercise_bp,post_exercise_bp,avg_bp,change_in_bp
0,2,weightlifting,140.0,111.0,93.0,-16.0
