# Pymaceuticals Inc.
---

### Analysis
Pymaceuticals, Inc. conducted a study to evaluate the effectiveness of various drug regimens in treating squamous cell carcinoma (SCC) in mice. The analysis explores several key aspects of the data, including:

- **Overall Drug Efficacy**: Comparing the performance of different drug regimens, with a primary focus on Capomulin and Ramicane, two promising treatments.
- **Gender-Based Analysis**: Examining whether there is a significant difference in treatment outcomes based on the gender of the mice.
- **Tumor Volume Trends Over Time**: Observing tumor volume reduction over time for an individual mouse treated with Capomulin.
- **Weight and Tumor Volume Correlation**: Analyzing the relationship between mouse weight and tumor volume under the Capomulin regimen to understand if weight influences treatment efficacy.

---

### **Key Findings**

#### **1. Drug Regimen Comparison**
- **Capomulin and Ramicane** consistently show significantly lower tumor volumes on average compared to other treatments, indicating they may be more effective at reducing tumor size.
- **Outliers** in certain treatments, particularly Infubinol, suggest variable responses among mice, which may indicate inconsistent drug efficacy.
- **Summary Statistics**: Calculations for each drug regimen reveal insights into tumor volume distribution and variability. Capomulin and Ramicane stand out as leading candidates for effective SCC treatment, based on their lower average tumor volumes and smaller variability.
#### **2. Gender Distribution and Impact**
- The study maintains a nearly balanced gender distribution, minimizing gender bias in the sample.
- A comparison of tumor volumes by gender for Capomulin reveals slight variations, hinting at potential gender-based differences in response to treatment. However, further research with larger sample sizes is recommended to confirm whether gender significantly impacts treatment efficacy.
#### **3. Tumor Volume Trends Over Time for Capomulin**
- **Tumor Reduction Over Time**: Tracking tumor volume over time for a single mouse treated with Capomulin reveals a steady decrease in tumor size, underscoring Capomulin's potential as an effective SCC treatment.
- The trend provides valuable insight into Capomulin’s action timeline, helping estimate how quickly the drug may begin showing effects, which is crucial for treatment planning.
#### **4. Mouse Weight and Tumor Volume Correlation**
- **Positive Correlation**: A scatter plot analysis of mouse weight versus average tumor volume for the Capomulin regimen shows a positive correlation, indicating that heavier mice tend to have larger tumor volumes on average.
- The **Linear Regression Line** further supports this correlation, suggesting that weight might be a factor when determining the dosage of Capomulin. This implies that higher doses of Capomulin may be necessary for heavier mice to achieve the same tumor reduction, an important consideration for personalized treatment planning.

---

### **Conclusion**

The analysis highlights Capomulin and Ramicane as the most effective candidates for SCC treatment, given their ability to consistently reduce tumor size. Additionally, the data suggests that **gender and weight may influence treatment efficacy**. Future studies should expand on these findings to:

- Optimize dosing strategies to account for weight differences, ensuring consistent effectiveness across mice of varying sizes.
- Investigate potential gender-specific responses to better understand treatment effects.

In [None]:
# dependencies and setup
import matplotlib.pyplot as plt
import pandas as pd
import scipy.stats as st

# data files
mouse_metadata_path = "data/Mouse_metadata.csv"
study_results_path = "data/Study_results.csv"

# read the mouse data and the study results
mouse_metadata_df = pd.read_csv(mouse_metadata_path)
study_results_df = pd.read_csv(study_results_path)

# combine the data into a single df
merged_df = pd.merge(mouse_metadata_df, study_results_df, on="Mouse ID")

# preview display
print("preview of merged data:")
display(merged_df.head())

In [None]:
# checking number of mice
unique_mice_count = merged_df["Mouse ID"].nunique()
print("Unique mice count:", unique_mice_count)

In [None]:
# data should be uniquely identified by mouse id and timepoint

# get duplicate mice by ID number that shows up formouse id and timepoint
duplicate_entries = merged_df[merged_df.duplicated(subset=["Mouse ID", "Timepoint"], keep=False)]
print("Duplicate entries based on Mouse ID and Timepoint:\n", duplicate_entries)

In [None]:
# opt: get all the data for the duplicate mouse id
duplicate_mouse_id = duplicate_entries["Mouse ID"].unique()[0] 
duplicate_mouse_data = merged_df[merged_df["Mouse ID"] == duplicate_mouse_id]
print(f"All data for duplicate mouse ID {duplicate_mouse_id}:\n", duplicate_mouse_data)

In [5]:
# create a clean df by dropping the duplicate mouse by id
cleaned_df = merged_df.drop_duplicates(subset=["Mouse ID", "Timepoint"])

In [None]:
# check number of mice in the clean df
cleaned_unique_mice_count = cleaned_df["Mouse ID"].nunique()
print("Number of unique mice in the cleaned data:", cleaned_unique_mice_count)

## Summary Statistics

In [None]:
# generate a summary statistics table of mean, median, variance, standard deviation, and SEM of the tumor volume for each regimen

# groupby and summary statistical methods to calculate the following properties of each drug regimen:
# mean, median, variance, standard deviation, and SEM of the tumor volume.
mean_tumor_volume = merged_df.groupby("Drug Regimen")["Tumor Volume (mm3)"].mean()
median_tumor_volume = merged_df.groupby("Drug Regimen")["Tumor Volume (mm3)"].median()
variance_tumor_volume = merged_df.groupby("Drug Regimen")["Tumor Volume (mm3)"].var()
std_tumor_volume = merged_df.groupby("Drug Regimen")["Tumor Volume (mm3)"].std()
sem_tumor_volume = merged_df.groupby("Drug Regimen")["Tumor Volume (mm3)"].sem()

# assemble into a single summary DataFrame.
summary_stats_df = pd.DataFrame({
    "Mean Tumor Volume": mean_tumor_volume,
    "Median Tumor Volume": median_tumor_volume,
    "Tumor Volume Variance": variance_tumor_volume,
    "Tumor Volume Std. Dev.": std_tumor_volume,
    "Tumor Volume Std. Err.": sem_tumor_volume
})

summary_stats_df


In [None]:
# more advanced method to generate a summary statistics table of mean, median, variance, standard deviation,
# SEM of the tumor volume for each regimen (only one method is required in the solution)
summary_stats_agg_df = merged_df.groupby("Drug Regimen")["Tumor Volume (mm3)"].agg(
    Mean="mean",
    Median="median",
    Variance="var",
    StdDev="std",
    SEM="sem"
)

# use aggregation method, produce the same summary statistics in a single line
summary_stats_agg_df

## Bar and Pie Charts

In [None]:
# generate bar plot showing the total number of rows (mouse id/timepoints) for each drug regimen using pandas
timepoint_counts = merged_df["Drug Regimen"].value_counts()
timepoint_counts.plot(kind="bar", title="Total Timepoints for Each Drug Regimen", xlabel="Drug Regimen", ylabel="Number of Timepoints")
plt.show()

In [None]:
# generate a bar plot showing the total number of rows (mouse id/timepoints) for each drug regimen using pyplot
plt.bar(timepoint_counts.index, timepoint_counts.values)
plt.title("Total Timepoints for Each Drug Regimen")
plt.xlabel("Drug Regimen")
plt.ylabel("Number of Timepoints")
plt.xticks(rotation=90)
plt.show()

In [None]:
# generate pie chart, using pandas, showing the distribution of unique female versus male mice used in the study

# unique mice with their gender
gender_distribution = merged_df.drop_duplicates(subset=["Mouse ID"])["Sex"].value_counts()
gender_distribution.plot(kind="pie", autopct='%1.1f%%', title="Gender Distribution of Mice")

# make pie chart
plt.ylabel("")  
plt.show()

In [None]:
# generate pie chart, using pyplot, showing the distribution of unique female versus male mice used in the study

# unique mice with their gender
plt.pie(gender_distribution, labels=gender_distribution.index, autopct='%1.1f%%')

# make pie chart
plt.title("Gender Distribution of Mice")
plt.show()

## Quartiles, Outliers and Boxplots

In [13]:
# calculate the final tumor volume of each mouse across four of the treatment regimens:
    # Capomulin, Ramicane, Infubinol, and Ceftamin

# last (greatest) timepoint for each mouse
last_timepoint_df = merged_df.groupby("Mouse ID")["Timepoint"].max().reset_index()


# merge this df with og df
final_tumor_df = pd.merge(last_timepoint_df, merged_df, on=["Mouse ID", "Timepoint"])

In [None]:
# list treatments for for loop (and later for plot labels)
treatment_list = ["Capomulin", "Ramicane", "Infubinol", "Ceftamin"]

# empty list to fill with tumor vol data (for plotting)
tumor_volumes = []

# loop through
for drug in treatment_list:
    # locate the rows which contain mice on each drug and get the tumor volumes
    drug_data = final_tumor_df[final_tumor_df["Drug Regimen"] == drug]["Tumor Volume (mm3)"]
    
    # append the tumor volume data to the list
    tumor_volumes.append(drug_data)
    
    # calculate the IQR and quantitatively determine if there are any potential outliers
    quartiles = drug_data.quantile([0.25, 0.5, 0.75])
    lowerq = quartiles[0.25]
    upperq = quartiles[0.75]
    iqr = upperq - lowerq
    
    # outliers using upper and lower bounds
    lower_bound = lowerq - (1.5 * iqr)
    upper_bound = upperq + (1.5 * iqr)
    outliers = drug_data[(drug_data < lower_bound) | (drug_data > upper_bound)]
    print(f"{drug} potential outliers:\n{outliers}\n")

In [None]:
# boxplot to show the distribution of the tumor volume for each treatment group
plt.boxplot(tumor_volumes, labels=treatment_list)
plt.title("Final Tumor Volume for Selected Drug Regimens")
plt.ylabel("Tumor Volume (mm3)")
plt.show()

## Line and Scatter Plots

In [None]:
# one mouse treated with Capomulin
capomulin_data = merged_df[merged_df["Drug Regimen"] == "Capomulin"]
single_mouse_data = capomulin_data[capomulin_data["Mouse ID"] == "l509"]  
# you can replace the mouse id if you want to test with others

# line plot of tumor volume vs. time point for  1509
plt.plot(single_mouse_data["Timepoint"], single_mouse_data["Tumor Volume (mm3)"], marker='o')
plt.title("Capomulin Treatment of Mouse l509")
plt.xlabel("Timepoint (days)")
plt.ylabel("Tumor Volume (mm3)")
plt.show()


In [None]:
# average tumor volume for each mouse on the Capomulin regimen
capomulin_avg_tumor_vol = capomulin_data.groupby("Mouse ID").agg(
    avg_tumor_volume=("Tumor Volume (mm3)", "mean"),
    weight=("Weight (g)", "mean")
)

# scatter plot of mouse weight vs. average tumor volume
plt.scatter(capomulin_avg_tumor_vol["weight"], capomulin_avg_tumor_vol["avg_tumor_volume"])
plt.title("Average Tumor Volume vs. Mouse Weight for Capomulin Regimen")
plt.xlabel("Weight (g)")
plt.ylabel("Average Tumor Volume (mm3)")
plt.show()
plt.show()

## Correlation and Regression

In [None]:
# calculation the correlation coefficient and a linear regression model
    # this is for mouse weight and average observed tumor volume for the entire Capomulin regimen

from scipy.stats import linregress

# correlation coefficient
correlation = capomulin_avg_tumor_vol["weight"].corr(capomulin_avg_tumor_vol["avg_tumor_volume"])
print(f"Correlation coefficient between mouse weight and average tumor volume: {correlation}")

# linear regression
slope, intercept, r_value, p_value, std_err = linregress(
    capomulin_avg_tumor_vol["weight"], capomulin_avg_tumor_vol["avg_tumor_volume"]
)

# plot linear regression model on the scatter plot
plt.scatter(capomulin_avg_tumor_vol["weight"], capomulin_avg_tumor_vol["avg_tumor_volume"], label='Data')
plt.plot(capomulin_avg_tumor_vol["weight"], slope * capomulin_avg_tumor_vol["weight"] + intercept, color='red', label='Fit Line')
plt.title("Average Tumor Volume vs. Mouse Weight for Capomulin Regimen")
plt.xlabel("Weight (g)")
plt.ylabel("Average Tumor Volume (mm3)")
plt.legend()
plt.show()
