# Pymaceuticals Inc.
---

### Analysis

Observations 
- Out of all the drugs tested Capomulin had the most impact on shrinking the tumors in the mice subjects, the same applies to Ramicane. This makes it possible to conclude that these two treatments were in one way or the other more effective in controlling/ reducing the tumor growth than the remaining drugs used in the study.

- Comparing the results for the values, Capomulin, and Ramicane show similar results in the decrease of tumor volume and, definitely, yield better results than Infubinol and Ceftamine do. This conclusion was made after analyzing the final tumor volumes result arising from box plots and it was observed that Capomulin was comparable to Ramicane in handling this cancer type better than Infubinol and Ceftamine.
 
- If further analysis of the mouse weight and the drug effectiveness were to be made, then it could be seen that the two variables are negatively correlated between the two. When analyzed using the linear regression analysis the result gets a high r-value of 0. 95, which shows a positive relationship between these variables. This implies that as the weight in mice increases the effectiveness of the drug decreases.

In [None]:
# Dependencies and Setup
import matplotlib.pyplot as plt
import pandas as pd
import scipy.stats as st

# Added additional Dependencies 
import numpy as np
from scipy.stats import linregress

# Study data files
mouse_metadata_path = (r'C:/Users/jcath/OneDrive/Desktop/GWU/Module 5 Challenge/Pymaceuticals/data/Mouse_metadata.csv')
study_results_path = (r'C:/Users/jcath/OneDrive/Desktop/GWU/Module 5 Challenge/Pymaceuticals/data/Study_results.csv')

# Read the mouse data and the study results
mouse_metadata = pd.read_csv(mouse_metadata_path)
study_results = pd.read_csv(study_results_path)

# Combine the data into a single DataFrame
merge_df = pd.merge(mouse_metadata, study_results, how="outer", on="Mouse ID")

# Display the data table for preview
merge_df.head()


In [None]:
# Checking the number of mice.
mice_count = merge_df["Mouse ID"].unique().size
mice_count

In [None]:
# Our data should be uniquely identified by Mouse ID and Timepoint
# Get the duplicate mice by ID number that shows up for Mouse ID and Timepoint.
duplicate_mouse = merge_df[merge_df.duplicated(subset=['Mouse ID', 'Timepoint'])]
duplicate_mouse

In [None]:
# Optional: Get all the data for the duplicate mouse ID.
duplicate_mice = merge_df.loc[merge_df["Mouse ID"] == "g989", :]
duplicate_mice

In [None]:
# Create a clean DataFrame by dropping the duplicate mouse by its ID.
clean_mice = merge_df[merge_df["Mouse ID"] != "g989"]
clean_mice


In [None]:
# Checking the number of mice in the clean DataFrame.
num_mice = merge_df["Mouse ID"].nunique()
num_mice

## Summary Statistics

In [None]:
# Generate a summary statistics table of mean, median, variance, standard deviation, and SEM of the tumor volume for each regimen

# Use groupby and summary statistical methods to calculate the following properties of each drug regimen:
# mean, median, variance, standard deviation, and SEM of the tumor volume.
mean = merge_df['Tumor Volume (mm3)'].groupby(merge_df['Drug Regimen']).mean()
median = merge_df['Tumor Volume (mm3)'].groupby(merge_df['Drug Regimen']).median()
var = merge_df['Tumor Volume (mm3)'].groupby(merge_df['Drug Regimen']).var()
std = merge_df['Tumor Volume (mm3)'].groupby(merge_df['Drug Regimen']).std()
sem = merge_df['Tumor Volume (mm3)'].groupby(merge_df['Drug Regimen']).sem()

summary_stat = pd.DataFrame({"Mean Tumor Volume":mean, 
                            "Median Tumor Volume":median, 
                           "Tumor Volume Variance":var, 
                           "Tumor Volume Std. Dev.":std, 
                           "Tumor Volume Std. Err.":sem})
# Assemble the resulting series into a single summary DataFrame.
# Display the Summary statistics table grouped by 'Drug Regimen' column
summary_stat

In [None]:
# A more advanced method to generate a summary statistics table of mean, median, variance, standard deviation,
# and SEM of the tumor volume for each regimen (only one method is required in the solution)

# Using the aggregation method, produce the same summary statistics in a single line
summary_agg =  merge_df.groupby(['Drug Regimen'])[['Tumor Volume (mm3)']].agg(['mean', 'median', 'var', 'std', 'sem'])
summary_agg

## Bar and Pie Charts

In [None]:
# Generate a bar plot showing the total number of rows (Mouse ID/Timepoints) for each drug regimen using Pandas.
total_timepoints = clean_mice[["Timepoint", "Drug Regimen"]].groupby(["Drug Regimen"]).count()
sorted_timepoints = total_timepoints.sort_values(["Timepoint"], ascending=False)

# Plot bar chart using pandas
timepoints_plot = sorted_timepoints.plot.bar()
timepoints_plot.set_ylabel("Timepoint")



In [None]:
# Generate a bar plot showing the total number of rows (Mouse ID/Timepoints) for each drug regimen using pyplot.
plt.bar(sorted_timepoints.index, sorted_timepoints['Timepoint'], color='b', alpha=1, align='center')

# Set axis labels
plt.ylabel('Timepoints') 
plt.xlabel('Drug Regimen')
plt.xticks(rotation='vertical')

# Show plot using pyplot
plt.show()

In [None]:
# Generate a pie chart, using Pandas, showing the distribution of unique female versus male mice used in the study
# Get the unique mice with their gender
gender_data = merge_df["Sex"].value_counts()
plt.title("")
gender_data.plot.pie(autopct= "%1.1f%%")

# Make the pie chart
plt.show()

In [None]:
# Generate a pie chart, using pyplot, showing the distribution of unique female versus male mice used in the study
# Get the unique mice with their gender
labels = ['Female', 'Male']
sizes = [49.7999197, 50.200803]
plot = gender_data.plot.pie(y='Total Count', autopct="%1.1f%%")
plt.title("")
plt.ylabel('Sex')

# Make the pie chart
plt.show()

## Quartiles, Outliers and Boxplots

In [None]:
# Calculate the final tumor volume of each mouse across four of the treatment regimens:
# Capomulin, Ramicane, Infubinol, and Ceftamin
cap_df = merge_df.loc[merge_df["Drug Regimen"] == "Capomulin",:]
ram_df = merge_df.loc[merge_df["Drug Regimen"] == "Ramicane", :]
inf_df = merge_df.loc[merge_df["Drug Regimen"] == "Infubinol", :]
ceft_df = merge_df.loc[merge_df["Drug Regimen"] == "Ceftamin", :]

# Start by getting the last (greatest) timepoint for each mouse
last_timepoint = merge_df.groupby(["Mouse ID"])["Timepoint"].max()
last_timepoint = last_timepoint.reset_index()

In [None]:
# Start by getting the last (greatest) timepoint for each mouse
#Capomulin
caplast = cap_df.groupby('Mouse ID').max()['Timepoint']
caplastvol = pd.DataFrame(caplast)
caplastmerge = pd.merge(caplastvol, merge_df, on=("Mouse ID","Timepoint"),how="left")
caplastmerge.head(5)

In [None]:
# Start by getting the last (greatest) timepoint for each mouse
#Ramicane
ramlast = ram_df.groupby('Mouse ID').max()['Timepoint']
ramlastvol = pd.DataFrame(ramlast)
ramlastmerge = pd.merge(ramlastvol, merge_df, on=("Mouse ID","Timepoint"),how="left")
ramlastmerge.head(5)

In [None]:
# Start by getting the last (greatest) timepoint for each mouse
#Infubinol
inflast = inf_df.groupby('Mouse ID').max()['Timepoint']
inflastvol = pd.DataFrame(inflast)
inflastmerge = pd.merge(inflastvol, merge_df, on=("Mouse ID","Timepoint"),how="left")
inflastmerge.head(5)

In [None]:
# Start by getting the last (greatest) timepoint for each mouse
#Ceftamin
ceftlast = ceft_df.groupby('Mouse ID').max()['Timepoint']
ceftlastvol = pd.DataFrame(ceftlast)
ceftlastmerge = pd.merge(ceftlastvol, merge_df, on=("Mouse ID","Timepoint"),how="left")
ceftlastmerge.head(5)

In [None]:
# Merge this group df with the original DataFrame to get the tumor volume at the last timepoint
merged_data_lasttp = pd.merge(last_timepoint, merge_df, on=("Mouse ID", "Timepoint"))
merged_data_lasttp


In [None]:
# Put treatments into a list for for loop (and later for plot labels)
treatments = ["Capomulin","Ramicane","Infubinol","Ceftamin"]

# Create empty list to fill with tumor vol data (for plotting)
total_tumor_vol = []

# Calculate the IQR and quantitatively determine if there are any potential outliers.
for drug in treatments:

    # Locate the rows which contain mice on each drug and get the tumor volumes
    tumor_vol = merged_data_lasttp.loc[merged_data_lasttp["Drug Regimen"] == drug, "Tumor Volume (mm3)"]

    # add subset
    total_tumor_vol.append(tumor_vol)

    # Determine outliers using upper and lower bounds
    quartiles = tumor_vol.quantile([.25, .5, .75])
    lowerq = quartiles[.25]
    upperq = quartiles[.75]
    iqr = upperq - lowerq 
    
    lower_bound = lowerq - (1.5*iqr)
    upper_bound = upperq + (1.5*iqr)
    outliers = tumor_vol.loc[(tumor_vol < lower_bound) | (tumor_vol > upper_bound)]
    print(f"{drug}, the outliers: {outliers}")

In [None]:
# Generate a box plot that shows the distribution of the tumor volume for each treatment group.
fig1, ax1 = plt.subplots(figsize=(6,5))
ax1.set_title('',fontsize =8)
ax1.set_ylabel('Final Tumor Volume (mm3)',fontsize = 8)
ax1.set_xlabel('Drug Regimen',fontsize = 8)
ax1.boxplot(total_tumor_vol, labels=treatments, widths = 0.4, patch_artist=True,vert=True)

plt.show()

## Line and Scatter Plots

In [None]:
# Generate a line plot of tumor volume vs. time point for a single mouse treated with Capomulin
forline_df = cap_df.loc[cap_df["Mouse ID"] == "l509",:]
forline_df.head()
# defined the x axis list by calling the timepoints from the l509 dataframe
x_axisTP = forline_df["Timepoint"] 
# defined the y axis or tumor size list by calling the tumor size from the dataframe
tumsiz = forline_df["Tumor Volume (mm3)"] 

# the plot function plt.plot() with x and y values and customizations
plt.title('Capomulin treatmeant of mouse l509') # created title
plt.plot(x_axisTP, tumsiz,linewidth=2, markersize=12) 
plt.xlabel('Timepoint (Days)')
plt.ylabel('Tumor Volume (mm3)')

plt.show()

In [None]:
# Generate a scatter plot of mouse weight vs. the average observed tumor volume for the entire Capomulin regimen
capomulin_df = merge_df.loc[merge_df['Drug Regimen'] == 'Capomulin']

# Find average tumor volume for each mouse

avg_vol_df = pd.DataFrame(capomulin_df.groupby('Mouse ID')['Tumor Volume (mm3)'].mean().sort_values()).reset_index().rename(columns={'Tumor Volume (mm3)': 'avg_tumor_vol'})

# Merge average tumor volume onto merge_df and drop duplicates
avg_vol_df = pd.merge(capomulin_df, avg_vol_df, on='Mouse ID')
final_avg_vol_df = avg_vol_df[['Weight (g)', 'avg_tumor_vol']].drop_duplicates()
final_avg_vol_df

x = final_avg_vol_df['Weight (g)']
y = final_avg_vol_df['avg_tumor_vol']

# Create a scatter plot based on new dataframe above with circle markers and listed colors
plt.scatter(x, y)

# Add labels and title to plot
plt.xlabel("Weight (g)")
plt.ylabel("Average Tumor Volume (mm3)")
plt.title('')
# Display plot
plt.show()


## Correlation and Regression

In [None]:
# Calculate the correlation coefficient and a linear regression model
# for mouse weight and average observed tumor volume for the entire Capomulin regimen

capomulin_df = merge_df.loc[merge_df['Drug Regimen'] == 'Capomulin']
avg_vol_df = pd.DataFrame(capomulin_df.groupby('Mouse ID')['Tumor Volume (mm3)'].mean().sort_values()).reset_index().rename(columns={'Tumor Volume (mm3)': 'avg_tumor_vol'})
avg_vol_df = pd.merge(capomulin_df, avg_vol_df, on='Mouse ID')
final_avg_vol_df = avg_vol_df[['Weight (g)', 'avg_tumor_vol']].drop_duplicates()
final_avg_vol_df
x = final_avg_vol_df['Weight (g)']
y = final_avg_vol_df['avg_tumor_vol']

# Correlation coefficient between mouse weight and average tumor volume
correlation = st.pearsonr(x,y)

# Print the answer to above calculation
print(f"""The correlation between weight and average tumor volume
on the Capomulin regimen is {round(correlation[0],2)}.""")

# Calculate linear regression
(slope, intercept, rvalue, pvalue, stderr) = linregress(x, y)
regress_values = x * slope + intercept
line_eq = "y = " + str(round(slope,2)) + "x + " + str(round(intercept,2))

# Plot linear regression on top of scatter plot
plt.scatter(x,y)
plt.plot(x,regress_values,"r-")

# Add labels and title to plot
plt.xlabel("Weight (g)")
plt.ylabel("Average Tumor Volume (mm3)")
plt.title('')
plt.show()