## Observations and Insights 
**Study:** SCC tumor growth in mice treated through a variety of drug regimens

Results of study to compare the performance of Pymaceuticals' drug of interest, Capomulin, versus the other treatment regimens
>
>Notebook contains calculations & graphs of the study results supporting the below observations:
>>- Gain understanding of the effectiveness of a drug regimen by comparing the statistical results of the last measured tumor size of each mouse under those various drug regimens. A drug regimen is more effective the smaller the average size tumor at last treatment.  
>>- Also, box plot chart helps illustrate the effectiveness of a drug regimen compared to other drug regimens;  The less variation in the results indicate a more predictable result of treatment for the tumor growth. (see box plot graph below) 
>>-  Capomulin effecively reduced the tumor size over time on mouse m957 (see line graph below)
>>- Tumor size appears to be positively coorelated with a mouse's weight (correlation coefficient = 0.88)

**Conclusion:**  *Capomulin* is as effective at shrinking tumor size as the the most effective existing drug regimen, Ramicine

In [1]:
%matplotlib notebook

In [2]:
# Dependencies and Setup
import matplotlib.pyplot as plt
import pandas as pd
import scipy.stats as st
import numpy as np

# Study data files
mouse_metadata_path = "data/Mouse_metadata.csv"
study_results_path = "data/Study_results.csv"

# Read the mouse data and the study results
mouse_metadata = pd.read_csv(mouse_metadata_path)
study_results = pd.read_csv(study_results_path)

# Combine the data into a single dataset
mouse_results_df = pd.merge(mouse_metadata,study_results, on="Mouse ID")

# Display the data table for preview
mouse_results_df.head()

Unnamed: 0,Mouse ID,Drug Regimen,Sex,Age_months,Weight (g),Timepoint,Tumor Volume (mm3),Metastatic Sites
0,k403,Ramicane,Male,21,16,0,45.0,0
1,k403,Ramicane,Male,21,16,5,38.825898,0
2,k403,Ramicane,Male,21,16,10,35.014271,1
3,k403,Ramicane,Male,21,16,15,34.223992,1
4,k403,Ramicane,Male,21,16,20,32.997729,1


In [3]:
# Checking the number of mice.
count_mice = mouse_results_df.groupby(["Mouse ID"]).nunique()
print(count_mice["Mouse ID"].sum())

249


In [4]:
# Create a clean DataFrame by dropping the duplicate mouse by its ID.
mouse_group = mouse_results_df.groupby(["Mouse ID"]) \
                                ['Timepoint'].transform(max) == mouse_results_df['Timepoint']

max_result_df = mouse_results_df[mouse_group]
max_result_df.head()

Unnamed: 0,Mouse ID,Drug Regimen,Sex,Age_months,Weight (g),Timepoint,Tumor Volume (mm3),Metastatic Sites
9,k403,Ramicane,Male,21,16,45,22.050126,1
19,s185,Capomulin,Female,3,17,45,23.343598,1
29,x401,Capomulin,Female,16,15,45,28.484033,0
39,m601,Capomulin,Male,22,17,45,28.430964,1
49,g791,Ramicane,Male,11,16,45,29.128472,1


In [5]:
# Checking the number of mice in the clean DataFrame.
print(max_result_df["Mouse ID"].count())

249


## Summary Statistics

In [6]:
# Generate a summary statistics table of mean, median, variance, standard deviation, and SEM of the tumor volume for each regimen
stat_tumor_by_regimen_df = max_result_df[['Drug Regimen','Tumor Volume (mm3)']]

# Use groupby and summary statistical methods to calculate the following properties of each drug regimen: 
summary_stat_by_regimen = stat_tumor_by_regimen_df.groupby("Drug Regimen").describe()

# mean, median, variance, standard deviation, and SEM of the tumor volume. 
mean_by_regimen = stat_tumor_by_regimen_df.groupby("Drug Regimen").mean()
median_by_regimen = stat_tumor_by_regimen_df.groupby("Drug Regimen").median()
var_by_regimen = stat_tumor_by_regimen_df.groupby("Drug Regimen").var()
std_by_regimen = stat_tumor_by_regimen_df.groupby("Drug Regimen").std()
sem_by_regimen = stat_tumor_by_regimen_df.groupby("Drug Regimen").sem()

# Assemble the resulting series into a single summary dataframe.
sum_stats_by_regimen_df = pd.DataFrame({"Mean":mean_by_regimen['Tumor Volume (mm3)'],
                                        "Median":median_by_regimen['Tumor Volume (mm3)'],
                                        "Variance":var_by_regimen['Tumor Volume (mm3)'],
                                        "Std Dev":std_by_regimen['Tumor Volume (mm3)'],
                                        "SEM":sem_by_regimen['Tumor Volume (mm3)']}).reset_index()
summary_stat_by_regimen

Unnamed: 0_level_0,Tumor Volume (mm3),Tumor Volume (mm3),Tumor Volume (mm3),Tumor Volume (mm3),Tumor Volume (mm3),Tumor Volume (mm3),Tumor Volume (mm3),Tumor Volume (mm3)
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
Drug Regimen,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Capomulin,25.0,36.667568,5.715188,23.343598,32.377357,38.125164,40.15922,47.685963
Ceftamin,25.0,57.753977,8.365568,45.0,48.722078,59.851956,64.29983,68.923185
Infubinol,25.0,58.178246,8.602957,36.321346,54.048608,60.16518,65.525743,72.226731
Ketapril,25.0,62.806191,9.94592,45.0,56.720095,64.487812,69.872251,78.567014
Naftisol,25.0,61.205757,10.297083,45.0,52.07951,63.283288,69.563621,76.668817
Placebo,25.0,60.508414,8.874672,45.0,52.942902,62.030594,68.134288,73.212939
Propriva,25.0,56.736964,8.327605,45.0,49.122969,55.84141,62.57088,72.455421
Ramicane,25.0,36.19139,5.671539,22.050126,31.56047,36.561652,40.659006,45.220869
Stelasyn,24.0,61.001707,9.504293,45.0,52.476596,62.19235,69.103944,75.12369
Zoniferol,25.0,59.181258,8.767099,45.0,49.988302,61.840058,66.794156,73.324432


In [7]:
# Generate a summary statistics table of mean, median, variance, standard deviation, and SEM of the tumor volume for each regimen
sum_stats_by_regimen_df

# Using the aggregation method, produce the same summary statistics in a single line

#???? Don't understand directions, not required?????
#agg_df = sum_stats_by_regimen_df.agg(['mean'])
#agg_df

Unnamed: 0,Drug Regimen,Mean,Median,Variance,Std Dev,SEM
0,Capomulin,36.667568,38.125164,32.663378,5.715188,1.143038
1,Ceftamin,57.753977,59.851956,69.982735,8.365568,1.673114
2,Infubinol,58.178246,60.16518,74.010875,8.602957,1.720591
3,Ketapril,62.806191,64.487812,98.92133,9.94592,1.989184
4,Naftisol,61.205757,63.283288,106.029927,10.297083,2.059417
5,Placebo,60.508414,62.030594,78.759797,8.874672,1.774934
6,Propriva,56.736964,55.84141,69.349002,8.327605,1.665521
7,Ramicane,36.19139,36.561652,32.166354,5.671539,1.134308
8,Stelasyn,61.001707,62.19235,90.331586,9.504293,1.940056
9,Zoniferol,59.181258,61.840058,76.862027,8.767099,1.75342


## Bar and Pie Charts

In [8]:
# Generate a bar plot showing the total number of measurements taken on each drug regimen using pandas.
count_by_regimen = stat_tumor_by_regimen_df.groupby("Drug Regimen").count().reset_index()
count_by_regimen = count_by_regimen.rename(columns={'Tumor Volume (mm3)':'Count of Mice'})
x_axis = np.arange(len(count_by_regimen))
count_by_regimen.plot(kind='bar',x= "Drug Regimen",y= 'Count of Mice').legend(loc='lower right')
plt.title("Total Number of Mice per Drug Regimen using pandas")
plt.xlabel("Drug Regimen")
plt.ylabel("Number of Mice")
plt.tight_layout()

<IPython.core.display.Javascript object>

In [9]:
# Generate a bar plot showing the total number of measurements taken on each drug regimen using pyplot.
plt.figure()
plt.bar(x_axis,count_by_regimen['Count of Mice'],color='g', alpha=0.5, align="center",)

#Format bar chart
tick_locations = [value for value in x_axis]
plt.xticks(tick_locations, count_by_regimen['Drug Regimen'], rotation="vertical")

plt.title("Total Number of Mice per Drug Regimen using pyplot")
plt.xlabel("Drug Regimen")
plt.ylabel("Number of Mice")
plt.tight_layout()


<IPython.core.display.Javascript object>

In [10]:
# Generate a pie plot showing the distribution of female versus male mice using pandas
max_result_df.groupby('Sex').count()

count_by_gender_df = max_result_df.groupby('Sex').count()
count_by_gender_df = count_by_gender_df.rename(columns={'Mouse ID':'Count of Mice'})

plot = count_by_gender_df.plot(kind='pie',y='Count of Mice',title="Count of Mice by Gender using pandas",shadow=True,explode=(0,0.05), autopct='%1.1f%%',startangle=60)


<IPython.core.display.Javascript object>

In [11]:
# Generate a pie plot showing the distribution of female versus male mice using pyplot
plt.figure()

plt.pie(count_by_gender_df['Count of Mice'],labels=count_by_gender_df.index,shadow=True,autopct='%1.1f%%',explode=(0,0.05),startangle=60)
plt.title("Count of Mice by Gender using pyplot")
plt.ylabel("Count of Mice")
plt.axis("equal")

<IPython.core.display.Javascript object>

(-1.1385516836321594,
 1.1560888593574177,
 -1.1372747920091575,
 1.137673838142574)

## Quartiles, Outliers and Boxplots

In [12]:
# Calculate the final tumor volume of each mouse across four of the treatment regimens:  

#Capomulin 
capomulin_df = max_result_df.loc[max_result_df['Drug Regimen'] == "Capomulin"]
capomulin_df = capomulin_df.sort_values(['Tumor Volume (mm3)'])
capomulin = capomulin_df['Tumor Volume (mm3)']
capomulin_quartiles = capomulin.quantile([0.25,0.5,0.75])
capomulin_lowerq = capomulin_quartiles[0.25]
capomulin_upperq = capomulin_quartiles[0.75]
capomulin_iqr = capomulin_upperq-capomulin_lowerq
capomulin_lower_bound = capomulin_lowerq - (1.5*capomulin_iqr)
capomulin_upper_bound = capomulin_upperq + (1.5*capomulin_iqr)
capomulin_outlier_mice = capomulin_df.loc[(capomulin_df['Tumor Volume (mm3)'] < capomulin_lower_bound) \
                                | (capomulin_df['Tumor Volume (mm3)'] > capomulin_upper_bound)]

print(f"The lower quartile of Capomulin is: {round(capomulin_lowerq,2)}")
print(f"The upper quartile of Capomulin is: {round(capomulin_upperq,2)}")
print(f"The interquartile range (IQR) of Capomulin is: {round(capomulin_iqr,3)}")
print(f"The the median of Capomulin is: {round(capomulin_quartiles[0.5],2)} ")
print(f"Values below {round(capomulin_lower_bound,2)} could be outliers.")
print(f"Values above {round(capomulin_upper_bound,2)} could be outliers.")
print(f"Number of Outlier values = {capomulin_outlier_mice['Tumor Volume (mm3)'].count()}")

The lower quartile of Capomulin is: 32.38
The upper quartile of Capomulin is: 40.16
The interquartile range (IQR) of Capomulin is: 7.782
The the median of Capomulin is: 38.13 
Values below 20.7 could be outliers.
Values above 51.83 could be outliers.
Number of Outlier values = 0


In [13]:
#Ramicane 
ramicane_df = max_result_df.loc[max_result_df['Drug Regimen'] == "Ramicane"]
ramicane_df = ramicane_df.sort_values(['Tumor Volume (mm3)'])
ramicane = ramicane_df['Tumor Volume (mm3)']
ramicane_quartiles = ramicane.quantile([.25,.5,.75])
ramicane_lowerq = ramicane_quartiles[0.25]
ramicane_upperq = ramicane_quartiles[0.75]
ramicane_iqr = ramicane_upperq-ramicane_lowerq
ramicane_lower_bound = ramicane_lowerq - (1.5*ramicane_iqr)
ramicane_upper_bound = ramicane_upperq + (1.5*ramicane_iqr)
ramicane_outlier_mice = ramicane_df.loc[(ramicane_df['Tumor Volume (mm3)'] < ramicane_lower_bound) \
                                | (ramicane_df['Tumor Volume (mm3)'] > ramicane_upper_bound)]

print(f"The lower quartile of Ramicane is: {round(ramicane_lowerq,2)}")
print(f"The upper quartile of Ramicane is: {round(ramicane_upperq,2)}")
print(f"The interquartile range  (IQR) of Ramicane is: {round(ramicane_iqr,3)}")
print(f"The the median of Ramicane is: {round(ramicane_quartiles[0.5],2)} ")
print(f"Values below {round(ramicane_lower_bound,2)} could be outliers.")
print(f"Values above {round(ramicane_upper_bound,2)} could be outliers.")
print(f"Number of Outlier values = {ramicane_outlier_mice['Tumor Volume (mm3)'].count()}")

The lower quartile of Ramicane is: 31.56
The upper quartile of Ramicane is: 40.66
The interquartile range  (IQR) of Ramicane is: 9.099
The the median of Ramicane is: 36.56 
Values below 17.91 could be outliers.
Values above 54.31 could be outliers.
Number of Outlier values = 0


In [14]:
#Infubinol
infubinol_df = max_result_df.loc[max_result_df['Drug Regimen'] == "Infubinol"]
infubinol_df = infubinol_df.sort_values(['Tumor Volume (mm3)'])
infubinol = infubinol_df['Tumor Volume (mm3)']
infubinol_quartiles = infubinol.quantile([.25,.5,.75])
infubinol_lowerq = infubinol_quartiles[0.25]
infubinol_upperq = infubinol_quartiles[0.75]
infubinol_iqr = infubinol_upperq-infubinol_lowerq
infubinol_lower_bound = infubinol_lowerq - (1.5*infubinol_iqr)
infubinol_upper_bound = infubinol_upperq + (1.5*infubinol_iqr)
infubinol_outlier_mice = infubinol_df.loc[(infubinol_df['Tumor Volume (mm3)'] < infubinol_lower_bound) \
                                | (infubinol_df['Tumor Volume (mm3)'] > infubinol_upper_bound)]

print(f"The lower quartile of Infubinol is: {round(infubinol_lowerq,2)}")
print(f"The upper quartile of Infubinol is: {round(infubinol_upperq,2)}")
print(f"The interquartile range  (IQR) of Infubinol is: {round(infubinol_iqr,2)}")
print(f"The the median of Infubinol is: {round(infubinol_quartiles[0.5],2)} ")
print(f"Values below {round(infubinol_lower_bound,2)} could be outliers.")
print(f"Values above {round(infubinol_upper_bound,2)} could be outliers.")
print(f"Outlier value(s) = {infubinol_outlier_mice['Tumor Volume (mm3)']}")


The lower quartile of Infubinol is: 54.05
The upper quartile of Infubinol is: 65.53
The interquartile range  (IQR) of Infubinol is: 11.48
The the median of Infubinol is: 60.17 
Values below 36.83 could be outliers.
Values above 82.74 could be outliers.
Outlier value(s) = 669    36.321346
Name: Tumor Volume (mm3), dtype: float64


In [15]:
#Ceftamin
ceftamin_df = max_result_df.loc[max_result_df['Drug Regimen'] == "Ceftamin"].reset_index()
ceftamin_df = ceftamin_df.sort_values(['Tumor Volume (mm3)'])
ceftamin = ceftamin_df['Tumor Volume (mm3)']
ceftamin_quartiles = ceftamin.quantile([.25,.5,.75])
ceftamin_lowerq = ceftamin_quartiles[0.25]
ceftamin_upperq = ceftamin_quartiles[0.75]
ceftamin_iqr = ceftamin_upperq - ceftamin_lowerq
ceftamin_lower_bound = ceftamin_lowerq - (1.5*ceftamin_iqr)
ceftamin_upper_bound = ceftamin_upperq + (1.5*ceftamin_iqr)
ceftamin_outlier_mice = ceftamin_df.loc[(ceftamin_df['Tumor Volume (mm3)'] < ceftamin_lower_bound) \
                                | (ceftamin_df['Tumor Volume (mm3)'] > ceftamin_upper_bound)]

print(f"The lower quartile of Ceftamin is: {round(ceftamin_lowerq,2)}")
print(f"The upper quartile of Ceftamin is: {round(ceftamin_upperq,2)}")
print(f"The interquartile range  (IQR) of Ceftamin is: {round(ceftamin_iqr,2)}")
print(f"The the median of Ceftamin is: {round(ceftamin_quartiles[0.5],2)} ")
print(f"Values below {round(ceftamin_lower_bound,2)} could be outliers.")
print(f"Values above {round(ceftamin_upper_bound,2)} could be outliers.")
print(f"Number of Outlier values = {ceftamin_outlier_mice['Tumor Volume (mm3)'].count()}")

The lower quartile of Ceftamin is: 48.72
The upper quartile of Ceftamin is: 64.3
The interquartile range  (IQR) of Ceftamin is: 15.58
The the median of Ceftamin is: 59.85 
Values below 25.36 could be outliers.
Values above 87.67 could be outliers.
Number of Outlier values = 0


In [16]:
# Put treatments into a list for for loop (and later for plot labels)
drug_regimen_list = ['Capomulin','Ramicane','Infubinol','Ceftamin']
data = [capomulin,ramicane,infubinol,ceftamin]

In [17]:
# Generate a box plot of the final tumor volume of each mouse across four regimens of interest
red_diamond = dict(markerfacecolor='r', marker='D')

fig1, ax1 = plt.subplots()
ax1.set_title('Final Tumor Volume Measurement by Drug Regimen')
ax1.boxplot(data, labels = drug_regimen_list,flierprops=red_diamond)
ax1.set_xlabel('Drug Regimen')
ax1.set_ylabel('Tumor Volume (mm3)')
plt.show()

<IPython.core.display.Javascript object>

## Line and Scatter Plots

In [18]:
# Generate a line plot of tumor volume vs. time point for a mouse treated with Capomulin
m957_mouse_results = mouse_results_df.loc[mouse_results_df['Mouse ID'] == "m957"].reset_index()
line_m957 = m957_mouse_results.plot.line(x='Timepoint',y='Tumor Volume (mm3)')
plt.title("m957 on Capomulin Drug Regimen Tumor Volume over time")
plt.ylabel('Tumor Volume (mm3)')

<IPython.core.display.Javascript object>

Text(0, 0.5, 'Tumor Volume (mm3)')

In [19]:
# Generate a scatter plot of average tumor volume vs. mouse weight for the Capomulin regimen
capomulin_df.plot(kind="scatter", x='Tumor Volume (mm3)', y='Weight (g)', grid=True,
              title="Capomulin regimen Tumor Volume Vs. Mouse Weight")
plt.show()

<IPython.core.display.Javascript object>

## Correlation and Regression

In [22]:
# Calculate the correlation coefficient 
# for mouse weight and average tumor volume for the Capomulin regime
capomulin_df['Tumor Volume (mm3)'].describe()
capomulin_tumor = capomulin_df['Tumor Volume (mm3)']
capomulin_weight = capomulin_df['Weight (g)']
capomulin_describe = round(st.pearsonr(capomulin_tumor,capomulin_weight)[0],2)

print(f"The correlation coefficient between Tumor Size acid and Mouse Weight is {capomulin_describe}")

The correlation coefficient between Tumor Size acid and Mouse Weight is 0.88


In [21]:
#linear regression model 
vc_slope, vc_int, vc_r, vc_p, vc_std_err = st.linregress(capomulin_tumor, capomulin_weight)
vc_fit = vc_slope * capomulin_tumor + vc_int
plt.scatter(capomulin_tumor, capomulin_weight)
plt.plot(capomulin_tumor,vc_fit,"--")
line_eq = "y = " + str(round(vc_slope,2)) + "x + " + str(round(vc_int,2))
plt.annotate(line_eq,(25,14),fontsize=15,color="red")
plt.xlabel('Tumor Volume')
plt.ylabel('Weight of Mouse')
plt.show()

print(line_eq)

y = 0.44x + 4.02


## Observations and Insights

observation #1: blah blah blah