# General Questions

(1) When presented with a new dataset or database, the steps I generally take to evaluate it prior to working with it:

1. What information does it provide?
2. Does the data give me enough information to answer my question/solve my problem?
3. If the dataset does not have all the information I need, are there any data tables that can be brought together to give a richer dataset?

 After confirming that and prior to setting up for a test:
 

4. If the dataset has the information I need, how large is the data so that it is statistically significant?
5. Are there any NaN's that need to be filtered out?
6. What are the overall statistics of the numerical data?
7. Are there any outliers that need to be taken out?
8. Are there similar features that would produce multicollinearity?
9. Does the data need to be rescaled?

(2) Based on the information provided and the attached dataset, the three questions I would like to understand prior to conducting any analysis of the data is:

1. How sick are the patients (early stage/late stage)?
2. Are the patients taking any other drugs besides those listed?
3. How much of the drugs are given?

# Data Analysis Questions

## EDA

In [1]:
import numpy as np
import pandas as pd
import plotly.plotly as py
import cufflinks as cf
from plotly import figure_factory as FF
import scipy

In [2]:
#Reading in Patient_Diagnosis data

diagnosis = pd.read_csv("Patient_Diagnosis.csv")

In [3]:
diagnosis.head(10)

Unnamed: 0,patient_id,diagnosis_date,diagnosis_code,diagnosis
0,2120,1/9/10,174.1,Breast Cancer
1,2720,1/9/10,174.1,Breast Cancer
2,2038,1/21/10,174.9,Breast Cancer
3,2238,1/21/10,174.9,Breast Cancer
4,2175,2/17/10,174.7,Breast Cancer
5,2475,2/17/10,174.7,Breast Cancer
6,2407,6/13/10,174.9,Breast Cancer
7,2607,6/13/10,174.9,Breast Cancer
8,2425,12/15/10,174.9,Breast Cancer
9,3025,12/15/10,174.9,Breast Cancer


In [4]:
diagnosis.tail()

Unnamed: 0,patient_id,diagnosis_date,diagnosis_code,diagnosis
52,8480,5/16/13,174.3,Breast Cancer
53,8615,7/18/13,174.7,Breast Cancer
54,8827,7/21/13,174.9,Breast Cancer
55,9489,8/19/13,174.9,Breast Cancer
56,9331,8/23/13,174.9,Breast Cancer


In [5]:
diagnosis.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57 entries, 0 to 56
Data columns (total 4 columns):
patient_id        57 non-null int64
diagnosis_date    57 non-null object
diagnosis_code    57 non-null float64
diagnosis         57 non-null object
dtypes: float64(1), int64(1), object(2)
memory usage: 1.9+ KB


In [6]:
#Reading in Patient_Treatment data

treatment = pd.read_csv("Patient_Treatment.csv")

In [7]:
treatment.head(10)

Unnamed: 0,patient_id,treatment_date,drug_code
0,2720,1/20/10,B
1,2238,1/21/10,B
2,2120,1/23/10,B
3,2038,1/24/10,A
4,2120,1/24/10,A
5,2038,1/24/10,B
6,2120,1/26/10,A
7,2120,1/26/10,B
8,2038,1/27/10,A
9,2038,1/27/10,B


In [8]:
treatment.tail()

Unnamed: 0,patient_id,treatment_date,drug_code
1091,2038,2/2/17,A
1092,2038,2/6/17,A
1093,2038,2/11/17,A
1094,2038,2/18/17,A
1095,2038,2/20/17,A


In [9]:
treatment.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1096 entries, 0 to 1095
Data columns (total 3 columns):
patient_id        1096 non-null int64
treatment_date    1096 non-null object
drug_code         1096 non-null object
dtypes: int64(1), object(2)
memory usage: 25.8+ KB


### (1) Distribution of cancer types across clinic's patients.

In [10]:
series = diagnosis['diagnosis'].value_counts()
series.head()

Breast Cancer    39
Colon Cancer     18
Name: diagnosis, dtype: int64

In [11]:
#Scrolling over each bar gives count for each cancer type, which verifies previous series.head() output.

series.iplot(kind='bar', yTitle='Count', title='Distribution of Cancer Types')

### (2) How long it takes for patients to start therapy after being diagnosed (quality of care for the patient).

1. Verify that patients undergoing treatment are found in diagnosis dataset.
2. Join the two tables on patient_id.
3. Create a new column that is the difference between treatment diagnosis_date and FIRST treatment_date (will need to format dates).
4. Get summary statistics on numerical values (average, min, max).

In [12]:
# 1. Verify that patients undergoing treatment are found in diagnosis dataset.

check_pat_id_inboth = treatment.patient_id.isin(diagnosis.patient_id)
check_pat_id_inboth.head(10)

0    True
1    True
2    True
3    True
4    True
5    True
6    True
7    True
8    True
9    True
Name: patient_id, dtype: bool

In [13]:
#Check if above series is all True by counting all distinct groups

check_pat_id_inboth.value_counts()

True    1096
Name: patient_id, dtype: int64

After confirming that each diagnosed patient received treatment detailed in treatment dataset, combining both datasets together. 

In [14]:
# 2. Join the two tables on patient_id.

overall_pat =  pd.merge(diagnosis, treatment, on='patient_id', how='outer')

In [15]:
overall_pat.head()

Unnamed: 0,patient_id,diagnosis_date,diagnosis_code,diagnosis,treatment_date,drug_code
0,2120,1/9/10,174.1,Breast Cancer,1/23/10,B
1,2120,1/9/10,174.1,Breast Cancer,1/24/10,A
2,2120,1/9/10,174.1,Breast Cancer,1/26/10,A
3,2120,1/9/10,174.1,Breast Cancer,1/26/10,B
4,2120,1/9/10,174.1,Breast Cancer,1/27/10,A


In [16]:
overall_pat.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1239 entries, 0 to 1238
Data columns (total 6 columns):
patient_id        1239 non-null int64
diagnosis_date    1239 non-null object
diagnosis_code    1239 non-null float64
diagnosis         1239 non-null object
treatment_date    1237 non-null object
drug_code         1237 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 67.8+ KB


Increase in dataframe due to one patient_id in diagnosis dataset matching to multiple patient_id rows in treatment dataset. This is explained by numerous treatment dates/rows per patient.

In [17]:
overall_pat.head(30)

Unnamed: 0,patient_id,diagnosis_date,diagnosis_code,diagnosis,treatment_date,drug_code
0,2120,1/9/10,174.1,Breast Cancer,1/23/10,B
1,2120,1/9/10,174.1,Breast Cancer,1/24/10,A
2,2120,1/9/10,174.1,Breast Cancer,1/26/10,A
3,2120,1/9/10,174.1,Breast Cancer,1/26/10,B
4,2120,1/9/10,174.1,Breast Cancer,1/27/10,A
5,2120,1/9/10,174.1,Breast Cancer,1/27/10,B
6,2120,1/9/10,174.1,Breast Cancer,1/29/10,A
7,2120,1/9/10,174.1,Breast Cancer,1/29/10,B
8,2120,1/9/10,174.1,Breast Cancer,2/1/10,A
9,2120,1/9/10,174.1,Breast Cancer,2/1/10,B


In [18]:
#Checking that each patient has its corresponding treatment data and that treatment dates are in ascending order
#Checking for Pat 2120
diagnosis.loc[diagnosis['patient_id'] == 2120]

Unnamed: 0,patient_id,diagnosis_date,diagnosis_code,diagnosis
0,2120,1/9/10,174.1,Breast Cancer


In [19]:
treatment_2120 = treatment.loc[treatment['patient_id'] == 2120]
treatment_2120.head(5)

Unnamed: 0,patient_id,treatment_date,drug_code
2,2120,1/23/10,B
4,2120,1/24/10,A
6,2120,1/26/10,A
7,2120,1/26/10,B
10,2120,1/27/10,A


In [20]:
treatment_2120.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26 entries, 2 to 63
Data columns (total 3 columns):
patient_id        26 non-null int64
treatment_date    26 non-null object
drug_code         26 non-null object
dtypes: int64(1), object(2)
memory usage: 832.0+ bytes


Checked that all treatment rows from a paritcular patient_id (in the this case, #2120) is matched with patient_id from diagnosis dataframe. 
Also, noticed that dates are in chronological order. =)

In [21]:
# 3. Create a new column that is the difference between treatment diagnosis_date and FIRST treatment_date (will need to format dates).

overall_pat['treatment_date'] = pd.to_datetime(overall_pat['treatment_date'])
overall_pat['treatment_date'].head(5)

0   2010-01-23
1   2010-01-24
2   2010-01-26
3   2010-01-26
4   2010-01-27
Name: treatment_date, dtype: datetime64[ns]

In [22]:
overall_pat['diagnosis_date'] = pd.to_datetime(overall_pat['diagnosis_date'])
overall_pat['diagnosis_date'].head(5)

0   2010-01-09
1   2010-01-09
2   2010-01-09
3   2010-01-09
4   2010-01-09
Name: diagnosis_date, dtype: datetime64[ns]

In [23]:
#Last column for now being called first_treatment_diagnosis. Following step retains only first treatment date for each patient.
overall_pat['first_treatment-diagnosis']= overall_pat['treatment_date']-overall_pat['diagnosis_date']

In [24]:
overall_pat.head(30)

Unnamed: 0,patient_id,diagnosis_date,diagnosis_code,diagnosis,treatment_date,drug_code,first_treatment-diagnosis
0,2120,2010-01-09,174.1,Breast Cancer,2010-01-23,B,14 days
1,2120,2010-01-09,174.1,Breast Cancer,2010-01-24,A,15 days
2,2120,2010-01-09,174.1,Breast Cancer,2010-01-26,A,17 days
3,2120,2010-01-09,174.1,Breast Cancer,2010-01-26,B,17 days
4,2120,2010-01-09,174.1,Breast Cancer,2010-01-27,A,18 days
5,2120,2010-01-09,174.1,Breast Cancer,2010-01-27,B,18 days
6,2120,2010-01-09,174.1,Breast Cancer,2010-01-29,A,20 days
7,2120,2010-01-09,174.1,Breast Cancer,2010-01-29,B,20 days
8,2120,2010-01-09,174.1,Breast Cancer,2010-02-01,A,23 days
9,2120,2010-01-09,174.1,Breast Cancer,2010-02-01,B,23 days


In [25]:
care_quality = overall_pat.groupby('patient_id').first()
care_quality

Unnamed: 0_level_0,diagnosis_date,diagnosis_code,diagnosis,treatment_date,drug_code,first_treatment-diagnosis
patient_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2038,2010-01-21,174.9,Breast Cancer,2010-01-24,A,3 days
2120,2010-01-09,174.1,Breast Cancer,2010-01-23,B,14 days
2175,2010-02-17,174.7,Breast Cancer,2010-02-21,A,4 days
2238,2010-01-21,174.9,Breast Cancer,2010-01-21,B,0 days
2407,2010-06-13,174.9,Breast Cancer,2010-06-19,A,6 days
2425,2010-12-15,174.9,Breast Cancer,2010-12-19,A,4 days
2462,2011-01-07,174.9,Breast Cancer,2011-01-11,A,4 days
2475,2010-02-17,174.7,Breast Cancer,2010-02-17,B,0 days
2607,2010-06-13,174.9,Breast Cancer,2010-07-03,B,20 days
2634,2011-02-19,153.9,Colon Cancer,2011-12-20,A,304 days


In [76]:
care_quality['first_treatment-diagnosis'].describe()

count                         46
mean     11 days 11:28:41.739130
std      44 days 06:01:41.221946
min            -3 days +00:00:00
25%              4 days 00:00:00
50%              5 days 00:00:00
75%              6 days 00:00:00
max            304 days 00:00:00
Name: first_treatment-diagnosis, dtype: object

In [81]:
data = overall_pat['first_treatment-diagnosis']/np.timedelta64(1, 'D')
series2 = data.value_counts()
series2.head()

40.0    38
20.0    38
30.0    37
24.0    33
60.0    32
Name: first_treatment-diagnosis, dtype: int64

In [82]:
series2.iplot(kind='bar', xTitle='Number of Days', yTitle='Count', title='First_treatment-Diagnosis (Days)')

Based on this visualization, we can see that most patients received treatment within 11 days of treatment.

In [27]:
#Verifying patient_id = 2238 actually started treatment that soon
overall_pat.loc[overall_pat['patient_id'] == 2238]

Unnamed: 0,patient_id,diagnosis_date,diagnosis_code,diagnosis,treatment_date,drug_code,first_treatment-diagnosis
56,2238,2010-01-21,174.9,Breast Cancer,2010-01-21,B,0 days
57,2238,2010-01-21,174.9,Breast Cancer,2010-01-31,B,10 days
58,2238,2010-01-21,174.9,Breast Cancer,2010-02-10,B,20 days
59,2238,2010-01-21,174.9,Breast Cancer,2010-02-20,B,30 days
60,2238,2010-01-21,174.9,Breast Cancer,2010-03-02,B,40 days
61,2238,2010-01-21,174.9,Breast Cancer,2010-03-12,B,50 days
62,2238,2010-01-21,174.9,Breast Cancer,2010-03-22,B,60 days
63,2238,2010-01-21,174.9,Breast Cancer,2010-04-01,B,70 days
64,2238,2010-01-21,174.9,Breast Cancer,2012-09-18,B,971 days
65,2238,2010-01-21,174.9,Breast Cancer,2012-09-28,B,981 days


This is indicative of great quality of care for there to be no lag between diagnosis and treatment for disease.

In [28]:
#Verifying that patient_id = 2634 actually took that long to start treatment
overall_pat.loc[overall_pat['patient_id'] == 2634]

Unnamed: 0,patient_id,diagnosis_date,diagnosis_code,diagnosis,treatment_date,drug_code,first_treatment-diagnosis
216,2634,2011-02-19,153.9,Colon Cancer,2011-12-20,A,304 days
217,2634,2011-02-19,153.9,Colon Cancer,2011-12-20,B,304 days
218,2634,2011-02-19,153.9,Colon Cancer,2011-12-24,A,308 days
219,2634,2011-02-19,153.9,Colon Cancer,2011-12-24,B,308 days
220,2634,2011-02-19,153.9,Colon Cancer,2012-01-01,A,316 days
221,2634,2011-02-19,153.9,Colon Cancer,2012-01-01,B,316 days
222,2634,2011-02-19,153.9,Colon Cancer,2012-01-05,A,320 days
223,2634,2011-02-19,153.9,Colon Cancer,2012-01-05,B,320 days
224,2634,2011-02-19,153.9,Colon Cancer,2012-01-09,A,324 days
225,2634,2011-02-19,153.9,Colon Cancer,2012-01-09,B,324 days


Quality of care for this patient needs to be investigated since approximately 10 months passed before this patient's treatment began.

In [29]:
#Investigating why values turned out to be Not a Number(NaN)/Not a Timestamp(NaT)
overall_pat.loc[overall_pat['patient_id'] == 4256]

Unnamed: 0,patient_id,diagnosis_date,diagnosis_code,diagnosis,treatment_date,drug_code,first_treatment-diagnosis
470,4256,2011-11-07,174.5,Breast Cancer,NaT,,NaT
471,4256,2011-11-07,174.8,Breast Cancer,NaT,,NaT


In [30]:
diagnosis.loc[diagnosis['patient_id'] == 4256]

Unnamed: 0,patient_id,diagnosis_date,diagnosis_code,diagnosis
24,4256,11/7/11,174.5,Breast Cancer
25,4256,11/7/11,174.8,Breast Cancer


In [31]:
treatment.loc[treatment['patient_id'] == 4256]

Unnamed: 0,patient_id,treatment_date,drug_code


Bingo! There is no treatment data for patient_id = 4256. Therefore, the values for treatment_date/first_treatment-diagnosis and drug_code were NaT and NaN.

In [32]:
#Verifying this by checking original dataframes
diagnosis.loc[diagnosis['patient_id'] == 4256]

Unnamed: 0,patient_id,diagnosis_date,diagnosis_code,diagnosis
24,4256,11/7/11,174.5,Breast Cancer
25,4256,11/7/11,174.8,Breast Cancer


In [33]:
treatment.loc[treatment['patient_id'] == 4256]

Unnamed: 0,patient_id,treatment_date,drug_code


In [34]:
treatment.isnull().values.any()

False

In [35]:
#Checking if treatment dataframe had any empty
treatment['patient_id'].empty

False

If there were no NaN's nor empty records for patient_id 4256, then the record did not exist.

Therefore, this patient's care was non-existant. So his/her care needs to be investigated as well.

Lastly, reason why patient_id = 8827 has negative value for length of time it takes for patient to start treatment is that doctor probably gave prescrition drugs to try before there was official diagnosis. In this situation, the quality of care appears to be high since the doctor seems to be very vigilant. At the same time, patients who did not receive treatment in a timely manner could have been non-compliant. Regardless of the actual reason, these conclusions need to be factored in when assessing quality of care at this cancer clinic.

In [36]:
# 4. Get summary statistics on care_quality['treatment-diagnosis'].

care_quality['first_treatment-diagnosis'].describe()

count                         46
mean     11 days 11:28:41.739130
std      44 days 06:01:41.221946
min            -3 days +00:00:00
25%              4 days 00:00:00
50%              5 days 00:00:00
75%              6 days 00:00:00
max            304 days 00:00:00
Name: first_treatment-diagnosis, dtype: object

Conclusion: Patients usually took on average 11 days to start treatment for cancer diagnosis.

### (3) Which treatment regimens would be indicated to be used as first-line of treatment for breast cancer and colon cancer?

According to NCI Dictionary of Cancer Terms, **first-line therapy**:

The first treatment given for a disease. It is often part of a standard set of treatments, such as surgery followed by chemotherapy and radiation. When used by itself, first-line therapy is the one _accepted as the best treatment_. **If it doesn’t cure the disease or it causes severe side effects, other treatment may be added or used instead**.

Based on the above definition, will find first-line drug for breast cancer and colon cancer by:
making smaller dataframes from care_quality dataframe for Breast Cancer and Colon Cancer diseases. 
And from these two smaller dataframes, see which drug was used the most and classify that as the first-line therapy. 

As a reminder, the dataset that will be used only considers what drug was used on first treatment date for each patient.

In [37]:
gb =care_quality.groupby('diagnosis')    
[gb.get_group(x) for x in gb.groups]

[           diagnosis_date  diagnosis_code      diagnosis treatment_date  \
 patient_id                                                                
 2038           2010-01-21           174.9  Breast Cancer     2010-01-24   
 2120           2010-01-09           174.1  Breast Cancer     2010-01-23   
 2175           2010-02-17           174.7  Breast Cancer     2010-02-21   
 2238           2010-01-21           174.9  Breast Cancer     2010-01-21   
 2407           2010-06-13           174.9  Breast Cancer     2010-06-19   
 2425           2010-12-15           174.9  Breast Cancer     2010-12-19   
 2462           2011-01-07           174.9  Breast Cancer     2011-01-11   
 2475           2010-02-17           174.7  Breast Cancer     2010-02-17   
 2607           2010-06-13           174.9  Breast Cancer     2010-07-03   
 2720           2010-01-09           174.1  Breast Cancer     2010-01-20   
 2735           2011-04-18           174.9  Breast Cancer     2011-04-23   
 2762       

In [38]:
#First, Breast Cancer treatment is analyzed.
breast_cancer = [gb.get_group(x) for x in gb.groups][0]
breast_cancer

Unnamed: 0_level_0,diagnosis_date,diagnosis_code,diagnosis,treatment_date,drug_code,first_treatment-diagnosis
patient_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2038,2010-01-21,174.9,Breast Cancer,2010-01-24,A,3 days
2120,2010-01-09,174.1,Breast Cancer,2010-01-23,B,14 days
2175,2010-02-17,174.7,Breast Cancer,2010-02-21,A,4 days
2238,2010-01-21,174.9,Breast Cancer,2010-01-21,B,0 days
2407,2010-06-13,174.9,Breast Cancer,2010-06-19,A,6 days
2425,2010-12-15,174.9,Breast Cancer,2010-12-19,A,4 days
2462,2011-01-07,174.9,Breast Cancer,2011-01-11,A,4 days
2475,2010-02-17,174.7,Breast Cancer,2010-02-17,B,0 days
2607,2010-06-13,174.9,Breast Cancer,2010-07-03,B,20 days
2720,2010-01-09,174.1,Breast Cancer,2010-01-20,B,11 days


In [39]:
breast_cancer.drug_code.value_counts()

A    19
B     9
C     3
Name: drug_code, dtype: int64

In [87]:
data = breast_cancer['drug_code']
series3 = data.value_counts()
series3.head()

series3.iplot(kind='bar', xTitle='drug_code', yTitle='Count', title='Breast Cancer - Line of Treatment')

For Breast Cancer diagnosis, the most common used drug is A. If it is the most common, then it must be the most affective with patients having Breast Cancer. Given, based on first-line therapy definition, the most indicative drug for first-line therapy is drug A.

In [40]:
#Now, Colon Cancer treatment is analyzed.
colon_cancer = [gb.get_group(x) for x in gb.groups][1]
colon_cancer

Unnamed: 0_level_0,diagnosis_date,diagnosis_code,diagnosis,treatment_date,drug_code,first_treatment-diagnosis
patient_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2634,2011-02-19,153.9,Colon Cancer,2011-12-20,A,304 days
2770,2011-04-16,153.9,Colon Cancer,2011-04-22,A,6 days
3070,2011-07-25,153.9,Colon Cancer,2011-07-25,D,0 days
3095,2011-07-10,153.3,Colon Cancer,2011-07-13,B,3 days
3395,2011-10-18,153.3,Colon Cancer,2011-10-18,D,0 days
3449,2011-09-09,153.5,Colon Cancer,2011-09-13,A,4 days
3749,2011-12-18,153.5,Colon Cancer,2011-12-18,D,0 days
4057,2012-01-25,153.4,Colon Cancer,2012-01-25,D,0 days
6837,2012-10-20,153.3,Colon Cancer,2012-10-25,B,5 days
6840,2012-11-15,153.4,Colon Cancer,2012-11-20,B,5 days


In [41]:
colon_cancer.drug_code.value_counts()

B    4
D    4
A    4
C    3
Name: drug_code, dtype: int64

In [88]:
data = colon_cancer['drug_code']
series4 = data.value_counts()
series4.head()

series4.iplot(kind='bar', xTitle='drug_code', yTitle='Count', title='Colon Cancer - Line of Treatment')

Here, the distribution is not as skewed. So there seems to be no drug that is predominantly chosen as the first one for treatment based on value counts for each drug. Though, there has to be a reason that the drug_codes are sorted in the manner it is rather than in A, B, C, D format. When looking closer at when each drug is given, can see preference of which drug is chosen based on number of days between diagnosis and treatment.

Drug D has a time difference of 0 from treatment date to diagnosis date. This means this is the chosen drug when diagnosis is first given. It indicates an aggressive treatment possibly for late stage Colon Cancer patients. Remaining drugs: A, B and C all have average of 6, 6.3 and 7 days (discounting outlier) from diagnosis date to treatment date. This means that these drugs are possibly given to Colon Cancer patients a week out from diagnosis poaaibly due to not having late stage Colon Cancer.

### (4) Do the patients taking Regimen A vs. Regimen B as first-line thereapy for Breast Cancer vary in terms of duration of therapy?

1. Look at each Breast Cancer patient only, and see if they have Regimen A/Regimen B as first-line therapy. 
2. Then for each of these patients, on top of having first day of treatment for Regimen A/B, need to grab last day of treatment for given Regimen A/B.
3. Get difference between these dates.
4. Compare differences in Regimen A versus Regimen B. 
    a. Preliminary assessment: Can get overall stat's on new column to get mean and standard deviations at 25%, 50% and 75%.  
    b. Final assessment: Perform paired t-test to assess differnce in duration of therapy between Regimen A and Regimen B.

In [42]:
#As reminder, here is the df with all treatment dates per patient. The first and last treatment dates need to be grabbed from this.
overall_pat2 = overall_pat.drop(['first_treatment-diagnosis'], axis = 1)
overall_pat2.head(5)

Unnamed: 0,patient_id,diagnosis_date,diagnosis_code,diagnosis,treatment_date,drug_code
0,2120,2010-01-09,174.1,Breast Cancer,2010-01-23,B
1,2120,2010-01-09,174.1,Breast Cancer,2010-01-24,A
2,2120,2010-01-09,174.1,Breast Cancer,2010-01-26,A
3,2120,2010-01-09,174.1,Breast Cancer,2010-01-26,B
4,2120,2010-01-09,174.1,Breast Cancer,2010-01-27,A


In [43]:
#Separating out Breast Cancer diagnosis from Colon Cancer diagnosis
gb2 =overall_pat2.groupby('diagnosis')    
[gb2.get_group(x) for x in gb2.groups]

[      patient_id diagnosis_date  diagnosis_code      diagnosis treatment_date  \
 0           2120     2010-01-09           174.1  Breast Cancer     2010-01-23   
 1           2120     2010-01-09           174.1  Breast Cancer     2010-01-24   
 2           2120     2010-01-09           174.1  Breast Cancer     2010-01-26   
 3           2120     2010-01-09           174.1  Breast Cancer     2010-01-26   
 4           2120     2010-01-09           174.1  Breast Cancer     2010-01-27   
 5           2120     2010-01-09           174.1  Breast Cancer     2010-01-27   
 6           2120     2010-01-09           174.1  Breast Cancer     2010-01-29   
 7           2120     2010-01-09           174.1  Breast Cancer     2010-01-29   
 8           2120     2010-01-09           174.1  Breast Cancer     2010-02-01   
 9           2120     2010-01-09           174.1  Breast Cancer     2010-02-01   
 10          2120     2010-01-09           174.1  Breast Cancer     2010-02-04   
 11          212

In [44]:
#Breast Cancer patient data only
breast_cancer_pats = [gb2.get_group(x) for x in gb2.groups][0]
breast_cancer_pats.head(5)

Unnamed: 0,patient_id,diagnosis_date,diagnosis_code,diagnosis,treatment_date,drug_code
0,2120,2010-01-09,174.1,Breast Cancer,2010-01-23,B
1,2120,2010-01-09,174.1,Breast Cancer,2010-01-24,A
2,2120,2010-01-09,174.1,Breast Cancer,2010-01-26,A
3,2120,2010-01-09,174.1,Breast Cancer,2010-01-26,B
4,2120,2010-01-09,174.1,Breast Cancer,2010-01-27,A


In [45]:
#Breast Cancer patients with drugs grouped together; final result needed: Regimen A grouping, Regimen B grouping
gb3 =breast_cancer_pats.groupby('drug_code')    
[gb3.get_group(x) for x in gb3.groups]

[      patient_id diagnosis_date  diagnosis_code      diagnosis treatment_date  \
 1           2120     2010-01-09           174.1  Breast Cancer     2010-01-24   
 2           2120     2010-01-09           174.1  Breast Cancer     2010-01-26   
 4           2120     2010-01-09           174.1  Breast Cancer     2010-01-27   
 6           2120     2010-01-09           174.1  Breast Cancer     2010-01-29   
 8           2120     2010-01-09           174.1  Breast Cancer     2010-02-01   
 10          2120     2010-01-09           174.1  Breast Cancer     2010-02-04   
 12          2120     2010-01-09           174.1  Breast Cancer     2010-02-07   
 14          2120     2010-01-09           174.1  Breast Cancer     2010-02-10   
 16          2120     2010-01-09           174.1  Breast Cancer     2010-02-14   
 18          2120     2010-01-09           174.1  Breast Cancer     2010-02-18   
 20          2120     2010-01-09           174.1  Breast Cancer     2010-02-22   
 22          212

In [46]:
#Breast Cancer patients with Regimen A treatment dates
breast_cancer_drugA = [gb3.get_group(x) for x in gb3.groups][0]
breast_cancer_drugA.head(5)

Unnamed: 0,patient_id,diagnosis_date,diagnosis_code,diagnosis,treatment_date,drug_code
1,2120,2010-01-09,174.1,Breast Cancer,2010-01-24,A
2,2120,2010-01-09,174.1,Breast Cancer,2010-01-26,A
4,2120,2010-01-09,174.1,Breast Cancer,2010-01-27,A
6,2120,2010-01-09,174.1,Breast Cancer,2010-01-29,A
8,2120,2010-01-09,174.1,Breast Cancer,2010-02-01,A


In [47]:
#Breast Cancer patients with Regimen B treatment dates
breast_cancer_drugB = [gb3.get_group(x) for x in gb3.groups][1]
breast_cancer_drugB.head(30)

Unnamed: 0,patient_id,diagnosis_date,diagnosis_code,diagnosis,treatment_date,drug_code
0,2120,2010-01-09,174.1,Breast Cancer,2010-01-23,B
3,2120,2010-01-09,174.1,Breast Cancer,2010-01-26,B
5,2120,2010-01-09,174.1,Breast Cancer,2010-01-27,B
7,2120,2010-01-09,174.1,Breast Cancer,2010-01-29,B
9,2120,2010-01-09,174.1,Breast Cancer,2010-02-01,B
11,2120,2010-01-09,174.1,Breast Cancer,2010-02-04,B
13,2120,2010-01-09,174.1,Breast Cancer,2010-02-07,B
15,2120,2010-01-09,174.1,Breast Cancer,2010-02-10,B
17,2120,2010-01-09,174.1,Breast Cancer,2010-02-14,B
19,2120,2010-01-09,174.1,Breast Cancer,2010-02-18,B


In [48]:
#For each patient_id, get only first and last day of treatment for drug: A

treatment_duration_A = breast_cancer_drugA.groupby('patient_id').nth([0,-1])
treatment_duration_A.head(5)

Unnamed: 0_level_0,diagnosis,diagnosis_code,diagnosis_date,drug_code,treatment_date
patient_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2038,Breast Cancer,174.9,2010-01-21,A,2010-01-24
2038,Breast Cancer,174.9,2010-01-21,A,2017-02-20
2120,Breast Cancer,174.1,2010-01-09,A,2010-01-24
2120,Breast Cancer,174.1,2010-01-09,A,2010-03-02
2175,Breast Cancer,174.7,2010-02-17,A,2010-02-21


In [49]:
#For each patient_id, get only first and last day of treatment for drug: B

treatment_duration_B = breast_cancer_drugB.groupby('patient_id').nth([0,-1])
treatment_duration_B.head(5)

Unnamed: 0_level_0,diagnosis,diagnosis_code,diagnosis_date,drug_code,treatment_date
patient_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2038,Breast Cancer,174.9,2010-01-21,B,2010-01-24
2038,Breast Cancer,174.9,2010-01-21,B,2010-01-30
2120,Breast Cancer,174.1,2010-01-09,B,2010-01-23
2120,Breast Cancer,174.1,2010-01-09,B,2010-03-02
2175,Breast Cancer,174.7,2010-02-17,B,2010-02-21


In [50]:
#Getting difference in days between first and last treatment dates for Regimen A

treatment_dur_A = treatment_duration_A.groupby(['patient_id'])['treatment_date'].agg(lambda x: (abs(x.iloc[-1] - x.iloc[0])))
treatment_dur_A

patient_id
2038   2584 days
2120     37 days
2175     41 days
2407     45 days
2425     51 days
2462     52 days
2735     53 days
2763     60 days
3948     62 days
4354     20 days
4692     42 days
5259     63 days
5657     63 days
6281     69 days
6321     70 days
6889     24 days
7796     69 days
7937     74 days
7976     77 days
8480     77 days
8615     80 days
8827     93 days
9331     95 days
Name: treatment_date, dtype: timedelta64[ns]

In [91]:
data = treatment_dur_A/np.timedelta64(1, 'D')
series5 = data.value_counts()
series5.head()

series5.iplot(kind='bar', xTitle='Patients Reindexed', yTitle='Count', title='Regimen A - Treatment Duration')

In [51]:
#Getting difference in days between first and last treatment dates for Regimen B

treatment_dur_B = treatment_duration_B.groupby(['patient_id'])['treatment_date'].agg(lambda x: (abs(x.iloc[-1] - x.iloc[0])))
treatment_dur_B

patient_id
2038      6 days
2120     38 days
2175     41 days
2238   1001 days
2407     45 days
2425     51 days
2462     52 days
2475      0 days
2607     74 days
2720     76 days
2735     53 days
2762     76 days
2763     60 days
3025     77 days
3948     62 days
4354     20 days
4692     42 days
5259     63 days
5657     63 days
6281     69 days
6321     70 days
6889     32 days
7796     30 days
8615     36 days
9331     24 days
Name: treatment_date, dtype: timedelta64[ns]

In [92]:
data = treatment_dur_B/np.timedelta64(1, 'D')
series6 = data.value_counts()
series6.head()

series6.iplot(kind='bar', xTitle='Patients Reindexed', yTitle='Count', title='Regimen B - Treatment Duration')

Treatment duration distribution between Regimen A and Regimen B look very similar, but in order to get a closer look, need to look at the overall statistics to make assessment less subjective.

In [52]:
#Preliminary assessment of Regimen A treatment duration/patient

treatment_dur_A.describe()

count                          23
mean     169 days 14:36:31.304347
std      526 days 16:15:03.638398
min              20 days 00:00:00
25%              48 days 00:00:00
50%              63 days 00:00:00
75%              75 days 12:00:00
max            2584 days 00:00:00
Name: treatment_date, dtype: object

In [53]:
#Preliminary assessment of Regimen B treatment duration/patient

treatment_dur_B.describe()

count                          25
mean             86 days 10:33:36
std      191 days 18:04:47.809755
min               0 days 00:00:00
25%              36 days 00:00:00
50%              52 days 00:00:00
75%              69 days 00:00:00
max            1001 days 00:00:00
Name: treatment_date, dtype: object

It is easy to compare descriptive statistics describing central tendency (mean), dispersion (standard deviations at different levels), but a more accurate comparison will come through a paired t-test.

A paired t-test, sometimes called the dependent sample t-test, is a statistical procedure used to determine whether the mean difference between two sets of observations is zero. In a paired sample t-test, each subject or entity is measured twice, resulting in pairs of observations.

So let us set one up.

Hypotheses:

• The null hypothesis (H0) assumes that the true mean difference (μd) is equal to zero.
• The two-tailed alternative hypothesis (Ha) assumes that μd is not equal to zero.
• The upper-tailed alternative hypothesis (Ha) assumes that μd is greater than zero.
• The lower-tailed alternative hypothesis (Ha) assumes that μd is less than zero.

Assumptions:

• The dependent variable must be continuous (interval/ratio).
• The observations are independent of one another.
• The dependent variable should be approximately normally distributed.
• The dependent variable should not contain any outliers.

In order to perform a paired t-test:

1. Calculate the sample mean.

2. Calculate the sample standard deviation.

3. Calculate the test statistic.

4. Calculate the probability of observing the test statistic under the null hypothesis. This value is obtained by comparing t to a t-distribution with (n − 1) degrees of freedom. This can be done by looking up the value in a table, such as those found in many statistical textbooks, or with statistical software for more accurate results.

p = 2 ⋅ Pr(T > |t|)    (two-tailed)
p = Pr(T > t)          (upper-tailed)
p = Pr(T < t)          (lower-tailed)

Determine whether the results provide sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis.

In [54]:
#Need to convert time data into plotly friendly format to perform paired t-test

#Use: pd.to_numeric

In [55]:
twosample_results = scipy.stats.ttest_ind(pd.to_numeric(treatment_dur_A) , pd.to_numeric(treatment_dur_B))

matrix_twosample = [
    ['', 'Test Statistic', 'p-value'],
    ['Treatment Duration Data', twosample_results[0], twosample_results[1]]
]

twosample_table = FF.create_table(matrix_twosample, index=True)
py.iplot(twosample_table)

Since our p-value (0.463841502294255) is less than our Test Statistic (0.7387004377283424), then with evidence we can reject our null hypothesis of identical means. This is in alignment with our setup, since we sampled from two different normal pdfs with different means (169 days for Regimen A versus 86 days for Regimen B).

Therefore, if length of drug treatment is indicative of its efficacy, meaning longer length of therapy means more favored/more affective drug, then Regimen A would be drug of choice for Breast Cancer patients from this particular population. 