# Cancer Test Results

In this section, We'll explore a simulated dataset on cancer test results for patients, and whether they really have cancer.  

*We will investigate the dataset to answer the following questions.*

* How many patients are there in total?
* How many patients have cancer?
* How many patients do not have cancer?
* What proportion of patients have cancer?
* What proportion of patients don't have cancer?
* What proportion of patients with cancer test positive?
* What proportion of patients with cancer test negative?
* What proportion of patients without cancer test positive?
* What proportion of patients without cancer test negative?

In [1]:
# load dataset
import pandas as pd
import numpy as np

df = pd.read_csv("cancer_test_data.csv")

# Print fir 5 rows of the dataset
df.head()

Unnamed: 0,patient_id,test_result,has_cancer
0,79452,Negative,False
1,81667,Positive,True
2,76297,Negative,False
3,36593,Negative,False
4,53717,Negative,False


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2914 entries, 0 to 2913
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   patient_id   2914 non-null   int64 
 1   test_result  2914 non-null   object
 2   has_cancer   2914 non-null   bool  
dtypes: bool(1), int64(1), object(1)
memory usage: 48.5+ KB


In [3]:
df.shape

(2914, 3)

In [4]:
# number of patients
df.shape[0]

2914

In [5]:
# number of patients with cancer
df.loc[df.has_cancer == True].shape[0]

# OR

df.loc[df.has_cancer].shape[0]

# OR
df.has_cancer.sum()

306

In [6]:
# number of patients without cancer
df.loc[df.has_cancer == False].shape[0]

2608

In [7]:
# proportion of patients with cancer
df.has_cancer.mean()

0.10501029512697323

In [8]:
# proportion of patients without cancer
df.loc[df.has_cancer == False].has_cancer.count() / df.shape[0]

# OR

1 - df.has_cancer.mean()

0.8949897048730268

In [9]:
# proportion of patients with cancer who test positive
df.loc[(df.has_cancer == True) & (df.test_result == 'Positive')].shape[0] / df.loc[df.has_cancer].shape[0]

# OR

(df.query('has_cancer == True')['test_result'] == 'Positive').mean()

0.9052287581699346

In [10]:
# proportion of patients with cancer who test negative
(df.query('has_cancer == True')['test_result'] == 'Negative').mean()

0.09477124183006536

In [11]:
# proportion of patients without cancer who test positive
(df.query('has_cancer == False')['test_result'] == 'Positive').mean()

0.2036042944785276

In [12]:
# proportion of patients without cancer who test negative
(df.query('has_cancer == False')['test_result'] == 'Negative').mean()

0.7963957055214724

# Conditional Probability & Bayes Rule Quiz

In the previous section, we found the following proportions from the cancer results dataset.

* Patients with cancer: 0.105
* Patients without cancer: 0.895
* Patients with cancer who tested positive: 0.905
* Patients with cancer who tested negative: 0.095
* Patients without cancer who tested positive: 0.204
* Patients without cancer who tested negative: 0.796

Based on the above proportions observed in the data, we can assume the following probabilities.

* P(cancer) = 0.105 ----------------> Probability a patient has cancer  
* P(\~cancer) = 0.89 ---------------> Probability a patient does not have cancer  
* P(positive|cancer) = 0.905 ----> Probability a patient with cancer tests positive  
* P(negative|cancer) = 0.095 ---> Probability a patient with cancer tests negative  
* P(positive|\~cancer) = 0.204 --> Probability a patient without cancer tests positive  
* P(negative|\~cancer) = 0.796 -> Probability a patient without cancer tests negative  

We will use the probabilities given above and Bayes rule to compute the following probabilities.

* Probability a patient who tested positive has cancer, or P(cancer|positive)
* Probability a patient who tested positive doesn't have cancer, or P(~cancer|positive)
* Probability a patient who tested negative has cancer, or P(cancer|negative)
* Probability a patient who tested negative doesn't have cancer, or P(~cancer|negative)

In [13]:
# What proportion of patients who tested positive have cancer?
#P(cancer | positive)
print(df.query("test_result == 'Positive'")['has_cancer'].mean())

0.34282178217821785


In [14]:
# What proportion of patients who tested positive don't have cancer?
print((df.query("test_result == 'Positive'")['has_cancer'] == False).mean())
# OR
print(1 - df.query("test_result == 'Positive'")['has_cancer'].mean())

0.6571782178217822
0.6571782178217822


In [15]:
# What proportion of patients who tested negative have cancer?
df.query("test_result == 'Negative'")['has_cancer'].mean()

0.013770180436847104

In [16]:
# What proportion of patients who tested negative don't have cancer?
print((df.query("test_result == 'Negative'")['has_cancer'] == False).mean())
# OR
1- df.query("test_result == 'Negative'")['has_cancer'].mean()

0.9862298195631529


0.9862298195631529