# Coding Applications in Medicine: Data Science - Statistical Analysis Practice

Review the following notebooks before attempting this practice question.

- Data Science - Handling Data with Pandas
- Data Science - Visualization
- Data Science - Categorical Hypothesis Tests
- Data Science - Numerical Hypothesis Tests
- Data Science - Other Statistical Analysis
- Data Science - Regression Analysis

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import RocCurveDisplay
from sklearn.model_selection import train_test_split

from scipy.stats import ttest_ind
from scipy.stats import chi2_contingency
from scipy.stats import fisher_exact
from scipy.stats import f_oneway
from scipy.stats import pearsonr
from scipy.stats.contingency import odds_ratio
from scipy.stats.contingency import relative_risk

import sklearn.datasets

We will analyze the Wisconsin Breast Cancer Dataset from scikit learn. Our goal is to create a model to determine whether a tumor is malignant or benign.

In [None]:
# Load the breast cancer data.
breastCancer_dict = sklearn.datasets.load_breast_cancer()
breastCancerDF = pd.DataFrame(breastCancer_dict["data"], columns=breastCancer_dict["feature_names"])
breastCancerDF

In [None]:
breastCancer_dict["target_names"]

Below is the list of all the features provided by the data. Which feature do you think would be good to use to determine whether a tumor is benign or malignant?

In [None]:
breastCancerDF.columns

Take a look at the provided encoded labels/values for the malignant column data. It looks like 0 represents malignant and 1 represents benign. We should reverse this encoding of the value to make it more intuitive before adding it as a new column to our data frame.

In [None]:
# Hint: Currently, breastCancer_dict["target"] store the data where 0 is malignant and 1 is benign. 

### breastCancerDF["malignant"] = ________

Based on the one feature you choose, make an appropriate visualization.

In [None]:
# Hint: In this case, "malignant" column would be the dependent variable.


Looking at your visualization, can you identify a pattern that help you predict whether a tumor is benign or malignant?

Fit the appropriate model to help make the prediction.

In [None]:
# Step 1: Split our dataset to training and testing datasets.

### breastCancerDF_tr, breastCancerDF_te = ________
### breastCancerDF_tr.reset_index(inplace=True, drop=True)
### breastCancerDF_te.reset_index(inplace=True, drop=True)

# Step 2: Create the X and Y matrix for model training.

### breastCancerDF_trX = ________
### breastCancerDF_trY = ________

# Step 3: Creating the model.

### breastCancerAreaModel = ________
### breastCancerAreaModel.________(________, ________);

In [None]:
# Determine the accuracy of the model.

### breastCancerAreaModel.________(________, ________)

In [None]:
# Try using the model to make predictions.

### breastCancerAreaModel.________(________)

Now we want to see if there is a significant difference in the feature you selected between benign and malignant tumors.

Based on the one feature you choose, make an appropriate visualization.

In [None]:
# Hint: In this case, "malignant" column would be the independent variable.


Looking at your visualization, can you determine whether there is a significant difference between benign or malignant tumors?

Calculate the test statistic using an appropriate test.

In [None]:
# Step 1: Create the appropriate grouping or table.

### benignGroup = ________
### tumorGroup = ________

# Step 2: Use the appropriate test for analysis.

### ________(________, ________)

Below is one way to approach this problem.

In [None]:
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#


Logistic Regression

In [None]:
# Load the breast cancer data.
breastCancer_dict = sklearn.datasets.load_breast_cancer()
breastCancerDF = pd.DataFrame(breastCancer_dict["data"], columns=breastCancer_dict["feature_names"])

In [None]:
# Example using comparison operator to produce T/F values which is then converted to 1/0.
### breastCancerDF["malignant"] = (breastCancer_dict["target"] == 0).astype(int)

# Example using mathematical manipulations.
breastCancerDF["malignant"] = 1 - breastCancer_dict["target"]

In [None]:
# Histogram.
sns.histplot(data = breastCancerDF, x = "mean area", hue = "malignant", binwidth=50, kde=True)
plt.legend(labels = ["Malignant", "Benign"])
plt.show();

In [None]:
# Split our dataset to training and testing datasets.
breastCancerDF_tr, breastCancerDF_te = train_test_split(breastCancerDF, test_size=0.10, random_state=33)
breastCancerDF_tr.reset_index(inplace=True, drop=True)
breastCancerDF_te.reset_index(inplace=True, drop=True)

# Create the X and Y matrix for model training.
breastCancerDF_trX = breastCancerDF_tr[["mean area"]].to_numpy()
breastCancerDF_trY = breastCancerDF_tr["malignant"].to_numpy()

# Creating the logistic regression model.
breastCancerAreaModel = LogisticRegression(fit_intercept=True)
breastCancerAreaModel.fit(breastCancerDF_trX, breastCancerDF_trY);

In [None]:
# Calculates the accuracy of the model.
breastCancerAreaModel.score(breastCancerDF_te[["mean area"]].to_numpy(), 
                            breastCancerDF_te[["malignant"]].to_numpy())

In [None]:
# Predicts whether the cancer is benign or malignant based on area/size.
breastCancerAreaModel.predict([[1500]])

Two Sample t-Test

In [None]:
# Box plot.
sns.boxplot(data=breastCancerDF, x="malignant", y="mean area")
plt.show();

In [None]:
# Create the benign and tumor grouping.
benignGroup = breastCancerDF[breastCancerDF["malignant"] == 0]
tumorGroup = breastCancerDF[breastCancerDF["malignant"] == 1]

# Two-sample t-test for analysis.
ttest_ind(benignGroup["mean area"], tumorGroup["mean area"])