<a href="https://colab.research.google.com/github/hanna0702/Machine-Learning-and-AI-Workshop/blob/main/01_eda.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Explore and Visualize the Data
Authors: Ryan Nie, Christina Xu

In [1]:
# connect to Google Drive
from google.colab import drive
drive.mount('/content/drive')
#hello!

Mounted at /content/drive


In [2]:
# import packages 

# data manipulation
import numpy as np
import pandas as pd

# data visualization
import matplotlib.pyplot as plt
import seaborn as sns

## 1 Data attributes 

1. **Race**: White, Black, and Other


2. **Marital Status**: Married, Divorced, Single, Widowed, and Separated


3. **T Stage** - refers to the size and extent of the main tumor. The higher the number after the T, the larger the tumor or the more it has grown into nearby tissues. 
    * T-T0: No evidence of primary tumor
    * T1 (includes T1a, T1b, and T1c): Tumor is 2 cm (3/4 of an inch) or less across
    * T2: Tumor is more than 2 cm but not more than 5 cm (2 inches) across
    * T3: Tumor is more than 5 cm across


4. **N Stage** - refers to the the number of nearby lymph nodes that have cancer. The M refers to whether the cancer has metastasized. This means that the cancer has spread from the main tumor to other parts of the body.N1, N2, N3: Refers to the number and location of lymph nodes that contain cancer. The higher the number after the N, the more lymph nodes that contain cancer.


5. **6th Stage**: Stage groups for breast cancer.Doctors assign the stage of the cancer by combining the T, N, and M classifications (see above), the tumor grade, and the results of ER/PR and HER2 testing. This information is used to help determine your prognosis. Doctors may refer to stage I to stage IIA cancer as "early stage" and stage IIB to stage III as "locally advanced."

    * Stage IIA:

        * There is no evidence of a tumor in the breast, but the cancer has spread to 1 to 3 axillary lymph nodes. It has not spread to distant parts of the body (T0, N1, M0).

        * The tumor is 20 mm or smaller and has spread to 1 to 3 axillary lymph nodes (T1, N1, M0).

        * The tumor is larger than 20 mm but not larger than 50 mm and has not spread to the axillary lymph nodes (T2, N0, M0).
    * Stage IIB:

         * The tumor is larger than 20 mm but not larger than 50 mm and has spread to 1 to 3 axillary lymph nodes (T2, N1, M0).

         * The tumor is larger than 50 mm but has not spread to the axillary lymph nodes (T3, N0, M0).
    * Stage IIIA: The tumor of any size has spread to 4 to 9 axillary lymph nodes or to internal mammary lymph nodes. It has not spread to other parts of the body (T0, T1, T2, or T3; N2; M0). Stage IIIA may also be a tumor larger than 50 mm that has spread to 1 to 3 axillary lymph nodes (T3, N1, M0).

    * Stage IIIB: The tumor has spread to the chest wall or caused swelling or ulceration of the breast, or it is diagnosed as inflammatory breast cancer. It may or may not have spread to up to 9 axillary or internal mammary lymph nodes. It has not spread to other parts of the body (T4; N0, N1, or N2; M0).

    * Stage IIIC: A tumor of any size that has spread to 10 or more axillary lymph nodes, the internal mammary lymph nodes, and/or the lymph nodes under the collarbone. It has not spread to other parts of the body (any T, N3, M0).


6. **differentiate**: Poorly differentiated, Moderately differentiated, Well differentiated,and Undifferentiated


7. **Grade**:
    * 1: looks most like normal breast cells and is usually slow growing 
    * 2: looks less like normal cells and is growing faster 
    * 3 looks different to normal breast cells and is usually fast growing


8. **A Stage** - These parameters show stages of cancer and is a summary of all data, it is an attribute that involves T,N and Grade data.

    * Regional: The cancer has spread outside the breast to nearby structures or lymph nodes.
    * Distant: The cancer has spread to distant parts of the body such as the lungs, liver or bones.


9. **Estrogen Status**: Estrogen positive and Estrogen negative

    * Estrogen positive : - Cancer cells that are ER positive may need estrogen to grow. These cells may stop growing or die when treated with substances that block the binding and actions of estrogen. Also called estrogen receptor positive.

    * Estrogen negative : - negative breast cancers are a group of tumors with poor prognosis and fewer cancer prevention and treatment strategies compared to ER-positive tumors.


10. **Progesterone Status**: Progesterone positive and Progesterone negative

    * Progesterone positive:- This type of breast cancer is sensitive to progesterone, and the cells have receptors that allow them to use this hormone to grow. Treatment with endocrine therapy blocks the growth of the cancer cells.

    * Progesterone negative: - This type of breast cancers have no estrogen or progesterone receptors. Treatment with hormone therapy drugs is not helpful for these cancers. These cancers tend to grow faster than hormone receptor-positive cancers.


11. **Regional Node Examined** - Records the total number of regional lymph nodes that were removed and examined by the pathologist.


12. **Reginol Node Positive** -  Records the exact number of regional lymph nodes examined by the pathologist that were found to contain metastases.


13. **Survival Months** - Created using complete dates, including days, therefore may differ from survival time calculated from year and month only.


14. **Status**: Any patient that dies after the follow-up cut-off date is recoded to alive as of the cut-off date.

**References**:

https://ieee-dataport.org/open-access/seer-breast-cancer-data

## 2 Download the data 

The data has been downloaded and extracted locally from Kaggle into the folder named data. Kaggle data source: https://www.kaggle.com/datasets/reihanenamdari/breast-cancer

## 3 Explore the data using Pandas
## 3.1 Load data from disk and view the raw data.
At a quick glance, we want to look for:
* NaNs (null values)
* Understand the features and hypothesize how they may predict Status
* The types of data (numerical vs categorical)

In [None]:
data = pd.read_csv('/content/drive/MyDrive/Copy of Breast Cancer Detection/data/Breast_Cancer.csv')
data.head() # first 5 rows

Unnamed: 0,Age,Race,Marital Status,T Stage,N Stage,6th Stage,differentiate,Grade,A Stage,Tumor Size,Estrogen Status,Progesterone Status,Regional Node Examined,Reginol Node Positive,Survival Months,Status
0,68,White,Married,T1,N1,IIA,Poorly differentiated,3,Regional,4,Positive,Positive,24,1,60,Alive
1,50,White,Married,T2,N2,IIIA,Moderately differentiated,2,Regional,35,Positive,Positive,14,5,62,Alive
2,58,White,Divorced,T3,N3,IIIC,Moderately differentiated,2,Regional,63,Positive,Positive,14,7,75,Alive
3,58,White,Married,T1,N1,IIA,Poorly differentiated,3,Regional,18,Positive,Positive,2,1,84,Alive
4,47,White,Married,T2,N1,IIB,Poorly differentiated,3,Regional,41,Positive,Positive,3,1,50,Alive


## 3.2 View information about the data
* How many features do we have?
* What features seem redundant?
* How many null values do each of them have?
    * The fewer non-null values, the less utility
* What are the different data types? 

In [None]:
data.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4024 entries, 0 to 4023
Data columns (total 16 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Age                     4024 non-null   int64 
 1   Race                    4024 non-null   object
 2   Marital Status          4024 non-null   object
 3   T Stage                 4024 non-null   object
 4   N Stage                 4024 non-null   object
 5   6th Stage               4024 non-null   object
 6   differentiate           4024 non-null   object
 7   Grade                   4024 non-null   object
 8   A Stage                 4024 non-null   object
 9   Tumor Size              4024 non-null   int64 
 10  Estrogen Status         4024 non-null   object
 11  Progesterone Status     4024 non-null   object
 12  Regional Node Examined  4024 non-null   int64 
 13  Reginol Node Positive   4024 non-null   int64 
 14  Survival Months         4024 non-null   int64 
 15  Stat

Our data has no null values. Typically, real world data would have some, if not many. If so, your next step would be to explore the null values in the feaures.

## 3.3 Get Summary Statistics

* Count - each column has 4024 values
* Mean, min(imum), and max(imum) are self-explanatory
* Std (standard deviation) - explains spread out the values are from the mean
  * Normal (Gaussian) distribution follows 68-95-99.7 rule
    * 68% of values are within 1 std
    * 95% of values are within 2 std
    * 99.7% of values are within 3 std
* 25% - first quartile
    * eg. 25% of the patients were under 47 years old
* 50% - second quartile or median
* 75 % - third quartile

In [None]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,4024.0,53.972167,8.963134,30.0,47.0,54.0,61.0,69.0
Tumor Size,4024.0,30.473658,21.119696,1.0,16.0,25.0,38.0,140.0
Regional Node Examined,4024.0,14.357107,8.099675,1.0,9.0,14.0,19.0,61.0
Reginol Node Positive,4024.0,4.158052,5.109331,1.0,1.0,2.0,5.0,46.0
Survival Months,4024.0,71.297962,22.92143,1.0,56.0,73.0,90.0,107.0


## 3.4 How many patients are alive? 

In [None]:
data['Status'].value_counts()

## 4 Visualize the data with Matplotlib and Seaborn

In [None]:
# store the Status label as y
y = data['Status']

# seperate features into numerical and categorical datasets
X_num = data[['Age', 'Tumor Size', 'Reginol Node Positive', 'Survival Months', 'Regional Node Examined']]
X_cat = data[['Race', 'Marital Status', 'T Stage ', 'N Stage', '6th Stage', 'differentiate', 'Grade', 'A Stage','Estrogen Status',
                  'Progesterone Status']]


## 4.1 Plot histograms for numerical features

Pay attention to clear splits in the plots for the numerical features.

In [5]:
plt.figure(figsize=(18,12))
plt.subplots_adjust(hspace = .5)

for i, column in enumerate(X_num, 1):
    plt.subplot(3,2,i)
    sns.histplot(data=X_num, x=column, hue=y, stat="density", common_norm=False, bins=60, kde=True)
   
    # stat - Aggregate statistic to compute in each bin density normalizes counts so that the area of the histogram is 1
    # common_norm - If False, normalize each histogram independently
    # kde - If True, compute a kernel density estimate to smooth the distribution and show on the plot as (one or more) line(s)

NameError: ignored

<Figure size 1296x864 with 0 Axes>

Next steps:
1. Age and Suvival Months have clear seperation between Alive and Dead patients
2. No clear seperation in Tumor Size, Reginol Node Positive, and Regional Node Examined, could drop

## 4.2 Plot histograms for categorical features

In [None]:
plt.figure(figsize=(18,12))
plt.subplots_adjust(hspace = .5)
for i, column in enumerate(X_cat, 1):
    plt.subplot(5,2,i)
    sns.histplot(data, x=column, hue=y, multiple='fill', stat='proportion',
    shrink=.8) #stat="proportion",discrete=True)

Next steps:
1. T Stage, N Stage, 6th Stage, differentiate, and Grade are **ordinal** variables, meaning that there is a clear ordering of the categories.
   eg. For T Stage, as the number following the T increases, the higher the proportion of those who die from cancer
2. Race, Marital Status, A Stage, Estrogen Status, and Progesterone Status are **non-ordinal** variables, meaning that they have no intrinsic ordering to the categories.

## 5 Explore correlation among features

In [None]:
corr_matrix = X_num.corr()
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=True)

It appears that Reginol Node Positive and Tumor Size had a moderately strong correlation (0.24) and Regional Node Examined and Reginol Node Positive have a moderately strong correlation as well (0.41).

## Summary:
* We perform EDA to analyze and investigate data sets and summarize their main characteristics.
* Age and Suvival Months have clear seperation between Alive and Dead patients.
* T Stage, N Stage, 6th Stage, differentiate, and Grade are ordinal variables. The higher the number, the more likely a patient is to die from cancer.

Now that you know how to perform EDA, let's head over to the next notebook, `02-feature_preprocessing.ipynb` to see how we can prepare our data for model ingestion.