# Project Proposal - Group 39

In [17]:
library("tidyverse")

set.seed(1000)

## 1. Introduction

The topic of our project focuses on the analysis of breast cancer samples and their proteomes to predict the stage of breast cancer. The data set we will be using for this is called "Breast Cancer Proteomes" sourced from Kaggle and uploaded by Kajot at the following URL:

https://www.kaggle.com/datasets/piotrgrabo/breastcancerproteomes/?fbclid=IwAR0c9gONWYSIGAQ6R66rUkjaFVtiwxH15b_XfVATdICBEEp0Ait9o6VZaIY

This dataset contains iTRAQ proteomics data for over 12,000 proteins in 77 breast cancer samples and 3 healthy breast samples. The samples used in this data set were collected by the Clinical Proteomic Tumour Analysis Consortium.

In order to narrow the scope of the data, our group will be looking at the top 10 most highly expressed proteins in tumors that are at AJCC stage III or higher. AJCC staging is the Cancer Staging System developed by the American Joint Committee on Cancer, that classifies the severity of cancer with regards to the primary tumor and can be used to show the extent in which the cancer has spread. This tool is helpful for determining a prognosis and can serve as the basis for developing the most effective treatment plan on an individual basis.

We are investigating whether the protein expression of the top 10 most highly expressed proteins in tumors that are AJCC stage III or higher can accurately predict and classify the AJCC stage of breast cancer.

## 2. Preliminary Analysis

Our GitHub repository can be found at: https://github.com/hesoru/DSCI_100_Breast_Cancer_Classification

Dataset Source: Our Breast Cancer Proteomes dataset was obtained from Kaggle at the following URL: 

https://www.kaggle.com/datasets/piotrgrabo/breastcancerproteomes/?fbclid=IwAR0c9gONWYSIGAQ6R66rUkjaFVtiwxH15b_XfVATdICBEEp0Ait9o6VZaIY

This dataset was downloaded and the following files were uploaded to our GitHub repository (hesoru/DSCI_100_Breast_Cancer_Classification/Original_Datasets)
- Proteomics data: https://github.com/hesoru/DSCI_100_Breast_Cancer_Classification/blob/main/Original_Datasets/77_cancer_proteomes_CPTAC_itraq.csv
- Clinical data: https://github.com/hesoru/DSCI_100_Breast_Cancer_Classification/blob/main/Original_Datasets/clinical_data_breast_cancer.csv

We used a Jupyter notebook to document reading, cleaning, and wrangling of our data into a tidy format, and preliminary exploration of the tidied dataset. This notebook can be found at:

https://github.com/hesoru/DSCI_100_Breast_Cancer_Classification/blob/main/Tidying_Data_and_Exploration.ipynb



| AJCC Stage | Patients |
|------------|----------|
|Stage I | 2 |
|Stage IA | 3 |
|Stage IB | 1 |
|Stage II | 8 |
|Stage IIA | 21 |
|Stage IIB | 14 |
|Stage III | 2 |
|Stage IIIA | 6 |
|Stage IIIB | 4 |
|Stage IIIC | 4 |
|Stage IV | 0 |

In [18]:
# read in data
top_10_mean_protein_expression_genes_stage_III_plus <- read_csv("Outputs/top_10_mean_protein_expression_genes_stage_III_plus")
top_10_mean_protein_expression_genes_stage_III_plus

patients_per_stage <- read_csv("Outputs/patients_per_stage")
patients_per_stage

proteome_and_clinical_data_training_merged <- read_csv("Outputs/proteome_and_clinical_data_training_merged.csv")

# visualize protein expression distributions across all AJCC stages, investigating only the top 10 most highly expressed proteins in stage III+ tumors
expression_distribution_plot <- proteome_and_clinical_data_training_merged |>
    ggplot(aes(x = protein_expression_log2_iTRAQ_ratios, fill = ajcc_stage)) +
    geom_histogram(bins = 10, binwidth = 1) +
    facet_grid(rows = vars(ajcc_stage), cols = vars(RefSeq_accession_number, gene_symbol)) +
    labs(x = "Protein Expression (Log2 iTRAQ ratios)", y = "Number of Tumor Samples") +
    theme(text = element_text(size = 13)) +
    theme(legend.position = "none")
expression_distribution_plot

ERROR: Error: 'Outputs/top_10_mean_protein_expression_genes_stage_III_plus' does not exist in current working directory ('/home/jovyan/DSCI_100_Breast_Cancer_Classification/Proposal').


## 3. Methods

We will perform K nearest neighbors classification of our testing dataset (25% of our entire dataset):
- **Parameters:** the log2 protein expression of the top 10 most highly expressed proteins in stage III+ breast cancer, identified earlier in our analysis (see jupyter notebook on the GitHub)
- **Predicted class:** AJCC stage of tumor samples


Our classifier will be trained on our training data (75% of our entire dataset). We will tune K using our training dataset and assess classifier accuracy by comparing classifier predictions of AJCC stage to the actual AJCC stages of our tumor samples in the testing dataset.


We aim to create the following visualizations:
- Bar plot with AJCC class on the x-axis, and sample counts under each class based on the classifier vs. actual observations on the y-axis
- Estimated accuracy of classifier on the y-axis and neighbors on the x-axis (tuning K)
- Since we're using 10 parameters for our classifier, it's not practical to plot the training/testing data on a scatterplot including all the parameters (10 axes!)

## 4. Expected Outcomes and Significance

We are expecting to train a model that can classify the stage of breast cancer for new samples using proteome data. Despite the size of our training data being large, it is still a selection from all possible proteomes which relate to breast cancer, thus we expect our model to have accuracy among a subgroup of breast cancer patients.

Such a classifier model is significant, for it not only provides a usable practice in classifying existing patients, but also generates knowledge in relevant fields by identifying a combination of proteomes that have a high correlation with cancer development.

Once a classifier model that is sufficiently accurate on this dataset is established, it is reasonable to ask about the reproducibility of accuracy on other samples of breast cancer patients, and eventually the generalizability towards other subgroups of cancer patients.