<a href="https://colab.research.google.com/github/meghutch/Breast-Cancer-Classification-Clinical-Genomic/blob/master/Principal_Component_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Predicting Clinical Outcomes of Breast Cancer Patients**

## **Neural Networks and PCA**

**Author:** Meg Hutch

**Date:** November 27, 2019

**Objective:** Integrate the Principal Components from the **Gene Expression Analysis - PCA**. 

These were the Principal Components explaining 90% of the variance. 

Unlike the **Neural_Network_Clinical_Outcomes** analysis, this will use the patients in the merged_expression.txt that were processed.

In [0]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.colors
import seaborn as sns

In [113]:
# Connect Colab to google drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
# Import Data
# Merged Expression Data
gene_data = pd.read_csv('/content/drive/My Drive/Projects/Breast_Cancer_Classification/Data/merged_expression.txt', sep=',')

## Principal Component Data
# All Principal Components
pc_all = pd.read_csv('/content/drive/My Drive/Projects/Breast_Cancer_Classification/Processed_Data/gene_pca_components_All_1747.txt')

# Principal Components Responsible for 90% of the variation
pc_90 = pd.read_csv('/content/drive/My Drive/Projects/Breast_Cancer_Classification/Processed_Data/gene_pca_components_90.txt')

# **Data Pre-Processing**

Check if there are missing values and add the event label to the prinicipal component dataframes

**Check if there are any missing values**

In [115]:
gene_data.isna().any()

Unnamed: 0    False
EVENT          True
OS_MONTHS     False
FIVE_YEAR      True
RERE          False
              ...  
CC2D1A        False
CB986545      False
IGSF9         False
DA110839      False
FAM71A         True
Length: 24372, dtype: bool

**Remove observations with missing values**

In [0]:
gene_data = gene_data.dropna()

**Check Final Number of Patients**

In [117]:
gene_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1747 entries, 0 to 1903
Columns: 24372 entries, Unnamed: 0 to FAM71A
dtypes: float64(24371), object(1)
memory usage: 324.9+ MB


**Add Outcome Variable to the PC Dataframes**

In [0]:
# Subset only the Event - whether they lived (0) or died (1) from Breast Cancer
labels = gene_data.EVENT

# Create a list of row names
patients = list(gene_data.index)

# Convert labels into a dataframe and indicate patients as the index
labels = pd.DataFrame(labels, index = patients)

# Create a row ID from the row names 
#labels['ID'] = np.arange(len(labels))

pc_all = pc_all.set_index('Unnamed: 0')
pc_90 = pc_90.set_index('Unnamed: 0')

# Remove the index name (Unnamed: 0)
pc_all.index.name = None
pc_90.index.name = None

# Add labels to the pc dataframes
pc_all = pd.merge(pc_all, labels, left_index=True, right_index=True)
pc_90 = pd.merge(pc_90, labels, left_index=True, right_index=True)

# **Predict Outcomes Using Gene Expression Principal Components**

We will attempt to predict Event (1 = died from breast cancer, 0 = alive), using the following classification methods and only using Principal Components as features

* Logistic Regression
* Random Forest
* Neural Networks. 

Logistic regression and random forest classiers can help serve as a benchmark of performance once we develop our neural network classifier.

*   Test All Principal Components
*   Test the Prinicipal Components with 90% variability 
*   We can continue trying to reduce the number of prinicipal components - I'm unsure as to whether or not doing so without clinical information will have any results --- can we even have that many variables in a logistic regression? 
*  Then I will do analyses with principal components + clinical data and will compare performances -- I have more patients now then in the Neural Network Clinical Outcomes --- I should go back to the data processing_MH script to see why, but it will be good to re-run all models to assess

