# Biol 359  |  Linear Discriminant Analysis
### Spring 2021, Week 6

<hr style="border:2px solid gray"> </hr>

The first 6 cells are the same as last week's activity. They still need to be run in order for this notebook to work, but are not the focus of this week.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

sns.set(rc={'figure.figsize':(11.7,8.27)}) 
sns.set_style("white")

#### Import breast cancer data 
Optional reference: https://pandas.pydata.org/docs/index.html

In [None]:
from sklearn.datasets import load_breast_cancer
# NOTE:
# `breast_raw.data`: Stores the raw data (breast feature data)
# `breast_raw.feature_names`: Stores the raw data feature labels
# `breast_raw.target`: Stores the tumor type (0 = 'benign', 1 = 'malignant')
# `breast_raw.target_names`: Stores the tumor type labels ('benign' or 'malignant')
# `breast_raw.DESCR`: Description of the data
breast_raw = load_breast_cancer()

# Uncomment the following line to print a description of the data
print(breast_raw.DESCR)

In [None]:
# Feature data set
features = pd.DataFrame(breast_raw.data, columns=breast_raw.feature_names)
features.head()

In [None]:
# Tumor label data set
tumor = pd.DataFrame(breast_raw.target, columns=['tumor'])
# tumor_set.replace({'tumor type': {0: 'benign', 1: 'malignant'}}, inplace=True)
tumor.head()

In [None]:
# Concantenate into one data frame
breast = pd.concat([features, tumor], axis=1)
# breast.loc[:, breast.columns != 'tumor'].head()
# breast.loc[:, breast.columns == 'tumor'].head()

#### Assess feature data statistics

In [None]:
features.describe()

### Plotting our data! 

In the case where we have 30 dimensions, we would have to make plots...below are four for an example of how we could plot each of the features against one another. 

In [None]:
# Create scatter plots of the various features
fig, axs = plt.subplots(2, 2, figsize=(10, 10))
features.plot.scatter(ax=axs[0, 0], x="mean radius", y="mean area", alpha=0.5, color='red');
features.plot.scatter(ax=axs[0, 1], x="mean radius", y="mean texture", alpha=0.5, color='green');
features.plot.scatter(ax=axs[1, 0], x="mean concave points", y="mean concavity", alpha=0.5, color='blue');
features.plot.scatter(ax=axs[1, 1], x="mean concave points", y="mean fractal dimension", alpha=0.5, color='orange');

### Another popular method is to use a Pair Plot. 


In [None]:
num_feats = 6
g = sns.pairplot(data=breast, corner=True, vars=features.columns.to_list()[0:num_feats], 
                 hue='tumor', plot_kws=dict(alpha=.4))

#you can ignore this
handles = g._legend_data.values()
g.fig.legend(handles=handles, loc='lower center', ncol=2, labels=['benign','malignant'], frameon=False)
g._legend.remove()
g.fig.suptitle('Pair Plot', weight='bold', size='xx-large')
g.fig.subplots_adjust(top=0.96, bottom=0.06)

### We can use PCA to identify axes (eigen vectors) that will retain the most information (variance) from our data:

In [None]:
# The code from last week did this from scratch, we are going to use a package from sklearn to do PCA this week
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

def PerformPCA(X):
    """
    Uses sklearn PCA tool to perform PCA
    input:
    X: Pandas Dataframe or Numpy Array of features
    n_dimensions: Number of PCs to fit
    
    output:
    X_pca: Pandas dataframe with column titles of PC1,...,PCn
    """
    X_standardized = StandardScaler().fit_transform(X)
    pca = PCA()
    pca.fit(X_standardized)
    X_pca_array = pca.transform(X_standardized)
    column_names = ['PC{}'.format(i+1) for i in range(X_pca_array.shape[1])] 
    X_pca = pd.DataFrame(X_pca_array, columns=column_names)
    return X_pca, pca

pca_features, pca_results = PerformPCA(features)
pca_features

### Note: the PCs have been ordered from highest amount of variance explained to least: PC1 captures the most variance in the data.

In [None]:
PC_values = np.arange(pca_results.n_components_) + 1
plt.bar(PC_values, pca_results.explained_variance_ratio_);
plt.title("Scree Plot", weight='bold', size='xx-large' )
plt.xlabel("Principal Component")
plt.ylabel("Percent Variance Explained")
sns.despine()
plt.show()

### Can you see the diminishing returns in how many principle components to include?

In [None]:
pca_breast = pd.concat([pca_features, tumor], axis=1)

num_feats = 6
g = sns.pairplot(data=pca_breast, corner=True, vars=pca_features.columns.to_list()[0:num_feats], 
                 hue='tumor', plot_kws=dict(alpha=.4))

#you can ignore this
handles = g._legend_data.values()
g.fig.legend(handles=handles, loc='lower center', ncol=2, labels=['benign','malignant'], frameon=False)
g._legend.remove()
g.fig.suptitle('Pair Plot', weight='bold', size='xx-large')
g.fig.subplots_adjust(top=0.96, bottom=0.06)

### Since we can capture most of the variance in our data from simply using PC1 and PC2, we can make a 2D plot with that information:

Be careful with your interpretations here, we haven't necessarily done an analysis, we have just have adjusted the **perspective** that we are looking at the data from.

In [None]:
g = sns.scatterplot(data=pca_breast, x='PC1', y='PC2', hue='tumor', alpha=.6)
sns.despine()
#you can ignore this
handles, _ = g.get_legend_handles_labels()
g.legend(handles=handles, loc='lower center', ncol=2, labels=['benign','malignant'])
plt.title('PCA plot with PC1 and PC2', weight='bold', size='xx-large')
plt.show()

### Remember, PCA is unsupervised:
Unsupervised means that we did **not** include any information about the benign vs. malignant information. 
What if our goal was to seperate the two classes or identify new data?
What if we tried to use that information to inform our new axes?

**Let's move on to Linear Discriminant Analysis**

### Supervised Learning: 

If unsupervised learning was to help us indentify **patterns** in the data we couldn't otherwise see, supervised learning is used for making **predictions** about our data. What if we had a new tumor sample that we have measured using the same metrics that we have identified before (*e.g.* mean area, mean concavity), could we use our previous data to predict malignancy? 

Let's use Linear Discriminant Analysis to try and make these predictions, and as you will see, it has some similar utility to PCA. We will project our features onto a new dimension, trying to seperate our classes: benign and maliganant.

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

def PerformLDA(X, y):
    """
    Uses sklearn LinearDiscriminantAnalysis tool to perform LDA
    input:
    X: Pandas Dataframe or Numpy Array of features
    y: Pandas Series or Numpy Vector of target 
    n_dimensions: Number of LDs to fit
    
    output:
    X_lda: Pandas dataframe with column titles of LD1,...,LDn
    """
    X_standardized = StandardScaler().fit_transform(X)
    lda = LinearDiscriminantAnalysis()
    lda.fit(X_standardized,y)
    X_lda_array = lda.transform(X_standardized)
    column_names = ['LD{}'.format(i+1) for i in range(X_lda_array.shape[1])] 
    X_lda = pd.DataFrame(X_lda_array, columns=column_names)
    return X_lda, lda

lda_features, lda_results = PerformLDA(features, tumor)
lda_breast = lda_features.join(tumor)
lda_features

### Now that we've reduced the dimensionality with LDA, let's compare it visually to PCA. 

These 4 plots show the exact same data, but it might be easier to interpret one over another. Feel free to draw your conclusions from any of them (and maybe think about what conclusions are easier to draw from what visualizations!) 

In [None]:
fig, axs = plt.subplots(2)
sns.swarmplot(ax=axs[0], data=lda_breast[lda_breast['tumor']==0], x='LD1', color=sns.color_palette()[0], alpha=.7)
sns.swarmplot(ax=axs[0], data=lda_breast[lda_breast['tumor']==1], x='LD1', color=sns.color_palette()[1], alpha=.7)
sns.despine(left=True)
sns.swarmplot(ax=axs[1], data=pca_breast[pca_breast['tumor']==0], x='PC1', color=sns.color_palette()[0], alpha=.7)
sns.swarmplot(ax=axs[1], data=pca_breast[pca_breast['tumor']==1], x='PC1', color=sns.color_palette()[1], alpha=.7)
sns.despine(left=True)
axs[0].text(-6, .2, 'LDA', weight='bold', size='xx-large')
axs[1].text(-15, .2, 'PCA', weight='bold', size='xx-large')
axs[0].set_xlim(-6,6)
axs[1].set_xlim(-15,15)
fig.legend(handles=handles, loc='lower center', ncol=2, labels=['benign','malignant'], frameon=False)
fig.suptitle('Swarm Plot', weight='bold', size='xx-large')
plt.show()



In [None]:
fig, axs = plt.subplots(2)

sns.boxplot(ax=axs[0], data=lda_breast.replace({'tumor': {0: 'benign', 1: 'malignant'}}), 
              x='LD1', y='tumor', color='white', fliersize=0)
sns.swarmplot(ax=axs[0], data=lda_breast.replace({'tumor': {0: 'benign', 1: 'malignant'}}), 
              x='LD1', y='tumor', alpha=.7)

sns.despine(left=True)
sns.boxplot(ax=axs[1], data=pca_breast.replace({'tumor': {0: 'benign', 1: 'malignant'}}), 
              x='PC1', y='tumor', color='white', fliersize=0)
sns.swarmplot(ax=axs[1], data=pca_breast.replace({'tumor': {0: 'benign', 1: 'malignant'}}), 
              x='PC1', y='tumor', alpha=.7)
sns.despine(left=True)
axs[0].text(-6, .5, 'LDA', weight='bold', size='xx-large')
axs[1].text(-15, .5, 'PCA', weight='bold', size='xx-large')
axs[0].set_xlim(-6,6)
axs[1].set_xlim(-15,15)
fig.legend(handles=handles, loc='lower center', ncol=2, labels=['benign','malignant'], frameon=False)
fig.suptitle('Swarm Plot with Box Plot', weight='bold', size='xx-large')
plt.show()


In [None]:
fig, axs = plt.subplots(2)
sns.histplot(ax=axs[0], data=lda_breast, x='LD1', hue='tumor', alpha=.4, fill=True, legend=False)
sns.despine()
sns.histplot(ax=axs[1], data=pca_breast, x='PC1', hue='tumor', alpha=.4, fill=True, legend=False)
sns.despine()
axs[0].text(-5.6, 60, 'LDA', weight='bold', size='xx-large')
axs[1].text(-14, 60, 'PCA', weight='bold', size='xx-large')
axs[0].set_xlim(-6,6)
axs[1].set_xlim(-15,15)
fig.legend(handles=handles, loc='lower center', ncol=2, labels=['benign','malignant'], frameon=False)
fig.suptitle('Histograms', weight='bold', size='xx-large')
plt.show()



In [None]:
fig, axs = plt.subplots(2)
sns.kdeplot(ax=axs[0], data=lda_breast, x='LD1', hue='tumor', alpha=.4, fill=True, legend=False)
sns.despine()
sns.kdeplot(ax=axs[1], data=pca_breast, x='PC1', hue='tumor', alpha=.4, fill=True, legend=False)
sns.despine()
axs[0].text(-5.6, .15, 'LDA', weight='bold', size='xx-large')
axs[1].text(-14, .08, 'PCA', weight='bold', size='xx-large')
axs[0].set_xlim(-6,6)
axs[1].set_xlim(-15,15)
fig.legend(handles=handles, loc='lower center', ncol=2, labels=['benign','malignant'], frameon=False)
fig.suptitle('Kernel Density Plot', weight='bold', size='xx-large')
plt.show()

### Now let's try to use LDA to predict whether or not a tumor is benign or malignant. 

First we will split our data into an 80% training set and a 20% test set. 

In [None]:
from sklearn.model_selection import train_test_split

split = 0.2

X_train, X_test, y_train, y_test = train_test_split(StandardScaler().fit_transform(features), 
                                                    tumor, test_size=split, random_state=5)

print(f"My training set has {X_train.shape[0]} observations, where my test set has {X_test.shape[0]}.")

### How can we evaluate our classification? 

An ubiquitous metric is "accuracy" which is the percentage of the set (training or test) that the algorithm was able to predict correctly. The training set is the data where the algorithm "sees" the target/response class. The test set is the one where we withhold the class data until the algorithm makes the prediction.

Remember: we have the ground truth of benign v. malignant to compare to, and we just need to give the algorithm the features. Please do be critical of any biases for your ground truth data, as your algorithm will only be as effective as the data you provide.

In [None]:
def PerformLDA(X, y):
    """
    Uses sklearn LinearDiscriminantAnalysis tool to perform LDA
    input:
    X: Pandas Dataframe or Numpy Array of features
    y: Pandas Series or Numpy Vector of target 
    n_dimensions: Number of LDs to fit
    
    output:
    X_lda: Pandas dataframe with column titles of LD1,...,LDn
    """
    X_standardized = StandardScaler().fit_transform(X)
    lda = LinearDiscriminantAnalysis()
    lda.fit(X,y)
    X_lda_array = lda.transform(X)
    column_names = ['LD{}'.format(i+1) for i in range(X_lda_array.shape[1])] 
    X_lda = pd.DataFrame(X_lda_array, columns=column_names)
    return X_lda, lda

lda_train, lda_model = PerformLDA(X_train, y_train)
train_accuracy = lda_model.score(X_train, y_train)
print(f"Training classification accuracy of {train_accuracy*100:0.1f}%")
test_accuracy = lda_model.score(X_test, y_test)
print(f"Test classification accuracy of {test_accuracy*100:0.1f}%")

In [None]:
lda_train

In [None]:
train = pd.DataFrame(lda_model.transform(X_train), columns=['LD1'])
train['tumor'] = y_train.values
train['train'] = 'train'
test = pd.DataFrame(lda_model.transform(X_test), columns=['LD1'])
test['tumor'] = y_test.values
test['train'] = 'test'
test['predict'] = lda_model.predict(X_test)


total_set = pd.concat([train, test], ignore_index=True)
total_set = total_set.replace({'tumor': {0: 'benign', 1: 'malignant'}})
total_set = total_set.replace({'predict': {0: 'benign', 1: 'malignant'}})

sns.swarmplot(data=total_set, x='LD1', y='train', hue='tumor', hue_order=['benign','malignant'],alpha=.7)
# sns.swarmplot(data=total_set[total_set['guess']!=total_set['guess']], x='LD1', y='train', color='red', size=3, alpha=.2)
sns.despine()
misses=total_set[total_set['predict']!=total_set['tumor']].dropna()
plt.axvline(misses['LD1'].min(), linestyle='--', color='r', linewidth=1)
plt.axvline(misses['LD1'].max(), linestyle='--', color='r', linewidth=1)
plt.show()

In [None]:
misses[['LD1','tumor','predict']]