# **Gene Expression Cancer RNA-Seq Analysis**

# **Context**

- RNA sequencing (RNAseq) is one of the most commonly used techniques in life sciences, and has been widely used in cancer research, drug development, and cancer diagnosis and prognosis.
- Sequencing the coding regions or the whole cancer transcriptome can provide valuable information about gene expression changes in tumors.
- Cancer RNA-Seq enables detection of strand-specific information, an important component of gene regulation.
- Cancer transcriptome sequencing captures both coding and noncoding RNA and provides strand orientation for a complete view of expression dynamics.

# **Objective**

To use PCA technique to transform a large set of variables into a smaller one that still contains most of the information in the large set.

# **Data Description**

This collection of data is part of the RNA-Seq (HiSeq) PANCAN data set, it is a random extraction of gene expressions of patients having different types of tumor:
- BRCA
- KIRC
- COAD
- LUAD
- PRAD


Samples (instances) are stored row-wise. Variables (attributes) of each sample are RNA-Seq gene expression levels measured by illumina HiSeq platform.

A dummy name (gene_XX) is given to each attribute.



------------------------
# **Concepts to Cover**
------------------------
- 1. <a href = #link1>Overview of the data</a>
- 2. <a href = #link2>Data Preparation</a>
- 3. <a href = #link3>Apply PCA technique</a>
- 4. <a href = #link4>Visualize the Data Points</a>
- 5. <a href = #link5>Conclusion</a>

# **Let's Start Coding!**

In [2]:
import pandas as pd
import numpy as np

In [3]:
path = 'F:/GL Office/case study/Unsupervised Learning/Practice exercise 2/PCA/TCGA-PANCAN-HiSeq-801x20531/'
data = pd.read_csv(path+'data.csv', index_col=[0])

In [4]:
labels = pd.read_csv(path+'labels.csv', index_col=[0])

# <a id='link1'>Overview of the data</a>

In [5]:
labels.head()

Unnamed: 0,Class
sample_0,PRAD
sample_1,LUAD
sample_2,PRAD
sample_3,PRAD
sample_4,BRCA


### In the above output, the Class depicts the type of tumour out of the five types: 'PRAD', 'LUAD', 'BRCA', 'KIRC', and 'COAD'

In [6]:
data.head()

Unnamed: 0,gene_0,gene_1,gene_2,gene_3,gene_4,gene_5,gene_6,gene_7,gene_8,gene_9,...,gene_20521,gene_20522,gene_20523,gene_20524,gene_20525,gene_20526,gene_20527,gene_20528,gene_20529,gene_20530
sample_0,0.0,2.017209,3.265527,5.478487,10.431999,0.0,7.175175,0.591871,0.0,0.0,...,4.926711,8.210257,9.723516,7.22003,9.119813,12.003135,9.650743,8.921326,5.286759,0.0
sample_1,0.0,0.592732,1.588421,7.586157,9.623011,0.0,6.816049,0.0,0.0,0.0,...,4.593372,7.323865,9.740931,6.256586,8.381612,12.674552,10.517059,9.397854,2.094168,0.0
sample_2,0.0,3.511759,4.327199,6.881787,9.87073,0.0,6.97213,0.452595,0.0,0.0,...,5.125213,8.127123,10.90864,5.401607,9.911597,9.045255,9.788359,10.09047,1.683023,0.0
sample_3,0.0,3.663618,4.507649,6.659068,10.196184,0.0,7.843375,0.434882,0.0,0.0,...,6.076566,8.792959,10.14152,8.942805,9.601208,11.392682,9.694814,9.684365,3.292001,0.0
sample_4,0.0,2.655741,2.821547,6.539454,9.738265,0.0,6.566967,0.360982,0.0,0.0,...,5.996032,8.891425,10.37379,7.181162,9.84691,11.922439,9.217749,9.461191,5.110372,0.0


In [7]:
data.shape

(801, 20531)

In [8]:
labels.shape

(801, 1)

### Let's combine the data and its labels using the below code:

In [9]:
frames = [labels, data]
result = pd.concat(frames, ignore_index=True)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


In [10]:
result.head()

Unnamed: 0,Class,gene_0,gene_1,gene_10,gene_100,gene_1000,gene_10000,gene_10001,gene_10002,gene_10003,...,gene_9990,gene_9991,gene_9992,gene_9993,gene_9994,gene_9995,gene_9996,gene_9997,gene_9998,gene_9999
0,PRAD,,,,,,,,,,...,,,,,,,,,,
1,LUAD,,,,,,,,,,...,,,,,,,,,,
2,PRAD,,,,,,,,,,...,,,,,,,,,,
3,PRAD,,,,,,,,,,...,,,,,,,,,,
4,BRCA,,,,,,,,,,...,,,,,,,,,,


# <a id='link2'>Data Preparation</a>

- Normalization is important in PCA since it is a variance maximizing exercise
- As PCA projects the original data onto directions which maximize the variance.

In [11]:
from sklearn.preprocessing import StandardScaler
X = data.loc[:, :].values
X = StandardScaler().fit_transform(X) # normalizing the features

In [12]:
X.shape

(801, 20531)

In [13]:
np.mean(X),np.std(X)

(-1.091446422314019e-18, 0.9934763587711302)

### The mean is nearly equal to zero and the variance is nearly 1 as we normalized the data.

# <a id='link3'>Apply PCA technique</a>

### Now comes the critical part, the next few lines of code will be projecting the 20531-dimensional Cancer RNA-seq data to two-dimensional principal components.

- Original dimensions = 20531
- Dimensions after applying PCA = 2

### Note: We only use the features for dimensionality reduction, we don't need labels.

In [14]:
from sklearn.decomposition import PCA
pca_data = PCA(n_components=2)
principalComponents_data = pca_data.fit_transform(X)

### Let's convert the above result into a dataframe.

In [15]:
principal_data_df = pd.DataFrame(data = principalComponents_data, columns = ['principal component 1', 'principal component 2'])

In [16]:
principal_data_df.tail()

Unnamed: 0,principal component 1,principal component 2
796,-12.417385,-42.321573
797,-29.415555,28.526281
798,-4.13309,15.690014
799,-30.814757,33.526422
800,-22.344557,4.052356


In [17]:
print('Explained variation per principal component: {}'.format(pca_data.explained_variance_ratio_))

Explained variation per principal component: [0.10539781 0.08754232]


### In the above result, the explained variance is shown.

- The first principal component explains 10% of total variance in the data.
- The second principal component explains 8.75% of total variance in the data.

So, that's a huge reduction in dimensions, from 20531-dimensions to 2-dimensions and these 2 dimensions explain 18.75% of total variance in this dataset.

In some other cases (some other datasets), PCA is able to explain significantly larger variance than we see in this case. (e.g. the 2-principal components are able to explain more than 90% of variance in the dataset after reduction in the dimensionality)

In [18]:
targets = list(labels.Class.unique())
targets

['PRAD', 'LUAD', 'BRCA', 'KIRC', 'COAD']

# <a id='link4'>Visualize the Data Points</a>

### We know that it's impossible for the humans to visualize the data in 20531 dimensions.

- We can visualize the data in 2-dimensions.
- We can also visualize the data in 3-dimensions using 3-D plots.
- In some cases we can also visualize the data in 4-D, by using different hue for the 4th dimension in a 3-D plot.
- But as mentioned earlier, it's impossible for us to visualize and interpret the data in 20531 dimensions.

### So, using PCA we scaled down to 2-D and now it's easy for us to visualize the data.

In [19]:
import matplotlib.pyplot as plt

plt.figure()
plt.figure(figsize=(10,10))
plt.xticks(fontsize=12)
plt.yticks(fontsize=14)
plt.xlabel('Principal Component - 1',fontsize=20)
plt.ylabel('Principal Component - 2',fontsize=20)
plt.title("Principal Component Analysis of RNA-Seq Dataset",fontsize=20)
colors = ['r', 'g', 'b', 'c', 'm']
for target, color in zip(targets, colors):
    indicesToKeep = result['Class'] == target
    plt.scatter(principal_data_df.loc[indicesToKeep, 'principal component 1']
               , principal_data_df.loc[indicesToKeep, 'principal component 2'], c = color, s = 50)

plt.legend(targets,prop={'size': 15})

<matplotlib.legend.Legend at 0x12b0f8fae80>

### Insights:

- From the above plot, we can observe that the five classes: **PRAD**, **LUAD**, **BRCA**, **KIRC**, and **COAD**, when projected to a two-dimensional space, can be linearly separable up to some extent.
- Other observations can be that the **KIRC** class is clearly seperated out as compared to the other class.
- The **BRCA** (the data points in blue) class is spread out (more sparce) as compared to the other class.

# <a id='link5'>Conclusion</a>

### As we know that new techniques are required to effectively analyze ever larger data sets.

### In this case study we can see a visual representation of our data using PCA - Principal Component Analysis.

- The very high dimensional nature of many data sets makes direct visualization impossible as we humans can only comprehend three dimensions. The solution is to work with data dimension reduction techniques.
- When reducing the dimensions of data, it’s important not to lose more information than is necessary.
- The variation in a data set can be seen as representing the information that we would like to keep.
- Principal Component Analysis (PCA) is a well-established mathematical technique for reducing the dimensionality of data, while keeping as much variation as possible.
- PCA is not only about dimensions’ reduction and saving computation time but also a way to avoid multicollinearity and better understand an industrial process behavior.

# **Appendix**

- **Pandas** : Pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

- **Numpy** : The fundamental package for scientific computing with Python.

- **Matplotlib** : Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.

- **sklearn.preprocessing.StandardScaler** : Standardize features by removing the mean and scaling to unit variance.

- **[PCA-Principal Component Analysis](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)**