### <center>A machine learning approach to classify cancer genes and  cancer-associated immune genes in a Next-generation RNA Seq</center>

#### Authors
* Lawrence Thanakumar Rajappa (**IDA, Linköping University, Linköping**)
* krzysztof Bartoszek (**STIMA,Linköping University, Linköping**)
* Jenny Persson (**Molecular Biology, Umeå University, Umeå**)

#### Abstract
<p align="justify">
In this study, we aim to predict the invasiveness of cancer subtypes by using a supervised and unsupervised machine learning approach. We will base on the next-generation RNA sequence data of cancer cells and cancer-associated immune cells to classify the cancerous cells from immune cells in the given samples taken from cancer patients and compare the classification/prediction performance of machine learning models such as SVM, decision tree, random forest, hierarchical clustering, etc. and choose the model that performs better. 

The evolution of genomic sequencing technology has led researchers to perform various experiments to find a cure for various genetic diseases such as cancer. The advent of Artificial Intelligence, especially in the healthcare sector has paved the way to numerous solutions such as drug discovery, cost-effective treatment for patients, etc. which would be beneficial for pharmaceutical organizations, hospitals, and researchers to perform their experiments at a fast pace. Further, Next-generation RNA sequencing data has helped to find the relationship between gene expression profiles and different developmental stages of a cell or a disease in an efficient and cost-effective approach with higher accuracy. There are various methodologies mentioned in various literature related to a cure for cancer, still, the fatality rate of cancer cannot be reduced. The machine learning approach helps us to understand the complexities and processes of cancer samples that lead to the development of effective cancer treatment/drugs in a reliable, fast, and efficient way.</p>


### Loading required libraries

In [1]:
%matplotlib inline
# Data loading and manipulation libraries
import pandas as pd
import numpy as np

# Visualization libraries
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import graphviz 

# ML algorithms libraries
from sklearn import tree # Decision tree
from sklearn.ensemble import RandomForestClassifier # Random Forest
from sklearn import svm # SVM

### Loading the data

In [2]:
gene_data = pd.read_csv("12_Sample_Gene_List_Updated_with_labels.csv", index_col=False)

In [3]:
gene_data = gene_data.drop(["Unnamed: 0"], axis=1)

In [4]:
gene_data.sample(frac=1).head(6)

Unnamed: 0,Genes,hgnc_symbol,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj,DR_UR,Type
347,ENSG00000280303,ERICD,73.842639,9.181362,0.861248,10.660533,0.0,0.0,UR,Cancer
243,ENSG00000011143,MKS1,20.237539,7.184085,0.933233,7.698061,1.38e-14,1.112e-13,UR,Cancer
188,ENSG00000233746,LINC00656,11.255121,-5.889553,0.965961,-6.097093,1.080144e-09,5.78176e-09,DR,Immune
394,ENSG00000075089,ACTR6,5.120283,4.832235,0.988273,4.889575,1.010541e-06,4.001483e-06,UR,Cancer
295,ENSG00000178343,SHISA3,59.985477,8.3944,0.904831,9.277311,0.0,0.0,UR,Cancer
203,ENSG00000268941,LINC01711,10.379298,-5.868086,1.074047,-5.46353,4.667576e-08,2.139446e-07,DR,Immune


### Preprocessing of data

##### Checking for missing data

In [5]:
gene_data.isna().sum()

Genes               0
hgnc_symbol       124
baseMean            0
log2FoldChange      0
lfcSE               0
stat                0
pvalue              0
padj                0
DR_UR               0
Type                0
dtype: int64

In [9]:
gene_data_with_NA = gene_data[(gene_data.hgnc_symbol).isna()]

In [11]:
gene_data_with_NA.sample(frac=1).head()

Unnamed: 0,Genes,hgnc_symbol,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj,DR_UR,Type
70,ENSG00000279780,,3.175288,-4.585986,1.157537,-3.961848,7.437183e-05,0.0002304736,DR,Immune
317,ENSG00000230852,,65.901123,9.032002,0.859868,10.503938,0.0,0.0,UR,Cancer
382,ENSG00000257341,,2.396192,-3.589426,1.142654,-3.141306,0.001681963,0.004222539,DR,Cancer
336,ENSG00000250073,,16.874108,6.984561,0.915175,7.631938,2.31e-14,1.839e-13,UR,Cancer
499,ENSG00000259076,,5.453164,5.363207,1.000003,5.363193,8.176337e-08,3.653483e-07,UR,Cancer
