## Notes on building an automatic cell type classifier for scRNA-seq data
Implementation of SVM and ANN to achieve cell type identifications based on previously annotated datasets.

10-13-2020:
- attempting to import data using ScanPy (https://scanpy.readthedocs.io/en/stable/index.html)
- currently having issues with setting the variable (barcode) names to the AnnData object created from ScanPy
- also attempting an alternative method to import data using SciPy and Pandas

10-15-2020:
- Learned how to import the data with SciPy and Pandas instead of ScanPy.
- Continuing to create sctype class and attempting to implement SVM

## Approach #1 to importing data: ScanPy (10-13-2020)

In [18]:
import pandas as pd
import scanpy as sc

### Read in the 10Xv2 data
- The matrix.mtx, barcodes.tsv, and genes.tsv files
- Set the names of observations and variables in the AnnData object to genes and barcodes

Note: for some reason, the variable names (barcodes) isn't being labeled properly

In [120]:
data = sc.read_mtx(filename = path + 'filtered_matrices_mex/hg19/matrix.mtx')
var = pd.read_csv(path + 'filtered_matrices_mex/hg19/barcodes.tsv', sep = '\t', header = None)
obs = pd.read_csv(path + 'filtered_matrices_mex/hg19/genes.tsv', sep = '\t', header = None)
data.vars_names = var.iloc[:,0]
data.obs_names = obs.iloc[:,1]

### Transpose the AnnData object and convert to a pd.DataFrame
Even though expression matrices usually have genes as rows and barcodes as columns, it seems that I need to have features as the columns and each single cell set to the rows for 

Note: Getting a warning "names are not unique" for the variables (supposed to be barcodes)

In [126]:
data_df = data.T.to_df()
data_df

Variable names are not unique. To make them unique, call `.var_names_make_unique`.


Unnamed: 0,MIR1302-10,FAM138A,OR4F5,RP11-34P13.7,RP11-34P13.8,AL627309.1,RP11-34P13.14,RP11-34P13.9,AP006222.2,RP4-669L17.10,...,KIR3DL2,AL590523.1,CT476828.1,PNRC2,SRSF10,AC145205.1,BAGE5,CU459201.1,AC002321.2,AC002321.1
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68574,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
68575,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
68576,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
68577,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Approach #2 to importing data: SciPy and Pandas (10-15-2020)

In [127]:
import pandas as pd
import scipy.io as io

### Import 10Xv2 data 

In [128]:
path = '/Users/leealj/py_projects/biof509_final/zheng68k/filtered_matrices_mex/hg19/'
genes = pd.read_csv(path + 'genes.tsv', sep = '\t', header = None)
barcodes = pd.read_csv(path + 'barcodes.tsv', sep = '\t', header = None)
expression = io.mmread(path + 'matrix.mtx') 
data = pd.DataFrame.sparse.from_spmatrix(data = expression, index = genes, columns = barcodes)

### Import the 'correct' annotations made by the original authors

In [20]:
path = '/Users/leealj/py_projects/biof509_final/zheng68k/'
anno = pd.read_csv(path + '68k_pbmc_barcodes_annotation.tsv', sep = '\t')
anno

Unnamed: 0,TSNE.1,TSNE.2,barcodes,celltype
0,7.565540,0.441370,AAACATACACCCAA-1,CD8+ Cytotoxic T
1,2.552626,-25.786672,AAACATACCCCTCA-1,CD8+/CD45RA+ Naive Cytotoxic
2,-5.771831,11.830846,AAACATACCGGAGA-1,CD4+/CD45RO+ Memory
3,1.762556,25.979346,AAACATACTAACCG-1,CD19+ B
4,-16.793856,-16.589970,AAACATACTCTTCA-1,CD4+/CD25 T Reg
...,...,...,...,...
68574,1.430476,-23.815174,TTTGCATGAGCCTA-8,CD8+ Cytotoxic T
68575,3.120762,-19.108131,TTTGCATGCTAGCA-8,CD8+/CD45RA+ Naive Cytotoxic
68576,13.526124,-1.559099,TTTGCATGCTGCAA-8,CD8+ Cytotoxic T
68577,11.646083,-3.386890,TTTGCATGGCTCCT-8,CD8+ Cytotoxic T


In [None]:
np.nanmax(data)

In [None]:
np.max(expression.data)