# Neural Network Classifier 

We are going to build a PyTorch neural network classifier using scRNAseq data. 

Helpful link:
Interfacing pytorch models with anndata: https://anndata.readthedocs.io/en/latest/tutorials/notebooks/annloader.html

Possibly helpful about speeding up pytorch modeling: https://sebastianraschka.com/blog/2023/pytorch-faster.html

This program will:

- pull in scRNAseq data from cell_census
- wrangle the data for modelling
- appply the NN

At first, we're just going to pick some data from cell_census. Once we get the basics we will modify to select the exact data we want.


In [1]:
import cell_census
import anndata as ad

import numpy as np

import torch
from torch import nn
import torch.nn.functional as F

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

In [2]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

cpu


## Read in the data

We are somewhat randomly selecting the data now, but will come back and update this later. Let's just get something working.

In [3]:
census = cell_census.open_soma(census_version="latest")


In [4]:
adata = cell_census.get_anndata(
        census=census,
        organism = "Homo sapiens",
        obs_value_filter = 'tissue_ontology_term_id == "UBERON:0002299" and assay == "10x 3\' v3"',
        column_names={"obs": ["sex"]},
        )

display(adata)

AnnData object with n_obs × n_vars = 57747 × 60664
    obs: 'sex', 'tissue_ontology_term_id', 'assay'
    var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_length'

In [5]:
# the data is stored in adata.X
adata.X

<57747x60664 sparse matrix of type '<class 'numpy.float32'>'
	with 136730898 stored elements in Compressed Sparse Row format>

Only 3.9% of the entries have values.

In [6]:
# find the number of features (genes)

adata.X.shape[1]

# returns 2.0
#adata.X[0,40]

60664

## Encode and Split the Data to prepare for modeling

**Outstanding question: Do we need to set requires_grad = True at some point??**

In [7]:
# select the labels. 
labels = adata.obs

In [8]:
labels.head()

Unnamed: 0,sex,tissue_ontology_term_id,assay
0,male,UBERON:0002299,10x 3' v3
1,male,UBERON:0002299,10x 3' v3
2,male,UBERON:0002299,10x 3' v3
3,male,UBERON:0002299,10x 3' v3
4,male,UBERON:0002299,10x 3' v3


In [9]:
lb = LabelEncoder()
labels['encoded_labels'] = lb.fit_transform(labels['sex'])
labels.head(5)

Unnamed: 0,sex,tissue_ontology_term_id,assay,encoded_labels
0,male,UBERON:0002299,10x 3' v3,1
1,male,UBERON:0002299,10x 3' v3,1
2,male,UBERON:0002299,10x 3' v3,1
3,male,UBERON:0002299,10x 3' v3,1
4,male,UBERON:0002299,10x 3' v3,1


In [10]:
# find the number of unique labels
labels['encoded_labels'].nunique()

2

In [11]:
y_data = labels['encoded_labels']

In [12]:
# convert sparse matrix to dense

#x_data = adata.X.todense()
x_data = adata.X.copy()

In [13]:
y_data.shape

(57747,)

In [14]:
x_data

<57747x60664 sparse matrix of type '<class 'numpy.float32'>'
	with 136730898 stored elements in Compressed Sparse Row format>

In [15]:
# create training and testing sets here

# using a very small train size to speed up development for now

# note that we are using dense matrices at this time, but train_test_split DOES
# work with sparse matrices (woohoo)

X_train, X_test, y_train, y_test = train_test_split(x_data,y_data,
                                                   train_size = 0.1)


In [16]:
# convert the data to tensors
# we'll change the data from CSR (compressed sparse row) format
# to COO (coordinate) format for better use with pytorch
# see https://pytorch.org/docs/stable/sparse.html for some details/thoughts

X_train_coo = X_train.tocoo()

X_train = torch.sparse.LongTensor(torch.LongTensor([X_train_coo.row.tolist(),X_train_coo.col.tolist()]),
                                 torch.LongTensor(X_train_coo.data.astype(np.float32)))

# y_train is a Series, so it is easier to convert to a tensor
y_train = torch.tensor(y_train, dtype=torch.long)

#X_train = torch.tensor.to_sparse_csr(X_train,layout=torch.sparse_csr)


# and the same for the test set
X_test_coo = X_test.tocoo()

X_test = torch.sparse.LongTensor(torch.LongTensor([X_test_coo.row.tolist(),X_test_coo.col.tolist()]),
                                 torch.LongTensor(X_test_coo.data.astype(np.float32)))

# y_train is a Series, so it is easier to convert to a tensor
y_test = torch.tensor(y_test, dtype=torch.long)



In [17]:
X_train

tensor(indices=tensor([[    0,     0,     0,  ...,  5773,  5773,  5773],
                       [   17,    24,    32,  ..., 41284, 46738, 52812]]),
       values=tensor([1, 1, 1,  ..., 1, 1, 2]),
       size=(5774, 56710), nnz=13561218, layout=torch.sparse_coo)

In [18]:
y_train

tensor([0, 0, 0,  ..., 0, 0, 1])

## Build Neural Network Classifier

based on https://medium.com/analytics-vidhya/a-simple-neural-network-classifier-using-pytorch-from-scratch-7ebb477422d2

In [19]:
test_size = torch.unique(y_train).size(dim=0)

test_size

2

In [20]:
# number of features (len of X cols)
# select the number of gene columns
input_dim = X_train.size(dim=1) #adata.X.shape[1] 

# number of hidden layers
hidden_layers = 2

# number of classes (unique of y)
output_dim = torch.unique(y_train).size(dim=0) #labels['encoded_labels'].nunique()

In [21]:
print(input_dim,hidden_layers,output_dim)

56710 2 2


In [22]:
class Network(torch.nn.Module):
    def __init__(self):
        super(Network, self).__init__()
        self.linear1 = torch.nn.Linear(input_dim,hidden_layers)
        self.linear2 = torch.nn.Linear(hidden_layers,output_dim)
        
    def forward(self,x):
        x = torch.sigmoid(self.linear1(x))
        x = self.linear2(x)
        return x

In [23]:
loss_func = F.nll_loss

In [24]:
clf = Network()

In [25]:
X_train.dtype

torch.int64

In [26]:
y_train.dtype

torch.int64

In [27]:
clf(X_train.float())

tensor([[-0.4685, -0.1877],
        [-0.4113, -0.1948],
        [-0.4707, -0.2702],
        ...,
        [-0.4796, -0.0594],
        [-0.5006, -0.0846],
        [-0.4276, -0.2486]], grad_fn=<AddmmBackward0>)

In [28]:
print(loss_func(clf(X_train.float()), y_train))

tensor(0.4187, grad_fn=<NllLossBackward0>)


In [29]:
print(clf.parameters)

<bound method Module.parameters of Network(
  (linear1): Linear(in_features=56710, out_features=2, bias=True)
  (linear2): Linear(in_features=2, out_features=2, bias=True)
)>


In [30]:
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(clf.parameters(), lr=0.1)

In [31]:
epochs = 4
for epoch in range(epochs):
  running_loss = 0.0
  #for i, data in enumerate(trainloader, 0):
  #inputs, labels = data
  # set optimizer to zero grad to remove previous epoch gradients
  optimizer.zero_grad()
  # forward propagation
  outputs = clf(X_train.float())
  loss = criterion(outputs, y_train)
  # backward propagation
  loss.backward()
  # optimize
  optimizer.step()
  running_loss += loss.item()
  # display statistics
  print(f'[{epoch + 1}] loss: {running_loss / 2000:.5f}')


[1] loss: 0.00040
[2] loss: 0.00034
[3] loss: 0.00033
[4] loss: 0.00033


### Test the model

Most of this work is based on https://medium.com/analytics-vidhya/a-simple-neural-network-classifier-using-pytorch-from-scratch-7ebb477422d2

In [32]:
outputs_test = clf(X_test.float())
__, predicted = torch.max(outputs_test,1)
print(predicted)

tensor([0, 0, 0,  ..., 0, 0, 0])


In [33]:
correct, total = 0, 0

with torch.no_grad():
    # calculate output by running through the network
    outputs = clf(X_test.float())
    # get the predictions
    __, predicted = torch.max(outputs.data, 1)
    # update results
    total += y_test.size(0)
    correct += (predicted == y_test).sum().item()
    
print(f'Accuracy of the network on the test data: {100 * correct // total} %')

Accuracy of the network on the test data: 71 %
