<a href="https://colab.research.google.com/github/jialun1221/scRNA-seq/blob/main/Preprocessing1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preprocessing and clustering PD astrocytes
### Part 1. Data selection

In May 2017, this started out as a demonstration that Scanpy would allow to reproduce most of Seurat's [guided clustering tutorial](http://satijalab.org/seurat/pbmc3k_tutorial.html) ([Satija et al., 2015](https://doi.org/10.1038/nbt.3192)).

We gratefully acknowledge Seurat's authors for the tutorial! In the meanwhile, we have added and removed a few pieces.

The data consist of *3k PBMCs from a Healthy Donor* and are freely available from 10x Genomics ([here](http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz) from this [webpage](https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k)). On a unix system, you can uncomment and run the following to download and unpack the data. The last line creates a directory for writing processed data.

In this notebook, we will compute ***Data Selection***. We will drop the cells that cotain Lewy Body Dementia, and create a new AnnData object that contains only PD and control cells. All other features of the original AnnData will reamin.

In [None]:
#import packages 
!pip install scanpy
import numpy as np
import pandas as pd
import scanpy as sc

In [None]:
#make directories for file storage
!mkdir data
!mkdir write

mkdir: cannot create directory ‘data’: File exists
mkdir: cannot create directory ‘write’: File exists


In [None]:
sc.settings.verbosity = 3             # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header()
sc.settings.set_figure_params(dpi=80, facecolor='white')

scanpy==1.9.1 anndata==0.8.0 umap==0.5.3 numpy==1.21.6 scipy==1.7.3 pandas==1.3.5 scikit-learn==1.0.2 statsmodels==0.12.2 python-igraph==0.9.11 pynndescent==0.5.7


In [None]:
# results_file = 'write/pd_astro.h5ad'  # the file that will store the analysis results

In [None]:
# pca_file = 'write/pca.h5ad'

In [None]:
#file to store new Anndata object 
new_anndata = 'write/new_anndata.h5ad'

Read in the count matrix into an [AnnData](https://anndata.readthedocs.io/en/latest/anndata.AnnData.html) object, which holds many slots for annotations and different representations of the data. It also comes with its own HDF5-based file format: `.h5ad`.

In [None]:
!pip install --quiet scvi-tools[tutorials]
import scvi

INFO:pytorch_lightning.utilities.seed:Global seed set to 0


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
adata = scvi.data.read_h5ad("drive/MyDrive/PD_astro.h5ad") 

In [None]:
adata

AnnData object with n_obs × n_vars = 33506 × 41625
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'Cell_Subtype', 'Cell_Type', 'disease__ontology_label', 'organ__ontology_label'
    var: 'features'



**Note**
    
Start with some basic checking.

In [None]:
adata.var_names #this gives genes!

In [None]:
adata.obs_names #this are labels

In [None]:
#Check how many rows are unwanted data.
adata.obs.loc[adata.obs['disease__ontology_label'].str.contains("Lewy body dementia", case=False)]

###Data selection

In [None]:
!pip install matplotlib==3.1.3
from numpy import inf

Drop the Lewy body dementia:

In [None]:
adata.obs = adata.obs.reset_index() #Set index for the labels
k = adata.obs #create a variable for further uses (a DataFrame)
# print(k)

In [None]:
y = k.index[k['disease__ontology_label'] == 'Lewy body dementia'].tolist() #get the index that contains the Lewy Body Dementia samples, stored in variable y (a list)
# print(y)

In [None]:
m = adata.X.toarray() #convert sparse matrix X to array

Conduct data selection separately in adata.X and adata.obs. 

In [None]:
m = np.delete(m, obj = y, axis=0) #delete rows that contain Lewy Body Dementia according to the previously generated index stored in y

In [None]:
#drop command for adata.obs
adata.obs.drop(adata.obs.index[adata.obs['disease__ontology_label'] == 'Lewy body dementia'], inplace=True)
adata.obs

##making new AnnData object ##

In [None]:
pip install anndata

In [None]:
#Command for making a new AnnData object. For each parameter, need to make a deep copy of the original object.
new = sc.AnnData(X = m,
  obs = adata.obs.copy(),
  var = adata.var.copy(),
  uns = adata.uns.copy(),
  obsm = adata.obsm.copy(),
  varm = adata.varm.copy(),
  layers = adata.layers.copy(),
  raw = adata.raw.copy(),
  dtype = "float32",
  shape = None,
  #filename = NULL,
  #filemode = NULL,
  obsp = adata.obsp.copy(),
  varp = adata.varp
  )
#varp = adata.varp.copy() would give me error but direct assignment would not



In [None]:
#A random line that I found necessary for the object to work. 
new.__dict__['_raw'].__dict__['_var'] = adata.__dict__['_raw'].__dict__['_var'].rename(columns={'_index': 'features'})

In [None]:
new.write(new_anndata)

In [None]:
print(adata.X.shape, new.X.shape) #Now the new AnnData object is generated. Check the dimension!

(33506, 41625) (26535, 41625)


A new AnnData object is created, and stored in the Colab disk. Navigate to the folder button on the left side panel, and click on "write", you will find the `new_anndata.h5ad file` here. Please either download it to your local disk, then upload to your google drive; or move to your drive folder by dragging it to the `drive` folder. 

---
The purpose of creating a new AnnData is to keep the accessibility of other features, stored in `adata.obsm`, `adata.varm`, etc. 
