## Import the packages
We will use *scanpy* as the main analysis tool for the analysis. Scanpy has a comprehensive [manual webpage](https://scanpy.readthedocs.io/en/stable/) that includes many different tutorial you can use for further practicing. Scanpy is used in the discussion paper and the tutorial paper of this course. 
An alternative and well-established tool for R users is [Seurat](https://satijalab.org/seurat/). However, scanpy is mainatined and updated by a wider community with many of the latest developed tools.

In [15]:
import scanpy as sc
import pandas as pd
import scvelo as scv
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn

## Loading and understanding the dataset structure

Data can be loaded from many different possible formats. Each format has a dedicated reading command, for example `read_h5ad`, `read_10X_mtx`, `read_txt`. We are going to use `read_10X_mtx` to load the output of the 10X software that produces the aligned data. 

Note the option `cache=True`. If you are going to read again the same data, it will be loaded extremely fast, because it has been stored in a convenient format for large datasets (`h5ad` format)

In [16]:
crypto_2 = sc.read_10x_mtx('../../../../scRNASeq_course/Data/cellranger_crypto2/outs/filtered_feature_bc_matrix/', cache=True)

The datasets `crypto_2` and `crypto_3` are now created. They are so-called `Annotated datasets`. Each annotated dataset contains:


*   The data matrix `X` of size $N\_cells \times N\_genes$
*   Vectors of cells-related quantities in the table `obs`(for example, how many transcripts there are in each cell)
* Vectors of genes-related quantities in the table `var` (for example, in how many cells the each gene is detected)
* Matrices of size $N\_cells \times N\_genes$ in `adata.layers` (for example, normalized data matrix, imputed data matrix, ....)

We will often call the cells for observations (obs) and the genes for variables (var) when it is practical in relation to the annotated dataset

During the analysis we will encounter other components of the annotated datasets. They will be explained when it is necessary, so you might want to skip this explanation if you want.

* Matrices where each line is cell-related in `obsm` (for example, the PCA coordinates of each cell)
* Matrices where each line is gene-related in `adata.varm` (for example, mean of the gene in each cell type)
* Anything else useful is in `adata.uns` and some quantities necessary for the `scanpy` package are saved in `obsp`

![alt text](https://falexwolf.de/img/scanpy/anndata.svg)

**Above:** a representation of the data matrix, variable and observations in an annotated dataset.  

Each component of the annotated dataset is called by using a `dot`, For example, we can see the data matrix by

In [17]:
crypto_2.X

<5176x36601 sparse matrix of type '<class 'numpy.float32'>'
	with 6517637 stored elements in Compressed Sparse Row format>

The matrix is in compressed format. We can reassign it as a dense matrix, so that we can see what it contains.

In [18]:
crypto_2.X = np.array( crypto_2.X.todense() )

In [19]:
crypto_2.X

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

When the matrix is no longer compressed, we can calculate some statistics for both cells and genes with the following `scanpy` command. Note that all scanpy commands follow a similar format.

In [20]:
sc.preprocessing.calculate_qc_metrics(crypto_2, inplace=True)

We can see that `obs` and `var` now contains a lot of different values whose names are mostly self-explicative. For example
- `n_genes_by_counts` is the number of detected genes in each cell
- `total_counts` is the number of transcripts in each cell
- `mean_counts` is the average of counts of each gene across all cells

In [21]:
crypto_2

AnnData object with n_obs × n_vars = 5176 × 36601
    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes'
    var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'

You can access directly all observations/variables or some of them specifically. Each observation line is named with the cell barcode, while variables have gene names in each line

In [22]:
crypto_2.obs

Unnamed: 0,n_genes_by_counts,log1p_n_genes_by_counts,total_counts,log1p_total_counts,pct_counts_in_top_50_genes,pct_counts_in_top_100_genes,pct_counts_in_top_200_genes,pct_counts_in_top_500_genes
AAACCTGAGACCTAGG-1,981,6.889591,2623.0,7.872455,38.658025,53.602745,66.831872,81.662219
AAACCTGAGCGATGAC-1,1508,7.319202,2831.0,7.948739,24.443659,34.934652,45.849523,64.394207
AAACCTGAGTTAACGA-1,1019,6.927558,2312.0,7.746301,37.110727,49.048443,60.813149,77.551903
AAACCTGCACATCTTT-1,374,5.926926,556.0,6.322565,39.208633,50.719424,68.705036,100.000000
AAACCTGCAGCATACT-1,547,6.306275,1037.0,6.945051,43.394407,54.676953,66.538091,95.467695
...,...,...,...,...,...,...,...,...
TTTGTCAGTACGACCC-1,1494,7.309881,2516.0,7.830823,18.441971,28.060413,40.779014,60.492846
TTTGTCAGTATCAGTC-1,724,6.586172,1406.0,7.249215,39.687055,50.426743,62.731152,84.068279
TTTGTCAGTTAGATGA-1,638,6.459904,1117.0,7.019297,33.751119,47.269472,60.787825,87.645479
TTTGTCATCGGGAGTA-1,515,6.246107,914.0,6.818924,40.153173,52.844639,65.536105,98.358862


In [23]:
crypto_2.obs[ ['total_counts','n_genes_by_counts'] ]

Unnamed: 0,total_counts,n_genes_by_counts
AAACCTGAGACCTAGG-1,2623.0,981
AAACCTGAGCGATGAC-1,2831.0,1508
AAACCTGAGTTAACGA-1,2312.0,1019
AAACCTGCACATCTTT-1,556.0,374
AAACCTGCAGCATACT-1,1037.0,547
...,...,...
TTTGTCAGTACGACCC-1,2516.0,1494
TTTGTCAGTATCAGTC-1,1406.0,724
TTTGTCAGTTAGATGA-1,1117.0,638
TTTGTCATCGGGAGTA-1,914.0,515


In [24]:
crypto_2.var

Unnamed: 0,gene_ids,feature_types,n_cells_by_counts,mean_counts,log1p_mean_counts,pct_dropout_by_counts,total_counts,log1p_total_counts
MIR1302-2HG,ENSG00000243485,Gene Expression,1,0.000193,0.000193,99.980680,1.0,0.693147
FAM138A,ENSG00000237613,Gene Expression,0,0.000000,0.000000,100.000000,0.0,0.000000
OR4F5,ENSG00000186092,Gene Expression,0,0.000000,0.000000,100.000000,0.0,0.000000
AL627309.1,ENSG00000238009,Gene Expression,1,0.000193,0.000193,99.980680,1.0,0.693147
AL627309.3,ENSG00000239945,Gene Expression,0,0.000000,0.000000,100.000000,0.0,0.000000
...,...,...,...,...,...,...,...,...
AC141272.1,ENSG00000277836,Gene Expression,0,0.000000,0.000000,100.000000,0.0,0.000000
AC023491.2,ENSG00000278633,Gene Expression,40,0.024536,0.024240,99.227202,127.0,4.852030
AC007325.1,ENSG00000276017,Gene Expression,8,0.001932,0.001930,99.845440,10.0,2.397895
AC007325.4,ENSG00000278817,Gene Expression,72,0.015649,0.015528,98.608964,81.0,4.406719


We store the matrix `X` to save the raw values. We will be able to see it in `layers`, independently of how we transform the matrix `X`

In [25]:
crypto_2.layers[ 'umi_raw' ] = crypto_2.X.copy()

We can see the matrix in `layers`, and reassign it to `X` or use it if needed in some future analysis

In [26]:
crypto_2

AnnData object with n_obs × n_vars = 5176 × 36601
    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes'
    var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
    layers: 'umi_raw'

In [27]:
crypto_2.layers['umi_raw']

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

The annotated datasets can be easily saved by using `write`. The format to be used is `h5ad`.

In [28]:
crypto_2.write('../../../Data/notebooks_data/crypto_2.h5ad')

... storing 'feature_types' as categorical
