## Import the packages
We will use *scanpy* as the main analysis tool for the analysis. Scanpy has a comprehensive [manual webpage](https://scanpy.readthedocs.io/en/stable/) that includes many different tutorial you can use for further practicing. Scanpy is used in the discussion paper and the tutorial paper of this course. 
An alternative and well-established tool for R users is [Seurat](https://satijalab.org/seurat/). However, scanpy is mainatined and updated by a wider community with many of the latest developed tools.

In [1]:
import scanpy as sc
import pandas as pd
import scvelo as scv
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn

## Loading and understanding the dataset structure

Data can be loaded from many different possible formats. Each format has a dedicated reading command, for example `read_h5ad`, `read_10X_mtx`, `read_txt`. We are going to use `read_10X_mtx` to load the output of the 10X software that produces the aligned data. 

Note the option `cache=True`. If you are going to read again the same data, it will be loaded extremely fast, because it has been stored in a convenient format for large datasets (`h5ad` format)

In [2]:
crypto_3 = sc.read_10x_mtx('../../../../scRNASeq_course/Data/cellranger_crypto3/outs/filtered_feature_bc_matrix/', cache=True)

The datasets `crypto_3` and `crypto_3` are now created. They are so-called `Annotated datasets`. Each annotated dataset contains:


*   The data matrix `X` of size $N\_cells \times N\_genes$
*   Vectors of cells-related quantities in the table `obs`(for example, how many transcripts there are in each cell)
* Vectors of genes-related quantities in the table `var` (for example, in how many cells the each gene is detected)
* Matrices of size $N\_cells \times N\_genes$ in `adata.layers` (for example, normalized data matrix, imputed data matrix, ....)

We will often call the cells for observations (obs) and the genes for variables (var) when it is practical in relation to the annotated dataset

During the analysis we will encounter other components of the annotated datasets. They will be explained when it is necessary, so you might want to skip this explanation if you want.

* Matrices where each line is cell-related in `obsm` (for example, the PCA coordinates of each cell)
* Matrices where each line is gene-related in `adata.varm` (for example, mean of the gene in each cell type)
* Anything else useful is in `adata.uns` and some quantities necessary for the `scanpy` package are saved in `obsp`

![alt text](https://falexwolf.de/img/scanpy/anndata.svg)

**Above:** a representation of the data matrix, variable and observations in an annotated dataset.  

Each component of the annotated dataset is called by using a `dot`, For example, we can see the data matrix by

In [3]:
crypto_3.X

<5764x36601 sparse matrix of type '<class 'numpy.float32'>'
	with 11034764 stored elements in Compressed Sparse Row format>

The matrix is in compressed format. We can reassign it as a dense matrix, so that we can see what it contains.

In [4]:
crypto_3.X = np.array( crypto_3.X.todense() )

In [5]:
crypto_3.X

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

When the matrix is no longer compressed, we can calculate some statistics for both cells and genes with the following `scanpy` command. Note that all scanpy commands follow a similar format.

In [6]:
sc.preprocessing.calculate_qc_metrics(crypto_3, inplace=True)

We can see that `obs` and `var` now contains a lot of different values whose names are mostly self-explicative. For example
- `n_genes_by_counts` is the number of detected genes in each cell
- `total_counts` is the number of transcripts in each cell
- `mean_counts` is the average of counts of each gene across all cells

In [7]:
crypto_3

AnnData object with n_obs × n_vars = 5764 × 36601
    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes'
    var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'

You can access directly all observations/variables or some of them specifically. Each observation line is named with the cell barcode, while variables have gene names in each line

In [8]:
crypto_3.obs

Unnamed: 0,n_genes_by_counts,log1p_n_genes_by_counts,total_counts,log1p_total_counts,pct_counts_in_top_50_genes,pct_counts_in_top_100_genes,pct_counts_in_top_200_genes,pct_counts_in_top_500_genes
AAACCTGAGGGAGTAA-1,749,6.620073,1706.0,7.442492,40.504103,54.572098,67.643611,85.404455
AAACCTGAGTACATGA-1,983,6.891626,1272.0,7.149132,18.632075,26.493711,38.443396,62.028302
AAACCTGCAAGAAGAG-1,838,6.732211,1410.0,7.252054,26.879433,39.219858,54.042553,76.028369
AAACCTGCAATGACCT-1,922,6.827629,1452.0,7.281385,28.236915,37.327824,50.275482,70.936639
AAACCTGCACGAAGCA-1,2723,7.909857,5288.0,8.573384,15.733737,23.827534,33.812405,50.945537
...,...,...,...,...,...,...,...,...
TTTGTCAGTACGAAAT-1,1140,7.039660,2108.0,7.653969,29.506641,39.990512,51.707780,69.639469
TTTGTCAGTTGGTGGA-1,358,5.883322,506.0,6.228511,35.177866,49.011858,68.774704,100.000000
TTTGTCATCAGAAATG-1,3298,8.101375,6762.0,8.819221,12.348418,18.988465,28.349601,45.489500
TTTGTCATCCTACAGA-1,1790,7.490529,2507.0,7.827241,13.442361,19.585162,28.639809,48.544077


In [9]:
crypto_3.obs[ ['total_counts','n_genes_by_counts'] ]

Unnamed: 0,total_counts,n_genes_by_counts
AAACCTGAGGGAGTAA-1,1706.0,749
AAACCTGAGTACATGA-1,1272.0,983
AAACCTGCAAGAAGAG-1,1410.0,838
AAACCTGCAATGACCT-1,1452.0,922
AAACCTGCACGAAGCA-1,5288.0,2723
...,...,...
TTTGTCAGTACGAAAT-1,2108.0,1140
TTTGTCAGTTGGTGGA-1,506.0,358
TTTGTCATCAGAAATG-1,6762.0,3298
TTTGTCATCCTACAGA-1,2507.0,1790


In [10]:
crypto_3.var

Unnamed: 0,gene_ids,feature_types,n_cells_by_counts,mean_counts,log1p_mean_counts,pct_dropout_by_counts,total_counts,log1p_total_counts
MIR1302-2HG,ENSG00000243485,Gene Expression,18,0.003123,0.003118,99.687717,18.0,2.944439
FAM138A,ENSG00000237613,Gene Expression,0,0.000000,0.000000,100.000000,0.0,0.000000
OR4F5,ENSG00000186092,Gene Expression,0,0.000000,0.000000,100.000000,0.0,0.000000
AL627309.1,ENSG00000238009,Gene Expression,29,0.005031,0.005019,99.496877,29.0,3.401197
AL627309.3,ENSG00000239945,Gene Expression,0,0.000000,0.000000,100.000000,0.0,0.000000
...,...,...,...,...,...,...,...,...
AC141272.1,ENSG00000277836,Gene Expression,0,0.000000,0.000000,100.000000,0.0,0.000000
AC023491.2,ENSG00000278633,Gene Expression,143,0.090909,0.087011,97.519084,524.0,6.263398
AC007325.1,ENSG00000276017,Gene Expression,81,0.023595,0.023321,98.594726,136.0,4.919981
AC007325.4,ENSG00000278817,Gene Expression,566,0.148334,0.138313,90.180430,855.0,6.752270


We store the matrix `X` to save the raw values. We will be able to see it in `layers`, independently of how we transform the matrix `X`

In [11]:
crypto_3.layers[ 'umi_raw' ] = crypto_3.X.copy()

We can see the matrix in `layers`, and reassign it to `X` or use it if needed in some future analysis

In [12]:
crypto_3

AnnData object with n_obs × n_vars = 5764 × 36601
    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes'
    var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
    layers: 'umi_raw'

In [13]:
crypto_3.layers['umi_raw']

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

The annotated datasets can be easily saved by using `write`. The format to be used is `h5ad`.

In [14]:
crypto_3.write('../../../Data/notebooks_data/crypto_3.h5ad')

... storing 'feature_types' as categorical
