## Import the packages
We will use *scanpy* as the main analysis tool for the analysis. Scanpy has a comprehensive [manual webpage](https://scanpy.readthedocs.io/en/stable/) that includes many different tutorial you can use for further practicing. Scanpy is used in the discussion paper and the tutorial paper of this course. 
An alternative and well-established tool for R users is [Seurat](https://satijalab.org/seurat/). However, scanpy is mainatined and updated by a wider community with many of the latest developed tools.

In [1]:
import scanpy as sc
import pandas as pd
import scvelo as scv
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn

## Loading and understanding the dataset structure

Data can be loaded from many different possible formats. Each format has a dedicated reading command, for example `read_h5ad`, `read_10X_mtx`, `read_txt`. We are going to use `read_10X_mtx` to load the output of the 10X software that produces the aligned data. 

Note the option `cache=True`. If you are going to read again the same data, it will be loaded extremely fast, because it has been stored in a convenient format for large datasets (`h5ad` format)

In [2]:
crypto_1 = sc.read_10x_mtx('../../../../scRNASeq_course/Data/cellranger_crypto1/outs/filtered_feature_bc_matrix/', cache=True)

The datasets `crypto_1` and `crypto_3` are now created. They are so-called `Annotated datasets`. Each annotated dataset contains:


*   The data matrix `X` of size $N\_cells \times N\_genes$
*   Vectors of cells-related quantities in the table `obs`(for example, how many transcripts there are in each cell)
* Vectors of genes-related quantities in the table `var` (for example, in how many cells the each gene is detected)
* Matrices of size $N\_cells \times N\_genes$ in `adata.layers` (for example, normalized data matrix, imputed data matrix, ....)

We will often call the cells for observations (obs) and the genes for variables (var) when it is practical in relation to the annotated dataset

During the analysis we will encounter other components of the annotated datasets. They will be explained when it is necessary, so you might want to skip this explanation if you want.

* Matrices where each line is cell-related in `obsm` (for example, the PCA coordinates of each cell)
* Matrices where each line is gene-related in `adata.varm` (for example, mean of the gene in each cell type)
* Anything else useful is in `adata.uns` and some quantities necessary for the `scanpy` package are saved in `obsp`

![alt text](https://falexwolf.de/img/scanpy/anndata.svg)

**Above:** a representation of the data matrix, variable and observations in an annotated dataset.  

Each component of the annotated dataset is called by using a `dot`, For example, we can see the data matrix by

In [3]:
crypto_1.X

<4333x36601 sparse matrix of type '<class 'numpy.float32'>'
	with 7088526 stored elements in Compressed Sparse Row format>

The matrix is in compressed format. We can reassign it as a dense matrix, so that we can see what it contains.

In [4]:
crypto_1.X = np.array( crypto_1.X.todense() )

In [5]:
crypto_1.X

array([[0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

When the matrix is no longer compressed, we can calculate some statistics for both cells and genes with the following `scanpy` command. Note that all scanpy commands follow a similar format.

In [6]:
sc.preprocessing.calculate_qc_metrics(crypto_1, inplace=True)

We can see that `obs` and `var` now contains a lot of different values whose names are mostly self-explicative. For example
- `n_genes_by_counts` is the number of detected genes in each cell
- `total_counts` is the number of transcripts in each cell
- `mean_counts` is the average of counts of each gene across all cells

In [7]:
crypto_1

AnnData object with n_obs × n_vars = 4333 × 36601
    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes'
    var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'

You can access directly all observations/variables or some of them specifically. Each observation line is named with the cell barcode, while variables have gene names in each line

In [8]:
crypto_1.obs

Unnamed: 0,n_genes_by_counts,log1p_n_genes_by_counts,total_counts,log1p_total_counts,pct_counts_in_top_50_genes,pct_counts_in_top_100_genes,pct_counts_in_top_200_genes,pct_counts_in_top_500_genes
AAACCTGAGAGTGACC-1,1237,7.121252,2693.0,7.898782,34.868177,44.931303,55.699963,72.632752
AAACCTGAGCGTTTAC-1,2799,7.937375,5979.0,8.696176,14.768356,22.077270,31.844790,50.091989
AAACCTGAGTGTCCCG-1,2261,7.724005,4412.0,8.392310,17.248413,25.362647,36.196736,54.351768
AAACCTGCACCTTGTC-1,1815,7.504392,2789.0,7.933797,16.600932,24.058802,33.775547,52.850484
AAACCTGCATTTCAGG-1,860,6.758095,1161.0,7.057898,21.016365,29.629630,43.152455,68.992248
...,...,...,...,...,...,...,...,...
TTTGTCAGTGCACCAC-1,2001,7.601902,3137.0,8.051341,11.603443,18.265859,28.243545,48.262671
TTTGTCAGTGCATCTA-1,2705,7.903227,9400.0,9.148571,36.978723,43.563830,52.265957,67.117021
TTTGTCAGTTGTTTGG-1,2466,7.810758,5150.0,8.546947,15.883495,23.844660,34.466019,53.766990
TTTGTCAGTTTAGCTG-1,1034,6.942157,1978.0,7.590347,31.041456,42.315470,54.600607,73.003033


In [9]:
crypto_1.obs[ ['total_counts','n_genes_by_counts'] ]

Unnamed: 0,total_counts,n_genes_by_counts
AAACCTGAGAGTGACC-1,2693.0,1237
AAACCTGAGCGTTTAC-1,5979.0,2799
AAACCTGAGTGTCCCG-1,4412.0,2261
AAACCTGCACCTTGTC-1,2789.0,1815
AAACCTGCATTTCAGG-1,1161.0,860
...,...,...
TTTGTCAGTGCACCAC-1,3137.0,2001
TTTGTCAGTGCATCTA-1,9400.0,2705
TTTGTCAGTTGTTTGG-1,5150.0,2466
TTTGTCAGTTTAGCTG-1,1978.0,1034


In [10]:
crypto_1.var

Unnamed: 0,gene_ids,feature_types,n_cells_by_counts,mean_counts,log1p_mean_counts,pct_dropout_by_counts,total_counts,log1p_total_counts
MIR1302-2HG,ENSG00000243485,Gene Expression,1,0.000231,0.000231,99.976921,1.0,0.693147
FAM138A,ENSG00000237613,Gene Expression,0,0.000000,0.000000,100.000000,0.0,0.000000
OR4F5,ENSG00000186092,Gene Expression,1,0.000231,0.000231,99.976921,1.0,0.693147
AL627309.1,ENSG00000238009,Gene Expression,8,0.001846,0.001845,99.815370,8.0,2.197225
AL627309.3,ENSG00000239945,Gene Expression,0,0.000000,0.000000,100.000000,0.0,0.000000
...,...,...,...,...,...,...,...,...
AC141272.1,ENSG00000277836,Gene Expression,0,0.000000,0.000000,100.000000,0.0,0.000000
AC023491.2,ENSG00000278633,Gene Expression,204,0.103393,0.098390,95.291946,448.0,6.107023
AC007325.1,ENSG00000276017,Gene Expression,8,0.002077,0.002075,99.815370,9.0,2.302585
AC007325.4,ENSG00000278817,Gene Expression,141,0.039926,0.039150,96.745904,173.0,5.159055


We store the matrix `X` to save the raw values. We will be able to see it in `layers`, independently of how we transform the matrix `X`

In [11]:
crypto_1.layers[ 'umi_raw' ] = crypto_1.X.copy()

We can see the matrix in `layers`, and reassign it to `X` or use it if needed in some future analysis

In [12]:
crypto_1

AnnData object with n_obs × n_vars = 4333 × 36601
    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes'
    var: 'gene_ids', 'feature_types', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
    layers: 'umi_raw'

In [13]:
crypto_1.layers['umi_raw']

array([[0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

The annotated datasets can be easily saved by using `write`. The format to be used is `h5ad`.

In [15]:
crypto_1.write('../../../Data/notebooks_data/crypto_1.h5ad')