# **Read the data**

---------------------------

## Learning objectives:
- Get an overview of the `scanpy` package and the `python` language syntax
- Learn and explore the data structure containing a single cell dataset
- Understand and apply basic interactions with the transcript matrix and the components of a dataset
----------------
**Execution time: 30-60 minutes**

------------------------------------

## Import the packages
We will use `scanpy` as the main analysis tool for the analysis, where we will also apply some other packages. Scanpy has a comprehensive [manual webpage](https://scanpy.readthedocs.io/en/stable/) that includes many different tutorial you can use for further practicing. Packages are imported with the command `import`, and their name is shortened with the command `as`, so that we can write shorter names in our code

An alternative and well-established tool for `R` users is [Seurat](https://satijalab.org/seurat/). This is used in the `R` version of this course.

In [2]:
import scanpy as sc
import pandas as pd
import scvelo as scv
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn

Commands from scanpy are under different categories: preprocessing (pp), tools (tl), plotting (pl). Each category contains some functions to work on single cell data. Scanpy has also a category called `external`, where a few external packages have been integrated to work with scanpy. Use the `help()` command to see what a command does in `python`

In [3]:
help(sc.preprocessing.calculate_qc_metrics)

Help on function calculate_qc_metrics in module scanpy.preprocessing._qc:

calculate_qc_metrics(adata: anndata._core.anndata.AnnData, *, expr_type: str = 'counts', var_type: str = 'genes', qc_vars: Collection[str] = (), percent_top: Union[Collection[int], NoneType] = (50, 100, 200, 500), layer: Union[str, NoneType] = None, use_raw: bool = False, inplace: bool = False, log1p: bool = True, parallel: Union[bool, NoneType] = None) -> Union[Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame], NoneType]
    Calculate quality control metrics.
    
    Calculates a number of qc metrics for an AnnData object, see section
    `Returns` for specifics. Largely based on `calculateQCMetrics` from scater
    [McCarthy17]_. Currently is most efficient on a sparse CSR or dense matrix.
    
    Note that this method can take a while to compile on the first call. That
    result is then cached to disk to be used later.
    
    Parameters
    ----------
    adata
        Annotated data matrix

## Loading and understanding the dataset structure

Data can be loaded from many different possible formats. Each format has a dedicated reading command, for example `read_h5ad`, `read_10X_mtx`, `read_txt`. We are going to use `read_10X_mtx` to load the output of the 10X software that produces the aligned data. 

Note the option `cache=True`. If you are going to read again the same data, it will be loaded extremely fast, because it has been stored in a convenient format for large datasets (`h5ad` format)

In [17]:
sample_2 = sc.read_10x_mtx('../../../../sandbox_scRNA_testAndFeedback/scRNASeq_course/Data/cellranger_sample2/outs/filtered_feature_bc_matrix/', cache=True)

In [None]:
sample_3 = sc.read_10x_mtx('../../../../sandbox_scRNA_testAndFeedback/scRNASeq_course/Data/cellranger_sample3/outs/filtered_feature_bc_matrix/', cache=True)

The datasets `sample_2` and `sample_3` are now created. They are so-called `Annotated datasets`. Each annotated dataset contains:


*   The data matrix `X` of size $N\_cells \times N\_genes$
*   Vectors of cells-related quantities in the table `obs`(for example, how many transcripts there are in each cell)
* Vectors of genes-related quantities in the table `var` (for example, in how many cells the each gene is detected)
* Matrices of size $N\_cells \times N\_genes$ in `adata.layers` (for example, normalized data matrix, imputed data matrix, ....)

We will often call the cells for observations (obs) and the genes for variables (var) when it is practical in relation to the annotated dataset

During the analysis we will encounter other components of the annotated datasets. They will be explained when it is necessary, so you might want to skip this explanation if you want.

* Matrices where each line is cell-related in `obsm` (for example, the PCA coordinates of each cell)
* Matrices where each line is gene-related in `adata.varm` (for example, mean of the gene in each cell type)
* Anything else useful is in `adata.uns` and some quantities necessary for the `scanpy` package are saved in `obsp`

![alt text](https://falexwolf.de/img/scanpy/anndata.svg)

**Above:** a representation of the data matrix, variable and observations in an annotated dataset.  

Each component of the annotated dataset is called by using a `dot`, For example, we can see the data matrix by

In [None]:
sample_2.X

The matrix is in compressed format. We can reassign it as a dense matrix, so that we can see what it contains.

In [None]:
sample_2.X = np.array( sample_2.X.todense() )

In [None]:
sample_2.X

In [None]:
sample_3.X = np.array( sample_3.X.todense() )

In [None]:
sample_3.X

When the matrix is no longer compressed, we can calculate some statistics for both cells and genes with the following `scanpy` command. Note that all scanpy commands follow a similar format. The two commands used below are the same, but in the second we used the short form for the `preprocessing` category.

In [None]:
sc.preprocessing.calculate_qc_metrics(sample_2, inplace=True)
sc.pp.calculate_qc_metrics(sample_3, inplace=True)

We can see that `obs` and `var` now contains a lot of different values whose names are mostly self-explicative. For example
- `n_genes_by_counts` is the number of detected genes in each cell
- `total_counts` is the number of transcripts in each cell
- `mean_counts` is the average of counts of each gene across all cells

In [None]:
sample_2

You can access directly all observations/variables or some of them specifically. Each observation line is named with the cell barcode, while variables have gene names in each line

In [None]:
sample_2.obs

In [None]:
sample_2.obs[ ['total_counts','n_genes_by_counts'] ]

In [None]:
sample_2.var

We store the matrix `X` to save the raw values. We will be able to see it in `layers`, independently of how we transform the matrix `X`

In [None]:
sample_2.layers[ 'umi_raw' ] = sample_2.X.copy()

In [None]:
sample_3.layers[ 'umi_raw' ] = sample_3.X.copy()

We can see that the matrix is stored in `.layers['umi_raw']`, and we can reassign it to `.X` or use it if needed in some future analysis

In [None]:
sample_2

In [None]:
sample_2.layers['umi_raw']

You can always subset a dataset by using a selection of cells and genes, and assign it as a new dataset (or to itself if you want to filter out some cells or genes)

An annotated dataset can be subsetted by cells, for example using a quality measure as the number of transcripts per cell

In [None]:
sample_2_qc = sample_2[ sample_2.obs['total_counts']<10000, : ].copy()

In [None]:
sample_2_qc

In a similar way, you can use values calculated on the genes to subset the data by genes, for example in how many cells each gene is detected

In [None]:
sample_2_qc = sample_2[ :, sample_2.var['n_cells_by_counts']>3 ].copy()

In [None]:
sample_2_qc

Note how `sample_2_qc` has first a reduced number of cells and then a reduced number of genes.

Remember that you cannot subset at the same time by cells and genes, for example
```
sample_2[ sample_2.obs['total_counts']<10000, sample_2.var['mean_counts']>1 ]
```
but those two steps have to be done separately as shown before.

The annotated datasets can be easily saved by using `write`. The format to be used in the file name is `h5ad`.

In [None]:
!mkdir -p ../../Data/notebooks_data

In [None]:
sample_2.write('../../Data/notebooks_data/sample_2.h5ad')

In [None]:
sample_3.write('../../Data/notebooks_data/sample_3.h5ad')