In [3]:
!pip install scanpy


Collecting scanpy
  Downloading scanpy-1.10.3-py3-none-any.whl.metadata (9.4 kB)
Collecting anndata>=0.8 (from scanpy)
  Downloading anndata-0.10.9-py3-none-any.whl.metadata (6.9 kB)
Collecting legacy-api-wrap>=1.4 (from scanpy)
  Downloading legacy_api_wrap-1.4-py3-none-any.whl.metadata (1.8 kB)
Collecting pynndescent>=0.5 (from scanpy)
  Downloading pynndescent-0.5.13-py3-none-any.whl.metadata (6.8 kB)
Collecting session-info (from scanpy)
  Downloading session_info-1.0.0.tar.gz (24 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting umap-learn!=0.5.0,>=0.5 (from scanpy)
  Downloading umap_learn-0.5.7-py3-none-any.whl.metadata (21 kB)
Collecting array-api-compat!=1.5,>1.4 (from anndata>=0.8->scanpy)
  Downloading array_api_compat-1.9.1-py3-none-any.whl.metadata (1.6 kB)
Collecting stdlib_list (from session-info->scanpy)
  Downloading stdlib_list-0.11.0-py3-none-any.whl.metadata (3.3 kB)
Downloading scanpy-1.10.3-py3-none-any.whl (2.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━

In [4]:
import numpy as np
import pandas as pd
import scanpy as sc

# **1. AnnData as a data object** anndata.svg


We will explore a small single cell RNA sequencing dataset of Fallopian Tube cells. The dataset includes benign samples from 5 patients , and high grade serous ovarian carcinoma (HGOC) samples from 5 patients.

HGOC is an ovarian cancer, but thought to originate from cells in the fallopian tube.

**1. Read in data to create an Anndata object**


In [6]:
hu = sc.read_h5ad('')

**2. View the structure of the data**

In [7]:
hu

AnnData object with n_obs × n_vars = 4545 × 19683
    obs: 'Patient', 'Author', 'Tissue', 'Disease_stage'

**3. View the meta data**

In [None]:
hu.obs

**4. View the highest expressed genes**

  Using sc.pl.highest_expr_genes()

In [None]:
sc.pl.highest_expr_genes(hu, n_top=20, )

# **2. Filtering of the data**


**1. Filter out cells which have less than 200 genes expressed, and genes which are expressed in less than 3 cells.**

Use the scanpy commands:

* sc.pp.filter_cells(data, min_genes= *int*)
* sc.pp.filter_genes(data, min_cells= *int*)



**2.Calculate quality control metrics**

* sc.pp.calculate_qc_metrics(data, percent_top=(50,
100, 200, 500),inplace=True, log1p=False, )

**3. View quality control metrics in a violin plot**
* sc.pl.violin(data, ['n_genes_by_counts','total_counts'],jitter=0.4, multi_panel=True)

**4.Filter the cells further by slicing the anndata object on 'n_genes_by_counts' and 'total_counts'**

For n_genes_by_counts (remove cells with a high number of detected genes) this would be:
* data=data[data.obs.n_genes_by_counts <6000,:]




**5.View the structure of the data after filtering steps**

# **3.Dimensionality Reduction and visualisation**
To plot the data in a UMAP,


1. Initial pre-processing steps:
* Normalise the counts per cell, so each cell has 10,000 counts
* log transform the data

2.   Identify and crop the data to only the highly variable genes, then scale the data
3.   Compute the PCA
4.   Compute the Nearest neighbours graph
5.   Compute the UMAP




**1.Normalise and log transform the data**
* sc.pp.normalize_total(data, target_sum=1e4)
* sc.pp.log1p(data)

**2.Crop the data to highly variable genes only**
* sc.pp.highly_variable_genes(data, min_mean=0.0125, max_mean=3, min_disp=0.5)
* sc.pl.highly_variable_genes(data)
* data.raw = data
* data = data [:, data.var.highly_variable]
* sc.pp.scale(data, max_value=10)

**3.Calculate and plot a principal component analysis (PCA) elbow plot**
* sc.tl.pca(data, svd_solver='arpack')
* sc.pl.pca_variance_ratio(data, log=True)

**4.Compute the nearest neighbours graph, choose the number of principal components to use based on the elbow of the PCA plot (inflection point)**
 * sc.pp.neighbors(hu, n_pcs=8)

**5.Compute and plot the UMAP**
* sc.tl.umap(hu)

Plot the UMAP in the colour of.obs columns
* sc.pl.umap(hu, color=[''])

Plot the UMAP in the color of gene expression (use one of the genes from top 20 highly expressed genes)



**6.Examine how the UMAP structure changes when you change the number of pcs used to generate the n_neighbours graph**

**7. Re-run the n_neighbours and UMAP with optimal number of PCs**


# **Visualising the data**



Plot Patient ID on the UMAP

Plot Disease on the UMAP

# **Leiden clustering**
Cluster the cells to identify similar groups of cells



Visualise these marker genes to determine the cell types present in each cluster

Epithelial : EPCAM
Secretory Epithelial : OVGP1
Ciliated Epithelial : SNTN
Endothelial :
Immune:
Fibroblast:




Create a new annotation of 'cell_type' and assign this to cells based on the value of their leiden cluster

Given the previous commands, plot the cell types onto the UMAP

Plot the proportions of cell types per patient, you will have to use matplotlib to do this.

Visualise expression of a gene in all cell types
* In a violin plot

As HGOC is thought to originate from epithelial cells in the fallopian tube, create a subset of the data of epithelial cells.
Plot the HGOC genes of epithelial cells per disease type,
Then per patient

We can evaluate which genes are highly expressed in the HGOG epithelial cells, versus the benign epithelial cells.

Using the top ranked genes from epithelial cells in HGOC, and benign FT, query the webgestalt API to get gene ontology gene set enrichments


We can use the GWAS catalog to look at the expression of genes which are associated with high grade serous ovarian cancer,