 ![CellphoneDB Logo](https://www.cellphonedb.org/images/cellphonedb_logo_33.png) CellphoneDB is a publicly available repository of curated receptors, ligands and their interactions.

# CellphoneDB method 3 (differential expression)  
In this example we are using the method 3 (`degs_analysis_method`) to study how cell-cell interactions change between perivascular cells (PVs) and trophoblast cells as the trophoblast differentiate and invade the maternal uterus. This method will retrieve interactions where at least one of the interacting partners (genes involved in the interaction) is differentially expressed. 

> The **differentially expressed genes should be pre-computed by the users** using their preferred methodology (e.g. Limma, DEseq2, Seurat, TradeSeq, etc), and pass the significant results to CellphoneDB in a predefined tabular format.


This notebook assumes that you either know how two **download CellphoneDB database or to create your own database**. If this is not the case, please check `T0_BuildDBfromFiles.ipynb` or `T0_BuildDBfromRelease.ipynb`. In this notebook we will explain how to run CellphoneDB for the **differential expression method**.

In [None]:
# THRESH 
FDR_THRESH = 0.05
LOGFC_THRESH = 0.2
MIN_GENES_THRESH = 20
MIN_PER_THRESH = 10

FDRstr = str(FDR_THRESH).replace('.','')
LOGFCstr = str(LOGFC_THRESH).replace('.','')
MIN_GENESstr = str(MIN_GENES_THRESH)
MIN_PERstr = str(MIN_PER_THRESH)

### Check python version

In [None]:
import pandas as pd
import pathlib as pt
import scanpy as sc
import sys
import os

pd.set_option('display.max_columns', 100)

Checking that environment contains a Python >= 3.8 as required by CellphoneDB.

In [3]:
print(sys.version)

3.8.16 | packaged by conda-forge | (default, Feb  1 2023, 16:01:55) 
[GCC 11.3.0]


### Install CellphoneDB

Installing last version of CellphoneDB in the current conda enviroment. \
Remove the `--quiet` flag in case you want to see a detailed description of the installation process.

> pip install --quiet cellphonedb

___
### Input files
The differential expression method accepts 6 input files (4 mandatory).
- **cpdb_file_path**: (mandatory) path to the database.
- **meta_file_path**: (mandatory) path to the meta file linking cell barcodes to cluster labels.
- **counts_file_path**: (mandatory) paths to normalized counts file (not z-transformed), either in text format or h5ad (recommended).
- **degs_file_path**: (mandatory) path to the DEG file indicating the differentially expressed genes in each cluster. Only differentially expressed genes that are significant should be included.
- **microenvs_file_path**: (optional) path to microenvironment file that groups cell types/clusters by microenvironments. When providing a microenvironment file, CellphoneDB will restrict the interactions to those cells within the microenvironment.
- **active_tf_path**: (optional) to the active transcription factors.

Both, `degs_file_path` and `microenvs_file_path` content will depend on the biological question that the researcher wants to answer.


> In **this example** we are studying how cell-cell interactions change between perivascular cells (PVs) and trophoblast cells as the trophoblast differentiate and invade the maternal uterus. Therefore, the `degs_file_path` contains only the genes differentially expressed along the trophoblast differentiation lineage as we are interested in those cell-cell communication processes that change along this differentiation process. The `microenvs_file_path` contains only cells present in a specific anatomical region of the placenta (i.e. PVs and trophoblast cell states in the same spatiotemporal neighborhood). The `meta_file_path` and `counts_file_path` contain all the PVs and trophoblast that we are interested in.

> CellphoneDB will retrieve all the interactions occurring between PVs and trophoblast where: (i) all the proteins are expressed in the corresponding cell type and (ii) at least one gene is differentially expressed by the trophoblast.

In [None]:
cpdb_file_path = '/home/jovyan/cpdb_tutorial/db/v5.0.0/cellphonedb.zip'
meta_file_path = '../cpdh_input_T21_vs_2n/cdata_all_MATCHING_metadata_p1Luz.tsv'
counts_file_path = '../../../workingObj/cdata_all_T21vs2n.h5ad'
microenvs_file_path = '../cpdh_input_T21_vs_2n/cdata_all_MATCHING_microenvironment_p1Luz.tsv'
degs_file_path = f'../cpdh_input_T21_vs_2n/dge_mergeMi_T21_vs_2n_{MIN_GENESstr}min_{MIN_PERstr}per_p1Luz.tsv'
out_path = '../cpdh_output_T21_vs_2n'

### Inspect input files

<span style="color:green">**1)**</span> The **metadata** file is compossed of two columns:
- **barcode_sample**: this column indicates the barcode of each cell in the experiment.
- **cell_type**: this column denotes the cell label assigned.

In [5]:
metadata = pd.read_csv(meta_file_path, sep = '\t')
metadata.head(3)

Unnamed: 0,index,0
0,HCA_GLNDrna13255573_AAACCTGCAGGCTGAA,thy_TH_processing.2n
1,HCA_GLNDrna13255573_AAACCTGGTAGCTCCG,thy_TH_processing.2n
2,HCA_GLNDrna13255573_AAACCTGGTCTTTCAT,thy_TH_processing.2n


<span style="color:green">**2)**</span>  The **counts** files is a h5ad object from scanpy. The dimensions and order of this object must coincide with the dimensions of the metadata file, i.e. must have the same number of cells in both files.

In [8]:
import anndata

adata = anndata.read_h5ad(counts_file_path)
print(adata.shape)
print(set(adata.obs['celltype']))

(131497, 30997)
{'end_Capillary', 'mes_CYGB', 'thy_TH_processing', 'mes_ACKR3/KCNB2/LTBP1', 'thy_Lumen-forming', 'mes_CCL19', 'end_Lymphatic', 'mes_SCN7A', 'end_Venous', 'mes_Cycling', 'thy_Cycling', 'end_Cycling', 'end_Arterial'}


In [9]:
# update '-' to '_'
adata.obs['celltype'] = adata.obs['celltype'].replace({'thy_Lumen-forming' : 'thy_Lumen_forming', 'epi_HLA-B' : 'epi_HLA_B'})

print(len(set(adata.obs['celltype'])))
# filter adata to include cells from metadata only
adata = adata[adata.obs.index.isin(metadata['index'].unique())]
# make cell_labels column
adata.obs['cell_labels'] = adata.obs['celltype'].astype(str)+'.'+adata.obs['karyotype'].astype(str)
print(set(adata.obs['cell_labels']))

13


  adata.obs['cell_labels'] = adata.obs['celltype'].astype(str)+'.'+adata.obs['karyotype'].astype(str)


{'thy_TH_processing.2n', 'end_Lymphatic.2n', 'thy_Lumen_forming.2n', 'end_Arterial.T21', 'mes_CYGB.2n', 'end_Capillary.T21', 'end_Arterial.2n', 'mes_CYGB.T21', 'end_Lymphatic.T21', 'end_Capillary.2n', 'thy_Lumen_forming.T21', 'thy_TH_processing.T21', 'end_Venous.2n', 'end_Venous.T21'}


Check barcodes in metadata and counts are the same.

In [10]:
list(adata.obs.index).sort() == list(metadata['index']) \
    .sort()

True

In [11]:
adata.X = adata.layers['counts']
# Normalizing to median total counts
sc.pp.normalize_total(adata)
# Logarithmize the data
sc.pp.log1p(adata)



<span style="color:green">**3)**</span> **Differentially expressed genes** file s is a two columns file indicanting which gene up-regulated (or specific) in a cell type. The **first column** corresponds to the cluster name (these match with those in the metadata file) and the **second column** the up-regulated gene. The remaining columns are ignored by CellphoneDB. All genes present in this file will be taken into account, thus the user must provide in this file only those genes considered as up-regulated or relevant for the analysis.

In [12]:
pd.read_csv(degs_file_path, sep = '\t') \
    .head(3)

Unnamed: 0.1,cell_type,gene,chr,isCosmic,cosmicTier,comparison,F,isCSM,FDR,logFC,isTSG,PValue,logCPM,ensID,isTF,Unnamed: 0,tumourType,direction
0,thy_TH_processing.T21,PDXK,chr21,False,,thy_TH_processing,143.840007,False,6.3e-05,0.904324,False,9.560567e-09,5.666743,ENSG00000160209,False,ENSG00000160209,,T21_up
1,thy_TH_processing.T21,AGPAT3,chr21,False,,thy_TH_processing,142.627732,False,6.3e-05,0.95521,False,1.009336e-08,6.036753,ENSG00000160216,False,ENSG00000160216,,T21_up
2,thy_TH_processing.T21,USP16,chr21,False,,thy_TH_processing,115.505041,False,0.000161,0.635044,False,3.848646e-08,6.881284,ENSG00000156256,False,ENSG00000156256,,T21_up


<span style="color:green">**4)**</span> **Micronevironments** defines the cell types that belong to a a given microenvironment. CellphoneDB will only calculate interactions between cells that belong to a given microenvironment. In this file we are defining just one microenvionment.

In [13]:
microenv = pd.read_csv(microenvs_file_path,
                       sep = '\t')
microenv.head(3)

Unnamed: 0,0,karyotype
0,thy_TH_processing.2n,2n
1,thy_Lumen_forming.2n,2n
2,thy_TH_processing.T21,T21


Displaying cells grouped per microenvironment

In [14]:
# microenv.groupby('microenvironment')['cell_type'].apply(lambda x : list(x.value_counts().index))

<span style="color:green">**5)**</span> **Active transcription factors** defines trancription factors active in a given cell type.

In [15]:
# pd.read_csv(active_tf_path, sep = '\t') \
#     .head(3)

____
### Run CellphoneDB with differential analysis (method 3)
The output of this method will be saved in `output_path` and also assigned to the predefined variables.

In [None]:
from cellphonedb.src.core.methods import cpdb_degs_analysis_method

cpdb_results = cpdb_degs_analysis_method.call(
    cpdb_file_path = cpdb_file_path,                            # mandatory: CellphoneDB database zip file.
    meta_file_path = meta_file_path,                            # mandatory: tsv file defining barcodes to cell label.
    counts_file_path = counts_file_path,                        # mandatory: normalized count matrix - a path to the counts file, or an in-memory AnnData object
    degs_file_path = degs_file_path,                            # mandatory: tsv file with DEG to account.
    counts_data = 'hgnc_symbol',                                # defines the gene annotation in counts matrix.
    microenvs_file_path = microenvs_file_path,                  # optional (default: None): defines cells per microenvironment.
    score_interactions = True,                                  # optional: whether to score interactions or not. 
    threshold = 0.1,                                            # defines the min % of cells expressing a gene for this to be employed in the analysis.
    result_precision = 3,                                       # Sets the rounding for the mean values in significan_means.
    separator = '|',                                            # Sets the string to employ to separate cells in the results dataframes "cellA|CellB".
    debug = False,                                              # Saves all intermediate tables emplyed during the analysis in pkl format.
    output_path = out_path,                                     # Path to save results
    output_suffix = f'mergeMi_{MIN_GENESstr}min_{MIN_PERstr}per_p_val_adj_{FDRstr}_avg_log2FC_{LOGFCstr}Luz',                                       # Replaces the timestamp in the output files by a user defined string in the  (default: None)
    threads = 25
    )

[ ][CORE][28/10/24-22:48:20][INFO] [Cluster DEGs Analysis] Threshold:0.1 Precision:3
Reading user files...
The following user files were loaded successfully:
../../../workingObj/cdata_all_T21vs2n.h5ad
../cpdh_input_T21_vs_2n/cdata_all_MATCHING_metadata_p1Luz.tsv
../cpdh_input_T21_vs_2n/cdata_all_MATCHING_microenvironment_p1Luz.tsv
../cpdh_input_T21_vs_2n/dge_mergeMi_T21_vs_2n_20min_10per_p1Luz.tsv
[ ][CORE][28/10/24-22:49:07][INFO] Running Real Analysis
[ ][CORE][28/10/24-22:49:07][INFO] Limiting cluster combinations using microenvironments
[ ][CORE][28/10/24-22:49:07][INFO] Running DEGs-based Analysis
[ ][CORE][28/10/24-22:49:07][INFO] Building results
[ ][CORE][28/10/24-22:49:08][INFO] Scoring interactions: Filtering genes per cell type..


100%|██████████| 14/14 [00:04<00:00,  3.43it/s]

[ ][CORE][28/10/24-22:49:12][INFO] Scoring interactions: Calculating mean expression of each gene per group/cell type..



100%|██████████| 14/14 [00:00<00:00, 24.71it/s]


[ ][CORE][28/10/24-22:49:14][INFO] Scoring interactions: Calculating scores for all interactions and cell types..


100%|██████████| 98/98 [00:19<00:00,  5.02it/s]


Saved deconvoluted to ../cpdh_output_T21_vs_2n/degs_analysis_deconvoluted_mergeMi_20min_10per_p_val_adj_005_avg_log2FC_02Luz.txt
Saved deconvoluted_percents to ../cpdh_output_T21_vs_2n/degs_analysis_deconvoluted_percents_mergeMi_20min_10per_p_val_adj_005_avg_log2FC_02Luz.txt
Saved means to ../cpdh_output_T21_vs_2n/degs_analysis_means_mergeMi_20min_10per_p_val_adj_005_avg_log2FC_02Luz.txt
Saved relevant_interactions to ../cpdh_output_T21_vs_2n/degs_analysis_relevant_interactions_mergeMi_20min_10per_p_val_adj_005_avg_log2FC_02Luz.txt
Saved significant_means to ../cpdh_output_T21_vs_2n/degs_analysis_significant_means_mergeMi_20min_10per_p_val_adj_005_avg_log2FC_02Luz.txt
Saved interaction_scores to ../cpdh_output_T21_vs_2n/degs_analysis_interaction_scores_mergeMi_20min_10per_p_val_adj_005_avg_log2FC_02Luz.txt


Results are save as files and as a dictionary in the `cpdb_results` variable.

In [17]:
list(cpdb_results.keys())

['deconvoluted',
 'deconvoluted_percents',
 'means',
 'relevant_interactions',
 'significant_means',
 'CellSign_active_interactions',
 'CellSign_active_interactions_deconvoluted',
 'interaction_scores']

___
### Description of output files
Most output files share common columns:
- **id_cp_interaction**: Unique CellphoneDB identifier for each interaction stored in the database.
- **interacting_pair**: Name of the interacting pairs separated by “|”.
- **partner A or B**: Identifier for the first interacting partner (A) or the second (B). It could be: UniProt (prefix simple:) or complex (prefix complex:)
- **gene A or B**: Gene identifier for the first interacting partner (A) or the second (B). The identifier will depend on the input user list.
- **secreted**: True if one of the partners is secreted.
- **Receptor A or B**: True if the first interacting partner (A) or the second (B) is annotated as a receptor in our database.
- **annotation_strategy**: Curated if the interaction was annotated by the CellphoneDB developers. Otherwise, the name of the database where the interaction has been downloaded from.
- **is_integrin**: True if one of the partners is integrin.
- **directionality**: Indiicates the directionality of the interaction and the charactersitics of the interactors.
- **classification**: Pathway classification for the interacting partners.