## Perform CellCell Interaction with CellPhoneDB
CellphoneDB is a publicly available repository of HUMAN curated receptors, ligands and their interactions paired with a tool to interrogate your own single-cell transcriptomics data 

In this example we are using the method 2 (statistical_analysis_method) to study how cell-cell interactions change between a subset of immune cells and trophoblast cells as the trophoblast differentiate and invade the maternal uterus. This method will retrieve interactions where the mean expression of the interacting partners (proteins participating in the interaction) displays significant cell state specificity by employing a random shuffling methodology.

In [None]:
import pandas as pd
import anndata


In [None]:
cpdb_file_path = 'v5.0.0/cellphonedb.zip'
meta_file_path = 'data/metadata.tsv'
counts_file_path = 'data/normalised_log_counts.h5ad'
microenvs_file_path = 'data/microenvironment.tsv'
active_tf_path = 'data/active_TFs.tsv'
out_path = 'results/method2_withScore'

**cpdb_file_path**: (mandatory) path to the database cellphonedb.zip

**meta_file_path**: (mandatory) path to the meta file linking cell barcodes to cluster labels metadata.tsv.

**counts_file_path**: (mandatory) paths to normalized counts file (not z-transformed), either in text format or h5ad (recommended) normalised_log_counts.h5ad.

**microenvs_file_path** (optional) path to microenvironment file that groups cell clusters by microenvironments. When providing a microenvironment file, CellphoneDB will restrict the interactions to those cells within a microenvironment.

**active_tf_path:** (optional) to the active transcription factors.

### Inspect Input Files

In [None]:
metadata = pd.read_csv(meta_file_path,sep='\t')
metadata.head()

In [None]:
adata = anndata.read_h5ad(counts_file_path)
adata.shape

In [None]:
microenv = pd.read_csv(microenvs_file_path, sep = '\t')
microenv.head(3)

In [None]:
microenv.groupby('microenvironment', group_keys = False)['cell_type'].apply(lambda x : list(x.value_counts().index))

### Run Statistical Analysis

The output of this method will be saved in output_path and also returned to the predefined variables.

The statisical method allows the user to downsample the data with the aim of speeding up the results (subsampling arguments). To this end, CellphoneDB employs a geometric sketching procedure (Hie et al. 2019) to preserve the structure of the data without losing information from lowly represented cells. For this tutorial, we have opted to manually downsample the count matrix and the metadata file accordingly.

In [None]:
from cellphonedb.src.core.methods import cpdb_statistical_analysis_method

cpdb_results = cpdb_statistical_analysis_method.call(
    cpdb_file_path = cpdb_file_path,                 # mandatory: CellphoneDB database zip file.
    meta_file_path = meta_file_path,                 # mandatory: tsv file defining barcodes to cell label.
    counts_file_path = counts_file_path,             # mandatory: normalized count matrix - a path to the counts file, or an in-memory AnnData object
    counts_data = 'hgnc_symbol',                     # defines the gene annotation in counts matrix.
    active_tfs_file_path = active_tf_path,           # optional: defines cell types and their active TFs.
    microenvs_file_path = microenvs_file_path,       # optional (default: None): defines cells per microenvironment.
    score_interactions = True,                       # optional: whether to score interactions or not. 
    iterations = 1000,                               # denotes the number of shufflings performed in the analysis.
    threshold = 0.1,                                 # defines the min % of cells expressing a gene for this to be employed in the analysis.
    threads = 5,                                     # number of threads to use in the analysis.
    debug_seed = 42,                                 # debug randome seed. To disable >=0.
    result_precision = 3,                            # Sets the rounding for the mean values in significan_means.
    pvalue = 0.05,                                   # P-value threshold to employ for significance.
    subsampling = False,                             # To enable subsampling the data (geometri sketching).
    subsampling_log = False,                         # (mandatory) enable subsampling log1p for non log-transformed data inputs.
    subsampling_num_pc = 100,                        # Number of componets to subsample via geometric skectching (dafault: 100).
    subsampling_num_cells = 1000,                    # Number of cells to subsample (integer) (default: 1/3 of the dataset).
    separator = '|',                                 # Sets the string to employ to separate cells in the results dataframes "cellA|CellB".
    debug = False,                                   # Saves all intermediate tables employed during the analysis in pkl format.
    output_path = out_path,                          # Path to save results.
    output_suffix = None                             # Replaces the timestamp in the output files by a user defined string in the  (default: None).
    )


### Description of output files
Most output files share common columns:

**id_cp_interaction**: Unique CellphoneDB identifier for each interaction stored in the database.

**interacting_pair**: Name of the interacting pairs separated by “|”.
partner A or B: Identifier for the first interacting partner (A) or the second (B). It could be: UniProt (prefix simple:) or complex (prefix complex:)
gene A or B: Gene identifier for the first interacting partner (A) or the second (B). The identifier will depend on the input user list.

**secreted**: True if one of the partners is secreted.

**Receptor A or B**: True if the first interacting partner (A) or the second (B) is annotated as a receptor in our database.

**annotation_strategy**: Curated if the interaction was annotated by the CellphoneDB developers. Otherwise, the name of the database where the interaction has been downloaded from.

**is_integrin**: True if one of the partners is integrin.

**directionality**: Indiicates the directionality of the interaction and the charactersitics of the interactors.

**classification**: Pathway classification for the interacting partners.

In [None]:
#cell_a|cell_b: The p-value resulting from the statistical analysis.
cpdb_results['pvalues'].head(2)

In [None]:
# means: Mean values for all the interacting partners:
#  mean value refers to the total mean of the individual partner average expression 
# values in the corresponding interacting pairs of cell types. 
# If one of the mean values is 0, then the total mean is set to 0.
cpdb_results['means'].head(2)

In [None]:
#significant_mean: Significant mean calculation for all the interacting partners. 
# If the interaction has been found relevant, the value will be the mean. 
# Alternatively, the value is absent.
cpdb_results['significant_means'].head(2)

In [None]:
#scores: scores ranging from 0 to 100. The higher the score is, the more specific the interaction is expected to be.
cpdb_results['interaction_scores'].head(2)

**Deconvoluted** fields

 **gene_name**: Gene identifier for one of the subunits that are participating in the interaction defined in “means.csv” file. The identifier will depend on the input of the user list

**uniprot**: UniProt identifier for one of the subunits that are participating in the interaction defined in “means.csv” file.
    
**is_complex**: True if the subunit is part of a complex. Single if it is not, complex if it is.
    
**protein_name**: Protein name for one of the subunits that are participating in the interaction defined in “means.csv” file.

**complex_name**: Complex name if the subunit is part of a complex. Empty if not
    
**mean**: Mean expression of the corresponding gene in each cluster.

In [None]:
cpdb_results['deconvoluted'].head(2)

## Basic Plotting

In [None]:
import os
import pandas as pd
import ktplotspy as kpy
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
kpy.plot_cpdb_heatmap(pvals = cpdb_results['pvalues'],
                      degs_analysis = False,
                      figsize = (5, 5),
                      title = "Sum of significant interactions")

Here we are plotting the interactions between the PVs and the trophoblasts that are mediated by TGFB2 and CSF1R.

In [None]:
kpy.plot_cpdb(
    adata = adata,
    cell_type1 = "PV MYH11|PV STEAP4|PV MMPP11",
    cell_type2 = "EVT_1|EVT_2|GC|iEVT|eEVT|VCT_CCC",
    means = cpdb_results['means'],
    pvals = cpdb_results['pvalues'],
    celltype_key = "cell_labels",
    genes = ["TGFB2", "CSF1R"],
    figsize = (10, 3),
    title = "Interactions between\nPV and trophoblast",
    max_size = 3,
    highlight_size = 0.75,
    degs_analysis = False,
    standard_scale = True,
    interaction_scores = cpdb_results['interaction_scores'],
    scale_alpha_by_interaction_scores = True
)

Interactions can also be plotted grouped by pathway.



In [None]:
from plotnine import facet_wrap

p = kpy.plot_cpdb(
    adata = adata,
    cell_type1 = "PV MYH11",
    cell_type2 = "EVT_1|EVT_2|GC|iEVT|eEVT|VCT_CCC",
    means = cpdb_results['means'],
    pvals = cpdb_results['pvalues'],
    celltype_key = "cell_labels",
    genes = ["TGFB2", "CSF1R", "COL1A1"],
    figsize = (12, 8),
    title = "Interactions between PV and trophoblast\ns grouped by classification",
    max_size = 6,
    highlight_size = 0.75,
    degs_analysis = False,
    standard_scale = True,
)
p + facet_wrap("~ classification", ncol = 1)