Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ranking genes between specific clusters in clonotype network #204

Closed
josinejansen opened this issue Oct 15, 2020 · 19 comments
Closed

Ranking genes between specific clusters in clonotype network #204

josinejansen opened this issue Oct 15, 2020 · 19 comments
Labels
question Further information is requested

Comments

@josinejansen
Copy link

josinejansen commented Oct 15, 2020

Dear ICBI lab,

I have some questions regarding the Scirpy package, I would really appreciate your help.
For my analyses I first merged TCR and transcriptomics data from two different samples (organ 1 and organ 2), subsequently I then merged these two files.

1. a) If using the Scirpy clonotype_network tool the 'sequence' is set to 'nt', are the clusters in the clonotype network (where each node represents a cell) based on identical nucleotide sequences, or is it based on similarity, meaning that one cluster could consists of cells with slightly different nucleotide sequences?

As I understand, each node represents a cell, and edges connect cells belonging to the same clonotype. The function makes visualization of the clonotype-network possible, analogous to the construction of a neighborhood graph from transcriptomics data with the Scanpy package, so that based on the above, it computes a neighborhood graph of CDR3 nucleotide sequences with "scirpy.pp.tcr_neighbors())". However answer to my question I couldn't find, I was hoping you can help me out?

b) Do the lines that connect the cells within a cluster have any meaning?

Screen Shot 2020-10-14 at 2 24 29 PM

c) In the plot above, is it correct that the closer the different clusters are together, the more similar their nucleotide sequences are? Meaning that the sequences of the clonotypes consisting of only 2 cells in this case are most different from the clonotype clusters consisting of >5 cells (as they are further apart)?

2. a) The package allows for specifying what organs the clusters consist of:

Screen Shot 2020-10-14 at 2 32 06 PM

My question is: how can I select specific clusters in the clonotype network graph (I only would like to include the clonotypes in the network that have identical clonotypes between the two different samples from my data, i.e. organ 1 and organ 2), meaning that in the plot above I would like to filter for the clusters only that display both blue and orange nodes.

b) How can I assign numbers to my clusters based on identical nucleotide sequences shared between two samples?

c) How can I add a legend to my plot, and how can I change the name 'batch' into 'sample'?

3. Using Scirpy, how can one best specify differentially expressed genes (based on the transcriptomics data) between the different clusters based on shared nucleotide sequences between blood and fat, for example cluster 1 vs. rest of the clusters, and cluster 1 and 2 vs. cluster 3 and 4? I have tried implementing Scanpy's tool to rank genes using WIlcoxon, but unfortunately I can't make It work.

sc.tl.rank_genes_groups(adata, 'clonotype', groups=['1','2'], reference=['3','4'], method='wilcoxon')
sc.pl.rank_genes_groups(adata, groups=['1', '2'], n_genes=20)

-The key of the observations grouping to be considered would be:
"clonotype clusters based on share identical nucleotide sequences between organ 1 and organ 2".
-Subset of groups to which comparison wouldl be restricted: "clonotype 1 and clonotype 2"
-Comparison: Compare with respect to a specific group
-Group identifier with respect to which compare: "clonotype 3 and clonotype 4"
“The number of genes that appear in the returned tables”: 100
“Method”: Wilcoxon-Rank-Sum

Thanks in advance,

Josine

@grst
Copy link
Collaborator

grst commented Oct 15, 2020

Hi Josine,

If using the Scirpy clonotype_network tool the 'sequence' is set to 'nt', are the clusters in the clonotype network (where each node represents a cell) based on identical nucleotide sequences

If you use the default options (i.e. don't specify the metric parameter) it will be based on sequence identity.

b) Do the lines that connect the cells within a cluster have any meaning?

an edge means "these two cells have the same clonotype" (when using identity metric), or "these two cells are in the same clonotype cluster" (when using a similarity metric). When using an identity metric, clusters are always fully connected. This is not necessarily the case when using a similarity metric.

In the plot above, is it correct that the closer the different clusters are together, the more similar their nucleotide sequences are? Meaning that the sequences of the clonotypes consisting of only 2 cells in this case are most different from the clonotype clusters consisting of >5 cells (as they are further apart)?

No, the distance has no meaning. This is just an artifact from the layouting algorithm.

When using a similarity metric, edges are colored by the distance, but the point can be made that it's hard to see (one could experiment with the edges_cmap parameter).

how can I select specific clusters in the clonotype network graph

You can just subset the AnnData object, e.g.

adata_subset = adata[adata.obs["clonotype"].isin(["23", "42"]), :]
ir.tl.clonotype_network(adata_subset)
ir.pl.clonotype_network(adata_subset)

How can I assign numbers to my clusters based on identical nucleotide sequences shared between two samples?

You just need to find out which clonotypes have more than two samples. There are many ways to do this, here's an (untested) approach using pandas. You can then subset adata as described above.

samples_per_clonotype = adata.obs.groupby("clonotype").apply(lambda group_df: len(set(group_df["batch"])))
clonotypes_with_two_or_more_samples = samples_per_clonotype[samples_per_clonotype > 2].index

How can I add a legend to my plot, and how can I change the name 'batch' into 'sample'?

Set legend_loc="right margin". The easiest way to change the title is to rename the column. Alternatively, you can retrieve the matplotlib.Axes object and call set_title. This is also the way to go if you further want to customize the plot.

  1. Using Scirpy, how can one best specify differentially expressed genes (based on the transcriptomics data) between the different clusters based on shared nucleotide sequences between blood and fat, for example cluster 1 vs. rest of the clusters, and cluster 1 and 2 vs. cluster 3 and 4? I have tried implementing Scanpy's tool to rank genes using WIlcoxon, but unfortunately I can't make It work.

This is exactely how it should work. There's also an example in the tutorial. Do you get any error message?

Cheers,
Gregor

@grst grst added the question Further information is requested label Oct 15, 2020
@josinejansen
Copy link
Author

josinejansen commented Oct 15, 2020 via email

@grst
Copy link
Collaborator

grst commented Oct 16, 2020

Hi Josine,

glad I could help :)

When I plot my clonotype network it doesn't show the number of the clonotype, so how can I determine which ones I have to select, what name (in tutorial it is "163_TCR" and "277_TCR")

You'll have to plot the clonotype network colored by clonotype and with legend_loc="on data".

ir.pl.clonotype_network(adata, color="clonotype", legend_loc="on data")

Actually, it would be nice to show overlay the clonotype number also if colored by other categories, I'll consider implementing that.

How can I plot differentially expressed genes between the two batches (samples) within one clonotype?

Once you found out the clonotype number, you can again subset adata and perform differential expression using scanpy:

adata_subset = adata[adata.obs["clonotype"].isin(["23", "42"]), :]
sc.tl.rank_genes_groups(adata_subset, 'batch', method="wilcoxon")

Let me know if there are further issues!

Best,
Gregor

@josinejansen
Copy link
Author

josinejansen commented Oct 19, 2020

Dear Gregor,

Ranking the DE genes between specific clonotypes now works perfectly, as does ranking DE genes within one clonotype that is shared between two organs, so first of all: thank you!

My other scripts seem to throw some errors, could you perhaps point me into the right direction once more?

When ranking DE genes between different subsets of clonotypes, I create multiple subsets, as suggested by your previous comment, for example:

final_data_subset = final_data[final_data.obs["clonotype"].isin(["100"]), :]
final_data_subset2 = final_data[final_data.obs["clonotype"].isin(["908", "9775", "2488", "501"]), :]

Unfortunately the following lines then throw an error:

sc.tl.rank_genes_groups(final_data, "clonotype", groups=["final_data_subset"], reference="final_data_subset2", method="wilcoxon"
)
sc.pl.rank_genes_groups_violin(final_data, groups="final_data_subset", n_genes=15)

ValueError: reference = final_data_subset2 needs to be one of groupby = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', etc.]

Subsequently, I tried the following, but there was another error:

sc.tl.rank_genes_groups(final_data, "clonotype", groups=[“100”], reference= (“908", "9775", "2488", "501"), method="wilcoxon")
sc.pl.rank_genes_groups_violin(final_data, groups=“100”, n_genes=15)

ValueError: reference = ('848', '996') needs to be one of groupby = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13’, etc.]

Not sure what I am doing wrong, how can I best solve these errors?**

Appreciatively,
Josine

@grst
Copy link
Collaborator

grst commented Oct 20, 2020

Hi Josine,

the first one doesn't work, because you are defining two variables (final_data_subset, final_data_subset2) which are essentially AnnData objects on their own. They are not stored in final_data. Therefore, rank_gene_groups(final_data, ...) can't know about them.

I initially thought your second approach should work, but apparently, scanpy only accepts a single group as reference, not a list.

To make this work, I would create an additional column in adata.obs containing your annotation. One way to do that is using pandas indexes (see below). But you could also use a loop or list-comprehension to build the column.

# initialize the whole column with 'other'
adata.obs["my_groups"] = "other" 
# set the corresponding rows in the "my_groups" column to 'group1'. 
# .loc works like this: .loc[ ROW_LABELS, COL_LABELS ]
adata.obs.loc[adata.obs["clonotype"].isin(["908", "9775", "2488", "501"]), "my_groups"] = "group1"
adata.obs.loc[adata.obs["clonotype"].isin(["100"]), "my_groups"] = "group2"

Then you can do

sc.tl.rank_genes_groups(adata, "my_groups", groups="group1", reference="group2", method='wilcoxon')

Best,
Gregor

P.S.: you can format code in github using backticks:

```python
[your code here]
```

@josinejansen
Copy link
Author

josinejansen commented Oct 21, 2020

Topic: DE genes between organ 1 cells belonging to the clonotypes that have identical CDR3 sequences in both organ 1 and 2 & NON-clonotypes in organ 1

So subset 1 now only consists of clonotypes that have a shared cdr3_nt sequence in organ 1 as well as in organ 2, thanks for helping me with that - really needed, because I am such a freshman in both Scanpy & Scirpy (actually every package/Python 🙈).

I would like to make two other subsets to only check for DE genes between the organ 1 cells that are part of subset 1 (which should create subset 2), and the organ 1 cells that belong to the NON-clonotypes (= singletons) (constructing subset 3).

I think subset 2 would need to be:

adata_subset2 = adata_subset1[adata_subset1.obs["batch"].isin(["blood"]), :]

But how to fix subset 3 and define DE between subset 2 and 3?

I tried the following methods (none of which works completely):

Method 1)

  • To make subset 2: the one which selects for the blood cells in clonotypes from subset1 that was made earlier
    adata_subset_organ1_from_clonotypes = adata_subset[adata_subset.obs["batch"].isin(["organ1"]), :]

  • Make this into a column:
    adata.obs["subset2"] = "adata_subset_organ1_from_clonotypes"

  • Then I need subset 3, namely the non-clonotypes that occur in organ 1:
    adata_NonClonotypes= adata[adata.obs["TRB_1_cdr3_nt"].isin(adata.obs['TRB_1_cdr3_nt'].value_counts()[adata.obs['TRB_1_cdr3_nt'].value_counts()==1].index)]

  • And in order to really get subset 3, filtering for ones in organ 1
    adata_subset_organ1_from_NONclonotypes = adata_NonClonotypes[adata_NonClonotypes.obs["batch"].isin(["organ1"]), :]

  • Now I want to make this into a column:
    adata.obs["subset3"] = "adata_subset_organ1_from_NONclonotypes"

  • Then using your last bit of advice:
    adata.obs["my_groups"] = "other"
    adata.obs.loc[adata.obs["subset2"].isin(["organ1"]), "my_groups"] = "group2"
    adata.obs.loc[adata.obs["subset3"].isin(["organ1"]), "my_groups"] = "group3"
    sc.tl.rank_genes_groups(adata, "my_groups", groups=(["group2"]), reference="group3", method='wilcoxon')

Not sure if this is the right direction, and getting many errors here.

Method 2)
Trying the same thing as I did for subset 1 (manually selecting the clonotypes that are shared in organ 1 and 2 based on the clonotype_network batch function and overlay with numbers). However, manually selecting the appropriate clonotypes here is impossible, as when I display the clonotypes min_size = 1 (because those are the ones I want for subset 3) displays an unreadable graph (also when changing palette and/or panel_size):

Overlaying with number so to select the ones for subset 3:

Method 3)
In order to reduce noise, creating a subset that only includes singletons - again, manually based on plotting the clonotype_network, after which I would - just like for subset 2 - select for ["batch"].isin(["blood"]), :] (this would construct subset 3). However, apparently, when computing the clonotype_network, a max_size is not an option, and throws an error:

[ir.tl.clonotype_network(adata, sequence='nt', max_size=1)]

Do you have any coding suggestion to create subset 3 and define DE between subset 2 and 3?
Any advice > welcome.

Many, thanks (again),
Josine

@josinejansen
Copy link
Author

josinejansen commented Oct 22, 2020

One more question:

When I run the following code, the intended graphs are displayed for some clonotypes, but not for others.

adata_subsetX = adata[adata.obs["clonotype"].isin(["Number"]), :]
sc.tl.rank_genes_groups(adata_subsetX, 'batch', method="wilcoxon")
sc.pl.rank_genes_groups(adata_subsetX, n_genes=25, sharey=False)

Error: 'ZeroDivisionError'

How to solve this?

@grst
Copy link
Collaborator

grst commented Oct 22, 2020

Hi Josine,

you are on the right track with making a column (as unfortunately can't do the DE comparison between subsets with scanpy). However, there are a few (pandas) problems with your steps, I'll try to explain what's wrong:

adata.obs["subset2"] = "adata_subset_organ1_from_clonotypes"

Like that you just assign a string ("adata_subset_organ1_from_clonotypes") to all rows of adata.obs. To a column, you can either assign a constant value (like here) or an array (or list) with the same length as the adata.obs has rows.
Therefore, leaving away the quotes would lead to an error.

adata_NonClonotypes= adata[adata.obs["TRB_1_cdr3_nt"].isin(adata.obs['TRB_1_cdr3_nt'].value_counts()[adata.obs['TRB_1_cdr3_nt'].value_counts()==1].index)]

You can have that easier by using the clonotype_size column ;)

adata.obs.loc[adata.obs["subset2"].isin(["organ1"]), "my_groups"] = "group2"

Here you assign group2 to those rows, where the column subset2 == "organ1". This works in principle, but
I think you don't have the right values in the subset2 columns.

Alternatively, you can use the cell indexes from the subset you created to assign the values:

adata.obs.loc[adata_subset_organ1_from_clonotypes.obs_names, "my_groups"] = "group2"`

Maybe this tutorial could help you to familiarize yourself with the pandas indexes:
https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/


I think a short version of what you're trying to achieve could look like this:

# Step 1: create "boolean masks", i.e. arrays with the same length as adata.obs
# that indicate if a certain row meets a certain condition
is_organ1 = (adata.obs["batch"] == "organ1")
is_singleton = (adata.obs["clonotype_size"] == 1) 
# (or just use directly what you used to create subset1)
is_subset1 = adata.obs_names.isin(adata_subset1.obs_names) 


# Step 2: create a column using the masks. & means boolean
# and, i.e. decide for each row if both conditions are true
adata.obs["new_col"] = "other"
adata.obs.loc[is_subset1 & is_organ1, "new_col"] = "subset1+organ1"
adata.obs.loc[is_organ1 & is_singleton, "new_col"] = "singleton+organ1"

Then you can perform the wilcox test as you already know.


When I run the following code, the intended graphs are displayed for some clonotypes, but not for others.

adata_subsetX = adata[adata.obs["clonotype"].isin(["Number"]), :]
sc.tl.rank_genes_groups(adata_subsetX, 'batch', method="wilcoxon")
sc.pl.rank_genes_groups(adata_subsetX, n_genes=25, sharey=False)

I suspect this happens if there are too few/no cells in a subset and no meaningful ranks can be computed for the genes. I would check the gene expression of these groups manually (e.g. using a scanpy matrixplot) to get an idea.

@josinejansen
Copy link
Author

Thanks so much for your help/suggestions, they have been very useful and I can apply them now for my other analyses!

@josinejansen
Copy link
Author

josinejansen commented Nov 10, 2020

Hi Gregor,

I have a question in line with the thread above: how can I save only the annotated data from one batch (organ_1) within a clonotype (number 1116) to .csv? The reason I am asking is: I'd like to perform some pathway analyses, but the software I am using requires .csv format.

I think the annotated data belonging to clonotype 1116 can be saved like this:
Clonotype1116 = adata[adata.obs["clonotype"].isin(["1116"]), :]
Clonotype1116.obs.to_csv('Filename', sep=',')

But the following does not work, which I can understand (AttributeError: 'DataFrame' object has no attribute 'obs'):
File_Organ1 = Clonotype1116.obs.loc[Clonotype1116.obs["batch"].isin(["tumor"])]
FIle_Organ1.obs.to_csv('Filename', sep=',')

However, I can't think of another way to save the annotated data belonging to batch = Organ 1, specifically from clonotype 1116?

Your help would again be much appreciated.

Thanks in advance,
Josine

@grst
Copy link
Collaborator

grst commented Nov 10, 2020 via email

@josinejansen
Copy link
Author

josinejansen commented Nov 10, 2020

Thanks, I see. Saving the df directly works in general, but not when trying to save only the data belonging to cells in a specific batch within a clonotype. The following error then occurs: "AttributeError: 'DataFrame' object has no attribute 'write_csvs'"

Saving data belonging to for example one clonotype works:
df1 = adata[adata.obs["clonotype"].isin(["1116"]), :]
df1.write_csvs('file_name1', sep=',')

Saving annotated data from all cells from the batch 'organ1' within clonotype '1116' does not work, however that's what I'd like to end up with:
df2 = adata.obs.loc[df1.obs["batch"].isin(["blood"]), :] # This works fine
df2.write_csvs('file_name2', sep=',') # This is the line throwing the error.

@grst
Copy link
Collaborator

grst commented Nov 10, 2020

It's a detail ;)

write_csv is defined on a DataFrame, i.e. on adata.obs.
write_csvs is defined on AnnData, i.e. on adata directly.

The former will only save the obs data frame, the latter will write both obs and var data frames (and potentially others) stored in an AnnData object.

@josinejansen
Copy link
Author

josinejansen commented Nov 10, 2020

OH, 🙈THANK YOU!!!
Works now.

@josinejansen
Copy link
Author

josinejansen commented Nov 15, 2020

Hi again, so I have tried many things now, but I am not proceeding with the following:
ranking genes between selection A of clonotypes --> clonotypes only occurring in the brain ("is_brainOnly") and selection B --> only batch=brain from selection B ("is_shared"). Selection B consists of clonotypes that occur both in brain AND in blood.

What am I doing wrong:

As you suggested, here is step 1 to create boolean masks
**is_brain** = (adata.obs["batch"] == "brain")

is_shared = adata[adata.obs["clonotype"].isin(["1527","1611","5502","1494","7112","296","3399","9309","8951","863","1991","3461","6214","5086","5864","3605","6286","3892","996","2488","100","6123","7317","5417","7279","2596","6527","501","1688","287","2984","3033","1116","2574","76","5250","1506","1209","1445","848","42"]) , :]

is_brainOnly = final_data.obs.loc[final_data.obs["clonotype"].isin(["10231", "10491", "9803", "10698", "10084", "9447", "9448", "10362", "10492", "9641", "11010", "10584", "11483", "10061", "10015", "10036", "10437", "10459", "9775", "9784", "9812", "9919", "9465", "9599", "9647", "9642", "9916", "9830", "9450", "10526", "10985", "10717", "11279", "9453"]), :]

Step 2: create a column using the masks and decide for each row if both conditions are true
adata.obs["new_col"] = "other"

Setting the groups here, first brain only:
adata.obs.loc[adata.obs["clonotype"].isin(["10231", "10491", "9803", "10698", "10084", "9447", "9448", "10362", "10492", "9641", "11010", "10584", "11483", "10061", "10015", "10036", "10437", "10459", "9775", "9784", "9812", "9919", "9465", "9599", "9647", "9642", "9916", "9830", "9450", "10526", "10985", "10717", "11279", "9453"]), "new_col"] = "group1"

Second group is from shared clonotypes the tumor cells:
adata.obs.loc[is_shared & is_brain), "new_col"] = "group2"

sc.tl.rank_genes_groups(adata, "new_col", groups=(["group1"]), reference = "group2", method='wilcoxon')
sc.pl.rank_genes_groups(adata, groups=['group1'], n_genes=15)
sc.pl.rank_genes_groups_violin(adata, groups=['group1'],n_genes=15)

The script keeps hanging on the line with '& - sign', or there are errors with invalid syntax. Really not sure what I am doing wrong, I have read the documentation that you suggested. Thank you, Josine

@grst
Copy link
Collaborator

grst commented Nov 16, 2020

Hi Josine,

can you please always include the complete error message? Otherwise it's hard to tell what the actual problem is.
In the line

adata.obs.loc[is_shared & is_brain), "new_col"] = "group2"

I spot an unmatched bracket is_brain). Maybe that's already the problem?

Best,
Gregor

@josinejansen
Copy link
Author

image

Hi Gregor,

So yesterday and today no error, just hangs forever (see above). I have restarted the kernel multiple times, but it doesn't seem to work. As you can see I use somewhat different names to describe the groups than I used above, so please don't get confused :) -> It should be pretty straight-forward what indicates the different groups. Thanks for your time, best, Josine

@grst
Copy link
Collaborator

grst commented Nov 18, 2020

  • is_tumor correctly is a boolean mask, i.e. a pandas series of booleans
  • is_shared, however, is an object of class anndata (-> you'll need to get rid of the outer final_data[...])

tbh, I don't know why this is not throwing an error though

@grst
Copy link
Collaborator

grst commented Dec 6, 2020

Hi Josine,

I'm closing this issue for now. Feel free to reopen, or create a new one, if you have further questions.

Best,
Gregor

@grst grst closed this as completed Dec 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants