-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some distances silently not computed (for identical entries) #59
Comments
Andreas, thanks for this feedback. I agree we can add more warnings around this point. However, note the correspondence is between the TCRrep.clone_df not the input passed to the cell_df assert tr.pw_alpha.shape[0] == tr.clone_df.shape[0] This occurs, because there are often fewer clones in the clone_df than the cell_df if any of data fields are missing or there are duplicates. If you want to retain the all the rows add a column to the input 'rank' corresponding to the indices. Note the order of the TCRrep.clone_df will not match the input unless Before proceeding, it is also helpful to understand that each TCRrep instance contains two Pandas DataFrames: (i) the cell_df, which is provided by the user at initialization, and (ii) the clone_df, which is generated by the program immediately thereafter. The cell_df contains the data specified by the user, which is then augmented with columns containing IMGT aligned CDR1, CDR2, and CDR2.5 inferred from the V-gene name. The clone_df is a derivative Pandas DataFrame generated by deduplicating identical rows in the cell_df. That is, the rows of the cell_df with identical values are grouped together and the count column is updated to reflect the aggregation of multiple rows. Also, it is helpful to know that the order of the rows in the clone_df will not match the order in cell_df. (Although not recommended for new users of tcrdist3, users who pre-check their data to ensure no missing values and no unrecognized V-gene names, may use the deduplicate = False option which will allow the cell_df row order to be directly transferred to the clone_df without any row removal.)
It is possible to see those lines of cell_df not integrated into clone_df by calling tr.show_incomplete() after initialization.** (Note: Advanced users who wish to add new genes not currently in the tcrdist3 library can do so by modifying the content of the ‘alphabeta_gammadelta_db.tsv ‘ file in the package source code (python3.8/site-packages/tcrdist/db/alphabeta_gammadelta_db.tsv)) |
Excellent and exhaustive answer. Thanks so much! One add-on question: is there a way to relate the entries of Many thanks! |
Thanks for the questions. The best way to relate the clone_df to cell_df one-to-one is to include a 'rank' column providing a unique number to each clone in the input. This way the deduplicate step will not agglomerate any rows as each will be unique. (Some rows may be dropped if no valid v-gene is provided). Then one can you an inner join on the 'rank' column. You can also re sort the clone_df by the rank column before running tr.compute_distances() Everything is based on the row order in the tr.clone_df at the time you compute distances. You can compute after you sort. df = pd.read_csv("/Users/andreas/Downloads/dash.csv")
df['rank'] = df.index.to_list()
tr = TCRrep(cell_df = df,
organism = 'mouse',
chains = ['alpha','beta'],
db_file = 'alphabeta_gammadelta_db.tsv',
compute_distances = False)
tr.clone_df.sort_values('rank')
tr.compute_distances() Also, you can always use deduplicate = False is you've pre-checked your input so that there are no NAs or invalid v-genes, but given that tcrdist3 still is missing CDRs for some pseudogenes this will fail unless you are careful. df = pd.read_csv("/Users/andreas/Downloads/dash.csv")
df['rank'] = df.index.to_list()
tr = TCRrep(cell_df = df,
organism = 'mouse',
chains = ['alpha','beta'],
db_file = 'alphabeta_gammadelta_db.tsv',
compute_distances = False,
deduplicate = False)
tr.compute_distances() |
Awesome. Thank you! |
Hi there,
I noticed that under certain cirumstances some distances are silently not computed. I assume this happens for identical entries (what exactly is identical here?
sum(df.duplicated())
returns 0 in the example below). While it makes sense to save compute time, there should be at least a warning. It's only logical to assume that the output matrix has the same number of rows as the input data frame.The following is taken straight from the docs:
And a
fails (1920 != 1924).
Best,
Andreas
The text was updated successfully, but these errors were encountered: