-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix functions #5
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #5 +/- ##
==========================================
- Coverage 79.59% 71.92% -7.67%
==========================================
Files 2 2
Lines 98 114 +16
==========================================
+ Hits 78 82 +4
- Misses 20 32 +12
☔ View full report in Codecov by Sentry. |
… correct card labels from this function
@khynder , I addressed 10. -14. (everything regarding the A new To Do:
|
…efault return_df to true
…ch function that accepts dfs from users
Hey, I think this PR might be good to go now.
|
I wondered if it was possible to replace the content of
The pure numpy solution is simpler and faster, as it avoids looping in Python and accessing single cells in the dataframe. |
That seems like a smart solution, great! I replaced it in commit 9c629a9 |
Glad I could be of some help! @khynder Are you satisfied with the proposed changes? |
I'll have a look at it all on Sunday :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the first part of my review, it's a good job!
I'll get into the details of the _check_data
function and answer the other points tomorrow.
src/cardsort/analysis.py
Outdated
return None | ||
else: | ||
user_ids = df["user_id"].unique() | ||
for id in user_ids: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thing using "id" as a variable name could lead to conflicts with the built-in id() function, maybe just replace by id_ would be safer
src/cardsort/analysis.py
Outdated
if len(cluster_cards) > 0: | ||
user_ids = df["user_id"].unique() | ||
|
||
for id in user_ids: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as before: I thing using "id" as a variable name could lead to conflicts with the built-in id() function, maybe just replace by id_ would be safer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both done: 192bda9
src/cardsort/analysis.py
Outdated
) | ||
if print_results: | ||
logger.info( | ||
"User " + str(id) + " labeled card(s): " + cluster_label |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could replace the + between strings by the f format, by adding a "f" in front of the big string and inserting the variable in {}. Here it would be:
f"User {id} labeled card(s): cluster_label"
src/cardsort/analysis.py
Outdated
break | ||
return cluster_df | ||
if print_results: | ||
logger.info("User " + str(id) + " did not cluster cards together.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with the f-string format it would be
f"User {id} did not cluster cards together."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both done: 2ce154b
src/cardsort/analysis.py
Outdated
) | ||
if return_df_results: | ||
cards = _get_cards_for_label(cluster_label, df_u) | ||
cluster_df = pd.concat( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of a concatenation for each user_id, you can use .append on a list and use pd.concat only once at the end:
# before the loop
cluster_list = []
# inside the loop
cluster_list.append([id, cluster_label, cards])
# after the loop
cluster_df = pd.concat(cluster_list, columns=["user_id", "cluster_label", "cards"])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: 0cf44a0
Great, thank you @khynder ! I will try to implement the changes before the end of the week :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More precise comments on the _check_data
function, because the data format requirements changed with the distance matrix implementation changes :).
Here's my review:
Great work :) |
Thank you for the review @khynder ! I tried to address everything.
Then it's only No. 9 that's missing :) |
Actually adding a test on the dendrogram labels is a bit more difficult than I thought. I was thinking about the methods Last little correction: the name of the |
Hello,
this is a draft PR for changes in functions based on the reviews of @khynder and @Batalex .
It is quite a lot and I will probably need some help with some of the comments.
I'll list the changes/ToDos here in a numbered list ordered by function:
General
I changed all print statements to logs (6cd3fe2, 934f6f7). I have never used logs before but I assume this is how it is intended to be?
Apart from checking if the first user_id is 1 in "get_distance_matrix" (see d43f500 and below in section for this function for details), I have not created any data format tests yet. An easy fix would be to only allow kardSort data for now. Otherwise, I will probably need to create a new function for these checks that is called in any user-facing function that accepts dataframes?
I think this has been solved in a prior PR based on @Robaina's feedback, all
while
statements have been replaced byfor
. A check if the first user ID is 1 has been added forget_distance_matrix
in d43f500.Distance matrix
I have not looked into this yet
How do you think scipy's pdist could be used here? I would have used it from the start, but thought it would not be possible, because this is not a standard distance function like euclidean distance etc.?
I added a check if user_id is 1 (d43f500), but I'm not sure if it is the most elegant solution like this. I added an else statement that logs an error message and returns None in case the ID is different.
Dendrogram
I changed the code to include the sort_values statement (7c20d8c).
I'm sorry, I am not sure what exactly should be changed here. The changes in the create_dendrogram function should automatically apply to the notebook?
Do you mean comparing the "labels" list with the card_labels in the df?
Cluster labels
I think this might have been solved in a prior PR, since there is no if/else condition after the for-loop. Or do you mean the if/else condition in the for loop?
Thank you for suggesting code for a common function, I agree that it would be cleaner if there was only one function for both. I have not looked in detail at your function suggestion yet, but wonder if it is really necessary to have 2 boolean values? Would one not be enough, e.g.
return_df = true
orfalse
?Open to do: I have not looked into this yet in detail.
Open to do: I have not looked into this yet in detail.
Open to do: I have not looked into this yet in detail.