Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix functions #5

Merged
merged 31 commits into from
Aug 1, 2023
Merged

Fix functions #5

merged 31 commits into from
Aug 1, 2023

Conversation

katoss
Copy link
Owner

@katoss katoss commented Jul 5, 2023

Hello,

this is a draft PR for changes in functions based on the reviews of @khynder and @Batalex .
It is quite a lot and I will probably need some help with some of the comments.

I'll list the changes/ToDos here in a numbered list ordered by function:

General

  • 1.

avoid using print in code, if you have a lot of users in the database it could slow down the execution. You can add a Boolean parameter print_progression or use loggers (@Batalex could explain loggers better than me I think) (comment by @khynder)

I changed all print statements to logs (6cd3fe2, 934f6f7). I have never used logs before but I assume this is how it is intended to be?

  • 2.

Be careful with the data format when it does not come directly from kardsort.com, more details below. It would be safer if a function was added to check that the data format is correct, especially if the package is used with data which does not come from kardsort.com. More specifically:

  • The first user in df should have 1 as user_id (otherwise in get_distance_matrix, distance_matrix_all will be used before creation, and labels could not be computed)
  • All users should categorize all cards and each card only once
  • card_id should go from 1 to n (n being the number of cards)
  • one card_id should always be associated with exactly one card_label (not two or more) (comment by @khynder)

Apart from checking if the first user_id is 1 in "get_distance_matrix" (see d43f500 and below in section for this function for details), I have not created any data format tests yet. An easy fix would be to only allow kardSort data for now. Otherwise, I will probably need to create a new function for these checks that is called in any user-facing function that accepts dataframes?

  • 3. several functions

get_cluster_labels, get_distance_matrix and get_cluster_labels_df use the form id = 1; while id <= max(df.user.id). We can make this more robust by using a for loop on the values in the dataframe. The original version would fail for string ids or even if a value is missing, e.g [1, 2, 4, 5]. Since the value 1 is used to initialize the distance matrix, you may add a boolean condition (to be changed in the first iteration), or you may create a matrix with np.zeros before the for loop. I like using for user_id, sub_df in df.groupby("user_id"):, you are guaranteed to loop over all user_id values in the dataframe. (by @Batalex)

I think this has been solved in a prior PR based on @Robaina's feedback, all while statements have been replaced by for. A check if the first user ID is 1 has been added for get_distance_matrix in d43f500.

Distance matrix

  • 4. get_distance_matrix_for_user

It can be written as a comprehension list if you want (no obligation, if you feel more at ease with the current code keep it 😊) (by @khynder)

I have not looked into this yet

  • 5. get_distance_matrix_for_user

_get_distance_matrix_for_user could be rewritten using scipy's pdist, but it is worth checking if the rewrite to compute the upper triangular matrix is more efficient. (by @Batalex)

How do you think scipy's pdist could be used here? I would have used it from the start, but thought it would not be possible, because this is not a standard distance function like euclidean distance etc.?

  • 6. get_distance_matrix

the first user should have user_id = 1, otherwise distance_matrix_all won’t exist. It can stay that way, with a check at the beginning of the function to stop the computation if the first user_id is not 1. Or you can create before the for loop a variable distance_matrix_all = None, and replace the id == 1 condition by a distance_matrix_all is not None condition (by @khynder)

I added a check if user_id is 1 (d43f500), but I'm not sure if it is the most elegant solution like this. I added an else statement that logs an error message and returns None in case the ID is different.

Dendrogram

  • 7. create_dendrogram

when you define labels labels = df.loc[df["user_id"] == 1]["card_label"].squeeze().to_list()It is very important that the order of the card_label in labels is exactly the same as the one used to create the distance_matrix. The distance_matrix is created with the card_id in the order 1, 2, 3, 4, …, n. So labels should be: first the card_label corresponding to card_id = 1, then card_label corresponding to card_id = 2, then card_label corresponding to card_id = 3, etc. The above code line works fine if the card_id for the user_id 1 are in the order 1, 2, 3 etc, but does not work if the card_id are shuffled. One way to correct this behavior is to add a sort_values(“card_id”) to the labels definition. labels = df.loc[df["user_id"] == 1].sort_values(“card_id”)["card_label"].squeeze().to_list() (by @khynder)

I changed the code to include the sort_values statement (7c20d8c).

  • 8. example.ipynb

Note that it should also be changed in the example.ipynb. (by @khynder)

I'm sorry, I am not sure what exactly should be changed here. The changes in the create_dendrogram function should automatically apply to the notebook?

  • 9. test

Maybe a test could be added on the order of labels in the dendrogram (on the test-data.csv) (comment by @khynder)

Do you mean comparing the "labels" list with the card_labels in the df?

Cluster labels

  • 10. get_cluster_labels

I think the if/else condition after the for loop can be removed. (by @khynder)

I think this might have been solved in a prior PR, since there is no if/else condition after the for-loop. Or do you mean the if/else condition in the for loop?

  • 11. get_cluster_labels

There is a lot of common code lines between get_cluster_labels and get_cluster_labels_df. These two functions can be factorized into one function, with 2 boolean parameters to indicate what type of return you want, for example print_results and return_df_results. (Note: See review for example function) (by @khynder)

Thank you for suggesting code for a common function, I agree that it would be cleaner if there was only one function for both. I have not looked in detail at your function suggestion yet, but wonder if it is really necessary to have 2 boolean values? Would one not be enough, e.g. return_df = true or false?

  • 12. get_cluster_labels

the warning "is/are not a valid card label" can be moved in this function, instead of being checked at a user_id level (Note: See review for example function) (by @khynder)

Open to do: I have not looked into this yet in detail.

  • 13. get_cluster_labels

I believe most of the complexity of get_cluster_labels could be cut down by checking the values of cluster_cards against the possible values before entering the loop. It is quite risky to mutate a datastructure (especially removing values) while using it in a loop, and usually considered a bad practice. (by @Batalex)

Open to do: I have not looked into this yet in detail.

  • 14. get_cluster_label_for_user

It can be simplified with the “isin()” pandas method:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.isin.html (Note: See review for example function) (by @khynder)

Open to do: I have not looked into this yet in detail.

@codecov
Copy link

codecov bot commented Jul 5, 2023

Codecov Report

Patch coverage: 67.34% and project coverage change: -7.67% ⚠️

Comparison is base (57697d5) 79.59% compared to head (9569604) 71.92%.
Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main       #5      +/-   ##
==========================================
- Coverage   79.59%   71.92%   -7.67%     
==========================================
  Files           2        2              
  Lines          98      114      +16     
==========================================
+ Hits           78       82       +4     
- Misses         20       32      +12     
Files Changed Coverage Δ
src/cardsort/analysis.py 70.64% <65.59%> (-8.31%) ⬇️
src/cardsort/__init__.py 100.00% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@katoss katoss requested review from Batalex and Robaina July 5, 2023 09:37
@katoss katoss added the enhancement New feature or request label Jul 5, 2023
@katoss katoss self-assigned this Jul 5, 2023
src/cardsort/analysis.py Outdated Show resolved Hide resolved
@katoss
Copy link
Owner Author

katoss commented Jul 12, 2023

@khynder , I addressed 10. -14. (everything regarding the get_cluster_labels functions). Your code suggestions helped me a lot, thank you :)

A new To Do:

  • 15. Adapt documentation to changes in get_cluster_label function

@katoss katoss requested a review from khynder July 13, 2023 07:38
@katoss
Copy link
Owner Author

katoss commented Jul 19, 2023

Hey,

I think this PR might be good to go now.
Comments by issue (see original post on top of this page for more info on each number):

  • 1. Use logging instead of print (6cd3fe2, 934f6f7, 3f1b18e, 3867cfc)
  • 2. Created a _check_data function that performs theses tests and added it to all functions that take a df from users (926a0bf, a3e30be).
  • 3. solved before, see above
  • 4. I would suggest leaving the for loop, I am not sure if the comprehension list would make the code necessarily easier to read.
  • 5. As said in the original post, I don't think there is a standard pdist function that can replace the _get_distance_matrix_for_user function. Which one did you think about?
  • 6. Added a check for user_id=1 first, but later on replaced it with the _check_data function (see 2.)
  • 7. Done, see original post
  • 8. I think the notebook includes all changes (apart from the point mentioned in 15. below)
  • 9. I don't think it is possible to extract the dendrogram labels, as the function does not output anything, and just plots the visualization.
  • 10. to 14. Merged get_cluster_labels and get_cluster_labels_df functions into one function, simplified code, move check for wrong cards into main function, adapted tests (76ed2da, ab1d72c, caa2cb7, a9d9ef9, 0b2a65f)
  • 15. Updated documentation in readme (4bffebe) and notebook (584137b). Regarding the notebook, the get_cluster_labels function does not yet return the right output, because it imports the latest published version of cardsort that does not include the function changes yet. I will update it once this PR is merged.

@katoss katoss marked this pull request as ready for review July 19, 2023 13:47
@Batalex
Copy link
Collaborator

Batalex commented Jul 19, 2023

  • 5. As said in the original post, I don't think there is a standard pdist function that can replace the _get_distance_matrix_for_user function. Which one did you think about?

pdist can be used with a custom function of two arguments (see the docs, specifically 24.).

I wondered if it was possible to replace the content of _get_distance_matrix_for_user by pdist with hamming distance. As it turns out, I thought about something even simpler:

>>> df
   card_id card_label  category_id category_label  user_id
0        1        Dog            1           pets        1
1        2      Tiger            1           pets        1
2        3        Cat            1           pets        1
3        4      Apple            2          lunch        1
4        5   Sandwich            2          lunch        1
5        6     Banana            3      long food        1
6        7    Hot Dog            3      long food        1
7        8  Croissant            4    Moon-shaped        1
8        9   Mooncake            4    Moon-shaped        1
9       10       Moon            4    Moon-shaped        1
>>> _get_distance_matrix_for_user(df)  # from main branch
array([[0., 0., 0., 1., 1., 1., 1., 1., 1., 1.],
       [0., 0., 0., 1., 1., 1., 1., 1., 1., 1.],
       [0., 0., 0., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 0., 0., 1., 1., 1., 1., 1.],
       [1., 1., 1., 0., 0., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 0., 0., 1., 1., 1.],
       [1., 1., 1., 1., 1., 0., 0., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 0., 0., 0.],
       [1., 1., 1., 1., 1., 1., 1., 0., 0., 0.],
       [1., 1., 1., 1., 1., 1., 1., 0., 0., 0.]])
>>> arr = df["category_label"].values
>>> (arr != arr[:, None]).astype(float)
array([[0., 0., 0., 1., 1., 1., 1., 1., 1., 1.],
       [0., 0., 0., 1., 1., 1., 1., 1., 1., 1.],
       [0., 0., 0., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 0., 0., 1., 1., 1., 1., 1.],
       [1., 1., 1., 0., 0., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 0., 0., 1., 1., 1.],
       [1., 1., 1., 1., 1., 0., 0., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 0., 0., 0.],
       [1., 1., 1., 1., 1., 1., 1., 0., 0., 0.],
       [1., 1., 1., 1., 1., 1., 1., 0., 0., 0.]])

The pure numpy solution is simpler and faster, as it avoids looping in Python and accessing single cells in the dataframe.

@gedankenstuecke
Copy link
Collaborator

That seems like a super clever solution, thanks @Batalex! @katoss and I had looked at the Hamming distance but then struggled in adapting everything else given that it would reverse the metrics effectively!

@katoss
Copy link
Owner Author

katoss commented Jul 20, 2023

That seems like a smart solution, great! I replaced it in commit 9c629a9

@Batalex
Copy link
Collaborator

Batalex commented Jul 20, 2023

Glad I could be of some help! @khynder Are you satisfied with the proposed changes?

@khynder
Copy link
Collaborator

khynder commented Jul 20, 2023

I'll have a look at it all on Sunday :)

Copy link
Collaborator

@khynder khynder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the first part of my review, it's a good job!

I'll get into the details of the _check_data function and answer the other points tomorrow.

return None
else:
user_ids = df["user_id"].unique()
for id in user_ids:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thing using "id" as a variable name could lead to conflicts with the built-in id() function, maybe just replace by id_ would be safer

if len(cluster_cards) > 0:
user_ids = df["user_id"].unique()

for id in user_ids:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as before: I thing using "id" as a variable name could lead to conflicts with the built-in id() function, maybe just replace by id_ would be safer

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both done: 192bda9

)
if print_results:
logger.info(
"User " + str(id) + " labeled card(s): " + cluster_label
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could replace the + between strings by the f format, by adding a "f" in front of the big string and inserting the variable in {}. Here it would be:

f"User {id} labeled card(s): cluster_label"

break
return cluster_df
if print_results:
logger.info("User " + str(id) + " did not cluster cards together.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the f-string format it would be

f"User {id} did not cluster cards together."

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both done: 2ce154b

)
if return_df_results:
cards = _get_cards_for_label(cluster_label, df_u)
cluster_df = pd.concat(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a concatenation for each user_id, you can use .append on a list and use pd.concat only once at the end:

# before the loop
cluster_list = [] 
# inside the loop
cluster_list.append([id, cluster_label, cards])
# after the loop
cluster_df = pd.concat(cluster_list, columns=["user_id", "cluster_label", "cards"])

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: 0cf44a0

@katoss
Copy link
Owner Author

katoss commented Jul 24, 2023

Great, thank you @khynder ! I will try to implement the changes before the end of the week :)

Copy link
Collaborator

@khynder khynder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More precise comments on the _check_data function, because the data format requirements changed with the distance matrix implementation changes :).

src/cardsort/analysis.py Outdated Show resolved Hide resolved
src/cardsort/analysis.py Outdated Show resolved Hide resolved
src/cardsort/analysis.py Outdated Show resolved Hide resolved
src/cardsort/analysis.py Show resolved Hide resolved
@khynder
Copy link
Collaborator

khynder commented Jul 24, 2023

Here's my review:

  • 1. OK
  • 2. comments in the review (the data requirements changed with the new distance matrix implementation)
  • 3. OK
  • 4. OK
  • 5. OK
  • 6. OK
  • 7. OK
  • 8. In the example.ipynb file, a .sort_values("card_id") should be added in the last session ("Create a custom dendrogram")
  • 9. I'll have a look at this tomorrow!
  • 10. OK
  • 11. The 2 options are OK for me (1: keep the 2 arguments return_df_results and print_results, 2: keep only return_df_results). If you used print_results when you used the package, maybe some other users would be happy to have the print option :).
  • 12. OK
  • 13. OK
  • 14. OK
  • 15. OK

Great work :)

@katoss
Copy link
Owner Author

katoss commented Jul 27, 2023

Thank you for the review @khynder ! I tried to address everything.

  • 2. Adapted the function (see comments in review above)
  • 8. Ah yes, makes sense! d2a0812
  • 9. TBD
  • 11. Ok, kept it as it is for now. I'll see if I get any feedback regarding the use of this later on from people using the package.

Then it's only No. 9 that's missing :)

@khynder
Copy link
Collaborator

khynder commented Jul 31, 2023

Actually adding a test on the dendrogram labels is a bit more difficult than I thought. I was thinking about the methods get_yticklabels() and .get_text(), but I don't know how to get the plt.axes on which the dendrogram is drawn (plt.gca() seems to be an empty axes). So you can forget about my comment, sorry for the delay!

Last little correction: the name of the color_threshold argument in the create_dendrogram function call in example.ipynb is missing an "h" :).

@katoss
Copy link
Owner Author

katoss commented Aug 1, 2023

Great, thank you @khynder. I fixed the typos in 9569604. And yes I also struggled with the dendrogram labels, maybe there is really no easy solution. Thank you for checking the possibilities!

I guess the PR is ready to merge then 🚀 :)

@katoss katoss merged commit 91e3d03 into main Aug 1, 2023
1 of 3 checks passed
@katoss katoss deleted the fix-functions branch August 1, 2023 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants