Fix functions #5

katoss · 2023-07-05T09:36:39Z

Hello,

this is a draft PR for changes in functions based on the reviews of @khynder and @Batalex .
It is quite a lot and I will probably need some help with some of the comments.

I'll list the changes/ToDos here in a numbered list ordered by function:

General

1.

avoid using print in code, if you have a lot of users in the database it could slow down the execution. You can add a Boolean parameter print_progression or use loggers (@Batalex could explain loggers better than me I think) (comment by @khynder)

I changed all print statements to logs (6cd3fe2, 934f6f7). I have never used logs before but I assume this is how it is intended to be?

2.

Be careful with the data format when it does not come directly from kardsort.com, more details below. It would be safer if a function was added to check that the data format is correct, especially if the package is used with data which does not come from kardsort.com. More specifically:

The first user in df should have 1 as user_id (otherwise in get_distance_matrix, distance_matrix_all will be used before creation, and labels could not be computed)

All users should categorize all cards and each card only once

card_id should go from 1 to n (n being the number of cards)

one card_id should always be associated with exactly one card_label (not two or more) (comment by @khynder)

Apart from checking if the first user_id is 1 in "get_distance_matrix" (see d43f500 and below in section for this function for details), I have not created any data format tests yet. An easy fix would be to only allow kardSort data for now. Otherwise, I will probably need to create a new function for these checks that is called in any user-facing function that accepts dataframes?

3. several functions

get_cluster_labels, get_distance_matrix and get_cluster_labels_df use the form id = 1; while id <= max(df.user.id). We can make this more robust by using a for loop on the values in the dataframe. The original version would fail for string ids or even if a value is missing, e.g [1, 2, 4, 5]. Since the value 1 is used to initialize the distance matrix, you may add a boolean condition (to be changed in the first iteration), or you may create a matrix with np.zeros before the for loop. I like using for user_id, sub_df in df.groupby("user_id"):, you are guaranteed to loop over all user_id values in the dataframe. (by @Batalex)

I think this has been solved in a prior PR based on @Robaina's feedback, all while statements have been replaced by for. A check if the first user ID is 1 has been added for get_distance_matrix in d43f500.

Distance matrix

4. get_distance_matrix_for_user

It can be written as a comprehension list if you want (no obligation, if you feel more at ease with the current code keep it 😊) (by @khynder)

I have not looked into this yet

5. get_distance_matrix_for_user

_get_distance_matrix_for_user could be rewritten using scipy's pdist, but it is worth checking if the rewrite to compute the upper triangular matrix is more efficient. (by @Batalex)

How do you think scipy's pdist could be used here? I would have used it from the start, but thought it would not be possible, because this is not a standard distance function like euclidean distance etc.?

6. get_distance_matrix

the first user should have user_id = 1, otherwise distance_matrix_all won’t exist. It can stay that way, with a check at the beginning of the function to stop the computation if the first user_id is not 1. Or you can create before the for loop a variable distance_matrix_all = None, and replace the id == 1 condition by a distance_matrix_all is not None condition (by @khynder)

I added a check if user_id is 1 (d43f500), but I'm not sure if it is the most elegant solution like this. I added an else statement that logs an error message and returns None in case the ID is different.

Dendrogram

7. create_dendrogram

when you define labels labels = df.loc[df["user_id"] == 1]["card_label"].squeeze().to_list()It is very important that the order of the card_label in labels is exactly the same as the one used to create the distance_matrix. The distance_matrix is created with the card_id in the order 1, 2, 3, 4, …, n. So labels should be: first the card_label corresponding to card_id = 1, then card_label corresponding to card_id = 2, then card_label corresponding to card_id = 3, etc. The above code line works fine if the card_id for the user_id 1 are in the order 1, 2, 3 etc, but does not work if the card_id are shuffled. One way to correct this behavior is to add a sort_values(“card_id”) to the labels definition. labels = df.loc[df["user_id"] == 1].sort_values(“card_id”)["card_label"].squeeze().to_list() (by @khynder)

I changed the code to include the sort_values statement (7c20d8c).

8. example.ipynb

Note that it should also be changed in the example.ipynb. (by @khynder)

I'm sorry, I am not sure what exactly should be changed here. The changes in the create_dendrogram function should automatically apply to the notebook?

9. test

Maybe a test could be added on the order of labels in the dendrogram (on the test-data.csv) (comment by @khynder)

Do you mean comparing the "labels" list with the card_labels in the df?

Cluster labels

10. get_cluster_labels

I think the if/else condition after the for loop can be removed. (by @khynder)

I think this might have been solved in a prior PR, since there is no if/else condition after the for-loop. Or do you mean the if/else condition in the for loop?

11. get_cluster_labels

There is a lot of common code lines between get_cluster_labels and get_cluster_labels_df. These two functions can be factorized into one function, with 2 boolean parameters to indicate what type of return you want, for example print_results and return_df_results. (Note: See review for example function) (by @khynder)

Thank you for suggesting code for a common function, I agree that it would be cleaner if there was only one function for both. I have not looked in detail at your function suggestion yet, but wonder if it is really necessary to have 2 boolean values? Would one not be enough, e.g. return_df = true or false?

12. get_cluster_labels

the warning "is/are not a valid card label" can be moved in this function, instead of being checked at a user_id level (Note: See review for example function) (by @khynder)

Open to do: I have not looked into this yet in detail.

13. get_cluster_labels

I believe most of the complexity of get_cluster_labels could be cut down by checking the values of cluster_cards against the possible values before entering the loop. It is quite risky to mutate a datastructure (especially removing values) while using it in a loop, and usually considered a bad practice. (by @Batalex)

Open to do: I have not looked into this yet in detail.

14. get_cluster_label_for_user

It can be simplified with the “isin()” pandas method:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.isin.html (Note: See review for example function) (by @khynder)

Open to do: I have not looked into this yet in detail.

codecov · 2023-07-05T09:37:12Z

Codecov Report

Patch coverage: 67.34% and project coverage change: -7.67% ⚠️

Comparison is base (57697d5) 79.59% compared to head (9569604) 71.92%.
Report is 4 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main       #5      +/-   ##
==========================================
- Coverage   79.59%   71.92%   -7.67%     
==========================================
  Files           2        2              
  Lines          98      114      +16     
==========================================
+ Hits           78       82       +4     
- Misses         20       32      +12

Files Changed	Coverage Δ
src/cardsort/analysis.py	`70.64% <65.59%> (-8.31%)`	⬇️
src/cardsort/__init__.py	`100.00% <100.00%> (ø)`

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/cardsort/analysis.py

… correct card labels from this function

katoss · 2023-07-12T17:18:58Z

@khynder , I addressed 10. -14. (everything regarding the get_cluster_labels functions). Your code suggestions helped me a lot, thank you :)

A new To Do:

15. Adapt documentation to changes in get_cluster_label function

…efault return_df to true

…ch function that accepts dfs from users

katoss · 2023-07-19T13:47:36Z

Batalex · 2023-07-19T17:32:01Z

5. As said in the original post, I don't think there is a standard pdist function that can replace the _get_distance_matrix_for_user function. Which one did you think about?

pdist can be used with a custom function of two arguments (see the docs, specifically 24.).

I wondered if it was possible to replace the content of _get_distance_matrix_for_user by pdist with hamming distance. As it turns out, I thought about something even simpler:

>>> df
   card_id card_label  category_id category_label  user_id
0        1        Dog            1           pets        1
1        2      Tiger            1           pets        1
2        3        Cat            1           pets        1
3        4      Apple            2          lunch        1
4        5   Sandwich            2          lunch        1
5        6     Banana            3      long food        1
6        7    Hot Dog            3      long food        1
7        8  Croissant            4    Moon-shaped        1
8        9   Mooncake            4    Moon-shaped        1
9       10       Moon            4    Moon-shaped        1
>>> _get_distance_matrix_for_user(df)  # from main branch
array([[0., 0., 0., 1., 1., 1., 1., 1., 1., 1.],
       [0., 0., 0., 1., 1., 1., 1., 1., 1., 1.],
       [0., 0., 0., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 0., 0., 1., 1., 1., 1., 1.],
       [1., 1., 1., 0., 0., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 0., 0., 1., 1., 1.],
       [1., 1., 1., 1., 1., 0., 0., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 0., 0., 0.],
       [1., 1., 1., 1., 1., 1., 1., 0., 0., 0.],
       [1., 1., 1., 1., 1., 1., 1., 0., 0., 0.]])
>>> arr = df["category_label"].values
>>> (arr != arr[:, None]).astype(float)
array([[0., 0., 0., 1., 1., 1., 1., 1., 1., 1.],
       [0., 0., 0., 1., 1., 1., 1., 1., 1., 1.],
       [0., 0., 0., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 0., 0., 1., 1., 1., 1., 1.],
       [1., 1., 1., 0., 0., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 0., 0., 1., 1., 1.],
       [1., 1., 1., 1., 1., 0., 0., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 0., 0., 0.],
       [1., 1., 1., 1., 1., 1., 1., 0., 0., 0.],
       [1., 1., 1., 1., 1., 1., 1., 0., 0., 0.]])

The pure numpy solution is simpler and faster, as it avoids looping in Python and accessing single cells in the dataframe.

gedankenstuecke · 2023-07-20T11:29:41Z

That seems like a super clever solution, thanks @Batalex! @katoss and I had looked at the Hamming distance but then struggled in adapting everything else given that it would reverse the metrics effectively!

…matrix_for_user

katoss · 2023-07-20T15:15:23Z

That seems like a smart solution, great! I replaced it in commit 9c629a9

Batalex · 2023-07-20T17:30:53Z

Glad I could be of some help! @khynder Are you satisfied with the proposed changes?

khynder · 2023-07-20T22:01:59Z

I'll have a look at it all on Sunday :)

khynder

Here's the first part of my review, it's a good job!

I'll get into the details of the _check_data function and answer the other points tomorrow.

khynder · 2023-07-23T18:57:57Z

src/cardsort/analysis.py

+        return None
+    else:
+        user_ids = df["user_id"].unique()
+        for id in user_ids:


I thing using "id" as a variable name could lead to conflicts with the built-in id() function, maybe just replace by id_ would be safer

khynder · 2023-07-23T18:58:24Z

src/cardsort/analysis.py

-        if len(cluster_cards) > 0:
+        user_ids = df["user_id"].unique()
+
+        for id in user_ids:


Same as before: I thing using "id" as a variable name could lead to conflicts with the built-in id() function, maybe just replace by id_ would be safer

both done: 192bda9

khynder · 2023-07-23T19:00:57Z

src/cardsort/analysis.py

-                )
+                if print_results:
+                    logger.info(
+                        "User " + str(id) + " labeled card(s): " + cluster_label


I think you could replace the + between strings by the f format, by adding a "f" in front of the big string and inserting the variable in {}. Here it would be:

f"User {id} labeled card(s): cluster_label"

khynder · 2023-07-23T19:01:32Z

src/cardsort/analysis.py

-            break
-    return cluster_df
+                if print_results:
+                    logger.info("User " + str(id) + " did not cluster cards together.")


with the f-string format it would be

f"User {id} did not cluster cards together."

both done: 2ce154b

khynder · 2023-07-23T21:59:10Z

src/cardsort/analysis.py

+                    )
+                if return_df_results:
+                    cards = _get_cards_for_label(cluster_label, df_u)
+                    cluster_df = pd.concat(


Instead of a concatenation for each user_id, you can use .append on a list and use pd.concat only once at the end:

# before the loop cluster_list = [] # inside the loop cluster_list.append([id, cluster_label, cards]) # after the loop cluster_df = pd.concat(cluster_list, columns=["user_id", "cluster_label", "cards"])

Done: 0cf44a0

katoss · 2023-07-24T18:02:09Z

Great, thank you @khynder ! I will try to implement the changes before the end of the week :)

khynder

More precise comments on the _check_data function, because the data format requirements changed with the distance matrix implementation changes :).

src/cardsort/analysis.py

khynder · 2023-07-24T21:02:30Z

…h built-in 'id'

…_id is 1

katoss · 2023-07-27T14:24:45Z

Thank you for the review @khynder ! I tried to address everything.

2. Adapted the function (see comments in review above)
8. Ah yes, makes sense! d2a0812
9. TBD
11. Ok, kept it as it is for now. I'll see if I get any feedback regarding the use of this later on from people using the package.

Then it's only No. 9 that's missing :)

khynder · 2023-07-31T17:37:02Z

Actually adding a test on the dendrogram labels is a bit more difficult than I thought. I was thinking about the methods get_yticklabels() and .get_text(), but I don't know how to get the plt.axes on which the dendrogram is drawn (plt.gca() seems to be an empty axes). So you can forget about my comment, sorry for the delay!

Last little correction: the name of the color_threshold argument in the create_dendrogram function call in example.ipynb is missing an "h" :).

katoss · 2023-08-01T18:16:31Z

Great, thank you @khynder. I fixed the typos in 9569604. And yes I also struggled with the dendrogram labels, maybe there is really no easy solution. Thank you for checking the possibilities!

I guess the PR is ready to merge then 🚀 :)

katoss added 4 commits July 5, 2023 09:34

fix formatting with black

7c20d8c

Changed print to logs

6cd3fe2

Change logging level to info

934f6f7

Check if first user id is 1 in get_distance_matrix

d43f500

katoss requested review from Batalex and Robaina July 5, 2023 09:37

katoss added the enhancement New feature or request label Jul 5, 2023

katoss self-assigned this Jul 5, 2023

katoss mentioned this pull request Jul 5, 2023

cardsort: analyzing data from open card sorting tasks pyOpenSci/software-submission#102

Closed

30 tasks

Removed pull request trigger from CI

8d22703

Batalex reviewed Jul 7, 2023

View reviewed changes

src/cardsort/analysis.py Outdated Show resolved Hide resolved

katoss added 10 commits July 12, 2023 17:12

Fix link to logo

687feb7

Fix logging, add nullHandler to init

3f1b18e

Simplify _get_cluster_label_for_user with isin(), and remove test for…

76ed2da

… correct card labels from this function

Merge get_cluster_labels and get_cluster_labels_df into one function

ab1d72c

fix formatting with black

9b436a8

fix formatting

78ddc44

Remove get_cluster_labels_df function

caa2cb7

Add check if there are still cards in list. Otherwise return None

a9d9ef9

Adapt tests to adapted version of get_cluster_labels

0b2a65f

Remove get_cluster_labels_df from imports

c0bf22e

katoss requested a review from khynder July 13, 2023 07:38

katoss added 6 commits July 19, 2023 11:20

Change logging to logger, change remaining prints to logger, change d…

3867cfc

…efault return_df to true

Fix test_get_cluster_labels_print by adding return_df_results=False

1e02db6

Created _check_data function to check input format and added it to ea…

a3e30be

…ch function that accepts dfs from users

Add test for _check_data function

926a0bf

Adapt usage example to changes in get_cluster_labels function

4bffebe

Adapt notebook to new functions

584137b

katoss marked this pull request as ready for review July 19, 2023 13:47

Replace for loop with faster numpy array comparison in _get_distance_…

9c629a9

…matrix_for_user

khynder reviewed Jul 23, 2023

View reviewed changes

khynder reviewed Jul 24, 2023

View reviewed changes

src/cardsort/analysis.py Outdated Show resolved Hide resolved

src/cardsort/analysis.py Outdated Show resolved Hide resolved

src/cardsort/analysis.py Outdated Show resolved Hide resolved

src/cardsort/analysis.py Show resolved Hide resolved

katoss added 8 commits July 27, 2023 14:31

Add sort_labels to 'creating a custom dendrogram' example

d2a0812

Replace 'id' variable name with 'id_' to prevent potential issues wit…

192bda9

…h built-in 'id'

reformat code with black

2ce154b

Append to list to concat list only once after loop

0cf44a0

Adapt test to changes in get_cluster_labels

ff42545

Remove check if card_ids go from 1 to n in _check_data

c9f27d7

Fix format with black

a3c958f

Replace check if user_ids range go from 1 to n to check if first user…

85acb7a

…_id is 1

fix typos

9569604

katoss merged commit 91e3d03 into main Aug 1, 2023
1 of 3 checks passed

katoss deleted the fix-functions branch August 1, 2023 18:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix functions #5

Fix functions #5

katoss commented Jul 5, 2023 •

edited

Loading

codecov bot commented Jul 5, 2023 •

edited

Loading

katoss commented Jul 12, 2023 •

edited

Loading

katoss commented Jul 19, 2023

Batalex commented Jul 19, 2023 •

edited

Loading

gedankenstuecke commented Jul 20, 2023

katoss commented Jul 20, 2023 •

edited

Loading

Batalex commented Jul 20, 2023

khynder commented Jul 20, 2023

khynder left a comment

khynder Jul 23, 2023

khynder Jul 23, 2023

katoss Jul 27, 2023

khynder Jul 23, 2023

khynder Jul 23, 2023

katoss Jul 27, 2023

khynder Jul 23, 2023

katoss Jul 27, 2023

katoss commented Jul 24, 2023

khynder left a comment

khynder commented Jul 24, 2023 •

edited

Loading

katoss commented Jul 27, 2023 •

edited by khynder

Loading

khynder commented Jul 31, 2023

katoss commented Aug 1, 2023

Fix functions #5

Fix functions #5

Conversation

katoss commented Jul 5, 2023 • edited Loading

General

Distance matrix

Dendrogram

Cluster labels

codecov bot commented Jul 5, 2023 • edited Loading

Codecov Report

katoss commented Jul 12, 2023 • edited Loading

katoss commented Jul 19, 2023

Batalex commented Jul 19, 2023 • edited Loading

gedankenstuecke commented Jul 20, 2023

katoss commented Jul 20, 2023 • edited Loading

Batalex commented Jul 20, 2023

khynder commented Jul 20, 2023

khynder left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

katoss commented Jul 24, 2023

khynder left a comment

Choose a reason for hiding this comment

khynder commented Jul 24, 2023 • edited Loading

katoss commented Jul 27, 2023 • edited by khynder Loading

khynder commented Jul 31, 2023

katoss commented Aug 1, 2023

katoss commented Jul 5, 2023 •

edited

Loading

codecov bot commented Jul 5, 2023 •

edited

Loading

katoss commented Jul 12, 2023 •

edited

Loading

Batalex commented Jul 19, 2023 •

edited

Loading

katoss commented Jul 20, 2023 •

edited

Loading

khynder commented Jul 24, 2023 •

edited

Loading

katoss commented Jul 27, 2023 •

edited by khynder

Loading