Lesson proposal: Clustering and Visualising Documents using Word Embeddings (PH/JISC/TNA) #415

tiagosousagarcia · 2021-11-02T15:16:52Z

The Programming Historian has received the following proposal for a lesson on 'Clustering and Visualising Documents using Word Embeddings' by @jreades and @jenniewilliams. The proposed learning outcomes of the lesson are:

The ability to generate word embeddings from a large corpus.
The ability to use dimensionality reduction and clustering techniques for visualisation and analysis
purposes.
The ability to use these steps to find and explore groups of similar documents within a large data set.

In order to promote speedy publication of this important topic, we have agreed to a submission date of no later than April 2022. The author(s) agree to contact the editor in advance if they need to revise the deadline.

If the lesson is not submitted by April 2022, the editor will attempt to contact the author(s). If they do not receive an update, this ticket will be closed. The ticket can be reopened at a future date at the request of the author(s).

The main editorial contact for this lesson is @tiagosousagarcia.

Our dedicated Ombudsperson is (Ian Milligan - http://programminghistorian.org/en/project-team). Please feel free to contact him at any time if you have concerns that you would like addressed by an impartial observer. Contacting the ombudsperson will have no impact on the outcome of any peer review.

svmelton · 2022-02-02T01:43:46Z

@hawc2 has offered to edit this piece.

hawc2 · 2022-02-18T20:21:51Z

Hi @jreades and @jenniewilliams, I look forward to reading your submission. Please let me know if you have any questions in the meantime. Feel free to email me or post questions on this ticket.

jreades · 2022-03-09T14:58:54Z

Hi sorry -- between strikes, childcare, and general... aaaaaaaargh... I'm behind where I'd hoped to be with this! I do have a perfectly serviceable draft of the core explanatory part (what are Word Embeddings, etc.) and I have separate code that I've already used in other analysis of the same data so I know the path to completion...

However, is it helpful for me to share this early draft, or would you prefer to see only a full submission? Having open review creates more opportunities to shape the work as it develops rather than afterwards, but it could also be confusing/unhelpful. Let me know!

If helpful then I can: 1) share access to the GitHub repo where I'm writing the draft (so that we don't pollute this 'timeline'); 2) attach a draft to this thread; 3) submit a draft but recognise that it will need be versioned later (a pull request or similar).

Best,

Jon

hawc2 · 2022-03-09T16:54:59Z

If you'd prefer me to look at an early draft and give you feedback over email, I can, but otherwise I'd say let's go ahead and get a rough draft uploaded as a markdown file to the PH-Submissions repo with previews working, I can give you a first round of feedback in this ticket thread, and then you can finalize for submission to peer review. Does that sound good?

…

On Wed, 9 Mar 2022 at 09:59, Jon Reades ***@***.***> wrote: Hi sorry -- between strikes, childcare, and general... aaaaaaaargh... I'm behind where I'd hoped to be with this! I do have a perfectly serviceable draft of the core explanatory part (what are Word Embeddings, etc.) and I have separate code that I've already used in other analysis of the same data so I know the path to completion... However, is it helpful for me to share this early draft, or would you prefer to see only a full submission? Having open review creates more opportunities to shape the work as it develops rather than afterwards, but it could also be confusing/unhelpful. Let me know! If helpful then I can: 1) share access to the GitHub repo where I'm writing the draft (so that we don't pollute this 'timeline'); 2) attach a draft to this thread; 3) submit a draft but recognise that it will need be versioned later (a pull request or similar). Best, Jon — Reply to this email directly, view it on GitHub <#415 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADXF4EG3TD3QOMBWREOJPDLU7C4EDANCNFSM5HGW6GQA> . You are receiving this because you were assigned.Message ID: ***@***.***>

-- *Alex Wermer-Colan, PhD * *Digital Scholarship Coordinator* *Temple University, Scholars Studio*

jreades · 2022-03-10T09:38:00Z

It’s ok, I’ll have a go at finishing a proper draft before sending anything over — think I was feeling guilty that I’d been radio-silent for so long and wanted to have a “See, I have done work on this!” moment. ;-) Jon

…

-- mob: 07976987392 email: ***@***.*** skype: jreades

On 9 Mar 2022, 16:55 +0000, Alex Wermer-Colan ***@***.***>, wrote: If you'd prefer me to look at an early draft and give you feedback over email, I can, but otherwise I'd say let's go ahead and get a rough draft uploaded as a markdown file to the PH-Submissions repo with previews working, I can give you a first round of feedback in this ticket thread, and then you can finalize for submission to peer review. Does that sound good? On Wed, 9 Mar 2022 at 09:59, Jon Reades ***@***.***> wrote: > Hi sorry -- between strikes, childcare, and general... aaaaaaaargh... I'm > behind where I'd hoped to be with this! I do have a perfectly serviceable > draft of the core explanatory part (what are Word Embeddings, etc.) and I > have separate code that I've already used in other analysis of the same > data so I know the path to completion... > > However, is it helpful for me to share this early draft, or would you > prefer to see only a full submission? Having open review creates more > opportunities to shape the work as it develops rather than afterwards, but > it could also be confusing/unhelpful. Let me know! > > If helpful then I can: 1) share access to the GitHub repo where I'm > writing the draft (so that we don't pollute this 'timeline'); 2) attach a > draft to this thread; 3) submit a draft but recognise that it will need be > versioned later (a pull request or similar). > > Best, > > Jon > > — > Reply to this email directly, view it on GitHub > <#415 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ADXF4EG3TD3QOMBWREOJPDLU7C4EDANCNFSM5HGW6GQA> > . > You are receiving this because you were assigned.Message ID: > ***@***.***> > -- *Alex Wermer-Colan, PhD * *Digital Scholarship Coordinator* *Temple University, Scholars Studio* — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you were mentioned.Message ID: ***@***.***>

jreades · 2022-04-06T11:57:03Z

I assume the first draft is submitted as an attachment to this issue... so here goes!

The article is a README from our private repo (will make a public version prior to publication):
README.md

Images are here:

hawc2 · 2022-04-06T15:15:36Z

@jreades, I'll try to get the lesson set up, and I'll email you with more specific questions/issues with the files. More soon!

tiagosousagarcia · 2022-04-19T06:39:14Z

@hawc2 -- I can setup the lesson later today, if you haven't had a chance

jreades · 2022-04-19T08:39:14Z

This was my bad — in discussing the submission with Alex it become obvious to me that I’d deviated a long way from a format that worked as a standalone tutorial. I’ve just this morning sent a substantially rewritten version that I hope will work a lot better: you can copy+paste the code ‘as is’ from the Markdown document to create a new notebook, but I can also supply a standalone notebook that is ready to run as well. We’ve discussed whether or not to split the tutorial at the point where there is a shift from word embeddings to dimensionality reduction, but until Alex has had a chance to have a look it’s TBD. Apologies again for making this such a protracted, difficult process — as I said to Alex: I’d not realised the extent to which my approach to writing has been deeply reshaped by the academic article format. Jon

…

On 19 Apr 2022, 07:39 +0100, tiagosousagarcia ***@***.***>, wrote: @hawc2 -- I can setup the lesson later today, if you haven't had a chance — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

tiagosousagarcia · 2022-04-19T11:15:36Z

no worries @jreades, it's all part of the process. I haven't seen any development here, that's why I was asking if there was something I could do -- but if @hawc2 has the matter in hands, then we are all good (though the offer to help if needed still stands)

hawc2 · 2022-04-21T02:34:28Z

It's looking good now, here is a preview link to the lesson: https://programminghistorian.github.io/ph-submissions/en/drafts/originals/clustering-visualizing-word-embeddings

@jreades can you let me know if you see anything basic in the markdown rendering that might be incorrect?

I'll follow up with some preliminary feedback on the lesson itself in the coming week. Once you finish that round of edits, I'll work on sending this out for peer review.

Thanks also @jreades for putting together now a Github repo that will be linked in the lesson. The repo will include the Python code in a Jupyter notebook runnable in Google Colab for testing purposes

jreades · 2022-04-21T06:36:37Z

Definitely some rendering issues (around some of the maths especially) and a few typos that I’ve just spotted now (naturally). If you can give me editing access I’ll get this tidied up today. Jon

…

-- mob: 07976987392 email: ***@***.*** skype: jreades

On 21 Apr 2022, 03:34 +0100, Alex Wermer-Colan ***@***.***>, wrote: It's looking good now, here is a preview link to the lesson: https://programminghistorian.github.io/ph-submissions/en/drafts/originals/clustering-visualizing-word-embeddings @jreades can you let me know if you see anything basic in the markdown rendering that might be incorrect? I'll follow up with some preliminary feedback on the lesson itself in the coming week. Once you finish that round of edits, I'll work on sending this out for peer review. Thanks also @jreades for putting together now a Github repo that will be linked in the lesson. The repo will include the Python code in a Jupyter notebook runnable in Google Colab for testing purposes — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

hawc2 · 2022-04-21T13:19:44Z

I'm giving you write access now. Can you also post the Github repo and Colab notebook here for reference?

jreades · 2022-04-22T06:49:44Z

I’ve updated the tutorial Markdown file with links across to the public GitHub repo and Colab. Fixed the minor typos and one substantive content area that I wanted to correct. Have committed this back to: https://github.com/programminghistorian/ph-submissions/blob/gh-pages/en/drafts/originals/clustering-visualizing-word-embeddings.md Jon

…

-- mob: 07976987392 email: ***@***.*** skype: jreades

On 21 Apr 2022, 14:19 +0100, Alex Wermer-Colan ***@***.***>, wrote: I'm giving you write access now. Can you also post the Github repo and Colab notebook here for reference? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

hawc2 · 2022-04-26T14:15:55Z

This is looking like a very solid first draft. My main feedback is pretty general, so I’ll hold off from giving you specific line edits, and just ask for some broad revisions before we send out for review.

My main observation is that this is quite a difficult lesson, and more work will be required to translate terminology for beginner audiences, signposting where the lesson is going, and onboarding the reader to each phase of the methodology. It will be helpful for you to do some basic revisions in this direction before I send it out for reviewers, so they don’t need worry as much about how this lesson caters to its audience.

My only other concern is that this lesson is very long. Lessons usually don’t go over 8,000 words. I’d rather not see it bulge into a two part lesson, although that is a possible solution. For now, I’d encourage you to focus on the difficult task of editing this draft for both clarity and length, making it ideally more concise and more concrete at the same time.

As an example of clarifying your language for introductory steps, in your first Learning Outcome, you say: “we use a selection of nearly 50,000 records relating to U.K. PhD completions.” Right off the bat, you should use language that more clearly identifies what kind of data your tutorial works with. What kind of records are these? As an American, I’m not sure what “records relating to U.K. PhD completions” would look like, nor why someone would do word embedding analysis on this type of data. I would’ve expected “a corpus of doctoral dissertations” as the main dataset. In this vein, on Paragraph 9, where you introduce this dataset in more detail, it’s still not clear yet what “textual data” you will be analyzing within the “metadata” about dissertations. I have to admit that the section on the Case Study gets so technical and detailed about the metadata that I lost the main thread: What is the text you are going to model?

The part where you explain word embeddings and compare them to other text mining algorithms also requires more revision. In the learning outcomes, the tutorial jumps right into ‘dimensionality reduction’ and ‘hierarchal clustering,’ but maybe a preliminary learning outcome should be something about teaching the reader why these methods are appropriate next steps once you’ve created a word embedding model, in order to pursue a research question about the dataset. Putting it in these less technical terms will help readers understand how the algorithmic processes relate to broader scholarly work.

The subsequent paragraphs do a good job of distinguishing PCA, LDA, and TF-IDF from WEs, but they do assume that the reader knows something about what all these have in common. In these opening paragraphs, try to find more ways to spell this out, in terms of approaches like predictive modeling and latent meaning. For example, this sentence clause doesn’t really clarify what TF-IDF is, so its comparison with WEs remains a bit vague: “ The benefit of simple frequency-based analyses such as TF/IDF is that they are readily intelligible and fairly easy to calculate . . .” What seems essential to highlight here is the type of meaning WEs offer us insight into about the text the other approaches overlook. There’s some explanation in the Word Embedding section (beginning Paragraph 39) that helpfully explains why dimensionality reduction is necessary; a brief version of this could be included early on in the tutorial to explain why that tutorial leads the reader through this specific series of steps. Similarly, under Prerequisites, you explain how this lesson differs from the Scikit Learn Clustering lesson, but you don’t really explain first what the two lessons have in common. Alot of these comparison examples are useful for clarifying what your lesson on word embeddings does, but ideally they’d all occur in one section, and focus mostly on clarifying what word embedding analysis can show about the text.

In this context, the Word Embedding section, in particular paragraphs 40-44, jumps very quickly from the mathematical to the semantic. Could you spend more time here explaining the analogical nature of word embedding models and vector relationships?

The Sample Output section similarly jumps right into the weeds. Could you have a little more introductory info here about the outputs, and how this is a useful sample for elucidating some key points?

A couple of your Tables take up a lot of real estate. Could they be condensed? Table 3 for example, could just be one row? Table 5 is also long.

Paragraph 57 - I agree this is a good break point. You can remove the signpost you put here for review. I think the next section on Words to Documents is very useful, and could be contextualized a bit in terms of Word2Vec and Doc2Vec or how this method differs from those. I see why it’s useful to get into Manifold Learning, but if TSNE-UMAP is the main point, you should get to that sooner. I kinda got lost in this section. Generally I think you go into too much behind the scenes background detail about alternative options, and not enough info on the specific thing you are teaching. Try to offload some of the secondary comparisons with other methods to footnotes.

The Visualization section seems like a good place to conclude. Right now that Figure isn’t rendering in Markdown. But Visualizing and Clustering the data ideally could’ve been foreshadowed earlier in the lesson. The current version of these sections can be condensed to focus on concluding the lesson with some first steps in these visualization directions. What about this section is really essential to this lesson? Is all the validation and related steps necessary, or could that be included as supplemental material on your Github repo and Colab notebook for more advanced users? There’s a bunch, like the Confusion Matrix, that just seems so dense and complicated, that you’d have to do a lot more work to justify its inclusion for the proof-of-concept word embedding methodology. Since that would take up more space, so I’m inclined to think a bunch of it can be removed?

If you can try to take a shot at edits along these lines in the next couple weeks, after one round of revision, it’ll be ready to send out for review. Let me know if you have any questions

jreades · 2022-05-03T11:30:32Z

I've nearly finished -- I just need to review the final bits of analysis in light of the edits above, but have been able to prune the tutorial down to about 9,800 words. I've fixed issues with maths rendering (GItHub doesn't actually do this directly for Markdown) and tried to generally tidy up.

jreades · 2022-05-03T20:26:13Z

Done. I've gone the whole way through and yanked as much as I think we can while preserving the overall intention of the submission. I'm sure there's more that could be done but I'm not able to see it at this point. The commit is in. The only thing I wasn't sure about is the images: I can see that they eventually go into an images/<tutorial_name>/ folder but figured you'd want to do this yourselves.

Let me know if you need anything else or have any further comments/ideas before sending out for review. As you can see, your initial comments prompted a major rethink and I hope you'll think we've done a good job acting on them.

jreades · 2022-05-04T08:59:28Z

Quick note to self: clarify that Euclidean distance works well with UMAP in this case because the abstracts don't vary enormously in length; this means that the magnitude of the averaged document vector isn't an issue. Cosine would probably be a better choice where there was significant variation in the length of the documents.

jreades · 2022-05-06T09:00:21Z

Quick note to self: clarify that Euclidean distance works well with UMAP in this case because the abstracts don't vary enormously in length; this means that the magnitude of the averaged document vector isn't an issue. Cosine would probably be a better choice where there was significant variation in the length of the documents.

I've now fixed this. This is ready for a review... I hope!

hawc2 · 2022-05-06T13:54:31Z

@jreades regarding the images, let's make sure those are all rendering correctly. I put them all in the directory: https://github.com/programminghistorian/ph-submissions/tree/gh-pages/images/clustering-visualizing-word-embeddings

Can you make sure your markdown file has each embedded in the appropriate place with alt-text? You can see information on naming the image files and inserting them into the markdown here: https://programminghistorian.org/en/author-guidelines

Once the lesson is rendering correctly, I'll do one last skim and send it out for peer review. Thanks so much for your thorough edits

jreades · 2022-05-06T14:23:50Z

Done. <fingers crossed I’ve done it right>

…

-- mob: 07976987392 email: ***@***.*** skype: jreades

On 6 May 2022, 14:54 +0100, programminghistorian/ph-submissions ***@***.***>, wrote: https://github.com/programminghistorian/ph-submissions/tree/gh-pages/images/clustering-visualizing-word-embeddings

hawc2 · 2022-05-06T14:34:36Z

So images don't need the whole directory link, just the name of the image file. You should be able to look at this preview to know when everything looks right: https://programminghistorian.github.io/ph-submissions/en/drafts/originals/clustering-visualizing-word-embeddings

I edited the final image in your lesson to show you what it should look like. The last image now renders correctly. I'd rather you finalize it since you know how it should look. Once you think it looks like there, I'll send this out for review. I don't have any other immediate feedback, but after we get reviewer feedback, I'll synthesize their feedback and add any remaining thoughts I have for further revision.

jreades · 2022-05-09T15:40:07Z

I’m definitely missing something here: I’ve removed the full path and followed the convention used in the other tutorials that I peeked at (image name only, no other path info) but I still can’t get the images to display even though they appear to me to be in the right place for the includes to work. I don’t know if the GitHub pages are only rebuilt intermittently or, more likely, if I’m still mucking up something in the placement of the images/code… but I’m stuck. I’m sure the images are ‘fine' in the sense that if you can get them working we can sort out any issues that they might present during the review stage. I’m not too worried about minor look-and-feel issues since the reviewers will presumably also comment on these if they noticed anything wrong. Jon

…

-- mob: 07976987392 email: ***@***.*** skype: jreades

On 6 May 2022, 15:34 +0100, Alex Wermer-Colan ***@***.***>, wrote: So images don't need the whole directory link, just the name of the image file. You should be able to look at this preview to know when everything looks right: https://programminghistorian.github.io/ph-submissions/en/drafts/originals/clustering-visualizing-word-embeddings I'd rather you finalize it since you know how it should look. Once you think it looks like there, I'll send this out for review. I don't have any other immediate feedback, but after we get reviewer feedback, I'll synthesize their feedback and add any remaining thoughts I have for further revision. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

tiagosousagarcia · 2022-05-09T15:48:01Z

I’m definitely missing something here: I’ve removed the full path and followed the convention used in the other tutorials that I peeked at (image name only, no other path info) but I still can’t get the images to display even though they appear to me to be in the right place for the includes to work. I don’t know if the GitHub pages are only rebuilt intermittently or, more likely, if I’m still mucking up something in the placement of the images/code… but I’m stuck. I’m sure the images are ‘fine' in the sense that if you can get them working we can sort out any issues that they might present during the review stage. I’m not too worried about minor look-and-feel issues since the reviewers will presumably also comment on these if they noticed anything wrong. Jon

@jreades and @hawc2, there's apparently something wrong with preview in the submissions repo which means that the relative paths don't work, we need the full path to the image for the preview to display it, according to what @anisa-hawes told me here

tiagosousagarcia · 2022-05-09T15:50:05Z

I’m definitely missing something here: I’ve removed the full path and followed the convention used in the other tutorials that I peeked at (image name only, no other path info) but I still can’t get the images to display even though they appear to me to be in the right place for the includes to work. I don’t know if the GitHub pages are only rebuilt intermittently or, more likely, if I’m still mucking up something in the placement of the images/code… but I’m stuck. I’m sure the images are ‘fine' in the sense that if you can get them working we can sort out any issues that they might present during the review stage. I’m not too worried about minor look-and-feel issues since the reviewers will presumably also comment on these if they noticed anything wrong. Jon

@jreades and @hawc2, there's apparently something wrong with preview in the submissions repo which means that the relative paths don't work, we need the full path to the image for the preview to display it, according to what @anisa-hawes told me here

I'll go through the file and correct the paths, give me 5 mins

jreades · 2023-03-13T08:51:15Z

Yes, will do. Didn’t want to polish that until I was sure the content wouldn’t be changing radically again. We also deal with reviewer feedback at same time… probably towards the end of this week

…

On 11 Mar 2023 at 18:23 +0100, Alex Wermer-Colan ***@***.***>, wrote: @jreades can you fix the Colab notebook and share a working link so @quinnanya and I can test it? Thanks! — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

jreades · 2023-03-15T07:48:08Z

Colab notebook is now up-to-date with the code in the tutorial: https://colab.research.google.com/github/jreades/ph-tutorial-code/blob/main/Clustering_Word_Embeddings.ipynb#scrollTo=OGwAfxDfLB0L). (And I’ve updated the link in the tutorial as well) I’ve also revised the introductory section to add a link to Barbara’s "How to use word embeddings for Natural Language Processing”. Jon On 13 Mar 2023, 08:51 +0000, Jon Reades ***@***.***>, wrote: Yes, will do. Didn’t want to polish that until I was sure the content wouldn’t be changing radically again. We also deal with reviewer feedback at same time… probably towards the end of this week On 11 Mar 2023 at 18:23 +0100, Alex Wermer-Colan ***@***.***>, wrote: Message ID: ***@***.***>

hawc2 · 2023-03-20T23:27:57Z

Colab notebook worked well for me, thank you @jreades.

@quinnanya can you take a look and share any last thoughts on this lesson?

quinnanya · 2023-03-21T00:04:05Z

The notebook worked well for me, too -- and admittedly, I was on Colab Pro, but the completion times were quite short.

Running through the code, though, leaves me with one major concern: how can people run this on any other data set they might have? There's a somewhat breezy note about parquet, but I can think of exactly one person I know who regularly uses that as a data format. If everything is going to depend on parquet files, for this to have a prayer of being reusable, it needs at a very minimum a pointer to some external tutorial for how one might go about converting, say, a CSV to a parquet file.

jreades · 2023-03-21T07:38:59Z

This is something that fell out of the streamlining of the notebook and tutorial to focus only on the post-embedding stage: the process of creating WEs (from a CSV with a complex encoding) disappeared. In Python converting to Parquet from CSV can be as simple as: pd.read_csv(…).to_parquet(…) The main advantages relate to size (Parquet files are much, much smaller than their equivalent CSV) and more complex data structures (lists and dicts can be embedded without the need to deserialise them). So in this case the document embeddings are just ’there’ in ‘doc2vec’ and ‘word2vec’ columns, and not something that needs ast.literal to convert. I will also note how to do this and just explain what you’d do if given a CSV containing a literal list (“['elem1’, ‘elem2’, ‘elem3’, … ]”). I would tend to lean towards adding an explanatory note at the start of the notebook together with the illustrative code above. The dependency on pyarrow is installed automatically in Colab and is listed in the requirements.txt file (which I should probably also signpost at the start of the notebook). Does this work? Jon On 21 Mar 2023, 00:04 +0000, Quinn Dombrowski ***@***.***>, wrote: Running through the code, though, leaves me with one major concern: how can people run this on any other data set they might have? There's a somewhat breezy note about parquet, but I can think of exactly one person I know who regularly uses that as a data format. If everything is going to depend on parquet files, for this to have a prayer of being reusable, it needs at a very minimum a pointer to some external tutorial for how one might go about converting, say, a CSV to a parquet file.

quinnanya · 2023-03-22T17:34:59Z

Hi Jon,

Got it! Yup, I think a quick "if you've got a CSV you can convert it easily with [code]" insert there would take care of it, thanks!

~Quinn

BarbaraMcG · 2023-03-23T15:50:30Z

I too have checked the code and it runs quickly with no issues. I agree that the input format needs some clarification, as well as some more comments to the code in the initial part of the notebook; it gets more verbose later on, but the beginning is a little dense.

hawc2 · 2023-03-23T18:51:16Z

Thanks @BarbaraMcG and @quinnanya for this useful and precise feedback. @jreades it sounds like the lesson is ready for copy-editing. @anisa-hawes can start working on that next phase.

Thanks again to our reviewers for taking a second look at this lesson, and giving it such a careful eye. It's going to be an excellent lesson, and I look forward to seeing how @quinnanya's in-progress lesson provides an introduction/background useful for this lesson.

Separately, I can work with you @jreades on revising and finalizing the accompanying Google Colab notebook. What you described doing sounds good to me. Since we don't want to replicate commentary available in the Programming Historian lesson, you can focus on minimal commentary and headings in the colab notebook that make it easy to follow along with, and explain any technical diversions from the PH lesson. The parquet pandas implementation is very elegant!

jreades · 2023-03-23T20:08:11Z

I added a bit of preamble to the Collab notebook: https://github.com/jreades/ph-tutorial-code/blob/main/Clustering_Word_Embeddings.ipynb If that’s not where you want it let me know. Jon

…

On 23 Mar 2023 at 18:51 +0000, Alex Wermer-Colan ***@***.***>, wrote: Thanks @BarbaraMcG and @quinnanya for this useful and precise feedback. @jreades it sounds like the lesson is ready for copy-editing. @anisa-hawes can start working on that next phase. Thanks again to our reviewers for taking a second look at this lesson, and giving it such a careful eye. It's going to be an excellent lesson, and I look forward to seeing how @quinnanya's in-progress lesson provides an introduction/background useful for this lesson. Separately, I can work with you @jreades on revising and finalizing the accompanying Google Colab notebook. What you described doing sounds good to me. Since we don't want to replicate commentary available in the Programming Historian lesson, you can focus on minimal commentary and headings in the colab notebook that make it easy to follow along with, and explain any technical diversions from the PH lesson. The parquet pandas implementation is very elegant! — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

anisa-hawes · 2023-03-30T18:31:44Z

Thank you @jreades. This lesson /en/drafts/originals/clustering-visualizing-word-embeddings is now being copyedited.

anisa-hawes · 2023-04-05T21:19:45Z

Hello @jreades,

I hope you are well.

Our copyeditor Iphgenia has prepared the edits for this lesson. I've staged these edits in a Pull Request #554. You can review the changes she's made in the rich diff by navigating to the "Files changed" tab.

Please let me know if you're happy with the adjustments. You'll notice that I have left some small comments/queries, and indicated where a few additions are needed.

Descriptive alt-text to accompany all figure images and visualisations
Math to be formatted as LaTeX

With many thanks,
Anisa

cc. @hawc2

hawc2 · 2023-04-17T19:09:00Z

@jreades did you see the pull request #554 awaiting your approval for copyedits on your lesson?

jreades · 2023-04-19T14:30:20Z

I had -- what I wasn't sure about was whether I was supposed to use the inline commenting function or approve the pull request. I've now done this and then added in the requested revisions.

I think this means we're there? 🤞

anisa-hawes · 2023-04-19T18:55:56Z

Thank you, @jreades!

--

Hello @hawc2,

This lesson is almost ready for your final review.

Preview: http://programminghistorian.github.io/ph-submissions/en/drafts/originals/clustering-visualizing-word-embeddings
.md file: /drafts/originals/clustering-visualizing-word-embeddings.md
original image: /gallery/originals/clustering-visualizing-word-embeddings-original.png
gallery image: /gallery/clustering-visualizing-word-embeddings.png
images: /images/clustering-visualizing-word-embeddings
@anisa-hawes to check images through. There are more than the 10 figures actually in use in this directory, so some sorting is required. I will also resize them. [I've removed all figure images not in use in the final lesson, renamed remaining figures, resized remaining figures (maximum 840 pixels on longest edge)]
assets: [none] ? Q: @hawc2 do you think we should we be hosting this data in our assets repository?
Q: @jreades I note there is one reference (line 731) to code which is available on GitHub but the link is missing (within the subsection titled '11 Clusters')

Sustainability + accessibility actions status:

Copyediting
Typesetting
Addition of Perma.cc links
Addition of alt-text for all figures (thank you, Jon)
Hello Jon @jreades and Jennie @jenniewilliams. May I ask if one of you could download and complete this authorial copyright declaration form? For co-authored lessons, we only require one lead author to complete the form. Please email your completed form to me admin[@]programminghistorian.org
@jreades and @jenniewilliams, could you also confirm whether you have ORCIDs that you'd like us to add to your bios? (below)

- name: Jon Reades
  orcid: 0000-0002-1443-9263
  team: false
  bio:
    en: |
      Jon Reades is Associate Professor at the Centre for Advanced Spatial Analysis, University College London.

- name: Jennie Williams
  orcid: 0000-0000-0000-0000
  team: false
  bio:
    en: |
      Jennie Williams is a PhD Student at the Centre for Advanced Spatial Analysis, University College London.

Next steps @hawc2 :

Define the lesson's difficulty: level, based on the criteria set out here
Define the lesson's activity:
Define the lesson's topics:
Liaise with the authors to provide a short abstract: for the lesson : Jon has provided this
Select + upload an image for the lesson (incl. avatar_alt:) : I've done this
Prepare x2 posts for our Twitter/Mastodon Bot (Let me know if you'd like me to do this? Happy to once we have an abstract)

hawc2 · 2023-04-28T01:41:26Z

thanks @anisa-hawes for getting everything ready. Yes, we can host these assets either in github large file storage or our zenodo repository.

@jreades, can you add the link to github where code is hosted as Anisa mentioned on line 731?

jreades · 2023-04-28T12:30:28Z

Done.

…

On 28 Apr 2023 at 02:41 +0100, programminghistorian/ph-submissions ***@***.***>, wrote: @jreades, can you add the link to github where code is hosted as Anisa mentioned on line 731?

anisa-hawes · 2023-04-28T12:35:22Z

Thank you, @jreades 🙂

hawc2 · 2023-05-03T15:39:33Z

@jreades I did a final line edit of the lesson and standardized some of the wording, tried to clarify a few points. The lesson overall is looking really solid, it's impressive, and while difficult, illuminating. It provides a lighthouse around which future PH lessons on embeddings can situate themselves, and it'll be interesting to see how we publish more lesssons that go into the weeds of emerging machine learning methods while trying to stay sustainable. I also appreciate how you engage with the Scikit Clustering lesson, and more generally situate the lesson in the context of other PH lessons.

In terms of the publishing timeline, Anisa is working on finalizing how we'll store all the assets for this lesson, and I'm preparing other elements for publication. I'm hoping to publish the lesson in the next couple weeks.

So this would be your last chance to make any edits to it. I had a couple lingering questions I was hoping you could clarify, and two suggestions for additional minor edits.

Under Configuring the Context, what does this line mean?: "A mixture of experimentation and reading indicated that Euclidean distance with Ward's quality measure is best" - does "reading" here mean "research"? Is there an article to cite for this decision?
I also wasn't sure what this sentence meant?: "Indeed, the assumptions about the theses being swapped between History DDCs are probably more robust, since the number of misclassified records is substantial enough for the differences to be relatively more robust." Can we use different words than robust here?
One thing I should've flagged earlier is I'm not a fan of commentary embedded in code blocks, especially when it breaks up functions. In this case, however, your lesson is so complex, and there's such detailed code blocks, that I think it works in many places. I tried to condense how many lines some of these commentary sections take up in distrupting the code, but I'd also encourage you to take one last look at this element of the lesson. In a few instances, some of the commentary in-line the code could be taken out and added as a paragraph before or after the code chunk. Ideally you are explaining in the tutorial prose what each code chunk is about to do or has just done. This might be especially helpful in the cases of functions being broken up by commentary. I'll defer to you in the end on how you prefer the in-line commentary to appear case by case, but I just wanted to flag it as something you could alter.
It's fine for you to leave that sort of commentary in the Google Colab notebook. The notebook itself works great, and my only ask for edits on the Colab notebook would be for you to try to incorporate more of the Section Headings from the PH Leson into the Colab notebook itself. Ideally a reader could switch from the lesson to the colab notebook and use the outlines to figure out where in the lesson the google colab code fits. It doesn't have to be perfect, but adding some more sign posts might help a reader juggle everything.

jreades · 2023-05-09T13:54:15Z

Hi Alex — replies below (and cc’ing Jennie to see if she has a quick ref for Q1). On 3 May 2023 at 16:39 +0100, Alex Wermer-Colan ***@***.***>, wrote: @jreades I did a final line edit of the lesson and standardized some of the wording, tried to clarify a few points. The lesson overall is looking really solid, it's impressive, and while difficult, illuminating. It provides a lighthouse around which future PH lessons on embeddings can situate themselves, and it'll be interesting to see how we publish more lesssons that go into the weeds of emerging machine learning methods while trying to stay sustainable. I also appreciate how you engage with the Scikit Clustering lesson, and more generally situate the lesson in the context of other PH lessons. Great, that’s nice to hear, thank you! [ ] Under Configuring the Context, what does this line mean?: "A mixture of experimentation and reading indicated that Euclidean distance with Ward's quality measure is best" - does "reading" here mean "research"? Is there an article to cite for this decision? Yes, we did mean ‘research’. ***@***.*** do you have a reference or two handy? [ ] I also wasn't sure what this sentence meant?: "Indeed, the assumptions about the theses being swapped between History DDCs are probably more robust, since the number of misclassified records is substantial enough for the differences to be relatively more robust." Can we use different words than robust here? Ah, good catch. That paragraph is doing two things which slightly confuse the matter: 1) there’s an argument about there being enough misclassified History theses that it suggests we’ve got substantive overlap between the input classes in the clustering space and so the differences in outcome classifications are significant (in a statistical sense); 2) an argument that the absence of misclassifications between Philosophy/Linguistics and History of Ancient World suggest a strong separation in the clustering space and it’s why we get slightly unhelpful distinguishing words such as ‘Bulgarian’ and ‘Mozambique’ in the TF/IDF plots (so these terms are more like artefacts of the statistics). I’ve rewritten this to try to make it easier to navigate. [ ] One thing I should've flagged earlier is I'm not a fan of commentary embedded in code blocks, especially when it breaks up functions. In this case, however, your lesson is so complex, and there's such detailed code blocks, that I think it works in many places. I tried to condense how many lines some of these commentary sections take up in distrupting the code, but I'd also encourage you to take one last look at this element of the lesson. In a few instances, some of the commentary in-line the code could be taken out and added as a paragraph before or after the code chunk. Ideally you are explaining in the tutorial prose what each code chunk is about to do or has just done. This might be especially helpful in the cases of functions being broken up by commentary. I'll defer to you in the end on how you prefer the in-line commentary to appear case by case, but I just wanted to flag it as something you could alter. I’ve tidied this up where it seemed sensible to me and left it in-place where I thought removing it would make the resulting code more forbidding to less experienced Python programmers. [ ] It's fine for you to leave that sort of commentary in the Google Colab notebook. The notebook itself works great, and my only ask for edits on the Colab notebook would be for you to try to incorporate more of the Section Headings from the PH Leson into the Colab notebook itself. Ideally a reader could switch from the lesson to the colab notebook and use the outlines to figure out where in the lesson the google colab code fits. It doesn't have to be perfect, but adding some more sign posts might help a reader juggle everything. I’ve done that now. Also realised as a result that the code for comparing clustering algorithms wasn’t in the notebook. Have fixed this and it led to a new sub-head in the tutorial near the end with one paragraph moving. If you would like to take a look at what I’ve done, I’ll get the refs from Jennie and we can work out where to put the parquet file. Jon — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

hawc2 · 2023-05-09T23:53:01Z

@jreades thanks for making these edits and additions, including the references. .

For the parquet file you mentioned, is that currently in the directory of assets you link? I think @anisa-hawes is planning to store all of those files on our Zenodo repository and relink to that location in the lesson

jreades · 2023-06-08T08:31:10Z

Just chasing this: I can certainly put the Parquet file in GitHub so that you can access it via the 'assets' directory, but if you're then going to move it elsewhere there's not much point as it will bulk up your repo with a 26MB file that's never actually used.

Let me know where you want it to go and I can do it, or feel free to download using the URL in the code and move it wherever you like.

Is there anything else you ened from me?

anisa-hawes · 2023-06-14T11:35:30Z

Dear @jreades. Thank you for following up.

Apologies for the delay. I am working through a few questions about how we handle lessons that integrate codebooks, and also about how to host large data assets. Both are important to ensuring we can manage and sustain this lesson into the future.

I have downloaded your code and created a .zip file which combines all the data assets and uploaded it to our PH Zenodo repository. However, I've expressed to Alex that I am a bit unsure about the data .zip having its own DOI. This appears to be automatically assigned by Zenodo unless we supply one (the lesson's own DOI isn't activated until shortly after publication). I've contacted the library who coordinate and register our DOIs with crossref to ask for advice here.

I’m also uncertain about how the download would work within the code. At line 245 of the Markdown, a block of Python specifies df = pd.read_parquet. Would this work to download and save a .zip? Sorry for all the questions and doubts here.

Anisa

hawc2 · 2023-07-24T20:27:25Z

@anisa-hawes Can we move this lesson forward to publish in the next couple weeks?

anisa-hawes · 2023-07-27T18:03:39Z

Hello @hawc2 ,

Thank you for your extended patience. All the sustainability + accessibility actions are complete:

Copyediting
Typesetting
Addition of Perma.cc links
Addition of alt-text for all figures
Receipt of authorial copyright agreement
Select + upload an image for the lesson (incl. avatar_alt:)
Liaise with authors to prepare bios for ph_authors.yml

As you are in the unusual position of being both Managing Editor and Editor of this lesson, I have prepared the files on Jekyll to help you.

Everything is ready for your review here: programminghistorian/jekyll#2987

(the first thing after publication will be for us to prepare x1 announcement + x2 future posts for our social media channels)

hawc2 · 2023-08-14T14:39:33Z

Huge congrats @jreades and @jenniewilliams on the publication of this amazing new lesson on word embeddings: https://programminghistorian.org/en/lessons/clustering-visualizing-word-embeddings

It's been my pleasure editing this piece, and I'm grateful to @quinnanya and @BarbaraMcG for their careful review of the lesson. Big thanks to @anisa-hawes too for helping prepare this lesson for publication and developing a new way for us to manage Jupyter and Colab Notebooks going forward.

@jreades and @jenniewilliams, we'll be promoting the published lesson on social media, and we encourage you to share it around as well. I look forward to recommending students read it, and I'm sure I'll make use of it in my own research in the future as well. Thanks for all your work and time on this lesson, and again congratulations!

tiagosousagarcia added English 0. Proposal 2021/22-JiscTNA Articles submitted in answer to the PH/JISC/TNA call for papers labels Nov 2, 2021

tiagosousagarcia self-assigned this Nov 2, 2021

drjwbaker assigned hawc2 Feb 18, 2022

anisa-hawes added 5. Revision 2 and removed Author Revising labels Mar 23, 2023

anisa-hawes added 7. Publication and removed 5. Revision 2 labels Apr 19, 2023

hawc2 closed this as completed Aug 14, 2023

Lesson proposal: Clustering and Visualising Documents using Word Embeddings (PH/JISC/TNA) #415

Lesson proposal: Clustering and Visualising Documents using Word Embeddings (PH/JISC/TNA) #415

Comments

tiagosousagarcia commented Nov 2, 2021 • edited Loading

svmelton commented Feb 2, 2022

hawc2 commented Feb 18, 2022

jreades commented Mar 9, 2022

hawc2 commented Mar 9, 2022 via email

jreades commented Mar 10, 2022 via email

jreades commented Apr 6, 2022

hawc2 commented Apr 6, 2022

tiagosousagarcia commented Apr 19, 2022

jreades commented Apr 19, 2022 via email

tiagosousagarcia commented Apr 19, 2022

hawc2 commented Apr 21, 2022

jreades commented Apr 21, 2022 via email

hawc2 commented Apr 21, 2022

jreades commented Apr 22, 2022 via email

hawc2 commented Apr 26, 2022

jreades commented May 3, 2022

jreades commented May 3, 2022

jreades commented May 4, 2022

jreades commented May 6, 2022

hawc2 commented May 6, 2022

jreades commented May 6, 2022 via email

hawc2 commented May 6, 2022 • edited Loading

jreades commented May 9, 2022 via email

tiagosousagarcia commented May 9, 2022

tiagosousagarcia commented May 9, 2022

jreades commented Mar 13, 2023 via email

jreades commented Mar 15, 2023 via email

hawc2 commented Mar 20, 2023

quinnanya commented Mar 21, 2023

jreades commented Mar 21, 2023 via email

quinnanya commented Mar 22, 2023

BarbaraMcG commented Mar 23, 2023

hawc2 commented Mar 23, 2023

jreades commented Mar 23, 2023 via email

anisa-hawes commented Mar 30, 2023

anisa-hawes commented Apr 5, 2023 • edited Loading

hawc2 commented Apr 17, 2023

jreades commented Apr 19, 2023

anisa-hawes commented Apr 19, 2023 • edited Loading

hawc2 commented Apr 28, 2023

jreades commented Apr 28, 2023 via email

anisa-hawes commented Apr 28, 2023

hawc2 commented May 3, 2023 • edited Loading

jreades commented May 9, 2023 via email

hawc2 commented May 9, 2023

jreades commented Jun 8, 2023

anisa-hawes commented Jun 14, 2023

hawc2 commented Jul 24, 2023

anisa-hawes commented Jul 27, 2023

hawc2 commented Aug 14, 2023 • edited Loading

tiagosousagarcia commented Nov 2, 2021 •

edited

Loading

hawc2 commented May 6, 2022 •

edited

Loading

anisa-hawes commented Apr 5, 2023 •

edited

Loading

anisa-hawes commented Apr 19, 2023 •

edited

Loading

hawc2 commented May 3, 2023 •

edited

Loading

hawc2 commented Aug 14, 2023 •

edited

Loading