Copyedit clustering visualizing word embeddings #554

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

jreades merged 5 commits into gh-pages from copyedit-clustering-visualizing-word-embeddings

Apr 19, 2023

Contributor

anisa-hawes commented Apr 5, 2023 •

edited

Loading

Hello @jreades,

I hope you are well.

Our copyeditor Iphgenia has prepared the edits for this lesson. You can review the changes she's made in the rich diff by navigating to the "Files changed" tab (above).

Please let me know if you're happy with the adjustments. You'll notice that I have left some small comments/queries, and indicated where a few additions are needed.

Descriptive alt-text to accompany all figure images
Math to be formatted as LaTeX

With many thanks,
Anisa

anisa-hawes and others added 5 commits

March 30, 2023 19:42


          Update clustering-visualizing-word-embeddings.md

e192a50

Adjustments to YAML header.


          Update clustering-visualizing-word-embeddings.md

7e9a54e


          Update clustering-visualizing-word-embeddings.md

21e393a


          Update clustering-visualizing-word-embeddings.md

3748e03


          Update clustering-visualizing-word-embeddings.md

c768df9

- add in alt-text template
- add Figure # into captions

anisa-hawes commented

View reviewed changes

en/drafts/originals/clustering-visualizing-word-embeddings.md Show resolved Hide resolved

anisa-hawes commented

View reviewed changes

en/drafts/originals/clustering-visualizing-word-embeddings.md Show resolved Hide resolved

anisa-hawes commented

View reviewed changes

en/drafts/originals/clustering-visualizing-word-embeddings.md

@@ @@ -421,7 +416,7 @@ dendrogram( @@
               plt.show()
               ```
-              The dendrogram is a top-down view, but recall that this is _not_ how we clustered the data; you can peek inside the `Z` object to see what happened and when. Table 6 shows what happened on the first and final iterations of the algorithm as well when we were one-quarter, one-half and three-quarters done with the clustering. On the first iteration, observations 4,445 and 6,569 were merged into a cluster of size 2 (<img alt="sum of ci and cj" src="https://render.githubusercontent.com/render/math?math={\sum c_{i}, c_{j}}" />) because the distance (_d_) between them was close to 0.000. Iteration 6,002 is a merge of two clusters to form a larger cluster of 5 observations: we know this because <img alt="ci" src="https://render.githubusercontent.com/render/math?math={c_{i}}" /> and <img alt="cj" src="https://render.githubusercontent.com/render/math?math={c_{j}}" /> *both* have higher indices than there are data points in the sample. On the last iteration, clusters 16,000 and 16,001 were merged  to create one cluster of 8,002 records. That is the 'link' shown at the very top of the dendrogram and it also has a very large <img alt="dij" src="https://render.githubusercontent.com/render/math?math={d_{ij}}" /> between clusters.
+              The dendrogram is a top-down view, but recall that this is _not_ how we clustered the data; you can peek inside the `Z` object to see what happened and when. Table 6 shows what happened on the first and final iterations of the algorithm as well when we were one-quarter, one-half and three-quarters done with the clustering. On the first iteration, observations 4,445 and 6,569 were merged into a cluster of size 2 (<img alt="sum of ci and cj" src="https://render.githubusercontent.com/render/math?math={\sum c_{i}, c_{j}}" />) because the distance (_d_) between them was close to 0.000. Iteration 6,002 is a merge of two clusters to form a larger cluster of five observations. We know this because <img alt="ci" src="https://render.githubusercontent.com/render/math?math={c_{i}}" /> and <img alt="cj" src="https://render.githubusercontent.com/render/math?math={c_{j}}" /> *both* have higher indices than there are data points in the sample. On the last iteration, clusters 16,000 and 16,001 were merged  to create one cluster of 8,002 records. That is the 'link' shown at the very top of the dendrogram and it also has a very large <img alt="dij" src="https://render.githubusercontent.com/render/math?math={d_{ij}}" /> between clusters.

Contributor Author

anisa-hawes Apr 5, 2023

As above (line 369), size 1 and size 2 left as integers.

For the math here, please could you provide this formatted as LaTeX?

Collaborator

jreades Apr 19, 2023

Have now done this in new commit (incoming shortly).

anisa-hawes commented

View reviewed changes

en/drafts/originals/clustering-visualizing-word-embeddings.md

                           margins=True, margins_name='Total')
               ```
-              Table 4 compares the top-level DDC assignments against the 3-cluster assignment: if the clustering has gone well, we'd expect to see the majority of the observations on the diagonal and, indeed, that's exactly what we see here with just a small fraction of theses being assigned to the 'wrong' cluster. The overall precision and recall are both 93%, as is the F1 score. Notice, however, the lower recall on Philosophy and psychology:  308 (17%) were misclassified compared to less than 4% of History theses.
+              Table 4 compares the top-level DDC assignments against the 3-cluster assignment: if the clustering has gone well, we'd expect to see the majority of the observations on the diagonal and, indeed, that's exactly what we see here with just a small fraction of theses being assigned to the 'wrong' cluster. The overall precision and recall are both 93%, as is the F1 score. Notice, however, the lower recall on Philosophy and psychology: 308 (17%) were misclassified compared to less than 4% of History theses.

Contributor Author

anisa-hawes Apr 5, 2023

Is 3-cluster assignment the same as 3 Cluster Solution? (below)

anisa-hawes commented

View reviewed changes

en/drafts/originals/clustering-visualizing-word-embeddings.md

    
              For instance, taking the two history DDCs we can see that documents clustered with Linguistics seem to have an educational component, while those clustered with Philosophy from History of the Ancient World reveal terms associated with Greek philosophy and those from more general History appear to have a strong 17th and 18th Century element. From what we can see of the misclustered History and Ancient History theses it's reasonable to infer a subtle, but meaningful difference between concerns with history as one of objects and sites, and one more focussed in issues of power, work, politics and empire.

              For instance, taking the two History DDCs we can see that documents clustered with Linguistics seem to have an educational component, while those clustered with Philosophy from History of the Ancient World reveal terms associated with Greek philosophy and those from more general History appear to have a strong seventeenth and eighteenth century element. From what we can see of the misclustered History and Ancient History theses it's reasonable to infer a subtle, but meaningful difference between concerns with history as one of objects and sites, and one more focussed in issues of power, work, politics and empire.

Contributor Author

anisa-hawes Apr 5, 2023

Our style guide advises use of lower case for university departments. Are the capital letters critical for this aspect of the clustering?

anisa-hawes commented

View reviewed changes

en/drafts/originals/clustering-visualizing-word-embeddings.md

    
              For instance, taking the two history DDCs we can see that documents clustered with Linguistics seem to have an educational component, while those clustered with Philosophy from History of the Ancient World reveal terms associated with Greek philosophy and those from more general History appear to have a strong 17th and 18th Century element. From what we can see of the misclustered History and Ancient History theses it's reasonable to infer a subtle, but meaningful difference between concerns with history as one of objects and sites, and one more focussed in issues of power, work, politics and empire.

              For instance, taking the two History DDCs we can see that documents clustered with Linguistics seem to have an educational component, while those clustered with Philosophy from History of the Ancient World reveal terms associated with Greek philosophy and those from more general History appear to have a strong seventeenth and eighteenth century element. From what we can see of the misclustered History and Ancient History theses it's reasonable to infer a subtle, but meaningful difference between concerns with history as one of objects and sites, and one more focussed in issues of power, work, politics and empire.

              Indeed, the assumptions about the theses being swapped between History DDCs are probably more robust since the number of misclassified records is substantial enough for the differences to be relatively more robust. Conversely, the tiny number of Philosphy and Linguistics theses clustered with the History of the Ancient World indicate a strong separation between these topics and throw up apparently unrelated significant words such as 'bulgarian' and 'scandinavian' (Linguistics), and 'mozambique' or 'habitus' (Philosophy).

Contributor Author

anisa-hawes Apr 5, 2023

'bulgarian', 'scandinavian' and 'mozambique' have retained lower case, as you discuss (at the beginning of the lesson) making all the words lower case for processing.

anisa-hawes self-assigned this

anisa-hawes added the English label

anisa-hawes mentioned this pull request

Lesson proposal: Clustering and Visualising Documents using Word Embeddings (PH/JISC/TNA) #415

Closed

jreades merged commit 799ef67 into gh-pages

anisa-hawes deleted the copyedit-clustering-visualizing-word-embeddings branch

April 19, 2023 14:38

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels