Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lesson proposal: Interrogating a National Narrative with Recurrent Neural Networks (PH/JISC/TNA) #418

Closed
tiagosousagarcia opened this issue Nov 2, 2021 · 61 comments
Assignees
Labels
7. Publication 2021/22-JiscTNA Articles submitted in answer to the PH/JISC/TNA call for papers English

Comments

@tiagosousagarcia
Copy link
Contributor

tiagosousagarcia commented Nov 2, 2021

The Programming Historian has received the following proposal for a lesson on 'Interrogating a National Narrative with Recurrent Neural Networks' by @ChantalMB. The proposed learning outcomes of the lesson are:

  • Understanding how general-purpose neural networks like GPT-2 can be applied to the distant study of large corpora in a way that can help guide further close readings, while also acknowledging the technical and ethical flaws that come with using large-scale language models
  • Creating a workflow for performing large scale computational analysis that works for the individual learner through advancing their technical knowledge on topics of machine learning software and hardware required to perform these kinds of tasks

In order to promote speedy publication of this important topic, we have agreed to a submission date of no later than 24/01/2022. The author(s) agree to contact the editor in advance if they need to revise the deadline.

If the lesson is not submitted by 24/01/2022, the editor will attempt to contact the author(s). If they do not receive an update, this ticket will be closed. The ticket can be reopened at a future date at the request of the author(s).

The main editorial contact for this lesson is @tiagosousagarcia.

Our dedicated Ombudsperson is (Ian Milligan - http://programminghistorian.org/en/project-team). Please feel free to contact him at any time if you have concerns that you would like addressed by an impartial observer. Contacting the ombudsperson will have no impact on the outcome of any peer review.

@tiagosousagarcia tiagosousagarcia added English 0. Proposal 2021/22-JiscTNA Articles submitted in answer to the PH/JISC/TNA call for papers labels Nov 2, 2021
@tiagosousagarcia tiagosousagarcia self-assigned this Nov 2, 2021
@drjwbaker
Copy link
Member

@svmelton and I discussed potential editors for this article. Sarah will seek an editor from the EN team when the article arrives. I note that @ChantalMB has emailed reporting some technical issue completing submission.

@drjwbaker
Copy link
Member

This lesson has now been submitted and staged here https://programminghistorian.github.io/ph-submissions/en/lessons/interrogating-national-narrative-gpt and is ready for technical review prior to peer review. Many thanks to @ChantalMB for submitting on time!

@tiagosousagarcia
Copy link
Contributor Author

The link for the staged submission has been updated and is now here: https://programminghistorian.github.io/ph-submissions/en/drafts/originals/interrogating-national-narrative-gpt

I'm starting the technical review now, will come back with any comments, corrections and suggestions shortly

tiagosousagarcia added a commit that referenced this issue Feb 17, 2022
@tiagosousagarcia
Copy link
Contributor Author

@ChantalMB, congratulations on an excellent tutorial -- this is a challenging and exciting topic (and something very close to my current work), and you've tackled it beautifully.

Apologies for the delay in moving your article forward -- there were a few personal reasons that kept me away from this for a while.

I've made an initial technical review 757f056, making small changes as I went along. A summary of these are:

Changes made

  • l. 30 -- changed single to double quotation marks (consistency) (and in other places, silently)
  • l. 30 -- added wiki link to machine learning
  • l. 30 -- added wiki link to artificial intelligence
  • l. 30 -- ie -> i.e.
  • l. 32 -- changed single to double quotation marks (consistency)
  • l. 36 -- "As the first half implies, " -> deleted (clarity)
  • l. 36 -- changed single to double quotation marks (consistency)
  • l. 38 -- "official" -> "officially"
  • l. 60 -- "ex" -> "e.g."
  • l. 64 -- added wiki link to multi-core processor
  • l. 68 -- "propriety" -> "proprietary"
  • l. 85 -- added link to conda enviroment
  • l. 183 -- "outline" -> "outlines"
  • l. 319 -- capitalised and italicised PH

Additionally, I would also suggest you consider the following before we send your article for peer-review:

Suggestions

  • l. 42 -- add a local copy of the dataset to PH (can be done now, if no further changes to the dataset)
  • l. 42 -- add references to other PH tutorials on relevant methods (i.e., web-scraping, data cleaning, for example)
  • l. 44 -- explain what is meant by 'prefix functionality'
  • l. 46 -- requirements should probably appear earlier in the lesson (after overview?)
  • l. 52 -- "you may use an online service that offers cloud-based GPU computing" -- add a note to explain that a few examples will be discussed in more detail in a section below
  • ll. 85-95 -- should the virtual environment have a more descriptive name than "gpt2"? I.e., users who dabbled before (or may wish to dabble again), would benefit from a clearer label
  • l. 155 -- add link to more documentation on GPT-2's different options
  • ll. 185-188 -- add a few notes explaining why you chose those specific default values, more detail on units (i.e., are they all in steps?), and more detail on 'learning_rate'
  • l. 197 -- Google Collab took 37 minutes for me; I wonder whether it would be worth it being a little more general here (i.e., execution times will vary), and just offer a minimum time (i.e., at least 20 min)
  • l. 258 -- link to example output text is missing
  • ll. 252-302 -- really like this section. Presents a good example of how to use a generative language model to interrogate a media narrative. A few more things that I would like to see addressed here would be a rough hit rate for useful generated text (i.e., how many generations until something interesting came along) and your process to determine what makes a particular generation interesting. I understand that this is a thorny and complex issue that falls slightly outside of the scope of the article, but it might be useful to acknowledge some of these complexities here. Another thing that could be addressed here (or elsewhere), is hints at other scholarly uses for AI-generated text.
  • ll. 303-322 -- another excellent section. A couple of notes here: 1) it might be useful to point towards other text-gen AIs, some of which attempt to at least address some of the concerns around OpenAIs practices and base model (Eleuther, for example) 2) The last paragraph (l. 321) reads more like a conclusion rather than an ethical discussion. I would probably separate it and expand it slightly with this in mind.

Thank you for all your hard work!

@drjwbaker
Copy link
Member

@ChantalMB Would you be able to get these small changes done in the next couple of weeks? If so, that'll give us some time to assign an editor (pinging @svmelton) before getting it out to peer review.

@svmelton
Copy link
Contributor

Thanks, @drjwbaker! @jrladd will serve as an editor for this piece.

@ChantalMB
Copy link
Contributor

Thanks @tiagosousagarcia for the initial technical review! @drjwbaker I should be able to fully review + get these changes done over the course of next week!

@anisa-hawes anisa-hawes self-assigned this Feb 18, 2022
@anisa-hawes
Copy link
Contributor

(I am co-assigning myself here so that I can shadow the editorial process).

@drjwbaker
Copy link
Member

@jrladd so pleased to have you as editor on this. Note that this article is part of a special series for which we have funding. As a rresult @tiagosousagarcia and I will be offering additional support. For example, in addition to doing the technical edit, Tiago has identified potential peer reviewers. So do write to us https://programminghistorian.org/en/project-team when you are ready!

@ChantalMB
Copy link
Contributor

@tiagosousagarcia (or anyone who may have the answer) In adding in references to other PH tutorials, I've discovered that the "Intro to Beautiful Soup" tutorial has been retired. Is it acceptable to link a retired tutorial? This is actually the tutorial I used to learn webscraping with python, therefore it is the most accurate in terms of what I'm mentioning in my tutorial!

@tiagosousagarcia
Copy link
Contributor Author

@ChantalMB -- in principle, I would avoid linking to retired tutorials. However, taking into account two things 1) The lesson was retired because the example used was no longer available, rather than because the technology itself has been superseded, and 2) I don't think (I may be wrong) we have another entry level beautiful soup tutorial available, it might be ok in this case (as long as the link is contextualised). I may be very wrong though -- @drjwbaker, @jrladd, do we have a strict policy about this?

@drjwbaker
Copy link
Member

If we are sending people to a retired lesson to do it, we shouldn't do that. If it is merely to reference a point there or to understand a principle, it is fine. Retired articles are retired because they no longer work (and can no longer be made to work) rather than because there is anything terrible about what they were trying to achieve when active.

@tiagosousagarcia
Copy link
Contributor Author

Hi @ChantalMB -- I wonder if we could get a sense of when you expect to have these revisions ready?

@ChantalMB
Copy link
Contributor

ChantalMB commented Mar 15, 2022

@tiagosousagarcia Revised tutorial is ready now, actually! Just wrapped up everything today-- apologies for the delay, got knocked out by cold for a week!

I also never got notified by GitHub that you and @drjwbaker had responded to my question, so thank you! I ended up linking the Beautiful Soup tutorial, but then also added a link to the "Automated Downloading with Wget" tutorial as an alternative resource for specific instruction since wget can also be used to download web pages.

I did quite a large edit re: the suggestions for lines 185-188 because to expand on learning rate meant that I also had to explain gradient descent, so now my tutorial includes diagrams. Should I be sending the revised article + attached images by email or attached to a comment in this ticket?

Similarly, you stated that a local copy of my data could be made; if possible, I'd like to do that for the training data and also the output text (the missing link at line 258).

Thanks for your help in advance!

@tiagosousagarcia
Copy link
Contributor Author

Great news @ChantalMB! Hope you are feeling better now. The easiest option is probably to send me the corrected md + additional files via email, I'll add them here and link the commit to the discussion. You can also do the changes via a pull request, but still need to send me the additional files separately.

@ChantalMB
Copy link
Contributor

@tiagosousagarcia Just sent everything your way via email!

tiagosousagarcia added a commit that referenced this issue Mar 16, 2022
@tiagosousagarcia
Copy link
Contributor Author

@programminghistorian/technical-team or @anisa-hawes, I wonder if anyone could give me a hand here: since the location of the drafts has been changed, it seems that the link for local datasets is broken in the preview -- is there any trick to referencing it that I'm missing? Currently I have it as /assets/[LESSON-SLUG]/[FILE-NAME].EXT -- I know other lessons also suffer from this problem (at least #416)

@anisa-hawes
Copy link
Contributor

Hello @tiagosousagarcia. Hmmm. This is strange... Let me take a look... When we made changes to the directories where the lesson .md files are saved, we didn't make any changes to the images or assets directories. The URL format should indeed be:

/assets/lesson-slug-here/asset-file-name.ext

@anisa-hawes
Copy link
Contributor

Ah, so when the lesson is moved over to Jekyll for publication, we update any /assets or other internal links so that they are 'relative' links. Until then, we need to use full links, i.e., https://github.com/programminghistorian/ph-submissions/blob/gh-pages/assets/interrogating-national-narrative-gpt/articles.txt.

@jrladd
Copy link
Collaborator

jrladd commented Mar 18, 2022

Thanks so much for these revisions, @ChantalMB. Just a quick note to officially introduce myself--I'm glad to be working with you all on this! We've reached out to potential reviewers, and once we know more I'll be back in touch about next steps.

@jrladd
Copy link
Collaborator

jrladd commented Jul 28, 2022

Thanks for all your hard work on this, @ChantalMB! I think this looks great, and we're almost ready to move to the next step.

I fixed a couple typos, added links to your files, and took care of a problem with footnote formatting. Those changes can be found in these two commits. Here's a complete list of what I changed:

  • Added links to the articles.txt file in assets in paragraphs 10, 41. (@anisa-hawes These links aren't currently working correctly. Did I put them in wrong?)
  • Removed "in the field of" in paragraph 6
  • Added "comes from" to "The code used for scraping articles comes from this tutorial" in paragraph 11
  • Changed "cloud GPU serve" to "cloud GPU service" in paragraph 35
  • Added "of" to "composed of two parts" in paragraph 63
  • Changed "withdrawl" to "withdrawal" in paragraph 68
  • Fixed subscript syntax for CO2 in paragraph 92
  • Fixed syntax for endnote 18 in paragraph 94
  • Fixed all endnote syntax in the endnotes section

Have a look at these when you get a chance, and let me know if it looks good. I also have two quick questions:

  • In paragraph 10, you have missing links for [INSERT SCRAPING LINK] and [INSERT SCRAPING ARTICLES]. If you have the links for these, I can add them.
  • In paragraph 65, there's a missing link for lesson output. Is this a link to a webpage you can send me, or is it an additional file you want us to add? Either way, I'm happy to add this, too.

Lastly, I'll need a brief author bio from you. This is just your name as you'd like it to appear on the article, your ORCID if you have one (completely optional), and a one- or two-sentence bio. This tutorial has a few example bios at the bottom (including my own). You can add your bio here or email it to me: whichever you prefer is fine.

Thanks again—excited to be in the home stretch with this!

@anisa-hawes
Copy link
Contributor

Thank you, @jrladd!

I've updated the links at paragraphs 10 and 41. Until the asset files are moved over to our Jekyll repository for publication, we need to provide the full link to the assets folder in ph-submissions (rather than the 'relative' link for internal pages). So, in this case the link required for now is: https://github.com/programminghistorian/ph-submissions/blob/gh-pages/assets/interrogating-national-narrative-gpt/articles.txt. I'll update it again ahead of publication!

Let me know if I can help you with any of the next steps, @jrladd & @ChantalMB.

@ChantalMB
Copy link
Contributor

Thanks @jrladd and @anisa-hawes for your quick work!

The missing links were to files that should have been in the ZIP I emailed, but alas I guess they were missing! I sorted out the issues I was having with GitHub (note: if you're teaching an "Intro to Git" lesson and generate a lot of PATs in one day during demos, this is apparently "bot-like behaviour" haha) so I was able to create a pull request with the added assets and edited links 👍

You can use this as my bio:

Chantal Brousseau is an MA candidate in the History and Data Science program at Carleton University. Her research focuses on applications of machine learning and computational creativity to the study of historical data.

And my ORCID is:

https://orcid.org/0000-0002-4649-059X

As always, let me know if there's anything else you need me to do!

@jrladd
Copy link
Collaborator

jrladd commented Jul 29, 2022

Thanks again, @ChantalMB and @anisa-hawes! This looks great to me. (By the way, you did add those files to the ZIP, and I just misunderstood. Sorry about that, and thanks for uploading them!)

@svmelton @hawc2 This is ready to advance. The list of file locations you'll need is below. Here's a new author bio for this lesson:

- name: Chantal Brousseau
  team: false
  orcid: 0000-0002-4649-059X
  bio:
      en: |
          Chantal Brousseau is an MA candidate in the History and Data Science program at Carleton University. Her research focuses on applications of machine learning and computational creativity to the study of historical data.

File locations:

  • Lesson file: en/drafts/originals/interrogating-national-narrative-gpt.md
  • Image files: images/interrogating-national-narrative-gpt
  • Asset files: assets/interrogating-national-narrative-gpt
  • Original avatar: gallery/originals/interrogating-national-narrative-gpt-original.png
  • Modified avatar: gallery/interrogating-national-narrative-gpt.png

Let me know if you need anything else, and when you're ready to merge I can add the tweets to the spreadsheet.

@svmelton
Copy link
Contributor

Thank you @jrladd! Excited to see this move forward. @anisa-hawes I think this still needs copyediting. Do you have capacity for that?

@tiagosousagarcia
Copy link
Contributor Author

Hello all -- just a quick note to say that this is my last week working for PH. It's been an absolute pleasure working on this, and I'm only sorry I'm not going to be around for its publication (from this side -- I'll definitely be reading and using it as a regular joe). Big thanks to @ChantalMB for writing it, and @jrladd for taking it forward, as well as @svmelton and @anisa-hawes for all your support. Well done everyone!

@anisa-hawes
Copy link
Contributor

Dear @ChantalMB + @jrladd,

I've applied the copyedits + perma.cc links to your lesson. Please let me know if you're happy with the adjustments. See Update interrogating-national-narrative-gpt.md.

Note that perma.cc can't capture the URLs that link directly to image files satisfactorily, so you'll see that 2x URLs at lines 216 (convex function) and 228 (local minimum) of the Markdown file remain as links to the live web.

I'd like to ask if you could add alt-text to your (very beautiful and expressive!) figure images? The syntax to use is: {% include figure.html filename="file-name.png" alt="Visual description of figure image" caption="Caption text to display" %}. One thing to note is that Markdown styling should not be included within your alt, because screen readers read the characters directly (so bold is read including asterisks).

Last thing is a new step in our workflow: we are now asking authors and translators to complete a declaration form to acknowledge their copyright and grant us permission to publish. I already have your email @ChantalMB, so I'll send you the form in a moment.

--

@jrladd, I can see that you've already found an image to represent this lesson. Thank you! A next step for you will be to prepare 2 Tweets for our Twitter Bot. Instructions for how to do that are here. Let me know if you have any questions! We can Tweet directly after publication, but the Bot will help us to publicise the lesson in the future.

Meanwhile, I'm aware that we would like to resolve the outstanding question of how to set up a banner logic which will bring enable readers to find lessons published as part of the PH/Jisc/TNA partnership. (cc. @drjwbaker)

Also tagging @hawc2 here, just to keep him up-to-date.

Very best to all,
Anisa

@jrladd
Copy link
Collaborator

jrladd commented Aug 19, 2022

Thanks very much, @anisa-hawes! I just drafted the tweets in the spreadsheet. It looks like all we need now is a stable DOI/link to the lesson. Do we have that already, or does that come after publication?

@ChantalMB do you have a Twitter handle you'd like us to use in the tweets?

@ChantalMB
Copy link
Contributor

ChantalMB commented Aug 23, 2022

@anisa-hawes I just made a pull request that adds alt text to the figures, and you should have gotten my email with the signed declaration forms!

@jrladd I just updated my twitter account info so you can tag me @ chntlbrouz-- I tend to only use twitter to lurk haha

@anisa-hawes
Copy link
Contributor

anisa-hawes commented Sep 1, 2022

Dear @ChantalMB.

Thank you for preparing the alt text, and thank you to @hawc2 for merging in these edits.

I've received your Authorial Copyright form by email too – thank you.

@jrladd the DOI is allocated and added to the YAML by Alex (Managing Editor) and the link is live shortly after publication. I can add it onto the Twitter Bot spreadsheet then!

Looks like this lesson is ready for @hawc2's final read-through, and we can move forwards with publication 🙂

@drjwbaker
Copy link
Member

Amazing job all. So looking forward to seeing this live.

@hawc2
Copy link
Collaborator

hawc2 commented Sep 7, 2022

Thanks all, looking forward to publishing this shortly. One minor question: how come the difficulty for this is set to a 2? I would've said this is an Advanced lesson

@jrladd
Copy link
Collaborator

jrladd commented Sep 7, 2022

I think the difficulty was already set when I came on as editor, @hawc2, so I'll let others speak to the original intent. But the way I've thought about this lesson is that though its subject is complex (and very clearly explained!), the code for fine-tuning the model is not as involved as some of our other lessons. I'm fine with changing the difficulty if others think this is better as a 3.

@ChantalMB
Copy link
Contributor

I personally aimed this lesson to be for an ambitious beginner with the same reasoning that @jrladd states, but I'm the author so not very objective haha Thank you for your revisions @hawc2 !

@hawc2
Copy link
Collaborator

hawc2 commented Sep 7, 2022

I see - thanks for clarifying! It's fine to leave it as a 2, then, but I'll slightly tweak the lesson abstract so it says the lesson 'introduces the basics' or something like that

@anisa-hawes
Copy link
Contributor

Hello @jrladd.

Just a quick note to say that I've removed the two Tweets you've prepared from the Google Sheet for now – I think we need to wait until the lesson is published to add these, otherwise the Bot might go ahead and Tweet them before we're ready!

When we're set up for publication and have a DOI, I'll add them back in! 🙂

@hawc2
Copy link
Collaborator

hawc2 commented Oct 3, 2022

@anisa-hawes with the banner feature updated on the site, is this lesson now ready for me to review and publish?

@drjwbaker
Copy link
Member

I think so.

@anisa-hawes
Copy link
Contributor

Hello @hawc2. Yes, it's ready.

And I have the two Tweets (prepared by John) to slot into our Bot pipeline as soon as the lesson is published.

@hawc2
Copy link
Collaborator

hawc2 commented Oct 13, 2022

Hi everyone, congratulations, this lesson has been published!:

https://programminghistorian.org/en/lessons/interrogating-national-narrative-gpt

@lorellav
Copy link

lorellav commented Oct 13, 2022 via email

@anisa-hawes
Copy link
Contributor

Congratulations to all! 🎉

@jrladd, I've added the two Tweets you prepared to our Bot pipeline.

If anyone would like to share in immediate promotion: https://twitter.com/ProgHist/status/1580587171614195712

@ChantalMB
Copy link
Contributor

Yay! Thanks to everyone for all of your help with getting this tutorial edited and published 😊🎉!

@jrladd
Copy link
Collaborator

jrladd commented Oct 13, 2022

Congrats to @ChantalMB and everyone who worked on this, and thanks everybody for all the hard work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
7. Publication 2021/22-JiscTNA Articles submitted in answer to the PH/JISC/TNA call for papers English
Projects
None yet
Development

No branches or pull requests

9 participants