Lesson proposal: Interrogating a National Narrative with Recurrent Neural Networks (PH/JISC/TNA) #418

tiagosousagarcia · 2021-11-02T15:30:46Z

The Programming Historian has received the following proposal for a lesson on 'Interrogating a National Narrative with Recurrent Neural Networks' by @ChantalMB. The proposed learning outcomes of the lesson are:

Understanding how general-purpose neural networks like GPT-2 can be applied to the distant study of large corpora in a way that can help guide further close readings, while also acknowledging the technical and ethical flaws that come with using large-scale language models
Creating a workflow for performing large scale computational analysis that works for the individual learner through advancing their technical knowledge on topics of machine learning software and hardware required to perform these kinds of tasks

In order to promote speedy publication of this important topic, we have agreed to a submission date of no later than 24/01/2022. The author(s) agree to contact the editor in advance if they need to revise the deadline.

If the lesson is not submitted by 24/01/2022, the editor will attempt to contact the author(s). If they do not receive an update, this ticket will be closed. The ticket can be reopened at a future date at the request of the author(s).

The main editorial contact for this lesson is @tiagosousagarcia.

Our dedicated Ombudsperson is (Ian Milligan - http://programminghistorian.org/en/project-team). Please feel free to contact him at any time if you have concerns that you would like addressed by an impartial observer. Contacting the ombudsperson will have no impact on the outcome of any peer review.

drjwbaker · 2022-01-25T15:56:44Z

@svmelton and I discussed potential editors for this article. Sarah will seek an editor from the EN team when the article arrives. I note that @ChantalMB has emailed reporting some technical issue completing submission.

drjwbaker · 2022-01-26T10:36:24Z

This lesson has now been submitted and staged here https://programminghistorian.github.io/ph-submissions/en/lessons/interrogating-national-narrative-gpt and is ready for technical review prior to peer review. Many thanks to @ChantalMB for submitting on time!

tiagosousagarcia · 2022-02-17T10:02:10Z

The link for the staged submission has been updated and is now here: https://programminghistorian.github.io/ph-submissions/en/drafts/originals/interrogating-national-narrative-gpt

I'm starting the technical review now, will come back with any comments, corrections and suggestions shortly

tiagosousagarcia · 2022-02-17T12:19:54Z

@ChantalMB, congratulations on an excellent tutorial -- this is a challenging and exciting topic (and something very close to my current work), and you've tackled it beautifully.

Apologies for the delay in moving your article forward -- there were a few personal reasons that kept me away from this for a while.

I've made an initial technical review 757f056, making small changes as I went along. A summary of these are:

Changes made

l. 30 -- changed single to double quotation marks (consistency) (and in other places, silently)
l. 30 -- added wiki link to machine learning
l. 30 -- added wiki link to artificial intelligence
l. 30 -- ie -> i.e.
l. 32 -- changed single to double quotation marks (consistency)
l. 36 -- "As the first half implies, " -> deleted (clarity)
l. 36 -- changed single to double quotation marks (consistency)
l. 38 -- "official" -> "officially"
l. 60 -- "ex" -> "e.g."
l. 64 -- added wiki link to multi-core processor
l. 68 -- "propriety" -> "proprietary"
l. 85 -- added link to conda enviroment
l. 183 -- "outline" -> "outlines"
l. 319 -- capitalised and italicised PH

Additionally, I would also suggest you consider the following before we send your article for peer-review:

Suggestions

l. 42 -- add a local copy of the dataset to PH (can be done now, if no further changes to the dataset)
l. 42 -- add references to other PH tutorials on relevant methods (i.e., web-scraping, data cleaning, for example)
l. 44 -- explain what is meant by 'prefix functionality'
l. 46 -- requirements should probably appear earlier in the lesson (after overview?)
l. 52 -- "you may use an online service that offers cloud-based GPU computing" -- add a note to explain that a few examples will be discussed in more detail in a section below
ll. 85-95 -- should the virtual environment have a more descriptive name than "gpt2"? I.e., users who dabbled before (or may wish to dabble again), would benefit from a clearer label
l. 155 -- add link to more documentation on GPT-2's different options
ll. 185-188 -- add a few notes explaining why you chose those specific default values, more detail on units (i.e., are they all in steps?), and more detail on 'learning_rate'
l. 197 -- Google Collab took 37 minutes for me; I wonder whether it would be worth it being a little more general here (i.e., execution times will vary), and just offer a minimum time (i.e., at least 20 min)
l. 258 -- link to example output text is missing
ll. 252-302 -- really like this section. Presents a good example of how to use a generative language model to interrogate a media narrative. A few more things that I would like to see addressed here would be a rough hit rate for useful generated text (i.e., how many generations until something interesting came along) and your process to determine what makes a particular generation interesting. I understand that this is a thorny and complex issue that falls slightly outside of the scope of the article, but it might be useful to acknowledge some of these complexities here. Another thing that could be addressed here (or elsewhere), is hints at other scholarly uses for AI-generated text.
ll. 303-322 -- another excellent section. A couple of notes here: 1) it might be useful to point towards other text-gen AIs, some of which attempt to at least address some of the concerns around OpenAIs practices and base model (Eleuther, for example) 2) The last paragraph (l. 321) reads more like a conclusion rather than an ethical discussion. I would probably separate it and expand it slightly with this in mind.

Thank you for all your hard work!

drjwbaker · 2022-02-17T16:30:59Z

@ChantalMB Would you be able to get these small changes done in the next couple of weeks? If so, that'll give us some time to assign an editor (pinging @svmelton) before getting it out to peer review.

svmelton · 2022-02-17T19:15:25Z

Thanks, @drjwbaker! @jrladd will serve as an editor for this piece.

ChantalMB · 2022-02-18T05:18:33Z

Thanks @tiagosousagarcia for the initial technical review! @drjwbaker I should be able to fully review + get these changes done over the course of next week!

anisa-hawes · 2022-02-18T11:31:10Z

(I am co-assigning myself here so that I can shadow the editorial process).

drjwbaker · 2022-02-18T11:49:15Z

@jrladd so pleased to have you as editor on this. Note that this article is part of a special series for which we have funding. As a rresult @tiagosousagarcia and I will be offering additional support. For example, in addition to doing the technical edit, Tiago has identified potential peer reviewers. So do write to us https://programminghistorian.org/en/project-team when you are ready!

ChantalMB · 2022-03-09T02:46:42Z

@tiagosousagarcia (or anyone who may have the answer) In adding in references to other PH tutorials, I've discovered that the "Intro to Beautiful Soup" tutorial has been retired. Is it acceptable to link a retired tutorial? This is actually the tutorial I used to learn webscraping with python, therefore it is the most accurate in terms of what I'm mentioning in my tutorial!

tiagosousagarcia · 2022-03-09T15:45:23Z

@ChantalMB -- in principle, I would avoid linking to retired tutorials. However, taking into account two things 1) The lesson was retired because the example used was no longer available, rather than because the technology itself has been superseded, and 2) I don't think (I may be wrong) we have another entry level beautiful soup tutorial available, it might be ok in this case (as long as the link is contextualised). I may be very wrong though -- @drjwbaker, @jrladd, do we have a strict policy about this?

drjwbaker · 2022-03-09T16:42:03Z

If we are sending people to a retired lesson to do it, we shouldn't do that. If it is merely to reference a point there or to understand a principle, it is fine. Retired articles are retired because they no longer work (and can no longer be made to work) rather than because there is anything terrible about what they were trying to achieve when active.

tiagosousagarcia · 2022-03-14T16:18:47Z

Hi @ChantalMB -- I wonder if we could get a sense of when you expect to have these revisions ready?

ChantalMB · 2022-03-15T00:32:30Z

@tiagosousagarcia Revised tutorial is ready now, actually! Just wrapped up everything today-- apologies for the delay, got knocked out by cold for a week!

I also never got notified by GitHub that you and @drjwbaker had responded to my question, so thank you! I ended up linking the Beautiful Soup tutorial, but then also added a link to the "Automated Downloading with Wget" tutorial as an alternative resource for specific instruction since wget can also be used to download web pages.

I did quite a large edit re: the suggestions for lines 185-188 because to expand on learning rate meant that I also had to explain gradient descent, so now my tutorial includes diagrams. Should I be sending the revised article + attached images by email or attached to a comment in this ticket?

Similarly, you stated that a local copy of my data could be made; if possible, I'd like to do that for the training data and also the output text (the missing link at line 258).

Thanks for your help in advance!

tiagosousagarcia · 2022-03-15T08:25:50Z

Great news @ChantalMB! Hope you are feeling better now. The easiest option is probably to send me the corrected md + additional files via email, I'll add them here and link the commit to the discussion. You can also do the changes via a pull request, but still need to send me the additional files separately.

ChantalMB · 2022-03-15T17:34:20Z

@tiagosousagarcia Just sent everything your way via email!

tiagosousagarcia · 2022-03-16T09:12:37Z

@programminghistorian/technical-team or @anisa-hawes, I wonder if anyone could give me a hand here: since the location of the drafts has been changed, it seems that the link for local datasets is broken in the preview -- is there any trick to referencing it that I'm missing? Currently I have it as /assets/[LESSON-SLUG]/[FILE-NAME].EXT -- I know other lessons also suffer from this problem (at least #416)

anisa-hawes · 2022-03-16T15:59:19Z

Hello @tiagosousagarcia. Hmmm. This is strange... Let me take a look... When we made changes to the directories where the lesson .md files are saved, we didn't make any changes to the images or assets directories. The URL format should indeed be:

/assets/lesson-slug-here/asset-file-name.ext

anisa-hawes · 2022-03-16T16:15:14Z

Ah, so when the lesson is moved over to Jekyll for publication, we update any /assets or other internal links so that they are 'relative' links. Until then, we need to use full links, i.e., https://github.com/programminghistorian/ph-submissions/blob/gh-pages/assets/interrogating-national-narrative-gpt/articles.txt.

jrladd · 2022-03-18T18:15:20Z

Thanks so much for these revisions, @ChantalMB. Just a quick note to officially introduce myself--I'm glad to be working with you all on this! We've reached out to potential reviewers, and once we know more I'll be back in touch about next steps.

jrladd · 2022-07-28T16:53:01Z

Thanks for all your hard work on this, @ChantalMB! I think this looks great, and we're almost ready to move to the next step.

I fixed a couple typos, added links to your files, and took care of a problem with footnote formatting. Those changes can be found in these two commits. Here's a complete list of what I changed:

Added links to the articles.txt file in assets in paragraphs 10, 41. (@anisa-hawes These links aren't currently working correctly. Did I put them in wrong?)
Removed "in the field of" in paragraph 6
Added "comes from" to "The code used for scraping articles comes from this tutorial" in paragraph 11
Changed "cloud GPU serve" to "cloud GPU service" in paragraph 35
Added "of" to "composed of two parts" in paragraph 63
Changed "withdrawl" to "withdrawal" in paragraph 68
Fixed subscript syntax for CO2 in paragraph 92
Fixed syntax for endnote 18 in paragraph 94
Fixed all endnote syntax in the endnotes section

Have a look at these when you get a chance, and let me know if it looks good. I also have two quick questions:

In paragraph 10, you have missing links for [INSERT SCRAPING LINK] and [INSERT SCRAPING ARTICLES]. If you have the links for these, I can add them.
In paragraph 65, there's a missing link for lesson output. Is this a link to a webpage you can send me, or is it an additional file you want us to add? Either way, I'm happy to add this, too.

Lastly, I'll need a brief author bio from you. This is just your name as you'd like it to appear on the article, your ORCID if you have one (completely optional), and a one- or two-sentence bio. This tutorial has a few example bios at the bottom (including my own). You can add your bio here or email it to me: whichever you prefer is fine.

Thanks again—excited to be in the home stretch with this!

anisa-hawes · 2022-07-28T17:19:41Z

Thank you, @jrladd!

I've updated the links at paragraphs 10 and 41. Until the asset files are moved over to our Jekyll repository for publication, we need to provide the full link to the assets folder in ph-submissions (rather than the 'relative' link for internal pages). So, in this case the link required for now is: https://github.com/programminghistorian/ph-submissions/blob/gh-pages/assets/interrogating-national-narrative-gpt/articles.txt. I'll update it again ahead of publication!

Let me know if I can help you with any of the next steps, @jrladd & @ChantalMB.

ChantalMB · 2022-07-28T20:51:35Z

Thanks @jrladd and @anisa-hawes for your quick work!

The missing links were to files that should have been in the ZIP I emailed, but alas I guess they were missing! I sorted out the issues I was having with GitHub (note: if you're teaching an "Intro to Git" lesson and generate a lot of PATs in one day during demos, this is apparently "bot-like behaviour" haha) so I was able to create a pull request with the added assets and edited links 👍

You can use this as my bio:

Chantal Brousseau is an MA candidate in the History and Data Science program at Carleton University. Her research focuses on applications of machine learning and computational creativity to the study of historical data.

And my ORCID is:

https://orcid.org/0000-0002-4649-059X

As always, let me know if there's anything else you need me to do!

jrladd · 2022-07-29T14:39:18Z

Thanks again, @ChantalMB and @anisa-hawes! This looks great to me. (By the way, you did add those files to the ZIP, and I just misunderstood. Sorry about that, and thanks for uploading them!)

@svmelton @hawc2 This is ready to advance. The list of file locations you'll need is below. Here's a new author bio for this lesson:

- name: Chantal Brousseau
  team: false
  orcid: 0000-0002-4649-059X
  bio:
      en: |
          Chantal Brousseau is an MA candidate in the History and Data Science program at Carleton University. Her research focuses on applications of machine learning and computational creativity to the study of historical data.

File locations:

Lesson file: en/drafts/originals/interrogating-national-narrative-gpt.md
Image files: images/interrogating-national-narrative-gpt
Asset files: assets/interrogating-national-narrative-gpt
Original avatar: gallery/originals/interrogating-national-narrative-gpt-original.png
Modified avatar: gallery/interrogating-national-narrative-gpt.png

Let me know if you need anything else, and when you're ready to merge I can add the tweets to the spreadsheet.

svmelton · 2022-07-29T14:42:10Z

Thank you @jrladd! Excited to see this move forward. @anisa-hawes I think this still needs copyediting. Do you have capacity for that?

tiagosousagarcia · 2022-08-08T15:07:09Z

Hello all -- just a quick note to say that this is my last week working for PH. It's been an absolute pleasure working on this, and I'm only sorry I'm not going to be around for its publication (from this side -- I'll definitely be reading and using it as a regular joe). Big thanks to @ChantalMB for writing it, and @jrladd for taking it forward, as well as @svmelton and @anisa-hawes for all your support. Well done everyone!

anisa-hawes · 2022-08-16T08:12:37Z

Dear @ChantalMB + @jrladd,

I've applied the copyedits + perma.cc links to your lesson. Please let me know if you're happy with the adjustments. See Update interrogating-national-narrative-gpt.md.

Note that perma.cc can't capture the URLs that link directly to image files satisfactorily, so you'll see that 2x URLs at lines 216 (convex function) and 228 (local minimum) of the Markdown file remain as links to the live web.

I'd like to ask if you could add alt-text to your (very beautiful and expressive!) figure images? The syntax to use is: {% include figure.html filename="file-name.png" alt="Visual description of figure image" caption="Caption text to display" %}. One thing to note is that Markdown styling should not be included within your alt, because screen readers read the characters directly (so bold is read including asterisks).

Last thing is a new step in our workflow: we are now asking authors and translators to complete a declaration form to acknowledge their copyright and grant us permission to publish. I already have your email @ChantalMB, so I'll send you the form in a moment.

--

@jrladd, I can see that you've already found an image to represent this lesson. Thank you! A next step for you will be to prepare 2 Tweets for our Twitter Bot. Instructions for how to do that are here. Let me know if you have any questions! We can Tweet directly after publication, but the Bot will help us to publicise the lesson in the future.

Meanwhile, I'm aware that we would like to resolve the outstanding question of how to set up a banner logic which will bring enable readers to find lessons published as part of the PH/Jisc/TNA partnership. (cc. @drjwbaker)

Also tagging @hawc2 here, just to keep him up-to-date.

Very best to all,
Anisa

jrladd · 2022-08-19T16:05:41Z

Thanks very much, @anisa-hawes! I just drafted the tweets in the spreadsheet. It looks like all we need now is a stable DOI/link to the lesson. Do we have that already, or does that come after publication?

@ChantalMB do you have a Twitter handle you'd like us to use in the tweets?

ChantalMB · 2022-08-23T19:53:23Z

@anisa-hawes I just made a pull request that adds alt text to the figures, and you should have gotten my email with the signed declaration forms!

@jrladd I just updated my twitter account info so you can tag me @ chntlbrouz-- I tend to only use twitter to lurk haha

anisa-hawes · 2022-09-01T14:35:05Z

Dear @ChantalMB.

Thank you for preparing the alt text, and thank you to @hawc2 for merging in these edits.

I've received your Authorial Copyright form by email too – thank you.

@jrladd the DOI is allocated and added to the YAML by Alex (Managing Editor) and the link is live shortly after publication. I can add it onto the Twitter Bot spreadsheet then!

Looks like this lesson is ready for @hawc2's final read-through, and we can move forwards with publication 🙂

drjwbaker · 2022-09-06T08:20:21Z

Amazing job all. So looking forward to seeing this live.

hawc2 · 2022-09-07T13:53:20Z

Thanks all, looking forward to publishing this shortly. One minor question: how come the difficulty for this is set to a 2? I would've said this is an Advanced lesson

jrladd · 2022-09-07T16:19:22Z

I think the difficulty was already set when I came on as editor, @hawc2, so I'll let others speak to the original intent. But the way I've thought about this lesson is that though its subject is complex (and very clearly explained!), the code for fine-tuning the model is not as involved as some of our other lessons. I'm fine with changing the difficulty if others think this is better as a 3.

ChantalMB · 2022-09-07T16:30:58Z

I personally aimed this lesson to be for an ambitious beginner with the same reasoning that @jrladd states, but I'm the author so not very objective haha Thank you for your revisions @hawc2 !

hawc2 · 2022-09-07T18:35:09Z

I see - thanks for clarifying! It's fine to leave it as a 2, then, but I'll slightly tweak the lesson abstract so it says the lesson 'introduces the basics' or something like that

anisa-hawes · 2022-09-08T12:47:51Z

Hello @jrladd.

Just a quick note to say that I've removed the two Tweets you've prepared from the Google Sheet for now – I think we need to wait until the lesson is published to add these, otherwise the Bot might go ahead and Tweet them before we're ready!

When we're set up for publication and have a DOI, I'll add them back in! 🙂

hawc2 · 2022-10-03T21:37:52Z

@anisa-hawes with the banner feature updated on the site, is this lesson now ready for me to review and publish?

drjwbaker · 2022-10-04T14:03:08Z

I think so.

anisa-hawes · 2022-10-05T17:10:23Z

Hello @hawc2. Yes, it's ready.

And I have the two Tweets (prepared by John) to slot into our Bot pipeline as soon as the lesson is published.

hawc2 · 2022-10-13T15:11:02Z

Hi everyone, congratulations, this lesson has been published!:

https://programminghistorian.org/en/lessons/interrogating-national-narrative-gpt

lorellav · 2022-10-13T15:51:22Z

Wonderful thank you and congratulations to us! L. Dr Lorella Viola Research Associate | DHARPA Project Luxembourg Centre for Contemporary and Digital History (C2DH ) UNIVERSITY OF LUXEMBOURG BELVAL CAMPUS Maison des Sciences Humaines 11, Porte des Sciences L-4366 Esch-sur-Alzette https://lorellaviola.me.uk / https://dharpa.org<https://dharpa.org/> +3524666445009 From: Alex Wermer-Colan ***@***.***> Sent: 13 October 2022 17:11 To: programminghistorian/ph-submissions ***@***.***> Cc: Lorella VIOLA ***@***.***>; Mention ***@***.***> Subject: Re: [programminghistorian/ph-submissions] Lesson proposal: Interrogating a National Narrative with Recurrent Neural Networks (PH/JISC/TNA) (Issue #418) Hi everyone, congratulations, this lesson has been published!: https://programminghistorian.org/en/lessons/interrogating-national-narrative-gpt — Reply to this email directly, view it on GitHub<#418 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AJEDIAMT765KPWTIOTDNQIDWDARBHANCNFSM5HGYAQ2Q>. You are receiving this because you were mentioned.Message ID: ***@***.******@***.***>>

anisa-hawes · 2022-10-13T16:04:24Z

Congratulations to all! 🎉

@jrladd, I've added the two Tweets you prepared to our Bot pipeline.

If anyone would like to share in immediate promotion: https://twitter.com/ProgHist/status/1580587171614195712 ✨

ChantalMB · 2022-10-13T16:49:57Z

Yay! Thanks to everyone for all of your help with getting this tutorial edited and published 😊🎉!

jrladd · 2022-10-13T18:21:35Z

Congrats to @ChantalMB and everyone who worked on this, and thanks everybody for all the hard work!

tiagosousagarcia added English 0. Proposal 2021/22-JiscTNA Articles submitted in answer to the PH/JISC/TNA call for papers labels Nov 2, 2021

tiagosousagarcia self-assigned this Nov 2, 2021

tiagosousagarcia added a commit that referenced this issue Feb 17, 2022

Tech review #418

757f056

svmelton assigned jrladd Feb 17, 2022

anisa-hawes self-assigned this Feb 18, 2022

tiagosousagarcia added a commit that referenced this issue Mar 16, 2022

author corrections for #418

e6f22d4

anisa-hawes mentioned this issue Mar 16, 2022

Building Interactive Applications with R and Shiny (PH/JISC/TNA) #416

Closed

spapastamkou mentioned this issue Mar 28, 2022

Review 'Interrogating a National Narrative with Recurrent Neural Networks' (PH/JISC/TNA) #418 #473

Closed

jrladd added the 1. Submission label Jul 27, 2022

anisa-hawes added the 5. Revision 2 label Aug 5, 2022

hawc2 added 7. Publication and removed 1. Submission 5. Revision 2 labels Oct 5, 2022

anisa-hawes closed this as completed Oct 13, 2022

Lesson proposal: Interrogating a National Narrative with Recurrent Neural Networks (PH/JISC/TNA) #418

Lesson proposal: Interrogating a National Narrative with Recurrent Neural Networks (PH/JISC/TNA) #418

Comments

tiagosousagarcia commented Nov 2, 2021 • edited Loading

drjwbaker commented Jan 25, 2022

drjwbaker commented Jan 26, 2022

tiagosousagarcia commented Feb 17, 2022

tiagosousagarcia commented Feb 17, 2022

Changes made

Suggestions

drjwbaker commented Feb 17, 2022

svmelton commented Feb 17, 2022

ChantalMB commented Feb 18, 2022

anisa-hawes commented Feb 18, 2022

drjwbaker commented Feb 18, 2022

ChantalMB commented Mar 9, 2022

tiagosousagarcia commented Mar 9, 2022

drjwbaker commented Mar 9, 2022

tiagosousagarcia commented Mar 14, 2022

ChantalMB commented Mar 15, 2022 • edited Loading

tiagosousagarcia commented Mar 15, 2022

ChantalMB commented Mar 15, 2022

tiagosousagarcia commented Mar 16, 2022

anisa-hawes commented Mar 16, 2022

anisa-hawes commented Mar 16, 2022

jrladd commented Mar 18, 2022

jrladd commented Jul 28, 2022

anisa-hawes commented Jul 28, 2022

ChantalMB commented Jul 28, 2022

jrladd commented Jul 29, 2022

svmelton commented Jul 29, 2022

tiagosousagarcia commented Aug 8, 2022

anisa-hawes commented Aug 16, 2022

jrladd commented Aug 19, 2022

ChantalMB commented Aug 23, 2022 • edited Loading

anisa-hawes commented Sep 1, 2022 • edited Loading

drjwbaker commented Sep 6, 2022

hawc2 commented Sep 7, 2022

jrladd commented Sep 7, 2022

ChantalMB commented Sep 7, 2022

hawc2 commented Sep 7, 2022

anisa-hawes commented Sep 8, 2022

hawc2 commented Oct 3, 2022

drjwbaker commented Oct 4, 2022

anisa-hawes commented Oct 5, 2022

hawc2 commented Oct 13, 2022

lorellav commented Oct 13, 2022 via email

anisa-hawes commented Oct 13, 2022

ChantalMB commented Oct 13, 2022

jrladd commented Oct 13, 2022

tiagosousagarcia commented Nov 2, 2021 •

edited

Loading

ChantalMB commented Mar 15, 2022 •

edited

Loading

ChantalMB commented Aug 23, 2022 •

edited

Loading

anisa-hawes commented Sep 1, 2022 •

edited

Loading