Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review Ticket for Fetching and Parsing Data from the Web with OpenRefine #69

Closed
jerielizabeth opened this issue Apr 26, 2017 · 41 comments
Closed
Assignees
Labels

Comments

@jerielizabeth
Copy link
Contributor

The Programming Historian has received the following tutorial on 'Fetching and Parsing Data from the Web with OpenRefine' by @evanwill. This lesson is now under review and can be read at:

http://programminghistorian.github.io/ph-submissions/lessons/fetch-and-parse-data-with-openrefine

Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.

I will act as editor for the review process. My role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum. I have already read through the lesson and provided feedback, to which the author has responded.

Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.

I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me. You can always turn to @ianmilligan1 or @amandavisconti if you feel there's a need for an ombudsperson to step in.

Anti-Harassment Policy

This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.

The Programming Historian is dedicated to providing an open scholarly environment that offers community participants the freedom to thoroughly scrutinize ideas, to ask questions, make suggestions, or to requests for clarification, but also provides a harassment-free space for all contributors to the project, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion, or technical experience. We do not tolerate harassment or ad hominem attacks of community participants in any form. Participants violating these rules may be expelled from the community at the discretion of the editorial board. If anyone witnesses or feels they have been the victim of the above described activity, please contact our ombudspeople (Ian Milligan and Amanda Visconti - http://programminghistorian.org/project-team). Thank you for helping us to create a safe space.

@jerielizabeth
Copy link
Contributor Author

To start out, this is really an excellent lesson and a very valuable addition! It provides an excellent intermediary point for people who are interested in manipulating digital materials but don't have time to learn an entire programming language first.

I have a couple of comments and suggestions for some additional images and clarifications to get us started, and then I'll set about to recruit reviewers for this lesson.

General Comments:

  • The lesson covers a lot of ground and builds on itself well. Most of my suggestions involve adding more images to help clarify the task and keep the reader oriented in the interface. I think you did a good job of explaining the key concepts, but I have also used OpenRefine in the past, so it will be interesting to hear from the reviewers if there is assumed knowledge that should be fleshed out.
  • In thinking about the difficulty level for this lesson, I think we should pitch this as Intermediate at least, if not Advanced. The existing OpenRefine lesson is listed as Intermediate.

Line Comments:

  • ¶ 3 -- Our styling on the quote boxes tends more to the warning box than the aside, so I would use them very sparingly for places where you need the reader to slow down and pay attention.
  • ¶ 6 -- Though I don't think OpenRefine updates much, it would be useful for future stability to mention which version of OpenRefine you are using in this tutorial.
  • ¶ 11 -- For long-term stability, we should probably store and serve this file from the PH repo in our 'assets' folder. Similar to images, we'll create a directory that corresponds to your lesson and store it there.
  • ¶ 13 -- For consistency, I'd recommend proving users with a recommended project name, in this case "Sonnets"
  • ¶ 19 -- An annotated image for this step would help clarify what you mean.
  • ¶ 26-28 -- These might read more clearly as bullet points under ¶ 25.
  • ¶ 30 -- It would be helpful to clarify which expression we're expected to have at this point.
  • ¶ 45 -- An image of this step and the resulting table would be helpful for clarifying.
  • ¶ 61 -- Similar to before, I would recommend a project name so that it's consistent.
  • ¶ 64 -- An image of the step would be useful for clarifying
  • ¶ 89 -- While I realize it wouldn't be much different than the image above, I would include an image of this step as well, just to keep the reader oriented.
  • ¶ 90 -- For readers who don't know Python, it would helpful to show how adding the throttle delay would look.

@evanwill, those are my suggestions to start. I am happy to clarify or talk through any of them as needed. Once you've had a chance to make these revisions, I'll recruit two external reviewers.
Is one month (deadline of May 26) sufficient time for this first round?

@evanwill
Copy link
Contributor

Thank you for the comments!
I made edits to the text addressing most of the "Line Comments" suggestions above, pushed in this commit: 067f69c

However, I have not created additional images yet, and have to think more about the use of blockquotes/asides. I will push those edits in a separate commit.

@jerielizabeth
Copy link
Contributor Author

Thanks, @evanwill! The changes so far look great!

I'll wait for the next commit and then recruit reviewers.

@evanwill
Copy link
Contributor

evanwill commented May 3, 2017

I added a few more images and edits, I believe addressing @jerielizabeth 's initial suggestions, in commit ffe736f

@jerielizabeth
Copy link
Contributor Author

@evanwill Thank you for the edits! I think the lesson is in good shape to bring in external reviewers.

@ettorerizza
Copy link

ettorerizza commented May 18, 2017

Very good tutorial ! In paragraph 66, "Values from the same row in other columns can be retrieved using cells['column name'].value. ", it may be worth clarifying that the square brackets notation may be replaced by a dot notation when the column name or the path doesn't contain any space. This makes writing a lot easier, since value.parseJson()['geonames'][0]['alternateNames'][0]['name'] can be written value.parseJson().geonames[0].alternateNames[0].name.

@evanwill
Copy link
Contributor

thanks @ettorerizza.

I had an aside about bracket versus dot notation plus not using spaces in column and key names in an earlier draft, but ended up removing it to avoid introducing too many concepts at once. Since bracket notation works for all column or key names, I just went with that version to simplify.

I will think about how to get that back in somewhere, because I notice that in paragraph 91 I mention that dot notation is replaced by brackets for Jython.

@jerielizabeth
Copy link
Contributor Author

Thank you, @ettorerizza!

A quick status update on the reviews for this lesson -- I am still recruiting reviewers. I have one tentatively agreed, and the twitter appeal has generated a lot of interest, but no volunteers (perhaps @ettorerizza excepted). :) So, still working on it!

@jerielizabeth
Copy link
Contributor Author

Update on reviewers: @peggygriesinger is the first reviewer on the lesson, with a review due date of June 16.

@jerielizabeth
Copy link
Contributor Author

Update on reviewers: @ljlow has agreed to be the second reviewer on the lesson, with a due date of June 28

@peggygriesinger
Copy link

This is a fascinating, very well explained, and really useful lesson. I learned a ton, so thank you! I'm already excited to share this with other DH enthusiasts. Overall I think you do a great job of gradually increasing the difficulty. I agree with @jerielizabeth that this should probably be classified as Advanced. My knowledge of OpenRefine definitely helped me perform the more difficult tasks without having to stop to figure out the simpler underlying steps. I was impressed with how much use you got out of the clipboard option for creating a project - I'd always been baffled as to why it was even there, but you demonstrated multiple good uses for it. I also appreciated the "sanity checks" - they are a great way to ensure you're on track with the lesson.

I was also able to go back to the Sonnets project at the end, and separate out the sentiments values into columns that just displayed "Positive/Negative/Neutral" - so great job explaining those concepts in Example 2! I was able to apply it on my own based on this tutorial.

Some line comments:

  • ¶ 32 - You might want to make it clearer that users shouldn't actually click Ok at this point, or else offer a quick suggestion to undo using the undo/redo tab if people did.
  • ¶ 47 - I think possibly a screenshot could be a helpful indicator that a step should be taken at this moment.
  • ¶ 48-50 - I got a little lost in this section. It was unclear to me when the explanation of a step was complete and when I should start actually transforming the data myself, as opposed to when you were explaining the interim steps and coding. For example, it took me a moment to understand that that the expression in ¶ 50 was building off the expression in ¶ 49, and that they were not separate expressions/steps.
  • ¶ 50 - I'm not sure if I screwed up somewhere to cause this, but I had to change any instance of "<br>" in the code to "<br />" - this is also what you show in screenshot ¶51
  • ¶ 68 - I believe 'express' should be 'expression'.
  • ¶ 80 - A screenshot of how the spreadsheet should look with all the new columns would be helpful.
  • ¶ 95 - It would be useful to make a note that users should clear their expression box before starting these steps, and not yet create the new sentiments column.

@evanwill I hope these suggestions are helpful. I'm happy to discuss them with you further if you'd like more clarification.

@jerielizabeth
Copy link
Contributor Author

@peggygriesinger Thank you so much for your review!

@evanwill feel free to follow up with @peggygriesinger if you need clarification or would like to discuss possible ways of addressing her suggestions. However, please don't make any changes to the text until we have heard from @ljlow.

Thanks all!

@ljlow
Copy link

ljlow commented Jun 19, 2017

@evanwill, what a fantastic lesson. Thank you for writing this! I felt the lesson was well-organized and contained the right amount of screenshots to follow along easily. The lesson also demonstrated why OpenRefine is a valuable tool and the examples provided an introduction to how it could be used in a research workflow.

I feel the difficulty level of the lesson is intermediate. My only experience with OpenRefine comes from working through the other lesson on Programming Historian, but I felt that the explanations included were clear and I had no trouble following along step by step.

I do feel, however, that having certain prior knowledge of some concepts would make this lesson more valuable for readers. Working knowledge of HTML and programming basics would be good prerequisites, but not required to follow the steps in this lesson. Linking to the Programming Historian series of lessons on Python might be a good idea.

A few line comments:

¶ 15 - As I was reading this, I wondered why we were pasting the url as a clipboard option instead of a url option. I spent a minute playing around with the url option just to see the difference in results. I think a brief sentence here explaining the use of clipboard vs url would be useful.

¶ 27 - I don't think 'Delete value' is necessary here.

¶ 29 - It might be useful to include a sentence here pointing out what is different between the two value columns; something along the lines of 'Notice tags like <!DOCTYPE> and <title> have been removed, leaving only

tags.' Unless you are accustomed to looking at html, it can look like a jumble of code, especially in the small value preview window.

¶ 31 - I really like this. Good discussion of GREL function syntax and encouragement to experiment.

¶ 36 - Good reminder here to check that things are consistent as you transform the data.

¶ 82 - I think the formatting here should match the formatting in 66.

¶ 105 - Would it be useful to also link to the lesson on Supervised Classification here? http://programminghistorian.org/lessons/naive-bayesian

¶ 107 - Great discussion here of the importance of investigating metrics produced by algorithms. I love this sentence in particular: 'This is not a new technical skill, but an application of the historian's traditional expertise, not unlike interrogating physical primary materials to unravel bias and read between the lines.' I feel compelled to add, though, that unlike physical primary materials, digital tools are customizable. If an algorithm is not producing useful results, it could mean that it is the incorrect algorithm for that situation. Or it could simply mean that the algorithm needs to be tuned or trained differently. You correctly point out in 106 that both APIs are not suited for the sonnets because they have been trained using modern English. A sentiment analysis tool could be trained to analyze Shakespeare, though. Many different techniques exist to optimize machine learning algorithms for a given scenario; from choosing the right training data set to adjusting the algorithm itself. Naive Bayes has many implementations, Gaussian, Multinomial, and Bernoulli, for example, and one may be more suited to a certain problem than others. These are important distinctions to investigate and keep in consideration when using machine learning tools.

@evanwill thank you again for sharing this lesson; I've really enjoyed reviewing it!

@evanwill
Copy link
Contributor

@peggygriesinger and @ljlow thank you for the detailed reviews!
I will work on integrating your suggestions this week.

I appreciate your input and help!

@jerielizabeth am I okay to start pushing changes to the text at this point?

Thank you.

@evanwill
Copy link
Contributor

@ljlow re: ¶ 107 algorithm discussion.

I think this is really important too, and appreciate your discussion--Almost want to paste it directly into the lesson! I don't want to try to explain everything, but just highlight that we need to be critical and investigative to understand what the algorithms offer. I like your suggestion to add that we can also be DIY.

I was originally thinking of including a more in depth discussion, but I was afraid it got too distracting from the practical flow of the tutorial. For example, I had a link to parts of NLTK book about training your own classifier in Python, but it seemed too specific.

Do you have any suggestions for a concise introduction to text analysis algorithms and implementations that I could point to?

thank you

@jerielizabeth
Copy link
Contributor Author

Thank you again to both @peggygriesinger and @ljlow for these excellent and helpful reviews.

@evanwill Reading through the review suggestions, I think they are all on point and helpful. My only addition is to note that you are of course welcome to address the concerns noted differently than suggested if you have another way to solve them.

Chiming in on the ¶107 discussion, I like the idea of adding a sentence to note that, while convenient algorithm services are often trained on data that is not well suited for humanities data and so do not perform well on our corpuses, they can be improved or implemented differently. Perhaps it could be put in the context of a "call" to keep learning? One advantage of doing data gathering and parsing with OpenRefine is that it lowers the coding barrier .. but that might be a nice way to gesture toward what can be added as the reader learns more. If you want to include links to additional training, you could include a "Next Steps" section at the end to suggest ways to build on the skills you're teaching here.

My one other suggestion, drawing on @ljlow 's comment about prior knowledge, is that you add a bit to the front section that notes the things you expect readers to know prior to the lesson. And if there are existing PH lessons on the topics, linking to them would be great to point readers to places to gain that background knowledge.

@evanwill Go ahead and push up your changes. Once you're done, I will read through one more time for copy editing and then we will move forward with publishing!

Thanks, all!

@ljlow
Copy link

ljlow commented Jun 22, 2017

@evanwill I agree that a detailed discussion on algorithms would distract from the practical flow of the tutorial. I think @jerielizabeth's suggestion of adding a brief sentence framed in the context of a call to keep learning would be a great fit in ¶107. And a link to the NLTK book would certainly fit in a Next Steps section.

re: a concise introduction to text analysis algorithms - Not sure if this fits the bill, but I remember really appreciating this image from scikit-learn when I was new to programming: NLTK and scikit-learn are used together frequently, so it might be worth including in a Next Steps section.

Thank you again; I'm looking forward to seeing the finished article!

@evanwill
Copy link
Contributor

evanwill commented Jul 7, 2017

Thanks for all the input!
Sorry I haven't had a chance to work on this yet (due to general summer-time-ness), but hope to get going next week.

thank you

@evanwill
Copy link
Contributor

evanwill commented Jul 18, 2017

@peggygriesinger ,

I figured out the issues with <br> versus <br /> in the code and screen shots--the original ebook used <br>, but Refine's parser replaces them with <br />. I must have rewritten the code snippets after looking at the ebook markup, despite doing it correctly in the screen shots.

To avoid the issue, I replaced all the <br> in the ebook file with <br />, and updated the snippets to <br />.

Sorry for the confusion!

@evanwill
Copy link
Contributor

@jerielizabeth I pushed my edits and a few new screenshots today. I think they clarify all the points raised by the reviewers. I will give it another look over in the next couple days, but I wanted to check in about a few things:

I need to update the link to the ebook (currently), which I think translates to the main site like:
http://programminghistorian.org/assets/fetch-and-parse-data-with-openrefine/pg1105.html
Is this correct?

What is the procedure for adding author data?

Thank you!

@jerielizabeth
Copy link
Contributor Author

Thanks, @evanwill! Let me know when you're ready for me to go through one last time for copy editing. For the link, is the ebook the only file which we will be hosting for the lesson? If so, I can update the link when we move the lesson over.

As far as the author data goes, if you would send me a 1 to 2 line bio, I will add it to the necessary files. You can either add it here or email it to me!

Thanks!

@evanwill
Copy link
Contributor

The ebook is the only file, but I wanted to update the screenshots that feature the link (since people to look very closely at the details and I don't want confusion).

@evanwill
Copy link
Contributor

Stock bio:

Evan Peter Williamson is the Digital Infrastructure Librarian at University of Idaho Library, working with Data & Digital Services to bring cool projects, enlightening workshops, and innovative services to life. Despite a background in Art History, Classical Studies, and Archives, he always manages to get involved in all things digital.

@ettorerizza
Copy link

ettorerizza commented Jul 20, 2017

For the sake of completeness, since the tutorial is in the advanced category, I wonder if it would not be worth adding a footnote stating that Jython can only import modules contained in the Python Standard Library (suh as urllib), but that it's also possible to install third-party modules (for instance requests or anything thats not coded in C language, as is unfortunately the case of NLTK) by following this tutorial.

Example:

screenshot-localhost-3333-2017-07-20-10-07-47

@evanwill
Copy link
Contributor

evanwill commented Jul 20, 2017

thanks @ettorerizza, agreed:

However, it is already there! Look in the blockquote box below the intro to Example 3 (paragraph 92).

@evanwill
Copy link
Contributor

@jerielizabeth I am done with my author revision / edits.

Let me know if there is anything else I can improve.

Thank you to everyone (@ettorerizza , @peggygriesinger, @ljlow ) for the helpful reviews and comments.

@jerielizabeth
Copy link
Contributor Author

Thank you, @evanwill! I will do my final editorial check this week. Are you comfortable receiving minor edits (should there be any) via a pull request?

@evanwill
Copy link
Contributor

yep, thanks @jerielizabeth

@jerielizabeth
Copy link
Contributor Author

Hi @evanwill! I am working on the final editing pass - sorry for the delay. In addition, I just found a bug with pulling the file down from our server that we're working to fix. But I should have the final edits to you soon.

@evanwill
Copy link
Contributor

evanwill commented Aug 6, 2017

In my last commit I changed the sonnets download link to what should be it's future home on the main site, so it won't work until the lesson is pushed over there, is that what you mean?

@jerielizabeth
Copy link
Contributor Author

@evanwill Thank you! not quite -- I moved the file over ahead of the lesson to do one last check. We made a recent change to our servers to go to https and it had the unexpected consequence of blocking traffic from Refine. All is well now.

I just submitted a pull request with some final edits. In addition, in ¶2: "programming concepts" is a bit too big to be useful. Are there particular concepts you would encourage people to encounter first, such as a "loop" or variables? Is there a Programming Historian lesson (or other lesson) we can link to for those concepts?

@jerielizabeth
Copy link
Contributor Author

Thoughts on images for the lesson:

https://flic.kr/p/oegZJw
https://flic.kr/p/oxLVzZ

I feel like distilling is a good metaphor here :D

@evanwill
Copy link
Contributor

evanwill commented Aug 7, 2017

@jerielizabeth thank you for the final edits--very helpful little details!

I love the distilling image idea, I think image one, https://flic.kr/p/oegZJw

I pushed three little final commits:

  • updated one spot about removing a column, where it referred to using the "All" menu, but should have been just the column menu.
  • updated my full name
  • tweaked the pre-req section "programming concepts"--what you mention above makes sense. Basic familiarity with variables is probably the most important thing, maybe loops. I think arrays are introduced in the lesson. I wasn't sure what to link to, while there is a lot of lessons where you can learn by doing it, there isn't exactly a lesson to focuses on basic programming vocabulary. I like this Library Carpentry Introduction to programming with Python, but it's still a draft, so not sure I want to link to it. If you have a suggestion, feel free to add it, otherwise, I am okay with out a link there.

Thank you for all your work on this (and Programming Historian in general)!

@jerielizabeth
Copy link
Contributor Author

Hi @evanwill! the updates look good to me. I think we can leave the general concepts without a link for now (I am also hesitant about linking to a draft lesson.)

An update on the publishing front: I seem to have triggered some sort of error on our primary site and it is not building at the moment. However, as soon as that is back up and working, we should be rolling out the lesson.

@mdlincoln
Copy link
Contributor

It hardly seems to be @jerielizabeth's doing! But GitHub is looking in to it -@evanwill you can keep tabs on the conversation here: programminghistorian/jekyll#565

@evanwill
Copy link
Contributor

oh boy, always a bit of excitement....
thanks!

@mdlincoln
Copy link
Contributor

mdlincoln commented Aug 22, 2017

@evanwill and @jerielizabeth I'm happy to say that I found a workaround for the site build problems, and https://programminghistorian.org/lessons/fetch-and-parse-data-with-openrefine is now live!

I'm very sorry that this took so long... but rest assured, I blame it all on GitHub's opaque meddling with the build process 😉 We are totally blameless for this delay.

@jerielizabeth
Copy link
Contributor Author

Party!! Thank you again @mdlincoln!

Since we shuffled things around a few times trying to get site to build, before I publicize that the lesson is up, we should double check that everything looks as intended. I'll give it a read through, and @evanwill if you would be willing to look things over and confirm that the lesson looks as intended at some point today.

If everything looks good, we will announce it!

@evanwill
Copy link
Contributor

Thank you @mdlincoln and @jerielizabeth, everything looks good.

I appreciate all the Jekyll / gh-pages sleuthing--always fascinating adventures!

thank you again!

@mdlincoln
Copy link
Contributor

Careful @evanwill if you reveal yourself to be too enthusiastic about Jekyll, you'll be dragooned onto our editorial team 😉

@jerielizabeth
Copy link
Contributor Author

Thank you everyone (@evanwill @peggygriesinger @ljlow and @ettorerizza) for your excellent work on this lesson! Working on this lesson has been a pleasure for me and I am very excited to see it out in the world.

Please keep tweeting about it so that it gets the attention it deserves!

I am closing this issue as we are now live!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants