-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Review Ticket for Fetching and Parsing Data from the Web with OpenRefine #69
Comments
To start out, this is really an excellent lesson and a very valuable addition! It provides an excellent intermediary point for people who are interested in manipulating digital materials but don't have time to learn an entire programming language first. I have a couple of comments and suggestions for some additional images and clarifications to get us started, and then I'll set about to recruit reviewers for this lesson. General Comments:
Line Comments:
@evanwill, those are my suggestions to start. I am happy to clarify or talk through any of them as needed. Once you've had a chance to make these revisions, I'll recruit two external reviewers. |
Thank you for the comments! However, I have not created additional images yet, and have to think more about the use of blockquotes/asides. I will push those edits in a separate commit. |
Thanks, @evanwill! The changes so far look great! I'll wait for the next commit and then recruit reviewers. |
I added a few more images and edits, I believe addressing @jerielizabeth 's initial suggestions, in commit ffe736f |
@evanwill Thank you for the edits! I think the lesson is in good shape to bring in external reviewers. |
Very good tutorial ! In paragraph 66, "Values from the same row in other columns can be retrieved using cells['column name'].value. ", it may be worth clarifying that the square brackets notation may be replaced by a dot notation when the column name or the path doesn't contain any space. This makes writing a lot easier, since |
thanks @ettorerizza. I had an aside about bracket versus dot notation plus not using spaces in column and key names in an earlier draft, but ended up removing it to avoid introducing too many concepts at once. Since bracket notation works for all column or key names, I just went with that version to simplify. I will think about how to get that back in somewhere, because I notice that in paragraph 91 I mention that dot notation is replaced by brackets for Jython. |
Thank you, @ettorerizza! A quick status update on the reviews for this lesson -- I am still recruiting reviewers. I have one tentatively agreed, and the twitter appeal has generated a lot of interest, but no volunteers (perhaps @ettorerizza excepted). :) So, still working on it! |
Update on reviewers: @peggygriesinger is the first reviewer on the lesson, with a review due date of June 16. |
Update on reviewers: @ljlow has agreed to be the second reviewer on the lesson, with a due date of June 28 |
This is a fascinating, very well explained, and really useful lesson. I learned a ton, so thank you! I'm already excited to share this with other DH enthusiasts. Overall I think you do a great job of gradually increasing the difficulty. I agree with @jerielizabeth that this should probably be classified as Advanced. My knowledge of OpenRefine definitely helped me perform the more difficult tasks without having to stop to figure out the simpler underlying steps. I was impressed with how much use you got out of the clipboard option for creating a project - I'd always been baffled as to why it was even there, but you demonstrated multiple good uses for it. I also appreciated the "sanity checks" - they are a great way to ensure you're on track with the lesson. I was also able to go back to the Sonnets project at the end, and separate out the sentiments values into columns that just displayed "Positive/Negative/Neutral" - so great job explaining those concepts in Example 2! I was able to apply it on my own based on this tutorial. Some line comments:
@evanwill I hope these suggestions are helpful. I'm happy to discuss them with you further if you'd like more clarification. |
@peggygriesinger Thank you so much for your review! @evanwill feel free to follow up with @peggygriesinger if you need clarification or would like to discuss possible ways of addressing her suggestions. However, please don't make any changes to the text until we have heard from @ljlow. Thanks all! |
@evanwill, what a fantastic lesson. Thank you for writing this! I felt the lesson was well-organized and contained the right amount of screenshots to follow along easily. The lesson also demonstrated why OpenRefine is a valuable tool and the examples provided an introduction to how it could be used in a research workflow. I feel the difficulty level of the lesson is intermediate. My only experience with OpenRefine comes from working through the other lesson on Programming Historian, but I felt that the explanations included were clear and I had no trouble following along step by step. I do feel, however, that having certain prior knowledge of some concepts would make this lesson more valuable for readers. Working knowledge of HTML and programming basics would be good prerequisites, but not required to follow the steps in this lesson. Linking to the Programming Historian series of lessons on Python might be a good idea. A few line comments: ¶ 15 - As I was reading this, I wondered why we were pasting the url as a clipboard option instead of a url option. I spent a minute playing around with the url option just to see the difference in results. I think a brief sentence here explaining the use of clipboard vs url would be useful. ¶ 27 - I don't think 'Delete value' is necessary here. ¶ 29 - It might be useful to include a sentence here pointing out what is different between the two value columns; something along the lines of 'Notice tags like <!DOCTYPE> and <title> have been removed, leaving only tags.' Unless you are accustomed to looking at html, it can look like a jumble of code, especially in the small value preview window. ¶ 31 - I really like this. Good discussion of GREL function syntax and encouragement to experiment. ¶ 36 - Good reminder here to check that things are consistent as you transform the data. ¶ 82 - I think the formatting here should match the formatting in 66. ¶ 105 - Would it be useful to also link to the lesson on Supervised Classification here? http://programminghistorian.org/lessons/naive-bayesian ¶ 107 - Great discussion here of the importance of investigating metrics produced by algorithms. I love this sentence in particular: 'This is not a new technical skill, but an application of the historian's traditional expertise, not unlike interrogating physical primary materials to unravel bias and read between the lines.' I feel compelled to add, though, that unlike physical primary materials, digital tools are customizable. If an algorithm is not producing useful results, it could mean that it is the incorrect algorithm for that situation. Or it could simply mean that the algorithm needs to be tuned or trained differently. You correctly point out in 106 that both APIs are not suited for the sonnets because they have been trained using modern English. A sentiment analysis tool could be trained to analyze Shakespeare, though. Many different techniques exist to optimize machine learning algorithms for a given scenario; from choosing the right training data set to adjusting the algorithm itself. Naive Bayes has many implementations, Gaussian, Multinomial, and Bernoulli, for example, and one may be more suited to a certain problem than others. These are important distinctions to investigate and keep in consideration when using machine learning tools. @evanwill thank you again for sharing this lesson; I've really enjoyed reviewing it! |
@peggygriesinger and @ljlow thank you for the detailed reviews! I appreciate your input and help! @jerielizabeth am I okay to start pushing changes to the text at this point? Thank you. |
@ljlow re: ¶ 107 algorithm discussion. I think this is really important too, and appreciate your discussion--Almost want to paste it directly into the lesson! I don't want to try to explain everything, but just highlight that we need to be critical and investigative to understand what the algorithms offer. I like your suggestion to add that we can also be DIY. I was originally thinking of including a more in depth discussion, but I was afraid it got too distracting from the practical flow of the tutorial. For example, I had a link to parts of NLTK book about training your own classifier in Python, but it seemed too specific. Do you have any suggestions for a concise introduction to text analysis algorithms and implementations that I could point to? thank you |
Thank you again to both @peggygriesinger and @ljlow for these excellent and helpful reviews. @evanwill Reading through the review suggestions, I think they are all on point and helpful. My only addition is to note that you are of course welcome to address the concerns noted differently than suggested if you have another way to solve them. Chiming in on the ¶107 discussion, I like the idea of adding a sentence to note that, while convenient algorithm services are often trained on data that is not well suited for humanities data and so do not perform well on our corpuses, they can be improved or implemented differently. Perhaps it could be put in the context of a "call" to keep learning? One advantage of doing data gathering and parsing with OpenRefine is that it lowers the coding barrier .. but that might be a nice way to gesture toward what can be added as the reader learns more. If you want to include links to additional training, you could include a "Next Steps" section at the end to suggest ways to build on the skills you're teaching here. My one other suggestion, drawing on @ljlow 's comment about prior knowledge, is that you add a bit to the front section that notes the things you expect readers to know prior to the lesson. And if there are existing PH lessons on the topics, linking to them would be great to point readers to places to gain that background knowledge. @evanwill Go ahead and push up your changes. Once you're done, I will read through one more time for copy editing and then we will move forward with publishing! Thanks, all! |
@evanwill I agree that a detailed discussion on algorithms would distract from the practical flow of the tutorial. I think @jerielizabeth's suggestion of adding a brief sentence framed in the context of a call to keep learning would be a great fit in ¶107. And a link to the NLTK book would certainly fit in a Next Steps section. re: a concise introduction to text analysis algorithms - Not sure if this fits the bill, but I remember really appreciating this image from scikit-learn when I was new to programming: NLTK and scikit-learn are used together frequently, so it might be worth including in a Next Steps section. Thank you again; I'm looking forward to seeing the finished article! |
Thanks for all the input! thank you |
I figured out the issues with To avoid the issue, I replaced all the Sorry for the confusion! |
@jerielizabeth I pushed my edits and a few new screenshots today. I think they clarify all the points raised by the reviewers. I will give it another look over in the next couple days, but I wanted to check in about a few things: I need to update the link to the ebook (currently), which I think translates to the main site like: What is the procedure for adding author data? Thank you! |
Thanks, @evanwill! Let me know when you're ready for me to go through one last time for copy editing. For the link, is the ebook the only file which we will be hosting for the lesson? If so, I can update the link when we move the lesson over. As far as the author data goes, if you would send me a 1 to 2 line bio, I will add it to the necessary files. You can either add it here or email it to me! Thanks! |
The ebook is the only file, but I wanted to update the screenshots that feature the link (since people to look very closely at the details and I don't want confusion). |
Stock bio: Evan Peter Williamson is the Digital Infrastructure Librarian at University of Idaho Library, working with Data & Digital Services to bring cool projects, enlightening workshops, and innovative services to life. Despite a background in Art History, Classical Studies, and Archives, he always manages to get involved in all things digital. |
For the sake of completeness, since the tutorial is in the advanced category, I wonder if it would not be worth adding a footnote stating that Jython can only import modules contained in the Python Standard Library (suh as urllib), but that it's also possible to install third-party modules (for instance requests or anything thats not coded in C language, as is unfortunately the case of NLTK) by following this tutorial. Example: |
thanks @ettorerizza, agreed: However, it is already there! Look in the blockquote box below the intro to Example 3 (paragraph 92). |
@jerielizabeth I am done with my author revision / edits. Let me know if there is anything else I can improve. Thank you to everyone (@ettorerizza , @peggygriesinger, @ljlow ) for the helpful reviews and comments. |
Thank you, @evanwill! I will do my final editorial check this week. Are you comfortable receiving minor edits (should there be any) via a pull request? |
yep, thanks @jerielizabeth |
Hi @evanwill! I am working on the final editing pass - sorry for the delay. In addition, I just found a bug with pulling the file down from our server that we're working to fix. But I should have the final edits to you soon. |
In my last commit I changed the sonnets download link to what should be it's future home on the main site, so it won't work until the lesson is pushed over there, is that what you mean? |
@evanwill Thank you! not quite -- I moved the file over ahead of the lesson to do one last check. We made a recent change to our servers to go to https and it had the unexpected consequence of blocking traffic from Refine. All is well now. I just submitted a pull request with some final edits. In addition, in ¶2: "programming concepts" is a bit too big to be useful. Are there particular concepts you would encourage people to encounter first, such as a "loop" or variables? Is there a Programming Historian lesson (or other lesson) we can link to for those concepts? |
Thoughts on images for the lesson: https://flic.kr/p/oegZJw I feel like distilling is a good metaphor here :D |
@jerielizabeth thank you for the final edits--very helpful little details! I love the distilling image idea, I think image one, https://flic.kr/p/oegZJw I pushed three little final commits:
Thank you for all your work on this (and Programming Historian in general)! |
Hi @evanwill! the updates look good to me. I think we can leave the general concepts without a link for now (I am also hesitant about linking to a draft lesson.) An update on the publishing front: I seem to have triggered some sort of error on our primary site and it is not building at the moment. However, as soon as that is back up and working, we should be rolling out the lesson. |
It hardly seems to be @jerielizabeth's doing! But GitHub is looking in to it -@evanwill you can keep tabs on the conversation here: programminghistorian/jekyll#565 |
oh boy, always a bit of excitement.... |
@evanwill and @jerielizabeth I'm happy to say that I found a workaround for the site build problems, and https://programminghistorian.org/lessons/fetch-and-parse-data-with-openrefine is now live! I'm very sorry that this took so long... but rest assured, I blame it all on GitHub's opaque meddling with the build process 😉 We are totally blameless for this delay. |
Party!! Thank you again @mdlincoln! Since we shuffled things around a few times trying to get site to build, before I publicize that the lesson is up, we should double check that everything looks as intended. I'll give it a read through, and @evanwill if you would be willing to look things over and confirm that the lesson looks as intended at some point today. If everything looks good, we will announce it! |
Thank you @mdlincoln and @jerielizabeth, everything looks good. I appreciate all the Jekyll / gh-pages sleuthing--always fascinating adventures! thank you again! |
Careful @evanwill if you reveal yourself to be too enthusiastic about Jekyll, you'll be dragooned onto our editorial team 😉 |
Thank you everyone (@evanwill @peggygriesinger @ljlow and @ettorerizza) for your excellent work on this lesson! Working on this lesson has been a pleasure for me and I am very excited to see it out in the world. Please keep tweeting about it so that it gets the attention it deserves! I am closing this issue as we are now live! |
The Programming Historian has received the following tutorial on 'Fetching and Parsing Data from the Web with OpenRefine' by @evanwill. This lesson is now under review and can be read at:
http://programminghistorian.github.io/ph-submissions/lessons/fetch-and-parse-data-with-openrefine
Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.
I will act as editor for the review process. My role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum. I have already read through the lesson and provided feedback, to which the author has responded.
Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.
I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me. You can always turn to @ianmilligan1 or @amandavisconti if you feel there's a need for an ombudsperson to step in.
Anti-Harassment Policy
This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.
The text was updated successfully, but these errors were encountered: