Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Review Ticket: Visualizing Data with Bokeh and Pandas #152
The Programming Historian has received the following tutorial on 'Visualizing Data with Bokeh and Pandas' by @archaeocharlie. This lesson is now under review and can be read at:
Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.
@drjwbaker and myself will act as editors for the review process. Our role is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum. We have already read through the lesson and provided feedback, to which the author has responded.
Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.
I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me. You can always turn to @amandavisconti if you feel there's a need for an ombudsperson to step in.
This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.
This is a good introduction with clear instructions about using Bokeh and pandas to do some preliminary exploration of
It is written in a personable and communicative style which will be welcoming to many readers. The dataset is a fairly interesting one to use, but the content of the data is mostly ignored. This is possibly good--both because military history may be alienating to some users, and because who wants to think seriously about Hiroshima and Dresden while learning a programming API?--but also means the usefulness of visualization in generating insight is mostly unaddressed. The dataset is also unfortunately short in continuous, quantitative variables to plot--there's basically just date and tonnage, although maybe there's more in the original data. I think the strongest possible version of this article would probably use a different data set, but that this one is good enough as it is.
My general take is that the major areas of improvement lie in tightening the scope of what is being taught, and introducing new elements more slowly.
The goals should be more general.
The first paragraph gives bullet points that describe what the tutorial will do. This should be pushed down,
What is Bokeh good for?
The most important thing to fix should be a clear and succinct description early on of what Bokeh is good for.
The closest thing currently is in paragraphs 23-25. (Too late, I think). But this explanation isn't sufficiently
This section needs to do a litle more.
I myself can't answer 2. Questions 1 and 3 are related, but the examples
In particular, I think that hovering tooltips are an incredibly useful feature that historians both need,
Add lines of code more slowly to the plots.
"Your first graph" is a useful approximation, but the second graph is a little confusing and could probably go.
In particular, I'd break off legends and titles into their own sections. Legends are introduced earlier than necessary; and the examples very quickly come to include several lines apiece about axis/title positioning, styling, and the like. Even at the cost of making the initial graphs uglier, I would avoid changing the default settings for things like axis ticks. Having the first plots of the Thor data be so long makes things look more complicated than they are.
As I said earlier, I think that this should ultimately lead to a plot of some type with tooltips for elements, which shows the distinctive possibilities of interactive web plots.
Have a stronger first plot of the data.
The first plot here is a scatterplot by latitude and longitude that, without a map underlying it, is almost unreadable. There's no need to make this when a lot of other plots would be more interesting; even this one requires some filtering, and the description of it ('top 100 targets') isn't quite right (this is the 100 largest individual raids, or bombs, or whatever the row-unit is here).
I think it would be useful to have an ungrouped example earlier in the running. Typically this would be a bivariate scatterplot; possibly something like tonnage on the y-axis vs date on the x. Anything that shows individual bombs is also pedagogically useful because it highlights the two atomic bombings, which don't stick out in the time-series plots later.
It's possible this should be random-sampled rather than taking a sort.
Some other candidates:
Consider one or two more plot types.
Currently there are instructions for making a map, a bar chart, and a line chart. This is a good set of choices if only three plot types will be shown.
There are also instructions for a scatterplot of latitude and longitude,
I would consider including a true scatterplot, although this dataset doesn't lend itself to a single very obvious one. (tonnage of bombs by date, as I said: maybe number of planes against tonnage, or something like that).
A heatmap, or a binned scatterplot, (or a single plot incorporating both) of who's bombing whom might be useful. It would presumably show the US bombing Germany and Japan, and the British bombing just Germany, possibly with smaller numbers.
Less importantly, I'd consider a boxplot or a filled area chart. For either of these, you could plot the US vs Britain.
How much other stuff needs to be taught to use Bokeh and pandas on data?
Some of this tutorial is a partial explanation of things that will be elementary to experienced
Avoid difficult list methods.
I would consider removing the following two list methods, which are more than self-taught beginners will have seen:
The latter could possibly be kept with an inline comment noting it's a trick.
Slim down the pandas vocabulary.
Pandas is a library with an endless vocabulary of functions that everyone seems to only use a subset of.
Things that are useful:
In practice, I think that means the pandas commands here should be limited to
Some of the commands that don't need to be shown are:
A description of the deprecated 3-D pandas type, or even of the pandas
(Is there a good explanation of pandas elsewhere) that can be used?
Can conda be removed?
I find the overhead of virtual environments to be rather high for
Especially given that PH already has a lesson about python and pip that doesn't include virtual environments, I think it is not useful to suggest them here.
All the code examples are rendering as a box-in-a-box on my computer, and appear to have four levels of html elements associated with them.
I've lined up a second peer reviewer who should have their comments by the end of this week, early next.
@archaeocharlie: feel free to read Ben's comments and begin thinking about them, although since we have a second review coming in I'd hold off on any substantial revisions (in case there are disagreements - and to keep the preview copy intact to track the second set of suggestions). So be in touch again fairly soon!
Very comprehensive introduction to Bokeh and Pandas! I enjoyed learning more about some of the functionality of these libraries, and I think you do a good job of 'narrating' the code.
I'll try to keep my review fairly short since Ben gave you a lot to work with. I'll first go through the sections I had comments on directly and then I have some general comments below. I'm happy to go into further detail if anything is unclear or if you would like more specificity.
I see that Ben is pushing for more generalized skills in this section which I think is right, but I would also push for you to address briefly why should someone turn their historical data into visual arguments. This pretty much assumed in the entire tutorial but I think it's worthwhile to talk about visualization as a way to test hypotheses or elucidate patterns. I would then move your bullet points about what you'll do in the tutorial to the The WWII THOR Dataset section where you could talk about your specific hypotheses. I would finish the overview section listing the generalized skills you will learn/need to visualize historical arguments.
The WWII THOR Dataset
I agree with Ben that you don't really go into the content of this data, which I think you could if you framed each graphing exercise as a way to test your hypotheses. Those bullet points from the overview could be reworked as questions that you want to explore in this tutorial. I do think the content of this dataset may be less exciting for some researchers so you might want to mention datasets used in other Programming Historian tutorials as alternatives. You also need to add a download tag to the dataset otherwise people have to manually copy and paste it into a file.
Creating a Python 3 Environment
Agree with Ben that conda is not really standard for most people and that using a virtual environment that works with a standard install of python would be preferable. I'm not sure if Programming Historian has guidelines on this but my understanding is that Pipenv is the current gold standard. You can install it with pip and it manages all dependencies https://pipenv.readthedocs.io/en/latest/
I agree with most of what Ben wrote, and I think this section should actually be shorter and titled What is Bokeh? and that you could move the second and third paragraph to the end of the tutorial where you could talk more generally about Bokeh vs other visualization libraries.
Adding to Your First plot
Mention that you should add these code lines before the show method.
The Bokeh ColumnDataSource
At line 72 "This approach to styling is mostly self-explanatory and this is an advantage of Bokeh, but you can see that sometimes Bokeh takes this too far. Some of its naming conventions are too verbose; for example, axis_label_text_font_size could probably just be called label_font_size since we already know we’re operating on an axis and the use of text and font seems redundant. Hopefully, this verbosity will be reigned in with future releases."
Overall I think you have really clear language for explaining the code that would be furthered if you spent some more time on framing your plots as ways to explore historical questions. You don't have to go super in depth but a line or two about what mapping vs bar chart etc... helps visualize would be really helpful for people new to visualization that might not understand how you're selecting which variables to graph. I realize it's tough to balance explaining the code vs explaining the research question/visualization method but I think some more time spent on the latter will really help readers understand how they could use these examples in their own research.
@drjwbaker and I have had a chance to discuss the reviews and our own work on the lesson. The reviews don't seem to conflict, so I think the best pathway forward is to begin responding and incorporating the reviewer suggestions. If it's possible to do this without a virtual environment that's worth considering – for what it's worth I similarly bypassed those steps and did it through my usual package manager.
Why don't you take a go at revising the lesson, and upon revisions, @drjwbaker and I are also happy to bounce ideas around with respects to framing and making the case for Bokeh as clear as possible.
We're really excited to see this lesson move through the pipeline!
Would April 16 work for a revision deadline, @archaeocharlie?
I should also note that now that we've received two reviews, we'd like to close this ticket to further reviews.
Author emailed for update 29 May 2018 and to agree fresh deadline for revisions. Will close ticket if no response is forthcoming.
Note: we are now past our recommended 4 week period for revisions to be completed (see policy at https://programminghistorian.org/en/editor-guidelines#managing-the-revision-process)
Thanks all for your great input and ample patience with this. I've just uploaded a modified version.
I've done my best to simplify code and also to emphasize interactivity in Bokeh. I'm sure there are many other parts that can continue to be improved. Although I didn't add it, one possibility is to include another example on linking plots together (e.g. linking two maps together in the last example with one showing incendiary and the other fragmentation could be nice).
I've also left the virtual env as is. I think it's extremely important, even for the beginner, to use one and conda, in my opinion, beats out pipenv! We can certainly discuss this though.
I've gone through and copy edited this proposed lesson. Actions for the author @archaeocharlie:
Overall, this is a huge improvement from the original submission that has benefited from astute peer review: so my thanks to @ZoeLeBlanc and @bmschmidt for their marvelous contributions! The substantive comments of both peer reviewers have been incorporated. I am happy with the use of a Virtual Environment (in fact, compared to other Python based lessons I found it easier to set up as I didn't have to figure out clashes with old versions of Python, especially Python 2). So @archaeocharlie if we can resolve the issues I note above, I think we can start moving this to the next part of the editorial process.
Would 27 July work as a revision deadline, @archaeocharlie?
@drjwbaker was amazingly given a day full of free time and I believe I've resolved all of these issues. The Witch Trial dataset does not have a spatial component. Should I drop this?
I'll be double checking tomorrow on another machine that code runs smoothly and there are no other problems. If things look good, I see two last things I need to do.
How does this sound?
Looks great. Further comments before I do a copy edit:
@mdlincoln Is it normal that links inside
@archaeocharlie Once the issues above (#152 (comment) & #152 (comment)) we are ready to start publishing! Looking at my workflow for this https://programminghistorian.org/en/editor-guidelines#acceptance-and-publication---editorial-checklist the only remaining items I need from you are:
Let's try and do all this by the July 27th deadline above.
@archaeocharlie Thank you for your hard work. I've made some light edits (a8a5e85#diff-897efc7433bf34c69626feeec85fcf9f and headers on the live site version) and made a pull request for the lesson on the live site programminghistorian/jekyll#952 Note: #152 (comment) is unresolved, but hoping someone can help during the pull request process. I will close this issue once the lesson is live!