If this post excites you, I encourage you to apply to our 12 week immersive bootcamp (applications close August 5th) where you will learn data science through hands-on exercises and real world projects!
In our previous post, we outlined the best data science resources we have found online. In this post, I will walk you through my general data science process by analyzing the inspections of San Francisco restaurants using publicly available data from the Department of Public health. I will explore this data to map the cleanliness of the city, and get a better perspective on the relative meaning of these scores by looking at statistics of the data. Throughout the analysis, I show how to use a spectrum of powerful tools for data science (from UNIX shell to pandas and matplotlib) and provide some tips and data tricks.
The mode of restaurants is perfect score of 100, the distribution is heavily skewed towards high values (mean of 92, 75% quartile of 98, 25% quartile of 88), and there exists a long tail of restaurants. What does this mean, the most common score is 100, 75% of restaurants receive a score of 88 or better (and the scale is very nonlinear), i.e. you probably don’t want to eat at a restaurant that scored below 90.http://zipfianacademy.com/maps/h3/"></iframe>
Each restaurant is geographically binned using the D3.js hexbin plugin. The color gradient of each hexagon reflects the median inspection score of the bin, and the radius of the hexagon is proportional to the number of restaurants that fall in that the bin. Binning is first computed with a uniform hexagon radius over the map, and then the radius of each individual hexagon is adjusted for how many restaurants ended up in its bin.
Large blue hexagons represent many high scoring restaurants and small red mean a few very poorly scoring restaurants. The controls on the map allow users to adjust the radius (Bin:) of the hexagon for computing the binning as well as the range (Score:) of scores to show/use on the map. The color of the Bin: slider represents the median color of the Score: range and its size represents the radius of the hexagons. The colors of each of the Score: sliders represent the threshold color for that score, i.e. if the range is 40 - 100 the left slider’s color corresponds to a score of 40 and the right slider to a score of 100. The colors for every score in-between are computed using a linear gradient.
Somewhat recently, Yelp announced that it is partnering with Code for America and the City of San Francisco to develop LIVES, an open data standard which allows municipalities to publish restaurant inspection data in a standardized format. This is a step towards allows a much much more transparent government, leading ultimately to a more engaged citizenry.
To understand what those opaque numbers in restaurant windows mean, I set out to use statistics and data science to better grasp the implications of the ratings.
The entire process has been documented in an IPython notebook here and I hope anyone who is curious will run the code and review the analyses before they take the results at face value (because No one should trust a data scientist).
Some interesting results and insights I have found can be summed up by the plots below.
In order to learn more about the relative rating of each restaurant and find out just how good a 90 is, I simply plotted all the data in a histogram. It turns out (quite surprisingly) that the majority of restaurants score better than 94 and that 100 is the mode of the dataset. This is actually quite comforting to know that so many restaurants score so well, but might make you think twice about eating at your favorite restaurant that happened to score a 90.
The right plot is a binning of the scores into the categories the city defined to give a more qualitative interpretation of the scores (‘Poor’, ‘Needs Improvement’, ‘Adequate’, and ‘Good’). The interesting thing to note about these quantizations of the scores is that the scale is very nonlinear: 0 -> 70, 71 -> 85, 86 -> 90, 91 -> 100.
With such a skewed distribution and nonlinear scales, often our old way of thinking does not directly translate. To get a better grasp on the relative scores of restaurants compared to each other (and potential other cities) I computed the quantiles for the distribution. This allows us to have a somewhat standardized ranking to compare different scales and distributions in normalized fashion. It is for this reason that summary statistics can be quite powerful tools for inference and a standard tool in any statistician’s (or data scientist’s) tool belt.
Due to these very basic and easy to implement analyses, I am now a much more informed citizen and realize that scales in general can have subliminal/collateral effects on your perception of the rest of the world. In school we come to internalize 70 as a passing score, anything better than 90 quite good, and 98-100 to be unheard of… for Berkeley Physics at least ;)
I hope this post showed you that you do not necessarily need to do very complex analyses to get interesting insights and inspires folks to get out there and start working with open data. The first step to breaking into data science is to start making, and pick a project that you are passionate about (or always wanted to know the answer to). If you have any question about restaurant health inspection data, the data science process, or our program and classes please do not hesitate to reach out (or to just say hello!) at firstname.lastname@example.org. Happy Data-ing!