Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional dataset for chipotle homework assignment #1

Closed
wants to merge 1 commit into from

Conversation

micahmarkman
Copy link

Howdy Kevin,
After getting the feedback from Rachael on the homework from class 2, I had a quick back and forth with her on why the solution isn't actually quite as good as the one I hacked out in class.
I've added an updated chipotle dataset that includes an outlier Steak Burrito Order that breaks the homework solution. I'll admit I actually saw this earlier because when I first cloned the DAT8 repository it had the solution and I looked at it and said something to the effect of "this is an illustration of the family of jokes about the difference between an engineer and a scientist, the slow precision vs. quick estimation"; shockingly I fell onto the scientist side of things here (unusual for me).
Anyway, this also gave me a quick minute to refresh my memory on how to do pull requests (hadn't done one in a couple of years) so here's my dataset.

--Micah

@justmarkham
Copy link
Owner

Hi @micahmarkman, for which question(s) does the outlier order break the solution? And, if it's easy for you to briefly describe how it breaks the solution, I'd appreciate it. (Otherwise, no worries, I'll just try it out myself.) Thanks!

@rachnp89
Copy link

He's referring to question 4 (or the prescribed solution to question 4), which does not take into account the "quantity" column. So it looks like he changed order 45 to consist of 250 steak burritos.

@justmarkham
Copy link
Owner

Thank you, @rachnp89. And great critical thinking, @micahmarkman, I appreciate you bringing this up! I have many thoughts on this:

  1. To answer the question of which burrito is more "popular", we have to adopt a definition of popular. Presumably, it has something to do with the number of burritos of that type ordered. It makes sense that an order for 1 chicken burrito represents 1 "vote" for chicken. But if I see an order with 250 steak burritos, should I interpret that as 250 "votes" for steak?
    • I might decide that this is erroneous data, if evidence supports that line of thinking, in which case I would ignore it.
    • I might decide that this order represents the preferences of 1 person, not 250 people, under the assumption that they are ordering for a group and didn't actually ask every person what they wanted. Or I might I award it some amount of votes that is less than 250, to account for the fact that they probably asked some people what they wanted, but not all 250.
    • If the same order also included 219 chicken burritos, I would likely conclude that the person ordering did actually ask those 469 people what they wanted, in which case I would conclude that it should represent 250 votes for steak and 219 votes for chicken.
    • All of that being said, I can't answer how I would interpret that particular case, because it didn't actually arise in this dataset.
  2. The question "which burrito is more popular" implicitly assumes that you have looked at the data enough to figure out how to answer that question. In my case, I have looked at the data enough to know that the quantity column is irrelevant for answering this question, unless the chicken/steak line counts are close to one another. And since the counts are not close, it is not relevant for answering the specific question that was asked.
    • It would be reasonable to counter my point by saying that questions 1 through 3 do not explicitly require students to look at the data in depth, and thus they would not necessarily know that quantity is irrelevant for answering this question.
    • However, the answer I wrote is "my answer", not "the answer". Meaning, the answers I will give out to assignments during the course represent "how I would answer the question", not "the one right way to answer this question." I celebrate different ways of answering questions, and will sometimes share those different ways with the class.
  3. Although I did not state it explicitly, my design was for this assignment to be an exercise in data exploration. And in exploratory data analysis, you are almost always writing "throw-away code", meaning code that you write once and never look at again. In those cases, it doesn't matter whether the code is elegant or handles all cases that could arise in the data, all that matters is that it answers your question using the data that you do have.
    • If this assignment was instead about producing a report that would be emailed to Chipotle managers once a day, then it would be important to take quantity into account, since our code would have to handle all future iterations of this dataset, not just the existing dataset.
    • I realize that I did not explicitly state the design of the assignment, and thus it's not unreasonable for you (or any other student) to think that handling future iterations of the dataset is within the scope of what should be expected.
    • My guiding principle here is that time always matters in data science, because you could spend an unlimited amount of time on any one step in the data science process, and so at some point you have to stop and move on. (This point will become obvious once we near the end of the course and you are building machine learning models for your project, and you realize that there is an endless combination of features/models/parameters you could try. At some point you just have to accept that you are unlikely to find the "best" model, and accept that what you have is "good enough".)

Thanks again for bringing this up, and I will surely mention this discussion with the class so that those interested can read through.

As for the pull request itself, I'm closing it because I don't actually need the CSV file.

Anyone is free to comment further, as I'm happy to continue the discussion!

Kevin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants