In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw04.ipynb")

# Homework 04: Functions, Histograms, and Groups

**Helpful Reference:**
* [Python Reference](https://www.data8.org/sp22/python-reference.html). Cheat sheet of helpful array & table methods used in this course!

**Reading**: 
* [Visualizing Numerical Distributions](https://inferentialthinking.com/chapters/07/2/Visualizing_Numerical_Distributions.html) 
* [Functions and Tables](https://inferentialthinking.com/chapters/08/Functions_and_Tables.html)

For all problems that you must write explanations and sentences for, you **must** provide your answer in the designated space. Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook! For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously!

**Note: This homework has hidden tests on it. Additional tests will be run once your homework is submitted for grading. While you may pass all the tests you have access to before submission, you may not earn full credit if you do not pass the hidden tests as well.**. 

Many of the tests you have access to before submitting only test to ensure you have given an answer that is formatted correctly and/or you have given an answer that *could* make sense in context. For example, a test you have access to while completing the assignment may check that you selected a valid choice for a multiple choice problem (1, 2, or 3) or that your answer is an integer between 0 and 50 if asked to count a subset of states in the United States. The tests that are run after submission will evaluate your work for accuracy. **Do not assume that just because all your tests pass before submission means that your answers are correct!**

Consult with your teacher and course syllabus for information and policies regarding appropriate collaboration with other students, appropriate use of AI tools, and submission of late work.

In [None]:
# Don't change this cell; just run it. 
import numpy as np
from datascience import *

import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

## 1. Burrito-ful San Diego

Eunice, Ciara, and Kanchana are trying to use Data Science to find the best burritos in San Diego! Their friends Jessica and Sonya provided them with two comprehensive datasets on many burrito establishments in the San Diego area taken from (and cleaned from): https://www.kaggle.com/srcole/burritos-in-san-diego/data. You can find the full data set in the folder of this assignment, named `burrito.csv`.

## `ratings.csv`
The following cell reads in a table called `ratings` which contains names of burrito restaurants, their Yelp rating, Google rating, as well as their overall rating. The `Overall` rating is not an average of the `Yelp` and `Google` ratings, but rather it is the overall rating from the customers that were surveyed in the study above. 

In [None]:
ratings = Table.read_table("ratings.csv")
ratings

## `burritos_types.csv`

The following cell reads in a table called `burritos_types` which contains names of burrito restaurants in San Diego, their menu items, and the cost of the respective menu item at the restaurant when this data was collected in 2018.

In [None]:
burritos_types = Table.read_table("burritos_types.csv").drop(0)
burritos_types

### Question 1.1

It would be easier if we could combine the information in both tables. Assign `burritos` to the result of joining the two tables together, so that we have a table with the ratings for every menu item from every restaurant. **Each menu item will have the same rating as the restaurant that made it.** This is not perfect way to score individual menu items, but it is an assumption we will make because it is the best we can do with the data we have.

*Note: it doesn't matter which table you put in as the argument to the table method, either order will work for the autograder tests.*

*Hint: If you need refreshers on table methods, look at the [python reference](http://data8.org/sp20/python-reference.html).*

In [None]:
burritos = ...
burritos.show(5)

In [None]:
grader.check("q1_1")

<!-- BEGIN QUESTION -->

### Question 1.2.

Let's look at how the Yelp scores compare to the Google scores in the `burritos` table. First, assign `yelp_and_google` to a Table only containing the columns `Yelp` and `Google`. Then, make a scatter plot with Yelp scores on the x-axis and the Google scores on the y-axis. 

In [None]:
yelp_and_google = ...
...
# Don't change/edit/remove the following line.
# To help you make conclusions, we have plotted a straight line on the graph (y=x)
plt.plot(np.arange(2.5,5,.5), np.arange(2.5,5,.5));

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 1.3.

Looking at the scatter plot you just made in the previous question, do you notice any pattern(s)? Write an explanation about your observations and what it might imply about reviews found on Google and Yelp.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### `.group` refresher

In case you need a refresher on how `.group` works, you can read how `.group` works in the [textbook](https://www.inferentialthinking.com/chapters/08/2/Classifying_by_One_Variable.html), or you can use the [Table Functions Visualizer](http://data8.org/interactive_table_functions/) to get some more hands-on experience with the `.group` function.

### Question 1.4.

There are so many types of California burritos in the `burritos` table! Kanchana wants to consider her options for California burritos based on rankings. Remember, for the sake of these questions, we are treating each menu item's rating the same as its respective restaurant's, as we do not have the rating of every single item at these restaurants.

Create a table with two columns: the first column includes the **names of the burritos** and the second column should contain the **average overall rating** of that burrito across all the restaurants that serve it.

In [None]:
california_burritos = ...
california_burritos

In [None]:
grader.check("q1_4")

### Question 1.5.

Given this new table `california_burritos`, Ciara can figure out the name of the California burrito with the highest overall average rating! Assign `best_california_burrito` to a line of code that evaluates to a string that corresponds to the name of the California burrito with the highest overall average rating. If multiple burritos tie for the highest average, you can output any of them.

In [None]:
best_california_burrito = ...
best_california_burrito

In [None]:
grader.check("q1_5")

<!-- BEGIN QUESTION -->

### Question 1.6.

Eunice thinks that burritos in San Diego are cheaper (and taste better) than the burritos in North Carolina. Plot a histogram that visualizes that distribution of the costs of the burritos from San Diego in the `burritos` table. Use the provided `cost_bins` variable when making your histogram.

In [None]:
cost_bins = np.arange(0, 15, 1)
# Please also use the provided bins
...

<!-- END QUESTION -->

### Question 1.7.

What percentage of burritos in San Diego are less than $6? Assign `burritos_less_than_six` to your answer, **which should be between 0 and 100** since it is a percentage, not a proportion. You should estimate this value using the histogram first, and then use code to compute the exact value.

**Hint:** Your solution should probably use the `np.count_nonzero` function.

In [None]:
burritos_less_than_six = ...
burritos_less_than_six

In [None]:
grader.check("q1_7")

## 2. Faculty Ages and Salaries

This question is designed to give you practice using the Table methods `pivot` and `group`, and see how they can both be used to summarize large Tables of values. Here is a link to the [Python Reference](https://www.data8.org/sp22/python-reference.html) page in case you need a quick refresher. The [Table Functions Visualizer](http://data8.org/interactive_table_functions/) may also be a helpful tool.

Run the cell below to view a demo on how you can use pivot on a table.

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo("4WzXo8eKLAg")

In the next cell, we load a dataset maintained by the University of North Carolina System Office which contains information on **UNC System professors** which includes their institution, age, job category, and base salary as of June 2023.

In [None]:
# Just run this cell
unc = Table.read_table("salary_2023.csv").drop('INITIAL HIRE DATE', 'EMPLOYEE HOME DEPARTMENT', 'PRIMARY WORKING TITLE').relabeled('INSTITUTION NAME', 'School Name').relabeled('AGE', 'Age').relabeled('JOB CATEGORY', 'Title').relabeled('EMPLOYEE ANNUAL BASE SALARY', 'Salary').set_format('Salary', CurrencyFormatter)
unc

### Question 2.1.

Suppose you wanted to know the average age and salary for each type of professor at each of the UNC System schools. Set `average_unc_stats` to a Table with 4 columns that shows the average age and salary for each combination of `School Name` and `Title` found in the `unc` table. For example, you should be able to determine the average age and salary for an Assistant Professor at ASU or for a Lecturer at ECU by looking at the corresponding row in the Table you create.

**Hint:** Use the `.group` method to create this table.

In [None]:
average_unc_stats = ...
average_unc_stats

In [None]:
grader.check("q2_1")

### Question 2.2

The Table that is created using the `.group` method can be a bit long when there are many combinations between the two specified variables, in this case `School Name` and `Title`. A `pivot` Table should make for an easier way to look up these averages

Create a pivot table assigned to `unc_pivot` that has each `School Name` for the column labels and a row that corresponds to each unique `Title`. The values in each column should be the average value of `Age` for each `Title` at that school.

In [None]:
unc_pivot = ...
unc_pivot

In [None]:
grader.check("q2_2")

### Question 2.3

Recall that pivot tables can use *any* function to compute the collected value in the table.

Write your own function named `salary_range` that can take in an array of floats (that will represent salaries in our case) and returns the range of the array. The range is the smallest value subtracted from the the largest value in that array.

In [None]:
...
    ...

In [None]:
grader.check("q2_3")

#### Question 2.4

Set `job_ranges` to a table containing Titles as the rows, and the School as the columns. The values in the rows should correspond to the salary range for the job, where the range is calculated using the `salary_range` function you wrote in Question 2.3.

In [None]:
job_ranges = ...
job_ranges

In [None]:
grader.check("q2_4")

<!-- BEGIN QUESTION -->

### Question 2.5.

Write an explanation as to why some of the values are 0 in the `job_ranges` Table you created in the previous question. There may be more than one reason, so think carefully and include as many reasons as you can in your response.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

# Submitting your work
You're done with this assignment! Assignments should be turned in using the following best practices:
1. Save your notebook.
2. Restart the kernel and run all cells up to this one.
3. Run the cell below with the code `grader.export(...)`. This will re-run all the tests. Make sure they are passing as you expect them to.
4. Download the file named `hw04_<date-time-stamp>.zip`, found in the explorer pane on the left side of the screen. **Note**: Clicking on the link in this notebook may result in an error, it's best to download from the file explorer panel.
5. Upload `hw04_<date-time-stamp>.zip` to the corresponding assignment on Canvas.

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit.

In [None]:
grader.export(pdf=False, force_save=True)