In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab04.ipynb")

# Lab 4: Tables and Visualizations 

Welcome to Lab 4! 

We are moving into additional functions with tables as well as ways to visualize data. 

The [Python Reference](https://pages.mtu.edu/~lebrown/data1202-s24/reference/index.html) has information that will be useful for this lab.

**Recommended Reading**:
 * [Tables](https://inferentialthinking.com/chapters/06/Tables.html)
 * [Visualizing Categorical Distributions](https://inferentialthinking.com/chapters/07/1/Visualizing_Categorical_Distributions.html)
 * [Visualizing Numerical Distributions](https://inferentialthinking.com/chapters/07/2/Visualizing_Numerical_Distributions.html)


**Submission**: Once you’re finished, run all cells besides the last one, select File > Save Notebook, and then execute the final cell. Then submit the downloaded zip file, that includes your notebook,  according to your instructor's directions.

In [None]:
# Just run this cell
import numpy as np
import math
from datascience import *

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

# Cal Football 

This lab we will be working with a file `"football.csv"`, which contains information about the Cal football team. 


### Load table from the file 

**Question 1.1** 
Load the data file into a table called `cal`. 

In [None]:
cal = ...
cal.show(10)

In [None]:
grader.check("q11")

### Excluding columns: `drop` 

We now have information about Cal Football's seasons since statistics were kept. Because this file was pulled from the internet, it may have some data in it that we are not interested in, like the rows with a bunch of `nan` values (`nan` means "Not a number", and it is commonly used to indicate there is no value there).

**Caution**: It is not a good idea to blindly drop all columns with several NaN values from a table. What information would have been lost if we just dropped all missing values?

However, for the sake of this exercise, we'll do so. We can use the `drop` method to remove columns like this from the table. 


**Question 1.2** 
Let's drop the `Notes` column. 

Let's also drop the `AP Pre`, `AP High`, `AP Post`, `SRS` and `SOS` columns from the table. These are statistics specific to college football, and they are not important for what we're doing. `drop` can take in as many columns as you need, and it will drop them all from the table.

Call the new table `cal_improved_columns`

In [None]:
cal_improved_columns = ...
cal_improved_columns.show(5)

In [None]:
grader.check("q12")

### Querying 

**Question 1.3**
Let's try querying our new table using the `column` method to determine which conferences Cal has played in during its history. This information is contained within the `"Conf"` column of the `cal_improved_columns` table.

Use the `np.unique` method to only list the conferences once. 

In [None]:
conference_list = ...
conference_list

In [None]:
grader.check("q13")

### Picking columns: `select`

It appears that there are also several other columns that we are not very interested in. Instead of dropping several columns, we can use the `select` method to grab only the columns we want. 

**Question 1.4:** In this case, we only want to keep the `"Year"`, `"W"`, `"L"`, `"T"`, and `"Pct"`,  columns. Fill in the following code so that the `football` table has only the relevant columns.

In [None]:
football = ...
football.show(5)

In [None]:
grader.check("q14")

### Changing column labels: `relabeled`

We can rename column labels using the `relabeled` method. With this function, you are able to:
1. Relabel a *single column*
2. Relabel *several columns* at once

To change the names of multiple columns, we pass in an array of the old names and an array of the new names as the 2 inputs to `relabeled`.

*Note*: You may see another method called `relabel` in the `datascience` documenation. Please avoid using this,as it can change your data when you may not want to.*

**Question 1.5:** Some of the columns in the `football` table have labels that may not be best for what they store. Let's change the column labels to the following:

- `"W"` should be changed to `"Wins"`
- `"L"` should be changed to `"Losses"`
- `"T"` should be changed to `"Ties"`
- `"Pct"` should be changed to `"Winning Percentage"`

*Hint*: We've provided skeleton code for you to use.

In [None]:
old_names = ...
new_names = ...

football_relabeled = football.relabeled(..., ...)

football_relabeled.show(5)

In [None]:
grader.check("q15")

# More Table Operations

Now that we have the table we want, let's try to write some code that tells us some information about Cal Football's wins. Let's write three queries that can help us answer these three questions. 

1. What is the most wins Cal has ever had in one season?
2. How many total games has Cal lost?
3. What is the average number of games Cal each every year?


**Question 2.1**: What is the most wins Cal has ever had in one season?

In [None]:
most_wins_ever = ...
most_wins_ever

In [None]:
grader.check("q21")

**Question 2.2 (Losses)** For the following question, use a `NumPy` function, the `football_relabeled` table, and some table method to answer the following question:

>How many total games has Cal lost?

Assign the value to the variable `games_lost_alltime`.

In [None]:
games_lost_alltime = ...
games_lost_alltime

In [None]:
grader.check("q22")

**Question 2.3 (Wins)**: Similar to above, let's answer the third question using a combination of a function, table, and table method:

>What is the average numnber of games Cal wins each year?

Assign your answer to the variable `average_wins`.

In [None]:
average_wins = ...
average_wins

In [None]:
grader.check("q23")

### Interpreting Our Data

What does winning 5.52 games even mean?! Well, this means you can (roughly) expect Cal to win 5-6 games a year. 

While this is not a perfect statistic (some seasons are longer than others, football is a completely different game than it was a long time ago, etc.), in a 12-13 game season, do you think this a good amount of wins? The answer to this question is not concrete, and even with data to back up either side, neither answer seems more right than the other.

**Important**: Data science is not only being able to *compute* the answers to questions, but also forming thoughtful questions in response to your findings.  As well as understanding the limitations of your analysis. 

### Sorting a column: `sort`

Let's say we want to ask the question: **What is Cal's best season ever?**. There are many ways to answer the question, but you may argue that a season with the most wins or the fewest losses could be considered the best:

In [None]:
# We can sort in descending order
football_relabeled.sort("Wins", descending=True).show(5)

In [None]:
# Or we can sort in ascending order
football_relabeled.sort("Losses", descending=False).show(5)

As you can see, queries about the most wins and the fewest losses can both answer the question **What is Cal's best season ever?** in different ways. Note that the same seasons do not necessarily show up in the top of each queried table.

**Question 2.4**: Yet another way to answer this question about Cal's best seasons ever is to sort by winning percentage. Assign the variable `best_win_pct_year` to the year corresponding to the season with the **highest winning percentage**.

To do so, we want to assign `seasons_sorted` to the result of a table query sorting the `football_relabeled` table by winning percentage in **descending** order. 

*Note*: We want descending order because we want the best seasons **at the top of the table**.

In [None]:
seasons_sorted = ...
best_win_pct_year = ...
best_win_pct_year

In [None]:
grader.check("q24")

### Row selection: `where` and the `are` Predicates

The last table method we will talk about is the `where` method. The `where` method keeps all rows that satisfiy a particular boolean condition. It takes in a column label and an `are` statement, which can be crafted using the `are` library. These are the most important `are` library methods, but there are many more if you would like to investigate: [Explore the 'are' library here.](http://data8.org/datascience/predicates.html)

| Method | Input Type | Method Description |
| --- | --- | --- |
| `are.equal_to(n)` | number | Is the value from the column equal to `n`? |
| `are.above(n)` | number | Is the value from the column above `n`? |
| `are.above_or_equal_to(n)` | number | Is the value from the column above or equal to `n`? |
| `are.below(n)` | number | Is the value from the column below `n`? |
| `are.below_or_equal_to(n)` | number | Is the value from the column below or equal `n`? |
| `are.containing(s)` | string | Is `s` contained in the string value from the given column? |
| `are.containined_in(s)` | string | Is the string value from the given column contained in `s`? |

Adding a `not_` in front of all of these methods makes each method do the opposite of what it does (ex: `are.not_equal_to(n)`).

*Note*: As we've seen in lecture, we can achieve an **exact match** by not explicitly using an `are` predicate. That is, `where("col", are.equal_to("something")` is identical to `where("col", "something")`; the latter is shorthand for the former.

For example, if we only wanted to see the Cal Football seasons where Cal had a tie, we could use the `where()` method combined with an `are` method:

In [None]:
football_relabeled.where("Ties", are.above(0)).show(5)

For the 2021 season, Cal will play 12 games. If we wanted to see Cal's worst seasons where they lost more than 6 games, we can use a similar query:

In [None]:
football_relabeled.where("Losses", are.above(6)).show(5)

**Question 2.5 (Bowl Eligibility)**: In college football, a team advances to the post-season (to play "bowl games") if they have a winning/non-losing record. In other words, you must having a winning percentage of at least 0.500 to become eligible to play in a bowl game.

Assign the variable `bowl_eligible` to a float that describes the proportion of times in which Cal was eligible to play in college bowls throughout its history, based on their winning percentage.   

*Hint:* If you're stuck, feel free to add additional variables *before* you assign the float to `bowl_eligible`. It's often easier to break down these problems into multiple steps to make sure you're properly calculating each step and performing them in the right order. 



In [None]:
bowl_eligible = ...
bowl_eligible

In [None]:
grader.check("q25")

# Visualization 

Let's look at some of the patterns in the Cal football data. 

In Question 1.3, we observed that Cal has played seasons in different conferences. 

Suppose we want to look at the number of seasons Cal has played in the different conferences. 

We can get this data with the `group` function.  You will learn more about this function next week. 



In [None]:
seasons_in_conf = cal_improved_columns.select("Conf").group("Conf")
seasons_in_conf

**Question 3.1** 
What would be the best type of chart to display the number of seasons played in each conference?

1. Line Plot
2. Scatter Plot
3. Bar Chart
4. Histogram

Put your answer (1, 2, 3, 4) in `q31_chart_type`

In [None]:
q31_chart_type = ...

In [None]:
grader.check("q31")

<!-- BEGIN QUESTION -->

**Question 3.2**  Make the cart displaying the number of seasons played in each conference. 

*Note, it is good practice to have the information sorted in the presentation*

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.3** 
We want to look at the number of wins for the years when Cal played in the "Pac-10" conference. 

Use the `cal_improved_columns` Table. 

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

Let's now look at the relationship between the number of games won vs. the number of games lost. 

**Question 3.4** 
Plot the relationship between the number of game won (x-axis) against the number of games lost (y-axis). Only plot data points when Cal played for any of the "Pac" conferences, that is "Pac-8", "Pac-10" or "Pac-12" 

Look at the options for the [`where`](https://www.data8.org/datascience/reference-nb/datascience-reference.html#Table.where-Predicates)

In [None]:
...

<!-- END QUESTION -->

This plot as expected shows a relationship between wins and losses, because there are only a fixed number of games in a season (11-14). 

However, the plot does have one outlier. 

<!-- BEGIN QUESTION -->

**Question 3.5** Briefly explain that outlier point in the plot above. 


_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

Let's now look at the *distribution* of winning percentage. 

**Question 3.6** Make a histogram of the `Win Percentage` in the `football_relabeld` table. Use bins that make sense, e.g., 0, 1, 2, 3 wins, etc. 

In [None]:
...

<!-- END QUESTION -->

# Submission 

**Important submission steps:** 
1. Run the tests and verify that they all pass.
2. Choose **Save Notebook** from the **File** menu, then **run the final cell**. 
3. Click the link to download the zip file.
4. Then submit the zip file to the corresponding assignment according to your instructor's directions. 

**It is your responsibility to make sure your work is saved before running the last cell.**

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)