In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("lab.ipynb")

# Lab 2 – `pandas` 

## DSC 80, Spring 2023

### Due Date: Monday, April 17th at 11:59PM

## Instructions
Welcome to the second lab assignment in DSC 80 this quarter!

Much like in DSC 10, this Jupyter Notebook contains the statements of the problems and provides code and Markdown cells to display your answers to the problems. Unlike DSC 10, the notebook is *only* for displaying a readable version of your final answers. The coding will be done in an accompanying `lab.py` file that is imported into the current notebook, and **you will only submit that `lab.py` file**, not this notebook!

Some additional guidelines:
- **Unlike in DSC 10, labs will have both public tests and hidden tests.** The bulk of your grade will come from your scores on hidden tests, which you will only see on Gradescope after the assignment deadline.
- **Do not change the function names in the `lab.py` file!** The functions in the `lab.py` file are how your assignment is graded, and they are graded by their name. If you changed something you weren't supposed to, you can find the original code in the [course GitHub repository](https://github.com/dsc-courses/dsc80-2023-sp).
- Notebooks are nice for testing and experimenting with different implementations before designing your function in your `lab.py` file. You can write code here, but make sure that all of your real work is in the `lab.py` file, since that's all you're submitting.
- **To ensure that all of your work to be submitted is in `lab.py`, we've provided an additional uneditable notebook, called `lab-validation.ipynb`, that contains only the tests and their setup. Make sure you are able to run it top-to-bottom without error before submitting!**
- You are encouraged to write your own additional helper functions to solve the lab, as long as they also end up in `lab.py`.

**Importing code from `lab.py`**:

* Below, we import the `.py` file that's contained in the same directory as this notebook.
* We use the `autoreload` notebook extension to make changes to our `lab.py` file immediately available in our notebook. Without this extension, we would need to restart the notebook kernel to see any changes to `lab.py` in the notebook.
    - `autoreload` is necessary because, upon import, `lab.py` is compiled to bytecode (in the directory `__pycache__`). Subsequent imports of `lab` merely import the existing compiled python.

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
from lab import *

In [4]:
import pandas as pd
import numpy as np
import os

## Part 1: `pandas` Basics 👶

In this section, you'll have to implement several functions. The public tests test your functions on an example dataset, which is stored in `data/scores.csv`. You're free to import this `.csv` file as a DataFrame in your notebook and experiment with it. **However,** the functions you write must be general enough such that they can work on other datasets with the same column names but different values.

In addition:
* Do not hard-code any answers.
* Do not use any loops – you will not receive full credit if you do!

### Question 1

#### `data_load`

Complete the implementation of the function `data_load`, which takes in the file path of a dataset to be read as a string and returns the DataFrame that results from following the steps below:
    
a. First, read in only a subset of the columns: `'name'`, `'tries'`, `'highest_score'`, and `'sex'`.

b. Then, drop the `'sex'` column.

c. Rename the `'name'` column to `'firstname'` and the `'tries'` column to `'attempts'`.

d. Turn the `'firstname'` column into the index.

<br>
    
#### `pass_fail`

Complete the implementation of the function `pass_fail`, which takes a DataFrame returned from `data_load` and adds a column `'pass'` that contains `'Yes'` or `'No'` for each row, based on the following conditions:

* `'No'` if a number of attempts is strictly larger than 1 but the score is less than 60
* `'No'` if a number of attempts is strictly larger than 4 but the score is less than 70
* `'No'` if a number of attempts is strictly larger than 6 but the score is less than 90
* `'No'` if a number of attempts is strictly larger than 8
* `'Yes'` otherwise
 
Your function should return a DataFrame identical to the input `scores` with the column `'pass'` added. However, your function should not modify the original input DataFrame.

In [5]:
# don't change this cell -- it is needed for the tests to work
scores_fp = os.path.join('data', 'scores.csv')
scores = data_load(scores_fp)
passfail = pass_fail(scores.copy())
print(passfail)

           attempts  highest_score pass
firstname                              
Julia             4           90.0  Yes
Angelica          2           70.0  Yes
Tyler             2           88.0  Yes
Kathleen          7           88.5   No
Axel              5           45.3   No
Amiya             2           34.0   No
Marina            2          100.0  Yes
Torrey           14           99.0   No
Mariah           10           98.1   No
Grayson           3           67.0  Yes
Yvette            4           55.9   No
Marina            3          100.0  Yes
Marina            2          100.0  Yes


In [6]:
grader.check("q1")

### Question 2

#### `med_score`

Complete the implementation of the function `med_score`, which takes in a DataFrame that is returned by `pass_fail` and returns the median score amongst students who passed the test.

<br>

#### `highest_score_name`
    
Complete the implementation of the function `highest_score_name`, which takes in a DataFrame that is returned by `pass_fail` and returns a tuple in which:
- The first item is the maximum score any student received.
- The second item should be a list of the name(s) of the person(s) with the maximum score (attempts do not count). If just one student received the maximum score, the list you create will have length 1.

As a reminder, please follow these requirements:

* For all questions you need to write code general enough to be applied to another similar dataset. 
* Do not hard-code any answers. 
* Do not use `for` or `while` loops.

In [7]:
# don't change this cell -- it is needed for the tests to work
medscore = med_score(passfail.copy())
highest = highest_score_name(passfail)
print(medscore, highest)

90.0 (100.0, ['Marina', 'Marina', 'Marina'])


In [8]:
grader.check("q2")

### Question 3

Complete the implementation of the function `idx_dup`, which does not take any arguments and returns a single integer, answering the question below:

Is it possible for a DataFrame's index to have duplicate values?
1. No, index values must be unique and use non-negative integers only, just like in `numpy` arrays.
2. No, index values must be unique and use integers only.
3. No, index values must be unique but index values are not restricted to integers.
4. Yes, but index values must be non-negative integers only.
5. Yes, but index values must be integers only.
6. Yes, and index values are not restricted to integers.

In [9]:
# don't change this cell -- it is needed for the tests to work
idxdup = idx_dup()

In [10]:
grader.check("q3")

## Part 2: Tricky Pandas 🤔

Sometimes, `pandas` gives you weird outputs that you may not expect. The next set of questions walks you through a few examples that might surprise you. 

In [11]:
trick_me()

3

### Question 4

The following subparts all require you to define a function and return a number that is the answer to a multiple-choice question. You may need to write code and experiment with DataFrames to arrive at your answers.

#### `trick_me`

`trick_me` should not take any arguments. 
<br>

Inside the function:

* Create a DataFrame named `tricky_1` that has three columns labeled `'Name'`, `'Name'`, and `'Age'`. `tricky_1` should have 5 rows; the values are up to you.
* Save this DataFrame in the `.csv` file called `'tricky_1.csv'` without the index.
* Now create another DataFrame, named `tricky_2`, by reading in the file `'tricky_1.csv'`. What are your observations?

  1. It was not possible to create a DataFrame with the duplicate columns.
  2. `tricky_1` and `tricky_2` have the same column names.
  3. `tricky_1` and `tricky_2` have different column names.
   
Your function should return `1`, `2`, or `3`, answering the above question.

<br>
  
#### `trick_bool`
`trick_bool` should not take any arguments.

To determine the correct answer from the list below, you should follow the steps outlined by experimenting in **the notebook** (or in the Terminal by running `python`). Outside the function:

* Create a DataFrame named `bools` that has four columns: `True`, `True`, `False`, `False`. Each column name should be Boolean.
* `bools` should have 4 rows; the values are up to you.
* Predict the shape of the DataFrame that results by running each of the three lines of code below. Pick a corresponding answer from the given list. Your function should return a list with three numbers, one for each line.
* You should be able to answer without running any code, but feel free to run code to check your answer.
* **Your function should not do anything other than return a hardcoded list.**

```py
df[True]
df[[True, True, False, False]]
df[[True, False]]
```
    
Answer choices:
1. DataFrame: 2 columns, 1 row
2. DataFrame: 2 columns, 2 rows
3. DataFrame: 2 columns, 3 rows
4. DataFrame: 2 columns, 4 rows
5. DataFrame: 3 columns, 1 rows
6. DataFrame: 3 columns, 2 rows
7. DataFrame: 3 columns, 3 rows
8. DataFrame: 3 columns, 4 rows
9. DataFrame: 4 columns, 1 rows
10. DataFrame: 4 columns, 2 rows
11. DataFrame: 4 columns, 3 rows
12. DataFrame: 4 columns, 4 rows
13. Error

In [12]:
bools = pd.DataFrame(columns=[True, True, False, False], data = [[1,2,3,4],[1,2,3,4],[1,2,3,4],[1,2,3,4]])

In [13]:
print(bools[True, False])

TypeError: Cannot convert bool to numpy.ndarray

In [None]:
# don't change this cell -- it is needed for the tests to work
trick_ans = trick_bool()

list

In [None]:
grader.check("q4")

In [17]:
nans = pd.DataFrame([[0,1,np.NaN],[np.NaN,np.NaN,np.NaN],[1,2,3]])
print(nans)
print(correct_replacement(nans))

     0    1    2
0  0.0  1.0  NaN
1  NaN  NaN  NaN
2  1.0  2.0  3.0
         0        1        2
0      0.0      1.0  MISSING
1  MISSING  MISSING  MISSING
2      1.0      2.0      3.0


### Question 5

In the notebook, use the line of code given below to create a DataFrame named `nans`. Note that we use `np.NaN` (`numpy`'s representation of "Not a Number") to create missing values.
 
```py
nans = pd.DataFrame([[0, 1, np.NaN], [np.NaN, np.NaN, np.NaN], [1, 2, 3]])
```
Now, you decide to make your DataFrame more interpretable for data scientists who don't yet know about `np.NaN`, and replace each `np.NaN` with the string `'MISSING'`. In order to do that, you've written the following function:

```py
def change(x):
    if x == np.NaN:
        return 'MISSING'
    else:
        return x
```

In your notebook, write a line of code that applies the function above to the last column of the `nans` DataFrame. What was a result?
* A: It worked: all `np.NaN`s in the last column were changed to `"MISSING"`.
* B: It did not work.

You should end up answering B. What happened? 🤔 It turns out that you can't use simple comparison `==` to detect if a value is `np.NaN`. You need to use another way to compare a value to `np.NaN`. [Read more about it here](https://stackoverflow.com/questions/41342609/the-difference-between-comparison-to-np-nan-and-isnull).

<br>

#### `change`

Once you've read the aforementioned article, fix `change` so that it works as intended.

<br>

####  `correct_replacement`
Complete the implementation of the function `correct_replacement`, which takes in a DataFrame like `nans` and uses your updated `change` function to replace all of the `np.NaN`s in the input DataFrame (in all columns) with `'MISSING'`.

* You **cannot** use the `fillna` method, though the `applymap` method might be useful.
* Make sure **not** to modify the input DataFrame in-place. Instead, return a new DataFrame.


<br>

####  `missing_ser`

Complete the implementation of the function `missing_ser`, which does not take any arguments and returns the answer to the following multiple choice question.

Consider a Series named `ser` that has six elements:

```py
ser = pd.Series([np.NaN, 'DSC 80', np.NaN, 'King Triton', 'Queen Triton', np.NaN])
```

What would be the result of running the following code?

```py
ser[ser.isna()] = 'MISSING'
```

* Predict the output of running the lines of code above. Pick a corresponding answer from the given options below, and have `missing_ser` return that number.
* You should be able to answer without running any code, but feel free to run code to check your answer.
* **Your function should not do anything other than return a hardcoded answer.**


1. `pd.Series([np.NaN, 'MISSING', np.NaN, 'MISSING', 'MISSING', np.NaN])`
2. `pd.Series(['MISSING', 'DSC80', 'MISSING', 'King Triton', 'Queen Triton', 'MISSING'])`
3. Error. The code would not run.
      
<br>
        
####  `fill_ser`

Complete the implementation of the function `fill_ser`, which takes in a DataFrame with many `np.NaN`s and replaces each `np.NaN` with the string `'MISSING'` instead. This modification should be **in-place**, meaning that the function **should not return anything** and should simply modify the DataFrame given as input.

As a reminder, please follow these requirements:

* You need to write code general enough to be applied to a different DataFrame. 
* Do not hard-code any answers. 
* `loop` over the columns *is* allowed but `applymap` *is not* allowed.
* `apply` and `fillna` are not allowed.
* You shouldn't use use any method for this function (`loc` *is not* a method)

In [26]:
ser = pd.Series([np.NaN,'DSC', np.NaN])
ser[ser.isna()] = 'MISSING'
ser

0    MISSING
1        DSC
2    MISSING
dtype: object

In [30]:
nans = pd.DataFrame([[0,1,np.NaN],[np.NaN,np.NaN,np.NaN],[1,2,3]])
print(nans)
fill_ser(nans)
print(nans)

     0    1    2
0  0.0  1.0  NaN
1  NaN  NaN  NaN
2  1.0  2.0  3.0
         0        1        2
0      0.0      1.0  MISSING
1  MISSING  MISSING  MISSING
2      1.0      2.0      3.0


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  current_col[current_col.isna()] = 'MISSING'


In [31]:
grader.check("q5")

## Part 3: Summary Statistics 📊

In this question you will define two general purpose functions that make it easy to qualitatively assess the contents of a DataFrame.

### Question 6

Complete the implementation of the function `population_stats`, which takes in a DataFrame `df` and returns a DataFrame indexed by the columns of `df`, with the following columns:
* `'num_nonnull'`, which contains the number of non-null entries in each column.
* `'prop_nonnull'`, which contains the proportion of entries in each column that are non-null.
* `'num_distinct'`, which contains the number of distinct non-null entries in each column.
* `'prop_distinct'`, which contains the proportion of non-null entries that are distinct in each column.
       
For example, if `df` has a column named `'ages'` with the following elements:
       
```py
[2, 2, 2, np.NaN, 5, 7, 5, 10, 11, np.NaN]
```

Then:
- `'num_nonnull'` is 8, and `'prop_nonnull'` is $\frac{8}{10}$ = 0.8.
- There are six distinct entries, `[2, 5, 7, 10, 11, np.NaN]`, but only 5 of them are non-null. So the number of distinct non-null entries, `'num_distinct'`, is 5.
- There are 5 distinct non-null entries, and there are 8 total non-null entries, so `'prop_distinct'` is $\frac{5}{8}$ = 0.625.

Putting it all together, `population_stats(df).loc['ages']` should be a Series containing the numbers 8, 0.8, 5, and 0.625.

In [43]:
# don't change this cell -- it is needed for the tests to work
pop_data = np.random.choice(range(10), size=(100, 4))
df_pop = pd.DataFrame(pop_data, columns='A B C D'.split())
out_pop = population_stats(df_pop)
print(out_pop)

   num_nonnull  prop_nonnull  num_distinct  prop_distinct
A        100.0           1.0          10.0            0.1
B        100.0           1.0          10.0            0.1
C        100.0           1.0          10.0            0.1
D        100.0           1.0          10.0            0.1


In [44]:
grader.check("q6")

### Question 7
    
Complete the implementation of the function `most_common`, which takes in a DataFrame `df` and a number `N` and returns a DataFrame of the `N` most-common values and their counts for each column of `df`. Any column with fewer than `N` distinct values should contain `np.NaN` in those entries.

For example, consider the DataFrame shown on the left. This DataFrame is a subset of `salaries`, a larger DataFrame containing information on employees in the City of San Diego. The subset below contains two of the original columns: `'Job Title'` which contains job titles for employees, and `'status'` which denotes whether the employee works a full time position (`'FT'`) or a part time position (`'PT'`). On the right, the return value of `most_common(salaries, N=5)` is shown.

You can assume that there is no ties in our hidden tests.

<table><tr>
    <td><img src="data/imgs/dataframe.png" width="90%"/></td>
    <td><img src="data/imgs/most_common.png" width="90%"/></td>
</tr></table>

***Note:*** You can loop through the *columns* of `df` to construct your output. You should **not** be looping through rows.

***Hint:*** You may find that initializing an empty DataFrame with `N` rows and adding columns to it is useful in your implementation.

In [65]:
# don't change this cell -- it is needed for the tests to work
common_data = np.random.choice(range(10), size=(100, 2))
common_df = pd.DataFrame(common_data, columns='A B'.split())
print(common_df)
common_out = most_common(common_df,N=3)
common_out

    A  B
0   5  6
1   9  8
2   8  1
3   9  4
4   2  4
.. .. ..
95  9  0
96  8  6
97  4  8
98  1  4
99  3  3

[100 rows x 2 columns]
9    17
8    13
1    13
Name: A, dtype: int64
1    15
8    13
3    12
Name: B, dtype: int64


Unnamed: 0,A_values,A_counts,B_values,B_counts
0,9,17,1,15
1,8,13,8,13
2,1,13,3,12


In [66]:
grader.check("q7")

## Part 4: Superheroes 🦸

The questions below analyze a dataset of superheroes found in the `data` directory. One of the datasets lists the attributes of each superhero, while the other is a *Boolean* DataFrame describing which superheroes have which superpowers. Note, the datasets contain information on both **good** superheroes, as well as **bad** superheroes (AKA villains).

If you took DSC 10 in Fall 2021, this dataset may seem familiar – it was used for the Final Project that quarter!

### Question 8

Let's start working with the `powers` dataset, which you can see in `data/superheroes_powers.csv`. 

Complete the implementation of the function `super_hero_powers`, which takes in a DataFrame like `powers` and returns a list with the following three entries:

1. The name of the superhero with the greatest number of superpowers.
2. The name of the second most common superpower among superheroes who can fly (the most common being `'Flight'` itself).
3. The name of the most common superpower among superheroes with only one superpower.

You should **not** be hard-coding your answers in this question; your function should work on any DataFrame similar to `powers`. You should not be using loops in this question. In each case, you can assume the answer is unique.

***Hint:*** You may find the `idxmax` method useful in this problem.

In [78]:
# don't change this cell -- it is needed for the tests to work
super_fp = os.path.join('data', 'superheroes_powers.csv')
powers = pd.read_csv(super_fp)
super_out = super_hero_powers(powers)
print(super_out)

Spectre
['Spectre', 'Super Strength', 'Intelligence']


In [77]:
grader.check("q8")

### Question 9

In the notebook, load in the dataset in `data/superheroes.csv` as a DataFrame and explore it. Call your `population_stats` function from Question 6 on the DataFrame. You should notice that there are very few actually null (`np.NaN`) values, but there are many entries that **should** be null.

Complete the implementation of the function `clean_heroes`, which takes in a DataFrame like the one mentioned above and returns a new DataFrame with all of the missing values replaced with `np.NaN`.

***Note:*** Most of the work in this question is identifying how the missing values are stored in the DataFrame. The implementation of the function should only take one line.

In [80]:
superheroes_fp = os.path.join('data', 'superheroes.csv')
heroes = pd.read_csv(superheroes_fp, index_col=0)
population_stats(heroes)
heroes

Unnamed: 0,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,-,good,441.0
1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0
2,Abin Sur,Male,blue,Ungaran,No Hair,185.0,DC Comics,red,good,90.0
3,Abomination,Male,green,Human / Radiation,No Hair,203.0,Marvel Comics,-,bad,441.0
4,Abraxas,Male,blue,Cosmic Entity,Black,-99.0,Marvel Comics,-,bad,-99.0
...,...,...,...,...,...,...,...,...,...,...
729,Yellowjacket II,Female,blue,Human,Strawberry Blond,165.0,Marvel Comics,-,good,52.0
730,Ymir,Male,white,Frost Giant,No Hair,304.8,Marvel Comics,white,good,-99.0
731,Yoda,Male,brown,Yoda's species,White,66.0,George Lucas,green,good,17.0
732,Zatanna,Female,blue,Human,Black,170.0,DC Comics,-,good,57.0


In [108]:
# don't change this cell -- it is needed for the tests to work
superheroes_fp = os.path.join('data', 'superheroes.csv')
heroes = pd.read_csv(superheroes_fp, index_col=0)
clean_out = clean_heroes(heroes)
population_stats(clean_out)

Unnamed: 0,num_nonnull,prop_nonnull,num_distinct,prop_distinct
name,734.0,1.0,715.0,0.974114
Gender,705.0,0.96049,2.0,0.002837
Eye color,562.0,0.765668,22.0,0.039146
Race,430.0,0.585831,61.0,0.14186
Hair color,562.0,0.765668,29.0,0.051601
Height,517.0,0.70436,53.0,0.102515
Publisher,719.0,0.979564,24.0,0.03338
Skin color,72.0,0.098093,16.0,0.222222
Alignment,727.0,0.990463,3.0,0.004127
Weight,495.0,0.674387,134.0,0.270707


In [89]:
grader.check("q9")

Below, we have displayed the first 10 rows of the cleaned DataFrame.

In [90]:
clean_out.head(10)

Unnamed: 0,name,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
0,A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,,good,441.0
1,Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0
2,Abin Sur,Male,blue,Ungaran,No Hair,185.0,DC Comics,red,good,90.0
3,Abomination,Male,green,Human / Radiation,No Hair,203.0,Marvel Comics,,bad,441.0
4,Abraxas,Male,blue,Cosmic Entity,Black,-99.0,Marvel Comics,,bad,-99.0
5,Absorbing Man,Male,blue,Human,No Hair,193.0,Marvel Comics,,bad,122.0
6,Adam Monroe,Male,blue,,Blond,-99.0,NBC - Heroes,,good,-99.0
7,Adam Strange,Male,blue,Human,Blond,185.0,DC Comics,,good,88.0
8,Agent 13,Female,blue,,Blond,173.0,Marvel Comics,,good,61.0
9,Agent Bob,Male,brown,Human,Brown,178.0,Marvel Comics,,good,81.0


### Question 10

Using the **cleaned** superhero data, we will now generate some insights.

Complete the implementation of the function `super_hero_stats`, which takes no arguments and returns a list of length 6 containing your answers to the questions below. **Your answers should be hard-coded in the function.**

0. What is the name of the tallest `'Mutant'` with `'No Hair'`?
1. Among the publishers who have more than 5 characters, which publisher has the highest proportion of human characters? If there is a tie, return the publisher whose name is first alphabetically. We define a character to be human if their `'Race'` is exactly the string `'Human'`; for instance, a `'Race'` of `'Human / Radiation'` is non-human for the purposes of this question.
2. Among the characters whose `'Height'`s we know, who is taller on average – `'good'` characters or `'bad'` characters?
3. Which publisher has a greater proportion of `'bad'` characters – `'Marvel Comics'` or `'DC Comics'`?
4. Which `'Publisher'` that isn't `'Marvel Comics'` or `'DC Comics'` has the most characters? Consider all characters whose `'Publisher'` we know – that is, don't drop rows because they have null values in other columns.
5. There is only one character that is **both** more one standard deviation above the mean in height and more than one standard deviation below the mean in weight. What is their name?

***Note:*** Although you'll be writing code to find the answers, you should not include your code in your `.py` file. Just return a hard-coded list with your answers to the 6 questions; all 6 elements in the list should be strings.

In [109]:
indexed = clean_out.set_index('name')
first = indexed[(indexed['Race'] == 'Mutant') & (indexed['Hair color'] == 'No Hair')]['Height'].idxmax()
indexed


Unnamed: 0_level_0,Gender,Eye color,Race,Hair color,Height,Publisher,Skin color,Alignment,Weight
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
A-Bomb,Male,yellow,Human,No Hair,203.0,Marvel Comics,,good,441.0
Abe Sapien,Male,blue,Icthyo Sapien,No Hair,191.0,Dark Horse Comics,blue,good,65.0
Abin Sur,Male,blue,Ungaran,No Hair,185.0,DC Comics,red,good,90.0
Abomination,Male,green,Human / Radiation,No Hair,203.0,Marvel Comics,,bad,441.0
Abraxas,Male,blue,Cosmic Entity,Black,,Marvel Comics,,bad,
...,...,...,...,...,...,...,...,...,...
Yellowjacket II,Female,blue,Human,Strawberry Blond,165.0,Marvel Comics,,good,52.0
Ymir,Male,white,Frost Giant,No Hair,304.8,Marvel Comics,white,good,
Yoda,Male,brown,Yoda's species,White,66.0,George Lucas,green,good,17.0
Zatanna,Female,blue,Human,Black,170.0,DC Comics,,good,57.0


In [104]:
more_5 = indexed.groupby('Publisher').filter(lambda df: df.shape[0] > 5)
more_5['Is Human'] = more_5['Race'] == 'Human'
proportions = more_5.groupby('Publisher').mean()['Is Human'].sort_values(ascending=False)
proportions

Publisher
George Lucas         0.500000
Star Trek            0.500000
DC Comics            0.423256
Dark Horse Comics    0.388889
Marvel Comics        0.219072
HarperCollins        0.000000
Image Comics         0.000000
NBC - Heroes         0.000000
Name: Is Human, dtype: float64

In [111]:
known_heights = indexed[pd.isna(indexed['Height']) == False]
known_heights.groupby('Alignment').mean()

Unnamed: 0_level_0,Height,Weight
Alignment,Unnamed: 1_level_1,Unnamed: 2_level_1
bad,187.082432,140.057971
good,183.845245,95.61747
neutral,237.411765,198.117647


In [112]:
with_is_bad = indexed.assign(is_bad=indexed['Alignment'] == 'bad')
with_is_bad.groupby('Publisher').mean()

Unnamed: 0_level_0,Height,Weight,is_bad
Publisher,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ABC Studios,,,0.0
DC Comics,180.900685,102.784722,0.274419
Dark Horse Comics,176.909091,117.357143,0.333333
George Lucas,183.916667,77.4,0.428571
Hanna-Barbera,,,0.0
HarperCollins,,,0.0
IDW Publishing,,,0.0
Icon Comics,,,0.25
Image Comics,211.0,405.0,0.785714
J. K. Rowling,,,0.0


In [115]:
by_publisher = indexed.assign(is_character = True).groupby('Publisher').count()
by_publisher['is_character'].sort_values()

Publisher
Titan Books            1
South Park             1
Hanna-Barbera          1
Rebellion              1
Microsoft              1
Universal Studios      1
J. K. Rowling          1
J. R. R. Tolkien       1
Sony Pictures          2
Wildstorm              3
Shueisha               4
ABC Studios            4
Icon Comics            4
IDW Publishing         4
SyFy                   5
Team Epic TV           5
HarperCollins          6
Star Trek              6
George Lucas          14
Image Comics          14
Dark Horse Comics     18
NBC - Heroes          19
DC Comics            215
Marvel Comics        388
Name: is_character, dtype: int64

In [None]:
weight = indexed['Weight']
height = indexed['Height']
fits_conditions = indexed[(height > (height.mean() + height.std())) and (weight < (weight.mean() - weight.std()))]


In [None]:
# don't change this cell -- it is needed for the tests to work
stats_out = super_hero_stats()

In [None]:
grader.check("q10")

## Congratulations! You're done with Lab 2! 🏁

As a reminder, all of the work you want to submit needs to be in `lab.py`.

To verify that all of your work is indeed in `lab.py`, and that you didn't accidentally implement a function in this notebook and not in `lab.py`, we've included another notebook in the lab folder, called `lab-validation.ipynb`. `lab-validation.ipynb` is a version of this notebook with only the `grader.check` cells and the code needed to set up the tests. 

### **Go to `lab-validation.ipynb`, and go to Kernel > Restart & Run All.** This will check if all `grader.check` test cases pass using just the code in `lab.py`.

Once you're able to pass all test cases in `lab-validation.ipynb`, including the call to `grader.check_all()` at the very bottom, then you're ready to submit your `lab.py` (and only your `lab.py`) to Gradescope. Once submitting to Gradescope, make sure to stick around until all test cases pass.

There is also a call to `grader.check_all()` below in _this_ notebook, but make sure to also follow the steps above.

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()