# Discussion 06: Pandas Review

This notebook consists of review problems. They don't need to be turned in and solutions will *not* be provided (ask Staff if you have questions!)

Below is a DataFrame `submissions`, consisting of all HW submissions by students in a course:
* `PID` refers to the student ID
* `LVL` refers to Freshman, Sophomore, Junior, Senior
* `ASSGN` refers to which assignment number the submission was for (e.g. submission of HW#4).
* `GRADE` refers to the grade given to that submission.

Every students has *at most* one submission per assignment. There are no nulls values.

|PID|LVL|ASSGN|GRADE|
|---|---|---|---|
|A23452342|SO|HW2|79|
|A35434334|JR|HW5|96|
|A23452342|SO|HW1|99|
|A23452342|SO|HW5|90|
|A39598745|JR|HW7|67|
|A37534462|SR|HW2|93|
|A37534462|SR|HW9|79|
|...|...|...|...|

Below is a DataFrame `grades`, representing the course gradebook, constructed from the submissions above:

|PID|LVL|HW1|HW2|...|HW9|
|---|---|---|---|---|---|
|A23452342|SO|99|79|...|88|
|A37534462|SR|67|93|...|79|
|...|...|...|...|...|...|



## Questions:

For each of the follow questions, produce the desired result **in two ways**: using `submissions` and `grades`. (Don't just transform `submissions` into grades and compute the answer two ways; strive for efficient ways both computationally, as well as measure in lines of code!)

1. What are the kinds of data represented in each column?
1. Calculate the number of students in the class.
1. Calculate the number of distinct assignments.
1. Calculate the number of FR/SO/JR/SR that turned in at least one assignment. (Return a Series indexed by LVL).
1. For each assignment, calculate the average grade and the number of submissions. (Return a Series indexed by ASSGN).
1. Calculate the highest grade on each assignment (for `submissions`, both using groupby and not using groupby).
1. Which assignment was the highest grade for the most number of students?
1. Suppose each assignment comes in it's own DataFrame called `hwXX` (with columns PID, LVL, ASSGN, GRADE). Construct `submissions` and `grades` from these dataframes.

Lastly: 
* Write a single line of code which transforms `submissions` to `grades`.
* Add a column to `grades` called `Letter Grades` by computing the letter grades from the average HW grades of each student.

In [2]:
import pandas as pd
import numpy as np

In [32]:
df = pd.DataFrame(
    {
        'score': np.arange(5),
        'level': ['a','b','a','b','a']
    }
)
df

Unnamed: 0,score,level
0,0,a
1,1,b
2,2,a
3,3,b
4,4,a


In [64]:
# def two_things(df):
#     to_return = pd.DataFrame(

#     a = df.score.mean()
        
#     b = df.score.sum()
#     return 
    

grouped = df.groupby('level').score
replacement = pd.DataFrame(
    {
        'sum': grouped.sum(),
        'mean': grouped.mean()
    }
)

In [65]:
replacement

Unnamed: 0_level_0,sum,mean
level,Unnamed: 1_level_1,Unnamed: 2_level_1
a,6,2
b,4,2


In [39]:
df['mean'] = df.level.replace(replacement)

In [40]:
df

Unnamed: 0,score,level,mean
0,0,a,2.0
1,1,b,2.0
2,2,a,2.0
3,3,b,2.0
4,4,a,2.0


## Other Questions:

These are focused on the statistical concepts, which are pretty well covered by our HWs and the practice midterm. However, here are more questions using the same dataset (which also could be good material for studying for the final!)

* Do Sophomores have significantly better average HW grades than the rest of the class?
    - How will the sampling distribution (under the null hypothesis) change if there are only a few Sophomores in the class, as opposed to if the class were mostly sophomores? When are you more confident in your answer?
* Do Seniors have significantly lower grades than Juniors? How do your results change when the group sizes are equal? very un-equal?
* Compute the distribution of `Letter Grades` conditional on `LVL`.
* In `grades`, there are missing values in the `HWXX` columns. While we know that they are missing exactly when a student didn't turn in the assignment, we can still ask what the student may have gotten had they turned the assignment in. To follow this line of thought:
    1. Check if the missingness of `HWXX` is dependent on `LVL`.
    1. Check if the missingness of `HWXX` is dependent on `HWYY`. Since `HWYY` is quantitative, you should first convert it to letter grades, before checking dependency.
    1. What are the test-statistics for testing the above two questions?
    1. Consider all the methods of imputation we've learned (single-valued/probabilistic, conditional/unconditional). Which of these give the best results for the possible results from the missingness tests?
