# DSC 80: Project 01

### Checkpoint Due Date: Thursday April 9, 11:59:59 PM (Questions 1-4)
### Due Date: Thursday, April 16, 11:59:59 PM

---
# Instructions

This Jupyter Notebook contains the statements of the problems and provides code and markdown cells to display your answers to the problems.  
* Like the lab, your coding work will be developed in the accompanying `project01.py` file, that will be imported into the current notebook. This code will be autograded.
* **For the checkpoint, turn in questions 1-4**

**Do not change the function names in the `*.py` file**
- The functions in the `*.py` file are how your assignment is graded, and they are graded by their name. The dictionary at the end of the file (`GRADED FUNCTIONS`) contains the "grading list". The final function in the file allows your doctests to check that all the necessary functions exist.
- If you changed something you weren't supposed to, just use git to revert!

**Tips for developing in the .py file**:
- Do not change the function names in the starter code; grading is done using these function names.
- Do not change the docstrings in the functions. These are there to tell you if your work is on the right track!
- You are **encouraged to write your own additional functions** to solve the questions! 
    - Developing in python usually consists of larger files, with many short functions.
    - You may write your other functions in an additional `.py` file that you import in `project01.py` -- however, be sure to upload these to gradescope as well!
- Always document your code!

**Tips for testing the correctness of your answers!**
Once you have your work saved in the .py file, you should import the `project01` to test your function out in the notebook. In the notebook you should inspect/analyze the output to assess its correctness!
* Run your functions on the main dataset (`grades`) and ask yourself if the output *looks correct.*
* Run your functions on very small datasets (e.g. 1-5 row table), calculate the expected response by hand, and see if the function output matches (this *is* unit-testing your code with data).
* Run your functions on (large and small) samples of the dataset `grades` (with and without replacement). Does your code break? Or does it still run as expected.

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
import project01 as proj

In [4]:
%matplotlib inline
import pandas as pd
import numpy as np

import os

# The Other Side of Gradescope

The file contains the grade-book from a fictional data science course with 535 students. 

**Note: this dataset is synthetically generated; it does not contain real student grades.**

In this project, you will:
1. clean and process the data to compute total course grades according to a fictional syllabus (below),
2. qualitatively understand how students did in the course,
3. understand how student grades vary with small changes in performance on each assignment.

---

The course syllabus is as follows:

* Lab assignments 
    - Each are worth the same amount, regardless of each lab's raw point total.
    - The lowest lab is dropped.
    - Each lab may be revised for one week after submission for a 10% penalty, for two weeks after submission for a 20% penalty, and beyond that for a 50% penalty. Such revisions are reflected in the `Lateness` columns in the gradebook.
    - Labs are 20% of the total grade.
* Projects 
    - Each project consists of an autograded portion, and *possibly* a free response portion.
    - The total points for a single project consist of the sum of the raw score of the two portions.
    - Each are worth the same amount, regardless of each project's raw point total.
    - Projects are 30% of the total grade.
* Checkpoints
    - Project checkpoints are worth 2.5% of the total grade.
* Discussion
    - Discussion notebooks are worth 2.5% of the total grade.
* Exams
    - The midterm is worth 15% of the total grade.
    - The final is worth 30% of the total grade.


# A note on generalization

You may assume that your code will only need to work on a gradebook for a class with the syllabus given above. That is, you may assume that the dataframe `grades` looks like the given one in `data/grades.csv`.

However, such a class:
1. may have a different numbers of labs, projects, discussions, and project checkpoints.
2. may have a different number of students.

You may assume the course components and the naming conventions are as given in the data file.

The dataset was generated by Gradescope; you must attempt to reason about the data as given using what you know as a student who uses Gradescope.

### A note on 'putting everything together'

The goal of this project is to create and assess final grades for a fictional course; if anything, the process is broken down into functions for your convenience and guidance. Here are a few remarks and tips for approaching the projects:
1. If you are having trouble figuring out what a question is asking you to do, look at the big picture and try to understand what the current step is doing to contribute to this big picture. This may clarify what's being asked!
1. These questions intentionally build off of each other and the final result matters! In fact, you can 'get a question correct', but only receive partial credit on it because a previous answer was wrong.
    - Credit for a question will typically receive partial credit based on *how close* your answer is to correct (as well as some credit for a solution in the correct form). 
    - You should try to assess your answer to each question based on what you understand of the data. This might involve writing extensive code (that isn't turned in) just to check your work! Suggestions on checking your work are given in the assignment, but you should also think of your own ways of checking your work.
    - As you do this project, think about the data from the perspective of the student (which should be easy to do!)

In [6]:
grades_fp = os.path.join('data', 'grades.csv')
grades = pd.read_csv(grades_fp)

In [23]:
cols = 'lab01 lab02 lab03'.split()
processed = pd.DataFrame([[0.2, 0.90, 1.0]], index=[0], columns=cols)
a= proj.lab_total(processed)
np.isclose(proj.lab_total(processed), 0.95).all()

True

In [11]:
processed.sum(axis=1)-processed.min(axis=1)

0    1.9
dtype: float64

In [28]:
out = proj.simulate_pval(grades, 100)
out

0.01

### Getting started: enumerating the assignments

First, you will list all the 'assignment names' and what part of the syllabus to which they belong.

**Question 1:**

Create a function `get_assignment_names` that takes in a dataframe like `grades` and returns a dictionary with the following structure:
- The keys are the general areas of the syllabus: `lab, project, midterm, final, disc, checkpoint`
- The values are lists that contain the assignment names of that type. For example the lab assignments all have names of the form `labXX` where `XX` is a zero-padded two digit number. See the doctests for more details.

### Computing project grades

**Question 2**

Compute the total score for the project portion of the course according to the syllabus. Create a function `projects_total` that takes in `grades` and computes the total project grade for the quarter according to the syllabus. The output Series should contain values between 0 and 1.

*Note*: Don't forget to properly handle students who didn't turn in assignments! (Use your experience and common sense).

*Note:* To check your work, try (1) calculating the score for a few types of students by hand, and (2) calculate the statistics for the class performance on each individual course project, making sure they look reasonable.

In [8]:
proj.projects_total(grades)

0      0.900000
1      0.759333
2      0.673333
3      0.952667
4      0.718667
         ...   
530    0.949333
531    0.846667
532    0.837333
533    0.797333
534    0.948000
Length: 535, dtype: float64

In [9]:
header = grades.columns
dic = proj.get_assignment_names(grades)
lis = dic['project']

In [10]:
total = pd.Series(0,index = range(len(grades[lis[0]])))
for i in lis:
    st = i+'_free_response'
    s = i+' - Max Points'
    fs = st+' - Max Points'
    if (st in header):
        g = grades[st].add(grades[i],fill_value=0)
        t = grades[s].add(grades[fs],fill_value=0)
    else:
        g = grades[i]
        t = grades[s]
    proportion = g.divide(t,fill_value=0)
    total = proportion/len(lis)+total
total

0      0.900000
1      0.759333
2      0.673333
3      0.952667
4      0.718667
         ...   
530    0.949333
531    0.846667
532    0.837333
533    0.797333
534    0.948000
Length: 535, dtype: float64

In [11]:
b = grades[header[result_max2][0]]
for i in header[result_max2]:
    b = b+grades[i]
b = b-grades[header[result_max2][0]]
b

NameError: name 'result_max2' is not defined

In [12]:
result1 =  np.where(header.str.contains('project') == True)
result3 = np.where(header.str.contains('-')==False)
resultm3 = np.where(header.str.contains('4')==False)
result_p = np.intersect1d(result1,result3)
result_p1 = np.intersect1d(result_p,resultm3)
header[result_p1]

Index(['project01', 'project01_free_response', 'project02_checkpoint01',
       'project02_checkpoint02', 'project02', 'project02_free_response',
       'project03_checkpoint01', 'project03', 'project05_free_response',
       'project05'],
      dtype='object')

In [13]:
a = grades[header[result_p1][0]]
for i in header[result_p1]:
    a = a.add(grades[i],fill_value=0)
a = a - grades[header[result_p1][0]]
a

0      381.0
1      332.0
2      300.0
3      385.0
4      304.0
       ...  
530    397.0
531    349.0
532    360.0
533    348.0
534    386.0
Length: 535, dtype: float64

In [14]:
proportion = a/b
proportion

NameError: name 'b' is not defined

In [15]:
proportion1 = grades['project04'].divide(grades['project04 - Max Points'], fill_value=0)
proportion1

0      0.880000
1      0.666667
2      0.546667
3      0.973333
4      0.573333
         ...   
530    0.986667
531    0.893333
532    0.746667
533    0.706667
534    1.000000
Length: 535, dtype: float64

In [16]:
result = proportion*4/5+proportion1/5
result

0      0.920000
1      0.709333
2      0.597333
3      0.978667
4      0.610667
         ...   
530    0.973333
531    0.922667
532    0.797333
533    0.757333
534    0.984000
Length: 535, dtype: float64

### Computing lab grades

Now, you will clean and process the lab grades, which is a little more complicated. To do this, you will develop functions that:
- 'normalize' the grades, 
- adjust for late submissions, 
- drop the lowest lab grade, and 
- creates a total lab score for each student.

**Question 3**

Unfortunately, Gradescope sometimes experiences a delay in registering when an assignment is submitted during "periods of heavy usage" (i.e. near a submission deadline). You need to assess when a student's assignment was actually turned in on time, even if Gradescope did not process it in time. To do this, it is helpful to know:
* Every late submission has to be submitted by a TA (late submissions are turned off).
* TAs never submitted a late assignment "just after" the deadline. 
* The deadlines were at midnight and students had to come to staff hours to late-submit their assignment.

Create a function `last_minute_submissions` that takes in the dataframe `grades` and outputs the number of submissions on each assignment that were turned in on time by the student, yet marked 'late' by Gradescope. See the doctest for more details.

*Note:* You have to figure out what truly is a late submission by looking at the data and understanding the facts about the data generating process above. There is some ambiguity in finding which submissions are truly late; you will *make a best guess for a threshold* by looking at this dataset. This question is about 'cleaning' a messy 'data recording process'.

In [17]:
header = grades.columns
result1 =  np.where(header.str.contains('lab') == True)
result2 = np.where(header.str.contains('Lateness')==True)
result_p = np.intersect1d(result1,result2)
df = grades[header[result_p]]
df

Unnamed: 0,lab01 - Lateness (H:M:S),lab02 - Lateness (H:M:S),lab03 - Lateness (H:M:S),lab04 - Lateness (H:M:S),lab05 - Lateness (H:M:S),lab06 - Lateness (H:M:S),lab07 - Lateness (H:M:S),lab08 - Lateness (H:M:S),lab09 - Lateness (H:M:S)
0,00:00:00,00:00:00,252:56:22,00:00:00,00:00:00,00:00:00,382:51:44,00:00:00,00:00:00
1,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00,645:24:50,00:00:00,00:00:00,00:00:00
2,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00,764:40:45,00:04:51,00:00:00,00:00:00
3,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00
4,00:00:00,00:00:00,00:00:00,47:42:33,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00
...,...,...,...,...,...,...,...,...,...
530,47:26:10,00:00:00,00:00:00,12:08:58,00:00:00,431:48:42,00:00:00,00:00:00,00:00:13
531,00:00:00,00:00:00,00:00:00,47:03:14,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00
532,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00
533,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00,419:06:41,00:00:00,00:00:00,00:00:00


In [18]:
def cal(st):
    lst = st.split(':')
    i = int(lst[0])*3600+int(lst[1])*60+int(lst[2])
    if (i <= 36000 and i >0):
        return True
    else:
        return False

In [19]:
l = []
for i in header[result_p]:
    a = df[i].apply(cal)
    b = (a==True).sum()
    l.append(b)
ser = pd.Series(l, index =header[result_p])

**Question 4**

Now you need to adjust the lab grades for late submissions -- however, you need to take into account your investigation in the previous question, since students shouldn't be penalized by a bug in Gradescope!

Create a function `lateness_penalty` that takes in a 'Lateness' column and returns a column of penalties (represented by the values `1.0,0.9,0.8,0.5` according to the syllabus). Only *truly* late submissions should be counted as late.

*Note*: For the purpose of this project, we will only be calculating lateness for labs. There is no penalty for lateness for projects, discussions, nor checkpoints.

In [20]:
def calculate(st):
    lst = st.split(':')
    i = int(lst[0])*3600+int(lst[1])*60+int(lst[2])
    if (i <= 604800 and i >36000):
        return 0.9
    elif(i>604800 and i <=1209600):
        return 0.8
    elif(i >1209600):
        return 0.5
    elif(i<=36000):
        return 1.0

In [21]:
fp = os.path.join('data', 'grades.csv')
grades = pd.read_csv(fp)
out = proj.last_minute_submissions(grades)
(out > 0).sum()

8

**Question 5**

Create a function `process_labs` that takes in a dataframe like `grades` and returns a dataframe of processed lab scores. The output should:
* share the same index as `grades`,
* have columns given by the lab assignment names (e.g. `lab01,...lab10`)
* have values representing the lab grades for each assignment, adjusted for Lateness and scaled to a score between 0 and 1.

In [29]:
cols = 'lab01 lab02 lab03'.split()
processed = pd.DataFrame([[0.2, 0.90, 1.0]], index=[0], columns=cols)
np.isclose(proj.lab_total(processed), 0.95).all()

True

In [36]:
fp = os.path.join('data', 'grades.csv')
grades = pd.read_csv(fp)
out = proj.simulate_pval(grades, 100)
out

0.04

**Question 6**

Create a function `lab_total` that takes in dataframe of processed assignments (like the output of Question 5) and computes the total lab grade for each student according to the syllabus (returning a Series). Your answers should be proportions between 0 and 1. For example, if there are only 3 labs, and a student received scores of {80%,90%,100%}, then the total score would be 0.95.

*Note*: Don't forget to properly handle students who didn't turn in assignments! (Use your experience and common sense).

In [193]:
fp = os.path.join('data', 'grades.csv')
grades = pd.read_csv(fp)

In [205]:
lab_total = proj.total_points(grades)

In [206]:
lab_total

0      0.887323
1      0.804042
2      0.742045
3      0.900989
4      0.648927
         ...   
530    0.861339
531    0.749270
532    0.843537
533    0.851361
534    0.897919
Length: 535, dtype: float64

In [177]:
proj.process_labs(grades)

Unnamed: 0,lab01,lab02,lab03,lab04,lab05,lab06,lab07,lab08,lab09
0,0.99,0.86,0.576,0.9800,1.000000,0.976471,0.2425,0.88,0.86
1,0.98,0.52,0.730,0.7700,1.000000,0.250000,0.8900,0.94,0.86
2,0.86,0.45,0.400,0.7300,0.900000,0.214706,0.7200,0.71,0.76
3,1.00,1.00,0.920,0.9100,0.885714,0.670588,1.0000,0.95,0.78
4,0.66,0.33,0.690,0.6561,0.642857,0.741176,0.6000,0.36,1.00
...,...,...,...,...,...,...,...,...,...
530,0.81,0.82,1.000,0.7128,1.000000,0.173529,0.7600,0.91,0.98
531,1.00,0.86,0.800,0.5022,0.971429,0.705882,0.8500,0.84,0.56
532,0.87,0.90,1.000,0.9900,0.928571,0.764706,0.9500,0.87,1.00
533,0.84,0.83,0.880,0.9300,0.742857,0.229412,0.7700,0.95,0.82


In [101]:
result

Unnamed: 0,lab01,lab02,lab03,lab04,lab05,lab06,lab07,lab08,lab09
0,0.99,0.86,0.72,0.980,1.000000,0.976471,0.485,0.88,0.86
1,0.98,0.52,0.73,0.770,1.000000,0.500000,0.890,0.94,0.86
2,0.86,0.45,0.40,0.730,0.900000,0.429412,0.720,0.71,0.76
3,1.00,1.00,0.92,0.910,0.885714,0.670588,1.000,0.95,0.78
4,0.66,0.33,0.69,0.729,0.642857,0.741176,0.600,0.36,1.00
...,...,...,...,...,...,...,...,...,...
530,0.90,0.82,1.00,0.792,1.000000,0.347059,0.760,0.91,0.98
531,1.00,0.86,0.80,0.558,0.971429,0.705882,0.850,0.84,0.56
532,0.87,0.90,1.00,0.990,0.928571,0.764706,0.950,0.87,1.00
533,0.84,0.83,0.88,0.930,0.742857,0.458824,0.770,0.95,0.82


### Putting it together

**Question 7**

Finally, you need to create the final course grades. To do this, you will add up the total of each course component according to the weights given in the syllabus. 

* Create a function `total_points` that takes in `grades` and returns the final course grades according to the syllabus. Course grades should be proportions between zero and one.
* Create a function `final_grades` that takes in the final course grades as above and returns a Series of letter grades given by the standard cutoffs (`A >= .90`, `.90 > B >= .80`, `.80 > C >= .70`, `.70 > D >= .60`, `.60 > F`). You should not use rounding to determining the letter grades.
* Create a function `letter_proportions` which takes in the dataframe `grades` and outputs a Series that contains the proportion of the class that received each grade. (This question requires you to put everything together).
* The indices should be ordered by the proportion of the class that receives that grade, from largest to smallest.

*Note 1*: Don't repeat yourself when computing the checkpoint and discussion portions of the course.

*Note 2*: Only the lab portion of the course accounts for late assignments; you may assume all assignments in other portions are turned in without penalty.

*Note 3*: These values should add up to exactly 1.0. If you are getting something close such as 0.99999, that means there is a slight issue with your code from above. 

To check your work, verify the course grade distribution and relevant statistics! Do the work by hand for a few students.

In [111]:
lab_total = proj.lab_total(proj.process_labs(grades))
proj_total = proj.projects_total(grades)
mid = proj.get_assignment_names(grades)['midterm']
mid_total = 0
for i in mid:
    j = i+' - Max Points'
    g = grades[i]
    t = grades[j]
    proportion = g.divide(t,fill_value=0)
    mid_total = proportion/len(mid)+mid_total
final = proj.get_assignment_names(grades)['final']
fin_total = 0
for i in final:
    j = i+' - Max Points'
    g = grades[i]
    t = grades[j]
    proportion = g.divide(t,fill_value=0)
    fin_total = proportion/len(final)+fin_total
check = proj.get_assignment_names(grades)['checkpoint']
check_total = 0
for i in check:
    j = i+' - Max Points'
    g = grades[i]
    t = grades[j]
    proportion = g.divide(t,fill_value=0)
    check_total = proportion/len(check)+check_total
disc = proj.get_assignment_names(grades)['disc']
disc_total = 0
for i in disc:
    j = i+' - Max Points'
    g = grades[i]
    t = grades[j]
    proportion = g.divide(t,fill_value=0)
    disc_total = proportion/len(disc)+disc_total
result = lab_total*0.2+proj_total*0.3+mid_total*0.15+fin_total*0.3+check_total*0.025+disc_total*0.025
result

0      0.873098
1      0.804042
2      0.741309
3      0.900989
4      0.639419
         ...   
530    0.839271
531    0.749270
532    0.843537
533    0.851361
534    0.897919
Length: 535, dtype: float64

In [118]:
lab_total = proj.lab_total(proj.process_labs(grades))
lab_total

0      0.824494
1      0.836250
2      0.691250
3      0.930714
4      0.628004
         ...   
530    0.743484
531    0.823414
532    0.938571
533    0.845357
534    0.856460
Length: 535, dtype: float64

In [30]:
pd.set_option('display.max_columns', None)
grades

Unnamed: 0,PID,College,Level,lab01,lab01 - Max Points,lab01 - Lateness (H:M:S),lab02,lab02 - Max Points,lab02 - Lateness (H:M:S),project01,project01 - Max Points,project01 - Lateness (H:M:S),lab03,lab03 - Max Points,lab03 - Lateness (H:M:S),project01_free_response,project01_free_response - Max Points,project01_free_response - Lateness (H:M:S),lab04,lab04 - Max Points,lab04 - Lateness (H:M:S),lab05,lab05 - Max Points,lab05 - Lateness (H:M:S),project02_checkpoint01,project02_checkpoint01 - Max Points,project02_checkpoint01 - Lateness (H:M:S),Midterm,Midterm - Max Points,Midterm - Lateness (H:M:S),lab06,lab06 - Max Points,lab06 - Lateness (H:M:S),project02_checkpoint02,project02_checkpoint02 - Max Points,project02_checkpoint02 - Lateness (H:M:S),lab07,lab07 - Max Points,lab07 - Lateness (H:M:S),project02,project02 - Max Points,project02 - Lateness (H:M:S),project02_free_response,project02_free_response - Max Points,project02_free_response - Lateness (H:M:S),lab08,lab08 - Max Points,lab08 - Lateness (H:M:S),lab09,lab09 - Max Points,lab09 - Lateness (H:M:S),project03_checkpoint01,project03_checkpoint01 - Max Points,project03_checkpoint01 - Lateness (H:M:S),project03,project03 - Max Points,project03 - Lateness (H:M:S),Final,Final - Max Points,Final - Lateness (H:M:S),Total Lateness (H:M:S),project05_free_response,project05_free_response - Max Points,project05_free_response - Lateness (H:M:S),project04,project04 - Max Points,project04 - Lateness (H:M:S),project05,project05 - Max Points,project05 - Lateness (H:M:S),discussion01,discussion01 - Max Points,discussion01 - Lateness (H:M:S),discussion02,discussion02 - Max Points,discussion02 - Lateness (H:M:S),discussion03,discussion03 - Max Points,discussion03 - Lateness (H:M:S),discussion04,discussion04 - Max Points,discussion04 - Lateness (H:M:S),discussion05,discussion05 - Max Points,discussion05 - Lateness (H:M:S),discussion06,discussion06 - Max Points,discussion06 - Lateness (H:M:S),discussion07,discussion07 - Max Points,discussion07 - Lateness (H:M:S),discussion08,discussion08 - Max Points,discussion08 - Lateness (H:M:S),discussion09,discussion09 - Max Points,discussion09 - Lateness (H:M:S),discussion10,discussion10 - Max Points,discussion10 - Lateness (H:M:S)
0,A14721419,SI,JR,9.900000e-17,100.0,00:00:00,8.600000e-17,100.0,00:00:00,75.0,85.0,00:00:00,1.207960e-17,100.0,252:56:22,15.0,15.0,00:00:00,9.800000e-17,100.0,00:00:00,1.734665e-15,70.0,00:00:00,10.0,10.0,00:00:00,47.0,47.0,00:00:00,3.583503e-16,85.0,00:00:00,9.0,10.0,00:00:00,1.894531e-19,100.0,382:51:44,75.0,75.0,00:00:00,18.0,25.0,00:00:00,8.800000e-17,100.0,00:00:00,2.201600e-14,50.0,00:00:00,0.0,10.0,00:00:00,86.0,100.0,00:00:00,71.0,87.0,00:00:00,780:01:28,21.0,25,00:00:00,66.0,75,00:00:00,72.0,75,00:00:00,10.0,10,00:00:00,10.0,10,780:01:28,10.0,10,00:00:00,10.0,10,00:00:00,10.0,10,00:00:00,10.0,10,00:00:00,10.0,10,00:00:00,10.0,10,00:00:00,10.0,10,780:01:28,10.0,10,00:00:00
1,A14883274,TH,JR,9.800000e-17,100.0,00:00:00,5.200000e-17,100.0,00:00:00,53.0,85.0,00:00:00,7.300000e-17,100.0,00:00:00,11.0,15.0,00:00:00,7.700000e-17,100.0,00:00:00,1.734665e-15,70.0,00:00:00,10.0,10.0,00:00:00,44.0,47.0,00:00:00,7.167680e-19,85.0,645:24:50,9.0,10.0,00:00:00,8.900000e-17,100.0,00:00:00,64.0,75.0,00:00:00,25.0,25.0,00:00:00,9.400000e-17,100.0,00:00:00,2.201600e-14,50.0,00:00:00,0.0,10.0,00:00:00,88.0,100.0,00:00:00,68.0,87.0,00:00:00,669:12:21,16.0,25,00:00:00,50.0,75,00:00:00,56.0,75,669:12:21,7.0,10,00:00:00,7.0,10,669:12:21,8.0,10,00:00:00,7.0,10,669:12:21,7.0,10,00:00:00,8.0,10,00:00:00,7.0,10,669:12:21,7.0,10,00:00:00,7.0,10,00:00:00,8.0,10,00:00:00
2,A14164800,SI,SR,8.600000e-17,100.0,00:00:00,4.500000e-17,100.0,00:00:00,44.0,85.0,00:00:00,4.000000e-17,100.0,00:00:00,14.0,15.0,00:00:00,7.300000e-17,100.0,00:00:00,1.561199e-15,70.0,00:00:00,5.0,10.0,00:00:00,37.0,47.0,00:00:00,6.155773e-19,85.0,764:40:45,7.0,10.0,00:00:00,7.200000e-17,100.0,00:04:51,63.0,75.0,00:00:00,25.0,25.0,00:00:00,7.100000e-17,100.0,00:00:00,1.945600e-14,50.0,00:00:00,6.0,10.0,00:00:00,75.0,100.0,00:00:00,73.0,87.0,00:00:00,828:47:53,14.0,25,00:00:00,41.0,75,764:40:45,47.0,75,00:00:00,6.0,10,00:00:00,7.0,10,00:00:00,6.0,10,00:00:00,6.0,10,00:00:00,7.0,10,00:00:00,7.0,10,00:00:00,7.0,10,00:00:00,6.0,10,00:04:51,6.0,10,00:00:00,7.0,10,00:00:00
3,A14847419,TH,JR,1.000000e-16,100.0,00:00:00,1.000000e-16,100.0,00:00:00,78.0,85.0,00:00:00,9.200000e-17,100.0,00:00:00,15.0,15.0,00:00:00,9.100000e-17,100.0,00:00:00,1.536418e-15,70.0,00:00:00,4.0,10.0,00:00:00,44.0,47.0,00:00:00,2.460960e-16,85.0,00:00:00,2.0,10.0,00:00:00,1.000000e-16,100.0,00:00:00,69.0,75.0,00:00:00,25.0,25.0,00:00:00,9.500000e-17,100.0,00:00:00,1.996800e-14,50.0,00:00:00,0.0,10.0,00:00:00,94.0,100.0,00:00:00,75.0,87.0,00:00:00,120:01:11,23.0,25,00:00:00,73.0,75,00:00:00,75.0,75,00:00:00,10.0,10,00:00:00,10.0,10,00:00:00,10.0,10,00:00:00,10.0,10,00:00:00,10.0,10,00:00:00,10.0,10,00:00:00,10.0,10,00:00:00,10.0,10,00:00:00,10.0,10,00:00:00,10.0,10,00:00:00
4,A14162943,SI,JR,6.600000e-17,100.0,00:00:00,3.300000e-17,100.0,00:00:00,42.0,85.0,00:00:00,6.900000e-17,100.0,00:00:00,13.0,15.0,00:00:00,3.138106e-17,100.0,47:42:33,1.115142e-15,70.0,00:00:00,0.0,10.0,00:00:00,18.0,47.0,00:00:00,2.720008e-16,85.0,00:00:00,2.0,10.0,00:00:00,6.000000e-17,100.0,00:00:00,71.0,75.0,00:00:00,24.0,25.0,00:00:00,3.600000e-17,100.0,00:00:00,2.560000e-14,50.0,00:00:00,0.0,10.0,00:00:00,90.0,100.0,00:00:00,65.0,87.0,00:00:00,93:16:10,13.0,25,00:00:00,43.0,75,00:00:00,49.0,75,00:00:00,6.0,10,00:00:00,6.0,10,00:00:00,6.0,10,00:00:00,6.0,10,00:00:00,6.0,10,00:00:00,6.0,10,00:00:00,6.0,10,00:00:00,5.0,10,00:00:00,5.0,10,00:00:00,6.0,10,00:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
530,A14490387,SI,JR,3.874205e-17,100.0,47:26:10,8.200000e-17,100.0,00:00:00,78.0,85.0,00:00:00,1.000000e-16,100.0,00:00:00,11.0,15.0,00:00:00,3.409300e-17,100.0,12:08:58,1.734665e-15,70.0,00:00:00,10.0,10.0,00:00:00,41.0,47.0,00:00:00,4.975213e-19,85.0,431:48:42,10.0,10.0,00:00:00,7.600000e-17,100.0,00:00:00,66.0,75.0,00:00:00,25.0,25.0,00:00:00,9.100000e-17,100.0,00:00:00,2.508800e-14,50.0,00:00:13,1.0,10.0,00:00:00,99.0,100.0,00:00:00,65.0,87.0,00:00:00,491:24:29,22.0,25,431:48:42,74.0,75,00:00:00,75.0,75,00:00:00,10.0,10,00:00:00,10.0,10,00:00:00,10.0,10,00:00:00,10.0,10,12:08:58,10.0,10,00:00:00,10.0,10,00:00:00,10.0,10,00:00:00,10.0,10,12:08:58,10.0,10,00:00:00,10.0,10,00:00:00
531,A14088257,SI,SO,1.000000e-16,100.0,00:00:00,8.600000e-17,100.0,00:00:00,72.0,85.0,00:00:00,8.000000e-17,100.0,00:00:00,3.0,15.0,00:00:00,2.402007e-17,100.0,47:03:14,1.685103e-15,70.0,00:00:00,7.0,10.0,00:00:00,24.0,47.0,00:00:00,2.590484e-16,85.0,00:00:00,5.0,10.0,00:00:00,8.500000e-17,100.0,00:00:00,69.0,75.0,00:00:00,22.0,25.0,00:00:00,8.400000e-17,100.0,00:00:00,1.433600e-14,50.0,00:00:00,3.0,10.0,00:00:00,75.0,100.0,00:00:00,63.0,87.0,00:00:00,47:03:14,20.0,25,00:00:00,67.0,75,00:00:00,73.0,75,00:00:00,10.0,10,00:00:00,10.0,10,00:00:00,10.0,10,00:00:00,9.0,10,00:00:00,10.0,10,00:00:00,9.0,10,00:00:00,10.0,10,00:00:00,9.0,10,00:00:00,10.0,10,00:00:00,10.0,10,00:00:00
532,A14847419,WA,JR,8.700000e-17,100.0,00:00:00,9.000000e-17,100.0,00:00:00,66.0,85.0,00:00:00,1.000000e-16,100.0,00:00:00,15.0,15.0,00:00:00,9.900000e-17,100.0,00:00:00,1.610761e-15,70.0,00:00:00,10.0,10.0,00:00:00,40.0,47.0,00:00:00,2.806358e-16,85.0,00:00:00,6.0,10.0,00:00:00,9.500000e-17,100.0,00:00:00,74.0,75.0,00:00:00,20.0,25.0,00:00:00,8.700000e-17,100.0,00:00:00,2.560000e-14,50.0,00:00:00,0.0,10.0,00:00:00,88.0,100.0,00:00:00,70.0,87.0,00:00:00,120:01:11,19.0,25,00:00:00,56.0,75,00:00:00,62.0,75,00:00:00,9.0,10,00:00:00,9.0,10,00:00:00,9.0,10,00:00:00,9.0,10,00:00:00,9.0,10,00:00:00,9.0,10,00:00:00,8.0,10,00:00:00,9.0,10,00:00:00,9.0,10,00:00:00,9.0,10,00:00:00
533,A14513929,TH,SR,8.400000e-17,100.0,00:00:00,8.300000e-17,100.0,00:00:00,62.0,85.0,00:00:00,8.800000e-17,100.0,00:00:00,10.0,15.0,00:00:00,9.300000e-17,100.0,00:00:00,1.288608e-15,70.0,00:00:00,10.0,10.0,00:00:00,47.0,47.0,00:00:00,6.577401e-19,85.0,419:06:41,10.0,10.0,00:00:00,7.700000e-17,100.0,00:00:00,75.0,75.0,00:00:00,17.0,25.0,00:00:00,9.500000e-17,100.0,00:00:00,2.099200e-14,50.0,00:00:00,,10.0,00:00:00,87.0,100.0,00:00:00,74.0,87.0,00:00:00,419:06:41,18.0,25,00:00:00,53.0,75,419:06:41,59.0,75,00:00:00,9.0,10,419:06:41,8.0,10,00:00:00,9.0,10,419:06:41,8.0,10,00:00:00,8.0,10,00:00:00,9.0,10,419:06:41,8.0,10,00:00:00,9.0,10,00:00:00,9.0,10,419:06:41,8.0,10,00:00:00


### Do Sophomores get better grades?

**Question 8**

You notice that students who are sophomores on average did better in the class (if you can't verify this, you should go back and check your work!). Is this difference significant, or just due to noise?

Perform a hypothesis test, assessing likelihood of the null hypothesis: 
> "sophomores earn grades that are roughly equal on average to the rest of the class."


Create a function `simulate_pval` which takes in the number of simulations `N` and `grades` and returns the the likelihood that the grade of sophomores was no better on average than the class as a whole (i.e. calculate the p-value).

*Note:* To check your work, plot the sampling distribution and the observation. Do these values look reasonable?

### What is the true distribution of grades?

The gradebook for this class only reflects one particular instance of each student's performance, subject to the effects of all the little events and hiccups that occurred throughout the quarter. Might you have done better on the midterm had your roommate kept you up all night with their coughing? Wasn't it lucky that the example you were studying just before the final happened to appear on the exam?

**Question 9**

This question will simulate these '(un)lucky, random events' by adding or subtracting random amounts to each assignment before calculating the final grades. These 'random amounts' will be drawn from a Gaussian distribution of mean 0 and a std deviation 0.02:
```
np.random.normal(0, 0.02, size=(num_rows, num_cols))
```
Intuitively, such a model says that random events may bump up or down a given grade (given as a proportion):
- which on average has no effect on the class as a whole (mean 0),
- which not uncommonly might perturb a grade by 2% (std dev 0.02).

Create a function `total_points_with_noise` that takes in a dataframe like `grades`, adds noise to the assignments as described above, and returns the final scores using *the same procedure* as questions 1-7.

*Note:* You should be able to reuse (or minorly change) the code from previous problems. Try to be DRY (don't repeat yourself)!

*Note 1:* Once adding the noise to the assignment scores, use the `np.clip` function to be sure each assignment retains a score between 0% and 100%.

*Note 2:* To check your work -- what would you expect the difference between the actual scores and noisy scores to be, on average?

In [152]:
ran = np.random.normal(0, 0.02, size=(num_rows, num_cols))

In [207]:
num_cols = 0
num_rows = len(grades.index)
for x in di:
    num_cols += len(di[x])

In [208]:
df1

Unnamed: 0,lab01,lab02,lab03,lab04,lab05,lab06,lab07,lab08,lab09,project01,project02,project03,project04,project05,Midterm,Final,project02_checkpoint01,project02_checkpoint02,project03_checkpoint01,discussion01,discussion02,discussion03,discussion04,discussion05,discussion06,discussion07,discussion08,discussion09,discussion10,lab01 - Lateness (H:M:S),lab02 - Lateness (H:M:S),lab03 - Lateness (H:M:S),lab04 - Lateness (H:M:S),lab05 - Lateness (H:M:S),lab06 - Lateness (H:M:S),lab07 - Lateness (H:M:S),lab08 - Lateness (H:M:S),lab09 - Lateness (H:M:S)
0,99.0,86.0,72.0,98.0,70.0,83.0,48.5,88.0,43.0,75.0,75.0,86.0,66.0,72.0,47.0,71.0,10.0,9.0,0.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,00:00:00,00:00:00,252:56:22,00:00:00,00:00:00,00:00:00,382:51:44,00:00:00,00:00:00
1,98.0,52.0,73.0,77.0,70.0,42.5,89.0,94.0,43.0,53.0,64.0,88.0,50.0,56.0,44.0,68.0,10.0,9.0,0.0,7.0,7.0,8.0,7.0,7.0,8.0,7.0,7.0,7.0,8.0,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00,645:24:50,00:00:00,00:00:00,00:00:00
2,86.0,45.0,40.0,73.0,63.0,36.5,72.0,71.0,38.0,44.0,63.0,75.0,41.0,47.0,37.0,73.0,5.0,7.0,6.0,6.0,7.0,6.0,6.0,7.0,7.0,7.0,6.0,6.0,7.0,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00,764:40:45,00:04:51,00:00:00,00:00:00
3,100.0,100.0,92.0,91.0,62.0,57.0,100.0,95.0,39.0,78.0,69.0,94.0,73.0,75.0,44.0,75.0,4.0,2.0,0.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00
4,66.0,33.0,69.0,72.9,45.0,63.0,60.0,36.0,50.0,42.0,71.0,90.0,43.0,49.0,18.0,65.0,0.0,2.0,0.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,5.0,5.0,6.0,00:00:00,00:00:00,00:00:00,47:42:33,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
530,90.0,82.0,100.0,79.2,70.0,29.5,76.0,91.0,49.0,78.0,66.0,99.0,74.0,75.0,41.0,65.0,10.0,10.0,1.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,47:26:10,00:00:00,00:00:00,12:08:58,00:00:00,431:48:42,00:00:00,00:00:00,00:00:13
531,100.0,86.0,80.0,55.8,68.0,60.0,85.0,84.0,28.0,72.0,69.0,75.0,67.0,73.0,24.0,63.0,7.0,5.0,3.0,10.0,10.0,10.0,9.0,10.0,9.0,10.0,9.0,10.0,10.0,00:00:00,00:00:00,00:00:00,47:03:14,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00
532,87.0,90.0,100.0,99.0,65.0,65.0,95.0,87.0,50.0,66.0,74.0,88.0,56.0,62.0,40.0,70.0,10.0,6.0,0.0,9.0,9.0,9.0,9.0,9.0,9.0,8.0,9.0,9.0,9.0,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00
533,84.0,83.0,88.0,93.0,52.0,39.0,77.0,95.0,41.0,62.0,75.0,87.0,53.0,59.0,47.0,74.0,10.0,10.0,,9.0,8.0,9.0,8.0,8.0,9.0,8.0,9.0,9.0,8.0,00:00:00,00:00:00,00:00:00,00:00:00,00:00:00,419:06:41,00:00:00,00:00:00,00:00:00


In [300]:
total_with_noise = proj.total_points_with_noise(grades)
total = proj.total_points(grades)

### Short-answer questions (hard-coded)

Use your functions from above to understanding the data and answer the following questions. The function below should return **hard-coded values**. It should not compute anything!

**Question 10**

Create a function `short_answer` of zero variables that returns (hard-coded) answers to the following question in a list:
0. For the class on average, what is the difference between students' scores (`total_points`) and their scores with noise (`total_points_with_noise`)? (Remark: plot the distribution of differences; does this align with what you know about binomial distributions?)
1. What percentage of the class only sees their grade change at most (but not including) $\pm 0.01$?
2. What is the 95% confidence interval for the statistic above? (see [DSC10](https://www.inferentialthinking.com/chapters/13/3/Confidence_Intervals.html) and use `np.percentile`)
3. What proportion of the class sees a change in their letter grade?
4. The assumption behind the model in Question 9 is that:
    - The (observed) gradebook well represents the true population of students,
    - The noisy scores represent other possible observations drawn from the true population of students.
    - Answer `True` or `False`

In [341]:
fp = os.path.join('data', 'grades.csv')
grades = pd.read_csv(fp)

In [369]:

abcdf_grade = proj.final_grades(proj.total_points(grades))
out = abcdf_grade.value_counts()/len(abcdf_grade)
out = out.round(decimals=5)
out.sum()

1.0

In [7]:
total_with_noise = proj.total_points_with_noise(grades)
total = proj.total_points(grades)
total_with_noise-total

0      0.013028
1     -0.008557
2     -0.033905
3      0.009326
4     -0.039318
         ...   
530    0.014095
531   -0.003001
532    0.031540
533   -0.013717
534   -0.015932
Length: 535, dtype: float64

In [339]:
ls = []
for i in range(1000):
    total_with_noise = proj.total_points_with_noise(grades)
    total = proj.total_points(grades)
    change = total-total_with_noise
    ser = change <0.01
    ser2 = change>-0.01
    result = pd.Series(ser&ser2)
    percent = result.sum()/result.count()
    ls.append(percent)
[np.percentile(ls, 5),np.percentile(ls, 95)]

[0.6953271028037383, 0.7551401869158878]

In [338]:
[np.percentile(ls, 5),np.percentile(ls, 95)]

[0.6951401869158879, 0.7551401869158878]

# Congratulations, you finished the project!

### Before you submit:
* Be sure you run the doctests on all your code in project01.py

### To submit:
* **Upload the .py file to gradescope**