# Welcome to Lab: Simpson's Paradox 🔬

Before this lab section, you learned about Simpson's Paradox and confounding variables in lecture.  This week, you will find Simpson's Paradox through analysis of a dataset in Python, and get some practice writing conditionals for pandas DataFrames! 

A few tips to remember:

- Refer to your **lecture notebook** and the **pandas cheat sheet** to help you out with the code!
- If you find yourself stuck for more than a few minutes, ask a neighbor or course staff for help!  When you are giving help to your neighbor, explain the **idea and approach** to the problem without sharing the answer itself so they can have the same **<i>ah-hah</i>** moment!
- We are here to help you!  Don't feel embarrassed or shy to ask us for help!

Let's get started!

In [64]:
# Meet your CAs and TA if you haven't already!
# ...first name is enough, we'll know who they are! :)
ta_name = "Ramya"
ca1_name = "Eliana"
ca2_name = "Vikram"


# Say hello to each other!
# - Groups of 3 are ideal :)
# - However, groups of 2 or 4 are fine too!
#
# QOTD to Ask Your Group: "Orange or Blue?"
partner1_name = ""
partner1_netid = ""
partner1_orange_or_blue = ""

partner2_name = ""
partner2_netid = ""
partner2_orange_or_blue = ""

partner3_name = ""
partner3_netid = ""
partner3_orange_or_blue = ""

<hr style="color: #DD3403;">

# Part 1: The GPA Dataset

Many of you have likely come across one of the GPA visualizations found at https://waf.cs.illinois.edu/discovery/gpa/, potentially out of 
curiosity or the need to investigate a mysterious GenEd class you signed up for. 

Regardless, these visualizations are built on the **GPA Dataset** of UIUC students across all course subjects. Today, you're going to do some analysis and discover a case of Simpson's Paradox within data taken from UIUC courses!

## Load the GPA Dataset

The most recent version of the "GPA Dataset" is available here:
```
https://raw.githubusercontent.com/wadefagen/datasets/master/gpa/uiuc-gpa-dataset.csv
```

Use Python to load this dataset into a DataFrame called `df`:

In [9]:
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/wadefagen/datasets/master/gpa/uiuc-gpa-dataset.csv")
df.drop('W', inplace=True, axis=1) #Drop W grade column
df

Unnamed: 0,Year,Term,YearTerm,Subject,Number,Course Title,Sched Type,A+,A,A-,...,B,B-,C+,C,C-,D+,D,D-,F,Primary Instructor
0,2021,Fall,2021-fa,AAS,100,Intro Asian American Studies,DIS,2,14,2,...,5,3,0,1,0,0,0,0,0,"Zheng, Reanne"
1,2021,Fall,2021-fa,AAS,100,Intro Asian American Studies,DIS,0,15,0,...,5,1,0,2,0,0,0,0,1,"Atienza, Paul Michael L"
2,2021,Fall,2021-fa,AAS,100,Intro Asian American Studies,OD,7,4,1,...,7,0,2,3,0,0,1,0,1,"Wang, Yu"
3,2021,Fall,2021-fa,AAS,100,Intro Asian American Studies,DIS,1,18,0,...,4,1,0,0,0,0,0,0,0,"Zheng, Reanne"
4,2021,Fall,2021-fa,AAS,100,Intro Asian American Studies,DIS,0,16,1,...,5,1,0,2,0,0,0,0,0,"Atienza, Paul Michael L"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61552,2010,Summer,2010-su,STAT,410,Statistics and Probability II,LEC,5,10,2,...,5,1,0,1,3,0,0,0,2,"Stepanov, Alexei G"
61553,2010,Summer,2010-su,STAT,440,Statistical Data Management,LEC,4,12,8,...,3,0,0,0,0,0,0,0,0,"Unger, David"
61554,2010,Summer,2010-su,TAM,212,Introductory Dynamics,LEC,0,1,3,...,5,7,5,1,1,0,2,0,1,"Morgan, William T"
61555,2010,Summer,2010-su,TAM,251,Introductory Solid Mechanics,LCD,1,2,2,...,5,0,3,3,2,0,0,1,1,"Ott-Monsivais, Stephanie"


### 🔬 Test Case Checkpoint 🔬

In [10]:
## == CHECKPOINT TEST CASES ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any errors, you PASSED all test cases!
# - If this cell results in any errors, check your previous cell, make changes, and RE-RUN your code and then this cell.
assert(len(df) == 61557 ), "This is not the GPA dataset you're looking for"

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

🎉 All tests passed! 🎉


## Data Cleaning

The GPA dataset contains only raw GPA data and is not the easiest to work with if we want to analyze intricacies such as the average GPA or passing rate.  Luckily, DataFrames are modifiable, so we can add more columns based on what questions we want to answer.

The process of modifying a dataset via deletion (cleaning up empty/unwanted values) or addition (adding new columns) is often called **data cleaning**. This is an important concept in Data Science, because you won't always receive your data in the perfect format for your purposes.

With the GPA dataset, we want to investigate one innocent question posed by a theoretical incoming student: 
- *Is it easier to get an A in STAT or CS courses at UIUC?*

To do this, we'll need to first perform some modifications on our loaded DataFrame, `df`. Specifically, we will need to create two new columns:
- A `Total` column,  the total number of students in every course
- An `A_Grade` column, the total number of students receiving an A+, A, or A- in every course

### Puzzle 1.1: Creating a `Total Students` Column

Create the new column `Total Students` that stores the total number of students in every course.

- The `Total Students` column should include every grade **except** `W`.  W means the student withdrew from the course. We don't want to include withdraws in our analysis.

In [11]:
group_subject = df.groupby("Subject").agg("sum").reset_index()
df['Total Students'] = group_subject["A+"] + group_subject["A"] + group_subject["A-"] + group_subject["B+"] + group_subject["B"] + group_subject["B-"] + group_subject["C+"] + group_subject["C"] + group_subject["C-"] + group_subject["D+"] + group_subject["D"] + group_subject["D-"] + group_subject["F"]
df

Unnamed: 0,Year,Term,YearTerm,Subject,Number,Course Title,Sched Type,A+,A,A-,...,B-,C+,C,C-,D+,D,D-,F,Primary Instructor,Total Students
0,2021,Fall,2021-fa,AAS,100,Intro Asian American Studies,DIS,2,14,2,...,3,0,1,0,0,0,0,0,"Zheng, Reanne",6521.0
1,2021,Fall,2021-fa,AAS,100,Intro Asian American Studies,DIS,0,15,0,...,1,0,2,0,0,0,0,1,"Atienza, Paul Michael L",6617.0
2,2021,Fall,2021-fa,AAS,100,Intro Asian American Studies,OD,7,4,1,...,0,2,3,0,0,1,0,1,"Wang, Yu",107661.0
3,2021,Fall,2021-fa,AAS,100,Intro Asian American Studies,DIS,1,18,0,...,1,0,0,0,0,0,0,0,"Zheng, Reanne",51249.0
4,2021,Fall,2021-fa,AAS,100,Intro Asian American Studies,DIS,0,16,1,...,1,0,2,0,0,0,0,0,"Atienza, Paul Michael L",4460.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61552,2010,Summer,2010-su,STAT,410,Statistics and Probability II,LEC,5,10,2,...,1,0,1,3,0,0,0,2,"Stepanov, Alexei G",
61553,2010,Summer,2010-su,STAT,440,Statistical Data Management,LEC,4,12,8,...,0,0,0,0,0,0,0,0,"Unger, David",
61554,2010,Summer,2010-su,TAM,212,Introductory Dynamics,LEC,0,1,3,...,7,5,1,1,0,2,0,1,"Morgan, William T",
61555,2010,Summer,2010-su,TAM,251,Introductory Solid Mechanics,LCD,1,2,2,...,0,3,3,2,0,0,1,1,"Ott-Monsivais, Stephanie",


**Note:** Our DataFrame is so large that you may have to scroll to the right to see the new column, `Total Students`.

### Puzzle 1.2: Creating an `A_Grade` Column

Create the new column `A_Grade` that stores the total number of "A"s given in every course.

- We consider an "A" to be any type of A, in other words "A+", "A", or an "A-" are all included.

In [12]:
df["A_Grade"] = group_subject["A+"] + group_subject["A"] + group_subject["A-"]
df

Unnamed: 0,Year,Term,YearTerm,Subject,Number,Course Title,Sched Type,A+,A,A-,...,C+,C,C-,D+,D,D-,F,Primary Instructor,Total Students,A_Grade
0,2021,Fall,2021-fa,AAS,100,Intro Asian American Studies,DIS,2,14,2,...,0,1,0,0,0,0,0,"Zheng, Reanne",6521.0,4438.0
1,2021,Fall,2021-fa,AAS,100,Intro Asian American Studies,DIS,0,15,0,...,0,2,0,0,0,0,1,"Atienza, Paul Michael L",6617.0,4109.0
2,2021,Fall,2021-fa,AAS,100,Intro Asian American Studies,OD,7,4,1,...,2,3,0,0,1,0,1,"Wang, Yu",107661.0,57797.0
3,2021,Fall,2021-fa,AAS,100,Intro Asian American Studies,DIS,1,18,0,...,0,0,0,0,0,0,0,"Zheng, Reanne",51249.0,27626.0
4,2021,Fall,2021-fa,AAS,100,Intro Asian American Studies,DIS,0,16,1,...,0,2,0,0,0,0,0,"Atienza, Paul Michael L",4460.0,3548.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61552,2010,Summer,2010-su,STAT,410,Statistics and Probability II,LEC,5,10,2,...,0,1,3,0,0,0,2,"Stepanov, Alexei G",,
61553,2010,Summer,2010-su,STAT,440,Statistical Data Management,LEC,4,12,8,...,0,0,0,0,0,0,0,"Unger, David",,
61554,2010,Summer,2010-su,TAM,212,Introductory Dynamics,LEC,0,1,3,...,5,1,1,0,2,0,1,"Morgan, William T",,
61555,2010,Summer,2010-su,TAM,251,Introductory Solid Mechanics,LCD,1,2,2,...,3,3,2,0,0,1,1,"Ott-Monsivais, Stephanie",,


### 🔬 Test Case Checkpoint 🔬

In [13]:
## == CHECKPOINT TEST CASES ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any errors, you PASSED all test cases!
# - If this cell results in any errors, check your previous cell, make changes, and RE-RUN your code and then this cell.
assert('A_Grade' in df), "Make sure you've named the A_Grade column properly and added it to the dataframe"
assert(df['A_Grade'].sum() == 1983288), "Double check the values of your A_Grade column, and make sure you are calling .sum() on the correct list (A_Grade)"

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

🎉 All tests passed! 🎉


# Part 2: Working with the GPA Dataset
Now that we have our columns set up, we can perform basic mathematical analysis on the DataFrame to get insight towards our question. 

Should be simple, right? Let's just see which subject, STAT or CS, has a greater percentage of A grades .

### Puzzle 2.1: Subject DataFrames

Select only the rows of the GPA dataset `df` with a `Subject` of `STAT`. Assign these rows to a new DataFrame, `df_STAT`. 
- Make sure your result only contains STAT courses!

In [26]:
df_STAT = df[df.Subject == "STAT"]
df_STAT

Unnamed: 0,Year,Term,YearTerm,Subject,Number,Course Title,Sched Type,A+,A,A-,...,C+,C,C-,D+,D,D-,F,Primary Instructor,Total Students,A_Grade
2576,2021,Fall,2021-fa,STAT,100,Statistics,LCD,150,78,49,...,12,18,10,5,5,3,18,"Flanagan, Karle A",,
2577,2021,Fall,2021-fa,STAT,100,Statistics,ONL,208,151,81,...,41,42,24,17,25,5,45,"Flanagan, Karle A",,
2578,2021,Fall,2021-fa,STAT,107,Data Science Discovery,OLC,127,61,33,...,2,5,4,2,2,0,12,"Flanagan, Karle A",,
2579,2021,Fall,2021-fa,STAT,200,Statistical Analysis,ONL,75,135,34,...,12,9,3,0,0,1,3,"Fireman, Ellen S",,
2580,2021,Fall,2021-fa,STAT,207,Data Science Exploration,LEC,17,3,2,...,0,1,0,0,0,1,3,"Ellison, Victoria M",,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61549,2010,Summer,2010-su,STAT,100,Statistics,LCD,1,13,3,...,1,2,2,3,1,1,1,"Hirtz, Nathaniel R",,
61550,2010,Summer,2010-su,STAT,100,Statistics,LCD,0,14,3,...,0,3,0,0,1,0,3,"Dalpiaz, David M",,
61551,2010,Summer,2010-su,STAT,400,Statistics and Probability I,LEC,4,15,7,...,1,2,2,0,1,0,3,"Monrad, Ditlev",,
61552,2010,Summer,2010-su,STAT,410,Statistics and Probability II,LEC,5,10,2,...,0,1,3,0,0,0,2,"Stepanov, Alexei G",,


Select only the rows of the GPA dataset `df` with a `Subject` of `CS`. Assign these rows to a new dataframe, `df_CS`. 
- Make sure your result only contains CS courses!

In [27]:
df_CS = df[df.Subject == "CS"]
df_CS

Unnamed: 0,Year,Term,YearTerm,Subject,Number,Course Title,Sched Type,A+,A,A-,...,C+,C,C-,D+,D,D-,F,Primary Instructor,Total Students,A_Grade
800,2021,Fall,2021-fa,CS,100,Freshman Orientation,LEC,0,246,7,...,0,0,0,0,0,0,2,"Gunter, Elsa",,
801,2021,Fall,2021-fa,CS,100,Freshman Orientation,OLC,0,223,6,...,0,1,0,0,2,0,6,"Gunter, Elsa",,
802,2021,Fall,2021-fa,CS,101,Intro Computing: Engrg & Sci,OLC,112,264,48,...,10,7,5,4,2,2,10,"Davis, Neal E",,
803,2021,Fall,2021-fa,CS,105,Intro Computing: Non-Tech,OLC,27,83,53,...,17,20,11,9,6,1,24,"Zilles, Craig",,
804,2021,Fall,2021-fa,CS,105,Intro Computing: Non-Tech,OLC,23,100,37,...,15,19,9,6,5,4,14,"Zilles, Craig",,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61433,2010,Summer,2010-su,CS,101,Intro Computing: Engrg & Sci,LBD,4,6,2,...,2,0,1,0,0,0,0,"Gambill, Thomas N",,
61434,2010,Summer,2010-su,CS,225,Data Structures,LBD,1,5,1,...,1,3,1,0,2,0,1,"Earls, John C",,
61435,2010,Summer,2010-su,CS,373,Theory of Computation,LEC,5,1,5,...,2,0,2,0,2,0,1,"Kumar, Viraj",,
61436,2010,Summer,2010-su,CS,421,Progrmg Languages & Compilers,LCD,2,5,5,...,0,1,4,0,4,0,0,"Hafiz, Munawar",,


### 🔬 Test Case Checkpoint 🔬

In [28]:
## == CHECKPOINT TEST CASES ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any error our output, you PASSED all test cases!
# - If this cell results in any errors, check you previous cell, make changes, and RE-RUN your code and then this cell.
assert( 'df_STAT' in vars() ), "Make certain to name the STAT courses df_STAT."
assert( 'df_CS' in vars() ), "Make certain to name the CS courses df_CS."
assert( len(df_STAT[df_STAT.Subject != "STAT"] ) == 0 ), "It looks like you did not subset df_STAT to only STAT courses."
assert( len(df_CS[df_CS.Subject != "CS"] ) == 0 ), "It looks like you did not subset df_STAT to only STAT courses."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

🎉 All tests passed! 🎉


### Puzzle 2.2: Comparing Overall Percentages

With our two new dataframes of STAT and CS courses, use the following code cell to determine the **percentage** of A grades recieved in STAT and CS courses, storing them as variables `stat_a` and `cs_a` respectively. 

Print statements have been provided to show the values you calculate. 

**Hint:** To find the % of As, divide the total number of students by the total number of As. Remember the function `sum(df['column_name'])` can be used to add up the values of all rows in a particular column of a dataframe.

In [36]:
stat_a = (sum(df_STAT['A+']) + sum(df_STAT['A']) + sum(df_STAT['A-'])) / (sum(df_STAT['A+']) + sum(df_STAT['A']) + sum(df_STAT['A-']) + sum(df_STAT['B+']) + sum(df_STAT['B']) + sum(df_STAT['B-']) + sum(df_STAT['C+']) + sum(df_STAT['C']) + sum(df_STAT['C-']) + sum(df_STAT['D+']) + sum(df_STAT['D']) + sum(df_STAT['D-']) + sum(df_STAT['F']))
print(f'Overall % of As in STAT is: {stat_a}')

cs_a = (sum(df_CS['A+']) + sum(df_CS['A']) + sum(df_CS['A-'])) / (sum(df_CS['A+']) + sum(df_CS['A']) + sum(df_CS['A-']) + sum(df_CS['B+']) + sum(df_CS['B']) + sum(df_CS['B-']) + sum(df_CS['C+']) + sum(df_CS['C']) + sum(df_CS['C-']) + sum(df_CS['D+']) + sum(df_CS['D']) + sum(df_CS['D-']) + sum(df_CS['F']))
print(f'Overall % of As in CS is: {cs_a}')

Overall % of As in STAT is: 0.6027997446909787
Overall % of As in CS is: 0.5891362965182221


### 🔬 Test Case Checkpoint 🔬

In [37]:
## == CHECKPOINT TEST CASES ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any errors, you PASSED all test cases!
# - If this cell results in any errors, check your previous cell, make changes, and RE-RUN your code and then this cell.
import math
assert(math.isclose(stat_a, 0.6027997446909787)), "The overall percentage of A grades recieved in STAT courses does not appear to have been correctly calculated"
assert(math.isclose(cs_a, 0.5891362965182221)), "The overall percentage of A grades recieved in CS courses does not appear to have been correctly calculated"

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

🎉 All tests passed! 🎉


### Analysis: Good Comparison?
**Q: What conclusion can you take from the overall percentages found above when asking the original question:**
- *Is it easier to get an A in STAT or CS courses at UIUC?*

Based on the data above, it is easier to get an A in STAT courses at UIUC.

**Q: Given what you've learned about experimental design, what are some reasons (specific to this dataset) you may not trust this conclusion? If you would trust it, explain why.**

There could be a number of reasons as to why this may be the conclusion, it could be the number of students in STAT are greater than the number of students in the CS department. We would also have to look at the average GPA of each course which could help make a concrete conclusion of what department may be easier/harder.

## An Extra Consideration

If you look at the `Year` column of our GPA dataset, you might notice that we have some old data in our set - all the way back to **2010**! This means we aren't really answering our question from the perspective of a student now. 

If we want to know if it is easier to **currently** get an A in a STAT or CS course, we should control for the **date** of the data by looking at more **recent years** specifically.

### Puzzle 2.3: More DataFrames

Using the code cells below, define four new DataFrames by selecting from rows of our previously created `df_CS` and `df_STAT`. 

- `df_cs_recent`: all `CS` course data in *recent years*
- `df_stat_recent`: all `STAT` course data in *recent years*
- `df_cs_old`: all other, older `CS` course data
- `df_stat_old`: all other, older `STAT` course data

Define *recent years* as **any year after and including 2020**, and older years as any year before 2020. 

Feel free to use row selection conditionals OR the `.isin()` function you learned in the last lab.

In [38]:
df_stat_recent = df_STAT[(df_STAT.Year >= 2020)]
df_stat_recent

Unnamed: 0,Year,Term,YearTerm,Subject,Number,Course Title,Sched Type,A+,A,A-,...,C+,C,C-,D+,D,D-,F,Primary Instructor,Total Students,A_Grade
2576,2021,Fall,2021-fa,STAT,100,Statistics,LCD,150,78,49,...,12,18,10,5,5,3,18,"Flanagan, Karle A",,
2577,2021,Fall,2021-fa,STAT,100,Statistics,ONL,208,151,81,...,41,42,24,17,25,5,45,"Flanagan, Karle A",,
2578,2021,Fall,2021-fa,STAT,107,Data Science Discovery,OLC,127,61,33,...,2,5,4,2,2,0,12,"Flanagan, Karle A",,
2579,2021,Fall,2021-fa,STAT,200,Statistical Analysis,ONL,75,135,34,...,12,9,3,0,0,1,3,"Fireman, Ellen S",,
2580,2021,Fall,2021-fa,STAT,207,Data Science Exploration,LEC,17,3,2,...,0,1,0,0,0,1,3,"Ellison, Victoria M",,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9936,2020,Winter,2020-wi,STAT,420,Statistical Modeling in R,ONL,28,123,18,...,0,1,0,0,0,0,1,"Unger, David",,
9937,2020,Winter,2020-wi,STAT,420,Methods of Applied Statistics,ONL,5,14,5,...,0,0,0,0,0,0,1,"Unger, David",,
9938,2020,Winter,2020-wi,STAT,420,Methods of Applied Statistics,ONL,10,48,11,...,3,1,0,0,0,0,2,"Unger, David",,
9939,2020,Winter,2020-wi,STAT,440,Statistical Data Management,ONL,0,29,0,...,0,0,0,0,0,0,3,"Kinson, Christopher L",,


In [39]:
df_cs_recent = df_CS[(df_CS.Year >= 2020)]
df_cs_recent

Unnamed: 0,Year,Term,YearTerm,Subject,Number,Course Title,Sched Type,A+,A,A-,...,C+,C,C-,D+,D,D-,F,Primary Instructor,Total Students,A_Grade
800,2021,Fall,2021-fa,CS,100,Freshman Orientation,LEC,0,246,7,...,0,0,0,0,0,0,2,"Gunter, Elsa",,
801,2021,Fall,2021-fa,CS,100,Freshman Orientation,OLC,0,223,6,...,0,1,0,0,2,0,6,"Gunter, Elsa",,
802,2021,Fall,2021-fa,CS,101,Intro Computing: Engrg & Sci,OLC,112,264,48,...,10,7,5,4,2,2,10,"Davis, Neal E",,
803,2021,Fall,2021-fa,CS,105,Intro Computing: Non-Tech,OLC,27,83,53,...,17,20,11,9,6,1,24,"Zilles, Craig",,
804,2021,Fall,2021-fa,CS,105,Intro Computing: Non-Tech,OLC,23,100,37,...,15,19,9,6,5,4,14,"Zilles, Craig",,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9785,2020,Winter,2020-wi,CS,498,Data Visualiztion,ONL,226,169,30,...,2,3,0,3,0,0,9,"Hart, John C",,
9786,2020,Winter,2020-wi,CS,498,Data Visualiztion,OLC,20,14,3,...,1,0,0,0,0,0,1,"Hart, John C",,
9787,2020,Winter,2020-wi,CS,513,Theory & Pract Data Cleaning,ONL,350,28,3,...,0,0,0,0,0,0,1,"Ludaescher, Bertram",,
9788,2020,Winter,2020-wi,CS,598,Cloud Capstone,ONL,35,6,0,...,0,0,0,0,0,0,1,"Farivar, Reza",,


In [40]:
df_stat_old = df_STAT[(df_STAT.Year < 2020)]
df_stat_old

Unnamed: 0,Year,Term,YearTerm,Subject,Number,Course Title,Sched Type,A+,A,A-,...,C+,C,C-,D+,D,D-,F,Primary Instructor,Total Students,A_Grade
12452,2019,Fall,2019-fa,STAT,100,Statistics,ONL,69,105,63,...,31,25,12,4,10,1,13,"Flanagan, Karle A",,
12453,2019,Fall,2019-fa,STAT,100,Statistics,LCD,71,143,95,...,31,24,19,19,14,5,8,"Flanagan, Karle A",,
12454,2019,Fall,2019-fa,STAT,100,Statistics,ONL,19,72,23,...,8,19,3,5,4,3,9,"Yu, Albert",,
12455,2019,Fall,2019-fa,STAT,107,Data Science Discovery,LEC,6,109,7,...,3,3,3,4,2,0,5,"Flanagan, Karle A",,
12456,2019,Fall,2019-fa,STAT,200,Statistical Analysis,LCD,28,1,1,...,0,0,1,0,0,0,1,"Simpson, Douglas G",,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61549,2010,Summer,2010-su,STAT,100,Statistics,LCD,1,13,3,...,1,2,2,3,1,1,1,"Hirtz, Nathaniel R",,
61550,2010,Summer,2010-su,STAT,100,Statistics,LCD,0,14,3,...,0,3,0,0,1,0,3,"Dalpiaz, David M",,
61551,2010,Summer,2010-su,STAT,400,Statistics and Probability I,LEC,4,15,7,...,1,2,2,0,1,0,3,"Monrad, Ditlev",,
61552,2010,Summer,2010-su,STAT,410,Statistics and Probability II,LEC,5,10,2,...,0,1,3,0,0,0,2,"Stepanov, Alexei G",,


In [41]:
df_cs_old = df_CS[df_CS.Year < 2020]
df_cs_old

Unnamed: 0,Year,Term,YearTerm,Subject,Number,Course Title,Sched Type,A+,A,A-,...,C+,C,C-,D+,D,D-,F,Primary Instructor,Total Students,A_Grade
10833,2019,Fall,2019-fa,CS,100,Freshman Orientation,LEC,0,214,22,...,29,12,10,4,2,2,4,"Gunter, Elsa",,
10834,2019,Fall,2019-fa,CS,101,Intro Computing: Engrg & Sci,LEC,282,135,62,...,7,6,3,4,3,5,12,"Davis, Neal E",,
10835,2019,Fall,2019-fa,CS,105,Intro Computing: Non-Tech,LEC,86,134,87,...,26,37,14,12,10,10,14,"Harris, Albert F",,
10836,2019,Fall,2019-fa,CS,125,Intro to Computer Science,LEC,150,357,45,...,7,11,6,0,13,0,23,"Challen, Geoffrey W",,
10837,2019,Fall,2019-fa,CS,126,Software Design Studio,LCD,9,56,37,...,3,3,1,3,0,2,3,"Woodley, Michael J",,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61433,2010,Summer,2010-su,CS,101,Intro Computing: Engrg & Sci,LBD,4,6,2,...,2,0,1,0,0,0,0,"Gambill, Thomas N",,
61434,2010,Summer,2010-su,CS,225,Data Structures,LBD,1,5,1,...,1,3,1,0,2,0,1,"Earls, John C",,
61435,2010,Summer,2010-su,CS,373,Theory of Computation,LEC,5,1,5,...,2,0,2,0,2,0,1,"Kumar, Viraj",,
61436,2010,Summer,2010-su,CS,421,Progrmg Languages & Compilers,LCD,2,5,5,...,0,1,4,0,4,0,0,"Hafiz, Munawar",,


### 🔬 Test Case Checkpoint 🔬

In [42]:
## == CHECKPOINT TEST CASES ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any error our output, you PASSED all test cases!
# - If this cell results in any errors, check you previous cell, make changes, and RE-RUN your code and then this cell.
assert( 'df_stat_recent' in vars() ), "Make certain to name the recent STAT courses df_stat_recent."
assert( 'df_cs_recent' in vars() ), "Make certain to name the recent CS courses df_cs_recent."
assert( 'df_stat_old' in vars() ), "Make certain to name the old STAT courses df_stat_old."
assert( 'df_cs_old' in vars() ), "Make certain to name the old CS courses df_cs_old."

assert( len(df_stat_recent[df_stat_recent.Year < 2020] ) == 0 ), "Make sure only years after and including 2020 are in the df_stat_recent dataframe."
assert( len(df_cs_recent[df_cs_recent.Year < 2020] ) == 0 ), "Make sure only years after and including 2020 are in the df_cs_recent dataframe."
assert( len(df_stat_old[df_stat_old.Year >= 2020] ) == 0 ), "Make sure only years before 2020 are in the df_stat_old dataframe."
assert( len(df_cs_old[df_cs_old.Year >= 2020] ) == 0 ), "Make sure only years before 2020 are in the df_cs_old dataframe."

assert(len(df[ df.index.isin(df_stat_recent.index) & df.index.isin(df_stat_old.index) ]) == 0), "Check for duplicate values in your df_stat_recent and df_stat_old dataframes."
assert(len(df[ df.index.isin(df_cs_recent.index) & df.index.isin(df_cs_old.index) ]) == 0), "Check for duplicate values in your df_cs_recent and df_cs_old dataframes."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

🎉 All tests passed! 🎉


### Puzzle 2.4: New Analysis
Now that we've got all the DataFrames setup with GPA data of the CS and STAT courses separated by recency (2020 or newer being 'recent'), we can do more in-depth analysis to investigate our question. 

In the following code cells, **calculate the percentages** described by the comment. Your answer should be a decimal between 0 and 1. 

**Hint:** the function `sum(df['column_name'])` can be used to add up the values of all rows in a particular column of a DataFrame.

In [43]:
# Percentage of As received in CS in recent years
cs_recent_a = (sum(df_cs_recent['A+']) + sum(df_cs_recent['A']) + sum(df_cs_recent['A-'])) / (sum(df_cs_recent['A+']) + sum(df_cs_recent['A']) + sum(df_cs_recent['A-']) + sum(df_cs_recent['B+']) + sum(df_cs_recent['B']) + sum(df_cs_recent['B-']) + sum(df_cs_recent['C+']) + sum(df_cs_recent['C']) + sum(df_cs_recent['C-']) + sum(df_cs_recent['D+']) + sum(df_cs_recent['D']) + sum(df_cs_recent['D-']) + sum(df_cs_recent['F']))
print(f'Percentage of As received in CS in recent years: {cs_recent_a}')

Percentage of As received in CS in recent years: 0.7305164527732138


In [44]:
# Percentage of As received in STAT in recent years
stat_recent_a = (sum(df_stat_recent['A+']) + sum(df_stat_recent['A']) + sum(df_stat_recent['A-'])) / (sum(df_stat_recent['A+']) + sum(df_stat_recent['A']) + sum(df_stat_recent['A-']) + sum(df_stat_recent['B+']) + sum(df_stat_recent['B']) + sum(df_stat_recent['B-']) + sum(df_stat_recent['C+']) + sum(df_stat_recent['C']) + sum(df_stat_recent['C-']) + sum(df_stat_recent['D+']) + sum(df_stat_recent['D']) + sum(df_stat_recent['D-']) + sum(df_stat_recent['F']))
print(f'Percentage of As received in STAT in recent years: {stat_recent_a}')

Percentage of As received in STAT in recent years: 0.683119837776349


In [45]:
# percentage of As recieved in CS in older years
cs_old_a = (sum(df_cs_old['A+']) + sum(df_cs_old['A']) + sum(df_cs_old['A-'])) / (sum(df_cs_old['A+']) + sum(df_cs_old['A']) + sum(df_cs_old['A-']) + sum(df_cs_old['B+']) + sum(df_cs_old['B']) + sum(df_cs_old['B-']) + sum(df_cs_old['C+']) + sum(df_cs_old['C']) + sum(df_cs_old['C-']) + sum(df_cs_old['D+']) + sum(df_cs_old['D']) + sum(df_cs_old['D-']) + sum(df_cs_old['F']))
print(f'Percentage of As received in CS in older years: {cs_old_a}') 

Percentage of As received in CS in older years: 0.5442563329057359


In [46]:
# percentage of As recieved in CS in older years
stat_old_a = (sum(df_stat_old['A+']) + sum(df_stat_old['A']) + sum(df_stat_old['A-'])) / (sum(df_stat_old['A+']) + sum(df_stat_old['A']) + sum(df_stat_old['A-']) + sum(df_stat_old['B+']) + sum(df_stat_old['B']) + sum(df_stat_old['B-']) + sum(df_stat_old['C+']) + sum(df_stat_old['C']) + sum(df_stat_old['C-']) + sum(df_stat_old['D+']) + sum(df_stat_old['D']) + sum(df_stat_old['D-']) + sum(df_stat_old['F']))
print(f'Percentage of As received in STAT in older years: {stat_old_a}')

Percentage of As received in STAT in older years: 0.5803123874934227


### 🔬 Test Case Checkpoint 🔬

In [47]:
## == CHECKPOINT TEST CASES ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any errors, you PASSED all test cases!
# - If this cell results in any errors, check your previous cell, make changes, and RE-RUN your code and then this cell.
assert(math.isclose(cs_recent_a,  0.7305164527732138)), "The overall percentage of A grades recieved in STAT courses recently does not appear to have been correctly calculated"
assert(math.isclose(stat_recent_a, 0.683119837776349)), "The overall percentage of A grades recieved in CS courses recently does not appear to have been correctly calculated"

assert(math.isclose(cs_old_a,  0.5442563329057359)), "The overall percentage of A grades recieved in STAT courses in older years does not appear to have been correctly calculated"
assert(math.isclose(stat_old_a, 0.5803123874934227)), "The overall percentage of A grades recieved in CS courses in older years does not appear to have been correctly calculated"

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

🎉 All tests passed! 🎉


### Observe the Results
Run the following cell to format all of your answers as a DataFrame, keeping in mind that "Old" means data from courses before 2020, and "New" means courses held during or after 2020. 

In [48]:
pd.DataFrame([
  {'Old % of A': cs_old_a, 'New % of A': cs_recent_a, 'Overall % of A': cs_a},
  {'Old % of A': stat_old_a, 'New % of A': stat_recent_a, 'Overall % of A': stat_a}
], index=['CS', 'STAT'])

Unnamed: 0,Old % of A,New % of A,Overall % of A
CS,0.544256,0.730516,0.589136
STAT,0.580312,0.68312,0.6028


Notice that when observing the overall % of A grades received, you may think `STAT` is easier than `CS` to get an A in. But in the sub-group of the courses held in years of 2020 and later, we see that CS actually has a higher A-Grade rate! 

This is **Simpson's Paradox**: a pattern within a population can appear, disappear, or reverse when you look at subpopulations.

In more formal terms, Simpson's Paradox can cause you to observe a pattern reverse when you look at the overall group statistics versus statistics of groups post-stratification. In this case we are stratifying by time.


### Analysis: Reflecting on New Observations 

You should see the pattern reverse when you look at the overall A grade percentages vs. the percentages stratified to account for recency. This is called **Simpson's Paradox**: a pattern within a population can appear, disappear, or reverse when you look at subpopulations.

Now think about how would you now respond differently to the incoming student's question:
- *Is it easier to get an A in STAT or CS courses at UIUC?*

**Q: Which comparison of percentages do you trust more and why? Are there any other potential confounding variables when answering this question that could be investigated further? Respond with at least three full sentences.**

I believe that recent date is more applicable to the public, so I would trust the comparison that shows the percentage of A's in both newer and older classes within the STAT and CS department. A lot of factors could contribute to why this new trend is being observed. Perhaps the CS department has more opportunities for students to seek assistance and get help on their work in comparison to the STAT department. The number of students within each department could also change our results. 

<hr style="color: #DD3403;">

# Part 3: Revisiting the Hello Dataset

Over the past two weeks, you created a series of questions that made up the "Hello Dataset" and completed the survey by answering all of the questions yourself. Now, we will load this dataset again and briefly answer a few questions.

## Load the Hello Dataset

The "Hello Dataset" is available here:
```
https://waf.cs.illinois.edu/discovery/hello-fa22.csv
```

Use Python to load this dataset into a DataFrame called `df_hello`:

In [51]:
df_hello = pd.read_csv("https://waf.cs.illinois.edu/discovery/hello-fa22.csv")
df_hello

Unnamed: 0,Name,Lab,Number 1 or Number 2?,Number 3 or Number 5?,What is your preferred way of saying one half?,7UP or Sierra Mist?,Android or iOS?,Apple or Orange?,Beef or Pork?,Your birthday is...,...,What is your favorite coffee shop drink?,What is your favorite mobile app?,What is your favorite subject?,Who is your favorite singer?,What is your favorite season?,What is your Zodiac sign?,What is your favorite video game?,What is your favorite non-video game?,What is your favorite food?,What's your favorite movie?
0,CA Ana,AL1,2.0,5.0,2-Jan,Sierra Mist,iOS,Apple,Pork,On/Before June 30,...,Cold brew,Snapchat,Math,Taylor Swift,Fall,Piceis,Minecraft,Chutes and Ladders,Yogurt,?
1,CA Ram,AYR,2.0,5.0,2-Jan,7UP,iOS,Apple,,On/After July 1,...,Black coffee,Google chrome,Mathematics,TWICE,Spring,Scorpio,Player Unkown's Battleground,Basketball,Rice,The conjouring
2,CA Xin,AYO,1.0,5.0,2-Jan,Sierra Mist,iOS,Apple,Beef,On/Before June 30,...,starbucks,Starbucks,Psychology,Five Exercuse,Winter,aquarius,,,Eggs,Flipped
3,Lekha,AYH,2.0,2.0,0.5,7UP,iOS,Orange,Pork,On/After July 1,...,pumpkin spice latte,twitter,english (writing),lana del rey,autumn,leo,animal crossing,catan,tteokbokki,dune (2021)
4,Reni,AYD,2.0,5.0,2-Jan,7UP,iOS,Apple,Pork,On/Before June 30,...,Mocha,Instagram,Math,Mark kozelek,Winter,Aquarias,Rainbow six siege,Catan,Tacos,End of Evangelion
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
519,Bill,AYC,1.0,5.0,2-Jan,Sierra Mist,iOS,Apple,Beef,On/After July 1,...,,,,,,,,,,
520,Renault,AYR,2.0,5.0,0.5,Sierra Mist,Android,Orange,Pork,On/Before June 30,...,moca,wechat,cs,Michael Jackson,Spring,idk,WOW classic,badminton,rice,guardians of gahoole
521,Shivam Patel,AYC,1.0,3.0,0.5,7UP,iOS,Apple,,On/Before June 30,...,Don’t have one,Youtube,World History,Kumar Sanu,Summer,Aquarius,FIFA 20,Chess,Pizza,Avengers Endgame
522,Colin,AYI,1.0,5.0,2-Jan,Sierra Mist,iOS,Orange,Beef,On/Before June 30,...,tea,YouTube,Math,The Weeknd,Summer,Aquarious,Rocket League,Poker,Tacos,Inception


### 🔬 Test Case Checkpoint 🔬

In [52]:
## == CHECKPOINT TEST CASES ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any errors, you PASSED all test cases!
# - If this cell results in any errors, check your previous cell, make changes, and RE-RUN your code and then this cell.
assert(len(df_hello) == 524), "This is not the Hello dataset you're looking for. Check the URL."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

🎉 All tests passed! 🎉


## Classes v. Sleep

With the Hello Dataset, we are going to briefly explore two quantitative questions:
- How many hours of sleep do you get on average? 
- How many classes are you taking this semester?

### Puzzle 3.1: Observation Subsets

In this situation, let's define an average of **6 or more hours** of sleep as "good sleep" and any less to be defined as "bad sleep".

From this, create two DataFrames that contain subsets of the Hello Dataset: 
- `df_goodsleep`: including everyone who gets "good sleep" on average
- `df_badsleep`: including everyone who gets "bad sleep" on average

In [53]:
sleep = 'How many hours of sleep do you get on average?'
df_goodsleep = df_hello[df_hello["How many hours of sleep do you get on average?"] >= 6]
df_goodsleep

Unnamed: 0,Name,Lab,Number 1 or Number 2?,Number 3 or Number 5?,What is your preferred way of saying one half?,7UP or Sierra Mist?,Android or iOS?,Apple or Orange?,Beef or Pork?,Your birthday is...,...,What is your favorite coffee shop drink?,What is your favorite mobile app?,What is your favorite subject?,Who is your favorite singer?,What is your favorite season?,What is your Zodiac sign?,What is your favorite video game?,What is your favorite non-video game?,What is your favorite food?,What's your favorite movie?
0,CA Ana,AL1,2.0,5.0,2-Jan,Sierra Mist,iOS,Apple,Pork,On/Before June 30,...,Cold brew,Snapchat,Math,Taylor Swift,Fall,Piceis,Minecraft,Chutes and Ladders,Yogurt,?
2,CA Xin,AYO,1.0,5.0,2-Jan,Sierra Mist,iOS,Apple,Beef,On/Before June 30,...,starbucks,Starbucks,Psychology,Five Exercuse,Winter,aquarius,,,Eggs,Flipped
3,Lekha,AYH,2.0,2.0,0.5,7UP,iOS,Orange,Pork,On/After July 1,...,pumpkin spice latte,twitter,english (writing),lana del rey,autumn,leo,animal crossing,catan,tteokbokki,dune (2021)
4,Reni,AYD,2.0,5.0,2-Jan,7UP,iOS,Apple,Pork,On/Before June 30,...,Mocha,Instagram,Math,Mark kozelek,Winter,Aquarias,Rainbow six siege,Catan,Tacos,End of Evangelion
5,Linger,AYF,2.0,5.0,2-Jan,7UP,iOS,Apple,Beef,On/Before June 30,...,Starbucks,App Store,Any STEM,Don’t have one,Fall,Gemini,Minecraft,UNO,Authentic Chinese Food,Harry Potter
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
517,Lauren,AYP,1.0,3.0,2-Jan,7UP,iOS,Apple,Beef,On/Before June 30,...,tea,instagram,english,ariana grande,summer,taurus,call of duty,flappy bird,pasta,the notebook
518,Celesta,AYS,1.0,3.0,0.5,7UP,iOS,Apple,Beef,On/After July 1,...,Frappuccino,instagram,science,bts,winter,capricorn,mario kart,genshin impact,ramen,disney rapanzel
519,Bill,AYC,1.0,5.0,2-Jan,Sierra Mist,iOS,Apple,Beef,On/After July 1,...,,,,,,,,,,
520,Renault,AYR,2.0,5.0,0.5,Sierra Mist,Android,Orange,Pork,On/Before June 30,...,moca,wechat,cs,Michael Jackson,Spring,idk,WOW classic,badminton,rice,guardians of gahoole


In [54]:
df_badsleep = df_hello[df_hello["How many hours of sleep do you get on average?"] < 6]
df_badsleep

Unnamed: 0,Name,Lab,Number 1 or Number 2?,Number 3 or Number 5?,What is your preferred way of saying one half?,7UP or Sierra Mist?,Android or iOS?,Apple or Orange?,Beef or Pork?,Your birthday is...,...,What is your favorite coffee shop drink?,What is your favorite mobile app?,What is your favorite subject?,Who is your favorite singer?,What is your favorite season?,What is your Zodiac sign?,What is your favorite video game?,What is your favorite non-video game?,What is your favorite food?,What's your favorite movie?
1,CA Ram,AYR,2.0,5.0,2-Jan,7UP,iOS,Apple,,On/After July 1,...,Black coffee,Google chrome,Mathematics,TWICE,Spring,Scorpio,Player Unkown's Battleground,Basketball,Rice,The conjouring
8,Suvinay,AYP,1.0,5.0,2-Jan,7UP,Android,Orange,,On/After July 1,...,Frappuccino,Netflix,Astronomy,Freddie Mercury,Spring,Sagitarrius,Minecraft,Chess,Basmati Rice,Interstellar
16,Claudia,AL1,2.0,3.0,2-Jan,Sierra Mist,iOS,Apple,Beef,On/After July 1,...,Ice caramel macchiato or salted caramel cream ...,Disney +,Math,Giveon,Spring,Capricorn,Smash bros,Uno,Chicken,Tangled
17,Wing,AYB,1.0,5.0,2-Jan,Sierra Mist,iOS,Apple,Beef,On/Before June 30,...,Bridgeport Coffees,iMessage,Math,IU,Winter,Aquarius,Valorant,Exploding Kittens,Lobster,Avengers End Game
20,Sakshyam,AYT,1.0,5.0,0.5,7UP,iOS,Orange,Beef,On/After July 1,...,Latte,Duolingo,Philosophy,Elvis Presley,summer,Scorpio,Call of duty,Ludo,Burger,Interstellar
85,Rachel,AYK,1.0,5.0,2-Jan,7UP,iOS,Apple,Beef,On/After July 1,...,Matcha latte,Instagram,Statistics,Enhypen,Fall,Virgo,Minecraft,Uno,Sushi,Your Name
90,Lily,AYB,1.0,5.0,2-Jan,7UP,iOS,Orange,Beef,On/Before June 30,...,Tous Les Jours,Discord,Math,Powfu,Summer,Cancer,Valorant,Chess,Pasta,Weathering with You
156,rafy,AYK,1.0,3.0,2-Jan,Sierra Mist,iOS,Orange,Pork,On/After July 1,...,cafe paradiso,tiktok,literature,,fall,leo,mobile legends / league of legends,jenga,chocolate,my neighbor totoro
179,Alisha,AYC,1.0,5.0,2-Jan,7UP,iOS,Orange,Beef,On/Before June 30,...,Vanilla Cold Brew,Instagram,Biology,Harry Styles,Winter,Capricorn,Minecraft,Magic: The Gathering,Mac & Cheese,Black Panther
247,Oddie,AYR,1.0,3.0,2-Jan,Sierra Mist,Android,Orange,Pork,On/Before June 30,...,Cold Brew,Discord,Computer Science,Bruno Mars,Fall,Aquarius,League of Legends,Catan,Pho,Summer wars


### 🔬 Test Case Checkpoint 🔬

In [55]:
## == CHECKPOINT TEST CASES ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any errors, you PASSED all test cases!
# - If this cell results in any errors, check your previous cell, make changes, and RE-RUN your code and then this cell.
assert(len(df_goodsleep) == 503 ), "Double check your conditional used to create df_goodsleep from df_hello - remember, good sleep is response values of 6 hours or more"
assert(len(df_badsleep) == 21 ), "Double check your conditional used to create df_badsleep from df_hello - remember, bad sleep is response values of less than 6 hours"

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

🎉 All tests passed! 🎉


### Puzzle 3.2: Average Number of Classes by Group

Now find the **average number of classes** of each group (good sleep and bad sleep):

Hint: the `df['column name'].mean()` function returns the mean of all values in the specified column of `df`

In [56]:
classes = 'How many classes are you taking this semester?'
goodsleep_avg_classes = df_goodsleep['How many classes are you taking this semester?'].mean()
goodsleep_avg_classes

5.656

In [57]:
badsleep_avg_classes = df_badsleep['How many classes are you taking this semester?'].mean()
badsleep_avg_classes

5.285714285714286

### 🔬 Test Case Checkpoint 🔬

In [60]:
## == CHECKPOINT TEST CASES ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any errors, you PASSED all test cases!
# - If this cell results in any errors, check your previous cell, make changes, and RE-RUN your code and then this cell.
assert(math.isclose(goodsleep_avg_classes,  5.656)), "The average number of classes for those with good sleep does not appear to have been correctly calculated"
assert(math.isclose(badsleep_avg_classes,  5.285714285714286)), "The average number of classes for those with bad sleep does not appear to have been correctly calculated"

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

🎉 All tests passed! 🎉


### Analyis: Classes v. Sleeptime

**Q: What is the relationship between classes and average sleep time?  Can you think of a possible *confounding variable* in the observed relationship (or lack thereof) between classes and average sleep time?** Write at least three complete sentences.

To me, it looks like there isn't much a difference regarding how many classes each student is taking to how many hours of sleep they get at night. The difference in average number of classes between each group is only 0.3702857142857141, which would be about 37 or so minutes. A possible confounding variable could be social media usage. 

<hr style="color: #DD3403;">

# Submission

You're almost done!  All you need to do is to commit your lab to GitHub:

1.  ⚠️ **Make certain to save your work.** ⚠️ To do this, go to **File => Save All**

2.  After you have saved, exit this notebook and follow the Canvas instructions to commit this lab to your Git repository!

3. Your TA will grade your submission and provide you feedback after the lab is due. :)