<a href="https://colab.research.google.com/github/pdesire-20/Lab6/blob/main/lab_simpsons_paradox.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Welcome to Lab: Simpson's Paradox 🔬

Before this lab section, you learned about Simpson's Paradox and confounding variables in lecture.  This week, you will find Simpson's Paradox through analysis of a dataset in Python, and get some practice writing conditionals for pandas DataFrames!

A few tips to remember:

- **You are not alone on your journey in learning programming!**
- If you find yourself stuck for more than a few minutes, ask a neighbor or course staff for help!  When you are giving help to your neighbor, explain the **idea and approach** to the problem without sharing the answer itself so they can have the same **<i>ah-hah</i>** moment!
- We are here to help you!  Don't feel embarrassed or shy to ask us for help!

Let's get started!

# Part 1: The GPA Dataset

The GPA dataset contains data about every section of every course at UIUC. Using this data, we can analyze how students are doing in each class.

Simpson's paradox is when there is a trend in a *subset* of the data, but is completely reversed when you look at the data as a whole.

![image.png](https://upload.wikimedia.org/wikipedia/commons/f/fb/Simpsons_paradox_-_animation.gif)

Picture credit: Wikipedia

## Load the GPA Dataset

The most recent version of the "GPA Dataset" is available here:
```
https://waf.cs.illinois.edu/discovery/gpa.csv
```

Use Python to load this dataset into a DataFrame called `df`:

In [6]:


import pandas as pd

df = pd.read_csv("https://waf.cs.illinois.edu/discovery/gpa.csv")
df

Unnamed: 0,Year,Term,YearTerm,Subject,Number,Course Title,Sched Type,A+,A,A-,...,C+,C,C-,D+,D,D-,F,W,Students,Primary Instructor
0,2023,Spring,2023-sp,AAS,100,Intro Asian American Studies,DIS,0,11,5,...,0,0,0,0,0,0,1,0,22,"Shin, Jeongsu"
1,2023,Spring,2023-sp,AAS,100,Intro Asian American Studies,DIS,0,17,2,...,0,0,0,0,0,0,0,1,23,"Shin, Jeongsu"
2,2023,Spring,2023-sp,AAS,100,Intro Asian American Studies,DIS,0,13,2,...,0,0,1,0,0,0,1,0,21,"Lee, Sabrina Y"
3,2023,Spring,2023-sp,AAS,200,U.S. Race and Empire,LCD,6,15,5,...,0,0,0,0,1,0,1,0,33,"Sawada, Emilia"
4,2023,Spring,2023-sp,AAS,215,US Citizenship Comparatively,LCD,16,12,2,...,0,0,0,0,0,0,0,0,33,"Kwon, Soo Ah"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
69064,2010,Summer,2010-su,STAT,410,Statistics and Probability II,LEC,5,10,2,...,0,1,3,0,0,0,2,1,31,"Stepanov, Alexei G"
69065,2010,Summer,2010-su,STAT,440,Statistical Data Management,LEC,4,12,8,...,0,0,0,0,0,0,0,0,28,"Unger, David"
69066,2010,Summer,2010-su,TAM,212,Introductory Dynamics,LEC,0,1,3,...,5,1,1,0,2,0,1,0,28,"Morgan, William T"
69067,2010,Summer,2010-su,TAM,251,Introductory Solid Mechanics,LCD,1,2,2,...,3,3,2,0,0,1,1,0,21,"Ott-Monsivais, Stephanie"


### 🔬 Test Case Checkpoint 🔬

In [9]:
## == CHECKPOINT TEST CASES ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any errors, you PASSED all test cases!
# - If this cell results in any errors, check your previous cell, make changes, and RE-RUN your code and then this cell.
assert(len(df) == 69069 ), "This is not the GPA dataset you're looking for."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

🎉 All tests passed! 🎉


## Data Cleaning: An Additional Column

The GPA dataset contains raw GPA data and is not the easiest to work with if we want to analyze intricacies such as the average GPA or passing rate.  Luckily, DataFrames are modifiable, so we can **add more columns** based on what questions we want to answer.

The process of modifying a dataset via deletion (cleaning up empty/unwanted values) or addition (adding new columns) is often called **data cleaning**. This is an important concept in Data Science, because you won't always receive your data in the perfect format for your purposes.

With the GPA dataset, we want to investigate one innocent question posed by a theoretical incoming student:
- *Is it easier to get an A in STAT or CS courses at UIUC?*

To do this, we'll need to first perform some modifications on our loaded DataFrame, `df`. Specifically, we will need to create:
- An `A_Grades` column, containing the total number of students receiving an A+, A, or A- in every course

### Puzzle 1.1: Creating an `A_Grades` Column

Create the new column `A_Grades` that stores the total number of "A"s given in every course.

- We consider an "A" to be any type of A, in other words "A+", "A", or an "A-" are all included.
- In our `df`, the number of students recieving a particular grade in a course is found in the `df['Grade']` column. For example, `df['A']` contains the number of students who recieved an A.

In [7]:
df['A_Grades'] = df ["A+"] + df ["A"] + df ["A-"]
df['A_Grades']

0        16
1        19
2        15
3        26
4        30
         ..
69064    17
69065    24
69066     4
69067     5
69068    22
Name: A_Grades, Length: 69069, dtype: int64

### 🔬 Test Case Checkpoint 🔬

In [19]:
## == CHECKPOINT TEST CASES ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any errors, you PASSED all test cases!
# - If this cell results in any errors, check your previous cell, make changes, and RE-RUN your code and then this cell.
assert('A_Grades' in df), "Make sure you've named the A_Grades column properly and added it to the DataFrame."
assert(df['A_Grades'].sum() == 2325768), "Double check the values of your A_Grades column."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

🎉 All tests passed! 🎉


# Working with the GPA Dataset
Now that we have that extra column set up, we can perform basic mathematical analysis on the GPA Dataset to get insight towards our question:
- *Is it easier to get an A in STAT or CS courses at UIUC?*

Should be simple, right? Let's just see which subject, STAT or CS, has a greater percentage of A grades.

### Puzzle 1.2: Subject DataFrames

Select only the rows of the GPA dataset `df` with a `Subject` of `STAT`. Assign these rows to a new DataFrame, `df_STAT`.
- Make sure your result only contains STAT courses!

In [8]:
df_STAT = df[df["Subject"] == "STAT"]

Select only the rows of the GPA dataset `df` with a `Subject` of `CS`. Assign these rows to a new DataFrame, `df_CS`.
- Make sure your result only contains CS courses!

In [10]:
df_CS = df[df["Subject"] == "CS"]

### 🔬 Test Case Checkpoint 🔬

In [20]:
## == CHECKPOINT TEST CASES ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any error our output, you PASSED all test cases!
# - If this cell results in any errors, check you previous cell, make changes, and RE-RUN your code and then this cell.
assert( 'df_STAT' in vars() ), "Make certain to name the STAT courses df_STAT."
assert( 'df_CS' in vars() ), "Make certain to name the CS courses df_CS."
assert( len(df_STAT[df_STAT.Subject != "STAT"] ) == 0 ), "It looks like you did not subset df_STAT to only STAT courses."
assert( len(df_CS[df_CS.Subject != "CS"] ) == 0 ), "It looks like you did not subset df_CS to only CS courses."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

🎉 All tests passed! 🎉


### Puzzle 1.3: Comparing Overall Percentages

With our two new DataFrames of STAT and CS courses, use the following code cell to determine the **percentage** of A grades recieved in STAT and CS courses, storing them as variables `stat_a` and `cs_a` respectively.

Print statements have been provided to show the values you calculate.

As you work, remember:
- To find the **% of As**, divide the total number of `A_Grades` by the total number of `Students` ($\space \frac{A \space Grades}{Students} \space$).
- The `A_Grades` column you made earlier, and the `Students` column containing the total number of students in each course.
- The syntax `sum(df['column_name'])` can be used to add up the values of all rows in a particular column of a DataFrame.
- Your % should be a decimal between 0 and 1.

In [11]:
stat_a = sum(df_STAT['A_Grades'] ) / sum(df_STAT["Students"])
print(f'Overall % of As in STAT is: {stat_a}')

cs_a = sum(df_CS['A_Grades'] ) / sum(df_CS["Students"])
print(f'Overall % of As in CS is: {cs_a}')

Overall % of As in STAT is: 0.609647329889628
Overall % of As in CS is: 0.6119729348794934


### 🔬 Test Case Checkpoint 🔬

In [None]:
## == CHECKPOINT TEST CASES ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any errors, you PASSED all test cases!
# - If this cell results in any errors, check your previous cell, make changes, and RE-RUN your code and then this cell.
import math
assert(math.isclose(stat_a, 0.609647329889628)), "The overall percentage of A grades received in STAT courses does not appear to have been correctly calculated."
assert(math.isclose(cs_a, 0.6119729348794934)), "The overall percentage of A grades received in CS courses does not appear to have been correctly calculated."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

### Analysis: Good Comparison?
**Q: What conclusion can you take from the overall percentages found above when asking the original question:**
- *Is it significantly easier to get an A in STAT or CS courses at UIUC, or is so close it would be hard to make general conclusions?*

*(✏️hard to make a decision)*

**Q: Given what you've learned about experimental design, what are some reasons (specific to this dataset) you may not trust this conclusion? If you would trust it, explain why.**

There other unknown variables that would determine the response.

## An Extra Consideration

If you look at the `Year` column of our GPA dataset, you might notice that we have some old data in our dataset - all the way back to **2010**! This means we aren't really answering our question from the perspective of a student now.

If we want to know if it is easier to **currently** get an A in a STAT or CS course, we should control for the **date** of the data by looking at more **recent years** specifically.

### Puzzle 1.4: More DataFrames

Using the code cells below, define four new DataFrames by selecting from rows of our previously created `df_CS` and `df_STAT`.

- `df_cs_recent`: all `CS` course data in *recent years* ($\space \geq 2021 \space$)
- `df_stat_recent`: all `STAT` course data in *recent years* ($\space \geq 2021 \space$)
- `df_cs_old`: all other, older `CS` course data ($\space < 2021 \space$)
- `df_stat_old`: all other, older `STAT` course data ($\space < 2021 \space$)

Define *recent years* as **any year after and including 2021**, and older years as any year before 2021.

Feel free to use conditionals OR the `.isin()` syntax you learned in the last lab.

In [13]:
df_stat_recent = df_STAT[df_STAT["Year"] >=2021]
df_stat_recent

Unnamed: 0,Year,Term,YearTerm,Subject,Number,Course Title,Sched Type,A+,A,A-,...,C,C-,D+,D,D-,F,W,Students,Primary Instructor,A_Grades
2118,2023,Spring,2023-sp,STAT,100,Statistics,LEC,51,165,50,...,24,15,7,14,10,19,3,478,"Flanagan, Karle A",266
2119,2023,Spring,2023-sp,STAT,100,Statistics,ONL,82,285,83,...,51,23,17,19,6,41,3,849,"Flanagan, Karle A",450
2120,2023,Spring,2023-sp,STAT,107,Data Science Discovery,LEC,44,177,36,...,7,5,2,0,0,8,0,341,"Flanagan, Karle A",257
2121,2023,Spring,2023-sp,STAT,200,Statistical Analysis,LEC,59,54,23,...,3,0,2,0,0,1,0,178,"Yu, Albert",136
2122,2023,Spring,2023-sp,STAT,207,Data Science Exploration,LEC,27,23,8,...,6,3,1,0,0,3,0,93,"Deeke, Julie M",58
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12562,2021,Summer,2021-su,STAT,400,Statistics and Probability I,OLC,11,22,6,...,4,3,2,1,0,7,2,84,"Nguyen, Ha Khanh",39
12563,2021,Summer,2021-su,STAT,410,Statistics and Probability II,ONL,8,14,4,...,3,0,0,3,0,4,2,56,"Stepanov, Alexey G",26
12564,2021,Summer,2021-su,STAT,420,Statistical Modeling in R,ONL,101,94,15,...,1,0,0,0,0,2,4,227,"Unger, David",210
12565,2021,Summer,2021-su,STAT,420,Methods of Applied Statistics,ONL,5,8,1,...,2,0,0,0,0,1,0,22,"Unger, David",14


In [14]:
df_cs_recent = df_CS[df_CS["Year"] >=2021]
df_cs_recent

In [22]:
df_stat_old = df_STAT[df_STAT["Year"] < 2021]

In [36]:
df_cs_old = df_CS[df_CS["Year"] <2021]

### 🔬 Test Case Checkpoint 🔬

In [23]:
## == CHECKPOINT TEST CASES ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any error our output, you PASSED all test cases!
# - If this cell results in any errors, check you previous cell, make changes, and RE-RUN your code and then this cell.
assert( 'df_stat_recent' in vars() ), "Make certain to name the recent STAT courses df_stat_recent."
assert( 'df_cs_recent' in vars() ), "Make certain to name the recent CS courses df_cs_recent."
assert( 'df_stat_old' in vars() ), "Make certain to name the old STAT courses df_stat_old."
assert( 'df_cs_old' in vars() ), "Make certain to name the old CS courses df_cs_old."

assert( len(df_stat_recent[df_stat_recent.Year < 2021] ) == 0 ), "Make sure only years after and including 2021 are in the df_stat_recent DataFrame."
assert( len(df_cs_recent[df_cs_recent.Year < 2021] ) == 0 ), "Make sure only years after and including 2021 are in the df_cs_recent DataFrame."
assert( len(df_stat_old[df_stat_old.Year >= 2021] ) == 0 ), "Make sure only years before 2021 are in the df_stat_old DataFrame."
assert( len(df_cs_old[df_cs_old.Year >= 2021] ) == 0 ), "Make sure only years before 2021 are in the df_cs_old DataFrame."

assert( len(df[ df.index.isin(df_stat_recent.index) & df.index.isin(df_stat_old.index) ]) == 0 ), "Check for duplicate values in your df_stat_recent and df_stat_old DataFrames."
assert( len(df[ df.index.isin(df_cs_recent.index) & df.index.isin(df_cs_old.index) ]) == 0 ), "Check for duplicate values in your df_cs_recent and df_cs_old DataFrames."
assert( len(df_cs_old) + len(df_cs_recent) == len(df_CS) ), "You're excluding some rows from df_cs_old or df_cs_recent. Please double check your conditionals."
assert( len(df_stat_old) + len(df_stat_recent) == len(df_STAT) ), "You're excluding some rows from df_stat_old or df_stat_recent. Please double check your conditionals."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

🎉 All tests passed! 🎉


### Puzzle 1.5: New Analysis
Now that we've got all the DataFrames setup with GPA data of the `CS` and `STAT` courses separated by recency (2021 or newer being 'recent'), we can do more in-depth analysis to investigate our question.

In the following code cells, **calculate the percentages** described by the comment in the cell. Your answer should always be a **decimal between 0 and 1**.

Just like Puzzle 1.3, remember:
- To find the % of As, divide the total number of As by the total number of students.
- The `A_Grades` column you made earlier, and the `Students` column containing the total number of students in each course.
- The syntax `sum(df['column_name'])` can be used to add up the values of all rows in a particular column of a DataFrame.

In [24]:
# Percentage of As received in CS in recent years
cs_recent_a = sum(df_cs_recent['A_Grades'] ) / sum(df_cs_recent["Students"])
print(f'Percentage of As received in CS in recent years: {cs_recent_a}')

Percentage of As received in CS in recent years: 0.7130059975764269


In [30]:
# Percentage of As received in STAT in recent years
stat_recent_a = sum(df_stat_recent["A_Grades"] ) / sum(df_stat_recent["Students"])
print(f'Percentage of As received in STAT in recent years: {stat_recent_a}')

Percentage of As received in STAT in recent years: 0.6505842595985506


In [37]:
# percentage of As received in CS in older years
cs_old_a = sum(df_cs_old["A_Grades"] ) / sum(df_cs_old["Students"])
print(f'Percentage of As received in CS in older years: {cs_old_a}')

Percentage of As received in CS in older years: 0.5714531786360225


In [38]:
# percentage of As received in STAT in older years
stat_old_a = sum(df_stat_old["A_Grades"] ) / sum(df_stat_old["Students"])
print(f'Percentage of As received in STAT in older years: {stat_old_a}')

Percentage of As received in STAT in older years: 0.5975443876015648


### 🔬 Test Case Checkpoint 🔬

In [28]:
## == CHECKPOINT TEST CASES ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any errors, you PASSED all test cases!
# - If this cell results in any errors, check your previous cell, make changes, and RE-RUN your code and then this cell.
assert(math.isclose(cs_recent_a,  0.7130059975764269)), "The overall percentage of A grades received in CS courses recently does not appear to have been correctly calculated."
assert(math.isclose(stat_recent_a, 0.6505842595985506)), "The overall percentage of A grades received in STAT courses recently does not appear to have been correctly calculated."

assert(math.isclose(cs_old_a, 0.5714531786360225)), "The overall percentage of A grades received in CS courses in older years does not appear to have been correctly calculated."
assert(math.isclose(stat_old_a, 0.5975443876015648)), "The overall percentage of A grades received in STAT courses in older years does not appear to have been correctly calculated."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

NameError: name 'math' is not defined

### Observe the Results
**Run the following cell** to format all of your answers as a DataFrame.

Keep in mind that "Older" means data from courses **before 2021**, and "Recent" means courses held **during or after 2021**.

In [None]:
pd.DataFrame([
  {'Older % of A': cs_old_a, 'Recent % of A': cs_recent_a, 'Overall % of A': cs_a},
  {'Older % of A': stat_old_a, 'Recent % of A': stat_recent_a, 'Overall % of A': stat_a}
], index=['CS', 'STAT'])

Notice that when observing the "Overall % of A" grades received:
- It appears that `STAT` and `CS` are **equally difficult** to get an `A_Grades` (61.197% vs. 60.965%).
- However, if you look at the sub-group of the courses held in years of **2021 and later**, we see that `CS` actually has a higher `A_Grades` rate!

This is **Simpson's Paradox**: a pattern within a population can appear, disappear, or reverse when you look at subpopulations.

In more formal terms, Simpson's Paradox can cause you to observe a pattern reverse when you look at the overall group statistics versus statistics of groups post-stratification. In this case we are stratifying by time.


### Analysis: Reflecting on New Observations

Now think about how would you now respond differently to the incoming student's question:
- *Is it easier to get an A in STAT or CS courses at UIUC?*

**Q: Which comparison of percentages do you trust more and why? Are there any other potential confounding variables when answering this question that could be investigated further? Respond with at least three full sentences.**

*(✏️ Edit this cell to replace this text with your answer. ✏️)*

<hr style="color: #DD3403;">

# Part 2: Revisiting the Hello Dataset

Enough about GPA (for now). Two weeks ago, you created a series of questions that made up the **Hello Dataset** and completed the survey by answering all of the questions made by students in DISCOVERY.

Now, we will load this dataset again and briefly answer a few questions with data about YOU!

## Load the Hello Dataset

The "Hello Dataset" is available here:
```
https://waf.cs.illinois.edu/discovery/hello-sp24.csv
```

Use Python to load this dataset into a DataFrame called `df_hello`:

In [None]:
df_hello = ...
df_hello

### 🔬 Test Case Checkpoint 🔬

In [None]:
## == CHECKPOINT TEST CASES ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any errors, you PASSED all test cases!
# - If this cell results in any errors, check your previous cell, make changes, and RE-RUN your code and then this cell.
assert(len(df_hello) >= 600 and len(df_hello) < 700), "This is not the Hello Dataset you're looking for. Check the URL."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

## Happiness vs. Battery Life

There is a narrative that "Gen Zs are only happy when they're on their phone" -- let's explore the relationship between your happiness and your cell phone's battery life!

With the Hello Dataset, there's two questions with numeric responses that can help us explore this:
- From 1 to 10, how happy are you right now?
- What is your phone battery percentage currently at?

### Puzzle 2.1: Observation Subsets

In this situation, let's define three levels of happiness:
- You're **very happy** if you answered "how happy are you right now?" with a score of 9 or higher,
- You're **unhappy** if you answered "how happy are you right now?" with a score of less than 3,
- Otherwise, you're just **happy**.

From this, create three DataFrames that contain subsets of the Hello Dataset:
- `df_very_happy`: including everyone who is "very happy" ($\space \geq 9 \space$)
- `df_happy`: including everyone who is "happy" ($\space <9 $ and $\geq 3 \space$)
- `df_unhappy`: including everyone who is "unhappy" ($\space < 3 \space$)

The `happiness_question` question has been provided to you as a **string** for ease of DataFrame column access.

In [None]:
happiness_question = 'From 1 to 10, how happy are you right now?'

df_very_happy = ...
df_very_happy

In [None]:
df_happy = ...
df_happy

In [None]:
df_unhappy = ...
df_unhappy

### 🔬 Test Case Checkpoint 🔬

In [None]:
## == CHECKPOINT TEST CASES ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any errors, you PASSED all test cases!
# - If this cell results in any errors, check your previous cell, make changes, and RE-RUN your code and then this cell.
assert(len(df_very_happy) == 54 ), "Double check your conditional used to create df_very_happy."
assert(len(df_happy) == 519 ), "Double check your conditional used to create df_happy."
assert(len(df_unhappy) == 36 ), "Double check your conditional used to create df_unhappy."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

### Puzzle 2.2: Average Phone Battery Percentage by Group

Now find the **average phone battery percentage** of each group (`df_very_happy`, `df_happy`, and `df_unhappy`):

- The `df['column name'].mean()` function returns the mean of all values in the specified column of `df`.
- The `phone_battery_question` question has been provided to you as a **string** for ease of DataFrame column access.

In [None]:
phone_battery_question = 'What is your phone battery percentage currently at?'

very_happy_people_avg_phone_battery = ...
very_happy_people_avg_phone_battery

In [None]:
happy_people_avg_phone_battery = ...
happy_people_avg_phone_battery

In [None]:
unhappy_people_avg_phone_battery = ...
unhappy_people_avg_phone_battery

### 🔬 Test Case Checkpoint 🔬

In [None]:
## == CHECKPOINT TEST CASES ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any errors, you PASSED all test cases!
# - If this cell results in any errors, check your previous cell, make changes, and RE-RUN your code and then this cell.
assert(math.isclose(very_happy_people_avg_phone_battery,  56.96296296296296)), "The average phone battery percentage for very happy people does not appear to have been correctly calculated."
assert(math.isclose(happy_people_avg_phone_battery,  59.335907335907336)), "The average phone battery percentage for happy people does not appear to have been correctly calculated."
assert(math.isclose(unhappy_people_avg_phone_battery,  56.02777777777778)), "The average phone battery percentage for unhappy people does not appear to have been correctly calculated."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

### Analysis: Happiness vs. Phone Battery

**Q: What is the relationship between happiness and phone battery?  Can you think of a possible *confounding variable* in the observed relationship (or lack thereof) between happiness and phone battery?**  Write at least three complete sentences that includes:
- Your analysis of the relationship between the two variables,
- Your third, possibly confounding, variable that might be impacting the results, **AND**
- Your analysis on why this third variable could be a possible confounder.
- Any additional analysis on other variables or possible confounders.

*(✏️ Edit this cell to replace this text with your answer. ✏️)*