<a href="https://colab.research.google.com/github/pdesire-20/Lab6/blob/main/lab_populations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Welcome to lab_populations! 👥🌎🌍🌏👥

In lecture, you have been learning about both sampling and inference. This is the idea that we can calculate statistics from a random sample of the population and use those statistics to estimate what we would get if we asked every single person in the population a question.

The goal of this lab is to gain a more intuitive understanding of what inference is. We will explore sampling from a population and the meaning behind confidence intervals, error, and the Central Limit Theorem (CLT).

<hr>

A few tips to remember:

- **You are not alone on your journey in learning programming!**
- If you find yourself stuck for more than a few minutes, ask a neighbor or course staff for help!  When you are giving help to your neighbor, explain the **idea and approach** to the problem without sharing the answer itself so they can have the same **<i>ah-hah</i>** moment!
- We are here to help you!  Don't feel embarrassed or shy to ask us for help!

Let's get started!

<hr style="color: #DD3403;">

## Part 1: Sampling the Population

The `DISCOVERY_populations` library is included with this lab and contains a **very large** population (over 100,000 students) of current and former University of Illinois students.  We have simulated their answers to three questions:

1. Do you support the Kingfisher as the new Illinois mascot?
2. Do you follow @datascienceduo on Instagram?
3. Are you a Data Science major?

Right now, **we do NOT know the answers for the entire population and there is NO WAY to ask everyone**. Instead, we can only ask a sample of students and get answers for that sample. Run the following code to import the `DISCOVERY_populations` library and retrieve the sample:

In [14]:
import DISCOVERY_populations
sample = DISCOVERY_populations.getSample()
sample.head()

Unnamed: 0,DSmajor,FollowsDuo,ProKingfisher
161332,1,1,1
40258,0,0,1
105828,1,1,1
37930,1,1,1
78486,0,0,0


### Puzzle 1.1: Statistics about the Sample

You have received a **random sample** from the population and it looks like it has three columns: `DSmajor`, `FollowsDuo`, and `ProKingfisher`. Using the `len` function, create a variable `n` that stores the number of people in your sample:

In [15]:
n = len(sample)
n

52

We'll first focus on people who follow @datascienceduo -- the people who follow the DUO are coded with a `1` in the sample and the people who do not follow the DUO are coded with a `0`.  

In your sample, how many people follow the DUO?

In [19]:
followers = len(sample[sample.FollowsDuo == 1])
followers

27

In [20]:
## == CHECKPOINT TEST CASES ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any errors, you PASSED all test cases!
# - If this cell results in any errors, check your previous cell, make changes, and RE-RUN your code and then this cell.
assert("sample" in vars()), "Check to make sure you have the variable `sample`."
assert(len(sample) == n), "Check to make sure `n` stores the number of observations in your sample."
assert(followers == sum(sample.FollowsDuo)), "Check to make sure `followers` stores the number of people following @datascienceduo."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

🎉 All tests passed! 🎉


### Puzzle 1.2: Finding the 95% Confidence Interval for the Percentage of DUO followers

We want to estimate what percentage of the population follows @datascienceduo. To do that, we need to use the confidence interval formula you learned in lecture.

$$ CI = {SamplePercent} \pm ({z} \times {SampleStandardError})$$

Let's work on finding all three of the components we need: `samplePercent`, `z`, and `sampleSE`. For this entire puzzle, make sure your percentages (samplePercent and sampleSE) are in percent form and not decimal form. In other words, they should be numbers between 0% and 100%.


#### Puzzle 1.2(a): Finding `samplePercent`

 Using the `FollowsDuo` column, store the **percentage of the sample that follow the DUO** in the variable `samplePercent`:

 *Note: Since the `FollowsDuo` column is encoded so that a `0` is a non-follower and a `1` is a follower, the mean of the column will be a proportion (decimal), but we want to find a **percentage** so make sure to convert your answer to be between 0 and 100 percent.*

In [23]:
samplePercent = sample.FollowsDuo.mean() * 100
samplePercent

51.92307692307693

In [24]:
## == CHECKPOINT TEST CASES ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any errors, you PASSED all test cases!
# - If this cell results in any errors, check your previous cell, make changes, and RE-RUN your code and then this cell.
import math
F = sample.FollowsDuo
assert(math.isclose(samplePercent, F.sum()/n*100)), "Check your `samplePercent`."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

🎉 All tests passed! 🎉


#### Puzzle 1.2(b): Finding `z`

We want to find the range where we are 95% sure that the true percentage of people who follow the DUO is within that range. Find the z-score we need to use to create a 95% CI:

*Hint: Because the sample size is greater than 30 and the sample was randomly selected from the population, we can use the standard normal curve to find the z-score when creating our 95% CI.*

In [26]:
 from scipy.stats import norm
z = norm.ppf(0.95 + 0.025)
z

1.959963984540054

In [None]:
## == CHECKPOINT TEST CASES ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any errors, you PASSED all test cases!
# - If this cell results in any errors, check your previous cell, make changes, and RE-RUN your code and then this cell.
assert(math.isclose(abs(z) + abs(z)**abs(z), 5.69931068079139)), "Check your `z`."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")


#### Puzzle 1.2(c): Finding `sampleSE`

Finally, we need to find the standard error of the sample as a **percentage**.

Remember: $SE_{\%} = \frac{SD}{\sqrt{n}} * 100$, where $SE$ is standard error, $SD$ is standard deviation, and $n$ is the sample size.

In [27]:
sampleSE = (sample["FollowsDuo"].std()/(n**0.5))*100
sampleSE

6.996219952971145

In [None]:
## == CHECKPOINT TEST CASES ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any errors, you PASSED all test cases!
# - If this cell results in any errors, check your previous cell, make changes, and RE-RUN your code and then this cell.
assert(math.isclose(sampleSE, (n / F.var())**-0.5 * 100)), "Check your `sampleSE`."

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

### Puzzle 1.3: Finding the Confidence Interval

The formula for the confidence interval has both a "lower bound" (when you subtract the margin of error from the sample average) and an "upper bound" (when you add the margin of error to the sample average). Recall the formula you learned in lecture:

$$ CI = {SamplePercent} \pm ({z} \times {SampleStandardError})$$

Using the variables you just calculated in the previous section, find the `lower_bound_CI` and `upper_bound_CI` of your confidence interval:

In [28]:
lower_bound_CI = samplePercent -(z*sampleSE)
lower_bound_CI

38.210737787332974

In [29]:
upper_bound_CI = samplePercent + (z*sampleSE)
upper_bound_CI

65.63541605882088

Putting it all together, run the following code that will write out your full confidence interval interpretation:

In [30]:
print(f"Based on the sample, we are 95% confident that the true percentage of followers of @datascienceduo in the full population is between:\n   {round(lower_bound_CI, 2)}% - {round(upper_bound_CI, 2)}%")

Based on the sample, we are 95% confident that the true percentage of followers of @datascienceduo in the full population is between:
   38.21% - 65.64%


### Reflections

**Q1**: Talk to your group members and share your confidence intervals.
- (a): What is the confidence interval of another group member's sample?
- (b): Is it the same or different?
- (c): Why is it okay that it's the same or different?

*(✏️ Edit this cell to replace this text with your answers. ✏️)*

**Q2**: What is the margin of error (the numeric value) of your confidence interval?  *You may need to do a bit of calculation.*

*(✏️ Edit this cell to replace this text with your answers. ✏️)*

**Q3**: If the complete population is 1,000,000 people, we can be 95% certain **at least how many people are following the DUO**?  First, in English (not math equations), explain in at least one sentence how you will calculate this result.  Then, calculate it and include your answer below.

*(✏️ Edit this cell to replace this text with your answers. ✏️)*

**Q4**: Finally, can we accurately predict if at least 50% of the population follow @datascienceduo given your confidence interval?

*(✏️ Edit this cell to replace this text with your answers. ✏️)*

<hr style="color: #DD3403;">

## Part 2: Towards a Smaller Margin of Error

The number of followers of @datascienceduo is fun, but the large margin of error you had is a little alarming.  For really important issues, we want a smaller margin of error in our sample.

**Q5**: What are at least **TWO** things we can do as a data scientist to reduce the margin of error?

*(✏️ Edit this cell to replace this text with your answers. ✏️)*

### Part 2.1: An Expensive Sample

Taking a large sample requires surveying more people and getting more responses, which is almost always more expensive.  In the `DISCOVERY_populations` library you imported in Part 1, we have a second function: `getExpensiveSample()`.

The following code gets a larger and more expensive sample and stores it in `sample2`:

In [31]:
sample2 = DISCOVERY_populations.getExpensiveSample()
sample2

Unnamed: 0,DSmajor,FollowsDuo,ProKingfisher
94526,1,1,1
94213,1,1,1
48609,1,1,1
20982,0,0,0
147652,0,0,1
...,...,...,...
42995,0,0,0
80334,1,1,1
166399,0,0,0
112764,0,0,1


### Part 2.2: Finding the Confidence Interval for Kingfisher Support

Find the lower and upper bounds for the 95% CI for the support of the Kingfisher mascot, storing them in `kingfisher_CI_lower` and `kingfisher_CI_upper`.  We provided individual cells for each stage of the computation, and you should make sure your answer is reasonable at each step. We also want your answers as percentages between 0 and 100 percent.

Make sure you're using `sample2` since you have the better, more expensive sample now! :)

In [42]:
# Step 1: Find the samplePercent:

samplePercent2 = sample2.ProKingfisher.mean()*100

In [34]:
# Step 2: Find the z-score for the 95% CI:
from scipy.stats import norm
z = norm.ppf(0.95 + 0.025)
z

1.959963984540054

In [37]:
# Step 3: Find the sampleSE:
n2= len (sample2)

sampleSE2= (sample2["ProKingfisher"].std()/(n2**0.5))* 100

sampleSE2

1.4265160619173862

In [40]:
# Step 4: Find the margin of error:
from scipy.stats import norm
z2 = norm.ppf(0.95 + 0.025)


(sample2["ProKingfisher"].std()/(n2**0.5))* z2

0.02795920104725987

In [43]:
# Find the lower bound of the CI:
kingfisher_CI_lower = samplePercent2 -(z2*sampleSE2)
kingfisher_CI_lower

68.00054007226515

In [44]:
# Find the upper bound of the CI:
kingfisher_CI_upper = samplePercent2 + (z2*sampleSE2)
kingfisher_CI_upper

73.59238028171714

In [45]:
## == CHECKPOINT TEST CASES ==
# - This read-only cell contains test cases for your previous cell.
# - If this cell runs without any errors, you PASSED all test cases!
# - If this cell results in any errors, check your previous cell, make changes, and RE-RUN your code and then this cell.
import math
from scipy.stats import norm

F = sample2.ProKingfisher
N = norm(F.mean(), F.std() / (len(F)**0.5))
low, high = N.interval(0.95)
assert(kingfisher_CI_upper > kingfisher_CI_lower), "The upper bound must be larger than the lower bound."
assert( math.isclose(kingfisher_CI_lower, low * 100) ), "Check your `kingfisher_CI_lower` calculation"
assert( math.isclose(kingfisher_CI_upper, high * 100) ), "Check your `kingfisher_CI_upper` calculation"

## == SUCCESS MESSAGE ==
# You will only see this message (with the emoji showing) if you passed all test cases:
tada = "\N{PARTY POPPER}"
print(f"{tada} All tests passed! {tada}")

🎉 All tests passed! 🎉


### Part 2.3: Reflections

**Q6**: Write out the interpretation of your confidence interval in a complete sentence.

*(✏️ Edit this cell to replace this text with your answers. ✏️)*

**Q7**: If the whole population voted on if the next mascot should be the Kingfisher, how confident are you that the resolution will pass (that is, receive at least 50% of the vote)? Explain in at least one complete sentence how the data analysis you did backs up your confidence.

*(✏️ Edit this cell to replace this text with your answers. ✏️)*

**Q8**: Is the confidence interval of your larger (and more expensive) sample larger (wider) or smaller (narrower) than the first sample?  Explain in at least one complete sentence.

*(✏️ Edit this cell to replace this text with your answers. ✏️)*

<hr style="color: #DD3403;">

## Part 3: The Election is Here!

The polling is complete and the election day is here!  Run the following code to find your election-day results:

In [46]:
DISCOVERY_populations.electionDay()

The election was held and 20% of the population voted.

== Kingfisher Support ==
SUPPORT KINGFISHER: 28001 70.0%
OPPOSE KINGFISHER : 11999 30.0%

== Follows @datascienceduo ==
FOLLOWS DUO    : 20810 52.02%
DOES NOT FOLLOW: 19190 47.98%


**Q9**: In at least one complete sentence, explain if your analysis of the samples accurately predicted the outcomes.

*(✏️ Edit this cell to replace this text with your answers. ✏️)*

<hr style="color: #DD3403;">