# **U.S. Health Insurance - Project 3**
## Inference for the Population Proportion

# **Importing Necessary Python Modules**

Python incorporates a variety of open source add-ins called modules that add extra features to the basic setup. The name of the modules is after the import or from statement, and the purpose is in a non-code comment after the hashtag (#).




In [2]:
import pandas as pd                 #Data analysis
import numpy as np                  #Calculations
from IPython.display import Image   #Display images
from scipy.stats import norm        #Confidence Interval

In [3]:
# Assigns the URL of the image to display to the name 'image_url'.
image_url = 'https://blog.amopportunities.org/wp-content/uploads/2019/07/Health-Insurance.jpg'

# Display the image
Image(url=image_url, width = 575)

# **Context**

This dataset can be helpful in a simple, yet illuminating study in understanding the risk underwriting in Health Insurance, the interplay of various attributes of the insured, and how they affect the insurance premium.


# **About the Dataset**

This dataset contains 1338 rows of insured data, where the insurance charges are given against the following attributes of the insured: Age, Sex, BMI, Number of Children, Smoker, and Region. There are no missing or undefined values in the dataset.

Body mass index (BMI) provides an understanding of body weights that are relatively high or low relative to height. It is considered to be the objective index of body weight using the ratio of height to weight, ideally 18.5 to 24.9.

| **Variable**| **Description**                                      |
|:------------|:-----------------------------------------------------|
| AGE         | The age of the primary beneficiary   |
| SEX         | Male or Female                       |
| BMI         | Body Mass Index (kg/m<sup>2</sup>)    |
| Number of children | Number of dependents covered by the insurance |
| Smoker      | Yes or No                                    |
| Region      | The beneficiary's residential area in the U.S.<br>northeast, southeast, southwest, and northwest                            |
| Charges     | Individual medical costs billed by health insurance  |



Let's take a look at the data. To do this, first we import it directly from the url below.



# **A Snippet of the Data**

In [5]:
# Assigns the URL where the data file is stored to 'file_path'.
url='https://raw.githubusercontent.com/thamilton562/STAT108_Projects_Students/main/DataSets/US%20Health%20Insurance.csv'

# Reads in the CSV data file and assigns it to the DataFrame 'df'.
df=pd.read_csv(url)

Next, we can display the data by *typing the name* of the DataFrame. To ensure we can see all columns, we'll use the *pd.set_option* method.

In [6]:
# Set display options to show all columns
pd.set_option('display.max_columns', None)

# When you type the object name, the object gets printed.
df

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


# **INSTRUCTIONS**

* Use Python to analyze the data set and complete each of the following.
* Replace ellipsis (...) with the relavent names or code.  
* For problems that require a written response, replace the ellipsis (...) by double clicking the text box to start typing.
* Reference the tutorial from activity for assistance.
* If you still need help:
 * Watch the video.
 * Attend office hours.

# **The variable to analyze**
You will analyze a category of a qualitative variable. Based on the first initial of your LAST name, analyze the category of the variable listed in the table. Use this category for the entire project.

| Last Name | Variable = Category |
|-----------|-------------------------------|
| A-L       | Position = Linebacker (LB)  |
| M-Z       | Position = Wide Receiver (WR)  |

In [12]:
# Print all the category names.
# Use this list to ensure correct spelling of your category.

# Use the following code for students.
#print("... category names")                #Replace ... with the variable name written out
#print("--------------------------------")
#freq_table = pd.Series(df['...']).value_counts()        #Replace ... with the variable name
#print(freq_table)

#---------------------------------------------------
#A-L
print("Smoker category counts")
print("--------------------------------")
freq_table = pd.Series(df['smoker']).value_counts()
print(freq_table)

#---------------------------------------------------
#M-Z

print("\n================================\n")
print("Region category counts")
print("--------------------------------")
freq_table = pd.Series(df['region']).value_counts()
print(freq_table)


Smoker category counts
--------------------------------
smoker
no     1064
yes     274
Name: count, dtype: int64


Region category counts
--------------------------------
region
southeast    364
southwest    325
northwest    325
northeast    324
Name: count, dtype: int64


# **QUESTION 1**
## Confidence Interval

**Last Names A-L:** Construct and interpret the 91% confidence interval for the population proportion of people who are smokers.

**Last Names M-Z:** Construct and interpret the 91% confidence interval for the population proportion of people who live in the southwest region.



**1.1) Parameter: Define the parameter, using correct notation.**

**A-L:** p is the population proportion of people who are smokers.

**M-Z** p is the population proportion of people who live in the southwest.

**1.2) Method: Name the method you will use.**

1 sample z-interval for p

-OR-

1 proportion z-interval

**1.3) Assumptions:**

Complete the code below to find out how many majors fall under the category assigned to you.

In [15]:
# Count total observations
n = len(df)

#Use this code for students
# Count total successes
# Replace the 1st ... with the variable name
# Replace the 2nd ... with the name of the major category to be analyzed
#obs_count = df['...'].value_counts().get('...')

#print(f"{obs_count} out of {n} people ...") #Replace ... with the position you are analyzing.

#---------------------------------
#A-L
obs_count_AL = df['smoker'].value_counts().get('yes')
print(f"{obs_count_AL} out of {n} people are smokers.")

#---------------------------------
#M-Z
obs_count_MZ = df['region'].value_counts().get('southwest')
print(f"{obs_count_MZ} out of {n} people live in the southwest.")

274 out of 1338 people are smokers.
325 out of 1338 people are from the southwest.


**Show that both assumptions are met.**

**A-L:**

1. We will assume the sample represents the population (representative sample).

2. actual successes: 133(15/133) = 15 > 10

   actual failures: 133(1 - 15/100) = 118 > 10

   The sampling distribution of p̂ is approximately normal.

--------------------------------------------------------------
**M-Z:**

1. We will assume the sample represents the population (representative sample).

2. actual successes: 133(20/133) = 20 > 10

   actual failures: 133(1 - 20/133) = 113 > 10

   The sampling distribution of p̂ is approximately normal.

**1.4) Calculate: Complete the code below to calculate the sample proportion of majors that fall under the engineering category, and the confidence interval.**

In [16]:
# Define the confidence level
# Replace the ... with the stated confidence level, as a decimal (ex: 0.83, not 83%)
#CL = ...

#Use this code for students
#Calculate the values needed; p-hat, critical value (CV), and standard error (se).
#p_hat = obs_count / n
#cv = norm.ppf((1+CL)/2)
#se = np.sqrt(p_hat * (1-p_hat) / n)

#Calculate the bounds of the interval
#ci_lower = (p_hat - cv * se)
#ci_upper = (p_hat + cv * se)

#print(f"p-hat = {obs_count}/{n} = {p_hat.round(5)}")
#print(f"The {CL*100}% CI is ({ci_lower.round(5)}, {ci_upper.round(5)})")

#---------------------------------
#A-L
#Calculate the values needed; p-hat, critical value (CV), and standard error (se).
CL = 0.91
p_hat_AL = obs_count_AL / n
cv = norm.ppf((1+CL)/2)
se_AL = np.sqrt(p_hat_AL * (1-p_hat_AL) / n)

#Calculate the bounds of the interval
ci_lower_AL = (p_hat_AL - cv * se_AL)
ci_upper_AL = (p_hat_AL + cv * se_AL)

print("A-L:")
print(f"p-hat = {obs_count_AL}/{n} = {p_hat_AL.round(5)}")
print(f"The {CL*100}% CI is ({ci_lower_AL.round(5)}, {ci_upper_AL.round(5)})")

#---------------------------------
#M-Z
#Calculate the values needed; p-hat, critical value (CV), and standard error (se).
p_hat_MZ = obs_count_MZ / n
cv = norm.ppf((1+CL)/2)
se_MZ = np.sqrt(p_hat_MZ * (1-p_hat_MZ) / n)

#Calculate the bounds of the interval
ci_lower_MZ = (p_hat_MZ - cv * se_MZ)
ci_upper_MZ = (p_hat_MZ + cv * se_MZ)

print("\nM-Z:")
print(f"p-hat = {obs_count_MZ}/{n} = {p_hat_MZ.round(5)}")
print(f"The {CL*100}% CI is ({ci_lower_MZ.round(5)}, {ci_upper_MZ.round(5)})")

A-L:
p-hat = 274/1338 = 0.20478
The 91.0% CI is (0.18608, 0.22349)

M-Z:
p-hat = 325/1338 = 0.2429
The 91.0% CI is (0.22302, 0.26278)


**1.5) Communicate Results: Interpret the confidence interval calculated in 1.4 above. Round to three (3) decimal places.**

**A-L:** I am 91% confident that the interval 0.186 to 0.223 captures the population proportion of people who are smokers.

**M-Z:** I am 91% confident that the interval 0.223 to 0.263 captures the population proportion of people who live in the southwest.

**1.6) Show work to calculate the margin of error. Then interpret the margin of error.**

**A-L:**
**Calculation:**

ME = (0.223 - 0.186)/2 = 0.0185

**Interpretation:**

I am 91% confident that the population proportion of people who are smokers differs from p̂ = 0.205 by at most 0.0185.

-------------------------------------------------------
**M-Z:**
**Calculation:**

ME = (0.263 - 0.223)/2 = 0.02

**Interpretation:**

I am 91% confident that the population proportion of people who live in the southwest differs from p̂ = 0.243 by at most 0.0185.



# **Question 2**

## **Hypothesis Test**

**A-L:** According to the Centers for Disease Control, 11.6% of U.S. adults smoke cigarettes. Is there convincing evidence that the population proportion of who are smokers is different from 11.6% (0.116)? Use α=0.09. Write up the solution using the PMACC procedure.

**M-Z:** Is there convincing evidence that the population proportion of people who live in the southwest is different from 25% (0.25)? Use α=0.09. Write up the solution using the PMACC procedure.

**2.1) Parameter: Define the parameter, using correct notation.**

**A-L:** p is the population proportion of people who are smokers.

**M-Z** p is the population proportion of people who live in the southwest.

**2.2) Method: Name the method you will use, and write the hypotheses.**

**Method name:**

1 sample z-test for p

-OR-

1 proportion z-test

**Hypotheses:**

|**A-L**      |...| **M-Z**      |
|-------------|---|--------------|
| H0: p = 0.116 |...| H0: p = 0.25 |
| H1: p ≠ 0.116 |...| H1: p ≠ 0.25 |

**2.3) Assumptions: Show that both assumptions are met. Do not round.**

**A-L**

1. We will assume the sample represents the population (representative sample).

2. expected successes: 1338(0.116) = 155.2 > 10

   expected failures: 1338(1 - 0.116) = 1182.8 > 10

   The sampling distribution of p̂ is approximately normal.

--------------------------------------------------------
**M-Z**

1. We will assume the sample represents the population (representative sample).

2. expected successes: 1338(0.25) = 334.5 > 10

   expected failures: 1338(1 - 0.25) = 1003.5 > 10

   The sampling distribution of p̂ is approximately normal.

**2.4) Calculate: Complete the code below to calculate the values required.**

In [17]:
#Use this code for students
#Define p0, the value in H0.
#p_0 = ... #Replace ... with p0.

#Calculate the values needed; p-hat, and standard error (se).
#p_hat = obs_count / n
#se = np.sqrt(p_0 * (1-p_0) / n)

#Calculate the z-score of our p-hat, under the assumption H0 is true.
#z_score = (p_hat - p_0) / se

#Calculate the p-value for 1- and 2-sided tests
#p_value1 = (1 - norm.cdf(abs(z_score)))
#p_value2 = 2 * p_value1

#print(f"p-hat = {obs_count}/{n} = {p_hat.round(7)}")
#print(f"z-score = {z_score.round(7)}")
#print(f"1 sided p-value = {p_value1:.11f}")
#print(f"2 sided p-value = {p_value2:.11f}")

#-------------------------------
#A-L
#Define P0, the value in H0.
#Replace ... with p0, the value in the null hypothesis.
p_0 = 0.116     #p_0 = ...

#Calculate the values needed; p-hat, and standard error (se).
p_hat_AL = obs_count_AL / n
se_AL = np.sqrt(p_0 * (1-p_0) / n)

#Calculate the z-score of our p-hat, under the assumption H0 is true.
z_score_AL = (p_hat_AL - p_0) / se_AL

#Calculate the p-value for 1- and 2-sided tests
p_value1_AL = (1 - norm.cdf(abs(z_score_AL)))
p_value2_AL = 2 * p_value1_AL

print("A-L:")
print(f"p-hat = {obs_count_AL}/{n} = {p_hat_AL.round(7)}")
print(f"z-score = {z_score_AL.round(7)}")
print(f"1 sided p-value = {p_value1_AL:.11f}")
print(f"2 sided p-value = {p_value2_AL:.11f}")

#-------------------------------
#M-Z
#Define P0, the value in H0.
#Replace ... with p0, the value in the null hypothesis.
p_0 = 0.25     #p_0 = ...

#Calculate the values needed; p-hat, and standard error (se).
p_hat_MZ = obs_count_MZ / n
se_MZ = np.sqrt(p_0 * (1-p_0) / n)

#Calculate the z-score of our p-hat, under the assumption H0 is true.
z_score_MZ = (p_hat_MZ - p_0) / se_MZ

#Calculate the p-value for 1- and 2-sided tests
p_value1_MZ = (1 - norm.cdf(abs(z_score_MZ)))
p_value2_MZ = 2 * p_value1_MZ

print("\nM-Z")
print(f"p-hat = {obs_count_MZ}/{n} = {p_hat_MZ.round(7)}")
print(f"z-score = {z_score_MZ.round(7)}")
print(f"1 sided p-value = {p_value1_MZ:.11f}")
print(f"2 sided p-value = {p_value2_MZ:.11f}")

A-L:
p-hat = 274/1338 = 0.2047833
z-score = 10.1415424
1 sided p-value = 0.00000000000
2 sided p-value = 0.00000000000

M-Z
p-hat = 325/1338 = 0.2428999
z-score = -0.5997841
1 sided p-value = 0.27432508221
2 sided p-value = 0.54865016442


**2.5) Communicate Results: What conclusion is made about the null hypothesis? And what does that mean about the alternate hypothesis?**

**A-L:**
Because p-value = 0.000 < α = 0.09, we reject H0. We do have convincing evidence the population proportion of people who smoke is not equal to 0.116.

**M-Z:**
Because p-value = 0.549 > α = 0.09, we fail to reject H0. We do not have convincing evidence the population proportion of people who live in the southwest is not equal to 0.25.

# **Question 3**

## **Do you make the same conclusion if you use the confidence interval?**

**In question 2 you concluded that we either do have or do not have convincing evidence for the alternate hypothesis. Using your confidence interval from question 1, do you reach the same conclusion?**

**A-L:** Yes, 0.116 is not in the 91% interval (0.18608, 0.22349), so we do have convincing evidence that the population proportion of people who smoke is different from 0.1.

**M-Z** Yes, 0.25 is in the 91% interval (0.22302, 0.26278), so we do not have convincing evidence that the population proportion of people who live in the southwest is different from 0.25.

# **QUESTION 4**

Generate a paragraph of at least 100 words to address one of the following questions. That is, answer only 4a or 4b, but not both.

**4a)** Discuss how analyzing your chosen data set using statistical methods could help you become better prepared for future courses in your major?

...

--OR--

**4b)** Discuss how analyzing your chosen data set using statistical methods could be instrumental in becoming better prepared for your future career?

...


<br><br>
### Once you are done and ready to submit, follow the instructions below to save as a PDF and submit to GradeScope.

### Save as PDF
Note 1: You do not have to select Print Preview. You can print directly from the notebook.
Note 2: Image and graph sizes have been set so you should be able to see them correctly without making any changes to the browser width or the layout (portrait vs landscape).
1. Run all code one last time and make sure your graphs can be seen.
2. File -> Print (or ctrl-p/cmnd-p)
3. Change the "Desination" to PDF.
4. Save the PDF, taking note of where it is saved.

### Submit to GradeScope
**Watch the "GradeScope Submission" video for help.**
1. Login to the Canvas course
2. Click on GradeScope in the course navigation.
3. If you see multiple courses in GradeScope, click on the STAT 108 course
4. Click on the name of the assignment that matches your data set
5. Click on "Submit Work", select PDF
6. Select the PDF you just created
7. You need to tell GradeScope which page each problem answer/output is on. You should see a list of problems on the left, and a display of pages (thumbnails) on the right. Assign pages to questions by clicking on the question number on the left, then clicking on all pages that question is on.
8. After ALL questions have been assigned to their respective page(s), click "Submit"

#### **Still need help? Your STAT 108 team is here to help. Take your laptop to office hours.**


# **About the Dataset**

This dataset contains 1338 rows of insured data, where the insurance charges are given against the following attributes of the insured: Age, Sex, BMI, Number of Children, Smoker, and Region. There are no missing or undefined values in the dataset.

Body mass index (BMI) provides an understanding of body weights that are relatively high or low relative to height. It is considered to be the objective index of body weight using the ratio of height to weight, ideally 18.5 to 24.9.

| **Variable**| **Description**                                      |
|:------------|:-----------------------------------------------------|
| AGE         | The age of the primary beneficiary   |
| SEX         | Male or Female                       |
| BMI         | Body Mass Index (kg/m<sup>2</sup>)    |
| Number of children | Number of dependents covered by the insurance |
| Smoker      | Yes or No                                    |
| Region      | The beneficiary's residential area in the U.S.<br>northeast, southeast, southwest, and northwest                            |
| Charges     | Individual medical costs billed by health insurance  |



Let's take a look at the data. To do this, first we import it directly from the url below.



# **A Snippet of the Data**

In [None]:
# Assigns the URL where the data file is stored to 'file_path'.
url='https://raw.githubusercontent.com/thamilton562/STAT108_Projects_Students/main/DataSets/US%20Health%20Insurance.csv'

# Reads in the CSV data file and assigns it to the DataFrame 'df'.
df=pd.read_csv(url)

Next, we can display the data by *typing the name* of the DataFrame. To ensure we can see all columns, we'll use the *pd.set_option* method.

In [None]:
# Set display options to show all columns
pd.set_option('display.max_columns', None)

# When you type the object name, the object gets printed.
df

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500
