- Create a Jupyter notebook in Google Colab named "assignment_08.ipynb"
- Use the following URL to directly load data from the source URL: https://ed-public-download.app.cloud.gov/downloads/Most-Recent-Cohorts-All-Data-Elements.csv
- Only load two columns:  CONTROL (ownership type), MD_EARN_WNE_P10 (potential earning)
- Filter the data so you have only public institutions (using CONTROL variable)
- Drop the colleges that have zero or missing potential earnings. The remaining colleges constitute the population for which we are to perform interval estimation.
- Get a random sample of 50 colleges from the population.
- Calculate the sample mean and sample standard error.
- Calculate the confidence intervals of the mean estimate at 68%, 95%. and 99.7% confidence level
- Calculate the population mean.
- Compare the population mean with the sample mean - display the difference
- Check the confidence intervals and determine if the population mean is within the confidence intervals calculated above. You don't need to write code, just check using your eyes.
- Start from step 6 again with a larger sample of 100 colleges
- Observe the difference of confidence intervals between sample size 50 and sample size 100. Draw some conclusions. 
- Save the notebook to your GitHub repository. 
- Submit the link of the notebook in Blackboard.


In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt
import math

#### Loading the data

#### load two columns: CONTROL (ownership type), MD_EARN_WNE_P10 (potential earning)

In [2]:
data=pd.read_csv('Most-Recent-Cohorts-Scorecard-Elements.csv', usecols=['CONTROL', 'MD_EARN_WNE_P10'])

In [3]:
data.head()

Unnamed: 0,CONTROL,MD_EARN_WNE_P10
0,1,31000
1,1,41200
2,2,39600
3,1,46700
4,1,27700


#### Filter the data so you have only public institutions (using CONTROL variable)

CONTROL 1 suggests public Universities

In [4]:
data=data[data['CONTROL']==1]

#### Drop the colleges that have zero or missing potential earnings. The remaining colleges constitute the population for which we are to perform interval estimation.

In [5]:
data=data.dropna()

In [6]:
#One row element named Privacy Suppressed is also dropped
data=data[data['MD_EARN_WNE_P10']!='PrivacySuppressed']

In [7]:
data.dtypes

CONTROL             int64
MD_EARN_WNE_P10    object
dtype: object

In [8]:
## Should convert the dtypes of the columns
convert_dict={'CONTROL':object, 'MD_EARN_WNE_P10':float}
data = data.astype(convert_dict) 

#### No zero values present in the data

#### Get a random sample of 50 colleges from the population.

In [9]:
sample=data.sample(50)

#### Calculate the sample mean and sample standard error.

In [10]:
# Sample mean
sample_mean=sample['MD_EARN_WNE_P10'].mean()
sample_mean

36174.0

In [11]:
def standard_error(column):
    '''
    returns standared error of an inputed column
    error = standard deviation/sqrt of number of samples
    '''
    return (column.std()/math.sqrt(len(column)))

In [12]:
std_err=standard_error(sample['MD_EARN_WNE_P10'])
print(std_err)

1397.2737595702831


#### Calculate the confidence intervals of the mean estimate at 68%, 95%. and 99.7% confidence level

In [13]:
def confidence_interval(mean, standard_error, multiply):
    '''
    Function only works for 68%, 95% and 99.7% confidence intervals
    mean: Sample mean,
    standard_error: Standard Error
    multiply: Any one number in 0.68-->1, 95-->2, 95.7-->3'''
    
    return (mean-multiply*standard_error, mean+multiply*standard_error)
    
    

In [14]:
# 68% confidence interval is 1 std deviation away from the mean
confidence_interval(sample_mean, std_err, 1)

(34776.726240429714, 37571.273759570286)

In [15]:
# 95% confidence interval is 2 std deviations away from the mean
confidence_interval(sample_mean, std_err, 2)

(33379.452480859436, 38968.547519140564)

In [16]:
# 99.7% confidence interval is 3 std deviations away from the mean
confidence_interval(sample_mean, std_err, 3)

(31982.17872128915, 40365.82127871085)

#### Calculate the population mean.




In [17]:
population_mean=data['MD_EARN_WNE_P10'].mean()
population_mean

36071.244635193136

#### Compare the population mean with the sample mean - display the difference

In [18]:
difference = population_mean - sample_mean
abs(difference)

102.75536480686424

#### Start from step 6 again with a larger sample of 100 colleges

In [19]:
sample=data.sample(100)

In [20]:
# Sample mean
sample_mean=sample['MD_EARN_WNE_P10'].mean()
sample_mean

36698.0

In [21]:
std_err=standard_error(sample['MD_EARN_WNE_P10'])
print(std_err)

1156.7893394034957


In [22]:
# 68% confidence interval is 1 std deviation away from the mean
confidence_interval(sample_mean, std_err, 1)

(35541.2106605965, 37854.7893394035)

In [23]:
# 95% confidence interval is 2 std deviations away from the mean
confidence_interval(sample_mean, std_err, 2)

(34384.42132119301, 39011.57867880699)

In [24]:
# 99.7% confidence interval is 3 std deviations away from the mean
confidence_interval(sample_mean, std_err, 3)

(33227.63198178951, 40168.36801821049)

In [25]:
# Population Mean
population_mean=data['MD_EARN_WNE_P10'].mean()
population_mean

36071.244635193136

In [26]:
difference = population_mean - sample_mean
abs(difference)

626.7553648068642

#### Observe the difference of confidence intervals between sample size 50 and sample size 100. Draw some conclusions. 

Observation is the values when sample size is increased are close to the population mean, population deviation