## Table of Content

1. **Sampling**
2. **Parameter Estimation**

**Import the required libraries**

In [18]:
# import pandas
import pandas as pd

# import 'numpy' 
import numpy as np

# import subpackage of matplotlib
import matplotlib.pyplot as plt

# import 'seaborn'
import seaborn as sns

# to suppress warnings 
from warnings import filterwarnings
filterwarnings('ignore')

# import statistics to perform statistical computation  
import statistics

# import 'stats' package from scipy library
from scipy import stats

# import a library to perform Z-test
from statsmodels.stats import weightstats as stests

# to test the normality 
from scipy.stats import shapiro

# import the function to calculate the power of test
from statsmodels.stats import power

# import 'random' to generate a random sample
import random

In [2]:
# set the plot size using 'rcParams'
# once the plot size is set using 'rcParams', it sets the size of all the forthcoming plots in the file
# pass width and height in inches to 'figure.figsize' 
plt.rcParams['figure.figsize'] = [15,8]

### Let's begin with some hands-on practice exercises

<a id="sample"></a>
## 1. Sampling

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>1. A farmer has planted 98 tomato plants last year. He has numbered each plant with numbers 1,2,...98. Now he wants to study the growth of the plants. Help the farmer to select 12 plants randomly as a sample for the study using an appropriate sampling technique.</b>
                </font>
            </div>
        </td>
    </tr>
</table>

**Answer**

In [9]:
data = list(range(1,99))
print(data)
print('\n')

# Simple random sampling without replacement
sample_plant = random.sample(population = data, k = 12)
print('Sample of 12 plants selected randomly without replacement:', sample_plant)

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98]


Sample of 12 plants selected randomly without replacement: [61, 12, 88, 31, 80, 64, 53, 54, 18, 71, 93, 85]


<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>2. Ann has collected 20 beautiful blue marbles pebbles on her last summer vacation. Her mother gave her permission to take only 4 pebbles for her friends. Each of the marble is coded with numbers as 1,2,...20. As 2 is her favorite number, she wants to select pebbles starting from the 2nd pebble. Help Ann to systematically select the 4 marble pebbles for her friends.</b>
                </font>
            </div>
        </td>
    </tr>
</table>

**Answer**

In [69]:
pebbles = np.array(range(1, 21))

# total number of pebbles
N = len(pebbles)

# required number of pebbles
n = 4

# i.e. k = 5
# arrange data in 4 rows and 5 columns using 'reshape()'
pebbles = pebbles.reshape(4, 5)
print(pebbles)
print('\n')
print('Systematic sample:')

sample_pebbles = []
# use for loop to select each sample
for i in range(4):
    sample_pebbles.append(pebbles[i][1])

print(sample_pebbles)

[[ 1  2  3  4  5]
 [ 6  7  8  9 10]
 [11 12 13 14 15]
 [16 17 18 19 20]]


Systematic sample:
[2, 7, 12, 17]


<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>3. A rose nursery contains roses of 5 distinct colors. Select two plants of each color randomly.</b>
                </font>
            </div>
        </td>
    </tr>
</table>

Given data:

          rose_col = ['White', 'Pink', 'White', 'Red', 'Yellow', 'Orange', 'Orange', 'Red', 'Yellow', 'White', 'Pink', 
                      'White', 'Red', 'Orange']

**Answer**

In [55]:
# create dataframe from given data
rose_col = pd.DataFrame(dict(Color = ['White', 'Pink', 'White', 'Red', 'Yellow', 'Orange', 'Orange', 'Red', 'Yellow', 'White', 'Pink', 
                  'White', 'Red', 'Orange']))

print(rose_col.groupby(['Color']).size())
print('\nStratified sample:')

rose_col.groupby('Color', group_keys = False).apply(lambda x: x.sample(2, random_state = 1))

Color
Orange    3
Pink      2
Red       3
White     4
Yellow    2
dtype: int64

Stratified sample:


Unnamed: 0,Color
5,Orange
13,Orange
1,Pink
10,Pink
3,Red
12,Red
11,White
9,White
4,Yellow
8,Yellow


<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>4. Ron found gold, silver and copper coin on his way home. He asked 10 people to choose a coin randomly. He then created a list of coins chosen by each individual. Find one of the possible sample space for this experiment. </b>
                </font>
            </div>
        </td>
    </tr>
</table>

The obtained sample is one of the possible outcomes of Ron's experiment.

**Answer**

In [28]:
# Given Ron found a gold, silver and copper coin
coin_space = ['Gold', 'Silver', 'Copper']

# Sample with replacement
sample_wr = random.choices(population = coin_space, k = 10)
print('A possible sample space for this experiment is:\n', sample_wr)

A possible sample space for this experiment is:
 ['Gold', 'Silver', 'Silver', 'Silver', 'Silver', 'Silver', 'Copper', 'Copper', 'Copper', 'Silver']


<a id = "est"> </a>
## 2. Parameter Estimation

<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>1. The quality controller at the automobile company needs to know the average length of a steel rod produced in the company. He managed to collect the length (in cm) of 15 rods produced in the last week. Find the point estimate for the average length of the steel rod using sample data. </b>
                </font>
            </div>
        </td>
    </tr>
</table>

Given data:

        len_rod (cm) = [25.2, 26.3, 28, 21.9, 23.4, 24, 27.2, 23, 29.2, 28.7, 23.1, 23.5, 26.4, 22.8, 24.7]

**Answer**

In [64]:
len_rod = [25.2, 26.3, 28, 21.9, 23.4, 24, 27.2, 23, 29.2, 28.7, 23.1, 23.5, 26.4, 22.8, 24.7]

print('The point estimate for the average length of the steel rod is', np.mean(len_rod), 'cm')

The point estimate for the average length of the steel rod is 25.16 cm


<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>2. The production manager at the automobile company states that all the steel rods are produced with an average length of 26 cm. Use the data given in the previous question and calculate the sampling error for the mean.</b>
                </font>
            </div>
        </td>
    </tr>
</table>

Given data:

        len_rod (cm) = [25.2, 26.3, 28, 21.9, 23.4, 24, 27.2, 23, 29.2, 28.7, 23.1, 23.5, 26.4, 22.8, 24.7]

**Answer**

In [65]:
len_rod = [25.2, 26.3, 28, 21.9, 23.4, 24, 27.2, 23, 29.2, 28.7, 23.1, 23.5, 26.4, 22.8, 24.7]

sample_mean = np.mean(len_rod)

# Given: population mean = 26cm
pop_mean = 26

print("Sampling error for the mean is", round(np.abs(sample_mean - pop_mean), 4), 'cm')

Sampling error for the mean is 0.84 cm


<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>3. The NY university has opened the post of Astrophysics professor. The total number of applications was 36. To check the authenticity of the applicants a sample of 10 applications was collected, out of which 3 applicants were found to be a fraud. Estimate the number of fraud applicants from all the applications.</b>
                </font>
            </div>
        </td>
    </tr>
</table>

**Answer**

In [4]:
# Let us assume the probability of fraud applications follows a normal distribution
# Given: 

# total number of applications
N = 36 

# number of applications considered in sample
n = 10

# number of fraud applications in sample
x = 3

# sample proportion
p_samp = x/n

# estimate the number of fraud applications
num_fraud = p_samp * N

print('Estimated number of fraud applications is', round(num_fraud))

Estimated number of fraud applications is 11


<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>4. The production manager at tea emporium wants to estimate the weight of a green tea bag. The previous study shows that the standard deviation of weight is 2.3 g. The manager collects 65 tea bags for the study. How much margin of error the manager should consider to estimate the weight with 99%  confidence. </b>
                </font>
            </div>
        </td>
    </tr>
</table>

**Answer**

In [89]:
# sample standard deviation
std = 2.3

# sample number of tea bags
n = 65

# given alpha (1 - confidence %)
alpha = 0.01

# calculate z_alpha_by_2 with (1-alpha) = 0.01
z_alpha_by_2 = round(stats.norm.isf(q = 0.01/2), 4)
print('z_alpha_by_2 =', z_alpha_by_2)

error = (z_alpha_by_2 * std)/np.sqrt(n)

print("Margin of error to estimate weight with 99% confidence is", round(error, 4))

z_alpha_by_2 = 2.5758
Margin of error to estimate weight with 99% confidence is 0.7348


<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>5. The bank manager has received several fraud complaints from past a few weeks. The report from accountant states that the standard deviation in frauds is 16. The manager is ready to consider the margin of error as 5. To estimate the average number of frauds with 90% confidence how many transactions should the manager consider?</b>
                </font>
            </div>
        </td>
    </tr>
</table>

**Answer**

In [8]:
# standard deviation
std = 16

# given alpha (1 - confidence %)
alpha = 0.1

# calculate z_alpha_by_2 with (1 - confidence %) = 0.1
z_alpha_by_2 = round(stats.norm.isf(q = 0.1/2), 4)
print('z_alpha_by_2 =', z_alpha_by_2)

# Given error 
margin_error = 5

# error = (z_alpha_by_2 * std)/np.sqrt(n)
n = ((z_alpha_by_2 * std)/margin_error) ** 2
print("Number of transactions to consider to estimate average number of frauds with 90% confidence is", round(n))

z_alpha_by_2 = 1.6449
Number of transactions to consider to estimate average number of frauds with 90% confidence is 28


<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>6. A paediatrician wants to check the amount of sugar in the 100g pack of baby food produced by KidsGrow company. The medical journal states that a standard deviation of sugar in 100g pack is 8g. The paediatrician collects 37 packets of baby food and found that the average sugar is 24g. Find the 90% confidence interval for the population mean.</b>
                </font>
            </div>
        </td>
    </tr>
</table>

As the sample size is large (> 30), we use the Z-distribution to calculate the confidence interval.

**Answer**

In [6]:
# standard deviation
sample_std = 8

# sample number of baby food packets
n = 37

# sample mean 
sample_avg = 24

# calculate the 90% confidence interval
interval = stats.norm.interval(0.9, loc = sample_avg, scale = sample_std/np.sqrt(n))

print('90% confidence interval for the population mean is', interval)

90% confidence interval for the population mean is (21.83670183570907, 26.16329816429093)


<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>7. The physical trainer at a university wants to estimate the average height of students at the university. The trainer collects the data of 100 students and found that the average height is 168 cm with a standard deviation of 12 cm. Find the 95% confidence interval for the population average height.</b>
                </font>
            </div>
        </td>
    </tr>
</table>

As the sample size is large (> 30), we use the Z-distribution to calculate the confidence interval.

**Answer**

In [7]:
# standard deviation
sample_std = 12

# sample number of baby food packets
n = 100

# sample mean 
sample_avg = 168

# calculate the 90% confidence interval
interval = stats.norm.interval(0.95, loc = sample_avg, scale = sample_std/np.sqrt(n))

print('95% confidence interval for the population average height is', interval)

95% confidence interval for the population average height is (165.64804321855195, 170.35195678144805)


<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>8. The health magazine in Los Angeles states that a person should drink 1.8 L water every day. To study this statement, the physician collects the data of 15 people and found that the average water intake for these people is 1.6 L with a standard deviation of 0.5 L. Calculate the 90% confidence interval for the population average water intake.</b>
                </font>
            </div>
        </td>
    </tr>
</table>

As the sample size is small (< 30), we use the t-distribution to calculate the confidence interval.

**Answer**

In [9]:
# standard deviation
sample_std = 0.5

# sample number of baby food packets
n = 15

# sample mean 
sample_avg = 1.6

# calculate the 90% confidence interval using t-distribution
interval = stats.t.interval(0.9, df = n-1, loc = sample_avg, scale = sample_std/np.sqrt(n))

print('90% confidence interval for the population average water intake is', interval)

90% confidence interval for the population average water intake is (1.3726158392212553, 1.8273841607787449)


<table align="left">
    <tr>
        <td width="6%">
            <img src="question_icon.png">
        </td>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b>9. The NY university has opened the post of Astrophysics professor. The total number of applications was 36. To check the authenticity of the applicants a sample of 10 applications was collected, out of which 3 applicants were found to be a fraud. Construct the 95% confidence interval for the population proportion.</b>
                </font>
            </div>
        </td>
    </tr>
</table>

**Answer**

In [16]:
# total number of applications
N = 36 

# number of applications considered in sample
n = 10

# number of fraud applications in sample
x = 3

# sample proportion
p_samp = x/n

# Calculate 95% confidence interval using t-distribution
interval = stats.norm.interval(0.95, loc = p_samp, scale = np.sqrt((p_samp * (1 - p_samp))/n))

print('95% confidence interval for the population proportion is', interval)

95% confidence interval for the population proportion is (0.015974234910674623, 0.5840257650893254)
