## Inferential Statistics

Inferential statistics makes inferences about the general population using a sample of data.
It assumes that the sample is a representative of a larger population and hence, draws conclusions about the population based on the sample under study.

![alt text](https://brainalyst.in/wp-content/uploads/2023/02/inferential-Statistics-1024x394.jpg)

### What is a "population" and a "sample"?

- Population - every single data point within the universe of observation
- Sample - a subset of that specific universe or the population

![alt text](https://www.programsbuzz.com/sites/default/files/inline-images/populationvssample-e1556351520474.png)

- It is highly impossible to gather and work with every single data point hence the need to obtain a sample
- Analysis is faster, quicker and more efficient when we make inferences from the sample about the population

BUT - how do we make sure that the sample is a representative of the population? What is "Representativeness"?

### Importance of a Representative Sample

![alt text](https://www.gloriafood.com/wp-content/uploads/2021/03/25_Restaurant_Survey_Questions_to_Help_You_Gain_Valuable_Insight_-_fb.png)


#### A representative sample is a sample without any bias - that is, the data points are chosen without any bias

How to make sure the sample shows full representative behaviour?
<br> We can also look at the descriptive statistics of the sample and see if they follow the same pattern in the population as well

BUT - the best way to get a representative sample? Pick at random!

### A Random Sample

![alt text](https://i.ytimg.com/vi/QaAKnQPHW-I/maxresdefault.jpg)


In most business scenarios, samples are chosen at random. There are different types of random sampling like- simple and stratified:

![alt text](https://www.qualtrics.com/m/assets/wp-content/uploads/2021/08/Screen-Shot-2021-08-31-at-10.17.31-AM.png)

BUT - the most important thing is to make sure that your sample is a representative of your population because if it isn't, then whatever conclusions you draw about the population will end up being inaccurate!

## Probabilty

Understanding and successfully utilising statistical inferential procedures, the cornerstone of analytical methods, requires a thorough understanding of probability methods.

It is impossible to genuinely know a population. Because it is typically too tough or difficult to get statistics for a group, since we have to work with samples. Therefore, we need to grasp probability theory and its applications in order to draw sample inferences that may be applied to a population.

Probability of an Event X = no. of outcomes where X occurs/Total no. of possible outcomes

### Understanding Probability

![alt text](https://i.imgur.com/IcsOXl0.png?0)


##### Probability
- How likely is that an event can occur

Let A be an event
- P(A) = Favorable number of events/total number of events
- 0<=P(A)<=1

### Conditional Probability:
It is the likelihood of a specific event occurring under a circumstance that has already happened, as known.

![alt text](https://assets.tivadardanka.com/2022_10_conditional_probability_featured_c9d47cc379.jpg)

#### Basics of Probability
- Random Experiment: An action that leads to one of several possible outcomes; for eg., tossing a coin
- Event: An event is the outcome of an experience; eg., Head
- Trial: Each time a random experiment is conducted, it is called a trial
- Sample Space: A list of all possible outcomes of an experiment; eg., {H,T} for coin toss

###### Random Variable
- A variable whose possible values depend on the outcomes of a certain random phenomenon
- Ex: 1 head or 1 tail, 2 heads

### Understanding Sample Space - Coin Toss Example

![alt text](https://www.theshirtlist.com/wp-content/uploads/2020/01/Toss-a-Coin-v2.jpg)


### With and Without Replacement

### Probability with and without Replacement
![alt text](https://cdn.dribbble.com/users/2712/screenshots/16449995/media/44c002c726f852f390b154bf33daa026.png?resize=400x0)


### Idependent Events

If the probability of occurrence of an event A is not affected by the occurrence of another event B, then A and B are said to be independent events.

### Mutually Exclusive Events

Where two events cannot occur at the same time

![alt text](https://keydifferences.com/wp-content/uploads/2016/05/mutually-exclusive-vs-independent-event-thumbnail.jpg)

![alt text](https://servicemasterofcolumbia.com/wp-content/uploads/2016/11/claim-diaries-2.jpg)


In [None]:
#1. Three unbiased coins are tossed then chances of seeing more than one head
S = ['HHH','HHT','HTH','THH','HTT','THT','TTH','TTT']
A = ['HHH','HHT','HTH','THH']
P = len(A)/len(S)
print(P)

0.5


In [None]:
#2. Probability of 3 heads
B = ['HHH']
P = len(B)/len(S)
print(P)

0.125


In [None]:
# Probability of drawing a black card from a deck of cards
cards = 52
black_cards = 26
P_black = black_cards/cards
print(P_black)

0.5


## Random Variables

Numerical outcomes of a Random event

### Understanding Random Variables: Airline no-show

![alt text](https://i.pinimg.com/originals/2b/4d/5a/2b4d5a6be05138a924497130add7491c.png)


### Why are Random variables so important for us?

A significant portion of the data we use in business settings is the result of random variables. Consider "Sales" - can we be certain of the sales we will make in the upcoming week? Can we be absolutely certain of the precise number of sales we will experience the next week? Unlikely.

We deal with random variables with probabilities - highly likely or highly unlikely outcomes, but never certain.

##### Discrete and Continuous Random Variable
- Discrete Random Variable: countable number of values in a finite amount of time; eg., customers in a queue, products added to cart
- Continuous Random Variable: whose values are uncountable or infinite; eg., amount of time to complete a task

## Probability Distribution

A probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment.

### Understanding Probability Distribution: Airline no-show

![alt text](https://i.pinimg.com/originals/2b/4d/5a/2b4d5a6be05138a924497130add7491c.png)


#### Probability Distribution

- A statistical function that describes all the possible values and likelihoods that a random variable can take within a given range
- The sum of all probabilities for all possible values must equal 1; the probability for a particular value or range of values must be between 0 and 1
- May be discrete or continuous

![alt text](https://www.programsbuzz.com/sites/default/files/inline-images/1_c2ylMCItL1XG6O3mGhjzng.png)

##### Discrete Probability Distribution

- A function that can assume a discrete number of values
- Each possible value has a non-zero likelihood
- Types of discrete distributions:
   1. Binomial or Bernoulli
   2. Negative Binomial
   3. Geometric
   4. Poisson

##### Binomial Distribution

- Binomial distribution with parameters n and p is the discrete probability distribution of the `number of successes in a sequence of n independent experiments`, each asking a yes–no question, and each with its own boolean-valued outcome - success or failure
- Features:
   1. The number of observations or trials is fixed
   2. Each observation or trial is independent
   3. The probability of success is exactly the same from one trial to another
   
![alt text](https://cdn.educba.com/academy/wp-content/uploads/2019/05/Binomial-Distribution-Formula.jpg)

##### Probability of definite no. of successes in certain no. of trials

The random chance probability of seeing exactly r successes in n Bernoulli trials, when the probability of success on any trial is p, is calculated using probability mass function (PMF)  

In [3]:
# Probability of getting 5 heads while tossing a coin 10 times

from scipy.stats import binom
n = 10                                     #no. of trials
p = 0.5                                    #probability of getting head
r = 5                                      #no. of successful outcomes, i.e., head
dist = binom.pmf(r,n,p)
print("Probability of getting 5 heads in 10 trials of coin tossing: ", round(dist,3))

Probability of getting 5 heads in 10 trials of coin tossing:  0.246


In [8]:
# Q: You're the quality control manager at a shipyard parts manufacturing unit. A defect rate of 15% has been observed during the process.
# As part of the QC process, you randomly decide to evaluate the products in batches of 10.
# Supposing from a random batch you found 3 defects, how likely is this outcome due to random chance?

n = 10                                     #no. of items in each batch
p = 0.15                                    #defect rate
r = 3                                      #no. of defects drawn
dist = binom.pmf(r,n,p)
print("Probability of drawing 3 defects from a batch of 10 items: ", round(dist,3))


Probability of drawing 3 defects from a batch of 10 items:  0.005


### Complex example of applying a binomial distribution to a business problem

![alt text](https://img.freepik.com/premium-vector/cash-flow-dollar-bill-investment-fund-flow-currency-exchange-vector-stock-illustration_100456-9962.jpg?w=2000)


##### Cumulative Probability

- Probability that a random variable falls within a specified range, i.e., the random variable value is less than or equal to a specified value
- The binomial cumulative distribution function lets you obtain the probability of observing less than or equal to r successes in n trials, with the probability p of success on a single trial

![alt text](https://images.deepai.org/glossary-terms/f25ef2b2f937462890db2d019ff2cf12/CDF.png)

In [None]:
# Q: A utility provider bills its customers after they have used the electricity.
# While waiting for its customers to pay their bills, the electric company keeps track of overdue invoices as an account receivable (AR).
# You are analysing the AR balances as the CFO.
# With 150 clients, you want to prepare for a scenario in which more than 50% of clients are late with payments.
# The size of the contingency fund is directly proportional to the probability of more than 50% being late in any given month.
# You need to have an idea of the likelihood of more than 50% being late in any given month.

# What's the probability of more than 50% of clients being late in any month?

# A: Probability of > 50% of clients being late = Probability of > 75 clients being late
#                                                        =  1 - Probability of <= 75 clients being late

In [None]:
n = 150                              #total clients
p = 0.4                              #probability of clients being late in payments
r = 75
a = 1 - binom.cdf(75,150,0.4)
print("Probability of more than 50% clients being late: ", round(a,3))

Probability of more than 50% clients being late:  0.005


##### Hypergeometric Distribution

A discrete probability distribution that describes the `probability of 'k' successes` (random draws for which the object drawn has a specified feature) in 'n' draws, `without replacement`, from a finite population of size 'N' that contains exactly 'K' objects with that feature, wherein each draw is either a success or a failure

![alt text](https://static.packt-cdn.com/products/9781839217074/graphics/assets/6ed708c8-ca7c-4167-9b98-26cedebfc493.png)

In [None]:
# Q: HR Policies and Diversity
# A leading e-commerce company wishes to encourage diversity in its management ranks in the FY 23-24
# Of the 25 employees eligible for promotion into middle management, 9 are women
# The company announced promotions in the month of June '23 and 8 are promoted
# What's the probability that 4 women are promoted?

<H5>
Binomial distro is when multiple bernoulli's trial are happening with replacement. <br>
In a coin toss, if heads comes up once, it can come up again as well. <br>

Hyper Geo disto is when multiple bernoulli's trial are happening without replacement. <br>
In choosing employees to promote, once an employee is chosen, we can not choose them again.
</H5

In [None]:
from scipy.stats import hypergeom
N = 25                                 #total employees eligible for promotion
n = 8                                  #no. of promotions
K = 9                                  #no. of women eligible for promotion
k = 4                                  #no. of women actually promoted whose probability we've to find
prob_hyp = hypergeom.pmf(k,N,K,n)
print("Probability of 4 women being promoted: ", round(prob_hyp,3))

Probability of 4 women being promoted:  0.212


In [None]:
# Q: Customer Service
# A customer service representative at the e-commerce firm mentioned previously receives a large batch of 200 customer complaints,
# of which 30 involve billing issues
# The representative randomly selects 10 complaints to address
# What is the probability that 2 of the selected complaints involve billing issues?

In [None]:
N = 200                                #total no. of complaints received
n = 10                                 #no. of complaints randomly chosen to address
K = 30                                 #no. of billing issues
k = 2                                  #no. of selected complaints involving billing issues
prob_hyp = hypergeom.pmf(k,N,K,n)
print("Probability of 2 billing issue complaints: ", round(prob_hyp,3))

Probability of 2 billing issue complaints:  0.284


##### Negative Binomial Distribution

- A discrete probability distribution that models the `number of successes` in a sequence of independent and identically distributed Bernoulli trials `before a specified (non-random) number of failures (denoted 'r') occurs`
- Differs from binomial distribution in that the number of trials are not fixed and the random variable in question is the number of trials needed to make 'r' successes

![alt text](https://i0.wp.com/statisticsbyjim.com/wp-content/uploads/2022/10/negative_binomial_formula.png?resize=592%2C139&ssl=1)

In [14]:
# Q: If a shoe store in Bengaluru has a 30% chance of every customer making a purchase, what is the likelihood that the 100th purchase will be made by the 30th customer?

from scipy.stats import nbinom
n = 70                                         #no. of trials resulting in failures
r = 30                                         #no. of specified successes
p = 0.3                                       #probability of success in a single trial
prob_nbinom = nbinom.pmf(n, r, p)
print("Probability of 100th purchase happening with 30th customer: ", round(prob_nbinom,4))

Probability of 100th purchase happening with 30th customer:  0.026


##### Geometric Distribution

- Distribution representing the number of failures before you get a success in a series of Bernoulli trials
- The assumptions of this distribution are:
   1. There are two possible outcomes for each trial (success or failure)
   2. The trials are independent
   3. The probability of success is the same for each trial

![alt text](https://cdn.educba.com/academy/wp-content/uploads/2019/07/Geometric-Distribution-Formula.jpg)

In [None]:
# Q: What is the likelihood that a QC Inspector will need to inspect at most 15 items before detecting a problem in hardware
#    produced at an aircraft manufacturing facility with a 2% failure rate?

# A: P(reviewing at most 15 items)  =   P(x<=15)
# Here we use cumulative function to solve the problem

In [15]:
from scipy.stats import geom
x = 15                                         #no. of items reviewed before encountering a defective one
p = 0.02                                        #probability of getting defective piece
dist = geom.cdf(x,p)
print("Probability of reviewing at most 15 items before finding a defect: ", round(dist,4))

Probability of reviewing at most 15 items before finding a defect:  0.2614


In [None]:
# Q: A software developer writes code for a complex algorithm to create a chatbot
# The probability of introducing a bug in any given line of code is 0.02
# What is the probability that the developer writes 10 lines of code without introducing a bug?

In [None]:
# A: In this scenario, the software developer has a geometric probability distribution since they write lines of code until they introduce a bug.
# The probability of introducing a bug in any given line of code is 0.02.
# We need to find the probability of the first bug occurring after 10 lines of code have been written (i.e., writing 10 lines of code without introducing a bug).

In [None]:
x = 10                                          #no. of lines of code written without introducing a bug
p = 0.02                                        #probability of encountering bug
dist = geom.pmf(x,p)
print("Probability of writing 10 lines of code without introducing a bug: ", round(dist,4))

Probability of writing 10 lines of code without introducing a bug:  0.0167


##### Poisson Distribution
- A discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space, if these events occur with a known constant mean rate and independently of the time since the last event
- Used by business organizations to make forecasts about the number of customers or sales on certain days or seasons of the year
- Conditions to be satisfied are:
   1. Events have to be counted as whole numbers
   2. Events are independent so if one event occurs, it does not impact the chances of the second event occurring
   3. Average frequency of occurrence for the given time period is known
   4. Number of events that have already occurred can be counted
   
![alt text](https://cdn.educba.com/academy/wp-content/uploads/2019/06/Poisson-Distribution-Formula-1.jpg)

![alt text](https://upload.wikimedia.org/wikipedia/commons/thumb/1/16/Poisson_pmf.svg/325px-Poisson_pmf.svg.png)

### Example of Poisson distribution

![alt text](https://img.freepik.com/premium-vector/cute-smiling-happy-atm-with-money-flat-cartoon-character-illustration-isolated-white-background-automated-teller-machine-atm-character-concept_92289-1445.jpg)



In [None]:
# Q:
# You oversee the call centre for a consultation service provider, where there are 55 employees and an average of 330 calls handled per hour.
# What are the chances that the number of calls on that day will increase by 20%?
# A holiday is approaching, and 5 resources have requested time off.
# You estimate that the 50 resources still available can handle 20% more calls, but you want to prepare for the possibility of a call volume spike of more than 20%.

# A:
# m = (330)/55 = 6 calls/hour
# 20% greater calls with 5 less resources = (330 * 1.2)/50 = 7.2 (~7) calls an hour
# P(calls going up by 20%) = P(x>7)

In [None]:
from scipy.stats import poisson
m = 6
x = 8
dist = 1 - poisson.cdf(x,m)
print("Probability of calls going up by 20%: ", round(dist,3))

Probability of calls going up by 20%:  0.153


In [None]:
# Q: Assuming that they start with a zero balance, what is the most suitable amount of cash that needs to be stocked, for a period of four days,
#   at an ATM machine located at HSR in Bengaluru, where Federal Bank notes that there are typically 70 withdrawals per day with an average transaction amount of INR 55?
#   A customer service KRA is to maintain less than 10% client complaints.

In [None]:
# Calculating probability of possible withdrawals using poisson dist

m = 70
x1 = 70
P1 = 1 - poisson.cdf(x1,m)
print("Probability of 70 withdrawals on a day: ", round(P1,3))

x2 = 75
P2 = 1 - poisson.cdf(x2,m)
print("Probability of 75 withdrawals a day: ", round(P2,3))

x3 = 80
P3 = 1 - poisson.cdf(x3,m)
print("Probability of 80 withdrawals a day: ", round(P3,3))

x4 = 81
P4 = 1 - poisson.cdf(x4,m)
print("Probability of 81 withdrawals a day: ", round(P4,3))

x5 = 82
P5 = 1 - poisson.cdf(x5,m)
print("Probability of 82 withdrawals a day: ", round(P5,3))

x6 = 85
P6 = 1 - poisson.cdf(x6,m)
print("Probability of 85 withdrawals a day: ", round(P6,3))

x7 = 90
P7 = 1 - poisson.cdf(x7,m)
print("Probability of 90 withdrawals a day: ", round(P7,3))

Probability of 70 withdrawals on a day:  0.468
Probability of 75 withdrawals a day:  0.252
Probability of 80 withdrawals a day:  0.107
Probability of 81 withdrawals a day:  0.087
Probability of 82 withdrawals a day:  0.071
Probability of 85 withdrawals a day:  0.035
Probability of 90 withdrawals a day:  0.009


In [None]:
# Customer complaints to be kept <10%
# ATM must be stocked with cash enough for 80 withdrawals per day
# Calculating amount of cash required for 4 day stocking

avg_amount = 55
amount_stock = x3 * avg_amount * 4
print("The most appropriate amount of cash that needs to be stocked for a 4 day period: ", amount_stock)

The most appropriate amount of cash that needs to be stocked for a 4 day period:  17600


##### Continuous Probability Distribution

- A probability distribution in which the random variable can take on any value, i.e., is continuous
- The probability that a random variable will take on a specific value is zero

##### Normal Distribution
- The most common kind of a continuous probability distribution due to its useful applications in Statistics
- A family of distributions, i.e., an infinite number of distributions with differing means (μ) and standard deviations (σ)
- Symmetric about the mean
- Mean = Median = Mode
- The two tails extend indefinitely and never touch the axis
- The standard normal probability function has a mean of zero and a standard deviation of one


![alt text](https://v6e8y6s7.stackpathcdn.com/wp-content/uploads/2022/07/Normal-Distribution.png)

![alt text](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSpm86MO5-IbzWi3Ae_OVC54ucvJ-vhg3jr5dgONfVo2T3yKMvMnoDgRTkd4Jaoh-U9qB0&usqp=CAU)

In [None]:
# Q:
# Mensa set up a workshop at XYZ School and administered IQ tests to 100 children.
# They found that the average IQ is 110, with a standard deviation of 7
# What are the chances that the student you choose has an IQ higher than 120, assuming you chose a random student from the group of 100?

# A: P(score > 120) = 1 - P(score <= 120)

In [None]:
from scipy.stats import norm
x = 120
m = 110
s = 7
dist = 1-norm.cdf(x,m,s)
print("Probability that the random student chosen has an IQ > 120: ", round(dist,3))

Probability that the random student chosen has an IQ > 120:  0.077


In [None]:
# Q:
# A well-known TV manufacturer wishes to provide an hours-based performance guarantee for their new product line so that product failure rates based on
# performance hours are limited to fewer than 5%.
# With an average life of 71,450 hours and a standard variation of 2700 hours, they evaluate a batch of 1000 samples and
# discover that average performance hours are distributed normally.
# How many performance hours should they promise to have under 5% failure rates?

In [None]:
# Calculating probability for arbitrary number of hours using normal dist

m = 71450
s = 2700
x1 = 71450
P1 = norm.cdf(x1,m,s)
print("Probability that the random sample performs less than 71450 hours: ", round(P1,3))

x2 = 70000
P2 = norm.cdf(x2,m,s)
print("Probability that the random sample performs less than 70000 hours: ", round(P2,3))

x3 = 69000
P3 = norm.cdf(x3,m,s)
print("Probability that the random sample performs less than 69000 hours: ", round(P3,3))

x4 = 68000
P4 = norm.cdf(x4,m,s)
print("Probability that the random sample performs less than 68000 hours: ", round(P4,3))

x5 = 67000
P5 = norm.cdf(x5,m,s)
print("Probability that the random sample performs less than 67000 hours: ", round(P5,3))

x6 = 66000
P6 = norm.cdf(x6,m,s)
print("Probability that the random sample performs less than 66000 hours: ", round(P6,3))

Probability that the random sample performs less than 71450 hours:  0.5
Probability that the random sample performs less than 70000 hours:  0.296
Probability that the random sample performs less than 69000 hours:  0.182
Probability that the random sample performs less than 68000 hours:  0.101
Probability that the random sample performs less than 67000 hours:  0.05
Probability that the random sample performs less than 66000 hours:  0.022


In [None]:
# The guarantee of 67000 hours as performance will ensure failure rates be kept at <5%

In [None]:
# Q: A pharmaceutical company's marketing campaign has a conversion rate of 3%
# If the company sends 5000 marketing emails, what is the probability of getting more than 200 conversions?

In [None]:
# A: In this scenario, the number of conversions from the marketing campaign follows a binomial distribution,
# but we are interested in the probability of getting more than 200 conversions, which involves working with a normal distribution approximation.
# To calculate the mean and standard deviation for this normal distribution approximation, we use the formula for the mean and standard deviation of a binomial distribution,
# which is applicable when the sample size is large (e.g., emails sent) and the probability of success (e.g., conversion rate) is not too close to 0 or 1.

# Mean = n * p
# Standard Deviation = sqrt(n * p * (1 - p))
# where:
# n is the number of trials (number of marketing emails sent)
# p is the probability of success in each trial (conversion rate)

In [None]:
import math
conversion_rate = 0.03
emails_sent = 5000

mean = conversion_rate * emails_sent
std_dev = math.sqrt(emails_sent * conversion_rate * (1 - conversion_rate))

print("Mean conversion: ", mean)
print("Std dev: ", std_dev)

Mean conversion:  150.0
Std dev:  12.062338081814818


In [None]:
x = 200
dist = 1 - norm.cdf(x, mean, std_dev)
print("Probability of getting more than 200 conversions: ", dist)

Probability of getting more than 200 conversions:  1.6980799791843637e-05


### end of the notebook