## **Distributions & Some Business Problems**

*For example, we want to get an information about the average age of people in a region. But 10,000 people live in this region and it is difficult to meet each person and calculate their average age. For this purpose, instead of meeting with each of them, we will take a sample (100 individuals) out of ten thousand people. Then we will calculate the average age of these 100 people, and we will have an idea about the average age of the whole region.*

In [1]:
import numpy as np


In [2]:
# create a population of ten thousand people between the ages of 10 and 80
population = np.random.randint(0, 80, 10000)
population

array([ 9, 76, 54, ...,  6,  3, 56])

*What is a seed? We can write any number we want inside the seed. 5, 66, 1 and so on. For example, we said random.seed(5). Even while doing this we say: "I'm going to do a random selection in a moment, but more on that later I want to reuse the same numbers. Whenever I call seed(5) this "Bring me those numbers in the selection cases after saying". Seed is a box where we put those numbers so that they don't get lost.*

In [3]:
# do sampling

np.random.seed(1)
sample = np.random.choice(a = population, size = 100)
sample

array([65, 73, 76,  0, 68,  2, 42, 27, 63,  0, 67, 41, 69, 50, 72,  2, 73,
       41, 73, 28, 15, 25, 65,  2,  9, 22, 14, 29, 32,  0, 74, 24, 22, 18,
       40, 45, 20, 44, 58, 56,  2,  4, 73, 75, 34, 59, 76, 20, 63, 17, 45,
       76, 55,  8, 23, 69, 30, 39, 10, 18, 28, 54, 18,  8, 29, 78, 69, 75,
       66, 75, 70, 19, 58, 57, 10, 50, 37, 25, 30,  0, 72, 74, 69, 38,  4,
       33,  6, 14, 53, 68, 16, 49, 14,  3, 15, 35, 52, 37, 44, 13])

In [4]:
# look at the mean of the sample and the population

print("Sample mean is", sample.mean())
print("Population mean is", population.mean())

Sample mean is 39.07
Population mean is 39.5231


Here the power of statistical sampling theory is evident. In the future of these we will make statistical comments on it.

**Sampling distribution**

*Here, each sample has a different value, but thanks to seed those values are preserved. If we want to use the same values again elsewhere, it is possible to get those values again by typing seed(10)*

*Again, values close to the population came out. If we draw more samples, the mean will be closer to the mean of the population. That's what the Central Limit Theorem is.*

In [5]:
np.random.seed(10)
sample1 = np.random.choice(a = population, size = 100)
sample2 = np.random.choice(a = population, size = 100)
sample3 = np.random.choice(a = population, size = 100)
sample4 = np.random.choice(a = population, size = 100)
sample5 = np.random.choice(a = population, size = 100)
sample6 = np.random.choice(a = population, size = 100)
sample7 = np.random.choice(a = population, size = 100)
sample8 = np.random.choice(a = population, size = 100)
sample9 = np.random.choice(a = population, size = 100)
sample10 = np.random.choice(a = population, size = 100)

(sample1.mean() + sample2.mean() + sample3.mean() + sample4.mean() + sample5.mean() +
sample6.mean() + sample7.mean() + sample8.mean() + sample9.mean() + sample10.mean()) / 10

39.154999999999994

**Descriptive statistics, covariance and correlation**

Covariance is a measure of the variability of the relationship between two variables.

Thus, the change brought about by the two variables together is investigated.

Covariance formula: cov(X,Y) = E[(X - E[X])(Y - E[Y])]

Correlation is a statistical method that shows the relationship between two variables, whether this relationship is significant, and expresses the strength and direction of the relationship.

**Descriptive statistics application by using researchpy**

In [6]:
# install researchpy library
!pip install researchpy

import researchpy as rp
import seaborn as sns

tips = sns.load_dataset("tips")
df = tips.copy()
df.head()

Collecting researchpy
  Downloading researchpy-0.3.5-py3-none-any.whl (33 kB)
Installing collected packages: researchpy
Successfully installed researchpy-0.3.5


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [7]:
#for numeric variables
rp.summary_cont(df[["total_bill", "tip", "size"]])





  for ix, df_col in group1.iteritems():


Unnamed: 0,Variable,N,Mean,SD,SE,95% Conf.,Interval
0,total_bill,244.0,19.7859,8.9024,0.5699,18.6633,20.9086
1,tip,244.0,2.9983,1.3836,0.0886,2.8238,3.1728
2,size,244.0,2.5697,0.9511,0.0609,2.4497,2.6896


In [8]:
#for categoric variables
rp.summary_cat(df[["sex", "day", "smoker"]])

  for ix, df_col in group1.iteritems():


Unnamed: 0,Variable,Outcome,Count,Percent
0,sex,Male,157,64.34
1,,Female,87,35.66
2,day,Sat,87,35.66
3,,Sun,76,31.15
4,,Thur,62,25.41
5,,Fri,19,7.79
6,smoker,No,151,61.89
7,,Yes,93,38.11


In [9]:
# COVARIANCE between two variables
df[["tip", "total_bill"]].cov()

Unnamed: 0,tip,total_bill
tip,1.914455,8.323502
total_bill,8.323502,79.252939


In [10]:
# CORRELATION between two variables
df[["tip", "total_bill"]].corr()

Unnamed: 0,tip,total_bill
tip,1.0,0.675734
total_bill,0.675734,1.0


**Application of bernoulli distribution**

For example, here we are concerned with the probability of getting a tail (backside of a coin) in a coin flip experiment and we know that this probability is 0.6.

When we say k = 1, the probability of a tail, and when we say k = 0, it is a head (frontside of a coin) we will have calculated the probability.

In [11]:
from scipy.stats import bernoulli
p = 0.6
bd = bernoulli(p)
#pmf - probability mass function
print("The probability of getting a tail is", bd.pmf(k = 1))
print("The probability of getting a head is", bd.pmf(k = 0))

The probability of getting a tail is 0.6
The probability of getting a head is 0.4000000000000001


**The law of large numbers - application**

In [12]:
rng = np.random.RandomState(123)
for i in np.arange(1, 25):
    number_of_experiments = 2**i
    tails_heads = rng.randint(0, 2, size = number_of_experiments)
    heads_probabilities = np.mean(tails_heads)
    print("Shot count: ", number_of_experiments, "___", "Heads probabilities: %.2f" % (heads_probabilities*100))

Shot count:  2 ___ Heads probabilities: 50.00
Shot count:  4 ___ Heads probabilities: 0.00
Shot count:  8 ___ Heads probabilities: 62.50
Shot count:  16 ___ Heads probabilities: 43.75
Shot count:  32 ___ Heads probabilities: 46.88
Shot count:  64 ___ Heads probabilities: 56.25
Shot count:  128 ___ Heads probabilities: 50.78
Shot count:  256 ___ Heads probabilities: 52.73
Shot count:  512 ___ Heads probabilities: 52.93
Shot count:  1024 ___ Heads probabilities: 50.20
Shot count:  2048 ___ Heads probabilities: 48.58
Shot count:  4096 ___ Heads probabilities: 49.49
Shot count:  8192 ___ Heads probabilities: 49.58
Shot count:  16384 ___ Heads probabilities: 49.96
Shot count:  32768 ___ Heads probabilities: 50.00
Shot count:  65536 ___ Heads probabilities: 49.68
Shot count:  131072 ___ Heads probabilities: 49.97
Shot count:  262144 ___ Heads probabilities: 50.13
Shot count:  524288 ___ Heads probabilities: 50.01
Shot count:  1048576 ___ Heads probabilities: 50.09
Shot count:  2097152 ___ He

**Application of binomial distribution**

*The bussiness problem to be solved:*

A company advertises on various channels, and then tries to optimize probabilities of applying for the company"s products by viewing these advertisements

*Details:*

- The company advertises in some area

- Distribution (binomial) and click probabilities are known (p = 0.01)

*Question:*

What is the probability of 1, 5, 10 clicks on the ad when 100 people see it?



In [13]:
from scipy.stats import binom
p = 0.01
n = 100
bd = binom(n, p)
print("Probability of one klick:", bd.pmf(k = 1))
print("Probability of five klick:", bd.pmf(k = 5))
print("Probability of ten klick:", bd.pmf(k = 10))

Probability of one klick: 0.36972963764972666
Probability of five klick: 0.002897787123761478
Probability of ten klick: 7.006035693977194e-08


*The bussiness problem to be solved:*

A coin is tossed 4 times. What is the probability that the front side of that iron coin falls 1 time or 2 times?



In [14]:
p = 0.5
n = 4
bd = binom(n, p)
print(bd.pmf(k = 2))
print(bd.pmf(k = 1))

0.375
0.25


**Poisson distribution - application**

*The bussiness problem to be solved:*

There is a classifieds site. Some ad entries are made on this classifieds site. We want to calculate the probability of wrong ad input.

*Details:*

- Measurements are carried out for a year

- distribution is known (poisson) and lamda = 0.1 (average number of errors, i.e. measurements were carried out for 1 year and errors were observed, and their average was calculated to be 0.1)

*Question:*

What are the probabilities of no errors, 3 errors, and 5 errors?



In [15]:
from scipy.stats import poisson
lambda_ = 0.1
bd = poisson(mu = lambda_)
print("Probabilities of no errors:", bd.pmf(k = 0))
print("Probabilities of three errors:", bd.pmf(k = 3))
print("Probabilities of five errors:", bd.pmf(k = 5))

Probabilities of no errors: 0.9048374180359595
Probabilities of three errors: 0.00015080623633932676
Probabilities of five errors: 7.54031181696634e-08


*The bussiness problem to be solved:*

What is the probability that 5 grades are entered incorrectly out of 5000 grade entries in a university? The distribution is known to be poisson and lamda = 0.2



In [16]:
from scipy.stats import poisson
lambda_ = 0.2
bd = poisson(mu = lambda_)
print(bd.pmf(k = 5))

2.183282008207956e-06


**Normal distribution - application**

*The bussiness problem to be solved:*

A meeting is held before an investment. In this meeting, we try to determine the probability of realization of sales at certain values for the coming month

*Details:*
- the distribution is known to be normal (by visualization - distplot and so on with hypothesis tests and so on)

- monthly average number of sales is 80K, and std is 5K. Looking at the basic logic, it turns out that sales are realized between the mean + - standard deviation, that is, between 75 and 85. In short, sales are more likely to be made in this interval.

*Questions:*
- What is the probability of selling more than 90K?

- What is the probability of selling more than 70K?

- What is the probability of selling less than 73K?

- What is the probability of selling between 85K and 90K?

In [17]:
from scipy.stats import norm

#cdf - cumulative density function
print("The probability of selling more than 90K is", 1 - norm.cdf(90, 80, 5))
print("The probability of selling more than 70K is", 1 - norm.cdf(70, 80, 5))
print("The probability of selling less than 73K", norm.cdf(73, 80, 5))

print("The probability of selling between 85K and 90K is",
      norm.cdf(90, 80, 5) - norm.cdf(85, 80, 5))

The probability of selling more than 90K is 0.02275013194817921
The probability of selling more than 70K is 0.9772498680518208
The probability of selling less than 73K 0.08075665923377107
The probability of selling between 85K and 90K is 0.13590512198327787


**Note:**

The probability that any continuous distribution has an exact value is zero. Because these distributions are measured by area. If we give an exact value (eg 90) field does not occur. However, for example, if it is smaller than 90, an area is formed. That area is calculated by cumulative density function.

**Thanks for your attention ❗️**